UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Statistical methods for big tracking data Liu, Yang 2017

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2017_may_liu_yang.pdf [ 2.64MB ]
JSON: 24-1.0343238.json
JSON-LD: 24-1.0343238-ld.json
RDF/XML (Pretty): 24-1.0343238-rdf.xml
RDF/JSON: 24-1.0343238-rdf.json
Turtle: 24-1.0343238-turtle.txt
N-Triples: 24-1.0343238-rdf-ntriples.txt
Original Record: 24-1.0343238-source.json
Full Text

Full Text

Statistical Methods for Big Tracking DatabyYang LiuB.Sc., Nankai University, 2010M.Sc., The University of British Columbia, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)March 2017c© Yang Liu, 2017AbstractRecent advances in technology have led to large sets of tracking data, which bringsnew challenges in statistical modeling and prediction. Built on recent develop-ments in Gaussian process modeling for spatio–temporal data and stochastic differ-ential equations (SDEs), we develop a sequence of new models and correspondinginferential methods to meet these challenges. We first propose Bayesian Meld-ing (BM) and downscaling frameworks to combine observations from differentsources. To use BM for big tracking data, we exploit the properties of the pro-cesses along with approximations to the likelihood to break a high dimensionalproblem into a series of lower dimensional problems. To implement the downscal-ing approach, we apply the integrated nested Laplace approximation (INLA) to fit alinear mixed effect model that connects the two sources of observations. We applythese two approaches in a case study involving the tracking of marine mammals.Both of our frameworks have superior predictive performance compared with tra-ditional approaches in both cross–validation and simulation studies.We further develop the BM frameworks with stochastic processes that can re-flect the time varying features of the tracks. We first develop a conditional hetero-geneous Gaussian Process (CHGP) but certain properties of this process make itextremely difficult to perform model selection. We also propose a linear SDE withsplines as its coefficients, which we refer to as a generalized Ornstein-Ulhenbeck(GOU) process. The GOU achieves flexible modeling of the tracks in both meanand covariance with a reasonably parsimonious parameterization. Inference andprediction for this process can be computed via the Kalman filter and smoother.BM with the GOU achieves a smaller prediction error and better credibility inter-vals in cross validation comparisons to the basic BM and downscaling models.iiFollowing the success with the GOU, we further study a special class of SDEscalled the potential field (PF) models, which formulates the drift term as the gradi-ent of another function. We apply the PF approach to modeling of tracks of marinemammals as well as basketball players, and demonstrate its potential in learning,visualizing, and interpreting the trends in the paths.iiiPrefaceThis dissertation is the original intellectual work of the author Yang Liu under thesupervision of Prof. James V. Zidek.A version of Chapter 2 and Chapter 4 has been published [Liu Y, BattaileBC, Trites AW, Zidek JV, Bias correction and uncertainty characterization of dead-reckoned paths of marine mammals, Animal Biotelmetry 3:51, 2015] and [Liu Y,Zidek ZV, Trites AW, Battaile BC, Bayesian data fusion approaches to predictingspatial tracks: Application to marine mammals, The Annals of Applied Statis-tics, Vol 10, No.3 1517-1546, 2016]. I was the lead investigator, responsible forall methods development, algorithm design and implementation, simulation designand analysis, data analysis, and manuscript preparation. Battaile BC and Trites AWprovided data for the analysis and background of the scientific study. Zidek JV wasthe supervisory author on this project and was involved throughout the project inconcept formation and manuscript edits.A version of Chapter 7 has also been published [Liu Y, Dinsdale DR, Jun SH,Briercliffe C, Bone J, Statistical learning of basketball strategy via SportVU track-ing data: The potential field approach, Cascadia Symposium on Statistics in Sports,2016]. I was the lead investigator, responsible for most methods development, al-gorithm design and implementation, data processing and analysis, and manuscriptcomposition. Dinsdale DR was involved in the concept formation. All the otherauthors were involved in the results visualization, interpretation and manuscriptedits.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviiDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 Background on Animal Tracking . . . . . . . . . . . . . . . . . . . . 42.1 Satellite tags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Dead–Reckoning . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.1 Conventional bias correction for DR path . . . . . . . . . 82.3 Data sets in our case study . . . . . . . . . . . . . . . . . . . . . 83 Review of Some Fundamental Statistical Methods . . . . . . . . . . 103.1 Gaussian process in spatio–temporal modeling of big data . . . . . 103.1.1 Methods to deal with big data . . . . . . . . . . . . . . . 12v3.1.2 Methods for non-stationarity . . . . . . . . . . . . . . . . 193.1.3 Basics of the integrated nested Laplace approximation . . 203.2 Stochastic differential equations . . . . . . . . . . . . . . . . . . 233.3 Kalman filter and smoother . . . . . . . . . . . . . . . . . . . . . 263.4 B-spline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Bayesian Data Fusion Approaches . . . . . . . . . . . . . . . . . . . 314.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 Bayesian melding . . . . . . . . . . . . . . . . . . . . . . . . . . 354.2.1 Model inference . . . . . . . . . . . . . . . . . . . . . . 394.2.2 Connection with the conventional correction . . . . . . . 434.3 Downscaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444.4 Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4.1 Simulation evaluation of the approximations in the BM ap-proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.4.2 Comparison of all five approaches . . . . . . . . . . . . . 464.5 Case study results . . . . . . . . . . . . . . . . . . . . . . . . . . 494.5.1 Goodness of normal approximation credible intervals . . . 504.5.2 Model selection for BM . . . . . . . . . . . . . . . . . . 514.5.3 Model selection for downscaling . . . . . . . . . . . . . . 524.5.4 Cross–validation comparison of the different approaches . 544.5.5 Combined paths and their uncertainty from BM and down-scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.5.6 Paths in both dimensions and the distance traveled by theanimal . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 615 Conditional Heterogeneous Gaussian Process . . . . . . . . . . . . . 645.1 Definition of CHGP and CHBB . . . . . . . . . . . . . . . . . . 655.2 Attempts at model selection for the CHBB . . . . . . . . . . . . . 695.2.1 Metrics for model comparison . . . . . . . . . . . . . . . 695.2.2 Reason for their failure . . . . . . . . . . . . . . . . . . . 725.3 Reflections and future directions . . . . . . . . . . . . . . . . . . 75vi6 Generalized Ornstein–Ulhenbeck Process and Its Application in An-imal Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 776.1 Generalized Ornstein–Ulhenbeck process . . . . . . . . . . . . . 786.1.1 Mean and covariance of GOU: our model choice . . . . . 816.1.2 Conditional distribution of GOU . . . . . . . . . . . . . . 826.1.3 Sparse precision matrix of GOU . . . . . . . . . . . . . . 836.1.4 Implementation of GOU in practice . . . . . . . . . . . . 846.2 Application to Bayesian melding for animal’s track . . . . . . . . 866.2.1 DLM of the BM approach . . . . . . . . . . . . . . . . . 876.2.2 Model selection for BM-GOU and comparison with otherapproaches . . . . . . . . . . . . . . . . . . . . . . . . . 887 Potential Field Modeling in Tracking . . . . . . . . . . . . . . . . . . 927.1 Potential field for learning basketball strategies . . . . . . . . . . 947.1.1 Potential field as a tensor product spline over the space . . 957.1.2 Potential field using the other player’s location as a covariate 987.2 A mixture of potential fields model for seal foraging tracks . . . . 1018 Summary, Discussion, and Future Work . . . . . . . . . . . . . . . . 1048.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048.2 Another examination of the approximation and model specificationin Bayesian melding . . . . . . . . . . . . . . . . . . . . . . . . 1068.2.1 Why the variance parameters decrease . . . . . . . . . . . 1088.2.2 Similar phenomenon in other data sets and models . . . . 1098.2.3 Ideas to stabilize the variance parameters . . . . . . . . . 1118.2.4 The variance parameters and the CI . . . . . . . . . . . . 1128.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1148.3.1 Future developments for Bayesian melding . . . . . . . . 1148.3.2 Approximation to non-linear SDE . . . . . . . . . . . . . 115Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116A Inferential Methods for Bayesian Melding . . . . . . . . . . . . . . . 129A.1 Explicit form of 〈β ,η |X,Y〉 . . . . . . . . . . . . . . . . . . . . 129viiA.2 Derivation of (4.11) and (4.12) . . . . . . . . . . . . . . . . . . . 131A.3 Explicit expression for (4.12) . . . . . . . . . . . . . . . . . . . . 132A.4 Explicit expression of [φ ,β ,ηG|XG,Y] . . . . . . . . . . . . . . . 135A.5 Marginal distribution of η at the non-GPS points . . . . . . . . . 137A.6 Integration over the variance parameters φ . . . . . . . . . . . . . 137B Some Mathematical Details for CHBB . . . . . . . . . . . . . . . . . 140B.1 |C|= |C0||C1,1||C1,2| . . . . . . . . . . . . . . . . . . . . . . . . 140B.2 Short BB are more likely to have a small variance estimate . . . . 141B.3 Derivation of KL divergence for CHBB . . . . . . . . . . . . . . 141viiiList of TablesTable 2.1 The sample quantiles in minutes of the time gaps between twoconsecutive GPS observations in our data. . . . . . . . . . . . 9Table 4.1 RMSEs and actual coverage percentages of 95% credible inter-vals (in gray background) in L5OCV comparisons for differentbias correction term h(t) = ∑Qi=1βiti−1 with Q= 1,2,3,4 in theBM approach. . . . . . . . . . . . . . . . . . . . . . . . . . . 52Table 4.2 Selected DIC values for the downscaling models. “NA” meansthat INLA crashed when fitting this model. The AR1–AR2model minimizes the DIC for Trip 1 Northing, AR2–RW1 forTrip 1 Easting and Trip 2 Easting, and RW2–RW1 for Trip 2Northing. The RW1–RW1 model is included as a benchmark. . 53Table 4.3 RMSEs and actual coverage percentages of nominally 95% cred-ible intervals (in gray background) for the L5OCV comparisonsof the different downscaling models with different processes. . 54Table 4.4 RMSEs and actual coverage percentages of nominally 95% cred-ible intervals (in gray background) for the L5OCV comparisonsof all the approaches. Two approaches that only use GPS data:Linear interpolation (Linear), and Correlated Random Walk (CRAWL)with drift term. Three approaches use both GPS and DR path:conventional bias correction of DR (Convention), Bayesian Meld-ing withQ= 1 (BM1), and Downscaling with RW1–RW1 model(DS11). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56ixTable 4.5 Total distance (km) traveled in the two fur seal foraging tripscalculated via different methods . . . . . . . . . . . . . . . . . 61Table 6.1 Computation times (in seconds) for three different approachesto evaluate the likelihood of models involving the GOU. T isthe number of time points. . . . . . . . . . . . . . . . . . . . . 85Table 6.2 The best GOU model, RMSEs and actual coverage percent-ages of 95% credible intervals (in gray backgrounds) for theBayesian Melding approach with the GOU process (BM-GOU).The CV-RMSE and coverage from the Bayesian Melding withthe Brownian Bridge (BM-BB), and the Downscaling approach(DS) are also included from comparison. . . . . . . . . . . . . 89Table 6.3 The average width of posterior credible intervals from the BM-GOU and the BM-BB. . . . . . . . . . . . . . . . . . . . . . . 90Table 7.1 Fitted coefficients (multiplied by 1000) of the potential fieldscorresponding to HOU’s most commonly used lineup in Game5. They reflect the likelihood of the ball to move toward “col-umn name” when “row name” is in possession of the ball. . . . 100Table 8.1 Posterior modes of φ (σˆ2H , σˆ2D) from [φ |XR ∪XG,Y] and crossvalidation performance under different rate of thinning for Trip2 Northing data set. The cross–validation RMSE and coveragerate of the 95% CI (in brackets) from both LOOCV and L5OCVare reported. . . . . . . . . . . . . . . . . . . . . . . . . . . . 108Table 8.2 The estimates of variance parameters in CRAWL and Mate´rnmodel with different subset rates for the Trip 2 (Northing) GPSdata set. Thinning rate = 1 gives the original data set. . . . . . 110Table 8.3 Variance parameter estimates of the Brownian Motion and Mate´rnprocess for the motorcycle data set. . . . . . . . . . . . . . . . 110xList of FiguresFigure 4.1 The (negative) longitude of the GPS observations of the forag-ing trips made by two northern fur seals in our case study. Bothtrips started and ended at Bogoslof Island (Alaska) where thefemales gave birth and nursed their pups, and the horizontalline indicates the longitude -168.035E of the island. The timeunit is the proportion of the total time of this foraging trip. . . 37Figure 4.2 Plots from our simulation, from left to right: the box plot ofthe posterior mode of log(σ2H) obtained from the approximateand full likelihoods; the box plot of the posterior modes oflog(σ2D); scatter plot of the RMISE of the posterior mean of ηfrom the approximate Bayesian Melding approach versus theRMISE of the posterior mean from the full Bayesian Meldingapproach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47Figure 4.3 Pooled RMISE for all five methods of estimating animal’s path,stratified by the data generating model. . . . . . . . . . . . . 48Figure 4.4 The exact credible posterior density (solid line) and normalapproximation density (dotted line) for a few η(t) in Trip 1Northing direction for a northern fur seal. . . . . . . . . . . . 51Figure 4.5 The reconstructed path for a northern fur seal in a selectedperiod from all the five approaches considered in our study,Bayesian Melding (BM1), Downscaling (DS11), Conventionalbias correction, CRAWL and linear interpolation. The dots arethe GPS observations. . . . . . . . . . . . . . . . . . . . . . 57xiFigure 4.6 The Bayesian Melding and downscaling results for a north-ern fur seal undertaking Trip 1 Northing from Bogoslof Island,Alaska: Points are the GPS observations and the dotted curveis the DR path. The solid curve is the posterior mean of η(t)in the case of BM, whose 95% credible intervals are shown bythe gray solid curve. The dotted curve is the posterior mean ofY (t) in the downscaling approach, whose 95% credible inter-vals are shown by the gray dotted curve. . . . . . . . . . . . . 58Figure 4.7 Zoomed Bayesian Melding and downscaling results for a north-ern fur seal undertaking Trip 1 Northing on 2009–07–23 and2009–07–27: Red points are the GPS observations. The solidcurve is the posterior mean of η(t) in BM, whose 95% credibleintervals are shown by the gray solid curve. The dotted curveis the posterior mean of Y (t) in downscaling, whose 95% cred-ible intervals are shown by the gray dotted curve. . . . . . . . 59Figure 4.8 Posterior standard deviation (SD) from Bayesian Melding anddownscaling results for a northern fur seal undertaking Trip 1Northing. The solid curve is from BM while dotted curve isfrom downscaling. . . . . . . . . . . . . . . . . . . . . . . . 60Figure 4.9 GPS observations of the two trips and the corrected paths fromour BM approach. The GPS measured latitude and longitudeof the two trips of the northern fur seal began and ended atBogoslof Island in the summer of 2009. The GPS observationsin Trip 1 are indicated by the round points and those for Trip2 are indicated by the triangular points. The pink curve is thecorrected path from our BM model for Trip 1 and the bluecurve is for Trip 2. . . . . . . . . . . . . . . . . . . . . . . . 61xiiFigure 5.1 Five simulated realizations of a CHBB with knot set S ={0,10,20, . . . ,60}. The start and end points of this CHBBare fixed at zero. The variance parameters are chosen to beσ20 = 10, σ21,1 = σ21,3 = σ21,5 = 2 (shaded areas) and σ21,2 =σ21,4 = σ21,6 = 0.1. The vertical dashed lines indicate the knotsS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68Figure 5.2 Histogram of the selected knot s based on the LR statistic forthe one-knot CHBB model M(1). The data is generated fromthe null model M(0)(a Brownian Bridge process with variance1) and T = 100. The LR statistic are more likely to select aknot which is near the end points. . . . . . . . . . . . . . . . 74Figure 6.1 Examples of the Ornstein–Ulhenbeck process of ρ = 1,σ =1,µ = 5 with different initial value η0. The mean curve isplotted with the dash lines while the realizations are plottedwith the solid lines. . . . . . . . . . . . . . . . . . . . . . . . 79Figure 6.2 The posterior mean and 95% credible intervals from BayesianMelding with GOU and Brownian Bridge for Trip 2 Northingdata set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90Figure 6.3 The posterior mean and 95% credible intervals from BayesianMelding with GOU and Brownian Bridge for a period Trip 2Northing data set. . . . . . . . . . . . . . . . . . . . . . . . . 91Figure 7.1 Potential fields parameterized by tensor product splines fittedto the paths of the ball when HOU was on offence duringGames 4 and 5 vs. LAC in the 2015 NBA Conference play-offs. LAC won Game 4 128-95, while HOU won Game 5 124-103. The colors show the values of the potential field H, whoselowest value is around the basket. . . . . . . . . . . . . . . . 96Figure 7.2 Movement paths for Trevor Ariza (left) and Jason Terry (right)in Game 5. The points increase in size as the play progresses(and the shot clock decreases). Different colors corresponds todifferent plays. . . . . . . . . . . . . . . . . . . . . . . . . . 97xiiiFigure 7.3 The fitted potential fields for Trevor Ariza (left) and JasonTerry (right) in Game 5. We choose to only show the potentialfields away from the center circle, as to allow for clear visual-ization of the patterns in the potential fields. These two playerstend to occupy different corners of the court. . . . . . . . . . 98Figure 7.4 The simple potential field and mixture of potential fields mod-els fitted for the Trip 2 GPS data set collected in our case study.Top left: the potential field with the original Brillinger model,which does not fit the track well; Top right: The mixture po-tential field and the posterior clusters; Bottom left: the firstcomponent of the mixture. It accounts for 73% of the observa-tions and may be interpreted as the food map for the animal;Bottom right: second component of the mixture. It accountsfor 27% of the observations and reflects the returning behaviorof the animal. The top right field is the weighted sum of thebottom two fields with weight 0.73,0.27. . . . . . . . . . . . 103Figure 8.1 The motorcycle data set. . . . . . . . . . . . . . . . . . . . . 111Figure B.1 log(P(σˆ2 < z)) for different n and z= 0.001,0.01,0.1. . . . . 142xivGlossarySome of the frequently used acronyms are listed here.BB Brownian BridgeBF Bayes FactorBM Bayesian MeldingCRAWL Continuous-Time Correlated Random Walk LibraryCI Credible IntervalDLM Dynamic Linear ModelDR Dead–ReckoningDRA Dead–Reckoning AlgorithmGOU Generalized Ornstein–UlhenbeckGP Gaussian ProcessGPS Global Positioning SystemINLA Integrated nested Laplace approximationIOU Integrated Ornstein–UlhenbeckKL Kullback–LeiblerLOOCV Leave–one–out Cross–ValidationxvL5OCV Leave–five–out Cross–ValidationLR Likelihood RatioMLE Maximum Likelihood EstimateOU Ornstein–UhlenbeckRMSE Root Mean Squared ErrorSDE Stochastic Differential EquationsxviAcknowledgmentsMy deepest debt of gratitude for profound advice and warm support goes to mysupervisor, Professor James V. Zidek. As a PhD student, I am blessed to haveworked with Prof. Zidek. Through him, I learned many things, such as how todecompose the complex data problem into small problems and tackle them pieceby piece, how to convert our wild ideas into solid scientific research, and how tocommunicate complex statistical methods, etc. All these will benefit me for therest of my life.I would also thank my supervisory committee members, Prof. Nancy Heckmanand Prof. Matı´as Salibia´n-Barrera for their great help and suggestions. Part of thisdissertation is a product from the joint research program between the Departmentof Statistics and Marine Mammal Research Unit, Institute of Ocean and Fisheriesat UBC. Prof. Andrew W. Trites and Dr. Brian C. Battaile provided the trackingdata of northern fur seals and helped me to understand the scientific backgroundas well as the behavior of those animals. It is great pleasure for me to work withall of them. Through my seven years at UBC for both my master and PhD, Ihave benefited so much from being immersed in the great research atmosphere inthe Department of Statistics. I would thank every member of this department fortheir company and encouragement through my program. Especially, I would liketo thank Prof. Jiahua Chen, Prof. William Welch, and Prof. Harry Joe for manyhelpful discussions.My research is partially funded by the Natural Sciences and Engineering Re-search Council of Canada (NSERC) via the postgraduate scholarship–Doctoral(PGS-D). I am grateful for their recognition and support for my research.Last but least, I want to thank my amazing family, especially my wife An-xviigela Jiayun Yao and parents Shouhua Liu and Fengrong Zhang, for their love andsupport throughout all these years.xviiiDedicationTo my wife and parents.xixChapter 1IntroductionRecent technological advances have made it feasible to track many things, suchas foraging trips of endangered animals (Wilson et al., 2007), the movements ofbasketball players (Miller et al., 2014), and the spread of infectious diseases (Gins-berg et al., 2009; Lazer et al., 2014b). Foraging trips, for example, reflect thefeeding areas of animals, and can be used to identify critical habitat. Similarly, themovements of basketball players can be analyzed to identify playing strategies, andevaluate the players (see e.g., Cervone et al., 2014; Liu et al., 2016). It is also es-sential to track diseases for controlling epidemics and predicting future outbreaks.Tracking helps us to understand the processes that determine movement patterns,and provides a means to forecast and plan for future changes.The data collection technologies used for tracking involve observations fromboth direct and indirect sources. Direct observations are usually accurate but sparsein space and time, such as GPS observations of an animal’s location in animaltracking, or the number of infected patients from epidemic reports. In contrast,indirect sources are usually inaccurate but data rich, and can help to fill in thegaps in the direct observations, such as inferred locations between GPS observa-tions (i.e., Dead–Reckoned paths), or the search fraction of flu related queries onGoogle (Ginsberg et al., 2009). Combining different systems of observations ina statistically rigorous manner can ultimately result in more accurate and higherresolution tracks for objects of interest. This is the primary objective for our study.After high resolution and accurate tracks have been inferred, it is important1to further model and summarize the trends in them, which helps to convert thetracking data into practical knowledge, such as the animal’s habitat preference orthe basketball players’ strategies. Modeling and learning the trends is another keyobjective of our study. A good statistical model for the tracks can in turn furtherimprove our methods to combine different systems of observations.The advancement in data collection also results in tracking data in extraordi-nary high temporal resolution, like the 16Hz animal tracking data and the 25Hzbasketball tracking data, both of which will be studied later in this thesis. Suchhigh resolution tracking easily constitutes very big (long) data sets even for a shortperiod. This data rich environment brings remarkable details to the tracking sub-jects, but also computational challenges in the analysis and modeling. We aim todesign our methods such that they can perform well in the prediction and interpre-tation, but also have good computational efficiency and scalability.The rest of this thesis documents our efforts and is organized as follows. Chap-ter 2 reviews the background of mammal marine tracking and the properties ofdifferent systems for making observations. It also includes the details of a northernfur seal tracking study, which provides the data set used in much of our work. Ourmodel and methodology development are built on some recent developments inthe Gaussian process modeling and stochastic differential equations, which will bereviewed in Chapter 3. We propose and study two frameworks, Bayesian Meldingand downscaling to combine different systems of data in Chapter 4. Both per-form well relative to competing methods in the seal tracking data. To further im-prove stochastic models of the tracks, we propose two new stochastic processes, aconditional heterogeneous Gaussian process (CHGP) and a generalized Ornstein–Ulhenbeck (GOU) process in Chapter 5 and 6 respectively. Only the GOU processsucceeds in reaching our modeling objectives. To further improve the interpreta-tion of the tracks, we study the potential field approach for track modeling. It isapplied to both the seal and basketball tracking data in Chapter 7. The discussionand future work are included in Chapter 8.In what follows, we describe the mathematical notations in this thesis. Regularcharacters, such as t,x,η , denote univariate variables or one dimensional quanti-ties, while bold characters, such as Y,C,θ , denote multivariate variables, vectors,or matrices. We use {. . .} to denote a process or a collection of variables, (. . .)T to2denote a vector formed by its elements, and, [. . .] to denote a density function. Dueto the large amount of notation involved in different contexts explored in this thesis,we have chosen to use the symbols that seem best in the context, i.e., they shouldbe taken as “local” to each chapter or section. So for example, y in Section 3.1 andSection 3.3 can have different meanings. We will specify their meaning as they arefirst used.3Chapter 2Background on Animal TrackingMarine biologists have been attaching a variety of different electronic tags to ma-rine animals to track their movements, describe their behaviors and characterizetheir habitat preferences, e.g., Benoit-Bird et al. (2013a). Interactions of trackedanimals with other animals or the environment allow for ecological questions re-garding population structure and dynamics to be addressed (Block et al., 2011;Benoit-Bird et al., 2013b). Free–ranging animals can serve as sensors to collectenvironmental (e.g., oceanographic) data, which is difficult or expensive to ob-tain via conventional means such as ships or satellites (Boehme et al., 2008, 2010;Nordstrom et al., 2013). These environmental data can help to better understandecosystems and the impact of climate change.Accurately determining the locations of animals is a fundamental problem inanimal tracking. One means of obtaining locations is to have tags carried by an-imals communicate with satellite systems such as the Global Positioning System(GPS) or the ARGOS satellites. However, as detailed in Section 2.1 and 2.3, theneed to communicate with the satellites dramatically limits the resolution of thosedevices. One means of collecting data on the animal’s movements between obser-vations from a satellite tag is to concurrently deploy a “Dead–Reckoning” (DR) tagconsisting of an accelerometer, a magnetometer, a time–depth–recorder (TDR) andother supporting components (Wilson et al., 2007; Nordstrom et al., 2013). SuchDR tags can sample at infra–second frequencies (e.g., 16Hz) and provide a detailedrecord of an animal’s movements. Data downloaded from the tag can be processed4by a Dead–Reckoning algorithm (DRA) to reconstruct the Dead–Reckoned (DR)path of the animal (Wilson and Wilson, 1988; Johnson and Tyack, 2003; Wilsonet al., 2007). The details of DRA are further described in Section 2.2. Section 2.3provides the information about two data sets from tracking northern fur seals withboth GPS and DR devices, which are used throughout this paper.2.1 Satellite tagsAccording to Hofmann-Wellenhof et al. (2012), the basic idea of GPS is to firstobtain the sights of over four GPS satellites and then calculate the current loca-tion by solving a system of equations based on the time and known positions ofthose satellites. The ARGOS system works in a similar fashion. The necessity of“seeing” the satellites is an impediment in tracking the animals, especially marineanimals. These satellite tags cannot work when the animal is diving under the wa-ter, which accounts for a large proportion of their foraging trip. Also, the satellitetags have a higher energy consumption rate than the DR tags. To save battery life,the satellite tags are usually programed to be only turned on for a short period (i.e.,half a minute) after a long period of hibernation (i.e., 15 minutes) (Battaile et al.,2015). Thus, satellite systems can only provide a sparse and irregularly spacedrecord of animal locations.As sparse as those satellites observations are, they accurately record the an-imal’s locations. Extensive experiments (Bryant, 2007; Hazel, 2009) have beencarried out to test the precision of the GPS tags by comparing its readings at someknown locations to the locations’ latitude and longitude. It is found that errorsin latitude and longitude are of mean zero and independently normally distributedwith similar standard deviation, which depends on the number of satellites observ-able to the device. When over ten satellites are available, the standard deviationis around 10 meters while it increases to 70 meters when only five satellites areavailable.To fill gaps in the observations from satellite tags, many statistical interpola-tion methods have been developed. McClintock et al. (2014) provide an extensivereview of this literature. Most of the approaches developed are state–space basedmodels either in continuous or discrete time and space, such as the continuous time5correlated random walk (CRAWL, Johnson et al., 2008) or robust state space mod-els (Jonsen et al., 2005). Recently, Fleming et al. (2016) proposed Kriging to in-terpolate the satellite observations and compared the performance of several Gaus-sian processes with different covariance structures. Hooten and Johnson (2017)developed a continuous-time stochastic integral equation framework, which canbe constructed to mimic behavioral characteristics in realistic paths. However, theinterpolation from those models depends heavily on the model assumptions, suchas the correlated random walk or the Kriging model. Those assumptions may bedifficult to justify in practice and lead to unrealistic predictions, as shown in Sec-tion 4.5. It is thus compelling to supplement the satellite observations with othersources of data.2.2 Dead–ReckoningInstead of trying to record the location directly, the DR tag is designed to collectinformation related to the animal’s movement, such as the animal’s velocity, ac-celeration and body position (orientation), with a high sampling frequency (e.g.,16 times per second or higher). All of this information can be measured locallyby the accelerometer, magnetometer, and time–depth–recorder in the DR tag with-out connecting to the satellites. These measurements are then fed into the Dead–Reckoning algorithm (DRA) for an “estimate” of the animal’s location. The DRAis based on the physical rules that produce measurements in the DR tag. The de-tailed implementation of the DRA may vary in different applications (Wilson andWilson, 1988; Elkaim et al., 2006; Wilson et al., 2007), but the basic idea is asfollows.1. Obtain the earth’s gravity vector g and magnetic field vector m at the loca-tions being studied.2. Correct the systematic bias in the accelerometer and magnetometer that arecaused by inadequate calibration (Grewal et al., 2007).3. Smooth the corrected accelerometer and magnetometer readings with a run-ning mean or a low-pass filter.64. Find the animal’s orientation (often taken to be the tag’s orientation, namelythe three-dimensional direction of velocity) by solving Wahba’s problem (Wahba,1965), which is to find the rotation matrix O that minimizes the l2 distancebetween the rotated gravity and magnetic forces (Og and Om) and the cor-rected and smoothed accelerometer reading a˜ and magnetometer readings m˜,namely,argminO{||a˜−Og||2+ ||m˜−Om||2} .5. Obtain the animal’s speed (norm of the velocity vector) by one of the follow-ing approaches:(a) Assume it is a known constant.(b) Assume it is measured by a speed meter (a wheel or paddle), whichonly measures the speed of the animal with respect to the water, butnot to the earth.(c) From the data recorded by the TDR, calculate the velocity in the depthdirection vz. Given the animal’s orientation o = {ox,oy,oz}, the threedimensional velocity is then v = ovz/oz.6. Starting from a known point, integrate the velocity to obtain the animal’strajectory.The DR path provides remarkably detailed information about an animal’s move-ments, especially fine–scale fluctuations that the GPS cannot capture. However,the DR path can be biased because of poor measurements of swim speeds, sys-tematic as well as random error in the accelerometer and magnetometer sensors,undocumented animal movements caused by ocean currents, confounding betweenmovement and gravitational acceleration, and discretization in the integration ofthe speed. All of these factors lead to biases and errors in the DR path (Wilsonet al., 2007; Liu et al., 2015), which can be significant if not corrected using therelatively accurate GPS observations (by as much as 100km at the end of a seven–day trip in the case study of the next chapter).72.2.1 Conventional bias correction for DR pathThe conventional approach to correct for DR path bias has been to add a linear biascorrection term to the DR path, which directly shifts the DR path to the locationsindicated by the GPS observations (Wilson et al., 2007). This approach can besummarized as follows: denote the DR path (in one dimension) by x1,x2, . . . ,xT attimes t = 0,1,2, . . . ,T and the GPS observations at times 0 and T by y0,yT respec-tively; assume without loss of generality, that x0 = y0 = 0 and that the correctedpath ηˆt is calculated as,ηˆt = xt +tT(yT − xT ), (2.1)which evenly distributes the bias yT − xT over the individual time points. The DRpath between two GPS observations is shifted directly to the locations indicated bythe GPS observations, namely ηˆ0 = y0 and ηˆT = yT . This procedure is repeated forall the sections separated by the GPS observations to correct the whole path.Unfortunately, this conventional method to correct for the DR path bias is sim-plistic, and fails to consider the measurement error in the GPS observations. Thisconventional method also fails to provide a statement about the uncertainty in thecorrected path. As a result, the bio–logging community has concern about thevalidity of the corrected path (Nordstrom et al., 2013) and has generally been re-luctant to assign much significance to reconstructed locations. It is these concernsthat prompted us to develop the Bayesian Melding and downscaling approachesas competing statistically rigorous methods for track reconstruction that overcomethe limitations of the conventional approach. We sought to correct the biased DRpaths and quantify the uncertainty in the corrected paths.2.3 Data sets in our case studyThe application in this paper involves tracking data from two lactating northern furseals captured and tagged on Bogoslof Island (Alaska, USA) as part of the BeringSea Integrated Research Program (BSIERP) (Benoit-Bird et al., 2013a; Nordstromet al., 2013). Two tags were glued to the fur of each seal with five minute epoxy: aDR “Daily Diary” tag and a TDR MK 10–F with Fastloc R© GPS technology (both8manufactured by Wildlife Computers). The accelerometer and magnetometer ofthe DR tag were set to sample 16 times per second (16Hz) while the TDR pressuresensor sampled at 1Hz. The GPS sensor was programmed to make one attempt toconnect with the satellite every 15 minutes.We produced the DR path for two foraging trips made by the two female seals(denoted as “Trip 1” and “Trip 2”) using the “TrackReconstruction” R package onthe 16Hz data set. This R package was developed based on Wilson and Wilson(1988); Wilson et al. (2007) and described in detail by Battaile (2014). We latersub–sampled the DR path to various frequencies to fit the computational capacityof methods used to get the results. We also projected the GPS observations aslongitude and latitude to Easting and Northing in kilometers (km) in a point–wisefashion as per Wilson et al. (2007).The two foraging trips made by the fur seals in our study were each approxi-mately 1 week in duration. Trip 1 was 7 days and had 274 valid GPS observations,while Trip 2 lasted about 7.5 days and had 130 GPS observations. A large pro-portion of the GPS locations had time gaps around 15 minutes (Table 2.1)—thedesigned time gap for the GPS device to record the locations. However, non–negligible proportions of them (13% in Trip 1 and 30% in Trip 2) were greaterthan 1 hour, and the longest time gap in both trips was longer than 10 hours. Thedistance between GPS observations can be found in the Figure 4.9. Therefore, theGPS observations were irregularly spaced in time and space, and the duration ofthe gaps between them were quite large, making it necessary to incorporate thehigh resolution DR path.Table 2.1: The sample quantiles in minutes of the time gaps between twoconsecutive GPS observations in our data.Min 10% 25% 50% 75% 90% MaxTrip 1 14.75 15.00 15.45 18.40 31.68 82.79 953.65Trip 2 14.75 15.00 15.05 30.00 113.05 130.77 698.479Chapter 3Review of Some FundamentalStatistical MethodsIn this chapter, we review fundamental statistical methods that serve as buildingblocks for our methodology development, including Gaussian processes, stochasticdifferential equations (SDE) and the Kalman filter/smoother. Due to our need forbrevity, we review only the knowledge that is used in or related to our followingchapters and thus do not give a complete treatment of these topics.3.1 Gaussian process in spatio–temporal modeling of bigdataA Gaussian process (GP) is a stochastic process {η(·)}, which is indexed by tin our cases. It has the property that its realizations at any finite collection ofpoints t have a joint multivariate Gaussian distribution. The index t can be in thetime domain R+, spatial domain R2, spatio–temporal domain R2×R+, or evenhigher dimensional spaces. A GP is completely specified by its mean functionm(t) = Eη(t) and covariance function C(t, t ′) = Cov(η(t),η(t ′)). That is, foran index set T = {t1, t2, . . . , tn}, the random vector η = (η(t1),η(t2), . . . ,η(tn))Tfollows a multivariate Gaussian distribution of mean m = m(T ) and covariance10matrix C =C(T ,T ). The density function of η is,pi(η ;m,C) = (2pi)−n/2|C|−1/2 exp(−12(η −m)TC−1(η −m)), (3.1)where we use pi to denote a density function. The inverse of the covariance matrixis called the precision matrix, Q = C−1. An elegant theoretical result is that a zeroentry Qi, j in the precision matrix indicates that η(ti) and η(t j) are conditionallyindependent given the remaining random variables η−i j = η \{η(ti),η(t j)},pi(η(ti),η(t j)|η−i j) = pi(η(ti)|η−i j)pi(η(t j)|η−i j).For Qi, j = 0, the conditional independence property also implies the Markovianproperty,pi(η(ti)|η−i j,η(t j)) =pi(η(ti),η(t j)|η−i j)pi(η(t j)|η−i j)=pi(η(ti)|η−i j)pi(η(t j)|η−i j)pi(η(t j)|η−i j)=pi(η(ti)|η−i j).There are many other important properties of GP and they have been reviewedsystematically by Rue and Held (2005).GP has been ubiquitously used in many statistical studies and applications,such as random effect models (Rue and Held, 2005), spatio–temporal modeling (Leand Zidek, 2006; Cressie and Wikle, 2015; Shaddick and Zidek, 2015) and ma-chine learning for various regression and classification problems (Williams andRasmussen, 2006), and the references within. However, it is computationally ex-pensive to use GP for big n problems. In this setting, n denotes the number of pointsor indexes where the GP is observed, i.e., the number of distinct locations wherea spatial GP is observed. Evaluating the likelihood or posterior density usually re-quires calculating the matrix inverse C−1 or solving a system of linear equationsMx = f (usually used to calculate M−1f), where M is a matrix of dimension n andf is a vector of length n. An example of such a quantity appears in Equation (4.9).Calculating M−1f for a general M has the time complexity of O(n3) and storing thematrix M requires O(n2) in memory space (Nychka et al., 2015). These complex-ities can be formidable for big n, e.g., n exceeding 500,000 as in our case study,11and that prevents us from directly applying the GP for big n data.Moreover, those large data sets often display some non-stationary features,e.g., different correlations and variances in different regions, which makes the tra-ditional stationary (isotropic) GP inadequate when modeling these data. There-fore, several ingenious statistical methods have been developed to address the non-stationarity and big n issues in the analysis of spatio–temporal and computer ex-periments.In Section 3.1.1, we review methods developed to tackle the big n problem forGP. Section 3.1.2 reviews some methods designed to address the non-stationarityissue. The integrated nested Laplace approximation (INLA), which is used inChapter 4, is a method designed to handle both big n and non-stationarity prob-lems. We walk through the steps of INLA in Section Methods to deal with big dataTo tackle big n problems, the following four paradigms are commonly invoked: 1,sparse matrix operations; 2, special covariance structures; 3, lower rank models;4, likelihood approximations. One or more can be used in developing models andmethods. We organize the following reviews based on the most fundamental ideasembraced by the paradigm.Methods based on sparse matrix operationsThe main bottleneck of using the GP for big n data is in the storage and computationof the big matrices, but they can be handled more efficiently if the big matrix issparse, i.e., a large fraction of its entries are zero. Many special data structuresand algorithms have been developed to handle the storage and computation of thesparse matrix (see e.g., Bates and Maechler, 2016). The basic idea to reduce thestorage cost is to store the non-zero entries and their indexes. The computation forsparse matrix decomposition, inversion, calculating determinant, etc., can also besimplified based on the zero entries (Pissanetzky, 1984).However, the covariance matrices of most commonly used GPs are not sparse.The idea of covariance tapering (Furrer et al., 2006; Kaufman et al., 2008) is meantto force a covariance/correlation matrix into a sparse matrix. For example, the co-12variance for η(t),η(t ′) is assumed to be zero if |t− t ′| is larger than a threshold.However, as pointed out by Banerjee et al. (2008) and Datta et al. (2016), taperingcompletely destroys the long range correlation and often fails to provide a reason-able approximation to the true covariance matrix.The long range correlation can still be preserved if we work with a sparse preci-sion matrix instead of the covariance matrix. The inverse of a non-diagonal sparsematrix is often not sparse1, which means that the correlation can still be non-zeroeven when |t−t ′| is large. A GP with a sparse precision matrix is also known as theGaussian Markov Random Field (GMRF Rue and Held, 2005; Rue et al., 2009).As discussed at the beginning of this section, the “Markov” in the name denotes theconditional independence property of such a GP. The Brownian Motion, BrownianBridge, autoregressive, Ornstein–Ulhenbeck processes that we later work with canall be viewed as a GMRF. The GMRF is a fundamental idea that used in the meth-ods developed in Rue et al. (2009); Lindgren et al. (2011); Nychka et al. (2015),which will be reviewed later. Besides sparse matrices, the Kalman filter/smootherwith an expanded state space can also be used for the inference of GMRF (Sa¨rkka¨et al., 2013). It is unclear to us how the Kalman filter/smoother implementationcompares to the sparse matrix implementation in terms of computation speed.Methods based on special structuresGramacy and Lee (2008) proposed a treed Gaussian process (TGP) model thatsplits the data domain into different sub-regions as in the Bayesian classificationand regression tree (Chipman, 1998) and fits a Gaussian process within each sub-region. This approach avoids the big n problem by partitioning the data and alsodeals with the non-stationarity via the local process in each sub-region. Nonethe-less, the GP in each sub-region is fitted separately, which introduces an unnaturaldiscontinuity into the global process. Although the discontinuity can be allevi-ated by averaging over the tree space, the averaging procedure and training mul-tiple trees increases the computational burden of this method. Park et al. (2011)1This is an empirical statement. For example, the inverse of a tri-diagonal symmetric matrix isnot sparse. We can also construct counter examples such that the inverse of a sparse matrix is stillsparse. To the best of our knowledge, there are no theorems that state what kind of sparse matrix hasa non-sparse inverse.13modified the TGP by imposing a continuity constraint on the boundaries of thesub-regions and obtained the continuous tree via a complex constrained numericaloptimization routine.The tree structure was also considered in the multi-resolution tree structuredmodels in Huang et al. (2002) and Tzeng et al. (2005). When compared to theTGP, the multi-resolution models avoid searching for the tree split points by re-cursively partitioning the spatial domain of interest into equal sized sub-regions.The space partitioning is done recursively by sub-regions. A spatial autoregres-sive process (SAP) was proposed in Huang et al. (2002) and Tzeng et al. (2005)for this tree structure, such that the covariance at two leaf locations equals thevariance of the sub-region that is the parent of both leaf locations. This covariancestructure enables an efficient “change–of–resolution” Kalman filter/smoother algo-rithm in evaluating the likelihood and posterior, whose time complexity is of orderO(n log(n)). But the SAP results in a “blocky” covariance structure and lacks theflexibility needed in modeling complex processes (Tzeng et al., 2005).Low rank methodsAnother way to tackle the big n issue is to restrict the expensive computations,such as the matrix inversion, to a matrix with lower dimension r. This brings usto the low rank method, which can be viewed as approximating a GP by a lineartransformation of a Gaussian random vector of dimension r. To illustrate howsuch a low rank matrix can simplify the computation, we work with an exampleof Y = η + ε , where η is the Gaussian random vector defined in (3.1) and ε isalso multivariate Gaussian with mean zero and covariance matrix σ2I (I is a n byn identity matrix) 2. It is easy to show that Y has a covariance matrix of C+σ2Iand its inverse needs to be calculated when evaluating the likelihood or calculatingthe conditional distribution of η given Y, i.e.,E(η |Y) =C(C+σ2I)−1(Y−m)Var(η |Y) =C−C(C+σ2I)−1CT2This corresponds to the simple kriging model Y (t) = η(t)+ ε(t) (see e.g., Cressie and Wikle,2015), where {ε(·)} is a white noise process that is independent of {η(·)} and has variance σ2.14where m and C are the mean and covariance matrix of η from (3.1). As discussedat the beginning of this section, the storage and inversion of C+σ2I is computa-tional expensive.In the low rank method, the full rank covariance matrix C is replaced by KSKT ,where S is a r× r covariance matrix and K is a n× r matrix. Equivalently, itassumes that the Gaussian random vector η = Kξ , where ξ is another Gaussianrandom vector of dimension r. To compute the inverse of KSKT + I (withoutloss of generality, assume σ2 = 1), we can use the Sherman–Morrison–Woodburyformula (Henderson and Searle, 1981) to simplify it into(I+KSKT )−1 = I−K(S−1+KTK)−1KT , (3.2)which replaces the inverse of the original n× n matrix by the inverse of two r× rmatrices (S−1 and (S−1 +KTK)−1). When r << n, the right-hand-side of (3.2)can be much faster to calculate than the left-hand-side. Also, it can take less spaceto store the two lower dimensional matrices K and S than the big matrix C. Inthe MCMC inference of Bayesian models, the low rank method can also be usedto break a high dimensional density into a sequence of lower dimensional ones,which is easier to sample from (see e.g., Datta et al., 2016). On the other hand,we would like to point out that the low rank method cannot preserve the originaldependence structure of η , so that Kξ should be viewed as an approximation tothe original process η .Many different approaches have been developed to construct K and S or equiv-alently find a good approximation of η based on ξ , including the fixed rank krig-ing (FRK) method in Cressie and Johannesson (2008), Gaussian Predictive Pro-cess (GPP) in Banerjee et al. (2008), Nearest-Neighbor Gaussian Process (NNGP)in Datta et al. (2016), and the multi-resolution Gaussian process (MRGP) in Ny-chka et al. (2015). These approaches are mostly developed for modeling spatialdata with the Gaussian process, but can be generalized to GP defined on otherspaces. To use the same terminology as the original papers, we will call the indext “location” in this subsection for their review.15In FRK, a Gaussian process {η(·)} is approximated byηˆ (t) =r∑i=1φi(t)ξi = φ (t)Tξ , (3.3)where φ (·) = (φ1(·),φ2(·), . . . ,φr(·))T is a vector of r basis functions. The vector ξis usually defined as the random vector formed by a GP evaluated at a set of r gridlocations. The basis function φi(t) often decreases with the increase of the distancebetween location t and the i-th grid location.In the GPP approach, the lower-dimensional GP ξ is first defined on a knotset S of size r < n, which may be a subset of the observed locations and the GPprocess is approximated byη˜ (t) =C(t,S )C−1(S ,S )ξ , (3.4)where C(·, ·) is the covariance matrix. If φ (t)T = C(t,S )C−1(S ,S ) and theknots and grids are the same, the FRK and GPP approaches yield exactly the sameapproximation process. For both FRK and GPP, the dimension r decides the good-ness of approximation as well as the computational complexity of the estimationand prediction procedure. For large n, r has to be fairly large for ηˆ or η˜ to ade-quately approximate the true process (Datta et al., 2016), which, once, again, leadsto computational burden. Also Stein (2014) pointed out these low rank approacheshave poor performance under situations where the neighboring observations arestrongly correlated.To further improve the FRK approach, Nychka et al. (2015) considered a multi-resolution Gaussian process (MRGP), such thatηˆ (t) =L∑l=1φ l(t)Tξ l =L∑l=1rl∑j=1φ l, j(t)Tξl, j, (3.5)where ξ l = {ξl,1,ξl,2, . . . ,ξl,rl} is a Gaussian Markov process defined on a regulargrid with spacing δl and each φl, j(·) is a basis function similar to that in (3.3).This MRGP is a summation of FRK models over different resolutions, which aredesigned to capture correlations at different scales. In particular, they assume that16δl = δl−1/2 and the ξ l at each resolution are mutually independent. With theMarkov process, a sparse precision matrix is introduced for the lower dimensionalprocess and reduces the original order of O(r3) in computational complexity to atmost O(r2), which means that we can afford a larger r than FRK with the samecomputational power.In parallel with the MRGP, Datta et al. (2016) modified the GPP into a NNGPmodel to overcome the aforementioned shortcomings of the low rank approaches.To avoid the computational burden of large r in the knot process, the joint density ofξ on the knot set S in GPP is approximated by a product of m−nearest neighborconditional distributions, e.g., pi(ξ1,ξ2,ξ3) ≈ pi(ξ1)pi(ξ2|ξ1)pi(ξ3|ξ2) for m = 1.The nearest neighbor structure helps to reduce the computational complexity inevaluating this density from order O(r3) to O(rm3), which is a huge improvementwhen m r. Also, in the MCMC of Bayesian inference for the NNGP, the non-knot locations are updated by conditioning on their m-nearest neighbors in the knotset, such thatC(t,S )C−1(S ,S ) in (3.4) is replaced byC(t,w(t))C−1(w(t)),w(t)),where w(t) are m-nearest neighbors of t in the knot setS . This further reduces thecomputational burden of the NNGP.Because the nearest neighbor structure in NNGP can be viewed as a high orderMarkov structure, a NNGP is equivalent to a MRGP with L= 1 and a certain knotset and parameterization of the covariance matrix. However, both the MRGP andNNGP only consider constant variance and correlation parameters, i.e. parametersin the basis functions φ l(·) of the MRGP or the covariance function C(·, ·) of theNNGP, over the whole study space. This essentially assumes that the same depen-dence structures are repeatedly observed over the whole space. Therefore, neitherthe NNGP nor the MRGP has the flexibility to deal with the inhomogeneous co-variance structure over different sub-regions of the space. We call a process “inho-mogeneous” (heterogeneous) when it has a non-stationary covariance function withnon-constant variance/correlation parameters. For example, a standard BrownianMotion is non-stationary but still homogeneous.17Methods based on approximationsThe big n problem can also be tackled by constructing approximations to the fulllikelihood via a sequence of conditional likelihoods as in Vecchia (1988) and Steinet al. (2004). Caragea and Smith (2007) reviewed and compared these approxima-tion methods based on conditional likelihoods. Eidsvik and Shaby (2014) proposedanother approximation method based on the pairwise composite likelihood. Theselikelihood approximation methods are mostly used for inference in adapting a GPmodel without a sparse precision matrix, such as the Mate´rn process. An oversim-plified view is that they essentially break a large data set into several small onesand only evaluate the likelihood in these small data sets, which omits certain longrange correlations. A concern about them is that the conditional distributions usedmay not result in a proper joint distribution (Datta et al., 2016).In addition to the conditional/composition likelihood approximation, Lindgrenet al. (2011) developed another approximation to the GP with Mate´rn covariancefunctions by representing the process as a solution to a Stochastic Partial Differen-tial Equation (SPDE) and then solving for the SPDE with a finite element method.This can be viewed as approximating a non-Markov GP by a GMRF, whose infer-ence can then be handle by INLA. To illustrate the approximation process, let’sconsider a Mate´rn process observed on a regular two dimensional n by n gridt1,1 = (1,1), t1,2 = (1,2), . . . , t1,n = (1,n), t2,1 = (2,1), t2,2 = (2,2), . . . , tn,n = (n,n).For expository simplicity, denote η(ti, j) as ηi, j. When the smoothness parameterof the Mate´rn process ν approaches 0 and the grid size n is large enough (suchthat we can ignore the boundary effects), the Mate´rn process approximately hasthe following conditional distributionE(ηi, j|η−i,− j)=1a(ηi−1, j+ηi+1, j+ηi, j−1+ηi, j+1), (3.6)Var(ηi, j|η−i,− j)=1a. (3.7)where a > 4 is a parameter that depends on the range parameter of the Mate´rnprocess. Notice that (3.6) and (3.7) define a GMRF through the conditional dis-tribution (Rue and Held, 2005), i.e., ηi, j is conditionally independent from all theother ηi′, j′ , |i′− i|> 1, or | j′− j|> 1 given its nearest neighbors. In addition to the18limiting case of ν → 0, Lindgren et al. (2011) shows that the Mate´rn process withν = 1 can be approximated by a GMRF whose conditional distribution isE(ηi, j|η−i,− j)=2a4+a2(ηi−1, j+ηi+1, j+ηi, j−1+ηi, j+1)− 14+a2(ηi−2, j+ηi+2, j+ηi, j−2+ηi, j+2)− 24+a2(ηi−1, j−1+ηi−1, j+1+ηi+1, j−1+ηi+1, j+1),Var(ηi, j|η−i,− j)=14+a2.Similar approximations can be constructed for other integer νs. For the Mate´rnprocess observed at irregular locations, Lindgren et al. (2011) first uses a finiteelement method to construct a mesh (i.e., divide the space into triangles) and thenbuilds the approximation GMRF on this mesh. The combination of the SPDE andINLA approximation enables efficient computation of GP’s for some big spatio–temporal data. It can also offer certain types of non-stationary modeling. However,Nychka et al. (2015) expressed concern about its computational burden for verylarge data sets and its lack of flexibility in non-stationary modeling.3.1.2 Methods for non-stationarityThe non-stationarity issue of GP in spatial analysis was first addressed in Samp-son and Guttorp (1992). Their idea was to find a spatial deformation via multi-dimensional scaling that maps the non-stationary field into a stationary one. Thismethod was later extended in different ways as in Damian et al. (2001) as wellas Schmidt and O’Hagan (2003). Bornn et al. (2012) addressed the non-stationaryissue by embedding the geographical plane in some higher-dimensional spaces torestore stationarity. This dimension expansion approach enjoys better flexibilityand interpretability compared to the spatial deformation approach. But both ap-proaches essentially add one or more dimensions to the domain over which thedata are collected. This intensifies the computational burden in the analysis.Another approach to non-stationary modeling is the process convolution ap-proach pioneered by Higdon et al. (1999) and further developed in Paciorek andSchervish (2006), Lemos and Sanso´ (2012) and Chu et al. (2014), etc. This ap-19proach is inspired by the definition of a spatial moving average process η(s):η(s) =∫k(s−u)x(u)du,where s,u are the locations in a two-dimensional space, {x(·)} is a white noiseprocess and k(·) is a convolution kernel that does not depend on location s. Higdonet al. (1999) achieved a non-stationary spatial model by allowing k(s− u) to varywith s as ks(u), a so–called spatially evolving kernel. Certain parameterizationsof the spatially evolving kernel result in a covariance matrix that appears to be aweighted average of two covariance matrices as in Paciorek and Schervish (2006)and Ba and Joseph (2012), where one covariance matrix models the long range de-pendence and the other captures the short range dependence. The idea of spatiallyevolving kernel is also used in Hooten and Johnson (2017).We observe that the non-stationarity issue is usually caused by the parsimo-nious parameterizations of the model and can be alleviated by allowing more flexi-ble parameterizations of the GP. The problem of non-stationarity then becomes thatof finding a good balance between modeling flexibility, identifiability and compu-tational cost.3.1.3 Basics of the integrated nested Laplace approximationThe integrated nested Laplace approximation (INLA, Rue et al., 2009; Lindgrenet al., 2011) is a novel approximation technique in Bayesian inference, especiallyfor latent GMRF models. Basically, the INLA method seeks the posterior modevia numerical optimization, approximates the integrals of the random effects orhyper-parameters via the Laplace approximation, and numerically integrates overthe hyper-parameters on a selected grid based on the (approximated) likelihood.When combined with the approximations to SPDE, INLA can handle the infer-ence for most commonly used covariance models in spatio–temporal modeling. Inthis subsection, we provide a brief review of its basic ideas and readers can referto Rue et al. (2009), Martins et al. (2013), and Blangiardo and Cameletti (2015)for detailed discussion and examples.Consider the following latent Gaussian model: the observations y depend on alatent random process or field η through an observation model (it is also called the20likelihood model in Rue et al. (2009)); The latent process is usually assumed to beGaussian, i.e., η |ψ ∼ N(0,C(ψ )), where the covariance matrix C depends on thehyper-parameters ψ = {ψ1,ψ2, . . . ,}T . For example,Observation Model: y|η ∼ N(η ,Cy(ψ2))Latent Model: η |ψ ∼ N(0,Cψ(ψ1))Priors: pi(log(ψ j)) ∝ 1, j = 1,2.Notice that we use pi() to denote the density functions, including the prior, condi-tional, or joint densities. With this model, we are usually interested in calculatingthe posterior density pi(ψ |y) and the posterior predictive density pi(y∗|y). This canbe done by Monte Carlo methods, such as MCMC or Gibbs sampler, but they areusually computationally expensive. The INLA approach is developed as a compu-tationally simpler alternative to the Monte Carlo methods.Nested Laplace approximationThe first building block of INLA is the Laplace approximation, which approxi-mates an arbitrary density function pi at its mode xˆ via a second order Taylor seriesapproximation,log(pi(x))≈ log(pi(xˆ))+ ∂ log(pi(xˆ))∂x(x− xˆ)+ 12∂ 2 log(pi(xˆ))∂x2(x− xˆ)2= log(pi(xˆ))+12∂ 2 log(pi(xˆ))∂x2(x− xˆ)2.The second line follows as ∂ log(pi(x))/∂x= 0 at the mode xˆ.Let σˆ2 =−1/(∂ 2 log(pi(xˆ))/∂x2)log(pi(x))≈ log(pi(xˆ))− 12σˆ2(x− xˆ)2.Taking the exponential of the above equation, we find that p˜i , the density of N(xˆ, σˆ2),can be a good approximation to pi , at least in a small neighborhood near xˆ. The den-sity p˜i is referred to as the Laplace approximation density.To apply this Laplace approximation to the posterior densities of interest, we21first work with the following identity of the conditional probabilities,pi(η |ψ ,y) = pi(ψ ,η |y)pi(ψ |y) ⇒ pi(ψ |y) =pi(ψ ,η |y)pi(η |ψ ,y) . (3.8)The numerator above may not be evaluated in closed form, but it is proportional tothe product of three densities in closed form,pi(ψ ,η |y) = pi(y,ψ ,η )pi(y)∝ pi(y,ψ ,η ) = pi(y|ψ ,η )pi(η |ψ )pi(ψ )The factor pi(y) does not depend on any of the unknown parameters η or ψ .The denominator in (3.8) can be approximated by the Laplace approximationdensity pi(η |ψ ,y)≈ p˜i(η |ψ ,y), yieldingpi(ψ |y)≈ p˜i(ψ |y) = pi(ψ ,η |y)p˜i(η |ψ ,y)∣∣∣∣η=argmaxpi(η |y,ψ ).When the likelihood model is also Gaussian, p˜i(η |y,ψ ) = pi(η |y,ψ ) exactly. Forother models, such as logistic, Poisson, etc., the Laplace approximation densitycan be derived by a Taylor expansion of the likelihood function. This gives the“nested” part of the approximation, as a Laplace approximation density p˜i(η |ψ ,y)is used to derive another approximate density p˜i(ψ |y).Another “nested Laplace” approximation in INLA can be used to improve theapproximation to pi(η |y,ψ ) when the likelihood model is not Gaussian. It calcu-lates the Laplace approximation iterating through all the elements of ηpi(η j|ψ ,y) =pi(η j,η− j|ψ ,y)pi(η− j|η j,ψ ,y)=pi(η j,η− j,ψ |y)pi(ψ |y)1pi(η− j|η j,ψ ,y)∝pi(η ,ψ |y)pi(η− j|η j,ψ ,y)∝pi(y|ψ ,η )pi(η |ψ )pi(ψ )pi(η− j|η j,ψ ,y)≈ pi(y|ψ ,η )pi(η |ψ )pi(ψ )p˜i(η− j|η j,ψ ,y)∣∣∣∣η− j=ηˆ− j(η j,ψ ), p˜i(η j|ψ ,y), (3.9)where , denotes “define as” and ηˆ− j(η j,ψ ) = argmax p˜i(η− j|η j,y,ψ ). This im-proved approximation comes at a high computational cost, because it needs to iter-22ate through all elements of η .Integration step of INLAThe approximate Laplace densities are put together for the posterior density ofpi(η |y).pi(η |y) =∫ψpi(η |y,ψ )pi(ψ |y)dψ≈∫ψp˜i(η |y,ψ )p˜i(ψ |y)dψ .The integration is done numerically by first searching the surface of p˜i(ψ|y) tosetup a grid of integration points, i.e., ψ 1,ψ 2, . . . ,ψ L. Each grid point is associatedwith its own weight wi ∝ p˜i(ψ i|y)∆i, i= 1,2, . . . ,L, where ∆i is area size.pi(η |y)≈L∑i=1wip˜i(η |y,ψ i).This integration scheme is also used in our Bayesian Melding approach in Sec-tion 4.2. If we are interested in a better approximation to the posterior marginal ofη j, p˜i(η |y,ψ ) can be replaced by p˜i(η j|ψ ,y) from Equation (3.9).3.2 Stochastic differential equationsStochastic differential equations have been used in many different areas, such asphysics, finance, and ecology (Iacus, 2009). They model the changes in a processthrough a stochastic model. An SDE is usually written in the form ofdθ(t) = µ(θ(t), t)dt+σ(θ(t), t)dW (t), (3.10)where {W (·)} is the standard Wiener process, µ(·) is the deterministic part, σ(·)scales the random noise dW (t), and dW (t) is a white noise process. Equation (3.10)should be understood to be an abbreviation for the following integral,θ(t) = θ(0)+∫ t0µ(θ(s),s)ds+∫ t0σ(θ(s),s)dW (s).23The third term∫ t0 σ(θ(s),s)dW (s) is a stochastic integral because of the dW (s)term. Recall that the regular Riemann integral is defined as the limit of the sum-mation of step functions with respect to a partition of an interval [0, t]. Let Πn ={0 = tn,0 < tn,1 < tn,2 < .. . < tn,n = t} be any partition of the interval [0, t] and||Πn||= maxi=0,1,...,n−1(tn,i+1− tn,i). The stochastic integral is most commonly de-fined as∫ tt0X(s)dWs = lim||Πn||→0n−1∑i=0X(tn,i)(W (tn,i+1)−W (tn,i)), (3.11)where we replace σ(θ(s),s) with X(s) for notational simplicity.The limit in (3.11) is the limit in quadratic mean (mean square). This stochasticintegral is known as the Itoˆ integral. It is important to notice that X(·) in the interval[ti, ti+1] is approximated by a step function evaluated at the start of the intervalX(ti) in (3.11). This gives Itoˆ integral properties different from a regular Riemannintegral. For example,∫ t0W (s)dW (s) = (W (t)2− t)/2, different from the Riemannintegral∫ t0 s ds = t2/2. Replacing X(ti) by X((ti + ti+1)/2) leads to a differentdefinition of the stochastic integral. A more formal treatment of the definitionsof stochastic integral can be found in Arnold (1974). For the SDE formulatedin (3.10), the functions µ(θ , t) and σ(θ , t) have to satisfy a few conditions to makesure there exists an unique continuous solution to (3.10). Notice that we use θ torepresent the first argument in these two functions. First, µ(·, ·) and σ(·, ·) must bemeasurable on the domain of θ and t, namelyRd× [0,T ], where d is the dimensionof θ . Second, they need to satisfy the global Lipschitz condition such that thereexist a constant K <+∞ with|µ(θ , t)−µ(θ ′, t)|+ |σ(θ , t)−σ(θ ′, t)|< K|θ −θ ′| (3.12)for all θ ,θ ′ ∈ Rd and t ∈ [0,T ]. The final condition requires that there exists aconstant C <+∞, such that|µ(θ , t)|+ |σ(θ , t)|<C(|θ |+1) (3.13)for all θ ,θ ′ ∈Rd and t ∈ [0,T ]. This condition ensures that the solution (3.10) does24not explode in a finite time (see e.g., Iacus, 2009). As seen in Arnold (1974), thesethree conditions also ensure that the solution is a Markovian diffusion process, inwhich case, µ(θ , t) and σ(θ , t) are refereed to as the drift and diffusion coefficientsrespectively. Heuristically, the Markovian property can be deduced from the factthat the change in the process only depends on the current state, i.e., not on theprevious state. The Markovian property guarantees an efficient computation in theinference and prediction of this process.Modeling the derivative dθ(t) can conveniently incorporate our knowledgeabout how the process changes and evolves, but this parameterization only resultsin closed form solutions in very few cases (Arnold, 1974; Iacus, 2009). One typeof SDE with a closed form solution is the linear SDE discussed in Chapter 6. Forthe general SDE without a closed form solution in Chapter 7, approximation tech-niques are needed for the inference and prediction. The simplest and commonlyused approximation method is the Euler approximation (see e.g., Iacus, 2009). Fora general SDE in (3.10), the Euler approximation defines the approximation pro-cess θ˜(t0), θ˜(t1), . . . , θ˜(tn) asθ˜(ti+1)− θ˜(ti) = µ(θ˜(ti), ti)(ti+1− ti)+σ(θ˜(ti), ti)(W (ti+1)−W (ti)),for i = 0,1,2, . . . ,n− 1. The approximation process θ˜ is a non-standard Brown-ian Motion process that is locally constant, whose mean and standard deviationare determined by µ and σ . To improve upon the Euler approximation, Ozaki(1992) developed a local linear approximation method, which can be viewed asusing a linear SDE to approximate a general SDE. Archambeau et al. (2007), Ar-chambeau and Opper (2010) and Vrettas et al. (2015) proposed a variational al-gorithm to minimize the Kullback–Leibler (KL) divergence between the true dis-tribution and the approximate distribution. The approximate distribution can beconstructed as a GP (Archambeau et al., 2007; Archambeau and Opper, 2010) ora linear SDE (Vrettas et al., 2015). However, we have not found a complete com-parison of these approximation techniques in terms of their theoretical/empiricalperformance. Also, we notice that the most sophisticated approximation techniqueonly approximates the SDE with a linear SDE.253.3 Kalman filter and smootherAs shown later in Chapter 6, the linear SDE can be expressed as a linear StateSpace Model (SSM), alternatively known as the Dynamic Linear Model (DLM,see e.g., Petris et al., 2009). In general, it is formulated asYt =Ftθ t +vt vt ∼Nm(0,Vt) (3.14)θ t =Gtθ t−1+ut ut ∼N p(0,Ut), (3.15)when t > 0. The initial condition is θ 0 ∼ N(m0,C0).Yt is the q−dimensional observation vector and θ t is the p−dimensional statevector. In general, the dimension p of the state vector can be allowed to be differentfrom the dimension q of the observation vector. Equation (3.14) is referred to asthe observation model, where Yt is a linear transformation of θ t by a q× p matrixFt plus a Gaussian noise vt with mean zero and covariance Vt . Equation (3.15)describes the state model, where the current state θ t is modeled as a linear trans-formation of the previous state θ t−1 by a q×q matrix Gt plus the Gaussian noiseut with mean zero and covariance Ut . The state θ t−1 and two Gaussian noises vt ,ut are assumed to mutually independent. The initial condition is considered to befixed i.e., C0 = 0, in our study.Notice that the subscripts t for all the notation above, Yt ,θ t ,Ft , . . . are sim-plifications of Y(t),θ (t),F(t), . . . and they can have arbitrary gaps either equal orunequal. We use yt to denote the observation of variable Yt from the DLM in (3.14)and (3.15) and ψ to denote all the parameters in m0,C0,F1:T ,G1:T ,V1:T ,U1:T . No-tation a : b represents collection of {a,a+1,a+2, . . . ,b}, i.e., θ 1:t = {θ 1,θ 2, . . . ,θ t}.For expository simplicity, we use [· · · ] instead of pi() to denote the density func-tions hereafter in this thesis. Also, 〈〉 is introduced to denote the densities obtainedby conditioning on ψ , 〈·|·〉 = [·|·,ψ ]. All this simplified notation enables us to fitmany equations in one line in the later chapters.The Kalman filter and smoother can be used to calculate the likelihood [y1:T |ψ ]and the posterior [θ 1:T |y1:T ,ψ ] (〈θ 1:T |y1:T 〉 in our simplified notation). The Kalmanfilter starts with the filtering density, where “filtering” means predicting the state θ tgiven the current observations y1:t . This is done recursively based on the previous26filtering density. We denote the previous filtering density at time t−1 asθ t−1|y1:t−1 ∼ N(mt−1,Ct−1), (3.16)for t > 0. At t = 1, it is equivalent to θ 0∼N(m0,C0), which initializes the Kalmanfilter. For t > 1, the filtering densities can be derived as follows.With the state model (3.15), the one step forward filtering density is〈θ t |y1:t−1〉=∫〈θ t |θ t−1〉〈θ t−1|y1:t−1〉dθ t−1. (3.17)It corresponds to the following Gaussian densityθ t |y1:t−1 ∼ N(at ,Rt)at = E(θ t |y1:t−1) = Gtmt−1 (3.18)Rt = Var(θ t |y1:t−1) = GtCt−1GTt +Ut . (3.19)The expression of the conditional mean (3.18) can be proved asat = E(θ t |y1:t−1)= E(E(θ t |θ t−1,y1:t−1))= E(Gtθ t−1|y1:t−1)= Gtmt−1.Similarly, the conditional variance (3.19) can be proved asRt = Var(θ t |y1:t−1) = Var(E(θ t |θ t−1,y1:t−1))+E(Var(θ t |θ t−1,y1:t−1))= Var(Gtθ t−1|y1:t−1)+Ut = GtCt−1GTt +Ut .Considering the observation model (3.14), we can also easily derive the onestep forward predictive density of yt〈yt |y1:t−1〉=∫〈yt |θ t〉〈θ t |y1:t−1〉dθ t . (3.20)27It corresponds to the following Gaussian densityyt |y1:t−1 ∼ N(ft ,Qt)ft = E(yt−1|y1:t−1) = FtatQt = Var(yt−1|y1:t−1) = FtRtFTt +Vt .The above quantities can be derived similarly as (3.18) and (3.19). We will skip thedetails here and the reader can refer to Petris et al. (2009) for a complete treatment.Notice with these predictive densities, we can calculate the likelihood of theSSM,[y1:T |θ ] = 〈y1:T 〉=T∏t=1〈yt |y1:t−1〉.Using Bayes rule and conditional independence property of the SSM, the cur-rent filtering density equals〈θ t |y1:t〉= 〈yt |θ t〉〈θ t |y1:t−1〉〈yt |y1:t−1〉 , (3.21)which depends on the observation model (3.14), the one step forward filtering den-sity (3.17), and the predictive density (3.20). As all three are Gaussian, the filteringdensity is also Gaussian,θ t |y1:t ∼ N(mt ,Ct)mt = E(θ t |y1:t) = at +RtFTt Q−1t (yt − ft)Ct = Var(θ t |y1:t) = Rt −RtFTt Q−1t FtRt .So far, we have shown how to derive the current filtering density 〈θ t |y1:t〉 basedon the previous filtering density 〈θ t−1|y1:t−1〉. An intermediate step (3.20) helps uscalculate the likelihood of the DLM. This procedure is referred to as the Kalmanfilter algorithm (Kalman, 1960) and it utilizes the Gaussian Markovian propertyof the DLM to sequentialize the calculation. We are only moving forward in timewith the Kalman filter. To calculate the posterior of θ 1:T given all the observations,we also need to move backward in time. The corresponding procedure is usually28referred as the Kalman smoother.Similarly as in the Kalman filter, we first set up a backward transition densityθ t+1|y1:T ∼ N(st+1,St+1).At t = T −1, it is equivalent to the filtering density at the last time point θ T |y1:T ∼N(mT ,CT ). With the Markovian property, we can show that〈θ t |θ t+1,y1:T 〉= 〈θ t |θ t+1,y1:t〉= 〈θ t+1|θ t〉〈θ t |y1:t〉〈θ t+1|y1:t〉 .It helps us to further derive the following smoothing density〈θ t |y1:T 〉= 〈θ t |y1:t〉∫ 〈θ t+1|θ t〉〈θ t+1|y1:t〉〈θ t+1|y1:T 〉dθ t+1.It depends on the filtering density 〈θ t |y1:t〉 (3.21), the state model 〈θ t+1|θ t〉 (3.15),the one step forward predictive density 〈θ t+1|y1:t〉 (3.17), and the previous smooth-ing density 〈θ t+1|y1:T 〉. All these densities are Gaussian and we can derive thesmoothing density asθ t |y1:T ∼ N(st ,St)st = mt +CtGTt R−1t+1(st+1−at+1)St = Ct −CtGTt+1R−1t+1(Rt+1−St+1)R−1t+1Gt+1Ct ,which is how the Kalman smoother calculates the current smoothing density 〈θ t |y1:T 〉based on the previous smoothing density 〈θ t+1|y1:T 〉.One key advantage of the Kalman filter and smoother is that they transfer thejoint density of y1:T and θ 1:T into the sequential conditioning densities, such thatthe likelihood and posterior can be evaluated in linear time of T . It also pairswell with the conditional specification of our spline based SDE, as the SDE onlyspecifies the one time step ahead conditional density instead of the joint density.293.4 B-splineThe B-spline (or basis spline) is a popular class of basis functions used to approxi-mate a non-periodic smooth function (see e.g., De Boor et al., 1978; Ramsay et al.,2009; Buderman et al., 2015). The B-spline basis functions are piecewise polyno-mials of order k that are continuous at the knots. Order k is one plus the degree ofthe polynomial, e.g., a constant function is of order one and a linear function is oforder 2, etc. In addition, the B-spline has continuous derivatives up to the (k−2)thderivative. Given knots t0 < t1 < .. . < tn and a desired order k, the B-spline con-struct J = n+k−1 basis functions recursively from k= 1, i.e., piecewise constant.B j,1(t) ={1 t j ≤ t < t j+1,0 Otherwise.For order k > 1, the basis functions are constructed as follows:B j,k(t) =t− t jt j+k−1− t jB j,k−1(t)+t j+k− tt j+k− t j+1B j+1,k−1(t).This recursive construction is implemented in the De Boor algorithm (see e.g.,De Boor et al., 1978), which makes the B-spline very efficient to evaluate. No-tice that we need to extend the knots beyond the t0, tn for the boundary cases andreaders can refer to De Boor et al. (1978); Ramsay et al. (2009) for more detaileddescription. The knots t0, t1, . . . , tn can have irregular spacing, but we only workwith equally spaced knots in this thesis. We specify a set of B-spline basis func-tions via its number of basis functions J and order k, i.e., B-spline(J,k) hereafter.The first derivative of the basis function B j,k with k> 1 is a linear combinationof two basis functions from a set of B-splines with the same knots but order k−1.B′j,k(t) =k−1t j+k−1− t jB j,k−1(t)−k−1t j+k− t j+1B j,k−1(t).Recall that J = n+k−1. This indicates that the first derivative of a B-spline with Jbasis functions and order k is formed by a B-spline with J−1 basis functions andorder k−1. This property will be used later in Chapter 6.30Chapter 4Bayesian Data FusionApproachesIn this chapter, we propose two Bayesian data fusion approaches to combine track-ing data from different sources.4.1 IntroductionThe problem of combining disparate data sets is not new to statisticians. In envi-ronmental statistics, for example, various approaches have been developed to com-bine measurements with numerical (computer) model outputs. Two of the mostcommon approaches are Bayesian Melding (BM, Fuentes and Raftery, 2005) anddownscaling (Berrocal et al., 2010; Zidek et al., 2012). Bayesian Melding wasdeveloped to combine direct observations of air–pollutant concentrations from asparse network of monitoring stations with outputs by grid cell from a determinis-tic chemical transportation (computer) model in a geographical domain based onknown pollutant source and geophysical information. In this approach, the directobservations and the computer model outputs are connected via a hidden processof the “true” air pollution level, that is, the monitoring stations observations Z0(s)at location s measure the true air pollutant level Z(s) with some measurement errorZ0(s) = Z(s)+ e(s), for all s,31where e(s) iid∼ N(0,σ2e ) is a white noise process. Notation · iid∼ · means the randomvariables on the left hand side are independently and identically distributed from acertain distribution on the right hand side. In addition, we assume the error process{ε(·)} is independent of {Z(·)}. The computer model outputs Z1(s) are assumedto beZ1(s) = a(s)+b(s)Z(s)+δ (s),where a(s) is the systematic additive error, b(s) is the multiplicative error, and δ (s)is the additional noise. Usually, the multiplicative error {b(·)} is assumed to be aconstant, and the additive error {a(·)} is modeled with polynomial or a Gaussianprocess with exponential covariance function (Foley and Fuentes, 2008; Fuentesand Raftery, 2005; Sahu et al., 2010). The BM approach has been adapted forother uses, such as to model hurricane surface winds (Foley and Fuentes, 2008),ozone levels (Liu et al., 2011), and wet deposition (Sahu et al., 2010), etc. Theseapplications have demonstrated the remarkable flexibility and effectiveness of theBM approach.Despite its flexibility, the BM approach is computationally cumbersome whendealing with “change-of-support” problems (Berrocal et al., 2010; Zidek et al.,2012)—namely where the observations are collected on different spatial scales,such as in-situ station measurements versus area averages. In response, Berrocalet al. (2010) developed a spatial or spatio–temporal downscaling approach to com-bine the air pollutant data, using a linear mixed effect model. This downscalingapproach can be described, using the same notation as above, by the followingregression model with spatially or temporally correlated coefficients,Z0(s) = β0+ r1(s)+(β1+ r2(s))Z1(s)+ ε(s),where {r1(·)} and {r2(·)} are some spatio–temporal random processes, such asthe Mate´rn process. Note that the index s can be time, location, or their combina-tion. Recently, Rundel et al. (2015) further developed the downscaling approach tocombine speciated PM2.5 (particulate matter with a diameter of 2.5 micro-metersor less) levels from multiple monitoring networks and computer model outputs.32In this paper, we adapt both the BM and downscaling approaches to combinemultiple sources of observations for tracking objects. We use both approaches tocombine the GPS locations and Dead-Reckoned paths of marine mammals and ap-ply them to data from northern fur seals—a species that inhabits the North PacificOcean (Battaile et al., 2015). Unlike combining the observations from monitor-ing networks and computer model outputs, we are less concerned with the spatial“change-of-support” problem in tracking, as the observations and model outputs inour application lie on the same time scale.In the BM framework, we first choose a random process that reflects the na-ture of the tracked object or the physics of its evolution, as the prior for {Z(·)}.For example, we consider {Z(·)} as a Brownian Bridge process in our applica-tion to northern fur seal tracking, which corresponds to the fact that they returnto their breeding beaches to feed their young after a foraging trip. To track aninfectious disease, we could model {Z(·)} with the susceptible–exposed–infected–recovered (SEIR) compartmental equation (Dukic et al., 2012). All the systemsof observations are linked to different transformations of {Z(·)}: the direct obser-vation {Z0(·)} is {Z(·)} plus a white noise process while the indirect observation{Z1(·)},{Z2(·)}, . . . are functions of {Z(·)} plus some other random processes asbiases that reflect model error. Our BM framework can be summarized as follows:for all t,Z(t)∼ A certain random process (4.1)Z0(t) =Z(t)+ ε0(t)Z1(t) =g1(Z(t))+ξ1(t)Z2(t) =g2(Z(t))+ξ2(t)· · · · · · ,where {ε0(·)} is a white noise process. Functions g j(·), j = 1,2, . . . can be as-sumed to have certain parameterized form but with unknown parameters. The{ξ j(·)}, j= 1,2, . . . are random processes that representing the errors in the indirectobservations. To make inference about {Z(·)}, we need to calculate the posteriordistribution of {Z(·)}|{Z0(·)},{Z1(·)},{Z2(·)}, . . ., whose posterior mean can be33the smoothed/predicted “track” of the tracking object and the posterior credibleintervals (CI) reflect the uncertainty in the track. Note that the random processin (4.1), such as the Brownian Bridge or the SEIR process, only reflects our priorknowledge of the track. Its posterior is updated via the observations.The downscaling approach for tracking bypasses the modeling of {Z(·)} andbuilds the mixed effects model between the direct observations and the other sys-tems of observation asZ0(t) = β0+ r0(t)+(β1+ r1(t))Z1(t)+(β2+ r2(t))Z2(t)+ · · ·+ ε(t),for all t. {r j(·)}, j = 0,1,2, . . ., can be Gaussian processes as in Berrocal et al.(2010) and Zidek et al. (2012). As with the BM approach, the posterior mean andCI of the linear predictor can be the predicted “track” and its uncertainty.For the tracking application, we need to first choose appropriate processes forthe random components, such as {Z(·)},{ξ j(·)}, etc. Besides matching the physicsof the tracked objects, we also need to take account of the computational burden.For example, devices attached to a northern fur seal, which samples at 16Hz, andthe video tracking of NBA players (Liu et al., 2016), which samples at 25Hz, bothyield incredibly big data sets. As a result, we avoid using MCMC techniques thathave been used in the past for both the BM and downscaling approaches. Instead,we fit the downscaling model with the integrated nested Laplace approximation(INLA) method developed in Rue et al. (2009) and Lindgren et al. (2011). Inspiredby the sparse matrix techniques, likelihood approximations, and gradient based nu-meric integrations in the INLA approach, we exploit the properties of the processesand designed approximations to the likelihood for the BM approach.The following describes the BM and downscaling approaches for tracking, andapplies them to a case study of northern fur seals as reviewed in Chapter 2. Sec-tion 4.2 describes our Bayesian Melding approach while Section 4.3 describes thedownscaling approach. We perform several simulation studies to evaluate our BMand downscaling approaches, which are reported in Section 4.4. Section 4.5 con-tains the case study results together with cross–validation comparisons. Conclusionand discussion are contained in Section 4.6.344.2 Bayesian meldingIn this section, we introduce the BM approach to combine the information fromthe accurate but sparse GPS observations with the biased but dense DR path. Forsimplicity, the two dimensions of the path (latitude and longitude) are dealt withseparately. Abstractly, our theory is about a one dimensional path over time, whichwe denote by η(t) at discrete time points t = 0,1,2, . . . ,T . The time unit plays noessential role in our theory. The approach works just as well with unequally spacedtime points, i.e., for arbitrary t0, t1, . . . , tT . But for expository simplicity we workwith 0 : T because the DR path is equally spaced. As in the previous BM literature,we put a Gaussian process prior on η(t),η (0 : T )∼ N (f(0 : T ),R(0 : T,0 : T )) , (4.2)where f(·) denotes the process mean function and R, its covariance matrix. 0 : T de-notes 0,1,2, . . . ,T . Similarly, f(0 : T ) stands for the vector ( f (0), f (1), . . . , f (T ))Twhile R(0 : T,0 : T ) is a (T + 1)× (T + 1) covariance matrix whose (t, t ′) entryR(t, t ′) equals Cov(η(t),η(t ′)). Throughout this thesis, bold faced characters areused exclusively to represent vectors or matrices.Various options are available for this Gaussian process. A common one (Fuentesand Raftery, 2005; Sacks et al., 1989) assumes that f is a simple parametric model,e.g., a constant or a linear function of the index (time or location) as well as ad-ditional covariates that may affect η . The covariance function is usually assumedas σ2ρ(|t− t ′|), where ρ(·) is an isotropic correlation function such as the Mate´rnor power exponential. However, this popular stationary Gaussian process is notsuitable for our application. As noted above, the tracked animal must return to thestarting location of its foraging trip (to reunite with her pup), which means thatthe start and end points of the track are fixed, as shown in Figure 4.1. Apart fromthe start and end points, the animal’s path is unknown, and hence random in ourBayesian framework. Its variation is relatively large in the middle and small whenclose to the known start and end points. These features of the path led us to model35it with a Brownian Bridge process, whose mean and covariance functions are:f (t) =A+(B−A) tTR(s, t) =σ2H(min(s, t))(T −max(s, t))Twhere η(0) =A and η(T ) =B are the known start and end points of the path, whileσ2H is the variance parameter. Notice that R(0, ·) = R(·,T ) = 0, in accordance withthe known start and end points η(0) and η(T ). Also, R(t, t) increases with t whent < T/2 and decreases with t for t > T/2, reflecting the fact that the variation of thepath is largest in the middle. Another noteworthy property of our covariance matrixR is its form as the product of a scalar σ2 and a matrix, the latter depending onlyon the time points. To clearly represent the parameters of the Brownian Bridgeprocess, we introduce the notationBB(A,B,TS,TE ,σ2) (4.3)for a Brownian Bridge process, which starts from A at time TS and ends in Bat time TE with a variance parameter σ2, namely having mean function f (t) =A+ (B− A)(t − TS)/(TE − TS) and covariance function σ2(min(s, t)− TS)(TE −max(s, t))/((TE −TS)).Our choice of the Brownian Bridge prior is popular in the literature of biol-ogy and ecology. According to Humphries et al. (2010), marine mammals tend toexhibit Brownian–like movements in environments with abundant food resources,such as the resource filled ocean around Bogoslof island where our case study wascentered (Benoit-Bird et al., 2013b). Also, a Brownian Bridge model was proposedby Horne et al. (2007) to model the habitat usage of a wide range of animal species.This model was further improved by Kranstauber et al. (2012, 2014), and Sawyeret al. (2009). Pozdnyakov et al. (2014) studied different ways of estimating the pa-rameters of the Brownian Bridge model. Many other examples where animal pathshave been modeled with Brownian Bridge processes can be found in the referencesof the above papers.It should be recognized that the Brownian Bridge prior does not mean that theanimal’s path after being updated with the GPS observations and the DR path is still36a Brownian Bridge. We use the Brownian Bridge prior to motivate a proper covari-ance structure, whereby the beginning and end of the animal’s path are known butthere is more uncertainty about the middle part of this path. The Markovian struc-ture of the Brownian Bridge also helps to simplify the Bayesian computation ofcombining this process with the DR path, as discussed later.l llllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllll llllllllll lllllllllllllllllll l llllllllllllllllllll llll lllllllllllllllllllllll llll llllllllllllllllllllllllllll lll1671681691700.00 0.25 0.50 0.75 1.00Time (proportion in the total time of this trip)Longitude (degree)Tripl 12Figure 4.1: The (negative) longitude of the GPS observations of the foragingtrips made by two northern fur seals in our case study. Both trips startedand ended at Bogoslof Island (Alaska) where the females gave birthand nursed their pups, and the horizontal line indicates the longitude -168.035E of the island. The time unit is the proportion of the total timeof this foraging trip.The GPS observations are available at time points t0 = 0 < t1 < t2 < .. . <tk = T , where the tk’s form a subset of {0,1, . . . ,T}. The GPS observations arefurther denoted by Y (tk),k = 0,1, . . . ,K. They are unbiased observations of thetrue location:Y (tk)|η(tk) iid∼ N(η(tk),σ2G), (4.4)for k = 1, . . . ,K − 1. The known start and end points assumption implies thatY (t0) = η(t0) = A, and Y (tK) = η(tK) = B are known.Next, X(t), t = 0,1,2, . . . ,T is used to denote the observed DR path. To incor-porate the bias of the DR path, we assume:X(t) = η(t)+h(t)+ξ (t), (4.5)37where h(·;β ) is a parametric function designed to capture the systematic bias trendin the DR path. {ξ (·)} denotes another Gaussian process independent of {η(·)}that captures any irregular components in the deviation of the DR path from thetruth:ξ (1 : T )∼ N(0,C(1 : T,1 : T )).For the parametric bias component, we have considered various models, e.g., h(t)=∑Qi=1βiti−1. The bias process {ξ (·)} is assumed to be a Brownian motion process(random walk of order 1) whose covariance function is thereforeC(s, t) =σ2D min(s, t).We believe the Brownian motion process to be a reasonable approximation to thegradually accumulating error in the DRA. If we assume the errors in the velocityestimates from the DRA, after removing the systematic bias h(t), at each time pointare an i.i.d. normal sequence, the error in the integrated path is then a Brownianmotion.The final ingredients in our BM model are the prior distributions of the param-eters. For notational simplicity, all densities are denoted by square brackets [. . .]throughout this thesis. For σ2G, we assume a known constant based on the previousextensive tests of the Fastloc R© GPS device (Bryant, 2007). The priors of the othertwo variance parameters are chosen to be [σ2H ]∝ 1/σ2H and [σ2D]∝ 1/σ2D, which areuniform priors on the log scale ([log(σ2H)] ∝ 1). For β = (β1,β2, . . . ,βQ)T , a flatprior [β ] ∝ 1 is used. All these parameters are assumed to be independent of eachother in their priors.For expository simplicity in describing the joint distribution of all the data andparameters, the following notation is introduced:• The unknown part of the true path is denoted by η = η (1 : (T −1)), a T −1dimensional vector.• The GPS observations of the unknown part of the path are denoted by Y =(Y (t1),Y (t2), . . . , Y (tK−1))T , a K−1 dimensional vector.38• The DR path is X= (X(1 : (T −1))T ,X(T )−Y (T ))T , a vector of dimensionT .• For the two unknown variance parameters, let φ = (σ2H ,σ2D)T .The joint likelihood of our model is[X,Y,η ,β ,φ ] = [φ ] [β ] [η |φ ] [Y|η ] [X|β ,φ ,η ]. (4.6)To obtain an estimate of the animal’s true path and its uncertainty, we need theposterior distribution[β ,η |X,Y] =∫[β ,η |X,Y,φ ]︸ ︷︷ ︸(1)× [φ |X,Y]︸ ︷︷ ︸(2)dφ . (4.7)We also include the β term, which can be used to assess the bias of the DRA.The posterior mean, denoted by η˜(t), can be an estimate of the animal’s path andthe posterior standard error, denoted by σ˜(t) provides an uncertainty statementabout the estimated path. The point-wise 95% credible interval for η(t) can beconstructed via a normal approximation[η˜(t)−1.96σ˜(t), η˜(t)+1.96σ˜(t)]. (4.8)In principle, the “exact” credible intervals can be found via the normal mixturesin our numerical integration scheme. Yet, we found empirically in our case studythat the exact credible intervals were almost indistinguishable from the normalapproximated intervals in (4.8). This is discussed in detail in Section Model inferenceTo calculate the posterior (4.7), we first fix the variance parameters φ and calculatepart (1) in Equation (4.7) and then integrate over the posterior of φ . The first partof this section shows how the components of (4.7) can be efficiently evaluated. Wethen use numerical integration on an adaptive grid, as in INLA (Rue et al., 2009), tomarginalize over the randomness in φ . The numerical integration part is describedin Appendix A.6.39For notational simplicity, 〈·|·〉 denotes [·|·,φ ], that is 〈η |X,Y〉 = [η |X,Y,φ ].As we specify our model in a Gaussian and linear fashion, it is straightforward toshow that 〈β ,η |X,Y〉 is a multivariate Gaussian density,〈β ,η |X,Y〉 ∝ exp{−12((ζ −M−11 M2)TM1(ζ −M−11 M2))}(4.9)where ζ = (β T ,η T )T and M1,M2 are derived in Appendix A.1.Although the multivariate Gaussian posterior makes inference conceptuallyeasy in implementation, calculating its posterior mean M−11 M2 and covariance ma-trix M−11 actually involves a matrix decomposition with computational complexityof order O(T 3), which is a tremendous computational burden when T is large. It ispossible to avoid the O(T 3) matrix decomposition with certain sparse matrix tech-niques together with the Sherman–Morrison–Woodbury formula (Henderson andSearle, 1981), but those techniques still require the storage of huge matrices andcomplicated matrix calculations. This pushes us to further reduce the complexityof (4.9).It is easily seen that we have more information (data) about ηG,η (t0:K)wherethe GPS observations are available than where they are not. For η (0 : T \ t0:K), weonly have the DR path. So our first step breaks η into two sets, that is〈β ,η |X,Y〉= 〈η (0 : T \ t0:K)|β ,ηG,X,Y〉〈β ,ηG|X,Y〉. (4.10)We can then use the Markovian property of the Brownian Bridge process (see e.g.,Stirzaker and Grimmett, 2001) to simplify (4.10) as:〈β ,η |X,Y,φ 〉={K−1∏k=0〈η (tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X,Y〉}×〈β ,ηG|X,Y〉. (4.11)A hidden assumption in (4.11) is that tk+1 > tk+1 for all k, such that each periodhas at least one non-GPS observations. When tk+1 = tk+ 1, our implementationskips this period automatically.In this way, we partition the long η series into small pieces separated by the40GPS observations. The next step exploits the Markovian property of the BrownianMotion and enables us to simplify the kth term in the first part of (4.11) as〈η (tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X,Y〉=〈η (tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X(tk : tk+1)〉. (4.12)All the derivations for (4.11) and (4.12) are provided in Appendix A.2. In (4.12),the posterior of η(t) between two GPS points can be evaluated using the corre-sponding DR path together with the posterior distribution of the two GPS pointsand β . This remarkably reduces the memory cost when computing the posterior ofthe long sequence and enables us to easily parallelize the whole calculation. More-over, both the Brownian Bridge and Brownian Motion processes conditioned ontwo end points are Brownian Bridge processes, such that,η (tk+1 : tk+1−1)|η(tk),η(tk+1)∼BB(η(tk),η(tk+1), tk, tk+1,σ2H)ξ (tk+1 : tk+1−1)|ξ (tk),ξ (tk+1)∼BB(ξ (tk),ξ (tk+1), tk, tk+1,σ2D).This fact is exploited to completely avoid the matrix inverse calculation when eval-uating (4.12), which further reduces the computational burden. The derivations areincluded in Appendix A.3. Also, we found that the bias correction in the mostsimplified BM approach (empirical Bayesian, β = 0) is a shrinkage of the conven-tional bias correction, which will account for the signal-noise ratio in the DR path.This is discussed in Section 4.2.2.However, the evaluation of 〈β ,ηG|X,Y〉 in (4.11) still requires us to deal withthe long sequence X. But Y is an unbiased observation of ηG and therefore〈ηG|X,Y〉 can be well approximated by 〈ηG|Y〉. This approximation is excep-tionally good when σ2D > σ2G. The coefficients β can be well inferred from thedifference between XG ,X(t1:K) and Y. Thus, we introduce the following approx-imation:〈β ,ηG|X,Y〉 ≈ 〈β ,ηG|XG,Y〉. (4.13)41With similar arguments, we can also approximate the posterior of φ by,[φ |X,Y]≈ [φ |XG,Y] (4.14)The explicit expressions for (4.13) and (4.14) are included in Appendix A.4. Oursimulations that mimic the real data sets have shown that the impact of the twoapproximation errors in (4.13) and (4.14) is negligible. We also verified througha thinned version of the real data set that the difference between the posterior ob-tained from (4.13) and (4.14) is not significant.In summary, the posterior of η and β is approximated as follows:[η ,β |X,Y] =∫[η ,β |X,Y,φ ][φ |X,Y]dφ=∫〈η (0 : T \ t0:K)|β ,ηG,X,Y〉〈ηG,β |X,Y〉[φ |X,Y]dφ=∫ {K−1∏k=0〈η (tk+1 : tk+1−1)|β ,η(tk),η(tk+1),X〉}〈ηG,β |X,Y〉[φ |X,Y]dφ=∫ {K−1∏k=0〈η (tk+1 : tk+1−1)|β ,η(tk),η(tk+1),X(tk : tk+1)〉}×〈ηG,β |X,Y〉[φ |X,Y]dφ≈∫ {K−1∏k=0〈η (tk+1 : tk+1−1)|β ,η(tk),η(tk+1),X(tk : tk+1)〉}×〈ηG,β |XG,Y〉[φ |XG,Y]dφ . (4.15)The integration in expression (4.15) is calculated on an adaptive grid based on[φ |XG,Y], which is discussed in Appendix A.6. Combining all these techniquesto simplify computation reduces the computational time of our BM approach to alevel similar to that of the DRA. For the two data sets we worked with, the DRAand BM of the DR path and GPS took less than five minutes in total on a regularlaptop. We implemented the BM approach in an R package “BayesianAnimal-Tracker” (Liu, 2014). The speed with which calculations can be done using thispackage, i.e., Table 3 of Liu et al. (2015), is a huge benefit for marine biologistswho want to follow the animal while aboard a ship or on a remote island using a42portable laptop without access to the Internet or any high performance computers.4.2.2 Connection with the conventional correctionWhen compared to the conventional bias correction method in (2.1), our BM ap-proach can account for both data and model uncertainty and provide a CI for the es-timates of the animal’s path. Moreover, there is an interesting connection betweenthe BM posterior mean η˜ and the conventional corrected path ηˆ in Equation (2.1).Without loss of generality, let K = 1 (no GPS observations except the known startand end points) and Y (0) = X(0) = 0. So (2.1) can be written asηˆ(t) = Y (T )tT+(X(t)−X(T ) tT).For BM, if h(t) = 0 for all t and φ is known, the posterior mean η˜ under the aboveassumptions as calculated by (A.15) in Appendix A.5 simplifies toη˜(t) = Y (T )tT+σ2Hσ2H +σ2D(X(t)−X(T ) tT).The first parts of ηˆ(t) and η˜(t) represent linear interpolations between Y (0) =0 and Y (T ), which determines the basic trend of the animal’s path between twoknown points. The second part is a “bridge” constructed by the X(t), which startsat X(0) = 0 and ends at 0 = X(T )−X(T )T/T . This bridge can be treated as the“detail” for the animal’s path, which is then added to the basic trend of the firstpart.The difference between the traditional conventional method and the simplifiedBM approach is the weight on the “detail”. In the conventional approach, the“detail” is directly added to the basic trend while BM shrinks the detail by a factorof ρ = σ2H/(σ2H +σ2D). According to our model, we cannot distinguish betweenη(t) and ξ (t) in X(t) at those non-GPS points, as we only observe the sum ofthem, but we know that η(t) accounts for the σ2H part of the total σ2H+σ2D variance(they are both of mean zero after Y (T )t/T is removed). In this way, a fractionρ = σ2H/(σ2H +σ2D) of the detail is treated as signal in η(t) and added to the basiclinear trend.Notice that we only compare the most simplified BM approach to the con-43ventional approach. In practice, the BM η˜ is far more complicated than the formshown above with the parametric part from β and the integration over φ . Accord-ing to our simulations and cross-validation of real data later in this chapter, our η˜is remarkably better than the conventional ηˆ .4.3 DownscalingUsing the same notation as in the previous section, we propose the following down-scaling model for the GPS observations and DR path:Y (tk) = β0+ γ1(tk)+(β1+ γ2(tk))X(tk)+ ε(tk), (4.16)where {γ1(·)} and {γ2(·)} are zero–mean Gaussian processes, such as randomwalks of order 1 and 2 (RW1, RW2), and autoregressive processes of order 1,2, and 3 (AR1, AR2, AR3). {ε(·)} is a white–noise process. For expository sim-plicity, let γ denote all the unknown and random components in (4.16), includingβ0,β1,γ 1(0 : T ),γ 2(0 : T ) and the hyper parameters governing them, e.g., the vari-ance/correlation parameters of γ 1,γ 2,ε . The combined path of the tracked animalcan be learned from the posterior[Y (t)|X(t),X(t0:K),Y(t0:K)] =∫γ[Y (t),γ |X(t),X(t0:K),Y(t0:K)]dγ=∫γ[Y (t)|γ ,X(t),X(t0:K),Y(t0:K)] [γ |X(t0:K),Y(t0:K)]dγ ,for t ∈ (0 : T )\t0:K . Notice that our formulation of the downscaling approach doesnot include a “true” process η explicitly as in the BM approach. Such an η canbe calculated as η(t) = β0+ γ1(t)+(β1+ γ2(t))X(t) for all t, which has the sameposterior expectation as Y (t).Traditionally, the above integral has been calculated using an MCMC approachsuch as the Gibbs sampler (Berrocal et al., 2010; Zidek et al., 2012). However, theMCMC approach is computationally expensive as well as technically challengingand it sometimes encounters mixing or convergence problems. We therefore cal-culate the above posterior by the integrated-nested Laplace approximation (INLA)via the R-INLA package (Martins et al., 2013), as reviewed in Section the downscaling approach, we use the default priors in R-INLA, which areall weakly informative priors. It is noteworthy that we do not put an informativeprior on the variance parameter of ε(t) as we do for the BM approach in Equa-tion (4.4) because ε(t) represents not only the measurement error in Y (t), but alsothe lack-of-fit errors of the downscaling model.So far, we have not found any direct equivalence between the BM and down-scaling models, but notice that the posterior in both cases is essentially only cal-culated based on the DR path at the GPS observations X(t0:K), not on the full setX(0 : T ). This supports the use of our approximation to the likelihood in (4.13)and (4.14) from another perspective.4.4 Simulation studyWe performed several simulation studies to evaluate the performance of our ap-proaches. These included a simulation to study the approximation in our BM ap-proach (Section 4.4.1) and a comparison of all five approaches under three differentdata generating models (Section 4.4.2).4.4.1 Simulation evaluation of the approximations in the BMapproachWe used a simulation to evaluate the impact of our likelihood approximation in (4.13)and (4.14). For expository reasons, let “full BM” denote the Bayesian Meldingapproach based on the full likelihood (left hand side of (4.14)), and “approxi-mate BM” denote Bayesian Melding approach based on the approximate likelihood(right hand side of (4.14)).In this simulation, the data were generated according to our BM model: Thetrue curve was simulated as a Brownian Bridge with T = 2000; the K = 125 GPStime points were randomly chosen from the T = 2000 time points; the GPS obser-vations were i.i.d. normal observations of the true curve at these time points; theDR path was the true curve plus the bias function h(t) and another Brownian Mo-tion process. The parameters used in the simulation were the maximum likelihoodestimates from our Trip 2 Northing data set. The results shown below are based on1000 replicates. Similar findings were found from other settings and thus omitted.45The first two panels in Figure 4.2, include the box–whisker plots of the poste-rior mode of log(σ2H) and log(σ2D) from the full and approximate BM. As we used auniform prior on the log scale, the posterior modes are equivalent to the maximumlikelihood estimates based on either the full or the approximate likelihood. Bothestimates were nearly unbiased in both parameters, but the approximate likelihoodestimates were obviously less efficient than the full likelihood estimates. The ra-tios between the standard errors of the approximate likelihood estimates and fulllikelihood estimates were 1.19 and 1.38 for log(σ2H) and log(σ2D) respectively. Theincrease in standard error, or equivalently the efficiency loss, was relatively smallas the approximate BM only uses a (125+ 125)/(2000+ 125) = 2/17 fractionof the observations used in the full BM for parameter inference. For example, ifwe assume the standard error of the estimates was proportional to 1/√n as in thei.i.d. case, the ratio between the standard errors would be predicted to be about√17/2≈ 2.915, far more than the ratio observed in the simulation study.Moreover, the estimates of log(σ2H) and log(σ2D) are relatively unimportant,given that these are nuisance parameters in our Bayesian inference. What mattersin the real application is the quality of the reconstructed path. Therefore, we calcu-lated the root mean integrated squared error (RMISE,√∑Tt=0(η(t)− η¯(t))2/(T +1))between the true curve η(t) and different estimates η¯(t). η¯(t) can generically rep-resent the posterior mean of η(t) from either the full or approximate posterior, orthe posterior mean of Y (t) in the downscaling approach.The third panel of Figure 4.2 is the scatter plot of the RMISE among all thereplicates in our simulation. Clearly, most of the points lie on the diagonal line,indicating little difference between the reconstructed η(t)’s from approximate BMand full BM. This is because of the small efficiency loss in our likelihood ap-proximation and the fact that we marginalize over the variance parameters in theposterior.4.4.2 Comparison of all five approachesThe above simulation study suggests that our approximate BM inference procedureis quite accurate. In the following simulation, we compared all the five candidatemethods of estimating the animal’s path: the two methods working with the GPS46llllllllllllApprox Full−2.6−2.4−2.2−2.0−1.8σH2lllllllllllllllApprox Full−2.4−2.2−2.0−1.8σD2llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll0.5 0.6 0.7 0.8 BMApprox BMFigure 4.2: Plots from our simulation, from left to right: the box plot of theposterior mode of log(σ2H) obtained from the approximate and full like-lihoods; the box plot of the posterior modes of log(σ2D); scatter plot ofthe RMISE of the posterior mean of η from the approximate BayesianMelding approach versus the RMISE of the posterior mean from the fullBayesian Melding approach.observations only (1. Linear interpolation and 2. CRAWL with drift term); and thethree methods working with both the GPS and DR path (3. Conventional bias cor-rection of the DR path, 4. Our Bayesian Melding method, and 5. Our downscalingmethod). We only work with the approximate BM because the full BM was verytime consuming to run. The detailed model choices for the BM and downscalingwere chosen to be the same as those selected for our real data application in thenext section. That is, the BM model had h(t) as a constant and the downscalingmodel had both γ1 and γ2 as RW1.The data were generated from three models: the Bayesian Melding model asmentioned above, the CRAWL model, and the downscaling model. When gen-erating data from the CRAWL model, a path was first simulated from a CRAWLmodel fitted to the real data sets and the DR path was generated by adding a Brow-nian Motion to this path. In the downscaling model, we fixed the observed DRpath (from the real data), and generated the GPS observations by our downscalingmodel. The parameter settings of all the data generation models were the estimatesfrom our real data from Trip 2. One thousand replicates were generated from eachmodel.To summarize the performance in both dimensions, we report the pooled RMISE47from both the northing and easting dimensions, namely,√T∑t=0[(ηN(t)− η¯N(t))2+(ηE(t)− η¯E(t))2]/(2∗ (T +1)),where N stands for Northing and E stands for Easting dimensions. The pooledRMISE serves as a general measure of the goodness of approximation in both Nor-thing and Easting dimensions. The box-plots of the RMISE in the two dimensionsseparately were similar and thus omitted. The box–plots of the pooled RMISE arein Figure 4.3. Our BM and downscaling methods have smaller RMISEs than allthe other competitors regardless of which model generated the real path. That is,even if the animal’s path were generated from a CRAWL model (with an integratedOU process) instead of a Brownian Bridge, the BM approach with a mis–specifiedprior still manages to outperform CRAWL interpolation with the help from the DRpath. In the comparison of BM and downscaling, their differences were very small.These findings were consistent with our findings in the cross–validation studies ofthe real data.Bayesian Melding CRAWL Downscalinglll lll lllll lllllllllllllllllllllllllllllllllll lllllll llll01234LinearCRAWLConventionBayesian MeldingDownscalingLinearCRAWLConventionBayesian MeldingDownscalingLinearCRAWLConventionBayesian MeldingDownscalingMethodRMISEFigure 4.3: Pooled RMISE for all five methods of estimating animal’s path,stratified by the data generating model.Another interesting finding is that the conventional bias correction of the DRpath outperformed CRAWL when the data were generated from the CRAWL modelbut not when the data were generated by the the downscaling model. This is be-cause in the CRAWL data generation, the error in the DR path is purely additivewhile the error is both additive and multiplicative in the downscaling data gener-ation. The conventional bias correction is only designed to deal with the additive48error but not the multiplicative one. On the other hand, as the model parameters inthe BM approach are adaptive in accordance with the observations, it still managesto predict the animal’s path as well as the correctly specified downscaling model.4.5 Case study resultsWe used our proposed BM and downscaling approaches to combine the DR pathand GPS observations for high resolution paths of northern fur seals. To simplifycomputation and comparing different models, we thinned the original 16Hz DRpath (16 observations per second) into one observation per 5 minutes and addedthe GPS time points into this thinned time set. Notice that the thinning was doneafter the DR path was produced from the original 16Hz data set. The resultingTrip 1 data set had 2100 time points, among which 274 were GPS time points.The resulting Trip 2 data set had 2334 times points, among which 130 were GPStime points. It is noteworthy that our BM approach could easily be fitted to theoriginal “big” data sets based on the 1Hz DR path (547803 time points for Trip 1and 661249 for Trip 2). The results were reported in Liu et al. (2015) but omittedhere as they are almost identical as in the following Table 4.4 or Figure 4.6 and 4.7.Individually modeling the two dimensions of the paths of the two animals yieldedfour data sets denoted as Trip 1 Northing (latitude), Trip 1 Easting (longitude), Trip2 Northing, and Trip 2 Easting, respectively.We considered different bias functions h in our BM approach as well as dif-ferent random processes γ1,γ2 for the downscaling approach. To compare thesedifferent models as well as the two approaches, we conducted leave–5–out crossvalidation experiments (L5OCV). In the L5OCV, we removed 5 consecutive GPSobservations at once when fitting our models and compared our model predictionsbased on the DR path to the original GPS observations. Such a cross validationscheme was designed to evaluate the predictive ability of our models when thetime gaps between the GPS observations are of relatively large size. We wereless concerned with the model’s predictive ability for short gaps as there were nat-ural constraints on the speed of our tracked animal. This means that when thetime gaps were small, the animal’s movements were confined in a small range andthe performance of all methods were similar, i.e., in leave–one–out cross valida-49tions (Liu et al., 2015). L5OCV in our real data sets created gaps that would belonger than roughly 90% of the gaps in the observed GPS time points as shown inTable 2.1 and thus provided us with a good way to evaluate the long term predic-tive performance. We used the root mean squared error (RMSE) as a measure forthe prediction accuracy and also calculated the actual coverage percentage of thenominal 95% posterior credible intervals to examine whether the uncertainty in thecombined path is calibrated properly.In the rest of this section, the goodness of normal approximation credible in-terval (4.8) in the BM approach is first studied in Section 4.5.1. Next, we summa-rize the model selection results of the BM and downscaling models (Section 4.5.2and 4.5.3 respectively). Then the cross–validation comparison with BM and down-scaling approaches as well as linear interpolation, CRAWL, and conventional biascorrection is described in Section 4.5.4. The corrected path and its credible inter-vals for both methods are included in Section 4.5.5. We further study the distancetraveled by the animal in Section Goodness of normal approximation credible intervalsThe normal mixture scheme for numerical integration described in Appendix A.6provides an almost “exact” way to calculate the point-wise credible intervals fromη(t), that is, we can numerically calculate quantiles of the normal mixture at eachtime point. However, calculating these exact credible intervals can be time con-suming for a large number of time points. A simple solution is to approximate theexact normal mixture posterior distribution of η with a normal distribution withmean (A.18) and variance (A.19). In our case study, we find that using the nor-mal approximation interval in (4.8) can provide almost identical credible intervalsas the exact ones. Using the Trip 1 Northing direction as an example, we plottedthe exact normal mixture density and normal approximation density for a few ran-domly picked time points in Figure 4.4. Similar plots were found for other datasets and time points, which are thus omitted.From Figure 4.4, it is clear that the normal approximation density is indis-tinguishable from the exact density. There are two reasons for this. First, the“randomness” from φ is relatively small comparing to the “randomness” of η con-50−0.5 0.0 0.5[2]Density0.5 1.0 1.5 2.0 2.5 3.0[30]Density−0.5 0.0 0.5 1.0 1.5[500]Density−28.5 −28.0 −27.5 −[1782]DensityFigure 4.4: The exact credible posterior density (solid line) and normal ap-proximation density (dotted line) for a few η(t) in Trip 1 Northing di-rection for a northern fur seal.ditioning on φ , i.e., the full Bayesian posterior [η |X,Y] is not very different fromthe empirical Bayesian posterior [η |φˆ ,X,Y], where φˆ is the posterior mode. As[η |φˆ ,X,Y] is normally distributed, it is not surprising to find that [η |X,Y] is nearlynormal. Second, in our case study, φ is well “estimated” from more than one hun-dred observations XG,YG. The large sample size for two parameters guaranteesthat the posterior of φ is also centered around the posterior mode and approxi-mately normal. The full posterior [η |X,Y] can thus be viewed as a multivariateGaussian resulting from a Gaussian distribution with Gaussian prior mean.4.5.2 Model selection for BMIn our BM approach, we used the Brownian Bridge and Brownian Motion pro-cesses with different bias functions h in the DR path. Among many possible pa-rameterizations of h, we investigated only the polynomials h(t) = ∑Qi=1βiti−1 oforder Q = 1 (constant) to Q = 4. The RMSE and actual coverage are shown inTable 4.1.As seen in Table 4.1, the BM with different h(·) functions had very similar51RMSE, i.e., they differed little compared to the difference between the BM ap-proach and linear interpolation, as in Table 4.4. The actual coverages of the cred-ible intervals were reasonably close to the nominal 95% among the different Q’s.This indicates that increasing the complexity in h(·) had little impact on the perfor-mance of our BM approach in our data sets. There was an exception for the Trip 2Northing direction where increasing Q to 3, 4 reduced the CVRMSE, but it came atthe cost of lowering coverage percentages for the credible intervals. This observa-tion led us to choose the simple BM approach with Q= 1 for further comparisons.Table 4.1: RMSEs and actual coverage percentages of 95% credible intervals(in gray background) in L5OCV comparisons for different bias correctionterm h(t) = ∑Qi=1βiti−1 with Q= 1,2,3,4 in the BM approach.Q=1 Q=2 Q=3 Q=4Trip 1 Northing 0.80 94.9 0.80 95.2 0.80 95.6 0.80 95.6Trip 1 Easting 0.75 97.8 0.75 98.2 0.76 97.8 0.76 97.8Trip 2 Northing 3.06 93.0 3.06 93.0 2.73 92.2 2.83 89.1Trip 2 Easting 2.62 96.9 2.60 96.1 2.52 93.8 2.53 93.8Perhaps our findings of little difference among the different BM models shouldnot seem surprising, as we marginalized over the variance parameters σ2H ,σ2D andβ (parameters in h) when evaluating the posterior [η |X,Y]. That marginalizationnaturally helps reduce reliance on a good mean model to correct the bias in the DRpath. Therefore, we chose not to pursue that investigation for more sophisticatedparameterization of h.4.5.3 Model selection for downscalingFor the downscaling model (4.16), we considered random walks of order 1 and 2(RW1, RW2) and autoregressive processes of order 1, 2, 3 (AR1, AR2, AR3) forboth γ1 and γ2, leading to 25 models in total. They are denoted in “XXX–YYY”form, (i.e., RW1–AR2 denotes the downscaling model with γ1 as a random walkof order 1 and γ2 as an autoregressive process of order 2). Not every model fit ourdata sets well and INLA failed to converge due to numerical issues when fittingcertain models. It was not feasible with our computational resources to performcross–validation for all of the 25 models. Instead, we screened these models using52the conventional deviance information criterion (Spiegelhalter et al., 2002, DIC),which can also be calculated by the INLA package. We found that the DIC wasminimized by some simple models for our data sets and have listed the DIC valuesfor them together with the simplest RW1–RW1 model in Table 4.2.Table 4.2: Selected DIC values for the downscaling models. “NA” means thatINLA crashed when fitting this model. The AR1–AR2 model minimizesthe DIC for Trip 1 Northing, AR2–RW1 for Trip 1 Easting and Trip 2Easting, and RW2–RW1 for Trip 2 Northing. The RW1–RW1 model isincluded as a benchmark.RW1–RW1 RW2–RW1 AR1–AR2 AR2–RW1Trip 1 Northing -1608.41 -51.87 -1641.93 -1542.40Trip 1 Easting -487.04 -367.29 NA -1683.86Trip 2 Northing -243.34 -853.31 -729.44 -827.04Trip 2 Easting -774.80 -91.57 -741.25 -829.27None of the models involving AR3 components were selected by the DIC cri-terion, indicating that the dependence in the two random process were of shortmemory, i.e., they only depended on the previous two time points. Also, it is clearfrom Table 4.2 that none of these models dominated in all four data sets. The RW2–RW1 model achieved the smallest DIC in the two Easting data sets, but it did notfit well in the Trip 1 Northing. However, the simplest RW1–RW1 model achievedreasonable fit in all of the four data sets and thus was included in the followingcomparisons.As pointed out by Spiegelhalter et al. (2014), DIC represents how the model fitsthe observed data with certain penalty on the model complexity and is not an idealcriterion for the predictive power of the models, which was our key objective in re-constructing the animal’s path. We further compared the four downscaling modelsvia a leave–5–out cross validation and summarized the results in Table 4.3. Amongthe four downscaling models we considered, AR2–RW1 and AR1–AR2 were thefirst to be ruled out as they had large prediction errors in the Trip 2 Northing andEasting data sets. Close examination of the cross–validation results found theycompletely failed to correct the DR path for a certain period of the trip and resultedin errors as large as 100km. In addition, the credible intervals for AR2–RW1 and53AR1–AR2 failed to achieve the nominal coverage percentage for Trip 1 Northingor Trip 2 Easting. As for the comparison between the RW2–RW1 and RW1–RW1models, RW2–RW1 had slightly smaller RMSEs in the two Easting data sets whileRW1–RW1 had smaller RMSEs in the two Northing data sets. Yet these two mod-els, in general, achieved similar performances in terms of the RMSE and actualcoverage percentage. We therefore chose the simpler RW1–RW1 model for furthercomparisons.Table 4.3: RMSEs and actual coverage percentages of nominally 95% cred-ible intervals (in gray background) for the L5OCV comparisons of thedifferent downscaling models with different processes.Downscaling with different γ1,γ2RW1–RW1 RW2–RW1 AR1–AR2 AR2–RW1RMSE (km) 0.95 0.83 1.06 2.73Trip 1 NorthingCoverage (%) 95.6 91.5 94.9 86.0RMSE (km) 0.75 0.93 0.71 0.80Trip 1 EastingCoverage (%) 98.9 95.2 96.3 96.6RMSE (km) 2.59 1.61 3.56 5.26Trip 2 NorthingCoverage (%) 93.0 98.4 92.9 91.1RMSE (km) 2.56 4.08 18.27 18.32Trip 2 EastingCoverage (%) 98.4 93.8 85.2 Cross–validation comparison of the different approachesWe compared the selected BM model with Q = 1 (BM1 for short) and downscal-ing RW1–RW1 model (DS11 for short) with the competitors: linear interpolation,conventional bias correction, and CRAWL (Johnson et al., 2008). Linear interpo-lation provided the same mean curve as using the Brownian Bridge to interpolatethe GPS observations. CRAWL interpolated the GPS observations via state spacemodels, whose process model is an integrated Ornstein–Uhlenbeck process withdrift term (but assuming the correlation and process noise parameters are the samein both dimensions). Linear interpolation and CRAWL were based on the GPSobservations only. Conventional bias correction was previously described in (2.1).The CV–RMSE and coverage percentages of the credible intervals are summarized54in Table 4.4.We first compared the two approaches only with the GPS observations (firsttwo columns of Table 4.4) to the approaches that combined both the GPS observa-tions and DR path (the last three columns of Table 4.4). We found that the latterapproaches had better predictive performance in general, which demonstrates thegreat value of the DR path in providing fine scale details of the animal’s move-ment. In the comparison of linear interpolation and CRAWL, the more complexCRAWL had a larger RMSE than linear interpolation in the two Easting data setswe considered, which indicates a poor fit for the CRAWL models. In addition, thecoverage percentages of CRAWL were lower than the nominal level in the two datasets from Trip 2.We also noticed that the conventional approach had a larger RMSE than linearinterpolation with the Trip 1 Northing data set, which shows that the conventionalapproach failed to properly use the high resolution DR path. The same issue wasnot found with our BM and downscaling approaches, both of which achieved asmaller RMSE than the other three approaches uniformly in all the four data setswe considered. Also the reduction in the RMSE achieved by our BM or downscal-ing approaches was larger than the differences between the BM or downscalingapproaches with different model components as shown in Tables 4.1 and 4.3.In the comparison between the selected BM and downscaling models (last twocolumns of Table 4.4), the BM1 had a smaller RMSE for Trip 1 Northing while theDS11 had a smaller RMSE for the Trip 2 Northing. They achieved similar RMSEsin the two Easting data sets. They also had similar coverage percentages across allfour data sets. Thus, we conclude that the BM and downscaling approaches havesimilar performance.To further explain the results in Table 4.4, we plotted the corrected path for allfive approaches considered above and zoomed in on the time period 12:00—24:00for 2009-07-23 in Trip 1 to better illustrate their differences (Figure 4.5). Similarplots were obtained by zooming into other periods and thus omitted. From this plot,we see that the conventional corrected path went through the GPS observations di-rectly as did the linear interpolation; but it inferred dramatic changes between theGPS observations. The reconstructed path from BM and downscaling aligned wellwith the GPS observations, while retaining detail from the conventional bias cor-55Table 4.4: RMSEs and actual coverage percentages of nominally 95% credi-ble intervals (in gray background) for the L5OCV comparisons of all theapproaches. Two approaches that only use GPS data: Linear interpola-tion (Linear), and Correlated Random Walk (CRAWL) with drift term.Three approaches use both GPS and DR path: conventional bias cor-rection of DR (Convention), Bayesian Melding with Q = 1 (BM1), andDownscaling with RW1–RW1 model (DS11).Linear CRAWL Convention BM1 DS11RMSE (km) 1.16 1.12 1.25 0.80 0.95Trip 1 NorthingCoverage (%) 94.1 94.9 95.6RMSE (km) 1.13 1.28 1.04 0.75 0.75Trip 1 EastingCoverage (%) 93.8 97.8 98.9RMSE (km) 4.44 3.70 3.33 3.06 2.59Trip 2 NorthingCoverage (%) 88.3 93.0 93.0RMSE (km) 3.84 3.98 2.67 2.62 2.56Trip 2 EastingCoverage (%) 90.6 96.9 98.4rection. As discussed in Section 4.2.2, the BM corrected path shrinks the conven-tional bias correction toward that of linear interpolation. This shrinkage removesthe noise in the DR path in a statistical way and gives the BM an advantage overthe conventional approach. From this plot, we also found that the corrected pathfrom BM and downscaling were very close.However, the path from CRAWL showed some unrealistic trends, like the up-swing before 19:00, while the DR path indicated that the animal was moving in theopposite direction. These unrealistic trends likely resulted from the model assump-tions in CRAWL, which may not fit the data well. These unrealistic trends and theresulting poor performance of CRAWL in cross–validations seem to have derivedfrom the lack of fine detail provided by the DRA.4.5.5 Combined paths and their uncertainty from BM anddownscalingFigure 4.5 only covered a small period of our data. We applied our proposed BMand downscaling approaches to the four data sets and found they all successfullycorrected the bias of the DR path and properly quantified the uncertainty. There-5613:00 15:00 17:00 19:00 21:00 23:00−2−1012009−07−23 Thursday After 12:00TimeNorthing (KM)l llll l lll lll l l l l ll l l l ll l l llBM Posterior MeanDownscaling Posterior MeanConventional Bias CorrectionCRAWLLinear InterpolationGPS ObservationsFigure 4.5: The reconstructed path for a northern fur seal in a selected periodfrom all the five approaches considered in our study, Bayesian Melding(BM1), Downscaling (DS11), Conventional bias correction, CRAWLand linear interpolation. The dots are the GPS observations.fore, we only present the plots for the corrected path for the Trip 1 Northing dataset. Similar plots and analysis were found in the other three data sets and are thusomitted.In Figure 4.6, we show the corrected path from the BM (solid curve) and down-scaling (dotted curve) approaches, which are the posterior mean of η(t) when BMwas used, and Y (t) when the downscaling approach was used. The point-wise95% credible intervals (gray solid curve for BM and purple dotted curve for down-scaling) were included to represent the uncertainty around the corrected path. Wealso included the original data—GPS observations (round points) and the DR path(dashed curve), from which it is clear that the bias of the DR path grew dramaticallyover time and reached 100km at the end of this trip. The location estimate in theDR path was not very useful in predicting the animal’s location, but the fluctuationsin the DR path matched the fluctuations of the GPS observations, meaning that theDR path has useful high–frequency information that can be further exploited tofill in the gaps between GPS observations. This was successfully achieved by ourproposed BM and downscaling approaches, as the corrected path from both ap-proaches lay close to the GPS observations.For the scale of Figure 4.6, the corrected path and CIs from BM and down-57Wed Thu Fri Sat Sun Mon Tue−20020406080TimeNorthing (KM)lllll lllll lllllllllllllllllllllllllllllllllllllllllllllllll llllllllll llllllllllll lllllllllllllllll llll lllllllllllllllllllll llll l lllllllllll lllllllllllllll llllllllllllllllllllllllllll lllllBM Posterior MeanBM 95% Credible IntervalDownscaling Posterior MeanDownscaling Posterior 95% Credible IntervalDR PathGPSFigure 4.6: The Bayesian Melding and downscaling results for a northern furseal undertaking Trip 1 Northing from Bogoslof Island, Alaska: Pointsare the GPS observations and the dotted curve is the DR path. Thesolid curve is the posterior mean of η(t) in the case of BM, whose 95%credible intervals are shown by the gray solid curve. The dotted curveis the posterior mean of Y (t) in the downscaling approach, whose 95%credible intervals are shown by the gray dotted curve.scaling were almost indistinguishable. To show how our BM and downscalingapproaches worked in fine scale, we zoomed into the Day 2 (2009–07–23) andDay 6 (2009–07–27) part of this trip. We can confirm from Figure 4.7 that thecorrected paths from the two approaches were similar as the curves of the posteriormeans nearly overlaid each other. As well, the CIs from both approaches displayeda clear “bridge” structure, that is, they were narrower at the GPS time points andwider in between the GPS observations. This is plausible because we have directand accurate observations at the GPS time points and less accurate informationwhen the GPS locations were not available. The error grows as the track movesaway from the GPS observations and decreases near the next GPS observation.Also, the longer the gap between the GPS time points, the larger the error, andthus, the wider the credible intervals, which is seen by comparing Days 2 and 6.On Day 6, fewer GPS observations were available and the gaps were longer, whichresulted in overall wider credible intervals.Another interesting finding seen in Figure 4.7 is that the CI from the downscal-58ing approach was narrower than those from BM on Day 2 while they were wider onDay 6. This was caused by the two RW1 components in the downscaling model, astheir variance was growing with time (Figure 4.8). On the other hand, the posteriorSD for the BM approach was more stable because it was more constrained with theBrownian Bridge structure.01:00 06:00 11:00 16:00 21:00−202462009−07−23 ThursdayTimeNorthing (KM)lllllllllllllll l lll lllllllll lll llll lll llll l llll lllll l llllBM Posterior MeanBM 95% Credible IntervalDownscaling Posterior MeanDownscaling Posterior 95% Credible IntervalGPS00:00 05:00 10:00 15:00 20:00−30−25−20−152009−07−27 MondayTimeNorthing (KM)llll l llllllll l lll llllllllllllllBM Posterior MeanBM 95% Credible IntervalDownscaling Posterior MeanDownscaling Posterior 95% Credible IntervalGPSFigure 4.7: Zoomed Bayesian Melding and downscaling results for a north-ern fur seal undertaking Trip 1 Northing on 2009–07–23 and 2009–07–27: Red points are the GPS observations. The solid curve is the poste-rior mean of η(t) in BM, whose 95% credible intervals are shown bythe gray solid curve. The dotted curve is the posterior mean of Y (t)in downscaling, whose 95% credible intervals are shown by the graydotted curve.59Wed Thu Fri Sat Sun Mon Tue0. SD (KM)BMDownscalingFigure 4.8: Posterior standard deviation (SD) from Bayesian Melding anddownscaling results for a northern fur seal undertaking Trip 1 Northing.The solid curve is from BM while dotted curve is from downscaling.4.5.6 Paths in both dimensions and the distance traveled by theanimalWe plot the corrected paths from the BM approach in both dimensions for thesetwo trips in Figure 4.9 and omit the similar paths from the downscaling approach,because the correct paths from downscaling were almost indistinguishable fromthose the BM on the scale of the whole trip (seen in Figure 4.6). From Figure 4.9,it is easy to find that the corrected paths from our approaches not only connect theGPS observations, but also reserve the tortuosity exhibited by the DR path.With these corrected paths from our BM approach, we more accurately calcu-lated the distance traveled by the fur seals during their foraging trip (Table 4.5).Distances calculated using our approach were 40% greater for Trip 1 and 49%greater for Trip 2 than those calculated by linear interpolation of the GPS obser-vations. This, once again, demonstrates that the corrected DR path is needed tofill in the gaps between the GPS observations. Calculating the distance traveled bythe animal with the conventional bias correction method revealed it to be twice thedistance calculated based on GPS observations, which further illustrates that our60BM is a compromise between linear interpolation and conventional DR correction.Table 4.5: Total distance (km) traveled in the two fur seal foraging trips cal-culated via different methodsTrip Linear Interpolation BM Conventional1 418.25 585.95 815.362 443.30 662.91 1023.88Figure 4.9: GPS observations of the two trips and the corrected paths fromour BM approach. The GPS measured latitude and longitude of the twotrips of the northern fur seal began and ended at Bogoslof Island in thesummer of 2009. The GPS observations in Trip 1 are indicated by theround points and those for Trip 2 are indicated by the triangular points.The pink curve is the corrected path from our BM model for Trip 1 andthe blue curve is for Trip 2.lllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllll ll ll llllllllllllllBogoslof Island53.7554.0054.25−170 −169 −168 −167Longitude (degree)Latitude (degree)GPSl 12Path124.6 DiscussionWe present a Bayesian Melding approach and a downscaling approach to combinesparse but accurate GPS observations with high resolution but biased DR paths forthe tracking of marine mammals. The posterior mean from our BM and down-61scaling approaches both offer an accurate and high resolution path for the trackedanimals and the posterior credible intervals provide a reasonable statement of theuncertainty in our inferences. The good predictive performance of our approachesis also supported by our simulation studies. Moreover, neither of our methodsrequires computationally expensive MCMC methods for computation. Our BMapproach exploits the conditional independence property of the Brownian Bridgeand Brownian Motion to dramatically reduce the heavy computational burden in-volved in dealing with large data sets. The downscaling approach is fitted via thecomputationally efficient INLA approach. The quality of the likelihood approxi-mation in BM and correctness of downscaling were confirmed in our simulationstudy.We performed cross–validation studies to compare different models in thesetwo approaches and found that the predictive performance of the simplest ap-proaches (i.e., BM with a constant bias term and downscaling with both two ran-dom effects being first order random walks) were as good as or even better thanthose of the more complex models, according to our empirical assessments. Thisfinding is partially explained by the fact that we marginalized over the model pa-rameters in the posterior distribution for our tracked subjects. In the comparisonbetween BM and downscaling, we could not find any noteworthy differences be-tween these two approaches in their prediction accuracy and actual coverage per-centage of their credible intervals. However, our implementation of BM is betterbecause it is more scalable to big data sets with more than half a million timepoints on a regular computer (Liu et al., 2015), while the downscaling approachfitted by INLA can only work with thinned data sets on the same computer. Also,we can build BM on a process that reflects the nature of our tracked subject, whichis discussed in Chapter 6.McClintock et al. (2014) have shown many disadvantages of using a discretetime formulation when working with satellite data, which may lead one to questionthe discrete time formulation we used. Yet, animal movement, (e.g., the fur sealswe tracked) is ultimately powered by its body movement (e.g., the stroking offlippers), and there is a maximum frequency of body movement that an animalis capable of achieving. Therefore, the animal’s movement is essentially discreteand can thus be sufficiently well described by a discrete process model with an62observation frequency of no less than twice the maximum frequency of the animal’sbody movement (Nyquist frequency, see e.g., Leis, 2011; Le and Zidek, 2006).The Nyquist frequency was taken into consideration in data collection, i.e., theoriginal sampling frequency of the DR path was 16Hz in our northern fur sealdata. The DR path thus captures the fine detail in an animal’s movements andprovides observations of the path in “continuous time” with respect to the animal.Our discrete time formulation in Bayesian Melding or downscaling is thus backedup by these “continuous time” observations, which are free of the shortcomingsof discrete time models for satellite data. In addition, the Bayesian Melding withBrownian Bridge and Brownian Motion have both processes defined in continuoustime, which can be easily modified for a continuous time model.One concern about our Brownian Bridge prior in the Bayesian Melding ap-proach is that the GPS observations (Figure 4.1) appear to be much smoother thana Brownian Bridge process. Yet, it is important to recognize that the BrownianBridge is only a prior distribution for the animal’s path. The corrected path, asseen in Figure 4.7, retains the smoothness and does not become tortuous as a sim-ulated Brownian Bridge, because the corrected path is the posterior given both theGPS and DR paths. The smoothness of the DR path is preserved in the posterior.However, our BM approach can undoubtedly be further improved by replacing theBrownian Bridge process prior with some other processes, which is the focus ofthe following chapters.63Chapter 5Conditional HeterogeneousGaussian ProcessIn this chapter, we propose a conditionally heterogeneous Gaussian process (CHGP)for the modeling of inhomogeneous spatio–temporal data. Here “inhomogeneous”means that the process have non-stationary covariance function with changing vari-ance or correlation parameters. For GP defined on the time domain, “inhomoge-neous” indicates the process has non-constant variance/correlation parameters. Ina CHGP, a Gaussian process (GP) is specified first on a set of knots. Conditioningon these knots, the sub-intervals separated by the knots follow GP’s of the samefamily but with possibly different parameter values. The knots and sub-intervalsshould enable us to handle efficiently large data sets and the “conditionally hetero-geneous” parameters can capture the non-stationary dependence structure of thedata. It is also possible to learn the hidden behavior from based on different con-ditional parameters. We first study a special case of the CHGP, the conditionallyheterogeneous Brownian Bridge (CHBB) as an initial investigation. Many metricsand algorithms are considered for the model selection in the CHBB modeling, butnone of them succeeds in choosing a reasonable knot structure automatically. Thischapter documents our attempts and discusses the reasons why this is an unsuc-cessful idea.The rest of this chapter is organized as follows. Section 5.1 introduces theCHGP and CHBB processes and their relationship to some other recent devel-64opments in GP modeling. Section 5.2 introduces and discusses our metrics andalgorithms for model selection in the CHBB modeling. Our reflections on theseefforts are discussed in Section 5.3 .5.1 Definition of CHGP and CHBBAs in Section 3.1, we find that most of the recent developments in spatio–temporalmodeling with the GP are focused on either the non-stationarity issue or the bigdata issue and few methods are good at dealing with both issues. Our proposal ofthe conditionally heterogeneous GP (CHGP) is designed to simultaneously handlethe non-stationarity and big data issues. Similarly as in the NNGP modeling (Dattaet al., 2016), we begin with a set of knots in Rd , S = {s0,s1, . . . ,sK} and define amultivariate Gaussian random vector,η (S )∼ N(f(S ),C(S ,S ;θ 0)),where f(S ) is the mean vector and C(S ,S ;θ 0) is the covariance matrix. Inaddition to the knots, we also respectively observe or aim to predict the processat locations T = {s0,s1, . . . ,sK , sK+1,sK+2, . . . ,sN}. We assume that the knotsS ⊂ T for simplicity in discussion, but the model will work as well withoutthis assumption. We divide the non-knot locations T \S = {sK+1, . . . ,sN} into Jdisjoint blocks T j, j = 1,2, . . . ,J, such thatT j ∩T j′ = /0, for j 6= j′,J⋃j=1T j =T \S .Conditioning on η (S ) and θ 0, η (T j) is assumed to be another GPη (T j)|η (S )∼ N(F jη (S j),C∗j(T j,T j;θ 1, j,θ 0)),where θ 1, j denotes the additional parameters in the conditional covariance matrixC∗j . For example, F j and C∗j can be specified as in the Gaussian Predictive Pro-65cess (GPP, Banerjee et al., 2008):F j =C(T j,S ;θ 0)C−1(S ,S ;θ 0) (5.1)C∗j =C(T j,T j;θ 1, j)−C(T j,S ;θ 0)C−1(S ,S ;θ 0)C(S ,T j.θ 0). (5.2)In this case, θ 1, j are the parameters in the marginal covariance matrix for η (S j).We can also consider other parameterizations for F j and C∗j , such as the radial basisfunction1 from the multi-resolution GP (MRGP, Nychka et al., 2015).It is easy to verify that the CHGP reduces to a GP when S = /0, or S = T ,or θ 1, j = θ 0,∀ j in expression (5.2). If Fk,C∗k are specified in the GPP manner asabove, we can chooseT k to be the subset whose elements have the same m-nearestneighbors inS . If Fk,C∗k are specified via radial basis functions as in MRGP, T kcan be chosen to be a subset whose elements have non-zero correlation with thesame set of elements in S . There can be different ways of choosing the T k’sand it should be chosen to accommodate the non-stationary dependence structure.In addition to the one layer of knots as specified above, we can consider multiplelayers of knots, i.e., specify another knot set inside T j, similarly as in Katzfuss(2016).A key difference in the definition of the CHGP compared with the NNGP (Dattaet al., 2016) and the MRGP (Nychka et al., 2015) is that the CHGP allows the pa-rameters θ to vary among the conditional sets T j, which is designed for inhomo-geneous modeling. Also, instead of specifying a fixed set of knots or the nearestneighbors as in MRGP or NNGP, we attempt to compare the model with differentknots and let the data choose an “optimal” set of knots. Our CHGP differs fromthe independent block (Stein et al., 2004) or the composite likelihood approach (Ei-dsvik and Shaby, 2014) due to the fact that the correlation between blocks in CHGPis still carried by the knot set.A special case of the CHGP defined for time interval T = [0,T ] is the con-ditionally heterogeneous Brownian Bridge (CHBB) process. At the knots S ={s0 = 0 < s1 < s2 . . .sK = T}, the random vector has the same joint distribution a1Radial basis function is a basis function that only depends on the distance.66Brownian Bridge with variance parameter σ20 ,η (S ) = η (s0:K)∼ BB(0,0,s0,sK ,σ20 ), (5.3)where BB(A,B, t, t ′,σ2) is defined the same as in (4.3). The start and end pointsη0 = 0,ηT = 0 are fixed for the definition of the BB and the knot set S alwayscontains 0,T . The knot setS separates T into K disjoint intervals (periods) Tk =(sk−1,sk),k = 1, . . . ,K. The process in each period is a Brownian Bridge processwith (conditional) variance parameter σ21,k conditioning on η(sk−1),η(sk):η (Tk)|η(sk−1),η(sk)∼ BB(η(sk−1),η(sk),sk−1,sk,σ21,k). (5.4)In addition, we assume conditional independence of any η(t) and η(t ′) when t andt ′ is separated by a knot sk,η(t)⊥ η(t ′)|η(sk), ∀, t < sk < t ′.Denote the set of all the variance parameters by σ 2 = {σ20 ,σ21,1, . . . ,σ21,K}. Thenthe joint distribution of CHBB [η |S ,σ 2] is[η |S ,σ 2] = [η (S )|σ20 ]K∏k=1[η (Tk)|η(sk−1),η(sk),σ21,k]. (5.5)The joint distribution of CHBB can be viewed as the distribution of a Gaussianprocess with changing variance parameters. Clearly, a CHBB reduces to a Brown-ian Bridge process when σ20 = σ21,k,∀k orS =T orS = {0,T}. Notice that σ21,kcan be different from σ20 , which allows this process to have inhomogeneous fea-tures and adapt to the local data. For modeling of an animal’s track, σ20 in CHBBcan reflect the overall range of the foraging trip and the different σ21,k can reflectthe animal’s hidden states, e.g., foraging, traveling or sleeping during this trip.As an illustration, Figure 5.1 plots five simulated realizations of a CHBB ob-served at T = {0,1, . . . ,60} with knot set S = {0,10, . . . ,60}. The start and endpoints of these 5 realizations are fixed at 0 and the variance parameters are set to beσ20 = 10, σ21,1 = σ21,3 = σ21,5 = 2 and σ21,2 = σ21,4 = σ21,6 = 0.1. From Figure 5.1, it iseasy to see that the overall range of these CHBB’s is mainly decided by σ20 . Also,67in the shaded periods, the non-knot points are far away from the lines connectingthe knots because of a large σ21,k. On the other hand, in the non-shaded periods,the non-knot points stay relatively close to the lines connecting the knots. For themarine mammal tracking application, the shaded periods may represent the ani-mal’s foraging behavior, where the animal searches its surroundings for food whilethe non-shaded periods may present animal’s traveling behavior, where it swimstowards a certain destination, such as its home.Figure 5.1: Five simulated realizations of a CHBB with knot set S ={0,10,20, . . . ,60}. The start and end points of this CHBB are fixed atzero. The variance parameters are chosen to be σ20 = 10, σ21,1 = σ21,3 =σ21,5 = 2 (shaded areas) and σ21,2 =σ21,4 =σ21,6 = 0.1. The vertical dashedlines indicate the knotsS .0 10 20 30 40 50 60−10−505101520Timeηlllllll lll ll lll ll llllllll l llllllllllllllll ll lllll lllllllllllllllll llllllllllllll lllll lll ll l llll l l lll lllllll lll l llll l lll lllllllll llllll ll lll llllll llll lllll lllll l l ll lll lllllllllllllllllllllllllllllllllllllllllllll ll lllllll ll lll lll ll lll ll ll lll l ll lll l l ll l llllllllllllllllllllll lllll l lll l l l ll llllllRealization12345The definition of CHBB is different from the dynamic Brownian Bridge pro-cess in Gurarie et al. (2009) as well as Kranstauber et al. (2012, 2014), whichassumes that the variance parameter is changing with time. They choose to es-timate the variance parameter at time t by averaging over the estimates from allthe moving windows that contain t. This produces a sequence of highly correlatedvariance parameters and it is unclear to us how to choose the window size. If thewindow covers observations from two behavior status of the animal, it is difficultto explain what the variance parameter represents.685.2 Attempts at model selection for the CHBBAn important feature of the CHBB and CHGP is that we allow the data to decide theknots and the conditional parameters. The conditional parameters can be estimatedgiven the knots and thus the key issue is how to select the knotsS , which involvestwo issues: how to enlist the candidate S ’s and how to measure their differences.For the first issue, we consider sequentially adding knots to an empty knot set, ordeleting knots from a full knot set, becauseS = {0,T} or T are the same model.We find empirically that the search direction has little impact on the selected knotsonce the metric for S is decided. As a sparse knot set is preferred, sequentiallyadding knots is more efficient and used in our following discussion.5.2.1 Metrics for model comparisonTo measure the differences between the CHBB models with different knots, wehave considered the following metrics: the likelihood ratio (LR) statistic, Kullback–Leibler (KL) divergence and three Bayes factors: original, intrinsic, and fractional.We use M to denote the CHBB model, including the knotsS ⊆T and the varianceparameters σ 2 = {σ20 ,σ21,1, . . . ,σ21,K}. The bracketed subscript on M(i) indexes thedifferent models.The LR statistic and KL divergence are two frequentist metrics for model com-parisons. The LR statistic D is twice the difference between the likelihood of twonested models M(0), M(1)D=−2(log([y|S(0), σˆ 2(0)])− log([y|S(1), σˆ 2(1)])), (5.6)where [y|S(i),σ 2(i)], i= 0,1 is the joint probability of y as in (5.5) and σˆ 2(i), i= 0,1are the maximum likelihood estimates of σ 2given S and y. Model M(0) is nestedin M(1) whenS(0) ⊂S(1).The KL divergence measures the discrepancy between M(1) and M(0). As withthe LR statistic, we evaluate the divergence at the MLE of the parameter estimates,KL(M(0)||M(1)) =∫[y|S(0), σˆ 2(0)] log([y|S(0), σˆ 2(0)][y|S(1), σˆ 2(1)])dy. (5.7)69A shortcoming of the LR statistic and KL divergence is that they depend on theMLE of the parameters and do not account for the uncertainty in the parameters.This shortcoming can be overcome by using the Bayes factor (BF), which comparestwo models by the marginal probability of the data y,[y|M(i)] =∫[y|S(i),σ 2(i)][σ 2(i)]dσ 2(i), i= 0,1, (5.8)where [σ 2(i)] is the prior for variance parameter σ2(i). The Bayes factor is defined asthe ratio of the two marginal distributionsBF(M(1),M(0)) =[y|M(1)][y|M(0)]. (5.9)Notice that a proper prior for the variance parameter is needed in the calculation ofthe marginal distribution and BF (O’Hagan, 1995). In the absence of the priorknowledge needed for an informative prior, the improper (uninformative) prior[σ2] ∝ 1 or [log(σ2)] ∝ 1 is a convenient choice, but these improper priors willlead to an ill–defined BF.To illustrate this problem, we introduce a normalizing constant c and write theimproper prior as c[σ2]. Notice that c does not exist but we act as if c[σ2] is aproper prior. Consider M(0) as a BB without knots; it has one variance parameterς20 . Its marginal density is[y|M(0)] = c∫[y|ς20 ][ς20 ]dς20 . (5.10)Meanwhile, M(1) is a CHBB with K = 2 (three knots) and thus it has three varianceparameters σ20 ,σ21,1,σ21,2. The marginal probability becomes[y|M(1)] = c3∫[y|S(1),σ20 ,σ21,1,σ21,2][σ20 ][σ21,1][σ21,2]dσ20 dσ21,1dσ21,2. (5.11)Dividing (5.11) by (5.10), the BF between M(0) and M(1) has a leading factor ofc2. This non-existent normalizing constant is part of the BF and thus demonstratesthat it is inappropriate to use the BF with improper priors.Two alternative forms of the BF have been proposed to construct the BF with70improper priors. The intrinsic BF (Berger and Pericchi, 1996, and referencestherein) first calculates the posterior with part of the data and then uses this pos-terior as the prior for the remaining data. A concern about the intrinsic BF is thatusers have to manually decide which observations to use in forming the prior. Sucharbitrariness is avoided by the fractional BF (O’Hagan, 1995) which considers thefollowing fractional marginal densityQb(M(i)) =∫[y|S(i),σ 2(i)][σ 2(i)]dσ 2(i)∫[y|S(i),σ 2(i)]b[σ 2(i)]dσ 2(i),where b ∈ (0,1) is a small fraction that in effect decides how much data is used to“form the prior”.For the model selection of CHBB, we have considered all three forms of theBF. The original BF is applied with informative priors. Small fractions of b =1/T,2/T,3/T , etc., are considered for the fractional BF. For the intrinsic BF, weuse the posterior of knots [σ20 |y(S )] as the prior forT \S . For all the metrics, LR,KL and the BFs, a larger value indicates a better fit of M(1). Yet, it is unclear howbig an increase means a substantially better fit, except for the LR statistic, wherethe AIC or BIC can be used. We skip this issue by listing all candidate models thatcan be visited and studying the top models with biggest increases relative to thenull model.However, none of the above metrics offer satisfactory model selection resultsfor both simulated and real data sets. They all tend to select two knots that areclose to each other. This issue can be alleviated by a strongly informative prior inthe original BF or a large b in fractional BF, but both have the undesired featureof sensitivity to the prior choices. For an individual data set, we may compare afew priors and choose one that yields a seemingly reasonable knot structure, i.e.,matching our visual examination of the data. However, the prior comparison isredundant and it is easier to just pick the knots from the visual examination. Visualexamination can also be challenging for big spatio–temporal data sets and prone tohuman bias.715.2.2 Reason for their failureTo explain why all the metrics fail, we use the following example. Consider y ={y0,y1, . . . ,yT} to be a realization of a CHBB at time setT = {0,1,2, . . . ,T}. M(0)is the Brownian Bridge model,M(0) : y∼ BB(y0,yT ,0,T,ς2).The end points y0,yT are fixed. Model M1(s) has a single knot at time s∈ (1,T−1)(no empty periods),M(1) :ys|y0,yT ∼BB(y0,yT ,0,T,σ20 )y1:(s−1)|y0,ys ∼BB(y0,ys,0,s,σ21,1)y(s+1):(T−1)|ys,yT ∼BB(ys,yT ,s,T,σ21,2)Under M(0), the twice negative log likelihood is−2log([y|ς2]) = (T −1)(log(2pi)+ log(ς2))+ log(|C|)+(y− f)TC−1(y− f) 1ς2,where C is the covariance matrix of a standard (variance parameter as 1) BrownianBridge on T and f is the mean of the Brownian Bridge as in (4.3). The MLE of ς2can be calculated asςˆ2 =1T −1(y− f)TC−1(y− f). (5.12)Plugging the MLE ςˆ2 back into the log likelihood, we have,−2log([y|ςˆ2]) = (T −1)(1+ log(2pi)+ log(ςˆ2))+ log(|C|). (5.13)72Similarly, under M(1),−2log([y|σˆ20 , σˆ21,1, σˆ21,2]) =(1+ log(2pi)+ log(σ˜20 ))+ log(|C0|)+(s−1)(1+ log(2pi)+ log(σˆ21,1))+ log(|C1,1|)+(T − s−1)(1+ log(2pi)+ log(σˆ21,2))+ log(|C1,2|).(5.14)The estimates of σˆ20 , σˆ21,1, σˆ21,2 can be calculated following (5.12). The matricesC0,C1,1,C1,2 are the covariances of a standard Brownian Bridge process for thecorresponding time points.Plugging (5.14) and (5.13) into (5.6), the LR statistic between M(0) and M(1) isD= log(ςˆ2σˆ20)+(s−1) log(ςˆ2σˆ21,1)+(T − s−1) log(ςˆ2σˆ21,2). (5.15)A key simplification is the factorization of matrix determinant |C|= |C0||C1,1||C1,2|.The proof is included in Appendix B.1.With (5.15), we may further derive the joint distribution for Ds with differ-ent choices of knot s under the null model M(0), which is denoted as Ds,s =2,3, . . . ,T − 2. The joint distribution Pr{D2,D3, . . . ,DT−2} can then be used tocalculate Pr{argt max{Dt} = s}, namely, the probability that knot s is selected asthe best M(1). However, those probabilities become very complicated to derive ina closed form and therefore we evaluate them by simulation. Figure 5.2 is the his-togram of the selected s based on 1000 replicates of the BB with T = 100. Thishistogram is U–shaped and the s’s near the two end points, s < 5 or s > 95, aremore likely to be selected than the middle s’s. This illustrates the tendency that themodel selection based on LR statistic is very likely to result in short periods in theCHBB.The exact form of D in (5.15) provides useful heuristics into this problem withthe LR statistic. When one of the variance estimates σˆ20 , σˆ21,1, σˆ21,2 is close to zero,the metric becomes very large and two very close knots are more likely to producea small variance estimate, because the sample size is smaller. For example, σˆ21,1is the MLE of the variance parameter of the BB conditional on y0 and ys, which73tDensity0 20 40 60 80 1000.0000.0100.0200.030Figure 5.2: Histogram of the selected knot s based on the LR statistic for theone-knot CHBB model M(1). The data is generated from the null modelM(0)(a Brownian Bridge process with variance 1) and T = 100. The LRstatistic are more likely to select a knot which is near the end points.has s−1 observations. Under M(0) with ς2 = 1, σˆ21,1 follows a gamma distributionwith shape (s− 1)/2 and scale 2/(s− 1), as shown in Appendix B.2. As seen inFigure B.1, the probability Pr{σˆ21,1 < ε} decreases with the increase of s, where0 < ε < 1 is an arbitrary small number. Now suppose we have two candidatemodels, M(1) and M′(1) with the same ςˆ2, σˆ20 , σˆ21,2, but s = 2 in M(1) while s = 20in M′(1). It is more likely that M(1) results in a smaller σˆ21,1 and thus a larger LRstatistic D. Our model selection procedure will then select M(1), which has a veryshort period s= 2.The sensitivity to small variance estimates is the main reason why model selec-tion with the LR statistics tends to pick two knots that are very close to each other.This problem also carries over to the BFs, as the likelihood is a vital part of theBFs, especially the intrinsic and fractional BF. For small periods, the likelihood isnot very informative about σ2 and thus different priors can heavily affect the BF.The KL divergence has a similar problem. The KL for the above M(0) and M(1) is74derived in Appendix B:KL(M(0)||M(1)) = g(ςˆ2σˆ20)+(s−1)g(ςˆ2σˆ21,1)+(T − s−1)g(ςˆ2σˆ21,2),where g(x) = x− log(x)− 1 with g(x) approaching infinity when x approacheseither 0 or ∞. The KL divergence is also very big when any one of σˆ20 , σˆ21,1, σˆ21,2 isclose to zero.5.3 Reflections and future directionsThe idea that led to the CHBB was inspired by the short period updates in ourBayesian Melding approach in Chapter 4. Our attempts may be viewed as tryingto partition the GPS observations into different periods and to estimate a differ-ent variance parameter in each period. This might produce better predictions ifwe knew where to partition the GPS observations instead of selecting the parti-tions using the data. Yet, with such information, analyzing each period separatelymight work equally well. For the animal tracking data, it is possible to select theperiods based on other information such as the diving records, or simply a visualexamination of the data. Such selection methods are not practical for other spatialor spatio–temporal data sets, as there is usually no auxiliary information or it isextremely difficult to perform a visual examination. So it is necessary to have anautomatic model selection procedure.The problem with our current model selection procedure is that it tends to selectshort periods with small variance parameters. For the marine mammal trackingdata, it results in too many behavior changing points. Also, we cannot tell theanimal’s different hidden behavior from the variance parameters, because they areall small. This disagrees with our initial expectation that a CHBB would haveperiods with small variances and other periods with large variances as in Figure 5.1.This might be fixed by designing informative priors on the σ2s or priors on the knotstructures similar to Bayesian trees (Chipman, 1998; Gramacy and Lee, 2008).Another direction would be to discard all the metrics above and design a metricthat encourages σ21,1,σ21,2, . . . to be different. These ideas would require a lot oftrial–and–error, which is part of our future work.75Another crucial difficulty in performing model selection for CHBB or CHGPis that the number of models (i.e., different knot sets) is 2n for data observed at nlocations. Such a number is prohibitive if we try to enumerate all models and findthe optimal model even for moderate n. We avoid this issue in this initial investiga-tion by performing a sequential greedy search and not revisiting the added/deletedknots. Yet, there is no guarantee that such a greedy search would converge to theoptimal model.We start the study of the CHGP by working on the CHBB, because the BBseems to be a good model in our application to animal tracking and it has a sim-ple mathematical form that enables us to derive many statistics explicitly. How-ever, there is only one parameter, the variance parameter, in the BB. As seen inSection 8.2, the variance parameter estimates can be affected by model misspeci-fication and the sampling rate, which might not be a good indicator of changes inconditional dependence. For future developments of CHGP, we should start with aprocess that has correlation parameters and a predetermined knot structure.76Chapter 6Generalized Ornstein–UlhenbeckProcess and Its Application inAnimal TrackingIn Chapter 4, we showed that our Bayesian Melding (BM) approach with the Brow-nian Bridge process works well to combine the GPS observations and DR path foraccurate high resolution track estimation. However, the constant variance σ2H inthe Brownian Bridge is not able to reflect the inhomogeneous feature of the track.Attempting to change only σ2H does not lead to a successful model in Chapter 5.In this chapter, we propose a generalized Ornstein–Ulhenbeck process (GOU) toreplace the Brownian Bridge process. GOU offers a flexible structure for boththe mean and variance of the path. Unlike the “jumping” variances in CHBB, wecontrol the mean and covariance of GOU to vary smoothly through our formula-tion based on B-splines, which also helps GOU retain a reasonably parsimoniousparameterization. In addition, a GOU is still a Gaussian Markov process, whichenables efficient computation via the Kalman filter and smoother.Our definition of the GOU can be easily modified for a multivariate process,but we focus on a univariate process in this chapter. The rest of this chapter isorganized as follows. Section 6.1 introduces the definition of the GOU process andits properties. Its application in our Bayesian Melding framework for combiningthe GPS observations and DR path for animal tracking is discussed in Section 6.2.776.1 Generalized Ornstein–Ulhenbeck processAs reviewed in Section 3.2, only a few types of SDEs have closed form solutionsand the Ornstein–Ulhenbeck (OU) process is one of them,dη(t) = ρ(µ−η(t))dt+σdW (t),with σ > 0. The initial value η0 = η(0) is considered to be fixed in our study.Using Itoˆ’s lemma (see e.g., Arnold, 1974), the solution of the above SDE isη(t) = η0e−ρt +µ(1− e−ρt)+σ∫ t0e−ρ(t−u)dW (u).In addition, {η(·)} can be shown to be a Gaussian process, whose mean and co-variance areEη(t) =η0e−ρt +µ(1− e−ρt)Cov(η(s),η(t)) =σ22ρe−ρ(s+t)(e2ρmin(s,t)−1).If ρ > 0, η(t) has a stationary distribution of N(µ, σ22ρ ) and has a mean–revertingbehavior: when η(t)> µ , η(t) is expected to decrease to µ , when η(t)< µ , η(t)is expected to increase to µ . This behavior is illustrated in Figure 6.1 with twodifferent initial values of η0 = 0,10 and µ = 5. Notice that the ρ > 0 conditionis only required to ensure the existence of a stationary distribution and the mean-reverting behavior, while the definition of the OU process does not require ρ > 0.When ρ < 0, σ2/(2ρ)< 0 and e2ρmin(s,t) < 1 still results in a positive variance. Inthis case, the mean and variance will diverge to infinity as t→∞, when µ−η0 6= 0.When ρ = 0, limρ→0(e2ρmin(s,t)−1)/(2ρ) =min(s, t) and the OU process reducesto a Brownian Motion process with variance σ2.The OU process has desirable properties and therefore has been widely used infinancial models for interest rates and stock prices (see e.g., Chan et al., 1992). Zhuet al. (2011a,b) applied the OU and integrated OU processes to model longitudinaldata. But the mean function of the OU process is an exponential curve η0e−ρt +µ(1− e−ρt) as shown by the dashed curves in Figure 6.1, which is not sufficiently780 2 4 6 8 100246810Time tη tη0=10Mean functionA realizationη0=0Mean functionA realizationFigure 6.1: Examples of the Ornstein–Ulhenbeck process of ρ = 1,σ =1,µ = 5 with different initial value η0. The mean curve is plotted withthe dash lines while the realizations are plotted with the solid lines.flexible in modeling the tracks. A natural extension of the OU process is the linearSDE as discussed in Chapter 8 of Arnold (1974),dη(t) = (A(t)η(t)+a(t))dt+σ(t)dW (t). (6.1)Notice that the linear SDE reduces to a OU process when A(t) = −ρ,a(t) =ρµ,σ(t) = σ . When A(t) = 1/(1− t), a(t) = 0, and σ(t) = 1, the linear SDEresults in a Brownian Bridge process defined on [0,1] (see e.g., Chapter IV ofRogers and Williams, 2000). In general, A(t),a(t),σ(t) need to satisfy a few mildconditions listed in Section 3.2 to guarantee the existence and uniqueness of thesolution. The mean and covariance functions of the linear SDE are derived in Sec-tion 6.1.1. We consider only the linear SDE instead of other types of SDE, asthe linear SDE is often used as an approximation base for many other types ofSDE (Ozaki, 1992; Vrettas et al., 2015). So even if we modeled with a non-linearSDE, we may be forced to use the linear one as an approximation in the inferenceand prediction unless we consider particle based Monte Carlo methods, such as thesequential Monte Carlo (Del Moral et al., 2006) or particle Markov Chain MonteCarlo (Andrieu et al., 2010), which are computationally intensive.79Although the linear SDE has been studied extensively in probability theory, wefound few applications with this process except for the OU special case. In ourstudy, we have merely investigated the special case that σ(t) is a constant functionwhile A(t) and a(t) are modeled with basis functions,A(t) =JA∑j=1γ(A)j B′j,kA(t) (6.2)a(t) =Ja∑j=1γ(a)j B j,ka(t). (6.3)Both A(t) and a(t) have equally spaced knots in the time domain. We choose theB-spline function for an initial investigation but other differentiable basis func-tions, such as other splines or Fourier basis functions may well be suitable. Noticethat A(t) is modeled by the first derivative of the basis functions. As reviewed inSection 3.4, the first derivative of the B-spline(J,k) can be represented by a B-spline(J− 1,k− 1) with one less degree of freedom. The modeling of A(t) viaB′j,kA(t), j = 1,2, . . . ,JA− 1 is equivalent to using B j,kA−1(t), j = 1,2, . . . ,JA− 1.As explained later, using B′j,kA(t) helps to simplify the evaluation of the linearSDE. We fix γ(A)JA = 0 in this investigation to ensure the identifiability of γ(A)j , j =1,2, . . . ,JA−1.The idea of modeling A(t) and a(t) through the B-splines allows the process tohave flexible (non-stationary) mean and variance functions. We can easily controlthe level of flexibility by tuning the number of basis functions JA, Ja. For notationalsimplicity, we call such a linear SDE with coefficients modeled by basis functionas a “generalized Ornstein–Ulhenbeck process” (GOU). As discussed later, GOUremains a Gaussian Markov process whose computation can be handled efficientlywith sparse matrix or the Kalman smoother techniques.806.1.1 Mean and covariance of GOU: our model choiceAs seen in Chapter 8 of Arnold (1974), the solution of (6.1) can written in thefollowing formη(t) =Ψ(t)(η0+∫ t0Ψ(u)−1a(u)du+∫ t0Ψ(u)−1σ(u)dW (u)),whereΨ(t) = exp{∫ t0A(u)du}= exp{JA∑j=1γ(A)h (B j,kA(t)−B j,kA(0))}. (6.4)Notice that Ψ(u)−1 = 1/Ψ(u), not the inverse function of Ψ(u). When A(t),a(t)satisfy conditions discussed in Section 3.2, which can be met by simply constrain-ing the spline coefficients to be bounded in our formulation (6.2) and (6.3), theabove solution is a Gaussian process with mean functionm(t) = Eη(t) =Ψ(t)(η0+∫ t0Ψ(u)−1a(u)du), (6.5)and covariance functionK(s, t) = Cov(η(s),η(t)) =Ψ(s)[∫ min(s,t)0σ(u)2Ψ(u)−2du]Ψ(t). (6.6)In the above mean and covariance functions, we notice that the function A needsto be integrated first to calculate Ψ(t), which in turn needs to be integrated againfor the mean or covariance function. Modeling A(·) with a function whose integralcan be calculated in close form, such as the B-spline or its first derivative, canhelp us avoid calculating both layers of integral numerically. We choose to use thefirst derivative, because its integral, namely the B-spline, can be easily calculatedin software packages, such as the R package “fda” (Ramsay et al., 2014). Thishelps us to simplify the implementation of GOU. If we choose to model A with theB-spline, the integral expression needs to be manually derived.The covariance function of GOU in (6.6) depends on both Ψ(t) and σ(t). Toallow for a dynamically changing variance in the GOU, it is sufficient to let one81of A(t) and σ(t) vary. σ(t) = σ is fixed as a constant hereafter to avoid overlycomplex parameterization of the GOU.6.1.2 Conditional distribution of GOUAs seen in Chapter 8 of Arnold (1974), the conditional distribution η(t)|η(s) = x,t > s is Gaussian with mean and variance respectively,E(η(t)|η(s) = x) =Ψ(t)∫ tsΨ(u)−1a(u)du+ xexp(∫ tsA(u)du)(6.7)Var(η(t)|η(s) = x) =σ2Ψ(t)∫ tsΨ(u)−2du. (6.8)Notice that the conditional mean of η(t) given η(s) is a linear function of η(s),which enables us to express the GOU process at a finite set of time points T ={t0 = 0, t1, t2, . . . , tn} as a state model in a dynamic linear model (DLM) (3.15).For expository simplicity, define θi = η(ti)−Eη(ti), θi−1 = η(ti−1)−Eη(ti−1) (θihas mean zero, i.e., we remove the mean of the GOU first). Substituting ti, ti−1into (6.7) and (6.8), the conditional distribution of θi given θi−1 = x isE(θi|θi−1 = x) =xexp(∫ titi−1A(u)du)Var(θi|θi−1 = x) =σ2Ψ(t)∫ titi−1Ψ(u)−2du.This is equivalent to the following linear transformation from θi−1 to θi,θi =Giθi−1+ui, ui ∼ N(0,Ci), (6.9)withGi =exp{∫ titi−1A(u)du}= exp{JA∑j=1γ(A)h(B j,kA(ti)−B j,kA(ti−1))},Ci =σ2Ψ(t)∫ titi−1Ψ(u)−2du.82Equation 6.9 will be used in the next section as part of a DLM representation ofour Bayesian Melding approach.6.1.3 Sparse precision matrix of GOUAccording to Arnold (1974), the solution to the linear SDE that satisfies conditionsin (3.12) and (3.13) is a Gaussian Markov process. Our GOU process is a specialcase of the linear SDE and thus a Gaussian Markov process. As seen in Rue andHeld (2005), a realization of a Gaussian Markov process at any finite collectionof points has a multivariate Gaussian distribution whose precision matrix is sparse.We can derive its covariance and precision matrix as follows. For a finite set of timepointsT = {t0 = 0, t1, t2, . . . , tn}, ti < t j for all i< j, we denote the random variableat these times points by ηi = η(ti). From (6.6), the covariance of ηi,η j, i< j isCov(ηi,η j) = K(ti, t j) = hih ji∑k=1dk,wheredi =∫ titi−1Ψ(u)−2du, hi = σΨ(ti).Now define the following matrices,D =Diag({d1,d2, . . . ,dn})H =Diag({h1,h2, . . . ,hn}),and L as a lower-triangular matrix of 1’s as shown below. The inverse of Lis a sparse matrix where only the diagonal and lower off-diagonal elements arenonzero.L =1 0 0 . . . 01 1 0 . . . 01 1 1 . . . 0.......... . ....1 1 1 . . . 1L−1 =1 0 0 . . . 0−1 1 0 . . . 00 −1 1 . . . 0.......... . ....0 0 . . . −1 1.83Then covariance and precision matrices for the random vector η =(η1, . . . ,ηn)Tare:Σ =HLDLTHQ =Σ−1 = H−1L−TD−1L−1H−1. (6.10)The precision matrix is the product of three diagonal matrices and two off-diagonalmatrices, which is a tri-diagonal matrix, such thatQi,i = h−2i (d−1i +d−1i+1), i< nQi,i+1 = Qi+1,i =−h−1i h−1i+1d−1i+1 i< nQn,n = h−2n d−1nQi, j = 0, |i− j|> Implementation of GOU in practiceWe discussed two properties of the GOU process that enable its efficient computa-tion above. With the conditional distribution and the DLM representation (6.9), theKalman filter and smoother discussed in 3.3 can be used to calculate the likelihoodand posterior for GOU. We can also consider the sparse matrix techniques dis-cussed in 3.1.1, as the GOU process has a tri-diagonal precision matrix. A naturalquestion is which approach is more efficient.In theory, both the sparse matrix and Kalman filter/smoother approaches, if im-plemented properly, can have linear complexity O(n) in evaluating the likelihood orposterior. Notice that the sparse matrix approach usually has super linear complex-ity, i.e., is larger than O(n), because of the Cholesky decomposition step (Davis,2006). But for our GOU process, the Cholesky decomposition can be calculatedexplicitly in linear time from (6.10).Although these two have the same theoretical complexity, they still have somedifferences in practice depending on the implementation and which programminglanguage is used. We implemented the DLM approach for the GOU with the Rpackage “dlm” (Petris et al., 2009) and the sparse matrix approach with the R pack-age “Matrix” (Bates and Maechler, 2016). The “dlm” package offers the Kalman84filter/smoother implementation in both the R and Fortran languages. A simulationwas used to compare the three implementations, DLM with R (DLM-R), DLMwith Fortran (DLM-F), and sparse matrix (SparseM), in terms of the computationtime in evaluating different likelihoods involving the GOU. We only consider thelikelihood evaluation because it is the most frequently used calculation in our ap-plications, i.e., likelihood evaluation is performed many times in the numericaloptimization for the posterior mode and Hessian matrix.The first likelihood in this simulation was that of a realization (free from obser-vation noise) of the GOU at its true parameter. We also considered our BayesianMelding (BM) model with the GOU, as introduced in the next section, and it canbe viewed as the GOU process with measurement error and other bias components.We considered the number of time points T ∈ {100,200,500,1000} as close to thenumber of GPS time points in our real data. Multiple parameter settings were con-sidered but the conclusions were invariant to these settings. We only report on onesetting copied from a real data fit. Table 6.1 summarizes the total computationaltime for 100 replicates of these likelihood evaluations.Table 6.1: Computation times (in seconds) for three different approaches toevaluate the likelihood of models involving the GOU. T is the number oftime points.Likelihood of GOU Likelihood of BM with GOUT DLM-R DLM-F SparseM DLM-R DLM-F SparseM100 1.89 0.28 0.73 1.69 0.32 2.44200 3.41 0.51 0.96 3.23 0.45 2.80500 7.59 0.50 1.25 7.41 0.67 3.401000 15.31 0.96 2.00 14.66 1.13 5.40From Table 6.1, we can see that the DLM with Fortran is the clear winner ofthe three approaches. Its advantage is moderate for small n and the likelihood ofGOU, but quite remarkable for large n and the complex BM model. It can be morethan 10 times faster than the DLM-R and approximately 5 times faster than thesparse matrix implementation when T = 1000. We must emphasize that Table 6.1is a comparison of software packages that we can use to implement our models.The advantage of the DLM with Fortran might be mainly due to the fact that it is85using a faster language.Another advantage of using the DLM implementation is that the same codecan work for T = 100 and big T > 500,000. If we were to use the sparse matrixapproach, we would need to separate such a long sequence into short pieces dueto memory issues and to update them separately. This idea would require a largeamount of derivation and programing, i.e., similar to what we did in Appendix Afor the BM models with the BB. The DLM implementation requires less detailedprograming and thus becomes the only implementation we consider hereafter.6.2 Application to Bayesian melding for animal’s trackWe consider use of the GOU process to replace the Brownian Bridge process inour Bayesian Melding approach to combine the GPS observations {Y (tk)} and DRpath {X(t)}, namely,η (0 : T )∼ GOU(γ (A),γ (a),σ) (6.11)Y (tk)|η(tk) iid∼N(η(tk),σ2G),k = 0,1, . . . ,K (6.12)X(t) =η(t)+h(t)+ξ (t). (6.13)As in Section 4.2, we consider a set of equally spaced time pointsT = {0,1,2, . . . ,T},but our model works for irregularly spaced time points as well. X(t) denotes theDR path at time t. Y (tk) denotes the kth GPS observation and {tk,k = 0,1, . . . ,K},which is a subset of T . We consider the random bias process ξ (t) as a BrownianMotion process with unknown variance σ2D. The measurement error in the GPSobservations has a fixed variance σ2G = 0.0625. The systematic bias in DR h(t) isconsidered to be an unknown constant β , which is the best model according to ourmodel selection analyses in Section 4.5.2. In addition, γ (A)= {γ(A)1 ,γ(A)2 , . . . ,γ(A)JA−1}and γ (a) = {γ(a)1 ,γ(a)2 , . . . ,γ(a)Ja } denote all the unknown spline coefficients for A(t)and a(t) (γ(A)JA = 0 being fixed). We represent all the unknown hyper-parameterswith φ = {γ (A),γ (a),σ ,σ2D,β}. The priors for the hyper-parameters are as follows:[γ (A)] ∝1, [γ (a)] ∝ 1, [β ] ∝ 1,[log(σ)] ∝1, [log(σ2D)] ∝ 1.866.2.1 DLM of the BM approachAs seen in (6.9), the zero-mean GOU can be written as a state model in DLM.Similarly, Brownian Motion over the discrete time set T = {0,1,2, . . . ,T} can beformulated asξt = ξt−1+ εt , εtiid∼ N(0,σ2D),where ξt = ξ (t) for notational simplicity. Given the hyper-parameter vector φ , wefirst remove the mean from the observations, i.e.,GPS: Y ∗(t) =Y (t)−m(t)DR Path: X∗(t) =X(t)−m(t)−h(t),wherem(t) is the mean of the GOU process as in (6.5). Then define Oi= {X∗(ti),Y ∗(ti)}Tand ζ i = {η(ti)−m(ti),ξi}T . For the time points that the GPS observation isnot available, we code them by “not available” (NA), which will be automaticallymarginalized in the Kalman filter/smoother.The Bayesian Melding model with GOU (BM-GOU hereafter), is then writtenasOi =Fiζ i+vt vt ∼Nm(0,Vt)ζ i =Giζ i−1+ut ut ∼Np(0,Ut),withFi =[1 01 1]Gi =[Gi = exp(∫ titi−1 A(u)du) 00 1]Vt =[σ2G 00 0]Ut =[Ci = σ2Ψ(t)∫ titi−1 Ψ(u)−2du 00 σ2D].With this state–space representation of our DLM, the Kalman smoother can usedto evaluate the following posterior distributions,[φ |XG,Y] [η |φ ,X,Y].87where the notation is the same as in Section 4.2: X for the DR path, XG for the DRpath at the GPS time points, Y for the GPS observations.As is Section 4.2, the marginal posterior of [η |X,Y] is approximated by[η |X,Y] =∫[η |φ ,X,Y][φ |X,Y]dφ≈∫[η |φ ,X,Y][φ |XG,Y]dφ , (6.14)where the integral in the second line is calculated via numerical integration dis-cussed in Appendix A.6. One may notice that the dimension of the hyper-parametersis much larger with the GOU than with the Brownian Bridge, which leads to a hugenumber of grid points if we perform the grid search in the direction of each eigen-vector. For example, JA= 3,Ja= 3 already leads to eight hyper-parameters (2 γ(A)j s,3 γ(a)h s, σ , σ2D, and β ) and 5 points in each dimension results in 58 = 390625 totalgrid points. We avoid this issue by searching only in the direction of the first feweigenvectors that explain more than 95% of the variation in the hyper-parameters.This is acceptable because the purpose of the grids is to explore the likelihoodsurface.6.2.2 Model selection for BM-GOU and comparison with otherapproachesAs in Section 4.2, we use leave–5–out cross validation (L5OCV) to perform modelcomparison and selection. For the B-spline models in A(t) and a(t) used in theGOU, we considered B-splines of order ka,KA {2,3,4} (piecewise linear to cubic)and the number of basis functions Ja,JA in {ka(kA)+1, . . . ,6}, i.e., at least one knotbesides the end points. This results in 81 models in total. Due to the large numberof models and the complexity in evaluating the GOU, we screened those modelsbased on the CV-RMSE of the empirical Bayes posterior[η |X,Y]≈ [η |φˆ ,X,Y],where φˆ is the posterior mode for φ . For the ten models with the smallest CV-RMSE in the empirical Bayes posterior, we further calculated their fully Bayesianposterior (6.14), whose CV-RMSE was used to decide the final model.88We report the model with the smallest CV-RMSE in the full Bayesian versionin Table 6.2 with its model specification JA,kA,Ja,ka and the cross validation per-formance. To compare the BM-GOU with the models we developed in Chapter 4,Table 6.2 also includes the CV-RMSE and coverage percentage from the selectedBayesian Melding model with the Brownian Bridge (BM-BB for short) and thedownscaling model, which were shown before as the last two columns of Table 4.4.Table 6.2: The best GOU model, RMSEs and actual coverage percentages of95% credible intervals (in gray backgrounds) for the Bayesian Meldingapproach with the GOU process (BM-GOU). The CV-RMSE and cover-age from the Bayesian Melding with the Brownian Bridge (BM-BB), andthe Downscaling approach (DS) are also included from comparison.Best GOU CV-RMSE and coverage percentagesJA kA Ja ka BM-GOU BM-BB DSTrip 1 Northing 5 2 6 4 0.80 95.2 0.80 94.9 0.95 95.6Trip 1 Easting 6 4 4 2 0.68 97.4 0.75 97.8 0.75 98.9Trip 2 Northing 6 2 6 3 1.83 93.8 3.06 93.0 2.59 93.0Trip 2 Easting 3 2 5 3 2.58 93.0 2.62 96.9 2.56 98.4From the first four columns of Table 6.2, the best models for the four data setshave J > 4 in at least one of A(t) and a(t), so that we have a flexible component forthe animal’s track modeling. These shows that the GOU modeling can be usefulin predicting the animal’s track. When compared to the BM-BB and downscalingapproaches, the BM-GOU performs at least as well. In particular, the BM-GOUis better than the BM-BB in three out of the four data sets and they are tied forthe Trip 1 Northing. In Trip 2 Northing, the BM with GOU manages to reduce theCV-RMSE by 40% when compared to the BM-BB. The actual coverage percentagefrom the credible intervals for BM-GOU is also close to the nominal level.To further compare the BM-GOU and BM-BB, we plot the posterior mean andcredible intervals from these two approaches for the Trip 2 Northing data set inFigure 6.2. Similar plots are obtained in other data sets and therefore omitted. Asin the BM-BB, the posterior mean for BM-GOU lies close to the GPS observationsand indicates a good bias correction of the DR path.The close-up of the middle part of this trip provided in Figure 6.3 shows that the89Table 6.3: The average width of posterior credible intervals from the BM-GOU and the BM-BB.Trip 1 Trip 2Northing Easting Northing EastingBM-GOU 1.58 1.64 3.78 4.23BM-BB 1.62 1.90 5.42 4.82BM-GOU provides a smoother posterior mean and narrower posterior CI than theBM-BB. It is plausible that the BM-GOU can capture a more complex correlationstructure in the process and thus enable greater smoothing. The BM-GOU borrowsstrength from neighboring DR observations to reduce the amount of uncertainty inthe prediction. This results in the narrower CIs as in Table 6.3, which summarizesthe average CIs width in the four data sets. Combining the results in Table 6.2and 6.3, we find that the BM-GOU provides narrower CIs than the BM-BB withoutsacrificing the coverage percentage, another advantage of the BM-GOU over BM-BB.Jul 21 Jul 23 Jul 25 Jul 27−2002040TimeNorthing (KM)llllllllll l lllllll ll l l lllllllll llllllllllll lllllllllllllllllllllllllllll lllllllllllllll GPS ObservationsBM−BB Posterior MeanBM−BB Posterior 95% CIBM−GOU Posterior MeanBM−GOU Posterior 95% CIFigure 6.2: The posterior mean and 95% credible intervals from BayesianMelding with GOU and Brownian Bridge for Trip 2 Northing data set.In summary, based on our cross–validation study, we find that the flexible meanand covariance structure of the GOU process can improve the prediction power of90Thu 21:00 Fri 07:00 Fri 17:00 Sat 03:0020304050TimeNorthing (KM)l llll l llllll ll llllll l ll l l l l lll GPS ObservationsBM−BB Posterior MeanBM−BB Posterior 95% CIBM−GOU Posterior MeanBM−GOU Posterior 95% CIFigure 6.3: The posterior mean and 95% credible intervals from BayesianMelding with GOU and Brownian Bridge for a period Trip 2 Northingdata set.Bayesian Melding approach. In addition, the model selection for GOU is mucheasier than that for the CHBB in Chapter 5, as the number of models for GOUis controlled by the number of basis functions in the splines while the CHBB cre-ates 2n models (n is the number of observations). This is a merit of modelingnon-stationarity in a continuous fashion over the discrete fashion as in the CHBB.However, we studied the posterior for γ (A),γ (a)h and did not find much useful in-formation that can help to further interpret the animal’s track. This motivates us toconsider other forms of SDEs in the next chapter.91Chapter 7Potential Field Modeling inTrackingIn the previous chapter, we demonstrated that SDEs can improve the prediction ofour animals’ spatial tracks when compared to the Brownian Bridge. In this chapter,we introduce more features into SDE modeling so that the fitted SDE can also beused to interpret the spatial tracks.We work with a special form of the SDE with drift term formulated as thenegative gradient of a function H:dr(t) =−∇H(r(t), t;β )dt+Σ1/2(r(t), t;β )dW(t), (7.1)where r(t) denotes the location on the track at time t inRd . The error term W(t) isa d-dimensional standard Brownian motion with independent increments in eachdimension and Σ(r(t), t) is a covariance matrix. We fix it to be a diagonal matrixΣ(r(t), t) = σ2Id for the following two case studies.The key feature of (7.1) lies in the drift term −∇H(r(t), t). Function H(·, ·;β )can vary with location and time and depends on other parameters β . Its gradient∇H = (∂H/∂x,∂H/∂y, . . .)T decides the negative expectation of the velocity ofthe object, such that the object is expected to visit locations of low values of H.This formulation comes from physics originally. Imagine that H represents theheight of a hill and a ball is dropped on top of this hill. Obviously, the ball will92move down the hill, which is in the direction of locations with lower values of H.Brillinger et al. (2004), Brillinger et al. (2007), and Brillinger and Stewart (2010)introduced this approach to modeling the movement of various objects, such as elk,monk seal, and soccer ball. Russell et al. (2016) extended the potential field modelby including a semi-parametric potential surface and a separate motility surface.Hooten et al. (2017) reviewed similar models developed for animal telemetry data.The function H is called the potential field (PF) in these papers, as H(r, t) presentsthe potential that the object will move to location r at time t. A lower value of Hindicates a higher likelihood of being visited.The PF function H offers a versatile interface for modeling the tracks. In ad-dition to time and location, it can also include covariates, such as the locations ofother animals (Brillinger et al., 2007) or a map of an ocean’s temperature or salin-ity. The coefficients of the covariates help us understand how an object interactswith others or the environment. Even without covariates, the PF can still offer anintuitive map to highlight attractive areas in the field of movement, which will beshown later in this chapter.As discussed in Section 3.2, the solution to (7.1) may not have a closed form.We use the Euler approximation for inference:r(ti+1)− r(ti)≈−∇H(r(ti), tk;β )(tk+1− tk)+Σ1/2(r(tk), tk;β )√tk+1− tkε k,where ε k,k = 0,1,2, . . . ,K are d-dimensional white noise terms (independent ineach dimension). After this approximation, the parameter β can be estimated viamaximum likelihood. When the PF H(·, ·;β ) is a linear function of β and the dif-fusion matrix Σ1/2(·, ·;β ) is a constant that does not depend on either the locationr or the coefficients β , the maximum likelihood estimate can be calculated by leastsquares directly.In this chapter, we further develop the PF approach for two applications, thetracks of basketball movement in the National Basketball Association (NBA) gamesas well as the foraging paths of northern fur seals. They are discussed in Section 7.1and 7.2 respectively.937.1 Potential field for learning basketball strategiesSince the beginning of the 2013-2014 season, the NBA has used SportVU R©1 tech-nology to track the movement of the players and the basketball in a nearly contin-uous manner. In each NBA arena, six high speed cameras have been installed torecord images of each game at a rate of 25 readings per second. These images arethen processed by specialized software to obtain the players’ and ball’s locations atthe same rate. This technology has already yielded an array of informative statis-tics such as speed, distance, and touches. All these extend the ability to summarizea player’s or team’s performance, thereby allowing teams to assess and optimizestrategy. For example, this technology was used recently to help Duke Universitywin the 2014-2015 NCAA basketball championship 2.The player tracking data encodes a vast amount of information for each gameand has sparked a lot of innovative statistical research. For example, Cervone et al.(2014) modeled basketball possession using a Markov chain and derived the ex-pected scores from this possession3 to evaluate the efficiency of players’ decisions.Cervone et al. (2016) further studied the impact of space occupation via hierarchi-cal spatial models on both the offensive and defensive ends (i.e., expected pointsscored or defended resulted from the space occupation). Hidden Markov modelsand log Gaussian Cox processes were recently used to quantify players’ defensiveskills and effectiveness in Franks et al. (2015). McIntyre et al. (2016) developeda multinomial logistic regression model to identify the ball screens4 and then ap-plied the identified screens to performance assessment. Neural network modelswere used to classify different types of offensive play in Wang and Zemel (2016).Many useful new statistics have arisen from the above cited studies to quantifyplayers’ or teams’ performance, but few of them have aimed at learning a team/-player’s playing strategies directly from the massive tracking data. The supervisedlearning approaches developed in McIntyre et al. (2016) as well as Wang and Zemel(2016) can learn certain types of plays, but they require human-annotated data sets1http://www.stats.com/sportvu/sportvu-basketball-media2http://www.stats.com/press-releases/duke-utilizes-stats-sportvu-player-tracking-technology-championship-season/3A possession is the period when a player or a team is in control of the basketball.4A screen is a basketball tactic that one player uses his/her body to block the defender(s) and thuscreate open space for his/her teammate.94to train the classifier (logistic or neutral network), which can be hard to obtain andprone to labeling errors. It is also difficult for the models learned in this fashion toadapt to the new types of plays created in the fast–evolving nature of NBA games.That leads us to consider an unsupervised learning approach for the massive NBAtracking data set, such that the play strategies can be learned and visualized withoutusing human-annotated training data sets. Our PF approach is designed as such anunsupervised learning approach.In this section, we illustrate our approach with the NBA tracking data from theseven game series between the Houston Rockets (HOU) and Los Angeles Clippers(LAC) in the 2015 NBA Western conference semi-finals. The raw player trackingdata were downloaded from http://stats.nba.com 5 via Python and further processedby R (R Development Core Team, 2008). The raw tracking data contains the team,player names and jersey number, game clock6, shot clock7, and the movementsof the ball and the players on the court. We further supplemented these data withthe “play-by-play” record from http://nba.com in order to extract the time periodswhen a certain team (HOU or LAC) was in possession of the ball. Two differentPF models are introduced below for different analytical purposes.7.1.1 Potential field as a tensor product spline over the spaceFirst, we consider the PF as a function over the basketball court, such that someareas of the court can be more attractive to the ball than other areas. This spatialfunction is formulated as a tensor product spline (De Boor et al., 1978)H(r(t), t;β ) =K∑j=1K∑k=1β jkS j(x(t))Tk(y(t)),where S j,Tk are spline functions in the x (long edge of the basketball court) andy (short edge of of the basketball court) directions. The location vector is r(t) =(x(t),y(t))T . We fit such a PF with K = 5 to the basketball’s movement paths5Data used in our study was downloaded in Sep 2015. Unfortunately, those data are no longeravailable to the public.6It shows the time remained in the quarter of a basketball game.7NBA rules specifies that the offensive team must shoot within 24 seconds from the start of thisplay. This clock indicates how many seconds are left to shoot the basketball.95from the offensive plays by HOU in Game 4 (G4) and Game 5 (G5) of this series.In G4, we have tracks of the basketball movement in 89 HOU offensive playswhich consist of 18,255 records of (x,y) locations. In G5, there are 81 plays,corresponding to 15,346 moments. The fitted PF is plotted in Figure 7.1.Figure 7.1: Potential fields parameterized by tensor product splines fitted tothe paths of the ball when HOU was on offence during Games 4 and5 vs. LAC in the 2015 NBA Conference playoffs. LAC won Game 4128-95, while HOU won Game 5 124-103. The colors show the valuesof the potential field H, whose lowest value is around the basket.At first, it appears that the two plots in Figure 7.1 are similar. In both games,we observe that the PFs take on high values around the half court line (blue) anddisplay a tendency to steadily decrease toward the basket (green, yellow, then red).The minimum value of the PF is attained near the basket, as can be seen from thered area surrounding this location. This is plausible, as the ball is supposed tomove toward the basket. Moreover, the PFs quickly decrease near the half courtline because the player dribbling the ball up the court is likely to rush over the halfcourt line to setup the offense (The gradient of the PF determines the velocity of themovement). The PFs for both games also indicate that Houston’s usual offensivepatterns are to move the ball inside or to move the ball to the corners for a 3-point shot8. This can easily be seen from the plot by observing the red and yellow8Ball shot from somewhere outside the big arch (3-point line).96regions.The same PF model can also be applied to analyze player’s movement data. Wework with the tracks of Jason Terry (point guard) and Trevor Ariza (small forward)from the HOU team in G5 as an illustration. In this game, we have 52 and 50plays involving Trevor Ariza and Jason Terry, respectively. These are limited topossessions in which HOU was on offence while Trevor Ariza/Jason Terry was onthe court. The paths of the two players, from the raw data, are shown in Figure 7.2.Different colors in this figure correspond to different plays. The paths of each playare plotted as dots, which increase in size as the play develops. That is, the sizeof the dot indicates how much time has elapsed on the shot clock, with larger dotsindicating less time remaining.Figure 7.2: Movement paths for Trevor Ariza (left) and Jason Terry (right) inGame 5. The points increase in size as the play progresses (and the shotclock decreases). Different colors corresponds to different plays.From Figure 7.2, we see that the paths of these two players typically end upalong the sidelines, just beyond the three point lines in the corners. However,given the amount of overlap, it is difficult to infer anything more from Figure 7.2.On the other hand, the fitted PFs for these two players (Figure 7.3) display someinteresting trends. The PF for Ariza takes on low values in both corners, but withpreference toward the right corner (R). Conversely, Terry seems to prefer the leftcorner (L), indicated by the red region at the top left corner of the plot. It is a good97strategy for them to occupy different corner, as this can create more opportunitiesfor inside–out pass.We also note that the PF for Ariza takes on relatively low values near the basket,at a similar level to the left corner. This indicates that moving toward the basket(either by cutting or driving to the basket) versus staying in the corner, are equallyattractive options to Ariza. This pattern is not observed in Terry’s PF. This agreeswith the fact that Ariza can drive to the basketball more often than Terry, becauseAriza is younger and more athletic.Figure 7.3: The fitted potential fields for Trevor Ariza (left) and Jason Terry(right) in Game 5. We choose to only show the potential fields awayfrom the center circle, as to allow for clear visualization of the patternsin the potential fields. These two players tend to occupy different cor-ners of the court.7.1.2 Potential field using the other player’s location as a covariateThe above PF with tensor product splines are useful to illustrate the patterns in ballor a player’s movement. We can also use the PF approach to characterize how oneplayer interacts with his four teammates by considering the following PFH(r(t), t;β ) = β0‖r(t)ball− rbasket‖22+4∑j=1β j‖r(t)ball− r(t)player j‖22, (7.2)98where || · ||2 denotes the l2 norm (Euclidean distance). This PF model can be fit-ted to the moments when a certain player is in possession of the basketball. Thedifferent possessions of the ball by the same player even in the same play are as-sumed to be mutually independent. Namely, if the player passes the ball to oneof his teammates and later gets the ball back, the two periods (possessions) areindependent observations of the PF. The coefficient, β0 represents the likelihood ofthe ball to move toward the basket, i.e., if β0 > 0, the player is likely to move theball toward the basket, especially if the distance between the ball and the basketis large. On the other hand, if β0 < 0, then the player is less likely to move theball toward the basket. The interpretations for β j, for j = 1, ...,4, are similar toβ0. These coefficients denote the likelihood of the player to move the ball towardteammate j.We fit this model to HOU’s starting lineup in G5, which consists of small for-ward Trevor Ariza, center Dwight Howard, shooting guard James Harden, pointguard Jason Terry, and, power forward Josh Smith. We found 22 plays (3900 mo-ments) by this line-up. We fit five separate models of (7.2) to the moments wheneach player is in possession of the ball. For expository simplicity, we denote thefitted coefficient for teammate j, or basket, when player i is in possession, by βi, jand therefore βi, j and βi′, j are from different models when i 6= i′. For example,βTerry,Harden is the coefficient corresponding to Terry moving the ball toward Harden(via passing or dribbling). Similarly, βHarden,Basket is the coefficient correspondingto Harden moving the ball toward the basket. These coefficients are summarizedin Table 7.1.We begin by focusing on Houston’s guard, Jason Terry. The negative coeffi-cients for Terry to all of the players except for Harden and the basket indicate thatTerry is unlikely to move the ball toward any teammates other than Harden. Thiscorresponds to how HOU usually starts an offensive play: Terry brings the ball overthe half court and passes to Harden to start the offense. James Harden has positivecoefficients corresponding to Ariza and Terry as well as the basket. This agreeswith what we observe by watching this game. Frequently, Harden would drive theball to the basket, as indicated by the positive coefficient of 4.90; If he sees eitherAriza or Terry open for a shot, he will make a pass to one of them. This tactic is ef-fective because Harden has great slashing ability, often drawing defenders toward99Table 7.1: Fitted coefficients (multiplied by 1000) of the potential fields cor-responding to HOU’s most commonly used lineup in Game 5. They re-flect the likelihood of the ball to move toward “column name” when “rowname” is in possession of the ball.Ariza Smith Howard Harden Terry BasketAriza NA -3.35 -1.51 1.90 1.03 5.38Smith 1.14 NA -1.23 -1.09 0.44 1.59Howard 1.19 3.98 NA -5.28 2.97 -6.27Harden 0.72 -2.65 -3.70 NA 1.73 4.90Terry -1.21 -1.36 -3.04 2.29 NA 4.21him. This often leaves one teammate open for a shot.From the third row of Table 7.1, the likelihood of Howard moving the balltoward the basket is negative (βHoward,Basket =−6.27). This seems counterintuitive,but is reasonable. In this game, Howard usually receives the ball near the basketand typically has three options: dunk, execute a post move9, or pass back to theperimeter. The latter two options were more frequent, which results in the ballmoving away from the basket and thus a negative coefficient.Meanwhile, we admit that this formulation of PF for the interactions amongplayer is not ideal. The likelihood of passing or driving is confounded with thedistance between the players (or the basket), which leads to tricky cases in theinterpretation. For example, imagine that Terry has already dribbled the ball closeto the basket free of the defenders while Harden is still far away near the mid-court. Terry in this case will almost certainly shoot the basketball instead of passto Harden. Yet, according to the PF model, Terry should be more likely to pass toHarden as he is far away (the distance from Terry to Harden is much larger thanTerry to the basket). This is a shortcoming of our current model and should beaddressed in future work. We may borrow ideas from the successful modeling ofanimal’s social network as in Scharf et al. (2017) and the references within.9Some moves/actions that a basketball player plays near the basket to get a better shot.1007.2 A mixture of potential fields model for seal foragingtracksThe PF model can also be used to interpret the foraging tracks of northern fur seals.As in the previous section, the fitted PF can serve as a map of hot foraging spotsof the seals. For initial investigation, the PF is only fitted to the GPS observationsin longitude and latitude without using the high resolution DR paths. As thereare only approximately a few hundred GPS observations, we consider a simplerparameterization of the PF instead of the tensor product splines in the previoussection:H(r, t;β ) =β1x(t)+β2y(t)+β3x(t)2+β4y(t)2+β5x(t)y(t). (7.3)where r(t) = (x(t),y(t))T are the longitude (Easting) and latitude (Northing) loca-tion.This is also the PF model proposed for the movement of monk seals in Brillingeret al. (2007). Using Trip 2 as an example, the fitted PF (7.3) is shown in the top-left panel in Figure 7.4. The fitted PF does not offer a good interpretation of theanimal’s movement, as the area with the lowest PF is an area the animal circledaround. This contradicts the interpretation of the PF, the area with lowest PF valueshould be most attractive to the animal. The poor fitting of this PF is plausibleas it is impractical to assume that a single PF governs the whole path of the seal.For most of the trip, the seal aims to collect as much food as possible, so a PFof food resources would interpret the animal’s movement. Once the animal hasaccumulated enough food, it needs to return to the island to feed its pup and theisland becomes the only destination for the return part of this trip, which requiresa different PF.Brillinger et al. (2007) avoided this issue by manually chopping the trip intodifferent periods. We attempt to tackle this issue by extending (7.3) into a mixtureof PFs as follows:101H(x(t),y(t);β ) =K∑k=1Zk(t)(β (k)1 x(t)+β(k)2 y(t)+β (k)3 x(t)2+β (k)4 y(t)2+β (k)5 x(t)y(t)),where {Z1(t),Z2(t), . . . ,ZK(t)} iid∼ from a multinomial distribution with n = 1 anda probability vector p = (p1, p2, . . . , pK)T . For simplicity, we assume the errorterms in the longitude and latitude components are independent and have the samevariance, such that Σ(r(t), t) = σ2I2. Parameter estimates in the Euler approxi-mated model can be calculated by the Expectation–Maximization (EM) algorithmof Dempster et al. (1977). For the current analysis, we only consider a two compo-nent mixture and calculate the parameter estimates with the FlexMix (Gru¨n et al.,2008) package.The mixture of PF models appears to be better than the original PF model, asthe maximum of the mixture PF is in an area where the animal appeared to stay fora long period of time. The advantage of the mixture of PF model is more obvious ifwe study its two components separately, as in the bottom two panels of Figure 7.4.The maxima of the first PF are in two areas where the animal might be foragingas it went back and forth. The second PF has only one maximum at the island(start and end points of this trip). The first PF may reflect the food resources inthe ocean, which attracts the foraging animals. The second PF reflects the animal’shome, where it returns after foraging.The above interpretation can be verified with the cluster membership10 plottedin the top right panel. Those GPS observations close to their previous and nextobservations mostly belong to the first “foraging” component while those GPSobservations being far from their previous and next observations are mostly in thesecond “returning home” component.10Decided by the posterior probabilities from the EM algorithm102Simple potential fieldEasting (KM)Northing (KM)−200204020 40 60 80−0.6−0.4− of Potential FieldsEasting (KM)Northing (KM)−200204020 40 60 80−2.0−1.5−1.0− llllllll llllllllllllllllllllllllll llll lllllllllllllllllllllllllllllllllllllComponent 1Component 2lPotential Field Component 1: 73%Easting (KM)Northing (KM)−200204020 40 60 80−4−3−2−1012Potential Field Component 2: 27%Easting (KM)Northing (KM)−200204020 40 60 80−1.0− 7.4: The simple potential field and mixture of potential fields modelsfitted for the Trip 2 GPS data set collected in our case study. Top left:the potential field with the original Brillinger model, which does not fitthe track well; Top right: The mixture potential field and the posteriorclusters; Bottom left: the first component of the mixture. It accountsfor 73% of the observations and may be interpreted as the food map forthe animal; Bottom right: second component of the mixture. It accountsfor 27% of the observations and reflects the returning behavior of theanimal. The top right field is the weighted sum of the bottom two fieldswith weight 0.73,0.27.103Chapter 8Summary, Discussion, and FutureWorkIn the final chapter of this thesis, we first summarize our major contributions in theprevious chapters in Section 8.1. Then we discuss a potential problem of GP mod-eling under different sampling frequency in Section 8.2. Some additional futurework is discussed in Section SummaryIn this thesis, we developed statistical models for high resolution tracking data,in particular, marine mammal tracking data with multiple sources of observations.Chapter 2 reviewed the scientific background of marine mammal tracking and stud-ied the properties of the two distinctive observations: the accurate but sparse GPSobservations and the high-resolution but biased Dead–Reckoning (DR) path. Suchdata from a case study of tracking of northern fur seals were used throughout thisdissertation. The models we developed were built upon recent developments inspatio–temporal modeling with Gaussian processes (GP) and combined with vari-ous forms of stochastic differential equations (SDE), which were reviewed in Chap-ter 3 with special emphasis on how to make inference and predictions with them.Chapter 4 first reviewed two Bayesian frameworks, Bayesian Melding (BM)and downscaling, commonly used to combine data from different sources for sta-104tistical inference. We adapted them to estimate the path of a moving object andapplied them in our case study of tracking northern fur seals. To make the BMapproach computationally feasible for big tracking data, we exploited the prop-erties of the processes along with approximations to the likelihood to break thehigh dimensional problem into a series of lower dimensional problems. To imple-ment the alternative, downscaling approach, we used the integrated nested Laplaceapproximation (INLA) to fit a linear mixed effect model that connects the twosources of observations. The predictions of the two approaches were comparedby cross–validation as well as simulations. We showed that both approaches yieldsimilar results—both provide accurate, high resolution estimates of tracks, as wellas Bayesian credible intervals to characterize the uncertainty about the estimatedpaths. They outperformed the conventional bias correction of the DR path or inter-polation methods based on GPS observation only in terms of prediction accuracyand coverage percentage.Methods developed in Chapter 4 cannot model the inhomogeneous feature ofthe tracks. We first proposed the conditionally heterogeneous Gaussian process(CHGP) and the conditionally heterogeneous Brownian Bridge (CHBB) processin Chapter 5. Unfortunately, the idea of conditional heterogeneity produces toomany candidate models and makes the model comparison metrics sensitive to smallvariance parameters. We failed to develop an adequate model selection method forthe CHBB.The generalized Ornstein–Ulhenbeck (GOU) process developed in Chapter 6tackled the inhomogeneous issue from another angle. Through modeling the co-efficients in a linear SDE via splines, we achieved a flexible non-stationary meanand covariance structure with parsimonious parametrization. With the Markovianproperty of the SDE, we were able to express the GOU as a dynamic linear modeland thus perform its inference and prediction efficiently via the Kalman filter/s-moother, which is also computationally scalable for big data. Our cross–validationstudies show that the BM with GOU has better prediction accuracy than the BMand downscaling approaches from Chapter 4. The BM with GOU also achievednarrower credible intervals without sacrificing the coverage percentage.To bring in more interpretable structures in the SDE modeling, we studied thepotential field (PF) approach in Chapter 7. The PF is a SDE with its drift term105parameterized as the gradient of another function, which can be a function over thespace or include other covariates. The PF was effective in learning and visualizingthe trends in the tracks, which was illustrated by the marine mammal tracking dataas well as the NBA tracking data. We also designed a mixture of PF’s for theanimal track to accommodate the different behaviors of the animal.In summary, we developed many statistical models that can be used to predictand interpret tracking data. They were all designed to be scalable and efficient incomputation. For the animal tracking data, the modeling approaches we developedand tested can facilitate the processing of high resolution in–situ records of thehydrographic data collected by marine mammals, and can contribute to broadeningknowledge about parts of the ocean that have been hard to observe. They cancontribute to studies seeking to address the effects of climate change on the ocean,and also contribute to answering many biological and ecological questions abouthabitat preference and resource selection (Hooten et al., 2013). The Bayesian datafusion approaches studied in this thesis can be further applied in other scenariosfor combining data from different sources. The GOU and PF approaches that wedeveloped can also be used to model other types of tracks.In addition, we empirically assessed the models and identified best modelchoices based on cross–validation or visual examination in the applications con-sidered in this study. The analyses demonstrated how users in other contexts mayconsider choosing models that best suit their applications, particularly where highdimensional data are involved.8.2 Another examination of the approximation andmodel specification in Bayesian meldingIn the Bayesian Melding approach developed in Chapter 4, one of the key approx-imations we utilized to simplify the posterior evaluation was[φ |X,Y]≈ [φ |XG,Y]. (8.1)A similar approximation is also used in the Bayesian Melding with the GOU inChapter 6. This approximation in (8.1) enables us to infer the two variance param-106eters φ = (σ2H ,σ2D)T from those observations at the GPS time points instead of thewhole data set. In our case studies, X on the left hand side of (8.1) is a vector ofsize over half a million, while XG on the right hand side only has a couple of hun-dreds of elements. This is an approximation first designed for the computationalissue and we have shown in Section 4.4.1 that it has little impact on the inferenceof the variance parameter based on simulated data. However, what happens whenwe remove this approximation and work with [φ |X,Y] directly on our real data?We do not have enough computing power to evaluate [φ |X,Y] but we canmimic what will happen by thinning X to various resolutions. We thin the orig-inal 1Hz DR path at various rates R= ∞,600,300,120,60 and then combine themwith the DR path at GPS time points. We denote such a data set by XR ∪XG.Thinning rate ∞ means that we only work with [φ |XG,Y]. The smaller the rate themore observations from X\XG are used in [φ |XR ∪XG,Y]. Sparse matrices haveto be used to make the computation feasible. The first two columns of Table 8.1summarize the posterior modes of the two variance parameters.To our surprise, the posterior modes of the variance parameters dramaticallydecrease when more of X is used in evaluating [φ |XR∪XG,Y]. As the width of CIsis proportional to the square root of the variance parameters, the posterior CI alsogets narrower, which results in poor coverage percentage in the cross validation.On the other hand, the “signal–noise ratio” σˆ2H/(σˆ2H+ σˆ2D) stays almost the same atdifferent Rs. Recall from Section 4.2.2 that this ratio is a key factor in deciding thecorrected path, the stable σˆ2H/(σˆ2H + σˆ2D) ensures that the posterior mean stays al-most unchanged and therefore the RMSE’s in the cross-validation are nearly equalamong the different thinning rates.We discuss the cause for this seemingly surprising phenomenon in Section 8.2.1.Section 8.2.2 further illustrates the changes in the variance parameter estimateswith other data sets and GP models. The implication of this phenomenon and ourproposals to alleviate it and fix the credible intervals are discussed in Section 8.2.3and 8.1: Posterior modes of φ (σˆ2H , σˆ2D) from [φ |XR∪XG,Y] and cross val-idation performance under different rate of thinning for Trip 2 Northingdata set. The cross–validation RMSE and coverage rate of the 95% CI(in brackets) from both LOOCV and L5OCV are reported.Rate σˆ2H σˆ2Dσˆ2Hσˆ2H+σˆ2DL5OCV LOOCVRMSE (Coverage) RMSE (Coverage)∞ 1.03 1.23 0.45 3.06 (93.0%) 0.85 (100%)1200 0.41 0.45 0.47 2.99 (82.8%) 0.86 (95.3%)600 0.25 0.27 0.48 2.99 (78.9%) 0.86 (93.8%)300 0.14 0.15 0.48 3.01 (63.3%) 0.88 (85.9%)120 0.061 0.066 0.48 3.08 (53.9%) 0.92 (73.4%)60 0.032 0.034 0.48 3.17 (41.4%) 0.99 (61.7%)8.2.1 Why the variance parameters decreaseIn our simulation studies in Section 4.4.1, we found that the posterior modes ofσ2H ,σ2D are almost the same no matter whether [φ |XG,Y] or [φ |X,Y] is used. Thedecrease of variance parameters only happens in the real data set and thus it isprobably caused by model misspecification. We provide an explanation for theBrownian Motion process, i.e., decrease in σ2D, but similar arguments apply to theσ2H of the Brownian Bridge .Given a sequence x0,x1,x2, . . . ,xn, observed at equally spaced time points t0 =0, t1 = δ , t2 = 2δ , . . . , tn = nδ , we assume that x0,x1,x2, . . . ,xn are from a zero-mean Brownian Motion process with variance parameter σ2D. The posterior modeof [σ2D|x0,x1, . . . ,xn] under a uniform prior log([σ2D])∝ 1 equals the maximum like-lihood estimate (MLE) of σ2D on the log scale. Its MLE can be computed in closedform,σˆ2D =1nn∑i=1(xi− xi−1)2δ, (8.2)because (xi−xi−1) iid∼N(0,σ2Dδ ). When xi are really from a Brownian Motion, thisestimate is “correct”, i.e., unbiased and asymptotically consistent. In our simula-tion studies, X or its subset XG are both from the correct model. When both X andXG have a fairly large n, the estimates of σ2D should be similar, i.e., the probability108that they have a big difference is small. This explains why we do not observe thedramatic change in the simulation study.However, we have little idea of the behavior of σˆ2D when the model is mis-specified. Now, let’s consider an extreme scenario that the xis are actually from asmooth function xi = s(ti), e.g., s(x) = sin(x). Equation (8.2) is then roughlyσˆ2D =1nn∑i=1(xi− xi−1)2δ=1nn∑i=1(s(ti)− s(ti−1))2δ≈ 1nn∑i=1(s′(ti−1))2δ .If we assume |s′(t)|2 ≈C for all t1, the estimate of σ2D is approximately Cδ . Thesmaller δ is, the smaller σˆ2. Under this misspecified model, when δ is large, thedata appear to be wiggly for the model, which results in a big σ2D estimate. But itappears to be smoother with smaller δ , which leads to a smaller σˆ2.Our real data are not such a smooth function, but it certainly has a smoothcomponent as seen in Figure 4.6. When more of X from X is included in theinference, the data used (random bias part in DR ξ ) appear to be smoother andresemble a Brownian Motion with smaller variance. This explains the substantialdecrease of the variance parameters in Table Similar phenomenon in other data sets and modelsThe BM-GOU models developed in Chapter 6 also experience such a decreasein the variance parameters σ ,σ2D when more of X is used in [φ |XR ∪XG,Y], butthe estimates of the B-spline coefficients γ (A) and γ (a) stay almost the same. Asimilar phenomenon can also be observed even when we work with a “smoother”process, such as the CRAWL and Mate´rn process. The CRAWL works with onlythe GPS observation, so we create a similar situation by thinning them at differentrates (i.e., thinning rate equals 2 means every other observation). For the Mate´rnmodel, we assume the GPS observations are from a Mate´rn process with smoothparameter ν = 1.5 plus observation noises with a known variance σ2G = 0.0625,which is a similar setting to that of our Bayesian Melding models. The estimatesof the variance parameters in the CRAWL model and Mate´rn process at differentthinning rates for the Trip 2 (Northing) data set are summarized in Table 8.2, which1The square of the first derivative ranges in a range around C.109shows that the variance parameters for these two models also decrease with thedecrease of subset rate in general. When compared to Table 8.1, the changes in thevariance parameters are relatively moderate and they do not decrease linearly as inTable 8.1.Table 8.2: The estimates of variance parameters in CRAWL and Mate´rnmodel with different subset rates for the Trip 2 (Northing) GPS data set.Thinning rate = 1 gives the original data set.Thinning Rate 4 3 2 1σˆ2 in CRAWL 10137.5 9759.5 9172.8 542.3σˆ2 in Mate´rn 979.6 1079.1 1042.3 959.4Besides the animal tracking data, we also performed similar analysis on themotorcycle data set, which was first published in Silverman (1985). It contains aseries of measurements of head acceleration in a simulated motorcycle accident,used to test crash helmets. These data are plotted in Figure 8.1. As with ouranimal tracks, the data also have a smooth component. The variance parameterestimates using the model of Brownian Motion or Mate´rn process with ν = 1.5 aresummarized in Table 8.3. Notice that we assume there is no observation noise, i.e.,σ2G = 0. Table 8.3 shows that the σˆ2s change dramatically in both the BrownianMotion and Mate´rn model for the motorcycle data set. Unlike the seal tracking dataset, they are not changing monotonically with the thinning rate. This is probablycaused by the fact that a lot of time points are quite close to each other in this dataset and we choose not to consider the observation noise in this illustration. Weonly aim to demonstrate changes in the variance parameter estimates and will notfurther study this data set.Table 8.3: Variance parameter estimates of the Brownian Motion and Mate´rnprocess for the motorcycle data set.Thinning Rate 4 3 2 1σˆ2 for Brownian Motion 609.6 495.1 991.1 2643.4σˆ2 for Mate´rn Process 2152.3 1895.6 2160.8 2401.4110lll llllllll llllllllllllllllllllllllllllllllllllllllllllllllllllllllllllll lllllllll10 20 30 40 50−100050Time (milliseconds)Acceleration (g)Figure 8.1: The motorcycle data set.8.2.3 Ideas to stabilize the variance parametersFrom the above tables, it is clear that the sampling rate can have a substantialimpact on the estimates of variance parameters, which then affect the credible in-tervals for the predictions. The problem is caused by the fact that the GP modeldoes not fit the smooth component in the data.The smoother GP’s, such as the CRAWL and Mate´rn can help to alleviate thedramatic change to some degree, but they do not completely remove this effectas shown in Tables 8.2 and 8.3. These smoother GP’s can also involve strongparametric assumptions on the data, which may not work well for real data. Forexample, we have shown that the CRAWL (integrated Ornstein Ulhenbeck pro-cess) has worse predictive performance than linear interpolation in Section 4.5.4.The Mate´rn process is shown to have inconsistent parameter estimates under cer-tain designs (Zhang, 2004) and we find it is numerically difficult to calculate itsestimates. Those smoother GPs, such as Mate´rn with ν > 0.5, do not have aMarkovian structure and require extra effort to scale it for big data as reviewedin Section 3.1 (Lindgren et al., 2011).Another idea is to model the smooth component with a flexible mean structure111parametrically, such as with a spline or polynomials (see e.g., Buderman et al.,2015). The limitation of this idea is that we need many parameters to ensure thatthe splines or polynomials are flexible enough, which makes the inference diffi-cult given the limited data. Also, it is tricky to ensure the quality of extrapolationor interpolation in periods where we do not have observations, as the parametricstructure of the splines/polynomials will have a strong effect on the predictions. Wehave made two attempts in this direction for the seal tracking study: the polynomi-als for the bias component of DR and the B-splines in the GOU process. Neitherof them is successful in stabilizing the variance parameter at different samplingfrequencies, but the GOU achieves better predictive performance.In particular, the nature of the seal tracking data set makes it difficult for bothideas to succeed. Recall that the GPS observations are only available at a sparseset of time points. Even if the true path were from a smooth GP, like CRAWLor Mate´rn, the dependence would be very weak if two GPS observations were farapart. So a wiggly process such as the Brownian Bridge/Motion may fit the dataadequately. The sparsity of the GPS observations also makes it difficult to fit thesemean curves or high–order dependence models. The DR path has high resolution,but we cannot distinguish the true path from the noise except at those time pointswhen GPS sitings are available. This is probably the main reason that the highorder polynomials do not improve the performance as in Section 4.5.2.We also point out that the flexible mean and smooth covariance can be con-founded. For example, Watson (1984) and Silverman (1985), have discussed acertain equivalence between the regression spline and kriging. Fan et al. (2014)discussed certain equivalence between the higher order stochastic differential equa-tions and the Chebyshev splines. We should be aware of this issue when trying bothideas simultaneously.8.2.4 The variance parameters and the CIIf we are only interested in the mean of the predictions, the variance parameter ac-tually does not matter, as shown in CV-RMSE in Table 8.1. However, the varianceparameters are important in the prediction uncertainty. We cannot fully explainwhy conditioning on XG leads to “correct” coverage percentages in the L5OCV112regardless of the model (i.e., BM-GOU, BM-BB and DS), but not in the LOOCV.A similar question is why subset rate r = 1200 leads to almost correct coveragepercentage in the LOOCV, but not the L5OCV. There are two aspects of this ques-tion: which cross–validation scheme we should use and which subset rate can leadto a “correct” variance parameter in the cross–validation.The reason that we chose L5OCV over LOOCV when evaluating our modelperformance was discussed in Section 4.5. In short, the left-out observation isstill very close to the nearby remaining observations in the LOOCV, such that thedependence from the nearby remaining observations helps to determine an intervalthat covers the left-out observation. The L5OCV reduces such dependence andits intervals are mainly decided by our models. We can confirm this point fromTable 8.1 as the LOOCV always has higher coverage percentage than the L5OCVregardless of the subset rates.The second part of this question is more complicated as it involves both thebehavior of our model and the true path. Ideally, we desire the model to capturethe data’s uncertainty and automatically adapt to different data and prediction re-quirements, i.e., same correct coverage percentage regardless of the subset rate orwhich cross validation to use. Stabilizing the variance parameter as discussed inthe above subsection can be a first step to tackle this problem.Given this coverage issue, we cannot claim our models result in CIs with cor-rect coverage percentage for all the other tracks with GPS and DR paths in anycross–validation scheme. First, we stress that the cross–validation scheme, i.e.,how much to leave out, should be decided by the scientific objective of the study,i.e., how long a time gap the path is expected to predict. Another way to decide thecross–validation scheme is to make the left-out data nearly independent from thetraining data (see e.g., Arlot et al., 2010). After the CV is decided, the amount ofuncertainty in our model can be controlled via the priors in our framework. Infor-mative priors can force the variance parameters into values that can lead to correctcoverage percentages. The parameters in the prior may be chosen by a combinationof prior knowledge and cross–validation.This “informative prior + cross–validation selection for priors” scheme of usingour Bayesian Melding model is not ideal, but it should work for practical purposes.As said by George Box: “All models are wrong, but some are useful.” The seals or113basketball players do not move according to any mathematical formulas. The mod-els developed in this project are our best effort to express these tracks statisticallyand use them for interpretation or prediction via moderate computational power.8.3 Future workBesides the issue of variance parameters discussed above, there are many aspectsof this thesis that can be further developed. The possible rescue of CHGP has beendiscussed by the end of Chapter 5. Similarly, the future development of the PFmodel for NBA tracking data has been discussed in Section 7.1. The rest of thissection covers the future directions for our BM framework and PF approach.8.3.1 Future developments for Bayesian meldingOur work has demonstrated the value of using the Bayesian data fusion techniquesto combine observations from different sources for tracking objects. We plan toadapt the ideas developed in this paper to track the progress of infectious disease.We will use the Google flu trends data as a biased but high resolution source of ob-servations and Center for Disease Control (CDC) reports as an accurate but sparseset of observations. The bias and failure to predict some disease epidemics byGoogle flu have been studied in Hooten et al. (2010); Lazer et al. (2014b,a) andthe references within. The media also used it as a serious warning about the use of“big data”. As noted by Salzberg (2014) “Big data can be great, but not when it isbad data.” While agreeing with this comment in principle, we point out that “badbig data”, such as the Google flu trend or the DR path in our paper, can be goodagain in cases when it can be combined and corrected by good data. So for exam-ple, Dugas et al. (2013) and Lazer et al. (2014b) used various regression models tocombine the Google flu and CDC reports and achieved better predictions for infec-tious disease. Our BM and downscaling approaches also successfully combine theDR path with GPS observations.Another interesting direction is to use a SDE parameterized by PF as the priorprocess in our BM framework. The implementation of this idea is straightforwardas the Euler approximation can express the SDE in a DLM and then the Kalmansmoother can be used. The main challenge might be how to find a parsimonious114parameterization for the PF in the spatio–temporal domain. In our current work,we considered coefficients that vary in the time domain in Chapter 6 and vary inthe spatial domain in Chapter 7. A PF reflecting both changes in the spatial andtemporal domains is key to its success in interpreting the tracks, such as thosefrom the fur seals. When working with a mixture of PFs in BM, we may need todesign new approximation techniques to avoid obtaining Monte Carlo samples onthe hidden classes. A starting point is marginalizing the latent classes in the PF andusing the marginalized process in the BM framework.8.3.2 Approximation to non-linear SDEIn the above discussion of the PF, the nonlinear PFs were all approximated by thelocal constant Euler approximation. Because our tracking data were observed athigh resolution, the Euler approximation is accurate (Iacus, 2009). Yet, if a betterapproximation is desired, we can consider other approximation techniques, suchas Ozaki (1992); Archambeau et al. (2007); Archambeau and Opper (2010); Vret-tas et al. (2015), which have been briefly reviewed in Section 3.2. It is beyond thescope of this thesis to introduce their formulation and compare their differences. Inparticular, we find that the Euler and Ozaki (Ozaki, 1992) approximations can bothbe derived from the Kolmogorov backward equation (see e.g., Arnold, 1974). Aninteresting idea is to derive similar approximations from the Kolmogorov forwardequation or combine them in some way.115BibliographyAndrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov chain MonteCarlo methods. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 72(3):269–342. → pages 79Archambeau, C., Cornford, D., Opper, M., and Shawe-Taylor, J. (2007). Gaussianprocess approximations of stochastic differential equations. GaussianProcesses in Practice, 1:1–16. → pages 25, 115Archambeau, C. and Opper, M. (2010). Approximate inference forcontinuous-time Markov processes. Bayesian Time Series Models, pages125–140. → pages 25, 115Arlot, S., Celisse, A., et al. (2010). A survey of cross-validation procedures formodel selection. Statistics Surveys, 4:40–79. → pages 113Arnold, L. (1974). Stochastic Differential Equations. Wiley-Interscience. →pages 24, 25, 78, 79, 81, 82, 83, 115Ba, S. and Joseph, V. R. (2012). Composite Gaussian process models foremulating expensive functions. The Annals of Applied Statistics,6(4):1838–1860. → pages 20Banerjee, S., Gelfand, A. E., Finley, A. O., and Sang, H. (2008). Gaussianpredictive process models for large spatial data sets. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 70(4):825–848. →pages 13, 15, 66Bates, D. and Maechler, M. (2016). Matrix: Sparse and Dense Matrix Classesand Methods. R package version 1.2-7.1. → pages 12, 84Battaile, B. (2014). TrackReconstruction: Reconstruct animal tracks frommagnetometer, accelerometer, depth and optional speed data. R packageversion 1.1. → pages 9116Battaile, B. C., Nordstrom, C. A., Liebsch, N., and Trites, A. W. (2015). Foraginga new trail with northern fur seals (Callorhinus ursinus): Lactating seals fromislands with contrasting population dynamics have different foraging strategies,and forage at scales previously unrecognized by GPS interpolated dive data.Marine Mammal Science, 31(4):1494–1520. → pages 5, 33Benoit-Bird, K. J., Battaile, B. C., Heppell, S. A., Hoover, B., Irons, D., Jones, N.,Kuletz, K. J., Nordstrom, C. A., Paredes, R., Suryan, R. M., Waluk, C. M., andTrites, A. W. (2013a). Prey patch patterns predict habitat use by top marinepredators with diverse foraging strategies. PloS ONE, 8(1):e53348. → pages 4,8Benoit-Bird, K. J., Battaile, B. C., Nordstrom, C. A., and Trites, A. W. (2013b).Foraging behavior of northern fur seals closely matches the hierarchical patchscales of prey. Marine Ecology Progress Series, 479:283–302. → pages 4, 36Berger, J. O. and Pericchi, L. R. (1996). The intrinsic Bayes factor for modelselection and prediction. Journal of the American Statistical Association,91(433):109–122. → pages 71Berrocal, V. J., Gelfand, A. E., and Holland, D. M. (2010). A spatio-temporaldownscaler for output from numerical models. Journal of Agricultural,Biological, and Environmental Statistics, 15(2):176–197. → pages 31, 32, 34,44Blangiardo, M. and Cameletti, M. (2015). Spatial and Spatio-temporal BayesianModels with R-INLA. John Wiley & Sons. → pages 20Block, B. A., Jonsen, I. D., Jorgensen, S. J., Winship, A. J., Shaffer, S. A.,Bograd, S. J., Hazen, E. L., Foley, D. G., Breed, G. A., Harrison, A.-L.,Ganong, J. E., Swithenbank, A., Castleton, M., Dewar, H., Mate, B. R.,Shillinger, G. L., Schaefer, K. M., Benson, S. R., Weise, M. J., Henry, R. W.,and Costa, D. P. (2011). Tracking apex marine predator movements in adynamic ocean. Nature, 475:86–90. → pages 4Boehme, L., Kovacs, K., and Lydersen, C. (2010). Biologging in the global oceanobserving system. Proceedings of OceanObs 09: Sustained OceanObservations and Information for Society, 2(September):21–25. → pages 4Boehme, L., Meredith, M. P., Thorpe, S. E., Biuw, M., and Fedak, M. (2008).Antarctic Circumpolar Current frontal system in the South Atlantic:Monitoring using merged Argo and animal-borne sensor data. Journal ofGeophysical Research: Oceans, 113(C9). → pages 4117Bornn, L., Shaddick, G., and Zidek, J. V. (2012). Modeling nonstationaryprocesses through dimension expansion. Journal of the American StatisticalAssociation, 107(497):281–289. → pages 19Brillinger, D., Preisler, H., Ager, A., and Wisdom, M. (2004). Stochasticdifferential equations in the analysis of wildlife motion. → pages 93Brillinger, D. R. et al. (2007). A potential function approach to the flow of play insoccer. Journal of Quantitative Analysis in Sports, 3(1). → pages 93, 101Brillinger, D. R. and Stewart, B. S. (2010). Stochastic modeling of particlemovement with application to marine biology and oceanography. Journal ofStatistical Planning and Inference, 140(12):3597–3607. → pages 93Bryant, E. (2007). 2D location accuracy statistics for Fastloc RCores runningfirmware versions 2.2 & 2.3. Wildtrack Telemetry Systems Ltd. → pages 5, 38Buderman, F. E., Hooten, M. B., Ivan, J. S., and Shenk, T. M. (2015). Afunctional model for characterizing long-distance movement behaviour.Methods in Ecology and Evolution, (7):264–273. → pages 30, 112Caragea, P. C. and Smith, R. L. (2007). Asymptotic properties of computationallyefficient alternative estimators for a class of multivariate normal models.Journal of Multivariate Analysis, 98(7):1417–1440. → pages 18Casella, G. (1985). An introduction to empirical Bayes data analysis. TheAmerican Statistician, 39(2):83–87. → pages 137Cervone, D., Bornn, L., and Goldsberry, K. (2016). NBA court realty. SLOANSports Analytics Conference, Boston. → pages 94Cervone, D., D’Amour, A., Bornn, L., and Goldsberry, K. (2014). POINTWISE:predicting points and valuing decisions in real time with NBA optical trackingdata. Proceedings MIT Sloan Sports Analytics. → pages 1, 94Chan, K. C., Karolyi, G. A., Longstaff, F. A., and Sanders, A. B. (1992). Anempirical comparison of alternative models of the short-term interest rate. TheJournal of Finance, 47(3):1209–1227. → pages 78Chipman, H. (1998). Bayesian CART model search. Journal of the AmericanStatistical Association, 93(443):935–948. → pages 13, 75Chu, T., Wang, H., and Zhu, J. (2014). On semiparametric inference ofgeostatistical models via local Karhunen–Loe`ve expansion. Journal of the118Royal Statistical Society: Series B (Statistical Methodology), pages 817–832.→ pages 19Cressie, N. and Johannesson, G. (2008). Fixed rank kriging for very large spatialdata sets. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 70(1):209–226. → pages 15Cressie, N. and Wikle, C. K. (2015). Statistics for Spatio-temporal Data. JohnWiley & Sons. → pages 11, 14Damian, D., Sampson, P. D., and Guttorp, P. (2001). Bayesian estimation ofsemi-parametric non-stationary spatial covariance structures. Environmetrics,12(2):161–178. → pages 19Datta, A., Banerjee, S., Finley, A. O., and Gelfand, A. E. (2016). Hierarchicalnearest-neighbor Gaussian process models for large geostatistical datasets.Journal of the American Statistical Association, 111(514):800–812. → pages13, 15, 16, 17, 18, 65, 66Davis, T. A. (2006). Direct Methods for Sparse Linear Systems, volume 2. SIAM.→ pages 84De Boor, C., De Boor, C., Mathe´maticien, E.-U., De Boor, C., and De Boor, C.(1978). A Practical Guide to Splines, volume 27. Springer-Verlag New York.→ pages 30, 95Del Moral, P., Doucet, A., and Jasra, A. (2006). Sequential Monte Carlo samplers.Journal of the Royal Statistical Society: Series B (Statistical Methodology),68(3):411–436. → pages 79Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihoodfrom incomplete data via the em algorithm. Journal of the Royal StatisticalSociety. Series B (Methodological), 39:1–38. → pages 102Dugas, A. F., Jalalpour, M., Gel, Y., Levin, S., Torcaso, F., Igusa, T., andRothman, R. E. (2013). Influenza forecasting with Google flu trends. PLoSONE, 8(2):e56176. → pages 114Dukic, V., Lopes, H. F., and Polson, N. G. (2012). Tracking epidemics withGoogle flu trends data and a state-space SEIR model. Journal of the AmericanStatistical Association, 107(500):1410–1426. → pages 33Eidsvik, J. and Shaby, B. (2014). Estimation and prediction in spatial models withblock composite likelihoods. Journal of Computational and GraphicalStatistics, 23(2):295–315. → pages 18, 66119Elkaim, G. H., Decker, E. B., Oliver, G., and Wright, B. (2006). Marine MammalMarker (MAMMARK) dead reckoning sensor for in-situ environmentalmonitoring. Proceedings of the ION/IEEE PLANS. → pages 6Fan, R., Zhu, B., and Wang, Y. (2014). Stochastic dynamic models and Chebyshevsplines. Canadian Journal of Statistics, 42(4):610–634. → pages 112Fleming, C. H., Fagan, W. F., Mueller, T., Olson, K. A., Leimgruber, P., andCalabrese, J. M. (2016). Estimating where and how animals travel: An optimalframework for path reconstruction from autocorrelated tracking data. Ecology.→ pages 6Foley, K. M. and Fuentes, M. (2008). A statistical framework to combinemultivariate spatial data and physical models for hurricane surface windprediction. Journal of Agricultural, Biological, and Environmental Statistics,13(1):37–59. → pages 32Franks, A., Miller, A., Bornn, L., Goldsberry, K., et al. (2015). Characterizing thespatial structure of defensive skill in professional basketball. The Annals ofApplied Statistics, 9(1):94–121. → pages 94Fru¨hwirth-Schnatter, S. (2006). Finite Mixture and Markov Switching Models:Modeling and Applications to Random Processes. Springer. → pages 139Fuentes, M. and Raftery, A. E. (2005). Model evaluation and spatial interpolationby bayesian combination of observations with outputs from numerical models.Biometrics, 61(1):36–45. → pages 31, 32, 35Furrer, R., Genton, M. G., and Nychka, D. (2006). Covariance tapering forinterpolation of large spatial datasets. Journal of Computational and GraphicalStatistics, 15(3):502–523. → pages 12Ginsberg, J., Mohebbi, M. H., Patel, R. S., Brammer, L., Smolinski, M. S., andBrilliant, L. (2009). Detecting influenza epidemics using search engine querydata. Nature, 457(7232):1012–1014. → pages 1Gramacy, R. B. and Lee, H. K. H. (2008). Bayesian treed Gaussian processmodels with an application to computer modeling. Journal of the AmericanStatistical Association, 103(483):1119–1130. → pages 13, 75Grewal, M. S., Weill, L. R., and Andrews, A. P. (2007). Global PositioningSystems, Inertial Navigation, and Integration. John Wiley & Sons. → pages 6120Gru¨n, B., Leisch, F., et al. (2008). Flexmix version 2: Finite mixtures withconcomitant variables and varying and constant parameters. Journal ofStatistical Software, 28(4):1–35. → pages 102Gurarie, E., Andrews, R. D., and Laidre, K. L. (2009). A novel method foridentifying behavioural changes in animal movement data. Ecology Letters,12(5):395–408. → pages 68Hazel, J. (2009). Evaluation of fast-acquisition GPS in stationary tests andfine-scale tracking of green turtles. Journal of Experimental Marine Biologyand Ecology, 374(1):58–68. → pages 5Henderson, H. V. and Searle, S. R. (1981). On deriving the inverse of a sum ofmatrices. SIAM Review, 23(1):53–60. → pages 15, 40Higdon, D., Swall, J., and Kern, J. (1999). Non-stationary spatial modeling.Bayesian Statistics, 6(1):761–768. → pages 19, 20Hofmann-Wellenhof, B., Lichtenegger, H., and Collins, J. (2012). GlobalPositioning System: Theory and Practice. Springer Science & Business Media.→ pages 5Hooten, M., Johnson, D., McClintock, B., and Morales, J. (2017). AnimalMovement: Statistical Models for Telemetry Data. Chapman & Hall/CRC BocaRaton, Florida, USA. → pages 93Hooten, M. B., Anderson, J., and Waller, L. A. (2010). Assessing north americaninfluenza dynamics with a statistical sirs model. Spatial and Spatio-temporalEpidemiology, 1(2):177–185. → pages 114Hooten, M. B., Hanks, E. M., Johnson, D. S., and Alldredge, M. W. (2013).Reconciling resource utilization and resource selection functions. The Journalof Animal Ecology, 82(6):1146–54. → pages 106Hooten, M. B. and Johnson, D. S. (2017). Basis function models for animalmovement. Journal of the American Statistical Association, (just-accepted). →pages 6, 20Horne, J. S., Garton, E., Krone, S., and Lewis, J. (2007). Analyzing animalmovements using Brownian bridges. Ecology, 88(9):2354–2363. → pages 36Huang, H., Cressie, N., and Gabrosek, J. (2002). Fast, resolution-consistentspatial prediction of global processes from satellite data. Journal ofComputational and Graphical Statistics, 11(1):63–88. → pages 14121Humphries, N. E., Queiroz, N., Dyer, J. R. M., Pade, N. G., Musyl, M. K.,Schaefer, K. M., Fuller, D. W., Brunnschweiler, J. M., Doyle, T. K., Houghton,J. D. R., Hays, G. C., Jones, C. S., Noble, L. R., Wearmouth, V. J., Southall,E. J., and Sims, D. W. (2010). Environmental context explains Le´vy andBrownian movement patterns of marine predators. Nature, 465(7301):1066–9.→ pages 36Iacus, S. M. (2009). Simulation and Inference for Stochastic DifferentialEquations: with R Examples. Springer Science & Business Media. → pages23, 25, 115Johnson, D., London, J., Lea, M.-a., and Durban, J. (2008). Continuous-timecorrelated random walk model for animal telemetry data. Ecology,89(5):1208–1215. → pages 6, 54Johnson, M. P. and Tyack, P. L. (2003). A digital acoustic recording tag formeasuring the response of wild marine mammals to sound. IEEE Journal ofOceanic Engineering, 28(1):3–12. → pages 5Jonsen, I., Flemming, J., and Myers, R. (2005). Robust state-space modeling ofanimal movement data. Ecology, 86(11):2874–2880. → pages 6Kalman, R. E. (1960). A new approach to linear filtering and prediction problems.Journal of Basic Engineering, 82(1):35–45. → pages 28Katzfuss, M. (2016). A multi-resolution approximation for massive spatialdatasets. Journal of the American Statistical Association, (just-accepted). →pages 66Kaufman, C. G., Schervish, M. J., and Nychka, D. W. (2008). Covariancetapering for likelihood-based estimation in large spatial data sets. Journal of theAmerican Statistical Association, 103(484):1545–1555. → pages 12Kranstauber, B., Kays, R., Lapoint, S. D., Wikelski, M., and Safi, K. (2012). Adynamic Brownian bridge movement model to estimate utilization distributionsfor heterogeneous animal movement. The Journal of Animal Ecology,81(4):738–46. → pages 36, 68Kranstauber, B., Safi, K., and Bartumeus, F. (2014). Bivariate Gaussian bridges:directional factorization of diffusion in Brownian bridge models. MovementEcology, 2(1):5. → pages 36, 68Lauritzen, S. L. (1996). Graphical Models. Clarendon Press, Oxford, UK. →pages 132122Lazer, D., Kennedy, R., King, G., and Vespignani, A. (2014a). Google flu trendsstill appears sick: An evaluation of the 2013-2014 flu season. SSRN ElectronicJournal, pages 1–11. → pages 114Lazer, D., Kennedy, R., King, G., and Vespignani, A. (2014b). The parable ofGoogle Flu: traps in big data analysis. Science, 343(6176):1203–1205. →pages 1, 114Le, N. D. and Zidek, J. V. (2006). Statistical Analysis of EnvironmentalSpace-time Processes. Springer. → pages 11, 63Leis, J. W. (2011). Digital Signal Processing Using MATLAB for Students andResearchers. John Wiley & Sons. → pages 63Lemos, R. T. and Sanso´, B. (2012). Conditionally linear models fornon-homogeneous spatial random fields. Statistical Methodology,9(1-2):275–284. → pages 19Lindgren, F., Rue, H., and Lindstro¨m, J. (2011). An explicit link betweengaussian fields and gaussian markov random fields: the stochastic partialdifferential equation approach. Journal of the Royal Statistical Society: SeriesB (Statistical Methodology), 73(4):423–498. → pages 13, 18, 19, 20, 34, 111Liu, Y. (2014). BayesianAnimalTracker: Bayesian Melding of GPS and DR Pathfor Animal Tracking. R package version 1.2. → pages 42Liu, Y., Battaile, B. C., Zidek, J. V., and Trites, A. W. (2015). Bias correction anduncertainty characterization of Dead-Reckoned paths of marine mammals.Animal Biotelmetry, 3(51). → pages 7, 42, 49, 50, 62Liu, Y., Dinsdale, D., Jun, S.-H., Briercliffe, C., and Bone, J. (2016). Statisticallearning of basketball strategy: The potential field approach. CascadiaSymposium on Statistics in Sports, Vancouver. → pages 1, 34Liu, Z., Le, N. D., and Zidek, J. V. (2011). An empirical assessment of Bayesianmelding for mapping ozone pollution. Environmetrics, 22(3):340–353. →pages 32, 137Martins, T. G., Simpson, D., Lindgren, F., and Rue, H. (2013). Bayesiancomputing with INLA: New features. Computational Statistics & DataAnalysis, 67:68–83. → pages 20, 44McClintock, B. T., Johnson, D. S., Hooten, M. B., Ver Hoef, J. M., and Morales,J. M. (2014). When to be discrete: the importance of time formulation inunderstanding animal movement. Movement ecology, 2(1):21. → pages 5, 62123McIntyre, A., Brooks, J., Guttag, J., and Wiens, J. (2016). Recognizing andanalyzing ball screen defense in the nba. SLOAN Sports Analytics Conference,Boston. → pages 94Miller, A., Bornn, L., Adams, R., and Goldsberry, K. (2014). Factorized pointprocess intensities: A spatial analysis of professional basketball. arXiv preprintarXiv:1401.0942. → pages 1Nordstrom, C. A., Benoit-Bird, K. J., Battaile, B. C., and Trites, A. W. (2013).Northern fur seals augment ship-derived ocean temperatures with highertemporal and spatial resolution data in the eastern Bering Sea. Deep SeaResearch Part II, 94:257–273. → pages 4, 8Nychka, D. W., Bandyopadhyay, S., Hammerling, D., Lindgren, F., and Sain, S.(2015). A multi-resolution Gaussian process model for the analysis of largespatial data sets. Journal of Computational and Graphical Statistics,24(2):579–599. → pages 11, 13, 15, 16, 19, 66O’Hagan, A. (1995). Fractional Bayes factors for model comparison. Journal ofthe Royal Statistical Society. Series B (Methodological), 57(1):99–138. →pages 70, 71Ozaki, T. (1992). A bridge between nonlinear time series models and nonlinearstochastic dynamical systems: A local linearization approach. Statistica Sinica,2(1):113–135. → pages 25, 79, 115Paciorek, C. J. and Schervish, M. J. (2006). Spatial modelling using a new classof nonstationary covariance functions. Environmetrics, 17(5):483–506. →pages 19, 20Park, C., Huang, J., and Ding, Y. (2011). Domain decomposition approach for fastGaussian process regression of large spatial data sets. The Journal of MachineLearning Research, 12:1697–1728. → pages 13Petris, G., Petrone, S., and Campagnoli, P. (2009). Dynamic linear models. InDynamic Linear Models with R, pages 31–84. Springer. → pages 26, 28, 84Pissanetzky, S. (1984). Sparse Matrix Technology-electronic Edition. AcademicPress. → pages 12Pozdnyakov, V., Meyer, T., Wang, Y.-B., and Yan, J. (2014). On modeling animalmovements using Brownian motion with measurement error. Ecology,95(2):247–253. → pages 36124R Development Core Team (2008). R: A Language and Environment forStatistical Computing. R Foundation for Statistical Computing, Vienna,Austria. ISBN 3-900051-07-0. → pages 95Ramsay, J. O., Hooker, G., and Graves, S. (2009). Functional Data Analysis withR and MATLAB. Springer Science & Business Media. → pages 30Ramsay, J. O., Wickham, H., Graves, S., and Hooker, G. (2014). fda: FunctionalData Analysis. R package version 2.4.4. → pages 81Rogers, L. C. G. and Williams, D. (2000). Diffusions, Markov processes andMartingales: Volume 2, Itoˆ Calculus, volume 2. Cambridge university press.→ pages 79Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theory andApplications. CRC Press. → pages 11, 13, 18, 83Rue, H. v., Martino, S., and Chopin, N. (2009). Approximate Bayesian inferencefor latent Gaussian models by using integrated nested Laplace approximations.Journal of the Royal Statistical Society: Series B (Statistical Methodology),71(2):319–392. → pages 13, 20, 21, 34, 39, 137, 138Rundel, C. W., Schliep, E. M., Gelfand, A. E., and Holland, D. M. (2015). A datafusion approach for spatial analysis of speciated PM2.5 across time.Environmetrics, 26(8):515–525. → pages 32Russell, J. C., Hanks, E. M., Haran, M., and Hughes, D. P. (2016). Aspatially-varying stochastic differential equation model for animal movement.arXiv preprint arXiv:1603.07630. → pages 93Sacks, J., Welch, W., Mitchell, T., and Wynn, H. (1989). Design and analysis ofcomputer experiments. Statistical Science, 4(4):409–423. → pages 35Sahu, S. K., Gelfand, A. E., and Holland, D. M. (2010). Fusing point and areallevel space–time data with application to wet deposition. Journal of the RoyalStatistical Society: Series C (Applied Statistics), 59(1):77–103. → pages 32Salzberg, S. (2014). Why google flu is a failure. Forbes. com [online], pages03–24. → pages 114Sampson, P. and Guttorp, P. (1992). Nonparametric estimation of nonstationaryspatial covariance structure. Journal of the American Statistical Associationan,87(417):108–119. → pages 19125Sa¨rkka¨, S., Solin, A., and Hartikainen, J. (2013). Spatiotemporal learning viainfinite-dimensional Bayesian filtering and smoothing: A look at Gaussianprocess regression through Kalman filtering. IEEE Signal ProcessingMagazine, 30(4):51–61. → pages 13Sawyer, H., Kauffman, M. J., Nielson, R. M., and Horne, J. S. (2009). Identifyingand prioritizing ungulate migration routes for landscape-level conservation.Ecological Applications, 19(8):2016–2025. → pages 36Scharf, H. R., Hooten, M. B., Fosdick, B. K., Johnson, D. S., London, J. M.,Durban, J. W., et al. (2017). Dynamic social networks based on movement. TheAnnals of Applied Statistics, 10(4):2182–2202. → pages 100Schmidt, A. M. and O’Hagan, A. (2003). Bayesian inference for non-stationaryspatial covariance structure via spatial deformations. Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 65(3):743–758. →pages 19Shaddick, G. and Zidek, J. V. (2015). Spatio-temporal Methods in EnvironmentalEpidemiology. CRC Press. → pages 11Silverman, B. W. (1985). Some aspects of the spline smoothing approach tonon-parametric regression curve fitting. Journal of the Royal Statistical Society.Series B (Methodological), 47(1):1–52. → pages 110, 112Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Linde, A. (2014). The devianceinformation criterion: 12 years on. Journal of the Royal Statistical Society:Series B (Statistical Methodology), 76(3):485–493. → pages 53Spiegelhalter, D. J., Best, N. G., Carlin, B. P., and Van Der Linde, A. (2002).Bayesian measures of model complexity and fit. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology), 64(4):583–639. → pages 53Stein, M. L. (2014). Limitations on low rank approximations for covariancematrices of spatial data. Spatial Statistics, 8:1–19. → pages 16Stein, M. L., Chi, Z., and Welty, L. (2004). Approximating likelihoods for largespatial data sets. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 66(2):275–296. → pages 18, 66Stirzaker, G. G. D. and Grimmett, D. (2001). Probability and Random Processes.Oxford Science Publications. → pages 40, 131126Tzeng, S., Huang, H.-C., and Cressie, N. (2005). A fast, optimalspatial-prediction method for massive datasets. Journal of the AmericanStatistical Association, 100(472):1343–1357. → pages 14Vecchia, A. (1988). Estimation and model identification for continuous spatialprocesses. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 50(2):297–312. → pages 18Vrettas, M. D., Opper, M., and Cornford, D. (2015). Variational mean-fieldalgorithm for efficient inference in large systems of stochastic differentialequations. Physical Review E, 91(1):012148. → pages 25, 79, 115Wahba, G. (1965). A least squares estimate of satellite attitude. SIAM Review,7(3):409–409. → pages 7Wang, K.-C. and Zemel, R. (2016). Classifying NBA offensive plays using neuralnetworks. SLOAN Sports Analytics Conference, Boston. → pages 94Watson, G. (1984). Smoothing and interpolation by kriging and with splines.Journal of the International Association for Mathematical Geology,16(6):601–615. → pages 112Williams, C. K. and Rasmussen, C. E. (2006). Gaussian Processes for MachineLearning. The MIT Press. → pages 11Wilson, R. P., R. and Wilson, M. P. (1988). Dead reckoning–a new technique fordetermining penguin movements at sea. Meeresforschung–Reports on MarineResearch, 32:155–158. → pages 5, 6, 9Wilson, R. P., Liebsch, N., Davies, I. M., Quintana, F., Weimerskirch, H., Storch,S., Lucke, K., Siebert, U., Zankl, S., Mu¨ller, G., Zimmer, I., Scolaro, A.,Campagna, C., Plo¨tz, J., Bornemann, H., Teilmann, J., and McMahon, C. R.(2007). All at sea with animal tracks; methodological and analytical solutionsfor the resolution of movement. Deep Sea Research Part II, 54(3-4):193–210.→ pages 1, 4, 5, 6, 7, 8, 9Zhang, H. (2004). Inconsistent estimation and asymptotically equal interpolationsin model-based geostatistics. Journal of the American Statistical Association,99(465):250–261. → pages 111Zhu, B., Song, P. X.-K., and Taylor, J. M. (2011a). Stochastic functional dataanalysis: A diffusion model-based approach. Biometrics, 67(4):1295–1304. →pages 78127Zhu, B., Taylor, J. M., and Song, P. X.-K. (2011b). Semiparametric stochasticmodeling of the rate function in longitudinal studies. Journal of the AmericanStatistical Association, 106(496):1485–1495. → pages 78Zidek, J. V., Le, N. D., and Liu, Z. (2012). Combining data and simulated data forspacetime fields: application to ozone. Environmental and EcologicalStatistics, 19(1):37–56. → pages 31, 32, 34, 44128Appendix AInferential Methods for BayesianMeldingThis section includes the detailed derivations of the inferential methods needed forBM.A.1 Explicit form of 〈β ,η |X,Y〉Following the model and notation in (4.2) to (4.7), it is easy to find〈β ,η |X,Y〉 ∝〈β ,η ,X,Y〉= 〈η 〉〈Y|η 〉〈X|β ,η 〉 (A.1)∝|R|−1/2 exp{−12(η − f)TR−1(η − f)}×|D|−1/2 exp{−12(Y−Gη )TD−1(Y−Gη )}×|C|−1/2 exp{−12(X−Zβ −η )TC−1(X−Zβ −η )},where the following notation is used in addition to those from (4.2) to (4.7):• f = ( f (1), f (2), . . . , f (T − 1))T , is a T − 1 vector as the prior mean of therandom part in η .• D= σ2GIK−1 is the covariance matrix of Y conditional on η , where Im standsfor the identity matrix of dimension m.129• Z is the design matrix for h as in (4.5).• G is a (K−1)× (T −1) matrix, withGk, j ={1, j = tk−10, Otherwise,for k = 1,2, . . . ,K−1. Notice that the first time point is removed.• R=R(1 : (T −1),1 : (T −1)) is the (T −1)× (T −1) covariance matrix forthe Brownian Bridge.• C = C(1 : T,1 : T ) is the T ×T covariance matrix for the Brownian Motionerror term ξ .To further simplify (A.1), the following notation is needed:• ζ = (β T ,η T )T , which is the joint vector of β and η of length Q+T −1.• U is a T × (Q+T −1) matrix with U(1 : T,1 : Q) = Z, U(1 : (T −1),(Q+1) : (Q+T − 1)) = IT−1 and U(T,(Q+ 1) : (Q+T − 1)) = 0, which mapsUζ = h+η .• V is a (T −1)× (Q+T −1) matrix with V(1 : (T −1),1 : Q) = 0 and V(1 :(T −1),(Q+1) : (Q+T −1)) = IT−1, such that Vζ = η .• W is a (K − 1)× (Q+ T − 1) matrix with W(1 : (K − 1),1 : Q) = 0 andW(1 : (K−1),(Q+1) : (Q+T −1)) = G, such that Wζ = Gη .Some algebra simplifies (A.1) and yields〈ζ |X,Y〉 ∝|R|−1/2|D|−1/2|C|−1/2× (A.2)exp{−12[(ζ −M−11 M2)TM1(ζ −M−11 M2)+M3−MT2 M−11 M2]},∝exp{−12[(ζ −M−11 M2)TM1(ζ −M−11 M2)]}130whereM1 =VTR−1V+WTD−1W+UTC−1U,M2 =VTR−1f+WTD−1Y+UTC−1X,M3 =fTR−1f+YTD−1Y+XTC−1X.The posterior distribution of β ,η conditional on φ is a multivariate Gaussian den-sity. The main computational complexity involved in evaluating this density comesfrom the calculation of M−11 M2.A.2 Derivation of (4.11) and (4.12)It is well known that the Brownian Bridge and Brownian Motion processes areMarkovian (Stirzaker and Grimmett, 2001), such that:[η(t)|η(t−1),η(t−2)] =[η(t)|η(t−1)][ξ (t)|ξ (t−1),ξ (t−2)] =[ξ (t)|ξ (t−1)],where η(t) is a Brownian Motion process as in our model (4.2) and ξ (t) is a Brow-nian Motion process as in (4.5). This Markovian property directly suggests theconditional independence property of these two processes, such that:[η(t−1),η(t+1)|η(t)] =[η(t−1)|η(t)][η(t+1)|η(t)],and[ξ (t−1),ξ (t+1)|ξ (t)] =[ξ (t−1)|ξ (t)][ξ (t+1)|ξ (t)].These properties help us derive (4.11) and (4.12). As an illustration, consider thecase where T = 4 (Recall that η(0) and η(4) are fixed, so only η (1,2,3) are ran-dom), and one GPS observation is available at t = 2. The first part of (4.10),〈η (1 : T \ t1:K)|β ,ηG,X,Y〉 in this situation is〈η(1),η(3)|η(2),X(1 : 3),Y (2),β 〉= 〈η(1),η(3)|η(2),X(1 : 3),β 〉, (A.3)as Y (2) only depends on η(2) and is independent of all the other random vari-ables. For brevity, we omit the β term in all the following derivations and let131Xi = X(i),ηi = η(i),ξi = ξ (i). Using the conditional independence property of ηand ξ together with certain variable transformations, we can simplify (A.3) into〈η1,η3|η2,X1:3〉= 〈η1,η2,η3,X1,X2,X3〉〈η2,X1,X2,X3〉 =〈η1,η2,η3,ξ1,ξ2,ξ3〉〈η2,X1,X2,X3〉=〈η1,η2,η3〉〈ξ1,ξ2,ξ3〉〈η2,X1,X2,X3〉 =〈η1|η2〉〈η3|η2〉〈η2〉〈ξ1|ξ2〉〈ξ3|ξ2〉〈ξ2〉〈η2,X1,X2,X3〉=〈η1|η2〉〈ξ1|ξ2〉〈η3|η2〉〈ξ3|ξ2〉〈η2〉〈ξ2〉〈η2,X1,X2,X3〉=〈η1,X1|η2,X2〉〈η3,X3|η2,X2〉〈η2,X2〉〈η2,X1,X2,X3〉=〈η1,X1|η2,X2〉〈η3,X3|η2,X2〉〈X1,X3|η2,X2〉 = . . .=〈η1,X1|η2,X2〉〈η3,X3|η2,X2〉〈X1|η2,X2〉〈X3|η2,X2〉= 〈η1|η2,X1,X2〉〈η3|η2,X2,X3〉.The above is an illustration on how we prove (4.11) and (4.12) and we omit thelengthy proof of the general case. On the other hand, these expressions can beeasily proved via the conditional independence property of a graphical model asin Lauritzen (1996).A.3 Explicit expression for (4.12)In the above subsection, we showed that:〈η (tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X,Y〉=〈η (tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X(tk : tk+1)〉.132We derive the explicit expression of the right-hand side above. First, it is easy touse the conditional independence property to find,〈η (tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X(tk : tk+1)〉=〈η (tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X(tk),X(tk+1)〉〈X(tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X(tk),X(tk+1)〉=〈η (tk+1 : tk+1−1)|η(tk),η(tk+1)〉〈X(tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X(tk),X(tk+1)〉 .Define two new variables,η c(tk+1 : tk+1−1) =η (tk+1 : tk+1−1)|η(tk),η(tk+1),andXc(tk+1 : tk+1−1) =X(tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X(tk),X(tk+1),such that〈η (tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X(tk : tk+1)〉=〈η c(tk+1 : tk+1−1)|Xc(tk+1 : tk+1−1)〉. (A.4)With the basic properties of the Brownian Bridge, it is easy to show thatη c(tk+1 : tk+1−1)∼MVN(fk,σ2HRk), (A.5)wherefk(t) = f (t)+a′k(t)(η(tk)− f (tk))+ak(t)(η(tk+1)− f (tk+1)) (A.6)=a′k(t)η(tk)+ak(t)η(tk+1), (A.7)ak(t) =t− tktk+1− tk ,a′k(t) =1−ak(t) =tk+1− ttk+1− tk ,Rk(s, t) =(s− tk)(tk+1− t)tk+1− tk , tk < s≤ t < tk+1,133and MVN(·, ·) denotes a multivariate normal distribution. The definition of f (t) =A+(B−A)t/T is used to simplify (A.6) into (A.7) as follows,a′k(t) f (tk)+ak(t) f (tk+1) =tk+1− ttk+1− tk(A+(B−A) tkT)+t− tktk+1− tk(A+(B−A) tk+1T)=Atk+1− tktk+1− tk +(B−A)(tk+1− t)tk+(t− tk)tk+1(tk+1− tk)T=A+(B−A) (tk+1− tk)t(tk+1− tk)T = A+(B−A)tT= f (t).It leads to f (t)−a′k(t) f (tk)−ak(t) f (tk+1) = 0.The definition of Xc and X in (4.5) implies that Xc is the sum of a deterministicterm and two independent Gaussian processes:Xc(tk+1 : tk+1−1) =h(tk+1 : tk+1−1)+η c(tk+1 : tk+1−1)+ξ c(tk+1 : tk+1−1),where h is defined in (4.5) and ξ c is defined similarly as η c:ξ c(tk+1 : tk+1−1) = ξ (tk+1 : tk+1−1)|ξ (tk),ξ (tk+1)∼MVN(gk,σ2DRk)with gk(t) = a′k(t)ξ (tk)+ak(t)ξ (tk+1).In this way, the marginal distribution of Xc isXc(tk+1 : tk+1−1)∼MVN(uk,(σ2H +σ2D)Rk), (A.8)whereuk(t) = h(t)+ fk(t)+gk(t)= h(t)+a′k(t)(η(tk)+ξ (tk))+ak(t)(η(tk+1)+ξ (tk+1))= h(t)+a′k(t)(X(tk)−h(tk))+ak(t)(X(tk+1)−h(tk+1)).134Also, the covariance between η c and Xc isCov(η c,Xc) = Cov(η c,η c) = σ2HRk. (A.9)From (A.5), (A.8), and (A.9), we can easily find that there is no need to in-vert any matrices when computing (A.4). As when the conditional covariance iscalculated, the inverse of Rk will be canceled by Rk, such that,Cov(η c,Xc)(Cov(Xc))−1 = σ2HRk((σ2H +σ2D)Rk))−1=σ2Hσ2H +σ2DItk+1−tk−1Cov(η c,Xc)(Cov(Xc))−1 Cov(Xc,η c) = σ2H(1−σ2Hσ2H +σ2D)Rk =σ2Hσ2Dσ2H +σ2DRk.We only need to calculate ρ = σ2H/σ2H +σ2D for the conditional mean and covari-ance. In this way, the desired posterior in (A.4) is〈η (tk+1 : tk+1−1)|η(tk),η(tk+1),β ,X(tk : tk+1)〉 (A.10)=〈η c(tk+1 : tk+1−1)|Xc(tk+1 : tk+1−1)〉∼MVN(fk+ρ (X(tk+1 : tk+1−1)−uk) ,ρσ2DRk) .A.4 Explicit expression of [φ ,β ,ηG|XG,Y]Following the notation in Appendix A.1[φ ,β ,ηG|XG,Y] ∝ [φ ][ηG|φ ][Y|ηG][X(t1:K)|β ,η (t1:K),φ ] (A.11)∝[φ ]|RG|−1/2 exp{−12(ηG− fG)TR−1G (ηG− fG)}×|D|−1/2 exp{−12(Y−ηG)TD−1(Y−ηG)}×|CG|−1/2 exp{−12(XG−ZGβ −ηG)TC−1G (XG−ZGβ −ηG)}where XG=X(t0:K) and RG=R(t0:K , t0:K). Similar notation applies to ηG, fG,CG,ZG.Also, we will introduce the following notation, which is similar to those in Ap-pendix A.1, specifically the sub-vector or sub-matrix of those in Appendix A.1135with respect to the GPS observations:• ζ G = (β T ,η TG)T , which is the joint vector of β and ηG of length Q+K−1;• UG is a K× (Q+K− 1) matrix with UG(1 : K,1 : Q) = ZG, UG(1 : (K−1),(Q+ 1) : (Q+K − 1)) = IK−1 and UG(K,(Q+ 1) : (Q+K − 1)) = 0,which maps UGζ G = hG+ηG;• VG is a (K− 1)× (Q+K− 1) matrix with VG(1 : (K− 1),1 : Q) = 0 andVG(1 : (K−1),(Q+1) : (Q+K−1)) = IK−1, such that VGζ G = ηG;Some algebra simplifies (A.11) to yield[φ ,β ,ηG|XG,Y] ∝ (A.12)1σ2H1σ2D×|RG|−1/2|D|−1/2|CG|−1/2×exp{−12[(ζ G−M−1G1MG2)TMG1(ζ G−M−1G1MG2)+MG3−MTG2M−1G1MG2]},whereMG1 =VTGR−1G VG+VTGD−1VG+UTGC−1UG,MG2 =VTGR−1G fG+VTGD−1Y+UTGC−1G XG,andMG3 =fTGR−1G fG+YTD−1Y+XTGC−1G XG.Following (A.12), it is easy to show that[ζ G|XG,Y,φ ]∼ MVN(M−1G1MG2,M−1G1). (A.13)Integrating ζ G out, we get[φ |XG,Y] ∝ 1σ2H1σ2D|RG|−1/2|D|−1/2|CG|−1/2|MG1|−1/2 (A.14)× exp{−12[MG3−MTG2M−1G1MG2]}.136A.5 Marginal distribution of η at the non-GPS pointsWith (A.13), we can marginalize ηG,β out in (A.10) to obtain 〈η (tk+ 1 : tk+1−1)|X,Y〉, which is later used in the numerical integration. Let ζ k=(β T ,η(tk),η(tk+1))Tand its mean and covariance matrix obtained in (A.13) be denoted as ζ˜ k, Σ˜k respec-tively.The uk of (A.8) can be written as as a linear transformation of ζ k, such thatuk = Bkζ k, where Bk =[ρ(Zk−AkZGk),Ak]. {Zk} are the rows of the designmatrix, Z(tk+1 : tk+1−1,1 : Q), which corresponds to this period of the non-GPSobservation. ZGk = Z(tk,k+1,1 : Q) corresponds to the rows of the design matrix ofthe two GPS observations. Ak is the matrix of the linear weights of a′k(t),ak(t) asin Equation (A.5). Marginalizing ζ k out in (A.10) results in〈η (tk+1 : tk+1−1)|X,Y〉 ∼MVN(Bkζ˜ k,BkΣ˜kBTk). (A.15)A.6 Integration over the variance parameters φIn the BM literature (Liu et al., 2011), the integral in (4.15) was usually calculatedby MCMC. However, the heavy computational burden of the MCMC techniquehas to be avoided in our application. Moreover, it may not be practical to store theMCMC samples of the high dimensional parameter η . The first alternative is toavoid the integration via the empirical Bayesian approach as in Casella (1985):[β ,η |X,Y]≈ [β ,η |X,Y, φˆ ],where φˆ is the maximum likelihood estimate of φ after η ,β are marginalized out,φˆ = argmaxφlog([φ |XG,Y]). (A.16)The empirical Bayesian approach is computationally simple, especially when wecan explicitly evaluate the marginal likelihood. However, it fails to reflect the un-certainty in φ and thus it underestimates the uncertainty in the posterior of η . Toovercome this issue, we use a numerical integration method like that in INLA (Rueet al., 2009), which approximates the integral on a grid decided by the likelihood137surface.Let H denote the 2× 2 Hessian matrix of φˆ = (σˆ2H , σˆ2D)T in (A.16) and Σ =H−1. With the eigenvalue decomposition Σ = AΛAT , the space of φ can be ex-plored by φ (z) = φˆ +AΛ1/2z, where z is a 2× 1 vector. To find the grid fornumerical integration, we start from z = 0 and search in the positive direction ofz1, that is, we increase j ∈ N+ and z = ( jδz,0) as long aslog([φ (0)|XG,Y])− log([φ (z)|XG,Y])< δpi ,where δz is the step size and δpi controls the magnitude of probability mass that willbe included in the numerical integration. After searching on the positive side, weswitch direction and search on the negative side of z1. This procedure is repeatedfor both dimensions of z. For our BM model above, if the search stops at J+1 ,J+2steps in the positive directions of z1 and z2 respectively and J−1 ,J−2 in their negativedirections, a grid of size (J+1 + J−1 + 1)× (J+2 + J−2 + 1) is used in the numericalintegration and the points on this grid are z j1, j2 = δz( j1, j2), with j1 ∈ (−J−1 ,−J−1 +1, . . . ,−1,0,1, . . . ,J+1 ) and j2 ∈ (−J−2 ,−J−2 +1, . . . ,−1,0,1, . . . ,J+2 ). The integralin (4.15) is then approximated by[η ,β |X,Y]≈J+1∑j1=−J−1J+2∑j2=−J−2w j1, j2× [η ,β |φ (z j1, j2),X,Y], (A.17)wherew j1, j2 =[φ (z( j1, j2))|XG,Y]∑ j1 ∑ j2 [φ (z( j1, j2))|XG,Y].Note that (A.17) resembles Equation (5) of Rue et al. (2009). As the joint posteriordistribution of η ,β conditional on φ follows a multivariate Gaussian distribution,the posterior [η ,β |X,Y] can be approximated by a mixture of multivariate Gaus-sian densities, whose mean and variance can be easily calculated as follows.The distribution of ζ is multivariate normal conditional on φ as in (A.13)and (A.15) and therefore the posterior density of ζ can be approximated by a mix-ture of multivariate normal densities. Let ζ˜(i)and Σ˜(i) be the posterior mean andcovariance of ζ conditional on the ith grid point of φ (z( j1, j2)) ( j1, j2 are collapsed138into a single index set and L= (J+1 +J−1 +1)×(J+2 +J−2 +1)). We simplify (A.17)into[ζ |X,Y]≈L∑i=1wi[ζ |φ (i),X,Y] =L∑i=1wiΨ(·; ζ˜ (i), Σ˜(i)),where Ψ represents the probability density function of the multivariate normal dis-tribution. For our application, we are only interested in finding the posterior meanand variance of ζ . As in (Fru¨hwirth-Schnatter, 2006), the posterior mean equalsζ˜ =L∑i=1wiζ˜(i), (A.18)and the posterior variance of the k-th element of ζ isσ˜2k =L∑i=1wi[Σ˜(i)(k,k)+L∑i=1(ζ˜ (i)(k)− ζ˜ (k))2]. (A.19)139Appendix BSome Mathematical Details forCHBBWe provide the proofs and derivations for some equations in Section 5.2.2.B.1 |C|= |C0||C1,1||C1,2|For a Brownian Bridge at time T = {0,1,2, . . . ,T} with variance parameter of 1,a decomposition of the joint distribution of [y1:T−1|y0,yT ] gives[y1:T−1|y0,yT ] =[ys|y0,yT ][y1:s−1|y0,ys,yT ][ys+1:T−1|y0,ys,yT ]=[ys|y0,yT ][y1:s−1|y0,ys][ys+1:T−1|ys,yT ]. (B.1)We use the conditional independence property of BB to simplify the first line intothe second line. Both sides of (B.1) are the normal densities[y1:T−1|y0,yT ] =(2pi)−(T−1)/2|C|−1/2 exp{(y1:T−1− f)TC−1(y1:T−1− f)}[ys|y0,yT ] =(2pi)−1/2|C0|−1/2 exp{(ys− f0)TC−10 (ys− f0}[y1:s−1|y0,ys] =(2pi)−(s−1)/2|C1,1|−1/2 exp{(y1:s−1− f1,1)TC−11,1(y1:s−1− f(1))}[ys+1:T−1|ys,yT ] =(2pi)−(T−s−1)/2|C1,2|−1/2 exp{(ys+1:T−1− f1,2)TC−11,2(ys+1:T−1− f1,2)},140where f, f0, . . . are the mean functions of the Brownian Bridge as in (4.3) and allthe other notations are the same as in Section 5.2.2.Equation (B.1) holds for all y. Let y0:T = 0, which leads to all f = 0, f0 =0, f1,1 = 0, f1,2 = 0. The exponential part in the above normal densities are all 1.Plugging them back into (B.1), we have(2pi)−(T−1)/2|C|−1/2 =(2pi)−1/2|C0|−1/2(2pi)−(s−1)/2|C1,1|−1/2(2pi)−(T−s−1)/2|C1,2|−1/2⇒ |C|−1/2 =|C0|−1/2|C1,1|−1/2|C1,2|−1/2⇒ |C|=|C0||C1,1||C1,2|B.2 Short BB are more likely to have a small varianceestimateAs in (5.12), the MLE of the Brownian Bridge variance parameter is calculated asσˆ2 =1n(y− f)TC−1(y− f),where n denotes the number of free observations, i.e., n=T−1 for BB(y0,yT ,0,T,σ20 ).Notation σ2 can denote any one in ς2,σ20 ,σ21,1,σ21,2.Without loss of generality, assume the true value σ2 = 1. Because of the prop-erties of multivariate normal distribution, (y− f)TC−1(y− f) following χ2n andtherefore σˆ2 follows a Gamma distribution with shape n/2 and scale 2/n. Withthis exact distribution, we can calculate the probability of σˆ2 < z, where z is asmall number for different n. These probabilities are plotted in Figure B.1, whichclearly shows that the probability of a small σˆ2 decreases with the increase of n.B.3 Derivation of KL divergence for CHBBThe KL divergence for multivariate normal P= N(µ1,Σ1) and Q= N(µ2,Σ2) isKL(p||q) = 12(tr(Σ−11 Σ0)+(µ1−µ2)Σ−11 (µ1−µ2)T − k− log( |Σ0||Σ1|))1410 20 40 60 80 100−300−200−100−500nlog(P(σ^2<z))z = 0.0010.010.1Figure B.1: log(P(σˆ2 < z)) for different n and z= 0.001,0.01,0.1.For p= BB(y0,yT ,0,T,ς2) and q= BB(y0,yT ,0,T,σ2), the KL simplifies intoKL(p||q) = (T −1) ς2σ2− (T −1)− (T −1) log(ς2σ2). (B.2)Notice that the two BB with different variance parameters still have the same meanwhen they are defined on the same time points. Therefore KL(p||q) does not in-volve the mean.For the single knot CHBB in Section 5.2.2, we introduce the following nota-tions for notational simplicity.p0 =[ys|y0,yT ,ς2]q0 =[ys|y0,yT ,σ20 ]p1 =[y1:s−1|y0,ys,ς2]q1 =[y1:s−1|y0,ys,σ21,1]p2 =[ys+1:T−1|ys,yT ,ς2]q2 =[ys+1:T−1|ys,yT ,σ21,2]142KL(M(0)||M(1)) =∫dy1:T−1[y1:T−1|y0,yT ,ς2] log([y1:T−1|y0,yT ,ς2][y1:T−1|y0,yT ,σ20 ,σ21,1,σ21,2])=∫ ∫ ∫dysdy1:s−1dys+1:T−1p0p1p2 log(p0p1p2q0q1q2)=∫ ∫ ∫dysdy1:s−1dys+1:T−1p0p1p2 log(p0q0)+∫ ∫ ∫dysdy1:s−1dys+1:T−1p0p1p2 log(p1q1)+∫ ∫ ∫dysdy1:s−1dys+1:T−1p0p1p2 log(p2q2)=∫dysp0p1p2 log(p0q0)+∫ ∫dysdy1:s−1p0p1 log(p1q1)+∫ ∫dysdy1:s−1p0p2 log(p2q2)=KL(p0||q0)+∫dysp0KL(p1||q1)+∫dysp0KL(p2||q2).(B.3)The last two terms in (B.3) only depends on the variance parameters but not themean, which thus does not depend on ys. This enables us to simplify (B.3) into,KL(M(0)||M(1)) = KL(p0||q0)+KL(p1||q1)+KL(p2||q2).Plugging (B.2) into the three terms,KL(M(0)||M(1)) =g(ς2σ21,0)+(s−1)g(ς2σ21,1)+(T − s−1)g(ς2σ21,2),whereg(x) = x− log(x)−1.143


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items