Temporal Adjusted Prediction for Predicting IndianReserve Populations in CanadabyDerek ChoB.Sc., The University of British Columbia, 2014A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)April 2018c© Derek Cho, 2018AbstractIn order to predict the population of Indian reserves in Canada for the 2016 Cen-sus, we can construct a suitable model using data from the Indian Register and pastcensuses. Linear mixed effects models are a popular method for predicting valuesof responses on longitudinal data. However, linear mixed effects models requirerepeated measures in order to fit a model. Alternative methods such as linear re-gression only require data from a single time point in order to fit a model, but itdoes not directly account for within-individual correlation when predicting. Sincewe are predicting the responses of the same set of individuals, we can expect re-sponses at the next time point to be strongly correlated with past responses for anindividual.We introduce a new method of prediction, temporal adjusted prediction (TAP),that addresses the issue of within-individual correlation in predictions and only re-quires data from a single time point to estimate model parameters. Predictions arebased on the last recorded response of an individual and adjusted based on changesto the values of their covariates and estimated regression coefficients that relate theresponse and the covariates. Predictions are made using a random intercept modelrather than a linear regression model. It is shown that if the random intercept ac-counts for a larger proportion of the random variation in the data than the randomerror term, then temporal adjusted prediction achieves a lower mean squared pre-diction error than linear regression.TAP performs better than linear regression when predicting on the same set ofindividuals at different time points. It also shows similar prediction performancecompared to linear mixed effects models estimated with maximum likelihood esti-mation despite only requiring data from one time point in order to fit a model.iiLay SummaryThis thesis introduces a new method of prediction for populations of Indian re-serves in Canada in 2016. Temporal adjusted prediction uses the population ofa reserve from the previous census in 2011 as a starting point for prediction andextrapolates based on the changes in population from an auxiliary data source be-tween the census dates in 2011 and 2016. In this case, the on-reserve populationsof a band from the Indian Register are used as the auxiliary data source since it ishighly correlated to the population of a reserve according to the census. One ofthe main benefits of temporal adjusted prediction is that it only requires data froma single time point to fit a model, while other similar methods require repeatedmeasures.iiiPrefaceThis dissertation is based on unpublished work the author, Derek Cho, performedduring a co-op work term at Statistics Canada. The temporal adjusted predictionmethod was developed by the author and used on data from the 2006, 2011 and2016 Censuses and Indian Register at Statistics Canada under the supervision ofScott McLeish and Rose Evra. Further work on the theoretical properties in Chap-ter 2, and simulations in Chapter 3 were done at UBC under the supervision ofProfessor Gabriela Cohen-Freue.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Temporal Adjusted Prediction . . . . . . . . . . . . . . . . . . . . . 52.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . 62.3 Linear Extrapolation . . . . . . . . . . . . . . . . . . . . . . . . 92.4 Temporal Adjusted Prediction . . . . . . . . . . . . . . . . . . . 122.4.1 Random Intercept . . . . . . . . . . . . . . . . . . . . . . 152.4.2 Comparing the MSPE of Linear Regression and TemporalAdjusted Prediction . . . . . . . . . . . . . . . . . . . . . 162.4.3 Robust Temporal Adjusted Prediction . . . . . . . . . . . 20v3 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.1 Data Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.1 Default Settings . . . . . . . . . . . . . . . . . . . . . . . 253.2.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.2.3 Additional Covariates . . . . . . . . . . . . . . . . . . . 323.2.4 Additional Time Point . . . . . . . . . . . . . . . . . . . 354 Case Study: Prediction of Populations of Indians Reserves in Canada 414.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.1.1 The Indian Register . . . . . . . . . . . . . . . . . . . . . 424.1.2 Differences Between the Indian Register and Census Pop-ulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 434.2 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . 464.2.1 Out-of-Scope Bands . . . . . . . . . . . . . . . . . . . . 474.2.2 Multi-Band Reserves and Multi-Reserve Bands . . . . . . 484.3 Prediction of Indian Reserve Populations in 2016 . . . . . . . . . 514.3.1 Exploratory Analysis . . . . . . . . . . . . . . . . . . . . 514.3.2 Prediction Results . . . . . . . . . . . . . . . . . . . . . 545 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61viList of TablesTable 2.1 Mean Squared Prediction Error of Temporal Adjusted Predic-tion and Linear Regression when varying Standard Deviationof Intercept . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Table 3.1 Comparison of Median MSPE under Default Settings . . . . . 25Table 3.2 Comparison of Median MSPE with 5% Outliers . . . . . . . . 29Table 3.3 Comparison of Median MSPE with 10% Outliers . . . . . . . 31Table 3.4 Comparison of Median MSPE with Additional Covariates . . . 34Table 3.5 Comparison of Median MSPE with Data with an AdditionalTime Point under three Scenarios . . . . . . . . . . . . . . . . 37Table 4.1 Levels of Census Metropolitan Influenced Zone . . . . . . . . 47Table 4.2 List of Multi-Band Reserves and Multi-Band Multi-Reserve Clus-ters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Table 4.3 Cross Tabulation of Province and Weighted Census MIZ . . . . 52Table 4.4 Estimated Robust Regression Coefficients for 2006 and 2011Reserve Populations . . . . . . . . . . . . . . . . . . . . . . . 53Table 4.5 Mean Square and Median Absolute Prediction Errors of 2016Indian Reserve Populations . . . . . . . . . . . . . . . . . . . 57viiList of FiguresFigure 2.1 Prediction of Selected Observation at Second Time Point viaLinear Regression . . . . . . . . . . . . . . . . . . . . . . . . 8Figure 2.2 Prediction of Selected Observation at Second Time Point viaExtrapolation . . . . . . . . . . . . . . . . . . . . . . . . . . 10Figure 2.3 Prediction of Selected Observation at Second Time Point viaTemporal Adjusted Prediction . . . . . . . . . . . . . . . . . 14Figure 2.4 Mean Square Prediction Error of Temporal Adjusted Data Us-ing Temporal Adjusted Prediction and Linear Regression . . . 19Figure 2.5 Potential Outliers in the Indian Reserves Population Data . . . 21Figure 3.1 Comparison of MSPE under Default Settings . . . . . . . . . 26Figure 3.2 Comparison of MSPE under Default Settings (Zoomed In) . . 27Figure 3.3 Comparison of Mean Squared Prediction Error with 5% Outliers 28Figure 3.4 Comparison of Mean Squared Prediction Error with 5% Out-liers (Zoomed In) . . . . . . . . . . . . . . . . . . . . . . . . 29Figure 3.5 Comparison of Mean Squared Prediction Error with 10% Outliers 31Figure 3.6 Comparison of Mean Squared Prediction Error with 10% Out-liers (Zoomed In) . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 3.7 Comparison of Mean Squared Prediction Error with an Addi-tional Variable . . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 3.8 Comparison of Mean Squared Prediction Error with an Addi-tional Variable (Zoomed In) . . . . . . . . . . . . . . . . . . 34Figure 3.9 Comparison of Mean Squared Prediction Error with Data withan Additional Time Point: Scenario (1) . . . . . . . . . . . . 37viiiFigure 3.10 Comparison of Mean Squared Prediction Error with Data withan Additional Time Point: Scenario (2) . . . . . . . . . . . . 38Figure 3.11 Comparison of Mean Squared Prediction Error with Data withan Additional Time Point: Scenario (3) . . . . . . . . . . . . 39Figure 4.1 2011 Census Reserve Populations vs. 2011 IR On-ReservePopulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 44Figure 4.2 Census Reserve Populations vs. IR On-Reserve Populations in2006 and 2011 . . . . . . . . . . . . . . . . . . . . . . . . . 53Figure 4.3 Comparison of Prediction Error of 2016 Reserve Populations . 55Figure 4.4 Comparison of Prediction Error of 2016 Reserve Populations(Zoomed In) . . . . . . . . . . . . . . . . . . . . . . . . . . 56ixAcknowledgmentsThere are several people that I would like to thank for helping me in my studies.In each step of my educational journey, I have relied on the help of others to getme through. Without their help, I would certainly not have come as far as I have.Firstly, I want to thank my family for their constant support of my educationalendeavours. Without their love and encouragement, none of this would be possible.Next, I would like to thank my supervisor, Professor Gabriela Cohen-Freue,for her guidance, patience and mentorship. From the time of my undergraduatestudies, she has provided me several valuable opportunities for which I am trulygrateful. I would also like to thank my second reader Professor Matias Salibian-Barrera.Thanks to my past and current officemates Creagh, Jonathan, Nikolas, Saif,Sohrab and Yijun for your friendship. I will miss all of the conversations that wehad and the trips to the village for dinner together. I also want to thank Danny,Qiong, Derek and the rest of my classmates for helping make grad school such anenjoyable experience. Thanks to Andrea, Ali and Peggy for all of your help andpatience when answering my questions.A special thanks to my colleagues at Statistics Canada where I did my co-opwork term. I would like to thank Scott McLeish, Rose Evra and Tristan Cayn forbeing an awesome team of supervisors that I truly enjoyed working with. I willmiss our end of the work day conversations. I also want to thank Susan Wallace,Sylvie Godin, Jarod Dobson and the rest of the Social and Aboriginal StatisticsDivision for their help with my projects while I was at Statistics Canada. Thanks toEric McGregor from INAC for being helpful and providing me with the necessarydata from the Indian Register for this thesis. I would also like to thank Pamela Bestxfor her kindness and willingness to help when I was in a tough spot.Lastly, thanks to my friends Anthony, Gilbert, Louie, Raymond and Simonwho have been with me since elementary school. I would also like to thank Lydiafor her companionship and constant support of my studies. Thanks for coming toOttawa to visit me while I was on co-op.xiChapter 1IntroductionLinear regression models are a popular method for estimating the effects that ex-planatory variables have on a response. This relationship can then be used to pre-dict future responses given new observations. In particular, a linear model canbe fit to past data and then used to predict future responses given new values ofthe covariates. In many applications, if you are able to record new values of anexplanatory variable, responses at the same time point will likely be recorded aswell. However, this is not true in every case. For example, if you take data fromtwo related datasets, one of which that has no recorded data for the current timepoint, you can predict the missing values from the other dataset using the estimatedrelationship between these datasets based on the past. One such application is theprediction of Indian reserve populations in Canada in 2016 using data from theIndian Register (IR) and the Census of Population. The goal is to use past datafrom the 2011 Indian Register and the 2011 Census to construct a model and thenuse the data from the 2016 Indian Register to make predictions of the populationsof Indian reserves in Canada in 2016. Counts from census collection that are no-ticeably different from model predictions could warrant further investigation if thereason for the disparity isn’t immediately clear. This motivating example will bereferred to throughout this thesis.The Indian Register is an administrative database that contains information onall Registered Indians in Canada including band membership and whether or notthey live on a reserve[5]. The population of most Indian reserves in Canada are1comprised mostly of Registered Indians[20] [16]. Thus, the number of individualsbelonging to a First Nation or band, also referred to as band members, living ona reserve associated with that First Nation according to the IR should be approx-imately equal to the actual population of the reserve as measured by the censusin most cases. The IR is useful for the prediction of reserve populations becauseit contains up to date information on the population changes of reserves since theprevious census. The census and the IR populations of reserves for a given censusyear are highly correlated since a person listed as living on-reserve according tothe register is likely to appear on the census as well. For most reserves, the cen-sus population is very similar to the IR on-reserve population at the time of eachcensus, but for others, there are substantial differences between these two popula-tions. Some reserves have people that are listed as living on-reserve according tothe IR but do not appear on the census. Other reserves have people that appear onthe census, but are not listed as band members in the IR. Statistical methods arenecessary in order to model the relationship between the census and IR populationand to produce accurate predictions.In Chapter 2, we discuss two simple methods, linear regression and linear ex-trapolation which can be estimated using data from a single time point and usedto predict future responses. Chapter 2 also introduces temporal adjusted prediction(TAP) and its robust variant which are able to predict using a model also fit withdata from single time point while addressing the shortcomings of linear regressionand linear extrapolation. Chapter 3 shows the results of a simulation study compar-ing temporal adjusted prediction to other models for predicting longitudinal datain various situations. In Chapter 4, we examine the case study in our motivat-ing example by comparing the prediction accuracy of temporal adjusted predictionagainst that of selected methods for the 2016 Census population using data from theprevious censuses and the IR. Lastly, Chapter 5 discusses conclusions and futuredirections for the work presented in this thesis.1.1 Related WorkIn the current literature, linear mixed effects (LME) models are a popular methodfor fitting data with repeated measures. LME models are suitable for longitudinal2data because they allow the modelling of both between and within-individual vari-ation [10]. The effects of some variables may be common across individuals (fixedeffects), while others may be specific to an individual (random effects). However,in order to estimate the random effects of each individual, multiple measurementsor repeated measures for each individual must be available. If Indian reserve pop-ulations are measured at multiple time points, we can use the census and IR data tofit a LME model.However, data from the IR is only available between 2006 and 2016. Due tothe limited amount of time points for which complete data is available, we wantto use a model that can be fit using complete data from only one time point. Inthis situation, we only have complete data for the 2006 and 2011 census years.That way, we can fit a model using 2006 Census and IR data to predict Indianreserve populations for 2011. Since we have the true populations for 2011 fromthe 2011 Census, we can use this as a validation set to test how well our modelpredicts. If we fit an LME model using both 2006 and 2011 data, then we areleft without a validation set to test how well the model predicts since there is noresponse for 2016. Because of this, we are looking to use a method which can befit with complete data from at least one time point rather than at least two.Although a linear regression model estimated by least squares using data froma single time point can be used to predict future responses, it does not directlytake into account the fact that the responses of specific individuals are correlatedwith their past responses. In the context of the Indian reserve population data, theindividuals in the data are reserves and the covariates are the population of thereserves according to the IR in each census year and geographic variables such asthe province in which the reserve is located. It makes sense that the population of areserve in 2016 will be correlated with its population in 2011. This issue is furtherdiscussed in Section 2.2.Another issue that this thesis aims to address is the effect that outliers haveon parameter estimates. It is well known that even a small amount of outliersin the data can have significant effects on parameter estimates[21]. In the Indianreserve populations dataset, there are a number of observations which are consid-ered outliers due to their distance from the bulk of the data. Thus, a robust pro-cedure is needed to estimate parameters without being affected by contaminated3data. Several robust regression estimation methods exist such as M-estimation[2],MM-estimation[21] and Least Trimmed Squares[12] which also only require datafrom a single time point. However, all of these estimation methods do not accountfor the within-individual correlation of responses.In this thesis, we propose a new method of estimating the parameters of arandom intercept model using only the 2011 Census and IR data. The proposedestimated model can accurately predict the 2016 Census populations of reservesusing the 2016 IR data. This parameter estimation method only requires data froma single time point to fit a random intercept model, it is robust to outliers and ittakes into account the previous observed response for individuals when predicting.4Chapter 2Temporal Adjusted PredictionIn this chapter, we propose a new method, temporal adjusted prediction (TAP), forthe prediction of responses where data is complete at the first time point, but miss-ing a response at the second time point such as census and Indian Register datasetdiscussed in the Introduction. The notation used in this thesis is introduced in Sec-tion 2.1. Section 2.2 provides details on the motivation behind TAP and explainswhy linear regression is not appropriate for prediction in datasets where observa-tions on the same set of individuals are taken at different time points. Section 2.3explains how to predict via linear extrapolation. Section 2.4 covers the details ofthe temporal adjusted prediction method and introduces a robust variant which canfit the data well even when outliers are present.2.1 NotationFor this thesis, Yjk,k ∈ [1, · · · ,n] represents the response at the jth time point forthe kth individual and Yˆjk is the predicted value of the response at the jth time pointfor the kth individual. In the context of the reserve populations dataset, the term“individual” refers to a specific reserve rather than a person living on a reserve.The value of the ith covariate/auxiliary variable at the jth time point for the kthindividual is given by Xi jk, i ∈ [1, · · · , p],k ∈ [1, · · · ,n]. For example, X123 are thevalues of first or primary covariate at the second time point for the third individualin the dataset. In some instances, vector and matrix notation is used as well. The5responses at the jth time point for n individuals is denoted Y j = [Yj1, · · · ,Yjn]Tand X j is a n× p matrix of the covariates/auxiliary variables at the jth time pointfor all individuals. The vector notation also applies to the predicted responsesYˆ j = [Yˆj1, · · · ,Yˆjn]T .In both the linear regression and random intercept models, β = [β0,β1, · · · ,βp]is a (p+ 1)× 1 vector of population parameters that relate Yjk and Xi jk while βˆare the estimated values of the parameters β . In the random intercept model, β0represents the fixed component of the intercept while b0k represents the randomcomponent of the intercept for the kth individual. In the linear regression model,ε jk represents random error at time j for the kth individual which we assume tofollow a Normal(0,σ2) distribution. This differs from the random intercept modelwhere the random error term will be denoted as e jk which we assume to follow aNormal(0,φ 2) distribution.2.2 Linear RegressionAs mentioned in the Introduction, the motivation for temporal adjusted predictioncomes from the problem of predicting the populations of Indian reserves in Canadausing an auxiliary data source, the Indian Register. The on-reserve populationaccording to the Indian Register is strongly correlated with the reserve populationfrom the census for a given census year. The initial approach was to fit a linearregression model using least squares on past population data gathered from the2011 IR along with population and geographic data from the 2011 Census, andthen use the fitted model to predict the census population of a reserve in 2016based on data from the IR in 2016.The linear regression model using data at time point j is defined asYjk = β0+p∑i=1βiXi jk+ ε jk,k ∈ [1, · · · ,n],where Yjk is the response, Xi jk are the values of covariates, ε jk is the random errorfor the kth individual at the jth time point. The regression coefficients βi modelthe effects of each covariate on the response and are assumed to be constant overtime. In the context of the Indian reserve populations data, the response variable6Yjk, j ∈ [1,2,3],k ∈ [1, · · · ,n] corresponds to the population of the kth reserve asmeasured at three time points, the census in 2006, 2011, and 2016. The valueof the ith covariate for the kth reserve at the jth time point is given by Xi jk, j ∈[1,2,3]. Covariates include the population of a reserve according to the IR for agiven census year and geographical variables such as the province that a reserve islocated in.We fit a linear model using data from the most recent time point with com-plete data, both response and covariates. We estimate the set of coefficients thatresults in the minimum sum of squared residuals where the residuals are r j =Yˆ j −Y j. The estimated coefficients via least squares estimation are given byβˆ j = (XTj X j)−1X Tj Y j. In this thesis, unless otherwise specified, the parametersof linear regression models are estimated using least squares.When predicting with linear regression, we assume that the underlying modelbetween the response and explanatory variables remains constant over time, other-wise it does not make sense to predict using the estimated coefficients βˆ j from onetime point on data from another time point. In mathematical terms, we assume thatβˆ = βˆ 1 = βˆ 2 and so on, which is why we drop the subscript j. Predicted values ofthe response at time point j are given by Yˆ j = X Tj βˆ . Although the linear regressionfit is able to capture the linear pattern between the census and IR populations well,the resulting predictions of the reserve populations are poor. Even though the IRon-reserve and census populations are not perfectly correlated, in most cases it isreasonable to expect a proportional increase in the census population if we observean increase in the IR on-reserve population. In several cases, despite seeing anincrease in population on the IR between 2011 and 2016, the model predicted thata reserve would have a lower population in the 2016 Census compared to the 2011Census.This is counterintuitive since we expect the growth in a given reserve popula-tion to be reflected by both sources. However, it does not make sense to predict alower population for a reserve, given that we know past census population of thereserve and that we observe a growth in population according to the IR. If we donot have access to the previous census populations for the reserves that we are try-ing to predict on, then it makes sense to predict using linear regression. This issueis illustrated in the Figure 2.1.7Figure 2.1: Prediction of Selected Observation at Second Time Point via Lin-ear RegressionIn Figure 2.1, the diagonal blue line represents the estimated linear regressionline based on data simulated at the first time point. Predictions at a given valueX1 jk correspond to the height of the blue line on the Y-axis for that X1 jk value.For example, the dashed line on the left represents the X110 value at the first timepoint for the individual marked by a black triangle and the dotted line to the rightrepresents the X120 value at the second time point for this individual. The differencebetween the two represents the growth in X1 between the second and first timepoints for this individual. To make a prediction for the point labelled with a blacktriangle, we use the new value of X120 from the second time point and trace thepoint that intersects the regression line. The predicted value Yˆ20 for this individualis marked by the height of the blue triangle. From this plot, we can see that eventhough the value of X1 j0 for this individual grows over time and X1 and Y arepositively associated across reserves, the predicted value of Yˆ20 is less than theoriginal value ofY10. The following sections propose methods to address this issue.82.3 Linear ExtrapolationA naive but intuitive solution to this problem is to simply extrapolate linearly fromdata in the first time point to obtain predictions for a second time point by usinga primary covariate. In the example of Indian reserve populations in Canada, wecan produce a simple prediction of the 2016 Census population by calculating thecensus to IR reserve population ratio of a reserve based on data from 2011 and thenmultiplying by the 2016 IR on-reserve population. This process is repeated for eachindividual reserve. For this section, we refer to the IR on-reserve population as theprimary covariate and other explanatory variables as secondary covariates. Withlinear extrapolation we assume that the ratio of the response Yjk and the primarycovariate X1 jk is constant over time for each individual. For the kth individual, weassumeY1kX11k=Y2kX12k.Therefore, the predicted value of the response variable for an individual at thesecond time point using linear extrapolation is given by the equationYˆ2k =Y1kX11k∗X12k.The main benefits of the extrapolation are that it is intuitive and easy to predictwith. It uses past data points as a starting point for prediction and extrapolatesbased on an individual reserve’s census to IR population ratio. A ratio greater thanone indicates that there are people living on-reserve that are not listed in the IR.One possible explanation for this is that there is a non-band member living on-reserve and therefore would not be listed in the IR, but would still be counted onthe census. Other possible explanations for this discrepancy will be discussed inChapter 4. This ratio of overall population to registered band members living on-reserve is preserved for any prediction. With linear extrapolation, the predictionline for an individual passes through the origin and the previous observed point(X11k,Y1k). Therefore if the relationship between X11k and Y1k is positive, then anincrease in X1 jk over time results in an increase in the predicted value of Yˆjk.9Figure 2.2: Prediction of Selected Observation at Second Time Point via Ex-trapolationIn Figure 2.2, the red line represents the predicted value of Y for values of X12kfrom linear extrapolation for the individual marked by the black triangle. The slopeis calculated by taking the ratio of Y10 to X110 for the selected individual. As withFigure 2.1, the dashed line represents the value of X110 and the dotted line to theright represents the value of X120 for this individual. The predicted value via linearextrapolation for this individual is given by the height of the red triangle. Theheight of the blue triangle represents the predictions obtained via linear regressionas described in Section 2.2. Contrary to prediction with linear regression, whenX1 jk increases between time points for an individual, the predicted value for Yjkmust increase as well and vice versa.However, predicting using linear extrapolation has some drawbacks as well.In order to extrapolate and predict accurately, there needs to be a covariate that ishighly correlated with the response variable. Although this is true in the contextof the Indian reserve populations data, other datasets may not have a predictor thatis highly correlated with the response. As the correlation between the predictorand the response variable weakens, prediction accuracy from linear extrapolation10deteriorates. Further, linear extrapolation only allows the use of one covariate eventhough several may be relevant for the prediction of the response variable. In ad-dition, linear extrapolation cannot make predictions if the observed values of X11kare zero for an individual since it will lead to a division by zero error.In some sense, linear extrapolation can be thought of as a mixed effects modelwith a fixed intercept and random slope. Each individual has their own randomslope, which is the ratio of Y1k to X11k and a fixed intercept of zero. However, beingforced to have a fixed intercept of zero may not be desired in some situations. Forthe Indian reserves population dataset, this constraint is fine since an IR on-reservepopulation of zero will also mean a reserve population of zero according to thecensus for most reserves.Linear extrapolation does not predict well for individuals that have a large rela-tive increase or decrease in their primary covariate between time points. Althoughthe large increase in the primary covariate is not an issue by itself, if the ratio be-tween the response and covariate in the first time point is not accurate, then thedifference between the true response and the prediction is amplified. In the contextof our case with the reserve population data, suppose a reserve had a populationof 20 people according to the 2011 Census and 5 people according to the 2011 IR.The 15 people included in the census count are non-band members that also liveon this reserve. Therefore, this reserve has a census to IR population ratio of 4.If this reserve were to experience a migration of 95 new registered band membersmoving onto this reserve between 2011 and 2016, then the predicted number ofpeople living on this reserve according to the census would be 4× (95+5) = 400.However, in reality the true population is the 20 people who lived on-reserve in2011 along with the 95 registered band members who have moved onto the reservesince then. Therefore, the prediction is off by 400− (20+ 95) = 285 people be-cause the census to IR reserve population ratio is not constant across time for thisreserve.Cases like this are most common in reserves with small populations. If the IRpopulation is small, then the census to IR population ratio is more likely to be largesince the denominator is small. These cases are also more likely to be the onesthat see a large relative increase in population between censuses. An increase of20 people doesn’t affect a reserve with 1000 people on it, but the same increase in11a reserve with 10 people there would be extreme. Violations to the assumption ofa constant census to IR on-reserve population ratio could have negative effects onprediction results, especially for reserves with a low population according to theIR. Although linear extrapolation addresses the some of the drawbacks of linearregression, the combination of issues discussed above suggest that extrapolationmay not yield accurate predictions either.2.4 Temporal Adjusted PredictionThis thesis proposes a new method for predicting values of a response variablegiven auxiliary data measured at the same time point and past data that is used toconstruct a model that relates the auxiliary data to the response. Temporal adjustedprediction combines the strengths of both linear regression and linear extrapolationwhile attempting to address the drawbacks of each.The relationship between the response variable at the first time point Y1k andits covariates Xi1k can be modelled by the regression equationY1k = β0+p∑i=1βiXi1k+ ε1k,where we assume ε1k ∼ Normal(0,σ2).Our goal is to predict the values of Y2k given new data Xi2k. Therefore, weassume that the relationship between the response and the covariates at the secondtime point follows a similar modelY2k = β0+p∑i=1βiXi2k+ ε2k,where we assume ε2k ∼ Normal(0,σ2).Since we assume that the underlying relationship between the response variableYjk and the covariates Xi jk do not change over time, i.e. the regression coefficientsβ are the same in both models, we can combine these equations resulting inY2k−Y1k = (β0−β0)+β1(X12k−X11k)+β1(X22k−X21k)+ · · ·+βp(Xp2k−Xp1k)+(ε2k− ε1k).12Therefore, the predicted value for Y2k is given byYˆ2k = Y1k+p∑iβˆi(Xi2k−Xi1k). (2.1)The coefficients βˆ are estimated by least squares on the data from the first timepoint. If we assume that the relationship between these variables is constant overtime then we can apply these estimated coefficients to predict Y2k using Xi2k. Tem-poral adjusted prediction is similar to prediction with linear regression in the sensethat they share the same set of estimated coefficients. However, the difference be-tween the two is that for linear regression, predictions share a common intercept β0whereas with TAP, the intercept term is random. For TAP, the estimated regressionline will interpolate the previous observed response for a given individual.If complete data is available for more than one time point, then least squaresonly needs to be fit on the most recent complete data. For example, if we wantto predict Indian reserve populations in 2016 and have access to IR on-reservepopulations for 2016, 2011, 2006 and census Indian reserve populations for 2011and 2006, then we only need to estimate the regression coefficients with 2011 dataand then use the difference between 2016 and 2011 IR counts along with the 2011Census count for prediction. Data from 2006 is not used because we have completedata for 2011.Figure 2.3 shows the prediction mechanisms of all three methods described inthis chapter. As with Figure 2.2, the black triangle represents a selected individualthat we are interested in making predictions for. The blue and red lines show theestimated linear regression and linear extrapolation lines, respectively. The greenline shows the predicted values for this individual using temporal adjusted predic-tion. Its estimated regression line runs parallel to the linear regression line sinceboth models share the same estimated coefficients, but with an adjustment to theintercept so that it passes through the previous observed point for that individual.The height of the blue and red triangles represent the predictions for this individualat the second time point given by linear regression and extrapolation, respectively.The height of the green triangle represents the prediction given by temporal ad-justed prediction.Temporal adjusted prediction shares many of the same benefits of both methods13Figure 2.3: Prediction of Selected Observation at Second Time Point viaTemporal Adjusted Predictionand addresses their individual weaknesses. The requirement for the high correla-tion between the response and primary auxiliary variable is not as important as itis for linear extrapolation since we can include other explanatory variables whenfitting the model. These additional covariates may have an effect on the response aswell, and these effects can be considered in predictions. Predictions obtained fromthis method also suffer from less variability compared to those obtained from linearextrapolation since we calculate a common slope for all individuals rather than aslope that only pertains to a single individual. Although this may seem more rigid,fitting a model this way makes more sense since we are able to draw some inputfrom the rest of the dataset rather than just extrapolating based on data from a sin-gle individual. In other words, we are assuming that the auxiliary variable(s) havea common effect on the response rather than an individual effect on the response.TAP predictions are more flexible by allowing random intercepts compared to lin-ear regression which assumes a common intercept β0. In situations where a largeproportion of the error is inherent to specific individuals, TAP is able to accountfor this and gives better predictions than linear regression which assumes that all14of the error is random. This is further discussed in Subsection 2.4.1.2.4.1 Random InterceptData at time point j for k individuals can be modelled by the formulaYjk = β0+p∑i=1βiXi jk+ ε jk. (2.2)In the linear regression framework, the error term ε jk is assumed to follow aNormal(0,σ2) distribution. The error terms are also assumed to be independentand identically distributed.However, when we look at data over more than one time point for the sameset of individuals, some of the error terms can no longer be assumed independent.More specifically, the error term ε jk,k ∈ [1, · · · ,n] can be decomposed into twocomponents, b0k,k ∈ [1, · · · ,n], a random intercept to model the within individualvariation, and e jk,k ∈ [1, · · · ,n], the remaining random error in the model. Therandom intercept b0k is combined with the fixed component of the intercept β0 toget the intercept term for the random intercept model. For example, in the caseof prediction of populations of Indian reserves, suppose a group of individualsliving on a particular reserve are not found in the IR but are included in the censuscount. Although this difference may be considered as error when we fit a linearregression model on the data at the first time point, it can still be considered topredict the population of the reserve at the second time point since it is reasonableto expect that this group will still be living on a particular reserve during the nextcensus. To model the random effects of specific reserves, a random intercept termcan be included in the regression model. This random intercept, b0k,k ∈ [1, · · · ,n],is assumed to follow a Normal(0,τ2) distribution. The remaining portion of therandom error that is not explained by the random intercept term is denoted as e jkwhich is assumed to follow a Normal(0,φ 2) distribution. We assume the randomintercept term b0k and the random error term e jk are independent of each other.The typical random intercept model with the two sources of variation at timepoint j is given byYjk = (β0+b0k)+β1X1 jk+β2X2 jk+ · · ·+βpXp jk+ e jk.15The overall variability in the data is a combination of the random intercept andthe random error term. Therefore, if Var(b0k) = τ2 and Var(e jk) = φ 2, then thetotal variance in the random effects model is τ2+φ 2 since we assume that the twosources of variation are independent of each other. The variance of the error termin the linear regression model is equal to the combined variance of the error termsfrom the random intercept modelσ2 = τ2+φ 2.However, the variance of the random intercept cannot be directly estimated if welack data from at least two time points. TAP addresses this limitation using the re-sponse at the first time point instead of the random intercept to predict the responseat the second time point.2.4.2 Comparing the MSPE of Linear Regression and TemporalAdjusted PredictionSince we are interested in comparing the prediction performance of each method,we compare the mean squared prediction error (MSPE) of both linear regressionand TAP. The MSPE is given by MSPE = 1n ∑n(Y2k− Yˆ2k)2 where Y2k and Yˆ2k arethe observed and predicted response for the kth individual at the second time point.In this subsection, we examine the univariate case, although the results can begeneralized to include additional covariates.Since linear regression does not include a random intercept term, the responsesY1k and Y2k can be modelled byYjk = β0+β1X1 jk+ ε jk, (2.3)and used to predict the response at the second time pointYˆ2k = βˆ0+ βˆ1X12k,where βˆ0 and βˆ1 are estimated with least squares on data from the first time point.The prediction error is obtained by taking the difference between the predictedresponse and the actual response16PELR,k = Yˆ2k−Y2k = (βˆ0−β0)+(βˆ1−β1)X12k− ε2k.The expected MSPE is given by the variance of the prediction error of linear re-gression which isE[MSPELR,k] =Var(PELR,k)=Var(Yˆ2k−Y2k)=Var((βˆ0−β0)+(βˆ1−β1)X11k− ε2k)=Var(Yˆ2k)+Var(ε2k)= σ2{1n+X12k− X¯11k(n−1)s2X11k}+σ2.As n→ ∞, the first term in the MSPE goes to zero and the expected MSPE con-verges to σ2.For TAP, the predicted response at the second time point, Yˆ2k isYˆ2k = Y1k+ βˆ1(X12k−X11k),where βˆ1 is estimated via least squares on data from the first time point. Theprediction error of TAP is obtained byPETAP,k = Yˆ2k−Y2k = Y1k−Y2k+ βˆ1(X12k−X11k).The expected MSPE is given by the variance of the prediction error of TAP whichis17E[MSPETAP,k] =Var(PETAP,k)=Var(Yˆ2k−Y2k)=Var(Y1k−Y2k+ βˆ1(X12k−X11k))=Var(Y1k−Y2k)+Var(β1)∗ (X12k−X11k)2= 2φ 2+(φ 2)(n−1)s2X12k∗ (X12k−X11k)2.As n→ ∞, the second term in the expected MSPE goes to zero and the expectedMSPE converges to 2φ 2.Since the expected MSPE of TAP converges to φ 2 and the expected MSPEof linear regression converges to σ2 = φ 2 + τ2, the method that will predict moreaccurately depends on the underlying parameters φ and τ as n→∞. In cases whereall of the variation is random error, τ2 = 0, then linear regression should have anexpected MSPE that is half of the expected MSPE of TAP when n is large sincethe expected MSPETAP,k → 2φ 2 while MSPELR,k → σ2 = φ 2 + τ2 = φ 2. If thevariance of the random error term is equal to the variance of the random intercept,φ 2 = τ2, then the expected MSPE of each method converges to the same valuesince MSPETAP,k→ 2φ 2 while MSPELR,k→ σ2 = φ 2+ τ2 = 2φ 2.From this, we can conclude that if φ < τ , then E[MSPETAP,k] < E[MSPELR,k]as n→ ∞. In other words, when the variance of the random intercept is largerthan the variance from the random error term, we expect TAP to predict better thanlinear regression.Figure 2.4 and Table 2.1 compare the MSPE given by linear regression andtemporal adjusted prediction for different values of τ and fixed φ = 5. For thissimulation, datasets of n=500 observations are generated with a single covariatewith values of reponses and covariates from two time points. The random in-tercept b0k is generated for each trial from a Normal(0,τ2) distribution whereτ ∈ [0,10,20,30,40]. The results of 1000 independent trials are shown. Morespecific details about the simulation settings are described in Section 3.1.Figure 2.4 and Table 2.1 show that linear regression performs better than tem-poral adjusted prediction when τ = 0, but as τ increases, then temporal adjusted18Figure 2.4: Mean Square Prediction Error of Temporal Adjusted Data UsingTemporal Adjusted Prediction and Linear RegressionMethod τ = 0 τ = 10 τ = 20 τ = 30 τ = 40Temporal Adjusted Prediction 50.04 50.06 50.06 50.11 50.41Linear Regression 25.15 124.43 422.88 920.93 1623.87Table 2.1: Mean Squared Prediction Error of Temporal Adjusted Predictionand Linear Regression when varying Standard Deviation of Interceptprediction predicts Y2k more accurately than linear regression.These results are not unexpected since when τ = 0, there is no random interceptand this means that almost all of the data in both time points line up well with theestimated regression line. The only variability remaining in the model is producedby e1k and e2k. If the linear regression model fits well, then the prediction erroris due to the inability of the fitted linear regression to predict e2k. However, sincethe temporal adjusted prediction includes Y1k in its predictions, it includes e1k intoits predictions as well. Combined with the error from e2k, this explains why theMSPE for temporal adjusted prediction is roughly double that of linear regressionwhen there is no variation in the random intercept.The MSPEs of TAP and linear regression are in line with the theoretical ex-pected MSPEs calculated above. In these simulations, φ 2 = 25 and therefore the19MSPE of TAP for each value of τ is approximately 2φ 2 = 50. The observed MSPEof linear regression is approximately φ 2 + τ2 = 25+ [0,102,202,302,402] whenτ = [0,10,20,30,40].2.4.3 Robust Temporal Adjusted PredictionIt is well documented that least squares estimates are highly sensitive to outliers[21]. Least squares estimates are obtained by minimizing the total sum of squaredresiduals. Even one extreme point in the dataset can result in a large residual and aneven larger squared residual. In order to minimize the sum of squared residuals, theleast squares regression line will be skewed towards to this outlier. This effectivelycauses the outlier to have more influence on the estimated regression coefficientsthan non-outlying points. In order to provide additional robustness to outliers,estimated regression coefficients used in the temporal adjusted prediction modelare estimated using a robust procedure rather than being estimated by least squares.In our motivating example, when we plot the relationship between the Indianreserve populations according to the 2011 Census and the Indian Register in 2011,we can see that the relationship between these two variables is positive and stronglylinear. However, in Figure 2.5 we can see that some potential outliers exist in thedataset. These bands have low reserve populations according to the IR but highreserve populations according to the census. Possible causes of these outliers arediscussed in Chapter 4, but they present a problem when it comes to estimatingregression coefficients using least squares.Although it is possible to identify and remove outliers from the dataset man-ually in simple cases where outliers are obvious, classifying an observation as anoutlier is not always trivial. A point that may look extreme can possibly be ex-plained by another variable and may not be an outlier once additional factors areconsidered. Some points may not be obvious outliers but can still be influential. InFigure 2.5, some potential outliers are highlighted in blue. Some of these observa-tions are obvious outliers since they are located far away from the bulk of the data.However, for others it is not clear whether these observations are true outliers or ifthey occur due to natural variability.There are many robust regression methods in the current literature but for tem-20Figure 2.5: Potential Outliers in the Indian Reserves Population Dataporal adjusted prediction, we estimate the regression parameters β using MM-Estimation introduced by Yohai (1987) [21]. In robust statistics, a common mea-sure of an estimators’ resistance to outliers or robustness is the breakdown pointintroduced by Hampel (1971) [1]. The breakdown point of an estimator is thelargest proportion of outliers a sample can contain before the estimated parametersare broken [21]. The main benefits of using MM-estimation are that it has a highbreakdown point and is highly efficient when the errors are normally distributed[21].In order to define MM-estimator, we must first define scale M-estimators [2][21].As stated by Yohai (1987) [21], let ρ be a real function that satisfies the followingassumptions:1. ρ(0) = 02. ρ(−u) = ρ(u)3. 0≤ u≤ v =⇒ ρ(u)≤ ρ(v)4. ρ is continuous215. Let a= supρ(u), then 0 < a< ∞6. If ρ(u)< a and 0≤ u≤ v, then ρ(u)< ρ(v)The M-scale estimator s(u) of a sample u = (u1,u2, · · · ,un) is the value s thatsatisfies1nn∑i=1ρ(ui/s) = b, (2.4)where b= Eφ (ρ(u)) and φ is the standard normal distribution [21].The MM-estimator is calculated using the following three-step procedure. First,take an initial estimate T 0,n of the regression coefficients β with a high breakdownpoint. Next, compute the residuals r(T 0,n) and compute the scale M-estimate of theresiduals sn = s(r(T 0,n)) as in equation 2.4 such that a=maxρ0(u) and b/a= 0.5.Lastly, let ρ1 be function where ρ1(u) ≤ ρ0(u), supρ1(u) = supρ0(u) = a andψ1 = ρ ′1. The MM-estimate T 1,n is the solution ofn∑i=1ψ1(ri(β )/sn)xi = 0, (2.5)which verifies S(T 1,n)≤ S(T 0,n) whereS(β ) =n∑i=1ρ1(ri(β )/sn) = 0, (2.6)and ρ1(0/0) is defined as 0 [21].In order to apply MM-estimates to temporal adjusted prediction, we replacethe estimated coefficients via least squares βˆ in equation 2.1 with those estimatedvia MM-estimation. This is referred to as robust temporal adjusted prediction, orrobust TAP. For this thesis, MM-estimates are computed in R using the packagerobustbase [11].22Chapter 3Simulation StudyIn this chapter, we compare the prediction performance of temporal adjusted pre-diction and robust temporal adjusted prediction against those of other selectedmodels including linear regression, robust regression and extrapolation under avariety of simulation settings. Section 3.1 describes the method of data generationfor the simulations in this chapter which is based on the Indian reserve populationsdataset mentioned throughout this thesis. Features present in that dataset (includingoutliers, additional covariates and data from an additional time point) are individu-ally added to the simulations so that we can observe their effects on the predictionaccuracy for each of the models listed above.Section 3.2 shows the prediction performance, measured by the MSPE, of theselected models on our simulated data. Subsection 3.2.1 shows the results of simu-lations carried out under our default settings without the inclusion of any additionalfeatures. In Subsection 3.2.2, we add outliers into the dataset and then compare theprediction performance of each model. Subsection 3.2.3 discusses the effects ofadding additional covariates on the prediction accuracy of each model. Lastly, inSubsection 3.2.4, we add data from a third time point into the dataset to accommo-date the inclusion of a linear mixed effects model with a random intercept and fixedslopes, and then discuss the prediction performance of each model. The parame-ters of this linear mixed effects model are estimated using maximum likelihoodmethods.233.1 Data GenerationThis section explains how data is generated for the simulations performed in thischapter. The data is generated according to the random intercept model describedin Section 2.4.1. Specific values of parameters and distributions of independentcovariates are meant to model the Indian reserve populations dataset described inChapter 4. For example, in the Indian reserve populations dataset, the IR on-reservepopulation roughly follows an exponential distribution, so the distribution of thecovariate X1 jk in the data generation model also follows an exponential distribution.This section only describes the default settings of data generation that are commonacross all simulations. Differences in data generation methods that are specific toa certain setting are described in the corresponding subsection.For simulations in Subsection 3.2.1, we generate a dataset of n = 500 obser-vations for two time points according to a random intercept model with a singlecovariate and a random error term. The data for the first time point is generatedaccording to the formulaY1k = (β0+b0k)+β1X11k+ e1k,where X11k ∼Exp(0.001), β0 = 0, β1 = 0.82 and e1k ∼Normal(0,φ 2)with φ = 10.The random intercept term b0k follows a Normal(0,τ2) distribution with τ = 20. Itis important to note that the values of b0k are exactly the same for both Y1k and Y2ksince this random term models the variability that is inherent to a specific individualacross both time points.Since data for the second time point must be correlated with the first time point,we multiply the independent X11k variable by a growth factor α1 = 1.2 and then addsome random noise η1k ∼ Normal(0,102) to get X12k. This growth factor is meantto emulate the growth in population that reserves will experience between censuses.The value of X12k is given by X12k = α1 ∗X11k+η1k. Although this growth factoris only applied to X11k, growth in the response variable Y2k will be reflected whenY2k is calculated according to the formulaY2k = (β0+b0k)+β1X12k+ e2k.24The response at the second time point Y2k is generated using the same values ofβ0,b0k,β1 values used to generate Y1k. The random error term e2k ∼Normal(0,φ 2)with φ = 10. The random error terms at the first and second time points, e1k ande2k, are identically and independently distributed.All results shown in this chapter are based on 1000 independent trials in eachsetting.3.2 Simulation Results3.2.1 Default SettingsThe MSPE of 1000 independent trials under the default simulation settings areshown in the boxplots in Figure 3.1 with the median MSPEs reported in Table 3.1.A few extreme values of MSPE for extrapolation are not shown for some plotsbecause they would distort the Y-axis of the plot if included. However, they arestill factored into the median MSPEs shown in the tables. The number of omittedvalues for each plot is mentioned in the corresponding subsection. Some of thedistributions of MSPE are right skewed so the median is better summary statisticof different methods.Method MSPEExtrapolation 105.43Linear Regression 421.61Robust Regression 421.59Robust Temporal Adjusted Prediction 49.95Temporal Adjusted Prediction 49.93Table 3.1: Comparison of Median MSPE under Default SettingsTable 3.1 shows that TAP and robust TAP achieve the lowest median MSPEsfollowed by extrapolation and then by linear regression and robust regression. Thisresult is expected since it was already shown that TAP predicts better than linearregression when the variance of the random intercept term is large relative to thevariance of the random error term. Extrapolation usually achieves a MSPE some-where between linear regression and TAP.25Figure 3.1: Comparison of MSPE under Default SettingsIt should be noted that 92 MSPE points for the extrapolation method are notdisplayed in the Figure 3.1 since they exceed a maximum bound of 800 and in-cluding all values would distort the Y-axis in the plot. This is likely due to the factthat points with a very low X11k value can result in a very high Y1kX11k ratio. If X11k isvery small for an individual and it experiences even a moderate increase, then theprediction for that individual will be extremely poor. Even a single bad predictioncan badly affect the MSPE which occurs in a number of trials in this simulation.3.2.2 OutliersIn this section, we show the effects of adding outliers to the generated dataset onprediction performance. Some outliers are found in the Indian reserve populationsdataset and the outliers generated in this section are meant to mimic their location26Figure 3.2: Comparison of MSPE under Default Settings (Zoomed In)and effects on the fitted models. In the first simulation, we randomly replace 5%of the data with points that are generated followingX11k ∼ Exp(0.01),Y1k = (β0+b0k)+β1X11k+ e1k.where β0 = 0, β1 = 0.82 and e1k ∼ Normal(0,φ 2) with φ = 10. Note that theX11k values for the outliers are generated in a similar fashion to those from non-outliers with the exception that the rate parameter of the exponential distribu-tion changes from 0.001 to 0.01. This adjustment is made to reflect the fact thatmost outliers located in the reserve populations dataset have lower values IR on-reserve populations, but higher census populations for a given year. To match the27higher values of the response for the outliers in this simulation, we generate therandom intercept term from b0k from a non-standardized Student’s t-distributiontd. f .=1(µ = 1000,σ = 10). Since the generated dataset focuses on the same set ofindividuals over time, the outliers are also present in Y2k since Y1k and Y2k are bothgenerated using the same random intercept, b0k.The MSPEs of 1000 trials are shown in the boxplots in Figure 3.3. The corre-sponding median MSPEs are displayed in Table 3.2.Figure 3.3: Comparison of Mean Squared Prediction Error with 5% OutliersWhen the data contains outliers, the prediction performance of TAP and robustTAP is approximately the same as when it contained no outliers while all othermethods experience large increases in their MSPE. Once again, this behaviouris expected since, a robust estimation method should still be able to accuratelyestimate the true regression parameters when there is a small amount of outlier28Figure 3.4: Comparison of Mean Squared Prediction Error with 5% Outliers(Zoomed In)Method MSPEExtrapolation 32491.85Linear Regression 46411.11Robust Regression 50618.20Robust Temporal Adjusted Prediction 50.05Temporal Adjusted Prediction 195.46Table 3.2: Comparison of Median MSPE with 5% Outlierscontamination, while the least squares estimates may be more severely affected. Itmay seem counter-intuitive that robust regression predicts worse than linear regres-sion when outliers are included but this behaviour is expected. Robust regression29methods are able to estimate the underlying distribution of the non-contaminatedpoints while placing less weight on outliers whereas linear regression using leastsquares weighs the outliers more heavily than the non-contaminated data in thissimulation. When it comes to prediction, robust regression can predict the non-contaminated data well but poorly predicts the outliers since it is not trying to fitthose points. Linear regression does a mediocre job at fitting the outliers and thenon-contaminated points and does a mediocre job predicting both as well. As a re-sult, robust regression does a better job predicting the non-contaminated points, butlinear regression does better on predicting the overall dataset. Other robust mea-sures of prediction performance such as median absolute prediction error (MAPE)are a suitable measure of performance prediction when outliers are present. Thisperformance measure will be used in Chapter 4 along with MSPE.The vertical displacement of the outliers are included in the random interceptterm b0k and therefore do not affect the underlying relationship between X11k andY1k, although they may affect how well each method is able to estimate β1. Thisis why TAP is still able to predict relatively well when outliers are present. Apoint that is in an extreme location in the first time point will also be in an extremelocation in the second time point. If there is little growth between the first andsecond time points in those outlying points in X1 jk and Yjk, then the value of thefirst time point Y1k should approximate Y2k well even if the linear regression fit isnot able to accurately estimate β1. Although TAP is able to predict relatively wellwhen outliers are present, there are gains to be made by using its robust alternativeinstead since robust estimation methods are able to estimate β1 better and thereforepredict more accurately. By starting the predictions at an already outlying pointY1k,robust TAP overcomes the drawback of not being able to predict the outliers thatrobust regression suffers from. Also, note that 271 extreme values of MSPE areomitted from Figure 3.3 since their inclusion would distort the axes in Figure 3.3.In Figure 3.5 we show the prediction results when the proportion of outliersin the simulated data is increased to 10%. As more outliers are introduced intothe data, the results discussed above become even more defined. Robust TAP hasa similar MSPE compared to when the data contains only 5% outliers instead of10%. Other models show substantial increases in MSPE when the proportion ofoutliers is increased. TAP is the next best performing model in terms of median30Figure 3.5: Comparison of Mean Squared Prediction Error with 10% OutliersMethod MSPEExtrapolation 131135.92Linear Regression 84931.00Robust Regression 101153.51Robust Temporal Adjusted Prediction 49.90Temporal Adjusted Prediction 557.25Table 3.3: Comparison of Median MSPE with 10% OutliersMSPE followed by linear and robust regression and then extrapolation.31Figure 3.6: Comparison of Mean Squared Prediction Error with 10% Outliers(Zoomed In)3.2.3 Additional CovariatesIn this simulation, we examine the effects of adding two additional covariates intothe model, one quantitative and one binary. In addition to making highly variablepredictions when the values of X11k are low, another major drawback of linearextrapolation is that it cannot easily factor in the effects of additional covariates tomake predictions and therefore it will not be featured in this section. The responsevariable at the first time point, Y1k, is generated according toY1k = (β0+b0k)+β1X11k+β2X21k+β3X31k+ e1k,where X21k ∼ Normal(0,202) and X31k ∼ Bernoulli(0.5) with the corresponding32coefficients β2 = −20 and β3 = 100. Other parameters and variables, β1,X11k,e1kand b0k, are still created according to the default settings described in Section 3.2.1.The second covariate at the second time point X22k is generated in a similar manneras X12k. A growth factor α2 = 1.3 is applied to X21k and then some random errorη2k ∼ Normal(0,102) is addedX22k = α2 ∗X21k+η2k.Lastly, X31k does not change over time, hence X31k = X32k.Figure 3.7: Comparison of Mean Squared Prediction Error with an Addi-tional VariableThe results shown in Figure 3.7 and Table 3.4 are comparable to those in Sub-section 3.2.1. In fact, the median MSPEs shown in Table 3.4 are almost identical to33Figure 3.8: Comparison of Mean Squared Prediction Error with an Addi-tional Variable (Zoomed In)Method MSPELinear Regression 422.71Robust Regression 423.61Robust Temporal Adjusted Prediction 50.134Temporal Adjusted Prediction 50.10Table 3.4: Comparison of Median MSPE with Additional Covariatesthose shown in Table 3.1 which reports the median MSPE under the default simula-tion settings. All models are able to estimate the additional regression coefficientsβ2 and β3 well and therefore accurately predict the effects that X22 and X32 have onthe response Y2. The increase in prediction error comes from the additional vari-34ability introduced by η2k. From this simulation, we can conclude that none of theincluded estimation procedures are heavily impacted by the inclusion of additionalcovariates.3.2.4 Additional Time PointIn this subsection, we add an additional time point to the dataset to accommodatethe inclusion of a linear mixed effects model which is more suitable for fittinglongitudinal data compared to linear regression. Although temporal adjusted pre-diction can be considered a special case of linear mixed effects models, when wemention LME models in this thesis, we are referring to LME models where theparameters are estimated using maximum likelihood methods. The parameters ofTAP are estimated as described in Section 2.4. Since we are introducing an addi-tional time point, we need to define notation that is specific to this subsection. Mostparameters and variables take on the same meaning as described in Section 3.1with the exception of a new term β1 j. The parameter β1 j represents the true valueof the regression coefficient β1 at time point j. The data generation model for theresponse variable Yjk at time point j ∈ [1,2,3] is given byYjk = (β0+b0k)+β1 jX1 jk+ e jk,where X11k and X12k are generated the same way as described in the default settings.X13k is generated by multiplying X12k by a growth factor α1 = 1.2 and then addingrandom error η2k ∼ Normal(0,102). The random error term e jk ∼ Normal(0,52)at each time point j.In this subsection, we also examine the effects of violations to our constantslope assumption. In this section, we denote β1 j as the true regression coefficientthat represents the relationship between covariate X1 jk and response Yjk at timepoint j ∈ [1,2,3]. In previous subsections, we assumed that β1 j remains constantover time; however in practice this may not always be the case. The constantrelationship over time is one of the key assumptions for predicting with TAP andLME models. We investigate the effects of slight deviations of this assumption onprediction accuracy.In the case of the reserve populations dataset, fitting a linear regression model35at each of the available time points shows that there are small differences in theestimated coefficient β1 j at each time point and therefore the assumption of a con-stant relationship between X1 jk and Yjk may be not be true. We test our selectedmodels under three different scenarios to see the effects of violations to the as-sumption that β1 is constant. Model parameters are estimated with complete datafrom the first and second time points available, and used to predict Yˆ3k given X13k.Scenario (1): β1 is constant over time, β11 = β12 = β13 = 0.82This is the default setting where the relationship between X and Y is constant overtime.Scenario (2): β1 is not constant over time, β11 6= β12 = β13In this scenario, we test to see the effects of changing β11 to 0.84 while the othertwo coefficients β12 and β13 remain at 0.82. This modification should only affectthe linear mixed effects model since it relies on data from first and second timepoints to fit the model, whereas the other models only utilize data from the secondtime point. We test this to see how a small change in β11 can affect the predictionsproduced by the linear mixed effects model.Scenario (3): β1 is not constant over time, β11 = β13 6= β12This is similar to the second scenario except here we change β12 to 0.84 whilethe other two coefficients β11 and β13 remain at the default value of 0.82. Sinceall model parameters are estimated using data at the second time point, a changeto the coefficient β12 should negatively impact the prediction accuracy of all threemodels. The fitted models are trying to predict data points that are generated byslightly different parameter values.In this subsection, we fit a linear mixed effects model with a random intercept,temporal adjusted prediction, robust temporal adjusted prediction, linear regres-sion, robust regression and extrapolation models.The results in Figure 3.9 and Table 3.5 show that in Scenario (1) when theslope β1 is constant over time, the LME model gives the best predictions followedclosely by robust TAP, TAP and then linear extrapolation. Linear regression androbust regression are not shown in Figure 3.9 so that their inclusion does not distortthe axes.In Scenario (2) we see that even a small change in one of the coefficients hasa noticeable effect on the prediction ability of the LME model. The LME model36Figure 3.9: Comparison of Mean Squared Prediction Error with Data with anAdditional Time Point: Scenario (1)Method MSPE (1) MSPE (2) MSPE (3)Extrapolation 99.23 99.23 1796.79Linear Mixed Effects Model 37.70 990.14 5152.61Linear Regression 424.38 424.38 2076.60Robust Regression 424.30 424.30 2078.29Robust Temporal Adjusted Prediction 49.93 49.93 1702.12Temporal Adjusted Prediction 49.94 49.94 1702.06Table 3.5: Comparison of Median MSPE with Data with an Additional TimePoint under three Scenariosgoes from having the lowest median MSPE in the previous scenario to having the37Figure 3.10: Comparison of Mean Squared Prediction Error with Data withan Additional Time Point: Scenario (2)highest. The other models have exactly the same median MSPE because theirparameters are estimated using only data from the second time point so a changein β11 does not affect them.When β12 is modified as in Scenario (3), the prediction accuracy of all modelsdecreases. This is expected since the parameters of all models except for the LMEmodel are estimated using data exclusively from the second time point. If this datais generated with a different slope β12 at 0.84, then the fitted models are not able topredict the points from the third time point well since the model fitted on the dataat the second time point is not the same as the underlying model that generated thedata at the third time point. However, in this scenario the LME model still predictsworse than the other methods.38Figure 3.11: Comparison of Mean Squared Prediction Error with Data withan Additional Time Point: Scenario (3)From these results, we conclude that the LME model performs the best whenthe coefficient β1 j is constant over time. However, even a small change in the co-efficient between time points can have damaging effects on prediction accuracyfor the LME model. Although the prediction accuracy of other models is also af-fected when β1 j changes between time points, they do not suffer to the same extentas the LME model. From these results, we can see that the LME model is moresensitive to violations of the assumption of a constant relationship between thecovariates and the response variable, compared to TAP. In the Indian reserve pop-ulations dataset, we expect there to be small changes in the regression parametersbetween censuses. In the exploratory analysis, when robust regression models arefitted using data from 2006 and 2011, we see that there are small differences in the39regression parameters between the two time points. This will be further discussedin Section 4.3.1.40Chapter 4Case Study: Prediction ofPopulations of Indians Reservesin CanadaAs discussed during the Introduction, the main motivation behind the developmentof temporal adjusted prediction comes from the goal of predicting the populationof Indian reserves in Canada for the 2016 Census using the Indian Register (IR) asan auxiliary data source. These predictions are used to confront the actual countsof the 2016 Census. If there are large discrepancies between the actual countsand model expectations, then it could warrant further investigation to determinethe cause of the difference. Most reserve inhabitants are band members that areregistered on the IR, [20] [16] so the IR should provide up to date informationon the population of each reserve [6]. If there are changes to the population of areserve according to the IR, then these changes are expected to be reflected by thecensus as well. In Canada, the census is conducted every five years, on years endingwith a 1 or 6, since 1871. However, data from the IR on band populations is onlyavailable starting from 2006 so we can only use data from the 2006, 2011 and 2016Censuses and IR for this case study. Using the IR and census data from 2006 and2011, we construct a model that can predict the populations of reserves for 2016based on population counts from the 2016 IR. In this chapter, we take a deeperlook at the data from the 2006, 2011 and 2016 Censuses and the Indian Register41and compare the prediction accuracy of temporal adjusted prediction against theprediction accuracy of the other methods discussed in this thesis. In Section 4.1,we introduce some background information regarding the census and IR necessaryfor the understanding of this case study. Section 4.2 provides a more in-depthexplanation of the data used in this study and some of the steps required to create adataset ready for analysis. Lastly, Section 4.3 describes the analysis of the IndianRegister and census data. This section also compares the prediction performancesresults of temporal adjusted prediction and other models when used to predict thepopulation of Indian reserves in Canada for the 2016 Census.4.1 Background4.1.1 The Indian RegisterThe Indian Register is the official record of all Registered Indians in Canada. TheIR was created in 1951 by the government in order to identify individual First Na-tions people and the bands to which they belonged [5]. This information is usedto determine to which benefits people are entitled through treaty agreements withbands. In order for a person to become a Registered Indian, they have to prove theirdescent from a person who is also a Registered Indian. Additions and deletions tothe IR are made by the Indian Registrar, an employee of Indigenous and North-ern Affairs Canada (INAC). The Indian Registrar is responsible for determiningwhether or not a person is eligible for registration on the IR [5].Although the IR is a relational database that was originally created for admin-istrative purposes, the information that it contains can still be useful for statisticalanalysis and modelling. Based on the information available from the IR, we canobtain the number of registered band members living on a band’s own reserves.The variables that are necessary for calculation of own-band on-reserve popula-tion counts include the band membership, the location of residence (whether anindividual lives on their own bands reserve or not), and their activity status. Mostindividuals on the IR are active; individuals who are not active include those whoare presumed or confirmed as deceased, and those who have given up their Reg-istered Indian status. The IR only states whether an individual lives on a reserve42that belongs to their own band or not and does not differentiate between specificreserves if a band is associated with multiple reserves. The IR also indicates ifan individual lives on a reserve of another band, but since it does not list whichother specific band or reserve, this information is not used. The in-scope popula-tion from the Indian Register are individuals who are currently indicated as livingon the reserve of their own band and are also currently active. The data used in thisthesis captures the state of the Indian Register on May 31st of each census yearand is tabulated to produce counts of registered band members on each band’s ownreserves. The IR on-reserve population of a given First Nation is defined as thenumber of active band members living on a reserve or crown land that is associatedwith their own band according to the Indian Register on May 31st of a given censusyear.4.1.2 Differences Between the Indian Register and CensusPopulationsThe census population of a reserve is defined as the number of people living ina particular census subdivision (CSD) associated with a reserve as measured bythe Census of Population in Canada for a given census year in the month of May.Statistics Canada defines a CSD as an “area that is a municipality or an area that isa deemed to be equivalent to a municipality for statistical reporting purposes (e.g.,as an Indian reserve or an unorganized territory)”[17]. Ideally, the IR on-reservepopulation should approximate well the census reserve population if most reservesare inhabited mainly by members of the band that the reserves are associated with.The IR is updated more regularly than the census so it is expected to capture mostof the changes in a reserve’s population since the previous census. However, thereare many differences in population counts between these two sources.If the IR on-reserve count is reflected perfectly by the census, then we canexpect the census population of a reserve to be at least as large as the populationaccording the IR. There could be people who are not listed on the IR that live on-reserve (e.g. members of other bands, people who have not registered with theIR yet, general population, etc.) that are captured by the census. There are alsocases where people living on a reserve are not included in the IR. Often more thanone factor contributes to differences across these data sources. Figure 4.1 shows43that the differences between the census and IR on-reserve populations for bandsin 2011. Points that are below the blue identity line (X = Y ) are reserves wherethe IR population exceeds that of the census, suggesting that many people who arerecorded as living on-reserve according to the IR may not be living there at thetime of census collection.Figure 4.1: 2011 Census Reserve Populations vs. 2011 IR On-Reserve Pop-ulationsConversely, a number of reserves have populations according to the censusthat are much larger than the IR on-reserve population. The most extreme of thesecases is the Westbank First Nation located in British Columbia which had 396 bandmembers living on-reserve according to the IR in 2011, but 7068 people living thereaccording to the 2011 Census [16]. Some First Nations such as the Westbank FirstNation choose to lease parts of their reserve to non-band members and thereforecould have larger populations relative to the number of registered band membersliving there [9][4]. Since the population according to the census is so much higherthan the IR on-reserve population for these reserves, they represent possible out-liers that may have high influence on the estimated parameters of regression modelsthat we fit on the data.44There are many possible reasons why people listed as living on-reserve ac-cording to the IR may not appear in the census. The IR on-reserve population onlyconsiders registered band members who live on their own band’s reserve lands, butsome reserves could be residences for members of other bands as well. The term“registered band members” refers to band members who are also registered on theIR. Some First Nations people may be registered as a member of a band and livingon a reserve, but unless they are also listed on the IR, they are not included in theIR on-reserve population. The census captures the state of the country on the dateof the census, which was May 16 in 2006 and May 10 in 2011 and 2016. However,the Indian Register data corresponds to May 31st of each census year. If changesoccurred after census collection but before data was taken from the Indian Register,this would also lead to a discrepancy between the two counts.In addition, a shortcoming of using the IR as an auxiliary data source is that itis usually updated via self-reporting by band offices [6]. These updates often occurafter major life events of registered individuals such as births and deaths. A resultof this is that the information contained can be outdated if bands or members donot self-report soon after these changes occur. In addition, many children are notregistered with the IR as many of the benefits associated with being a RegisteredIndian are not relevant until a person reaches adulthood. This could lead to under-estimation of census populations since children would be included in the censuscount if they were living on-reserve at the time. Deaths of Registered Indians aresometimes reported late or not at all since it is up to band or family members tonotify INAC of these changes [7]. There are also possible issues related to theresidence of Registered Indians as reported on the IR. Similar to other life changessuch as births and deaths, moves on or off reserve may be reported late or not atall. If a person had a residence on-reserve and a residence off-reserve, it is possiblethat they are listed as on-reserve on the IR, but could be residing elsewhere on thedate of the census [7]. All of these factors combined lead to discrepancies betweenthe IR and census populations.Although IR counts cannot be used to predict reserve populations directly, ac-curate predictions can still be made using an appropriate model.454.2 Data DescriptionData used for this case study was obtained from Statistics Canada and Indigenousand Northern Affairs Canada. INAC provided tables from the IR that listed eachFirst Nation along with their own band on-reserve population for each census yearfrom 2006 to 2016. Bands with an own band on-reserve population of less than10 persons are censored due to data confidentiality concerns. The remainder ofthe data comes from tables that are publicly available on Statistics Canada’s web-site and through the software GeoSuite [13], also through Statistics Canada. Thedata obtained from Statistics Canada’s website includes census tables for 2011 and2016 which contain the population count for each census subdivision for all threecensuses of interest along with additional information such as the province of eachsubdivision and whether or not that area corresponds to a non-enumerated Indianreserve or not. This data was linked to the IR data using an additional table fromStatistics Canada which lists all census subdivisions that are associated with eachFirst Nation in Canada. Lastly, census metropolitan area (CMA) codes are obtainedusing GeoSuite. The CMA code is used to determine the census metropolitan influ-enced zone (MIZ) level of a reserve. Geosuite is a publicly available program thatis provided by Statistics Canada that allows users to explore the links between allstandard levels of geography and to identify geographic codes, names, unique iden-tifiers, and, where applicable, types, as well as land area and population dwellingcounts [14].The response variable for each census year is the population of each reserveaccording to the census. The covariates include the IR on-reserve population asdefined above, the province/territory that a band’s reserves are located in, and theweighted census MIZ level of a band’s reserves. The levels of provinces and territo-ries include British Columbia (BC), Alberta (AB), Saskatchewan (SK), Manitoba(MB), Ontario (ON), Quebec (QC), Atlantic (AT), Yukon (YT) and the North-west Territories (NT). The Atlantic provinces New Brunswick, Newfoundland andLabrador, Nova Scotia, and Prince Edward Island each contain only a few reservesand are grouped together for this analysis. In addition, there are no First Nationslocated in Nunavut and therefore Nunavut is not listed as a level in our analysis un-der province/territory. It should be noted that according to the IR, the Taku River46Tlingit and Liard First Nation are both registered in the Yukon, however their asso-ciated reserves are located in BC according to the census. For this analysis, thesebands will be considered as a part of BC.The census MIZ variable is a measure of a reserves’ proximity to a metropoli-tan area measured on a ordinal scale from 0 to 5. Definitions of the census MIZlevels are shown in Table 4.1[17]. The reason for the inclusion of this variable isthat bands with reserves located in or near census metropolitan areas/census ag-glomerations are hypothesized to have higher populations. CMAs are defined asareas consisting of one or more neighbouring municipalities situated around a corethat has “a total population of at least 100,000 of which 50,000 or more live inthe core”[17]. A census agglomeration (CA) is an area with “a core population ofat least 10,000”[17]. Because some bands have multiple reserves that may be lo-cated in locations with differing proximity to a CMA, a rounded weighted averageby the census population of each reserve in 2011 is used to compute the censusMIZ level for a band. Since reserve locations do not change between censuses, thesame values of province and census MIZ level are used for constructing regressionmodels in 2006, 2011 and 2016. Although areas can potentially be reclassified asCMAs or CAs for the 2016 Census, it is unlikely to have any noticeable effect onour predictions.Level Definition0 Located in a CMA or CA1 Strong metropolitan influenced zone2 Moderate metropolitan influenced zone3 Weak metropolitan influenced zone4 No metropolitan influenced zone5 Located in a territoryTable 4.1: Levels of Census Metropolitan Influenced Zone4.2.1 Out-of-Scope BandsAccording to the data obtained from INAC, there are 633 unique First Nations inCanada as of May 2016. However, many of these bands and the reserves asso-47ciated with them are considered as out of scope for this thesis. Firstly, not everyFirst Nation in Canada is associated with a reserve and therefore is not of interestto this analysis. As mentioned earlier, bands that had on-reserve populations ofless than 10 are censored and these bands and their associated reserves are alsonot included in this analysis. In addition, some bands in the territories are associ-ated with CSDs that are not considered Indian reserves. According to the censusdictionary provided by Statistics Canada, the following CSD types are associatedwith on-reserve populations: Indian reserves(IRI), Indian settlements (S-E´), Indiangovernment districts (IGD), Terres re´serve´es aux Cris (TC), Terres re´serve´es auxNaskapis (TK), and Nisga’a land (NL) [17]. Many bands in the territories are as-sociated with other CSD types, and they are considered out of scope. Lastly, somereserves in Canada are incompletely enumerated and do not have available countsfor the 2006, 2011, or 2016 Censuses [19]. The bands associated with incompletelyenumerated reserves are also considered out of scope for this analysis.It should be noted that two reserves are associated with the Saddle Lake CreeNation in Alberta, Saddle Lake 125 and White Fish Lake 128 [18]. White FishLake 128 is enumerated in 2006, 2011 and 2016, however Saddle Lake 125 is not[16] [15]. As a result, these reserves are also omitted from this study since the IRon-reserve population for Saddle Lake Cree Nation does not differentiate betweenthe enumerated and non-enumerated reserve.In the spring of 2011, The Narrows 49, the reserve associated with the FirstNation Lake St. Martin in Manitoba experienced flooding [3]. According to thecensuses in 2011 and 2016, the population of this reserve was 826 and 5, respec-tively. As the reserve is virtually uninhabited as of the 2016 Census, Lake St.Martin and its associated reserve are also excluded from this analysis.4.2.2 Multi-Band Reserves and Multi-Reserve BandsWhile most First Nations are associated with a single inhabited reserve, some FirstNations are associated with multiple inhabited reserves. These bands are referredto as multi-reserve bands. Since the Indian Register does not differentiate whichband members live on which reserve, it only states whether someone lives on areserve that is associated with their own band, it is difficult to perform reserve spe-48cific predictions. Instead, populations of each reserve associated with a band aretotalled and modelling is done with the total reserve population for a band ratherthan the individual reserve population. We make the assumption that each reserveassociated with that band grows proportionally between censuses. Predictions ofspecific reserve populations are split based on the population ratios from the pre-vious census. For example, suppose that a band with two reserves had a total of100 people living on them according to the 2011 Census. If 70 people lived onone reserve and 30 people lived on another, then the predicted total reserve popu-lation for 2016 would be split at a ratio of 70 to 30 as well since we assume that allreserves associated with a band grow at the same rate.In addition, some reserves are associated with multiple bands, and are referredto as multi-band reserves. Although the reserve only has a single population ac-cording to the census, each band associated with a multi-band reserve has its ownon-reserve population according to the IR. To handle these reserves, each band thatshares a reserve has their IR population combined with other bands that share thereserve and are treated as a single band when modelling and predicting. Since weare only interested in reserve specific predictions, we do not need to split predic-tions by band like we did for bands associated with multiple reserves.A few reserves are associated with multiple bands which were also associatedwith multiple reserves. These groups are referred to as multi-band multi-reserveclusters in this thesis. An example of this are the Iskatewizaagegan #39 Indepen-dent First Nation and the Shoal Lake No.40 First Nation, which are both associ-ated with the Shoal Lake 34B2 reserve[18]. However, independently, they are alsoassociated with reserves that are exclusive to themselves. Iskatewizaagegan #39Independent First Nation is associated with the reserves Shoal Lake (Part) 39A andShoal Lake (Part) 39A (Manitoba) while Shoal Lake No.40 First Nation is associ-ated with the reserves Shoal Lake (Part) 40 and Shoal Lake (Part) 40 (Manitoba).Since we do not know how many band members of each reserve reside on theirown reserve and on the shared reserve, we combine their IR on-reserve and censuspopulations together and treat these two bands as a single band. As with predic-tions for multiple reserve bands, predictions for each individual reserve in 2016 aresplit based on ratios observed for the 2011 Census.Lastly, it should be noted that the Treaty Four Reserve Grounds 77 located49in the province of Saskatchewan are excluded from this study. This reserve isshared between 33 bands [8] and had a population of 10 people according to the2011 Census [16]. If we considered this group of 33 bands sharing the reserve asa multi-band multi-reserve cluster, then our predictions would suffer from a lossof granularity. In this situation, omitting a reserve of around 10 people is lessconsequential compared to grouping 33 bands together and analyzing them as onecluster.First Nations Associated ReservesIskatewizaagegan #39 IndependentShoal Lake No. 40Shoal Lake (Part) 39AShoal Lake (Part) 39A (Manitoba)Shoal Lake (Part) 40Shoal Lake (Part) 40 (Manitoba)Shoal Lake 34B2BearspawChinikiWesleyBig Horn 144AEden Valley 216Stoney 142,143,144Nisga’a Village of GingolxNisga’a Village of GitwinksihlkwNisga’a Village of Laxgalts’apNisga’a Village of New AiyanshNisga’aAishihikChampagneChampagne Landing 10Kloo LakeKlukshuTable 4.2: List of Multi-Band Reserves and Multi-Band Multi-Reserve Clus-tersTable 4.2 lists the multi-band reserves and the multi-band multi-reserve clus-ters included in this case study. Non-enumerated bands in multi-band multi-reserveclusters such as the Mohawks of Kahnawa`:ke and the Mohawks of Kanesatake, andthe Six Nations of the Grand River considered out of scope and not listed in Table4.2. In the list of Indian band areas and the census subdivisions they include [18],the Nisga’a Village of Gingolx, Nisga’a Village of Gitwinksihlkw, Nisga’a Villageof Laxgalts’ap and Nisga’a Village of New Aiyansh are associated individuallywith the Nisga’a villages (NVL) Gingolx, Gitwinksihlkw, etc. However, accord-50ing to census dictionary in 2011, these villages make up the CSD, Nisga’a Lands(NL)[17] and are treated as one reserve in this analysis. The Aishihik and Cham-pagne First Nations are also associated with the CSD, Haines Junction. However,this CSD is considered a village (VL) [18] and its population is not included in thecensus population for this multi-band multi-reserve cluster.4.3 Prediction of Indian Reserve Populations in 2016This section presents the prediction results of Indian reserve populations in Canadain 2016 using models compared in Chapter 3. Models are fit using data from the2011 Census and IR and predictions are made using data from the 2016 IR. Cen-sus and IR data from 2006 are also used to check model assumptions and to fit amixed effects model as well. Although one of the main advantages of temporaladjusted prediction over mixed effects models is that it does not require repeatedmeasurements to estimate model parameters, since we have access to data for 2006,we compare the performance of a mixed effects model here too. Subsection 4.3.1covers the exploratory analysis of the dataset and Subsection 4.3.2 presents theprediction performance results of each model in terms of MSPE. For this section,figures and numbers reported will be in terms of bands and not reserves unless oth-erwise specified. For example, Figure 4.1 compares the census and IR on-reservepopulations of the 505 in-scope bands rather than the 771 reserves that they areassociated with. This is done because the IR does not have reserve specific infor-mation for the bands that are associated with multiple reserves. However, since weare more interested in reserve populations instead of band populations, predictionresults will be reported at the reserve level rather than at the band level.4.3.1 Exploratory AnalysisAfter removing out-of-scope bands from the data, we are left with 505 bands whichare associated 771 reserves. Table 4.3 shows a cross tabulation of the categor-ical variables province and census MIZ. BC and Ontario have the highest num-bers of bands while the Yukon and Northwest Territories only include one bandeach. There are other bands located in the territories, but as mentioned in Subsec-tion 4.2.1, they are out of scope since they are not associated with a CSD type that51corresponds to an Indian reserve. Most bands have reserves that are located in areaswith weak or no metropolitan influence. BC has the highest proportion of bandsthat are located in either a census metropolitan area or a census agglomeration.Census MIZ Code0 1 2 3 4 5 TotalAB 7 0 14 5 14 0 40AT 12 0 6 6 8 0 32BC 49 3 16 24 90 0 182MB 0 0 4 27 28 0 60ProvinceNT 0 0 0 0 0 1 1ON 9 3 12 22 46 0 92QC 4 0 2 18 7 0 31SK 2 2 31 18 14 0 67YT 0 0 0 0 0 1 1Total 83 8 85 120 207 2 506Table 4.3: Cross Tabulation of Province and Weighted Census MIZThe first step of this analysis is to observe whether or not the assumption of aconstant relationship between our covariates and response over time holds. If thiscondition isn’t reasonably satisfied, then our fitted models will likely not predictthe population of Indian reserves in Canada well. When plotting the data in 2006and 2011, shown in Figure 4.2, we notice a few things. Although our data followsa linear pattern in both 2006 and 2011, there are some potential outliers in thedata that may have undue influence on the regression fit. This suggests that arobust regression model may be more appropriate for this dataset. In order to checkwhether or not the relationship between the covariates and response is constantover time, we fit a robust regression model on data from both years and comparethe estimated coefficients shown in Table 4.4. The baseline levels of province andcensus MIZ are Alberta and 0 respectively. We are unable to estimate a coefficientfor the level census MIZ of 5 since there is perfect collinearity between this variableand the levels of province, Yukon and Northwest Territories. Recall that a reservewith a census MIZ of 5 corresponds to a reserve located in a territory. In this case,we have only two bands located in the territories, one in the Yukon and the other52in the Northwest Territories. The estimated coefficients corresponding to thesevariables are able to capture the error relating to this level perfectly and thereforethe coefficient for level of census MIZ of 5 cannot be estimated and is not includedin Table 4.4.Figure 4.2: Census Reserve Populations vs. IR On-Reserve Populations in2006 and 2011Variable βˆi,2006 βˆi,2011Intercept 69.642 111.967IR On-Reserve Population 0.841 0.814Province:AT 33.289 -16.435Province:BC 7.469 -22.928Province:MB -75.037 -88.959Province:NT -76.703 -29.765Province:ON 3.461 -29.765Province:QC 37.024 47.084Province:SK 1.198 -24.773Province:YT -295.297 -147.485Census MIZ:1 9.498 43.691Census MIZ:2 -12.518 -22.674Census MIZ:3 -17.447 -6.353Census MIZ:4 -55.108 -60.631Table 4.4: Estimated Robust Regression Coefficients for 2006 and 2011 Re-serve PopulationsThe estimated coefficients related to census MIZ are similar in both 2006 and2011 across all levels. The estimated intercept for each year is the estimated reserve53population according to the census for a band with an IR on-reserve population of0, located in the province of Alberta and within a CA or CMA. The estimatedcoefficient for the IR on-reserve population indicates the expected increase in thecensus population of a reserve for a one person increase in the IR on-reserve pop-ulation for the corresponding year, while holding all other variables constant. Theestimated coefficients for province and census MIZ are the estimated effects of thelevels of province and census MIZ relative to the baseline. For example, the co-efficient corresponding to Ontario in 2011 is -29.765, meaning that a band withreserves located in Ontario is expected to have 29.765 less people on the censusin 2011 compared to a band in Alberta who share the same values of its other co-variates. One interesting thing to note is that reserves located near urban areas,or lower levels of census MIZ, correspond to higher populations while more re-serves in more remote areas correspond to lower populations. The most importantdifference between the two fitted models is the difference between the estimatedcoefficients for the IR on-reserve populations in 2006 and 2011 because this co-variate is highly correlated with the response. The difference in the estimated co-efficients between 2006 and 2011 is 0.027. As shown previously in Section 3.2.4,this could have implications on the prediction accuracy of each model if the esti-mates in 2006 and 2011 do not accurately reflect the relationship between censusand IR on-reserve populations in 2016. If the true regression parameters in 2016are similar to the estimates in 2011, then we can expect predictions of reserve pop-ulations for 2016 from temporal adjusted prediction to be accurate. Although thisis not a formal test of a constant relationship between the covariates and responseover time, it is reasonable to expect that the model will not change drastically in2016 as well.4.3.2 Prediction ResultsWe fit each of the models compared in Chapter 3 on the 2011 Census and IR dataand then predict the 2016 Census population of each reserve using the fitted modeland the IR on-reserve populations from 2016. We also include the predictionsresults of the LME model which was fit using both 2006 and 2011 data since itrequires repeated measures. Figure 4.3 shows boxplots comparing the prediction54performance of each model, and Table 4.5 shows the MSPE of each model whenpredicting the populations of each Indian reserve. Note that the MSPE reported inTable 4.5 differs from the MSPE reported in tables in Chapter 3. Tables in Chapter3 reported the median MSPE of 1000 simulated datasets, while Table 4.5 reportsthe MSPE of a single dataset.Figure 4.3: Comparison of Prediction Error of 2016 Reserve PopulationsFigure 4.4 shows a truncated view of Figure 4.3 so that we can see the dis-tribution of prediction errors more closely. Most of the prediction errors for eachmethod seen in Figure 4.4 are close to zero, although linear regression and robustregression appear to have the largest spread. These results are consistent with thesimulation results shown in Chapter 3 where methods like TAP, robust TAP, linearextrapolation and LME are able to predict better than linear regression.Linear extrapolation was able to predict with the highest accuracy among allof the selected models in terms of MSPE, followed by robust TAP and TAP, thenLME, and lastly linear regression and robust regression. It is interesting to seethat linear extrapolation predicts the best in terms of MSPE when it consistently55Figure 4.4: Comparison of Prediction Error of 2016 Reserve Populations(Zoomed In)predicts worse than TAP and robust TAP in our simulations. On closer inspection,this result is likely due the fact that extrapolation is able to predict the populationof the outliers better than the other models. As shown in Figure 4.3, each modelhad a large negative residual which corresponds to the Tsinstikeptum 9 reserve inBC. The Tsinstikeptum 9 and Tsinstikeptum 10 reserves are both associated withthe Westbank First Nation which was mentioned earlier in this chapter as an outlier56Method MSPE MAPEExtrapolation 14110.14 21.79Linear Mixed Effects Model 17425.72 21.37Linear Regression 128787.89 58.45Robust Regression 139002.38 36.92Robust Temporal Adjusted Prediction 15617.08 20.30Temporal Adjusted Prediction 15682.89 20.47Table 4.5: Mean Square and Median Absolute Prediction Errors of 2016 In-dian Reserve Populationswhere the total population of its reserves as measured by the census greatly exceedsthe IR on-reserve population for each census year. Although none of the models areable to predict the population of Tsinstikeptum 9 well, linear extrapolation comesthe closest. The residuals for Tsinstikeptum 9 are shown in Figure 4.3 as the lowestpoint in each boxplot. The poor prediction of just one reserve can considerablyinflate the MSPE as shown by each model in Table 4.5.Since there are outliers present in the data, we also compare another measureof prediction error that isn’t as severely affected by large residuals, the medianabsolute prediction error (MAPE). This measure is calculated in a similar mannerto MSPE, except we take the median of the absolute value of the residuals ratherthan the mean of the squared residuals. These values are also shown in Table 4.5.Robust TAP predicts with the lowest MAPE followed closely by TAP, LME,linear extrapolation and then by robust regression and linear regression. Theseresults do not exactly align with those observed in any of the scenarios discussedin Subsection 3.2.4. When fitting a robust regression model on the 2016 data aswe did on the 2006 and 2011 data in Subsection 4.3.1, the estimated coefficient forthe IR on-reserve population is 0.829. This means that our dataset most closelyresembles Scenario (3) of Subsection 3.2.4 where the true coefficient β1 is notconstant over time, although the results from the simulation and case study differ.In that simulation, it was shown that the prediction accuracy of all models areaffected because the true regression coefficient was different in the third time pointcompared to the first two time points. However, LME is the most badly affectedmodel among TAP, robust TAP and extrapolation. In this case study, we see that57LME is not as badly affected as it was in the simulations. However, under bothperformance metrics, LME still predicts worse than TAP and robust TAP in thiscase study.Although we can look at different performance metrics that show one model isbetter than another, from the results of this case study, we can at least conclude thatthe prediction accuracy of TAP is comparable to that of the LME model. We showthat TAP is able to produce similar prediction results as LME with the improvementthat it only requires data from a single time point to fit the model.58Chapter 5ConclusionIn this thesis we propose a new prediction model, temporal adjusted prediction,which is suitable for prediction of responses in datasets with observations for a setof individuals at multiple time points. Although linear mixed effects models canbe used in some instances, it requires repeated measures in order to fit a model. Insituations when repeated measures are not available, other models such as linearregression can be fit using data from a single time point. However, they do nottake into account the within-individual correlation when making predictions. Incases where observations are made on the same set of individuals, future values ofa response will be correlated with past values of the response.As shown in Section 2.4, predictions using TAP are based on a random in-tercept model even though it does not directly predict the random intercept term.Instead, the previous response values are used as starting points for predictions.A consequence of this is that additional error is introduced into the predictions.However, if the variance of the random intercept is greater than the variance of therandom error, then the expected MSPE of TAP is lower than that of linear regres-sion. As shown in simulations in Chapter 3 and the case study in Chapter 4, evenwhen repeated measures are available, TAP is shown to have similar predictionaccuracy compared to LME models with parameters estimated using maximumlikelihood methods despite requiring less data to fit a model.There are many possible extensions of the work presented in this thesis. Furthersimulations can be done comparing TAP to other models, especially linear mixed59effect models, and other performance metrics can be examined as well. When look-ing at MSPE alone, it was not conclusive that TAP or LME models were superiorto each other. Next, further research can be done on robust TAP by using differentestimation methods such as M-estimation, S-estimation, or least trimmed squaresto estimate the regression parameters. In addition, the simulations in this thesisonly examined the effects of outliers in the response because these were similar tothe outliers observed in the Indian reserves population data. However, research canbe done on how extreme values in the covariates affect the overall fit and predic-tion accuracy of robust TAP. Finally, one of the main weaknesses of TAP is that isdoes not predict as well as linear regression when the variation due to the randomintercept is small compared to random error. Improvements in this situation willmake TAP more widely applicable.60Bibliography[1] F. R. Hampel. A general qualitative definition of robustness. The Annals ofMathematical Statistics, 42(6):1887–1896, 1971. → pages 21[2] P. J. Huber. Robust statistics. Wiley, New York, 1981. ISBN9780471418054;0471418056;. → pages 4, 21[3] Indigenous and Northern Affairs Canada. Lake St. Martin floodingchronology of events. Gatineau, Canada, 2013. URLhttps://www.aadnc-aandc.gc.ca/eng/1367603834058/1367603955720.(accessed December 4, 2017). → pages 48[4] Indigenous and Northern Affairs Canada. Land management. Gatineau,Canada, 2014. URLhttps://www.aadnc-aandc.gc.ca/eng/1100100034737/1100100034738.(accessed January 4, 2017). → pages 44[5] Indigenous and Northern Affairs Canada. What is Indian status. Gatineau,Canada, 2017. URLhttps://www.aadnc-aandc.gc.ca/eng/1100100032463/1100100032464.(accessed December 4, 2017). → pages 1, 42[6] Indigenous and Northern Affairs Canada. How to update the IndianRegister. Gatineau, Canada, 2017. URLhttps://www.aadnc-aandc.gc.ca/eng/1100100032475/1100100032476.(accessed January 4, 2017). → pages 41, 45[7] Indigenous and Northern Affairs Canada. Registered Indian population byresidence and gender, 2013: summary statistics. Gatineau, Canada, 2017.URLhttps://www.aadnc-aandc.gc.ca/eng/1394032502014/1394032901691.(accessed January 4, 2017). → pages 4561[8] Indigenous and Northern Affairs Canada. Reserve/settlement/village Detail:Treaty Four Reserve Grounds 77. Gatineau, Canada, 2017. URLhttp://fnp-ppn.aandc-aadnc.gc.ca/fnp/Main/Search/RVDetail.aspx?RESERVE NUMBER=09329&lang=eng. (accessed December 4, 2017). →pages 50[9] Indigenous and Northern Affairs Canada. Westbank First NationSelf-Government. Gatineau, Canada, 2017. URLhttp://www.aadnc-aandc.gc.ca/eng/1100100031766/1100100031768.(accessed January 4, 2017). → pages 44[10] N. M. Laird and J. H. Ware. Random-effects models for longitudinal data.Biometrics, 38(4):963–974, 1982. → pages 3[11] M. Maechler, P. Rousseeuw, C. Croux, V. Todorov, A. Ruckstuhl,M. Salibian-Barrera, T. Verbeke, M. Koller, E. L. T. Conceicao, andM. Anna di Palma. robustbase: Basic Robust Statistics, 2017. URLhttp://robustbase.r-forge.r-project.org/. R package version 0.92-8. → pages22[12] P. J. Rousseeuw. Least median of squares regression. Journal of theAmerican Statistical Association, 79(388):871–880, 1984. → pages 4[13] Statistics Canada. Geosuite, 2016 Census. Catalogue no. 92-150-X, . →pages 46[14] Statistics Canada. Geosuite, reference guide, 2016 Census. Catalogue no.92-150-G, . → pages 46[15] Statistics Canada. Population and dwelling counts, for Canada and censussubdivisions (municipalities), 2016 and 2011 Censuses 100% data.Population and dwelling count highlight tables, 2016 Census. StatisticsCanada Catalogue no. 98-402-X2016001. Ottawa, Canada, 2017. URLhttp://www12.statcan.gc.ca/census-recensement/2011/dp-pd/hlt-fst/pd-pl/Table-Tableau.cfm?LANG=Eng&T=301&S=3&O=D. (accessed June 6,2017). → pages 48[16] Statistics Canada. Population and dwelling counts, for Canada and censussubdivisions (municipalities), 2011 and 2006 Censuses. Population anddwelling count highlight tables, 2011 Census. Statistics Canada Catalogueno. 98-310-XWE2011002. Ottawa, Canada, No date. URLhttp://www12.statcan.gc.ca/census-recensement/2011/dp-pd/hlt-fst/pd-pl/62Table-Tableau.cfm?LANG=Eng&T=301&S=3&O=D. Last updated August 9,2016. (accessed May 25, 2017). → pages 2, 41, 44, 48, 50[17] Statistics Canada. 2011 Census dictionary. Statistics Canada OnlineCatalogue no. 98-301-XWE. Ottawa, Canada, No date. URL http://www12.statcan.gc.ca/census-recensement/2011/ref/dict/index-eng.cfm.Last updated November 27, 2015. (accessed November 4, 2017). → pages43, 47, 48, 51[18] Statistics Canada. NHS aboriginal population profile,2011. Statistics CanadaCatalogue no. 99-011-XWE2011007. Ottawa, Canada, No date. URLhttp://www12.statcan.gc.ca/nhs-enm/2011/dp-pd/aprof/index.cfm?Lang=E.Last updated November 30, 2015. (accessed November 4, 2017). → pages48, 49, 50, 51[19] Statistics Canada. Guide to the Census of Population, 2016. StatisticsCanada Catalogue no. 98-304-X. Ottawa, Canada, No date. URL http://www12.statcan.gc.ca/census-recensement/2016/ref/98-304/index-eng.cfm.Last updated November 29, 2017. (accessed December 4, 2017). → pages48[20] Statistics Canada. Aboriginal peoples in Canada: First Nations people, Mtisand Inuit. Statistics Canada Catalogue no. 99-011-X. Ottawa, Canada, Nodate. URL http://www12.statcan.gc.ca/nhs-enm/2011/as-sa/99-011-x/99-011-x2011001-eng.cfm#a9. Last updated September 15, 2015.(accessed December 4, 2017). → pages 2, 41[21] V. J. Yohai. High breakdown-point and high efficiency robust estimates forregression. The Annals of Statistics, 15(2):642–656, 1987. → pages 3, 4, 20,21, 2263
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Temporal adjusted prediction for predicting Indian...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Temporal adjusted prediction for predicting Indian reserve populations in Canada Cho, Derek 2018
pdf
Page Metadata
Item Metadata
Title | Temporal adjusted prediction for predicting Indian reserve populations in Canada |
Creator |
Cho, Derek |
Publisher | University of British Columbia |
Date Issued | 2018 |
Description | In order to predict the population of Indian reserves in Canada for the 2016 Census, we can construct a suitable model using data from the Indian Register and past censuses. Linear mixed effects models are a popular method for predicting values of responses on longitudinal data. However, linear mixed effects models require repeated measures in order to fit a model. Alternative methods such as linear regression only require data from a single time point in order to fit a model, but it does not directly account for within-individual correlation when predicting. Since we are predicting the responses of the same set of individuals, we can expect responses at the next time point to be strongly correlated with past responses for an individual. We introduce a new method of prediction, temporal adjusted prediction (TAP), that addresses the issue of within-individual correlation in predictions and only requires data from a single time point to estimate model parameters. Predictions are based on the last recorded response of an individual and adjusted based on changes to the values of their covariates and estimated regression coefficients that relate the response and the covariates. Predictions are made using a random intercept model rather than a linear regression model. It is shown that if the random intercept accounts for a larger proportion of the random variation in the data than the random error term, then temporal adjusted prediction achieves a lower mean squared prediction error than linear regression. TAP performs better than linear regression when predicting on the same set of individuals at different time points. It also shows similar prediction performance compared to linear mixed effects models estimated with maximum likelihood estimation despite only requiring data from one time point in order to fit a model. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2018-04-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0365773 |
URI | http://hdl.handle.net/2429/65456 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2018-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2018_may_derek_cho.pdf [ 800.81kB ]
- Metadata
- JSON: 24-1.0365773.json
- JSON-LD: 24-1.0365773-ld.json
- RDF/XML (Pretty): 24-1.0365773-rdf.xml
- RDF/JSON: 24-1.0365773-rdf.json
- Turtle: 24-1.0365773-turtle.txt
- N-Triples: 24-1.0365773-rdf-ntriples.txt
- Original Record: 24-1.0365773-source.json
- Full Text
- 24-1.0365773-fulltext.txt
- Citation
- 24-1.0365773.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.24.1-0365773/manifest