UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Statistical models for agroclimate risk analysis Hosseini, Mohamadreza 2009

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Notice for Google Chrome users:
If you are having trouble viewing or searching the PDF with Google Chrome, please download it here instead.

Item Metadata

Download

Media
24-ubc_2010_spring_hosseini_mohamadreza.pdf [ 4.6MB ]
Metadata
JSON: 24-1.0070885.json
JSON-LD: 24-1.0070885-ld.json
RDF/XML (Pretty): 24-1.0070885-rdf.xml
RDF/JSON: 24-1.0070885-rdf.json
Turtle: 24-1.0070885-turtle.txt
N-Triples: 24-1.0070885-rdf-ntriples.txt
Original Record: 24-1.0070885-source.json
Full Text
24-1.0070885-fulltext.txt
Citation
24-1.0070885.ris

Full Text

Statistical models for agroclimate risk analysis by Mohamadreza Hosseini  B.Sc., Amirkabir University, 2003 M.Sc., McGill University, 2005  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Statistics)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) November, 2009 c Mohamadreza Hosseini 2009  Abstract In order to model the binary process of precipitation and the dichotomized temperature process, we use the conditional probability of the present given the past. We find necessary and sufficient conditions for a collection of functions to correspond to the conditional probabilities of a discrete–time categorical stochastic process X1 , X2 , · · · . Moreover we find parametric representations for such processes and in particular rth–order Markov chains. To dichotomize the temperature process, quantiles are often used in the literature. We propose using a two–state definition of the quantiles by considering the “left quantile” and “right quantile” functions instead of the traditional definition. This has various advantages such as a symmetry relation between the quantiles of random variables X and −X. We show that the left (right) sample quantile tends to the left (right) distribution quantile at p ∈ [0, 1], if and only if the left and right distribution quantiles are identical at p and diverge almost surely otherwise. In order to measure the loss of estimating (or approximating) a quantile, we introduce a loss function that is invariant under strictly monotonic transformations and call it the “probability loss function.” Using this loss function, we introduce measures of distance among random variables that are invariant under continuous strictly monotonic transformations. We use this distance measures to show optimal overall fits to a random variable are not necessarily optimal in the tails. This loss function is also used to find equivariant estimators of the parameters of distribution functions. We develop an algorithm to approximate quantiles of large datasets which works by partitioning the data or use existing partitions (possibly of non-equal size). We show the deterministic precision of this algorithm and how it can be adjusted to get customized precisions. Then we develop a framework to optimally summarize very large datasets using quantiles and combining such summaries in order to infer about the original dataset. Finally we show how these higher order Markov models can be used to construct confidence intervals for the probability of frost–free periods.  ii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  Abstract  List of Tables  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  xi  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xx Dedication  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi  1 Thesis introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  7 7 8 8 24 46 48 56 60  3 rth-order Markov chains . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Consistency of the conditional probabilities . . . . . . . . . . 3.4 Characterizing density functions and rth–order Markov chains 70 3.5 Functions of r variables on a finite domain . . . . . . . . . . 3.5.1 First representation theorem . . . . . . . . . . . . . . 3.5.2 Second representation theorem . . . . . . . . . . . . .  62 62 64 65  2 Exploratory analysis of the Canadian weather 2.1 Introduction . . . . . . . . . . . . . . . . . . . 2.2 Data description . . . . . . . . . . . . . . . . . 2.3 Temperature and precipitation . . . . . . . . . 2.4 Daily values, distributions . . . . . . . . . . . 2.5 Correlation . . . . . . . . . . . . . . . . . . . . 2.5.1 Temporal correlation . . . . . . . . . . 2.5.2 Spatial correlation . . . . . . . . . . . . 2.6 Summary and conclusions . . . . . . . . . . . .  data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  . . . . . . . . .  1  73 74 80 iii  Table of Contents  3.6 3.7 3.8  3.5.3 Special cases of functions of r finite variables Generalized linear models for time series . . . . . . Simulation studies . . . . . . . . . . . . . . . . . . . Concluding remarks . . . . . . . . . . . . . . . . . .  4 Binary precipitation process . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . 4.2 Models for 0-1 precipitation process . . . . 4.3 Exploratory analysis of the data . . . . . . 4.4 Comparing the models using BIC . . . . . 4.5 Changing the location and the time period  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . . . .  . . . .  . . . .  . . . .  . . . .  . . . . . .  . . . . . .  . . . . . .  . . . . . .  . 94 . 94 . 95 . 97 . 105 . 112  5 On the definition of “quantile” and its properties . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Definition of median and quantiles of data vectors and random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Defining quantiles of a distribution . . . . . . . . . . . . . . . 5.4 Left and right extreme points . . . . . . . . . . . . . . . . . . 5.5 The quantile functions as inverse . . . . . . . . . . . . . . . . 5.6 Equivariance property of quantile functions . . . . . . . . . . 5.7 Continuity of the left and right quantile functions . . . . . . 5.8 Equality of left and right quantiles . . . . . . . . . . . . . . . 5.9 Distribution function in terms of the quantile functions . . . 5.10 Two-sided continuity of lq/rq . . . . . . . . . . . . . . . . . . 5.11 Characterization of left/right quantile functions . . . . . . . 5.12 Quantile symmetries . . . . . . . . . . . . . . . . . . . . . . . 5.13 Quantiles from the right . . . . . . . . . . . . . . . . . . . . 5.14 Limit theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.15 Summary and discussion . . . . . . . . . . . . . . . . . . . . 6 Probability loss function . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Degree of separation between data vectors . . . . . . . . . . 6.3 “Degree of separation” for distributions: the “probability loss function” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Limit theory for the probability loss function . . . . . . . . . 6.5 The probability loss function for the continuous case . . . . . 6.6 The supremum of δX . . . . . . . . . . . . . . . . . . . . . . 6.6.1 “c-probability loss” functions . . . . . . . . . . . . . .  85 86 90 93  115 115 118 128 132 133 135 137 144 150 152 153 157 163 165 177 181 181 181 183 187 188 189 191  iv  Table of Contents . . . . . .  193 193 194 196 197 205  8 Quantile data summaries . . . . . . . . . . . . . . . . . . . . . 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Generalization to weighted vectors . . . . . . . . . . . . . . . 8.2.1 Partition operator . . . . . . . . . . . . . . . . . . . . 8.2.2 Quantile data summaries . . . . . . . . . . . . . . . . 8.3 Optimal probability indices for vector data summaries . . . . 8.4 Other loss functions . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Optimal index vectors for assigning quantiles to a random sample . . . . . . . . . . . . . . . . . . . . . . .  212 212 214 219 223 225 231  9 Quantile distribution distance and estimation . . . . . . . . 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Quantile–specified parameter families . . . . . . . . . . . . . 9.2.1 Equivariance of quantile–specified families estimation 9.2.2 Continuous distributions with the order statistics family of estimators . . . . . . . . . . . . . . . . . . . . . 9.3 Probability divergence (distance) measures . . . . . . . . . . 9.4 Quantile distance measures . . . . . . . . . . . . . . . . . . . 9.4.1 Quantile distance invariance under continuous strictly monotonic transformations . . . . . . . . . . . . . . . 9.4.2 Quantile distance closeness of empirical distribution and the true distribution . . . . . . . . . . . . . . . . 9.4.3 Quantile distance and KS distance closeness . . . . . 9.4.4 Quantile distance for continuous variables . . . . . . . 9.4.5 Equivariance of estimation under monotonic transformations using the quantile distance . . . . . . . . . . 9.4.6 Estimation using quantile distance . . . . . . . . . . .  236 236 237 239  7 Approximating quantiles in large datasets 7.1 Introduction . . . . . . . . . . . . . . . . . 7.2 Previous work . . . . . . . . . . . . . . . . 7.3 The median of the medians . . . . . . . . . 7.4 Data coarsening and quantile approximation 7.5 The algorithm and computations . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . algorithm . . . . . .  . . . .  . . . . . . .  . . . . . .  10 Binary temperature processes . . . . . . . . . . . . . . . . . . 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2 rth–order Markov models for extreme minimum temperatures 10.2.1 Exploratory analysis for binary extreme minimum temperatures . . . . . . . . . . . . . . . . . . . . . . . . .  234  242 243 248 249 254 255 260 267 268 272 272 275 275 v  Table of Contents 10.2.2 Model selection for extreme minimum temperature . 10.3 rth–order Markov models for extreme maximum temperatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.1 Exploratory analysis for extreme maximum temperatures . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.3.2 Model selection for extreme maximum temperature . 10.4 Probability of a frost–free period for Medicine Hat . . . . . . 10.5 Possible applications of the models . . . . . . . . . . . . . . . 11 Conclusions and future research . . . . . . . . . . . . . . . . 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Future research . . . . . . . . . . . . . . . . . . . . . . . . . 11.3.1 rth-order Markov chains . . . . . . . . . . . . . . . . 11.3.2 Approximating quantiles and data summaries . . . . 11.3.3 Parameter estimation using probability loss and quantile distances . . . . . . . . . . . . . . . . . . . . . . .  276 285 286 286 296 303 304 304 304 305 305 306 306  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308  Appendices A Climate review . . . . . . . . . . . . . . . . . . . . . . . . . . . A.1 Organizations and resources . . . . . . . . . . . . . . . . . . A.2 Definitions and climate variables . . . . . . . . . . . . . . . . A.3 Climatology . . . . . . . . . . . . . . . . . . . . . . . . . . . A.3.1 General circulations . . . . . . . . . . . . . . . . . . . A.3.2 Topography of Canada . . . . . . . . . . . . . . . . . A.4 Some interesting facts about Canadian geography and weather 319  312 312 313 318 318 319  B Extracting Canadian Climate Data from Environment Canada dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 B.2 Using Python to extract data . . . . . . . . . . . . . . . . . . 325 B.3 New functions to write stations’ data . . . . . . . . . . . . . 330 B.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 331 C Algorithms and Complexity  . . . . . . . . . . . . . . . . . . . 332  vi  Table of Contents D Notations and Definitions  . . . . . . . . . . . . . . . . . . . . 333  vii  List of Tables 2.1  2.2 2.3  2.4 2.5 2.6 3.1  3.2 3.3 4.1 4.2  4.3  The summary statistics for the mean annual maximum temperature, min temperature and precipitation at the Calgary site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Confidence intervals for the mean annual maximum temperature, min temperature and precipitation at the Calgary site. Lines fitted to annual mean minimum temperature and annual mean precipitation against annual mean maximum temperature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The regression line parameters for the fitted lines for each variable with respect to time for the Calgary site. . . . . . . . The regression line parameters for the fitted lines for each variable with respect to time for the Banff site. . . . . . . . . The regression line parameters for the fitted lines for each variable with respect to time for the Medicine Hat site. . . .  16 16  20 24 24 24  The estimated parameters for the model Zt−1 = (1, Yt−1 , cos(ωt)) with parameters β = (−1, 1, −0.5). The standard deviation for the parameters is computed once using GN (theo. sd) and once using the generated samples (sim. sd). . . . . . . . . . . 91 BIC values for several models competing for the role of the true model, where Zt−1 = (1, Y 1 , COS), β = (−1, 1, −0.5). . . 92 BIC values for several models competing for the role of true model given by Zt−1 = (1, Y 1 , Y 2 , COS), β = (−1, 1, 1, −0.5). 93 BIC values for models including N l , the number of precipitation days during the past l days for the Calgary site. . . . . . 106 BIC values for models including N l , the number of wet days during the past l days and Y 1 , the precipitation occurrence of the previous day for the Calgary site. . . . . . . . . . . . . 107 BIC values for models including N l , the number of wet days during the past l days and seasonal terms for the Calgary site. 108  viii  List of Tables BIC values for models including N l , the number of P N days during the past l days, Y 1 , the precipitation occurrence of the previous day and seasonal terms for the Calgary site. . . 4.5 BIC values for Markov models of different order with small number os parameters for the Calgary site. . . . . . . . . . . 4.6 BIC values for Markov models with different order plus seasonal terms for the Calgary site. . . . . . . . . . . . . . . . . 4.7 BIC values for models including seasonal terms and the occurrence of precipitation during the previous day for the Calgary site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.8 BIC values for 2nd–order Markov models for precipitation at the Calgary site. . . . . . . . . . . . . . . . . . . . . . . . . . 4.9 BIC values for 2nd–order Markov models for precipitation at the Calgary site plus seasonal terms. . . . . . . . . . . . . . . 4.10 BIC values for models including several covariates as temperature, seasonal terms and year effect for precipitation at the Calgary site. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 BIC values for several models for the binary process of precipitation in Calgary, 1990–1994 . . . . . . . . . . . . . . . . 4.12 BIC values for several models for precipitation occurrence in Medicine Hat, 2000-2004 . . . . . . . . . . . . . . . . . . . . . 4.4  109 109 110  110 111 111  112 113 113  5.1 5.2  Earthquakes intensities . . . . . . . . . . . . . . . . . . . . . . 121 Rain acidity data . . . . . . . . . . . . . . . . . . . . . . . . . 122  6.1  A class marks in mathematics and physics. The third column are the raw physics marks before the physics teacher scaled them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186  7.1 7.2  The table of data . . . . . . . . . . . . . . . . . . . . . . . . . 196 Comparing the exact method with the proposed algorithm in R run on a laptop with 512 MB memory and a processor 1500 MHZ, m = 1000, d = 500. “DOS” stands for degree of separation in the original vector. “DOS bound” is the theoretical degree of separation obtained by Theorem 7.4.1. . 208 Comparing the exact method with the proposed algorithm in R (run on a laptop with 512 MB memory and processor 1500 MHZ) to compute the quantiles of M T (daily maximum temperature) over 25 stations with data from 1940 to 2004. . 211  7.3  ix  List of Tables 9.1  9.2  9.3  9.4  Comparing standard normal with various distributions using quantile distance, where U denotes the uniform distribution and χ2 the Chi-squared distribution. . . . . . . . . . . . . . . Comparing standard normal on the tails with some distributions using quantile distance, where U denotes the uniform distribution and χ2 the Chi-squared distribution. . . . . . . . Assessment of Maximum likelihood estimation and quantile distance estimation using several measures of error for a sample of size 20. In the table s.e. stands for the standard error. Assessment of Maximum likelihood estimation and quantile distance estimation using several measures of error for a sample of size 100. In the table s.e. stands for the standard error.  10.1 BIC values for models including N k for the extreme minimum temperature process e(t) at the Medicine Hat site. . . . 10.2 BIC values for several models for the extreme minimum temperature e(t) at the Medicine Hat site. . . . . . . . . . . . . . 10.3 BIC values for models including N k for the extremely hot process E(t). . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 BIC values for several models for the extremely hot process E(t). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.5 BIC values for models including N k for the extremely cold process e(t) at the Medicine Hat site. . . . . . . . . . . . . . . 10.6 BIC values for several models including N k and seasonal terms for the extremely cold process e(t) at the Medicine Hat site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.7 BIC values for several models for the extremely cold process e(t) at the Medicine Hat site. . . . . . . . . . . . . . . . . . . 10.8 Theoretical and simulation estimated standard deviations for extremely cold process e(t) at the Medicine Hat site. . . . . .  261  267  269  269 284 285 295 296 298  299 299 300  x  List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16  Alberta site locations for temperature (deg C) data. There are 25 stations available with temperature data over Alberta. Alberta site locations for precipitation (mm) data. There are 47 stations available with precipitations data over Alberta. . The number of years available for sites with temperature (deg C) data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The number of years available for sites with precipitation (mm) data available. . . . . . . . . . . . . . . . . . . . . . . . The elevation (meters) of sites with temperature data available. The elevation (meters) of the sites with precipitation data available. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The time series of daily maximum temperature (deg C) at the Calgary site from 2000 to 2003. . . . . . . . . . . . . . . . The time series of daily minimum temperature (deg C) at the Calgary site from 2000 to 2003. . . . . . . . . . . . . . . . . . The time series of daily precipitation (mm) at the Calgary site from 2000 to 2003. . . . . . . . . . . . . . . . . . . . . . . The time series of monthly maximum temperature (deg C) at the Calgary site, 1995–2005. . . . . . . . . . . . . . . . . . . . The time series of monthly minimum temperature means (deg C) at the Calgary site, 1995–2005. . . . . . . . . . . . . . . . The time series of monthly precipitation means (mm) at the Calgary site, 1995–2005. . . . . . . . . . . . . . . . . . . . . . The annual mean maximum temperature (C) for Calgary site for all available years. . . . . . . . . . . . . . . . . . . . . . . The annual mean minimum temperature (C) for Calgary site for all available years. . . . . . . . . . . . . . . . . . . . . . . The annual mean precipitation (mm) for Calgary site for all available years. . . . . . . . . . . . . . . . . . . . . . . . . . . The histogram of annual maximum temperature means (deg C) for Calgary with a normal curve fitted to the data. . . . .  8 9 9 10 10 11 12 12 13 13 14 14 15 15 16 17 xi  List of Figures 2.17 The normal qq–plot for annual maximum temperature means (deg C) for Calgary. . . . . . . . . . . . . . . . . . . . . . . . 2.18 The histogram of annual minimum temperature means (deg C) for Calgary with normal curve fitted to the data. . . . . . 2.19 The normal qq–plot for annual minimum temperature means (deg C) for Calgary. . . . . . . . . . . . . . . . . . . . . . . . 2.20 The histogram of annual precipitation means (mm) for Calgary with normal curve fitted to the data. . . . . . . . . . . 2.21 The normal qq–plot for annual precipitation means for Calgary. 2.22 The time series plots of maximum temperature (deg C), minimum temperature (deg C) and precipitation (mm) annual means for Calgary. The time series plot in the bottom is minimum temperature, the one in the middle is precipitation and the top curve is maximum temperature. . . . . . . . . . . 2.23 The regression line fitted to maximum temperature and minimum temperature annual means for Calgary. . . . . . . . . . 2.24 The regression line fitted to maximum temperature and precipitation annual means for Calgary. . . . . . . . . . . . . . . 2.25 The regression line fitted to summer minimum temperature means against time for Calgary. . . . . . . . . . . . . . . . . . 2.26 The time series of daily maximum temperature at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.27 The histogram of daily maximum temperature at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.28 The normal qq–plots of of daily maximum temperature at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st. . . . . . . . . . . . . . . . . . . . . . . . 2.29 The time series of daily minimum temperature for Calgary for four given dates: January 1st, April 1st, July 1st and October 1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.30 The histogram of daily minimum temperature at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.31 The normal qq-plots of daily minimum temperature at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st. . . . . . . . . . . . . . . . . . . . . . . .  18 18 19 19 20  21 22 22 23  25  26  27  28  29  30  xii  List of Figures 2.32 The time series of daily precipitation at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.33 The histogram of daily precipitation at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.34 The confidence intervals for the daily mean maximum temperature (deg C) at the Calgary site. Dashed line shows the upper bound and the solid line the lower bound of the confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . 2.35 The confidence intervals for the daily mean minimum temperature (deg C) at the Calgary site. Dashed line shows the upper bound and the solid the lower bound of the confidence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.36 The confidence intervals for the probability of precipitation (mm) at the Calgary site for the days of the year. Dashed line shows the upper bound and the solid the lower bound of the confidence intervals. . . . . . . . . . . . . . . . . . . . . . 2.37 The confidence intervals for the standard deviation of each day of the year for maximum temperature (deg C) at the Calgary site. Dashed line shows the upper bound and the solid the lower bound of the confidence intervals. . . . . . . . 2.38 The confidence intervals for the standard deviation of each day of the year for minimum temperature (deg C) at the Calgary site. Dashed line shows the upper bound and the solid the lower bound of the confidence intervals. . . . . . . . 2.39 The confidence intervals for standard deviation (sd) of each day of the year for the probability of precipitation (mm) (0-1 precipitation process) at the Calgary site. Dashed line shows the upper bound and the solid the lower bound of the confidence p intervals. Plot shows sd ≤ 1/2. This1 is because sd = p(1 − p) which has a maximum value of 2 . . . . . . . 2.40 The distribution of each day of the year for M T (C) from Jan 1st to Dec 1st. The year has been divided to two halves. In each half rainbow colors are used to show the change of the distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.41 The distribution of each day of the year for mt (C) from Jan 1st to Dec 1st. The year has been divided to two halves. In each half rainbow colors are used to show the change of the distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . .  31  32  33  34  35  36  37  38  39  40 xiii  List of Figures 2.42 The histogram of daily precipitation greater than 0.2 mm at the Calgary site with Gamma density curve fitted using Maximum likelihood. . . . . . . . . . . . . . . . . . . . . . . . 2.43 The qq–plots of daily precipitation greater than 0.2 mm at the Calgary site with Gamma curve fitted using Maximum likelihood. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.44 The Gamma fit of each day of 4 months for precipitation (mm). In each month rainbow colors are used to show the change of the distribution. . . . . . . . . . . . . . . . . . . . . 2.45 The maximum likelihood estimate for α, the shape parameter of the Gamma distribution fitted to the precipitation amounts. 2.46 The confidence interval for MOM estimate of the shape parameter, α, of the Gamma distribution fitted to daily precipitation amounts. The dotted line is the upper bound and the solid line the lower bound. As seen in the figure the upper bounds at the beginning and end of the year have become very large. We have not shown them because otherwise then the pattern in the rest of the year could not be seen. . . . . . 2.47 The 1st-order transition probabilities. The dotted line is the the probability of precipitation if it happened the day before (pˆ11 ) and the dashed is the probability of precipitation if it did not happen the day before (pˆ01 ). . . . . . . . . . . . . . . 2.48 The 2nd–order transition probabilities for the precipitation at the Calgary site: p̂111 (solid) against p̂011 (dotted). . . . . 2.49 The 2nd–order transition probabilities for the precipitation at the Calgary site: p̂001 (solid) against p̂101 (dotted). . . . . 2.50 The correlation and covariance plot for maximum temperature at the Calgary site for Jan 1st and 732 consequent days. 2.51 The correlation plot for maximum temperature (deg C) at the Calgary site for Jan 1st and 732 consequent days. . . . . 2.52 The correlation plot for minimum temperature (deg C) at the Calgary site for Jan 1st and 732 consequent days. . . . . . . . 2.53 The correlation plot for precipitation (mm) at the Calgary site for Jan 1st and 732 consequent days. . . . . . . . . . . . 2.54 The correlation plot for maximum temperature (deg C) at the Calgary site for Feb 1st (solid), April 1st (dashed), July 1st (dotted) and Oct 1st (dot dash) and 30 consequent days. 2.55 The correlation plot for minimum temperature (deg C) at the Calgary site for Feb 1st (solid), April 1st (dashed), July 1st (dotted) and Oct 1st (dot dash) and 30 consequent days. . . .  40  41  42 43  44  45 46 47 48 49 50 51  52  53 xiv  List of Figures 2.56 The correlation plot for precipitation (mm) at the Calgary site for Feb 1st (solid), April 1st (dashed), July 1st (dotted) and Oct 1st (dot dashed) and 30 consequent days. . . . . . . 2.57 The correlation plot for maximum temperature and minimum temperature (deg C) between Calgary and Medicine Hat. . . 2.58 The correlation plot for precipitation (mm) between Calgary and Medicine Hat. . . . . . . . . . . . . . . . . . . . . . . . . 2.59 The correlation plot for maximum temperature (deg C) with respect to distance (km). . . . . . . . . . . . . . . . . . . . . . 2.60 The correlation plot for minimum temperature (deg C) with respect to distance(km). . . . . . . . . . . . . . . . . . . . . . 2.61 The correlation plot for precipitation (mm) with respect to distance (km). . . . . . . . . . . . . . . . . . . . . . . . . . . 2.62 The correlation plot for precipitation (mm) 0-1 process with respect to distance (km). . . . . . . . . . . . . . . . . . . . . . 3.1  4.1  4.2  4.3  4.4  The distribution of parameter estimates for the model with the covariate process Zt−1 = (1, Yt−1 , cos(ωt)) and parameters (β1 = −1, β2 = 1, β3 = −0.5). . . . . . . . . . . . . . . . .  54 55 55 56 57 58 59  92  The transition probabilities for the Banff site. The dotted line represents pˆ11 (the estimated probability of precipitation if precipitation occurs the day before) and the dashed represents pˆ01 (the estimated probability of precipitation if precipitation does not occur the day before.) . . . . . . . . . . . . . . . . . 99 The solid curve represents p̂111 (the estimated probability of precipitation if during both two previous days precipitation occurs) and the dashed curve represents p̂011 (the estimated probability that precipitation occurs if precipitation occurs the day before and does not occur two days ago) for the Banff site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 The solid curve represents p̂001 (the estimated probability of precipitation occurring if it does not occur during the two previous days) and the dotted curve is p̂101 (the estimated probability that precipitation occurs if precipitation does not occur the day before but occurs two days ago) for the Banff site. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Banff’s estimated mean annual probability of precipitation calculated from historical data. . . . . . . . . . . . . . . . . . 102  xv  List of Figures 4.5 4.6 4.7 5.1 5.2  5.3  5.4 5.5 5.6 5.7  7.1  7.2  9.1  Calgary’s estimated mean annual probability of precipitation calculated from historical data. . . . . . . . . . . . . . . . . . 102 The logit function: logit(x) = log(x/(1 − x)). . . . . . . . . . 103 The logit of the estimated probability of precipitation in Banff for different days of the year. . . . . . . . . . . . . . . . . . . 103 An example of a distribution function with discontinuities and flat intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . The left quantile (lq) function for the distribution function given in Example 5.7. Notice that this function is left continuous and increasing. . . . . . . . . . . . . . . . . . . . . . . . The right quantile (rq) function for the distribution function given in Example 5.7. Notice that this function is right continuous and increasing. . . . . . . . . . . . . . . . . . . . . . . LQ function for Example 5.7. Notice that this function is increasing and left continuous. . . . . . . . . . . . . . . . . . RQ function for Example 5.7, notice that this function is increasing and right continuous. . . . . . . . . . . . . . . . . . . For the vector x = (−2, −2, 2, 2, 4, 4, 4, 4) the left (top) and right (bottom) quantile functions are given. . . . . . . . . . . The solid line is the distribution function of {Xi }. Note that for the distribution of the Xi and p = 0.5, lqFX (p) = 0, rqFX (p) = 3. Let h = rq(p) − lq(p) = 3. The dotted line is the distribution function of the {Yi } which coincides with that of {Xi } to the left of lqFX (p) and is a backward shift of 3 units for values greater than rqFX (p). Note that for the {Yi }, lqFY (p) = rqFY (p) = 1. . . . . . . . . . . . . . . . . . . . . . .  141  142  142 143 143 156  176  Comparing the approximated quantiles to the exact quantiles N = 107 . The circles are the exact quantiles and the + are the corresponding approximated quantiles. . . . . . . . . . . . 209 Comparing the approximated quantiles to the exact quantiles for M T (daily maximum temperature) over 25 stations in Alberta 1940–2004. The circles are the exact quantiles and the + the approximated quantiles. . . . . . . . . . . . . . . . 210 The order statistics family members that estimate lqX (1/2) and lqX (P (Z ≤ 1)) for a random sample of length 25 obtained by generating samples of size 1 to 1000 from a standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 244 xvi  List of Figures 9.2  9.3  9.4  9.5  9.6  9.7  9.8  9.9  The order statistics family members that estimate lqX (1/2) and lqX (P (Z ≤ 1)) for a random sample of length 20 obtained by generating samples of size 1 to 1000 from a standard normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . Cauchy distribution’s distance with different scale parameter (and location parameter=0) to the standard normal. In the plots QD1 = QX and QD2 = QDY and QD = QD1 + QD2, where X is the standard normal and Y is the Cauchy. . . . . The distribution function of standard normal (solid) compared with the optimal Cauchy (and location parameter=0) picked by quantile distance minimization with scale parameter=0.66 (dashed curve), Cauchy with scale parameter=1 (dotted) and Cauchy with scale parameter=0.5 (dot dashed). Cauchy distribution’s distance with different scale parameter (and location parameter=0) to the standard normal on the tails. In the plots QD1 = QX and QD2 = QDY and QD = QD1 + QD2, where X is the standard normal and Y is the Cauchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The distribution function of standard normal (solid) compared with the optimal Cauchy picked by tail quantile distance minimization with scale parameter=0.12 (dashed curve), Cauchy with scale parameter=0.65 (dotted) and Cauchy with scale parameter=0.01 (dot dashed). . . . . . . . . . . . . . . . Comparing the standard normal distribution (solid) with optimal Cauchy picked by quantile distance (dashed) and the optimal Cauchy picked by tail quantile distance minimization (dotted). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Histograms for the parameter estimates using quantile distance and maximum likelihood methods for a sample of size 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Histograms for the parameter estimates using quantile distance and maximum likelihood methods for a sample of size 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  245  262  263  264  265  266  270  271  10.1 The estimated probability of a freezing day for the Banff site for different days of a year computed using the historical data. 276 10.2 The estimated probability of a freezing day for the Medicine Hat site for different days of a year computed using the historical data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277  xvii  List of Figures 10.3 The estimated 1st–order transition probabilities for the 01 process of extreme minimum temperatures for the Banff site. The dotted line represents the estimated probability of “e(t) = 1 if e(t − 1) = 1” (pˆ11 ) and the dashed, “e(t) = 1 if e(t − 1) = 0” (pˆ01 ). . . . . . . . . . . . . . . . . . . . . . . . . 10.4 The estimated 1st–order transition probabilities for the 0-1 process of extreme minimum temperatures for the Medicine Hat site. The dotted line represents the estimated probability of “e(t) = 1 if e(t − 1) = 1” (pˆ11 ) and the dashed, “e(t) = 1 if e(t − 1) = 0” (pˆ01 ). . . . . . . . . . . . . . . . . . . . . . . . 10.5 The estimated 2nd–order transition probabilities for the 0-1 process of extreme minimum temperature for the Banff site with p̂111 (solid) compared with p̂011 (dotted) both calculated from the historical data. . . . . . . . . . . . . . . . . . . . . . 10.6 The estimated 2nd–order transition probabilities for the 0-1 process of extreme minimum temperatures for the Banff site with p̂001 (solid) compared with p̂101 (dotted) calculated from the historical data. . . . . . . . . . . . . . . . . . . . . . . . . 10.7 The estimated 2nd–order transition probabilities for the 0-1 process of extreme minimum temperatures for the Medicine Hat site with p̂111 (solid) compared with p̂011 (dotted) calculated from the historical data. . . . . . . . . . . . . . . . . . . 10.8 The estimated 2nd–order transition probabilities for the 0-1 process of extreme minimum temperatures for the Medicine Hat site with p̂001 (solid) compared with p̂101 (dotted) calculated from the historical data. . . . . . . . . . . . . . . . . . . 10.9 The estimated probability of a hot day (maximum temperature ≥ 27 (deg C)) for different days of the year for the Banff site calculated from the historical data. . . . . . . . . . . . . 10.10The estimated probability of a hot day (maximum temperature ≥ 27 (deg C)) for different days of the year for the Medicine Hat site calculated from the historical data. . . . . 10.11The estimated 1st–order transition probabilities for the binary process of extremely hot temperatures for the Banff site. The dotted line represent the estimated probability of “E(t) = 1 if E(t − 1) = 1” (pˆ11 ) and the dashed, “E(t) = 1 if E(t − 1) = 0” (pˆ01 ). . . . . . . . . . . . . . . . . . . . . . . .  278  279  280  281  282  283  287  288  289  xviii  List of Figures 10.12The estimated 1st–order transition probabilities for the binary process of extremely hot temperatures for the Medicine Hat site. The dotted line represents the estimated probability of “E(t) = 1 if E(t − 1) = 1” (pˆ11 ) and the dashed, “E(t) = 1 if E(t − 1) = 0” (pˆ01 ). . . . . . . . . . . . . . . . . . . . . . . 10.13The estimated 2nd–order transition probabilities for the binary process of extremely hot temperatures for the Banff site with p̂111 (solid) compared with p̂011 (dotted) calculated from the historical data. . . . . . . . . . . . . . . . . . . . . . . . . 10.14The estimated 2nd–order transition probabilities for the binary process of extremely hot temperatures for the Banff site with p̂001 (solid) compared with p̂101 (dotted) calculated from the historical data. . . . . . . . . . . . . . . . . . . . . . . . . 10.15The estimated 2nd–order transition probabilities for the binary process of extremely hot temperatures for the Medicine Hat site with p̂111 (solid) compared with p̂011 (dotted), calculated from the historical data. . . . . . . . . . . . . . . . . . . 10.16The estimated 2nd–order transition probabilities for the binary process of extremely hot temperatures for the Medicine Hat site with p̂001 (solid) compared with p̂101 (dotted) calculated from the historical data. . . . . . . . . . . . . . . . . . . 10.17Medicine Hat’s estimated mean annual probability of frost calculated from the historical data. . . . . . . . . . . . . . . . 10.18Normal curved fitted to the distribution of 50 samples of the estimated parameters. . . . . . . . . . . . . . . . . . . . . . .  290  291  292  293  294 297 301  B.1 Canada site locations . . . . . . . . . . . . . . . . . . . . . . . 324  xix  Acknowledgements I would like to thank my supervisors, Prof. Jim Zidek and Prof. Nhu Le for what they have taught me in statistics and a lot more, their encouragements, ideas and financial support through various RA positions during my PhD studies. I feel very grateful and lucky to have them as my supervisors. I should also thank Prof. Matias Salibian-Barrera on my supervisory committee for giving me great ideas and feedbacks. I also like to thank other people in the statistics department at UBC, prof. Paul Gustafson, prof. John Petkau, prof. Constance Van Eden and prof. Ruben Zamar which I owe them a lot of things I know. I also thank Mike Marin (instructor at UBC) for having various interesting discussions about statistics and science, and Viena Tran for helping me regarding many administrative issues. I like to thank my friend Dr. Nathaniel Newlands for insightful comments and good suggestions and Ralph Wright (Alberta Agriculture Food and Rural Development) for making useful comments about the definition of the extremes. Finally, I like to express my deepest appreciation to all the people who helped me learn and love statistics and mathematics during all my life, from my grandfather who graciously taught me mathematics when I was a child to my mother who has been an inspiration and high school teachers who encouraged me to study mathematics as a university major.  xx  Dedication To my lovely parents, my amazing brother: Alireza, my sweet sister: Fatima, my best friends: Mostafa Aghajanpour, Mahmoud Sohrabi, Masoud Feizbakhsh, Behruz Khajali, Ali Mehrabian, Prof. Masoud Asgharian, Prof. Niky Kamran, Mirella Simoneova, Kiyouko Futaeda, May Yun, Yuki Ezaki, Soheil Keshmiri, Naoko Yoshimi and Mike Marin.  xxi  Chapter 1  Thesis introduction This thesis develops mathematical and statistical framework to model stochastic processes over time. In particular it develops models for precipitation and extreme (high or low) temperature events occurrences. This is important for Canada’s agriculture since agricultural production is dependent on weather and water availability. We study the quantiles of data and distributions in detail and develop a framework for approximating quantiles in large datasets and inference. We also study categorical Markov chains of higher order and apply them to precipitation and temperature processes. However, the methodologies and theories developed here are general and can be used in many other applications where such processes are encountered (such as physics, chemistry, climatology, economics and so on). Sample quantiles and quantile function are fundamental concepts in statistics. In the study of extreme events they are often used to pick appropriate thresholds. We use the quantiles specifically to pick thresholds for the temperature process. This motivates us to study the concept of quantiles and extend their classic definition to provide a more intuitively appealing alternative. This alternative also enables us to get interesting asymptotic results about their sample counterparts and a framework to approximate quantiles and make inference. In fact weather datasets (observed weather or output of climate models) are very large in size. This makes computing the quantiles of such large datasets computationally intensive. Along with this alternative definition, we present an algorithm for computing/approximating quantiles in large datasets. The data used in this thesis come from the climate data CD published by Environment Canada [10], which includes the daily observed precipitation and temperature data for several station from 1895 to 2007 (the years varying with the station). The data are saved in several binary files. We have written a Python module to extract the data in desired formats. The guide to using this module is in Appendix B. For most of the analysis however, we have used the “homogenized” dataset for Alberta. This dataset is adjusted for change of instruments and location of the stations. More information 1  Chapter 1. Thesis introduction about the datasets is given in Appendix B and Chapter 2. Chapter 2 presents results from the exploratory analysis of the dataset. We look at the variables’ daily time series, monthly means time series, annual means time series and the distribution of the daily/annual means values. We also look at the relation between the variables as well as some long–term trends by simple techniques such as linear regression. For example it seems that the mean summer daily minimum temperature has increased over time at some locations in Alberta. Then we study the seasonal patterns of these variables over the course of the year. As expected there is a strong seasonal component in these processes. For example, we observe that the daily temperature is more variable in the colder seasons than the warmer ones. The daily values for the minimum and maximum temperature seem to be described fairly well by a Gaussian process. However, some deviations from the Gaussian assumption is seen in the tails. This is particularly important in modeling extreme events and will help us in later chapters to choose our approach to modeling the occurrence of such extremes. As a part of the exploratory analysis, we look the precipitation occurrence. A question that has been addressed by several authors (e.g. Tong in [45] and Gabriel et al. in [18]) is the Markov order of such a chain. The exploratory analysis using the transition probabilities plots leads to the conjecture that a 1st–order Markov chain should be appropriate. This is studied in detail in later chapters. We also look at the spatial–temporal correlation function of these processes. Several interesting features are observed. For example for the maximum and minimum temperature, the correlation seems to be stationary over time. Also the geodesic distance seems to describe the spatial correlation for temperature well. For precipitation on other hand not much spatial correlation is observed. This could be due to the fact that we have only 47 precipitation stations available over Alberta and it is more variable over space compared to temperature. Let us denote a general weather process by Xt , where t denotes time. The main approach we take to model the process is discrete–time categorical rth–order Markov chains (r a natural number), where we have the following assumption for the conditional probabilities: P (Xt |Xt−1 , · · · ) = P (Xt |Xt−1 , · · · , Xt−r ). “Categorical chain” here means that Xt takes only a finite number of possible states. For example it can be a two state space of the (occurrence)/(nonoccurrence) of precipitation. Dichotomizing the temperature process, we can consider processes such as (freezing)/(not freezing). Processes with 2  Chapter 1. Thesis introduction more than two states can also be considered. For example a process with three states: (not warm)/(warm)/(hot). Chapter 3 studies the rth–order categorical Markov chains in general. We present a new representation theorem for such chains that expresses the above conditional probability as a linear combination of the monomials of past process values Xt−1 , · · · , Xt−r . We will show the existence and uniqueness of such a representation. In the stationary case since the conditional probability is the same for all time points, some more work on the consistency shows that this representation characterizes all stationary categorical rth–order Markov chains. For the binary case the result is a corollary of a theorem stated in [6]. However, the expression of the theorem in [6] is flawed as also pointed out by Cressie et al. [14]. We present a rigorous statement along with a constructive proof for the theorem. For discrete– time categorical chains with more than two states this theorem does not seem especially useful. We prove a new theorem for this case that gives us representation for all discrete–time categorical chains (rather than only binary). In order to estimate the parameters of such a model in the binary case and infer about them, we use the “Time series following general linear models” as described in [27]. The inferences are similar to generalized linear models. However because of dependencies over time some extensions of the usual theory are needed. Maximizing the “partial likelihood” will give us “consistent” estimators as shown in [48]. We apply the partial likelihood theory to our proposed rth–order Markov models. Simulations show that partial likelihood and the representation together give us satisfactory results for the binary case. We also check the performance of the Bayesian information criterion (BIC), developed in [42] and others, to pick optimal models by simulation studies and we get satisfactory results. This allows us not only to pick the order of the Markov chain but also to compare several Markov chains of the same order. Another advantage of this model to existing ones is the capacity to accommodate other continuous variables. For example, we can add some seasonal processes to get a non–stationary chain. [In previous studies regarding the order of the chain e.g. [45] and [18], it was assumed that the precipitation chain is stationary.] We can also add covariate processes such as temperature of the previous day to the model. Then we apply these techniques to the binary precipitation process in Alberta and pick appropriate models. A 1st–order non-stationary (with one seasonal term) seem to be the most appropriate based on the BIC method for model selection. To apply these techniques to the temperature processes, we need a way of dichotomizing the temperature process. Usually certain quantiles are cho3  Chapter 1. Thesis introduction sen in order to do so. Computing the quantiles for large datasets can be computationally challenging. Very large datasets are often encountered in climatology, either from a multiplicity of observations over time and space or outputs from deterministic models (sometimes in petabytes= 1 million gigabytes). Loading a large data vector and sorting it, is impossible sometimes due to memory limitations or computing power. We show that a proposed algorithm to approximating the median, “the median of the median” performs poorly. Instead, we propose a new algorithm that can give us good approximations to the exact quantiles, which is an extension of the algorithm proposed in [3]. In fact, we derive the precision of the algorithm. The algorithm partitions the data, “coarsens” the partitions at every iteration, put the coarsened vectors together and sort it instead of the original vector. Working on the quantiles, in order to find some theory to justify the usefulness and accuracy of the algorithm motivated us to think about the definition of the quantile function and quantiles for data vectors. The quantile function of a random variable X with distribution function F is traditionally defined as q(p) = inf{u|F (u) ≥ p}. Applying this to the fair coin example with 0, 1 as outputs, we get q(1/2) = 0. This is counterintuitive to the fact that the distribution has equal mass on 0 and 1. Also a standard definition for the quantiles does not exist for a data vector. [For example Hyndman et al. [25] point out that there are many definitions of quantiles in different packages.] For example suppose a data vector has an even number of points, then there is no point exactly in the middle, in which case, the average of the two middle values is often proposed as the median. We argue that this is not a good definition. In fact, we present an alternative way of defining quantiles that is motivated by an intuitive experiment and resolve all the above problems. We propose using the two-state definition of right and left quantiles instead of only quantile. The left quantile is defined as above and the right quantile is defined to be rq(p) = sup{u|F (u) ≤ p}. We also define left and right quantiles for the data vectors and study the limit properties of the sample quantiles. For example it turns out the sample left and right quantile converge to the distribution quantiles if and only if the left and right quantile are equal. This again shows another interesting aspect of the definition of quantiles and confirms that it is not redundant. This definition is an extension to the concept of upper and lower median 4  Chapter 1. Thesis introduction in robustness literature. Also in some books (e.g. [41]) rq(p) is taken to be the definition of quantiles. However, we do not know of any study of their properties or a claim that considering both can lead to many interesting results. We also show that the widely claimed equivariance property of traditional (left) quantile functions under strictly increasing transformations (for example in [21] and [29]) is false. However, we show that the left (right) quantile is equivariant under left (right) continuous increasing transformations. We also provide a neat result for continuous decreasing transformations. We also show that the probability that the random variable is between these the right and left quantile is zero and the left and right quantile are identical except for at most a countable subset of [0,1]. Since our objective is to approximate to the exact quantiles by our algorithm, we need a way of assessing the accuracy such an approximation (a loss function). We introduce a new loss function that is invariant under strictly monotonic transformations of the data or the random variable. This loss function is very natural and in summary the loss of estimating a quantile z by z ′ is the probability that the random variable is between these two values. In other words, we use the mass of the random variable itself between the two values to judge the goodness of the approximation. We also show some limit theorems to show the empirical loss function tends to the loss function of the distribution. This loss function might be a useful tool in many other contexts and is an interesting topic for future research. We show by simulations and real data the algorithm performs well. Then we will apply it to the weather data to pick the 95% quantile for the maximum daily temperature. After picking the quantiles we use the rth–order Markov techniques and partial likelihood to find appropriate models to describe the temperature. Using this loss function and the theory developed for the quantiles, we introduce measures to compute “distance” among distribution functions over the reals (random variables) that are invariant under continuous strictly monotonic transformations. We use this distance measures to show optimal overall fits to a random variable are not necessarily optimal in the tails (and hence not appropriate to study extremes). We also find “optimal” ways of picking a limited number of probabilities 0 ≤ p1 < · · · < pk ≤ 1 to summarize a random variable by its corresponding quantiles. Finally we show how these higher order Markov models can be used to construct confidence intervals for the probability of a frost free week at the beginning of August at Medicine Hat (in Alberta). The last chapter provides a summary of the work and the conclusions. It also points out some interesting questions that are not answered in this 5  Chapter 1. Thesis introduction thesis and a research proposal for the future.  6  Chapter 2  Exploratory analysis of the Canadian weather data 2.1  Introduction  This chapter performs an exploratory analysis for the Homogenized climate dataset for the province of Alberta in Canada. We have access to daily maximum temperature (M T ), daily minimum temperature (mt) and precipitation (P N ). The temperature data have been provided to us by Vincent, L.A. and the precipitation data have been provided to us by Eva Mekis both from Environment Canada. This dataset has been homogenized for changes of instrument, changes of the location of the stations and so on. More information about these data can be found in [34] and [47]. These data are a homogenized part of a larger dataset published by Environment Canada (2007), which are in binary format and a Python module in order to extract them is provided in Appendix B. This chapter uses several graphical and analytical tools to examine the behavior of selected climate variables. Looking at the data, we will see some interesting features that suggest future research. Section 2 describes the dataset. For example the location plots of the stations and their elevation plots are given. In Section 3, we look at the daily and annual time series of temperatures and precipitation. The normality of the distribution of annual values and the associations between different variables are investigated. We have also investigated the seasonal patterns as well as the long–term patterns for different variables over the course of the year. For example, the mean summer daily minimum temperature shows a significant increasing pattern over the course of the past century in Calgary and some other locations. Section 4 looks at the distribution of the daily values. For example, a normal distribution seems to describe the temperature and a Gamma distribution, the precipitation daily values. Confidence intervals for the mean/standard deviation in the normal case and shape/scale parameters in the Gamma case are given. Section 5 looks  7  2.2. Data description  56 54 52 50  Latitude N (degrees)  58  Temperature (deg C)  −118  −116  −114  −112  −110  Longitude W (degrees)  Figure 2.1: Alberta site locations for temperature (deg C) data. There are 25 stations available with temperature data over Alberta. at the spatial and temporal correlation of different variables.  2.2  Data description  The temperature data comes from 25 stations over Alberta which operated from 1895 to 2006. P N data involve 47 stations from 1895 to 2006. Different stations have different intervals of data available. For example, the P N data for Caldwell is available from 1911 to 1990. Figures 2.1 and 2.2 respectively depict the location of the stations for temperature (both M T and mt) and P N . The number of years available for each station is plotted against the location in Figures 2.3 and 2.4. Another available variable for the location of the stations is the elevation. Figures 2.5 and 2.6 show the elevation in meters. As seen in the plots, some stations have both temperature and precipitation data.  2.3  Temperature and precipitation  To get some initial impression of the data, we look at the time series of M T, mt, and P N at a fixed location. We use the Calgary site since it has a long 8  2.3. Temperature and precipitation  56 54 52 50  Latitude N (degrees)  58  PN (mm)  −118  −116  −114  −112  −110  Longitude W (degrees)  Figure 2.2: Alberta site locations for precipitation (mm) data. There are 47 stations available with precipitations data over Alberta.  80  54 52  Latitude N (degrees)  110 100 90  60 58 56  70  Available years of data  120  Temperature (deg C)  60  50 −120  48 −118  −116  −114  −112  −110  Longitude W (degrees)  Figure 2.3: The number of years available for sites with temperature (deg C) data. 9  2.3. Temperature and precipitation  60  56 54  Latitude N (degrees)  100 80  60 58  40  Available years of data  120  PN (mm)  52  20  50 −120  48 −118  −116  −114  −112  −110  Longitude W (degrees)  Figure 2.4: The number of years available for sites with precipitation (mm) data available.  600  54 52 50  200 −120  Latitude N (degrees)  1000 1200 1400 800  60 58 56  400  Elevation(m)  Temperature (deg C)  48 −118  −116  −114  −112  −110  Longitude W (degrees)  Figure 2.5: The elevation (meters) of sites with temperature data available.  10  2.3. Temperature and precipitation  600  54 52 50  200 −120  Latitude N (degrees)  1000 1200 1400 800  60 58 56  400  Elevation(m)  PN (mm)  48 −118  −116  −114  −112  −110  Longitude W (degrees)  Figure 2.6: The elevation (meters) of the sites with precipitation data available. period of data available and includes both temperature and precipitation. Looking at the maximum and minimum temperature, we see the periodic trend over the course of a year as shown in Figures 2.7 and 2.8 which illustrate the M T and mt daily values from 2000 to 2003. A regular seasonal trend is seen in both processes. Looking at the P N plot in Figure 2.9, we observe a large number of zeros. Moreover, seasonal patterns are hard to see by looking at daily values. To illustrate the seasonal patterns better, we look at the monthly averages for M T , mt and P N over the period 1995 to 2005 in Figures 2.10, 2.11 and 2.12. Now the seasonal patterns for precipitation can be seen better in Figure 2.12. Next we look at the mean annual values of the three variables for all available years that have less than 10 missing days (Figures 2.13, 2.14 and 2.15). Table 2.1 gives a summary of these annual means.  11  10 −20  −10  0  MT (deg C)  20  30  40  2.3. Temperature and precipitation  2000.0  2000.5  2001.0  2001.5  2002.0  2002.5  2003.0  Year  −10 −30  −20  mt (deg C)  0  10  Figure 2.7: The time series of daily maximum temperature (deg C) at the Calgary site from 2000 to 2003.  2000.0  2000.5  2001.0  2001.5  2002.0  2002.5  2003.0  Year  Figure 2.8: The time series of daily minimum temperature (deg C) at the Calgary site from 2000 to 2003. 12  15 0  5  10  PN (mm)  20  25  30  2.3. Temperature and precipitation  2000.0  2000.5  2001.0  2001.5  2002.0  2002.5  2003.0  Year  10 −10  0  MT (deg C)  20  Figure 2.9: The time series of daily precipitation (mm) at the Calgary site from 2000 to 2003.  1996  1998  2000  2002  2004  2006  Time (Year and month)  Figure 2.10: The time series of monthly maximum temperature (deg C) at the Calgary site, 1995–2005. 13  −5 −20  −15  −10  mt (deg C)  0  5  10  2.3. Temperature and precipitation  1996  1998  2000  2002  2004  2006  Time (Year and month)  0  1  2  PN (mm)  3  4  5  Figure 2.11: The time series of monthly minimum temperature means (deg C) at the Calgary site, 1995–2005.  1996  1998  2000  2002  2004  2006  Time (Year and month)  Figure 2.12: The time series of monthly precipitation means (mm) at the Calgary site, 1995–2005. 14  12 11 9  10  MT (deg C)  13  14  15  2.3. Temperature and precipitation  1900  1920  1940  1960  1980  2000  Year  −2 −4  −3  mt (deg C)  −1  Figure 2.13: The annual mean maximum temperature (C) for Calgary site for all available years.  1900  1920  1940  1960  1980  2000  Year  Figure 2.14: The annual mean minimum temperature (C) for Calgary site for all available years. 15  1.5 1.0  PN (mm)  2.0  2.5  2.3. Temperature and precipitation  1900  1920  1940  1960  1980  2000  Year  Figure 2.15: The annual mean precipitation (mm) for Calgary site for all available years. Variable  min  1st quartile  median  mean  3rd quartile  max  M T (deg C) mt (deg C) P N (mm)  7.59 -4.83 0.68  9.64 -3.40 1.12  10.37 -2.54 1.28  10.36 -2.66 1.29  11.19 -1.95 1.39  13.46 0.07 2.51  Table 2.1: The summary statistics for the mean annual maximum temperature, min temperature and precipitation at the Calgary site. Assuming stochastic normality and independence of the observations, we can obtain confidence intervals for all three variables and these are given in Table 2.2. The confidence intervals are fairly narrow. Variable M T (deg C) mt (deg C) P N (mm)  95% confidence interval (10.14,10.57) (-2.85,-2.47) (1.24,1.35)  Table 2.2: Confidence intervals for the mean annual maximum temperature, min temperature and precipitation at the Calgary site.  16  0.20 0.15 0.00  0.05  0.10  Density  0.25  0.30  0.35  2.3. Temperature and precipitation  8  10  12  14  16  MT (deg C)  Figure 2.16: The histogram of annual maximum temperature means (deg C) for Calgary with a normal curve fitted to the data. To investigate the shape of the distribution of annual means, we look at the histogram of each variable with a normal curve fitted in Figures 2.16, 2.18 and 2.20. The corresponding normal qq–plots (quantile–quantile) are also given in Figures 2.17, 2.19 and 2.21 to asses the normality assumption. Both the histogram and the qq–plots for M T validate the normality assumptions. The histogram for mt is slightly left skewed. For P N, some deviation from the normality assumption is seen. This is expected since the daily P N process is very far from normal to start with. Hence, even averaging through the whole year has not quite given us a normal distribution. We plot all three variables (annual mean M T , mt and P N ) in the same graph, Figure 2.22. As shown in that figure, M T and mt show the same trends over time. To get an idea of how the two variables are related, we fit a regression line, taking mt as response and M T as the explanatory variable. As seen in Figure 2.23, the regression fit looks very good. We repeat this analysis this time taking M T as explanatory variable and P N as response. As shown in Figure 2.24, the fit is still reasonable, but the association is not as strong. As shown in Table 3, both fits are significant. One can criticize the use of a simple regression since the independence assumption might not be satisfied. Finding more reliable and sensible relationships among the 17  2.3. Temperature and precipitation  9  10  11  12  13  Sample Quantiles (deg C)  14  15  Normal Q−Q Plot  −2  −1  0  1  2  Theoretical Quantiles  0.2 0.0  0.1  Density  0.3  0.4  Figure 2.17: The normal qq–plot for annual maximum temperature means (deg C) for Calgary.  −5  −4  −3  −2  −1  0  mt (deg C)  Figure 2.18: The histogram of annual minimum temperature means (deg C) for Calgary with normal curve fitted to the data. 18  2.3. Temperature and precipitation  −2 −3 −4  Sample Quantiles (deg C)  −1  Normal Q−Q Plot  −2  −1  0  1  2  Theoretical Quantiles  1.0 0.0  0.5  Density  1.5  2.0  Figure 2.19: The normal qq–plot for annual minimum temperature means (deg C) for Calgary.  0  1  2  3  PN (mm)  Figure 2.20: The histogram of annual precipitation means (mm) for Calgary with normal curve fitted to the data. 19  2.3. Temperature and precipitation  2.0 1.5 1.0  Sample Quantiles (mm)  2.5  Normal Q−Q Plot  −2  −1  0  1  2  Theoretical Quantiles  Figure 2.21: The normal qq–plot for annual precipitation means for Calgary. variables needs a multivariate model taking account of correlation and other aspects of the processes. Also note that these are annual averages which are not as correlated as daily values over time as seen in the annual time series plots. Variables mt (deg C) P N (mm)  Intercept  Slope  p-value for intercept  p-value for slope  -10.40 2.13  0.746 -0.082  2 ×10−16 1.49 × 10−14  2 ×10−16 0.0005  Table 2.3: Lines fitted to annual mean minimum temperature and annual mean precipitation against annual mean maximum temperature. Next we look at the change in the seasonal means for all three variables. As we noted above there are missing data particularly near the beginning of the time series. This has caused the gap at the beginning of most plots. To get a longer time series of means, we first compute the monthly means allowing 3 missing days and then compute the annual mean using the monthly means. This is reasonable since nearby days have similar values. We do the regression analysis for three locations: Calgary, Banff and Medicine Hat. We fit the regression line to annual means, spring means, summer means, fall means and winter means for each of M T , mt and P N with respect to 20  10 5 0 −10  −5  Temp (deg C) & PN (mm)  15  2.3. Temperature and precipitation  1900  1920  1940  1960  1980  2000  Year  Figure 2.22: The time series plots of maximum temperature (deg C), minimum temperature (deg C) and precipitation (mm) annual means for Calgary. The time series plot in the bottom is minimum temperature, the one in the middle is precipitation and the top curve is maximum temperature.  21  −2 −3 −5  −4  mt (deg C)  −1  0  2.3. Temperature and precipitation  8  9  10  11  12  13  MT (deg C)  1.5 1.0  PN (mm)  2.0  2.5  Figure 2.23: The regression line fitted to maximum temperature and minimum temperature annual means for Calgary.  8  9  10  11  12  13  MT (deg C)  Figure 2.24: The regression line fitted to maximum temperature and precipitation annual means for Calgary. 22  8 6  7  mt (deg C)  9  10  2.3. Temperature and precipitation  1900  1920  1940  1960  1980  2000  Year  Figure 2.25: The regression line fitted to summer minimum temperature means against time for Calgary. time. The results are given in Table 4, 5 and 6. We have only included fits that turned out to be significant. Note that P N does not appear in any of the tables. Annual minimum temperature and summer mean temperature show an increase in all three locations. Figure 2.25 depicts one of the time series (mt summer mean for Calgary) with the regression line fitted.  23  2.4. Daily values, distributions Variable mt (deg C) mt (deg C) mt (deg C)  Season  Intercept  Slope  p-value for intercept  p-value for slope  Year Spring Summer  -24.72 -30.05 -20.11  0.112 0.138 0.0144  2 × 10−05 0.0008 6 × 10−7  0.0001 0.0024 3 × 10−11  Table 2.4: The regression line parameters for the fitted lines for each variable with respect to time for the Calgary site.  Variable M T (deg C) M T (deg C) M T (deg C) mt (deg C) mt (deg C) mt (deg C)  Season  Intercept  Slope  p-value for intercept  p-value for slope  Year Spring Fall Year Spring Summer  -12.99 -17.0 -12.64 -37.0 -49.8 -36.8  0.0105 0.0048 0.0106 0.01666 0.0229 0.0212  0.019 0.075 0.19 2 × 10−10 5 × 10−9 2 × 10−15  0.0002 0.009 0.0326 2 × 10−8 10−7 2 × 10−16  Table 2.5: The regression line parameters for the fitted lines for each variable with respect to time for the Banff site.  Variable M T (deg C) M T (deg C) mt (deg C) mt (deg C) mt (deg C) mt (deg C)  Season  Intercept  Slope  p-value for intercept  p-value for slope  Year Spring Year Spring Summer Fall  -24.6 -34.24 -39.98 -39.81 -10.93 -24.66  0.0185 0.0235 0.0197 0.0196 0.0112 0.0122  0.00102 0.009 5 × 10−10 5 × 10−5 0.0199 0.0110  3 × 10−6 0.0005 2 × 10−9 9 × 10−5 7 × 10−6 0.0137  Table 2.6: The regression line parameters for the fitted lines for each variable with respect to time for the Medicine Hat site.  2.4  Daily values, distributions  This section studies the daily values for all three variables. To that end, we pick four days of the year, Jan 1st, April 1st, July 1st and October 1st. Let us look at the time series, histograms and normal qq–plots for each variable over the years. Figures 2.26 to 2.31 give the results. In fact the plots show that a normal distribution fits the data for daily M T and mt for the the selected days fairly well. However, some deviations from the normal distribution is seen, particularly in the tails. We also tried the first day of each month and observed similar results. 24  2.4. Daily values, distributions  Apr 1st  10 5 0 −15  −5  MT (deg C)  0 −10 −20  MT (deg C)  10  20  Jan 1st  1900  1940  1980  1900  25 20 15 5  10  MT (deg C)  25 20  0  15  MT (deg C)  1980  Oct 1st  30  July 1st  1940  1900  1940 Year  1980  1900  1940  1980  Year  Figure 2.26: The time series of daily maximum temperature at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st.  25  2.4. Daily values, distributions  Apr 1st  0.04  Density  0.02 0.00  0.02 0.00  Density  0.04  0.06  Jan 1st  −30  −20  −10  0  10  −15  −5  5  10  15  20  20  25  30  Oct 1st  0.04  Density  0.02 0.00  0.04 0.00  Density  0.08  0.06  July 1st  0  15  20  25  MT (deg C)  30  −5  0  5  10  15  MT (deg C)  Figure 2.27: The histogram of daily maximum temperature at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st.  26  2.4. Daily values, distributions  −5  0  −10  5  0  10  10  20  Apr 1st  −15  −20  Sample Quantiles (deg C)  Jan 1st  −2  −1  0  1  2  −2  −1  1  2  1  2  5  10  20  15  25  20  25  30  Oct 1st  0  15  Sample Quantiles (deg C)  July 1st  0  −2  −1  0  1  Theoritical Quantiles  2  −2  −1  0  Theoritical Quantiles  Figure 2.28: The normal qq–plots of of daily maximum temperature at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st.  27  2.4. Daily values, distributions  Apr 1st  −5 −15  mt (deg C)  −10 −20  −25  −30  mt (deg C)  0  0  5  Jan 1st  1900  1940  1980  1900  1980  5 0 −5  6  8  mt (deg C)  10 12 14  10  Oct 1st  2  4  mt (deg C)  July 1st  1940  1900  1940 Year  1980  1900  1940  1980  Year  Figure 2.29: The time series of daily minimum temperature for Calgary for four given dates: January 1st, April 1st, July 1st and October 1st.  28  2.4. Daily values, distributions  Density  0.04 0.00  0.02 0.00  Density  0.08  Apr 1st  0.04  Jan 1st  −40  −30  −20  −10  0  −25  −15  0  5  10  Oct 1st  Density  0.00  0.04  0.10 0.05 0.00  Density  0.08  0.15  July 1st  −5  2  4  6  8  10  mt (deg C)  12  14  16  −5  0  5  10  mt (deg C)  Figure 2.30: The histogram of daily minimum temperature at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st.  29  2.4. Daily values, distributions  Apr 1st  0  0  −5  −10  −15  −20  −25  −30  Sample Quantiles (deg C)  5  Jan 1st  −2  −1  0  1  2  −2  −1  1  2  1  2  4  −5  6  0  8  5  10 12 14  10  Oct 1st  2  Sample Quantiles (deg C)  July 1st  0  −2  −1  0  1  Theoritical Quantiles  2  −2  −1  0  Theoritical Quantiles  Figure 2.31: The normal qq-plots of daily minimum temperature at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st.  30  2.4. Daily values, distributions  Apr 1st  15 0  0  2  5  10  PN (mm)  8 6 4  PN (mm)  10  12  Jan 1st  1940  1980  1900  1940  Year  Year  July 1st  Oct 1st  1980  30 10  20  PN (mm)  15 10  0  0  5  PN (mm)  20  40  25  1900  1900  1940 Year  1980  1900  1940  1980  Year  Figure 2.32: The time series of daily precipitation at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st. We plot the histogram for P N as well (Figure 2.33). The distribution is far from normal because of high frequency of no P N (dry) days. Next, we use the available years to compute the confidence intervals for the mean of every given day of the year for M T and mt. For P N , we construct the confidence intervals for probability of P N . [A P N day is defined to be a day with P N > 0.2 (mm). This is because any precipitation amount less than 0.2 (mm) is barely measurable.] Figures 2.34 to 2.36 give the confidence intervals for the means. The confidence interval for the standard deviations (obtained by bootstrap techniques) are given in Figures 2.37 to 2.39. A regular seasonal pattern is seen in the means and standard deviations. For example the maximum for M T and mt occurs around the 200th (in July) day of the year and the minimum occurs at the beginning 31  2.4. Daily values, distributions  Apr 1st  0.3 0.1 0.0 2  4  6  8  10  12  14  0  5  10  15  PN (mm)  PN (mm)  July 1st  Oct 1st  20  0.10 0.00  0.00  0.05  0.05  Density  0.10  0.15  0.15  0  Density  0.2  Density  0.3 0.2 0.0  0.1  Density  0.4  0.4  Jan 1st  0  5  10  15  20  PN (mm)  25  30  0  10  20  30  40  50  PN (mm)  Figure 2.33: The histogram of daily precipitation at the Calgary site for four given dates: January 1st, April 1st, July 1st and October 1st.  32  10 5 −5  0  Mean MT (deg C)  15  20  25  2.4. Daily values, distributions  0  100  200  300  Day of the year  Figure 2.34: The confidence intervals for the daily mean maximum temperature (deg C) at the Calgary site. Dashed line shows the upper bound and the solid line the lower bound of the confidence intervals. and the end of the year. Comparing the plots of the means and the standard deviations, we observe that warmer days have smaller standard deviations than colder days. For example the minimum standard deviation for the Maximum and minimum temperature happens around the 200th day of the year which correspond to the warmest period of the year. The plots seem to indicate that a simple periodic function suffices to model the seasonal patterns. Contrary to M T and mt, for the 0-1 P N process, the standard deviation is the highest in June, when the probability of precipitation is close to 12 . As shown above, the distribution of daily P N values is far from normal. This time, after removing the zeros, we fit a Gamma distribution to P N (Figure 2.42). The Gamma qq–plots are given in Figure 2.43 and reveal a fairly good fit.  33  −5 −15  −10  Mean mt (deg C)  0  5  10  2.4. Daily values, distributions  0  100  200  300  Day of the year  Figure 2.35: The confidence intervals for the daily mean minimum temperature (deg C) at the Calgary site. Dashed line shows the upper bound and the solid the lower bound of the confidence intervals.  34  0.4 0.0  0.2  Probability of PN  0.6  0.8  2.4. Daily values, distributions  0  100  200  300  Day of the year  Figure 2.36: The confidence intervals for the probability of precipitation (mm) at the Calgary site for the days of the year. Dashed line shows the upper bound and the solid the lower bound of the confidence intervals.  35  8 6 4  Standard deviation (deg C)  10  2.4. Daily values, distributions  0  100  200  300  Day of the year  Figure 2.37: The confidence intervals for the standard deviation of each day of the year for maximum temperature (deg C) at the Calgary site. Dashed line shows the upper bound and the solid the lower bound of the confidence intervals.  36  8 6 2  4  Standard deviation (deg C)  10  2.4. Daily values, distributions  0  100  200  300  Day of the year  Figure 2.38: The confidence intervals for the standard deviation of each day of the year for minimum temperature (deg C) at the Calgary site. Dashed line shows the upper bound and the solid the lower bound of the confidence intervals.  37  0.45 0.35  0.40  Standard deviation (mm)  0.50  0.55  2.4. Daily values, distributions  0  100  200  300  Day of the year  Figure 2.39: The confidence intervals for standard deviation (sd) of each day of the year for the probability of precipitation (mm) (0-1 precipitation process) at the Calgary site. Dashed line shows the upper bound and the solid the lower bound of the confidence intervals. Plot shows sd ≤ 1/2. This p is because sd = p(1 − p) which has a maximum value of 12 .  38  2.4. Daily values, distributions  0.10 0.00  0.05  Density 0.00  0.05  Density  0.10  0.15  Second half of the year  0.15  First half of the year  −40  −20  0 MT (C)  20  40  −40  −20  0  20  40  MT (C)  Figure 2.40: The distribution of each day of the year for M T (C) from Jan 1st to Dec 1st. The year has been divided to two halves. In each half rainbow colors are used to show the change of the distribution.  39  2.4. Daily values, distributions  Density  0.00 0.05 0.10 0.15  Second half of the year  0.00 0.05 0.10 0.15  Density  First half of the year  −40  −20  0  10  20  30  −40  −20  mt (C)  0  10  20  30  mt (C)  Figure 2.41: The distribution of each day of the year for mt (C) from Jan 1st to Dec 1st. The year has been divided to two halves. In each half rainbow colors are used to show the change of the distribution.  Density  0.2  20  30  40  50  0  20  30  PN (mm)  July 1st  Oct 1st  40  50  40  50  0.10  Density  0.12  0.00  0.06 0.00  Density  10  PN (mm)  0.15  10  0.05  Density  0.1 0.0 0  0.00 0.10 0.20 0.30  Apr 1st  0.3  Jan 1st  0  10  20  30  PN (mm)  40  50  0  10  20  30  PN (mm)  Figure 2.42: The histogram of daily precipitation greater than 0.2 mm at the Calgary site with Gamma density curve fitted using Maximum likelihood.  40  15 10 5  Sample Quantile  0  0 2 4 6 8  Sample Quantile  12  2.4. Daily values, distributions  0  5  10  15  20  0  5  10  15  20  25  30 10 0  5 10  20  Sample Quantile  Theoritical Quantiles  0  Sample Quantile  Theoritical Quantiles  0  5  10  15  20  Theoritical Quantiles  25  0  5  10  15  20  25  Theoritical Quantiles  Figure 2.43: The qq–plots of daily precipitation greater than 0.2 mm at the Calgary site with Gamma curve fitted using Maximum likelihood.  41  2.4. Daily values, distributions  0.4 0.0  2  4  6  8  10  0  2  4  6  PN (mm)  September  December  8  10  8  10  0.4 0.2 0.0  0.0  0.2  0.4  Density  0.6  PN (mm)  0.6  0  Density  0.2  Density  0.4 0.2 0.0  Density  0.6  June  0.6  March  0  2  4  6  PN (mm)  8  10  0  2  4  6  PN (mm)  Figure 2.44: The Gamma fit of each day of 4 months for precipitation (mm). In each month rainbow colors are used to show the change of the distribution. Figures 2.40, 2.41 and 2.44 reveal the result of our investigation of the change in the distribution over a period of time. For M T and mt, we have done that for the course of the year. The figures show how the distribution deforms continuously over the year. We can also notice changes in mean and standard deviation over the year. For P N , we have done the same only for 4 different months because of high irregularity of the process. Next, we look at the parameters of the Gamma distribution fitted to P N over the course of a year. If we use maximum likelihood estimates (MLE), which we have used above to form the Gamma curve, the confidence intervals, obtained by bootstrap method will be very wide (tend rapidly to infinity). Hence, we use the “method of moments estimates” (MOM), to  42  3 2 0  1  Estimated parameter, alpha  4  5  2.4. Daily values, distributions  0  100  200  300  Day of the year  Figure 2.45: The maximum likelihood estimate for α, the shape parameter of the Gamma distribution fitted to the precipitation amounts. obtain confidence intervals. The MOM confidence intervals are given in Figure 2.46. When using MLE estimates, since there is no closed form for them, we need to use Newton method to find the Maximum values. However, MOM gives us closed form solution. This advantage might explain the better behavior of MOM estimates in forming the confidence intervals. However, even the MOM confidence intervals do not look satisfactory and are rather wide and irregular specially at the beginning and end of the year. We can also consider the 0-1 process of P N (1 for wet and 0 for dry) and compute the transition probabilities for P N (Figure 2.47). The figure shows the probability of P N is changing continuously over the year and can be modeled by a simple periodic function. Considering the 0-1 process of P N as a chain leads to the interesting question as the order of the Markov chain. Let us denote by 1 a P N occurrence and 0 otherwise. Suppose xt = 1 denote P N on day t and xt = 0 denote no P N and let pxt−r ···xt (t), denote the probability of observing xt on day t of the year conditional on the chain (xt−r · · · xt−1 ). In Figure 2.47, we have plotted the estimated p̂11 (t) and p̂01 (t) for different days of the year. 43  1.5 1.0 0.5  Estimated parameter, alpha  2.0  2.5  2.4. Daily values, distributions  0  100  200  300  Day of the year  Figure 2.46: The confidence interval for MOM estimate of the shape parameter, α, of the Gamma distribution fitted to daily precipitation amounts. The dotted line is the upper bound and the solid line the lower bound. As seen in the figure the upper bounds at the beginning and end of the year have become very large. We have not shown them because otherwise then the pattern in the rest of the year could not be seen.  44  0.4 0.0  0.2  Probability  0.6  0.8  1.0  2.4. Daily values, distributions  0  100  200  300  Day of the year  Figure 2.47: The 1st-order transition probabilities. The dotted line is the the probability of precipitation if it happened the day before (pˆ11 ) and the dashed is the probability of precipitation if it did not happen the day before (pˆ01 ).  45  0.4 0.0  0.2  Probability  0.6  0.8  1.0  2.5. Correlation  0  100  200  300  Day of the year  Figure 2.48: The 2nd–order transition probabilities for the precipitation at the Calgary site: p̂111 (solid) against p̂011 (dotted). The clear gap between this two estimated probabilities indicate that a 1st– order Markov chain should be preferred to a 0th–order. Figures 2.48 and 2.49 plot p̂111 against p̂011 and p̂001 against p̂101 . The estimated probabilities seem to be close and overlap heavily over the course of the year. Hence a 1st–order Markov chain seems to suffice for describing the binary process of PN.  2.5  Correlation  The correlation in a spatial–temporal process can depend on time and space. This section studies the temporal and spatial patterns of the correlation function separately.  46  0.4 0.0  0.2  Probability  0.6  0.8  1.0  2.5. Correlation  0  100  200  300  Day of the year  Figure 2.49: The 2nd–order transition probabilities for the precipitation at the Calgary site: p̂001 (solid) against p̂101 (dotted).  47  0.6 0.2 −0.2  Correlation  1.0  2.5. Correlation  0  200  400  600  60 20 −20  Covariance  Day  0  200  400  600  Day  Figure 2.50: The correlation and covariance plot for maximum temperature at the Calgary site for Jan 1st and 732 consequent days.  2.5.1  Temporal correlation  Here we look at the correlation/covariance of the variables as a function of time. The location is taken to be the Calgary site. First we look at the correlation/covariance of a given day and its consequent days. We pick Jan 1st and compute the correlation/covariance with the following days: Jan 2nd, Jan 3rd and etc. Figure 2.50 shows that the correlation and covariance have the same trends for maximum temperature. Figures 2.51 to 2.53 show a decreasing trend for correlation over time for M T, mt and P N . The decrease is far from linear and it looks to be exponentially decreasing. The plots also indicate that only a few consequent days are possibly correlated and in particular two days that are one year apart can be considered independent. This assumption might be useful in building a spatial–temporal model.  48  0.4 −0.2  0.0  0.2  Correlation  0.6  0.8  1.0  2.5. Correlation  0  200  400  600  Day  Figure 2.51: The correlation plot for maximum temperature (deg C) at the Calgary site for Jan 1st and 732 consequent days.  49  0.4 0.2 −0.2  0.0  Correlation  0.6  0.8  1.0  2.5. Correlation  0  200  400  600  Day  Figure 2.52: The correlation plot for minimum temperature (deg C) at the Calgary site for Jan 1st and 732 consequent days.  50  0.4 0.0  0.2  Correlation  0.6  0.8  1.0  2.5. Correlation  0  200  400  600  Day  Figure 2.53: The correlation plot for precipitation (mm) at the Calgary site for Jan 1st and 732 consequent days. Next we look at the correlation of responses on other days of the year with their 30 consecutive days. Our goal is to see if the correlation function has the same behavior over the course of a year. We pick, Feb 1st, April 1st, July 1st, Oct 1st. Figures 2.54 and 2.56 show similar patterns. Finally, we look at the correlation of two fixed locations over the course of the year (by changing the day). The results are given in Figures 2.57 and 2.58. Strong correlation and clear seasonal patterns are seen for M T and mt. This seems to indicate in particular that the temperature process is not stationary. The correlation in the middle of the year around day 200 which corresponds to the summer season seems to be smaller than the correlation at the beginning and end of the year which correspond to the cold season.  51  0.4 −0.2  0.0  0.2  Correlation  0.6  0.8  1.0  2.5. Correlation  0  5  10  15  20  25  30  Day  Figure 2.54: The correlation plot for maximum temperature (deg C) at the Calgary site for Feb 1st (solid), April 1st (dashed), July 1st (dotted) and Oct 1st (dot dash) and 30 consequent days.  52  0.4 −0.2  0.0  0.2  Correlation  0.6  0.8  1.0  2.5. Correlation  0  5  10  15  20  25  30  Day  Figure 2.55: The correlation plot for minimum temperature (deg C) at the Calgary site for Feb 1st (solid), April 1st (dashed), July 1st (dotted) and Oct 1st (dot dash) and 30 consequent days.  53  0.4 −0.2  0.0  0.2  Correlation  0.6  0.8  1.0  2.5. Correlation  0  5  10  15  20  25  30  Day  Figure 2.56: The correlation plot for precipitation (mm) at the Calgary site for Feb 1st (solid), April 1st (dashed), July 1st (dotted) and Oct 1st (dot dashed) and 30 consequent days.  54  0.5 0.7 0.9  Correlation, MT  2.5. Correlation  0  100  200  300  0.5 0.7 0.9  Correlation, mt  Day of the year  0  100  200  300  Day of the year  0.6 0.4 0.2 0.0  Correlation, PN  0.8  1.0  Figure 2.57: The correlation plot for maximum temperature and minimum temperature (deg C) between Calgary and Medicine Hat.  0  100  200  300  Day of the year  Figure 2.58: The correlation plot for precipitation (mm) between Calgary and Medicine Hat. 55  2.5. Correlation  0.8 0.6 0.0  0.2  0.4  Correlation  0.6 0.4 0.0  0.2  Correlation  0.8  1.0  Apr 1st  1.0  Jan 1st  0  200 400 600 800  0  0.8 0.6 0.0  0.2  0.4  Correlation  0.6 0.4 0.0  0.2  Correlation  0.8  1.0  Oct 1st  1.0  July 1st  200 400 600 800  0  200 400 600 800 Distance (km)  0  200 400 600 800 Distance (km)  Figure 2.59: The correlation plot for maximum temperature (deg C) with respect to distance (km).  2.5.2  Spatial correlation  This subsection looks at the spatial correlation by fixing the time to a few dates: January 1st, April 1st, July 1st and Oct 1st distributed over year’s climate regime. We plot the correlation with respect to the geodesic distance (km) on the surface of the earth. Figures 2.59 to 2.62 show the results for M T , mt, P N and 0-1 P N respectively. For M T and mt, we observe a clear decreasing trend with respect to distance. The trend for P N does not seem to be regular.  56  2.5. Correlation  0.8 0.6 0.0  0.2  0.4  Correlation  0.6 0.4 0.0  0.2  Correlation  0.8  1.0  Apr 1st  1.0  Jan 1st  0  200 400 600 800  0  0.8 0.6 0.0  0.2  0.4  Correlation  0.6 0.4 0.0  0.2  Correlation  0.8  1.0  Oct 1st  1.0  July 1st  200 400 600 800  0  200 400 600 800 Distance (km)  0  200 400 600 800 Distance (km)  Figure 2.60: The correlation plot for minimum temperature (deg C) with respect to distance(km).  57  2.5. Correlation  0  0.0 0.2 0.4 0.6 0.8 1.0  Apr 1st  Correlation  0.0 0.2 0.4 0.6 0.8 1.0  Correlation  Jan 1st  200 400 600 800  0  0  200 400 600 800 Distance (km)  0.0 0.2 0.4 0.6 0.8 1.0  Oct 1st  Correlation  0.0 0.2 0.4 0.6 0.8 1.0  Correlation  July 1st  200 400 600 800  0  200 400 600 800 Distance (km)  Figure 2.61: The correlation plot for precipitation (mm) with respect to distance (km).  58  2.5. Correlation  0  0.0 0.2 0.4 0.6 0.8 1.0  Apr 1st  Correlation  0.0 0.2 0.4 0.6 0.8 1.0  Correlation  Jan 1st  200 400 600 800  0  0  200 400 600 800 Distance (km)  0.0 0.2 0.4 0.6 0.8 1.0  Oct 1st  Correlation  0.0 0.2 0.4 0.6 0.8 1.0  Correlation  July 1st  200 400 600 800  0  200 400 600 800 Distance (km)  Figure 2.62: The correlation plot for precipitation (mm) 0-1 process with respect to distance (km).  59  2.6. Summary and conclusions  2.6  Summary and conclusions  This section summarizes our findings of the exploratory analysis. • There is a strong seasonal trend in the temperature and precipitation processes. See Figures 2.7, 2.8, 2.11 and 2.36. • The summer average minimum temperature has increased over several locations over the past century. See Figure 2.25. • mt and M T are highly correlated. See Figure 2.23. • The distributions of daily maximum temperature and minimum temperature are rather close to the Gaussian distribution in the center with some deviations seen in the tails. See Figures 2.27 and 2.29. • The temperature process in Alberta is less variable in the warm seasons and the converse holds for the precipitation process. See Figures 2.37, 2.38 and 2.39. • The distribution of the daily temperature varies continuously over the course of the year. This could not be shown for precipitation. (This might be because we need more data.) • The correlation between two sites depends on the time of the year. They are more correlated in cold seasons. This might be because there are more (strong) global weather regimes in the cold seasons influencing the whole region. • The correlation over time for M T , mt and P N seems stationary and is decreasing with a nonlinear trend (exponentially) with respect to the time difference. • The spatial correlations for M T and mt are strong and decreasing almost linearly with respect to the geodesic distance. • The spatial correlation for P N is not strong. It might be because the sites are too faraway to capture the spatial correlation for P N . The future chapters investigate some of these items. In particular after developing some theory regarding Markov chains, we investigate the order of the binary precipitation process. Then we will turn to modeling the occurrence of extreme temperature. Instead of using a Gaussian process to 60  2.6. Summary and conclusions model the temperature and use that to infer about the occurrence of the extremes, we use a categorical chain. This is because of the deviations from normality in the tails as pointed out above.  61  Chapter 3  rth-order Markov chains 3.1  Introduction  This chapter studies rth–order categorical Markov chains and more generally, categorical discrete–time stochastic processes. By “categorical”, we mean chains that have a finite number of possible states at each time point. Such chains have important applications in many areas, one of which is modeling weather processes such as precipitation over time. In fact, we use these chains to model the binary process of precipitation as well as dichotomized temperature processes. In rth–order Markov chains, the conditional probability of the present given the past is modeled. Such a conditional probability is a function of the past r states, where each one of them only takes finite possible values. It is useful and intuitively appealing to specify or model a discrete process over time by the conditional probabilities rather than the joint distribution. However, one must check the consistency of such a specification i.e. to prove that it corresponds to a full joint distribution. In the case of discrete–time categorical processes, we prove a theorem that shows the conditional probabilities can be used to specify the process. Also we prove a representation theorem which states that every such conditional probability after an appropriate transformation can be written as a linear summation of monomials of the past processes. In fact, we represent all categorical discrete–time stochastic processes over time, in particular rth–order Markov chains and more particularly stationary rth–order Markov chains. For the binary case the result is a consequence of an expansion theorem due to Besag [6]. To generalize the result to arbitrary categorical Markov chains, we prove a new expansion theorem which generalizes the result to the case of arbitrary categorical rth–order Markov chains (rather than binary only). The result simplifies the task of modeling categorical stochastic processes. Since we have written the conditional probability as a linear combination, we can simply add other covariates as linear terms to the model to build non-stationary chains. For example, we can add seasonal terms or geographical coordinates (longitude and latitude). The theory of “partial 62  3.1. Introduction likelihood” allows us to estimate the parameters of such chain models for the binary case. By restricting the degree of those polynomials or by requiring that some of their coefficients be the same, we can find simpler models. Simulation studies show that the “BIC” criterion (Bayesian information criterion) combined with the partial likelihood works well in that they recover the correct simulation model. Since we are only dealing with the categorical case all the density functions in this chapter are densities with the respect to the counting measure on the real line. Specifying a categorical chain over time (with positive joint densities) using conditional probabilities of the present given the past is quite common in statistics and probability. However, we did not find a rigorous result for sufficient and necessary condition for a collection of function to correspond to the conditionals of a unique stochastic process. The proof is given in Theorem 3.5.6. This is an easy consequence of Lemma 3.3.2 that states that the “ascending” joint densities can uniquely determine such a stochastic process. Another commonly used technique in statistics is transforming a discrete probability density from (0, 1) to the real numbers using a transformations such as “log” for example in logistic regression. This is done to remove the restriction of these quantities and ease modeling of such probabilities. Theorem 3.4.1 provides a characterization of all such density functions given any bijective transformation between positive numbers and reals. Hence any positive discrete density function (mass function) correspond to a a unique function on reals and any arbitrary function on reals correspond to a positive function (after fixing the transformation and one element with positive probability). We do not know of a result in this generality elsewhere. Obviously now modeling such arbitrary function on reals which can only take finite values is easier. In order to find a parametric form for an arbitrary function over the reals that only takes finite values, for the binary case, we use a corollary of a result stated by Besag [6] who used such functions in modeling Markov random fields. However, Besag did not provide a rigorous proof and the statement of the theorem is flawed as also pointed out by Cressie et al. in [14]. They also state a correct version of the theorem without offering a proof. We provide a rigorous statement and proof in Theorem 3.5.1. The corollary can only be obtained if the flaw in the statement is fixed. In order to extend to stochastic processes that can have more than two states at some times, we prove a new representation theorem in Theorem 3.5.6. Some novel simplified models with less parameters for such processes are given in Subsection 3.5.3 and many of them have been investigated in later chapters to model precipitation and 63  3.2. Markov chains extreme temperature events occurrences.  3.2  Markov chains  Let {Xt }t∈T be a stochastic process on the index set T , where T = Z, T = N (the integers or natural numbers respectively) or T = {0, 1, · · · , n}. It is customary to call {Xt }t∈T a chain, since T is countable and has a natural ordering. {Xt }t∈T is called an rth–order Markov chain if: P (Xt |Xt−1 , · · · ) = P (Xt |Xt−1 , · · · , Xt−r ), ∀t such that t, t − r ∈ T. We call the Markov chain homogenous if P (Xt = xt |Xt−1 = xt−1 , · · · , Xt−r = xt−r ) = P (Xt′ = xt |Xt′ −1 = xt−1 , · · · , Xt′ −r = xt−r ),  ∀t, t′ ∈ T such that t − r and t′ − r are also in T . Note that Markovness can be defined as a local property. We call {Xt }t∈T locally rth–order Markov at t if P (Xt |Xt−1 , · · · ) = P (Xt |Xt−1 , · · · , Xt−r ). Hence, we can have chains with a different Markov order at different times. Let Xt be the binary random variable for precipitation on day t, with 1 denoting the occurrence of precipitation and 0 non-occurrence. In particular, consider the precipitation (P N ) for Calgary site from 1895 to 2006. This process can be considered in two possible ways: 1. Let X1 , X2 , · · · , X366 denote the binary random variable of precipitation for days of a year. Suppose we repeatedly observe this chain year-by-year from 1895 to 2006 and take these observed chains to be independent and identically distributed from one year to the next. With this assumption, techniques developed in [4] can be applied in order to infer the Markov order of the chain. However, this approach presents three issues. Firstly independence of the successive chains seems questionable. In particular, the end of any one year will be autocorrelated with the beginning of the next. Secondly this model unrealistically assumes the 0-1 precipitation stochastic process is identically distributed over all years. Thirdly and more technically, leap years have 366 days while non–leap years have 365. We can resolve this last issue by formally assuming a missing data day in the non–leap years, by dropping 64  3.3. Consistency of the conditional probabilities the last day in the leap year or by using other methods. However, none of these approaches seem completely satisfactory. 2. Alternatively, we could consider the observations of Calgary daily precipitation as coming from a single process that spans the entire time interval from 1895 to 2006. In this case, we will show below that we can still build models that bring in the seasonality effects within a year.  3.3  Consistency of the conditional probabilities  To represent a stochastic process, we only need to specify the joint probability distributions for all finite collections of states. The Kolmogorov extension theorem then guarantees the existence and uniqueness of an underlying stochastic process from which these distributions derive, provided they are consistent as described below. (See [9] for example.) To state the version of that celebrated theorem we require, let T denote some interval (that can be thought of as “time”), and let n ∈ N = {1, 2, . . . }. For each k ∈ N and finite sequence of times t1 , · · · , tk , let νt1 ···tk be a probability measure on (Rn )k . Suppose that these measures satisfy two consistency conditions: 1. Permutation invariance. For all permutations π (a bijective and one–to–one map from a set to itself) of 1, · · · , k and measurable sets Fi ⊂ Rn ,  νtπ(1) ...tπ(k) (F1 × · · · × Fk ) = νt1 ...tk Fπ−1 (1) × · · · × Fπ−1 (k) .  2. Marginalization consistency. For all measurable sets Fi ⊆ Rn , m ∈ N: νt1 ...tk (F1 × · · · × Fk ) = νt1 ...tk tk+1 ,...,tk+m (F1 × · · · × Fk × Rn × · · · × Rn ) . Then there exists a probability space (Ω, F, P) and a stochastic process X : T × Ω → Rn , such that: νt1 ...tk (F1 × · · · × Fk ) = P (Xt1 ∈ F1 , . . . , Xtk ∈ Fk ) , for all ti ∈ T , k ∈ N and measurable sets Fi ⊆ Rn , i.e. X has the νt1 ...tk as its finite–dimensional distributions. (See [37] for more details.) 65  3.3. Consistency of the conditional probabilities Remark. Note that Condition 1 is equivalent to  νtπ(1) ...tπ(k) Fπ(1) × · · · × Fπ(k) = νt1 ...tk (F1 × · · · × Fk ) .  This is seen by replacing F1 ×· · ·×Fk by Fπ(1) ×· · ·×Fπ(k) in the first equality. Remark. We are only concerned about the case n = 1. This is because we consider stochastic processes, a collection of random variables from the same sample space to R1 = R. When working on (higher order) Markov chains over the index set N, it is natural to consider the conditional distributions of the present, time t, given the past instead of the finite joint distributions, in other words Pt (x0 , · · · , xt ) = P (Xt = xt |Xt−1 = xt−1 , · · · , X0 = x0 ),  for {Xt }t∈N∪0 plus the starting distribution  P0 (x0 ) = P (X0 = x0 ). However that raises a fundamental question – does there exist a stochastic process whose conditional distributions match the specified ones and if so, is it unique? We answer this question affirmatively in this section for the case of discrete–time categorical processes, in particular higher order categorical Markov chains. We also restrict ourselves to chains for which all the joint probabilities are positive. Let M0 , M1 , · · · ⊂ R be the state spaces for time 0, 1, · · · , where each one of them is of finite cardinality. A probability measure on the finite space M0 can be represented through its density function, a positive function P0 : M0 → R satisfying the condition X P0 (m) = 1. m∈M0  The following theorem ensures the consistency of our probability model. Theorem 3.3.1 Suppose M0 , M1 , · · · ⊂ R, |Mt | = ct < ∞, t = 0, 1, · · · . Let P0 : M0 → R be the density of a probability measure on M0 and more generally for n = 1, . . . , Pn (x0 , x1 , · · · , xn−1 , .) be a positive probability density on Mn , ∀(x0 , · · · , xn−1 ) ∈ M0 × · · · × Mn−1 . Then there exists a unique stochastic process (up to distributional equivalence) on a probability space (Ω, Σ, P ) such that P (Xn = xn |Xn−1 = xn−1 , · · · , X0 = x0 ) = Pn (x0 , x1 , · · · , xn−1 , xn ).  66  3.3. Consistency of the conditional probabilities To prove this theorem, we first consider a related problem whose solution is used in the proof. More precisely, we consider stochastic processes {Xn }n∈N∪{0} , where the state space for Xn is Mn , i = 0, 1, 2, · · · and finite. Suppose pn : M0 × M1 × · · · × Mn → R is the joint probability distribution (density) of a random vector {X0 , . . . , Xn }, i.e. pn (x0 , · · · , xn ) = P (X0 = x0 , · · · , Xn = xn ). We call a such sequence of functions, {pn }n∈N , the “ascending joint distributions” of the stochastic process {Xn }n∈N∪{0} . It is clear that given a family of functions {pn }n∈N , other joint distributions such as P (Xt1 = xt1 , · · · , Xtk = xtk ), are obtainable by summing over appropriate components. Now consider the inverse problem. Given the {pn }n∈N and some type of consistency between them, is there a (unique) stochastic process that matches these joint distributions? The following lemma gives an affirmative answer. · · · are finite, p0 : M0 → R repreLemma 3.3.2 Suppose Mt ⊂ R, t = 0, 1,P sents a probability density function (i.e. x0 ∈M0 p(x0 ) = 1) and functions pn : M1 × · · · × Mn → R+ ∪ {0} satisfy the following (consistency) condition: X  xn ∈Mn  pn (x0 , · · · , xn ) = pn−1 (x0 , · · · , xn−1 ).  Then there exist a unique stochastic process (up to distributional equivalence) {Xt }t∈N∪{0} such that P (X0 = x0 , · · · , Xn = xn ) = pn (x0 , · · · , xn ) Proof Existence: By the Kolmogorov extension theorem quoted above, we only need to show there exists a consistent family of measures (density functions) {qt1 ,··· ,tk |k ∈ N, (t1 , · · · , tk ) ∈ Nk }, such that q1,··· ,t = pt . We define such a family of functions, prove they are measures and consistent. For any sequence, t1 , · · · , tk , let t = max{t1 , · · · , tk } and define X qt1 ,··· ,tk (xt1 , · · · , xtk ) = pt (x1 , · · · , xt ). xu ∈Mu ,u∈{1,··· ,t}−{t1 ,··· ,tk }  We need to prove three things: 67  3.3. Consistency of the conditional probabilities a) Each qt1 ,··· ,tk is a density function. It suffices to show that qt is a measure because the qt1 ,··· ,tk are sums of such measures and so are measures themselves. But pt is nonnegative by assumption. It only remains to show that pt sums up to one. For t = 1 it is in the assumptions of the theorem. For t > 1, it can be done by induction because of the following identity X  xi ∈Mi ,i=0,1,··· ,t  pt (x0 , · · · , xt ) =  X  xi ∈Mi ,i=0,1,··· ,t−1  pt−1 (x0 , · · · , xt−1 )  where the right hand side is obtained by the assumption  P  Mn  pn = pn−1 .  b) In order to satisfy the first condition of Kolmogorov extension theorem, we need to show qt1 ,··· ,tk (xt1 , · · · , xtk ) = qtπ(1) ,··· ,tπ(k) (xtπ(1) , · · · , xtπ(k) ), for π a permutation of {1, 2, · · · , k}. But this is obvious since max{t1 , · · · , tk } = max{tπ(1) , · · · , tπ(k) }. c) In order to satisfy the second condition of Kolmogorov extension theorem, we need to show X qt1 ,··· ,ti ,··· ,tk (xt1 , · · · , xti , · · · , xtk ) = qt1 ,··· ,tˆi ,··· ,tk (xt1 , · · · , xˆti , · · · , xtk ), xti ∈Mti  where the notationˆ above a component means that component is omitted. To prove this, we consider two cases: Case I: t = max{t1 , · · · , tk } = max{t1 , · · · , tˆi , · · · , tk }: X  xti ∈Mti  X  xti ∈Mti  qt1 ,··· ,ti ,··· ,tk (xt1 , · · · , xti , · · · , xtk ) = X  xu ∈Mu ,u∈{1,··· ,t}−{t1 ,··· ,ti ,··· ,tk }  X  xu ∈Mu ,u∈{1,··· ,t}−{t1 ,··· ,tˆi ,··· ,tk }  pt (x0 , · · · , xt ) = pt (x0 , · · · , xt ) =  pt1 ,··· ,tˆi ,··· ,tk (xt1 , · · · , xˆti , · · · , xtk ) 68  3.3. Consistency of the conditional probabilities Case II: max{t1 , · · · , tˆi , · · · , tk } = t′ < t = ti :  X  xti ∈Mti  X  xti ∈Mti  qt1 ,··· ,ti ,··· ,tk (xt1 , · · · , xti , · · · , xtk ) = X  xu ∈Mu ,u∈{1,··· ,t}−{t1 ,··· ,ti ,··· ,tk }  X  xu ∈Mu ,u∈{1,··· ,t}−{t1 ,··· ,tˆi ,··· ,tk }  X  xu ∈Mu ,u∈{1,··· ,t′ }−{t1 ,··· ,tˆi ,··· ,tk }  xv ∈Mv  X  ,v∈{t′ +1,··· ,t}  X  xu ∈Mu ,u∈{1,··· ,t′ }−{t1 ,··· ,tˆi ,··· ,tk }  pt (x0 , · · · , xt ) = pt (x0 , · · · , xt ) = ft (x0 , · · · , xt ) =  pt′ (x0 , · · · , xt′ ) =  qt1 ,··· ,tˆi ,··· ,tk (xt1 , · · · , xˆti , · · · , xtk ).  Uniqueness: Suppose {Yt }t∈N∪{0} is another stochastic process satisfying the conditions of the theorem with the p′t1 ,··· ,tk as the joint measures. p′1,··· ,t = pt = p1,··· ,t , by the assumption. Taking the appropriate sums on the two sides, we get p′t1 ,··· ,tk = pt1 ,··· ,tk . Now the uniqueness is a straight consequence of the Kolmogorov Extension Theorem. Remark. Note that we did not impose the positivity of the functions for this case. Now we are ready to prove Theorem 3.3.1. Proof Existence: In Lemma 3.3.2, let p0 = P0 , p1 : M0 × M1 → R, p1 (x0 , x1 ) = p0 (x0 )P1 (x0 , x1 ), .. . pn : M1 ×M2 ×· · ·×Mn → R, pn (x0 , · · · , xn ) = pn−1 (x0 , · · · , xn−1 )Pn (x0 , · · · , xn ).  69  3.4. Characterizing density functions and rth–order Markov chains To see that the {pi } satisfy the conditions of Lemma 3.3.2, note that X pn (x0 , · · · , xn ) = X  xn ∈Mn  xn ∈Mn  pn−1 (x0 , · · · , xn−1 )Pn (x0 , · · · , xn ) =  pn−1 (x0 , · · · , xn−1 )  X  xn ∈Mn  Pn (x0 , · · · , xn ) =  pn−1 (x0 , · · · , xn−1 ).  Lemma 3.3.2 shows the existence of a stochastic process with joint distributions matching the pi . Furthermore, the positivity of the {Pi } implies that of the {pi }. Thus all the conditionals exist for such a process and they match the Pi by the definition of the conditional probabilities. Uniqueness. Any stochastic process satisfying the above conditions, has a joint distribution that matches those of the {pi } and hence by the above theorem they are unique.  3.4  Characterizing density functions and rth–order Markov chains  The previous section saw discrete–time categorical processes represented in terms of conditional probability density functions. However such densities on finite domains satisfy certain restrictions that can make modeling them difficult. That leads to the idea of linking them to unrestricted functions on R in much the same spirit as a single probability can profitably be logit transformed in logistic regression. To begin, let X be a random variable with probability density p defined on a finite set M = {m1 , · · · , mn }. The section finds the class of all possible such ps with p(mi ) > 0, i = 1, · · · , n and g : R → R+ , a fixed bijection. For example g(x) = exp(x). The following theorem characterizes the relationship between p and g. While particular examples of the following theorem are used commonly in statistical modeling we are not aware of a reference which contains this result or the proof in this generality. Theorem 3.4.1 Let g : R → R+ a bijection. For every choice of probability density p on M = {m1 , · · · , mn }, n ≥ 2, there exists a unique function f : M − {m1 } → R, such that 70  3.4. Characterizing density functions and rth–order Markov chains  p(m1 ) = p(x) =  1 1+ h(x)  1+  P  P  ,  (3.1)  , x 6= m1 ,  (3.2)  y∈M −{m1 } h(y)  y∈M −{m1 } h(y)  where h = g ◦ f . Moreover, h(x) = p(x)/p(m1 ). Inversely, for an arbitrary function f : M − {m1 } → R, the p defined above is a density function. Proof p(x) Existence: Suppose p : M → (0, 1) is given. Let h(x) = p(m , x 6= m1 and 1) −1 f : M − {m1 } → R, f (x) = g ◦ h(x). Obviously h = g ◦ f . Moreover 1 1+  P  and  y∈M −{m1 } h(y)  1+  1 1+  P  y∈M −{m1 } p(y)/p(m1 )  =  1 = p(m1 ) 1 + (1 − p(m1 ))/p(m1 ) h(x)  P  =  y∈M −{m1 } h(y)  =  p(x)/p(m1 ) = p(x), 1 + (1 − p(m1 ))/p(m1 )  thereby establishing the validity of equations (3.1) and (3.2). Uniqueness: Suppose for f1 , f2 , we get the same p. Let h1 = g ◦ f1 , h2 = g ◦ f2 , by dividing 3.2 by 3.1 for h1 and h2 , we get h1 (x) = p(x)/p(m1 ) = h2 (x) hence g ◦ f1 = g ◦ f2 . Since g is a bijection f1 = f2 . Corollary 3.4.2 Fixing a bijection g and m1 ∈ M , every density function corresponds to an arbitrary vector of length n − 1 over R. Example Consider the binomial distribution with a trials and probability of success π and the transformation g(x) = exp x. Then M = {0, 1, · · · , a}. Let m1 = 0 then for x 6= 0   n x −1 f (x) = g (h(x)) = log p(x)/p(0) = log p (1 − p)n−x /(1 − p)n = x   n log + x log{p/(1 − p)}. x 71  3.4. Characterizing density functions and rth–order Markov chains Theorem 3.4.3 Fix a bijection g : R → R+ , mn1 ∈ Mn . Let Mn , n = 0, 1, · · · be finite subsets of R with cardinality greater than or equal to 2 and Mn′ = Mn − {mn1 }, ∀n. Then every categorical stochastic process with positive joint distribution on the Mn having initial density P0 : M0 → R and conditional probabilities Pn at stage n given the past, can be uniquely represented by means of unique functions:  g0 : M0′ → R .. . gn : M0 × · · · × Mn−1 × Mn′ → R .. .  for n = 1, . . . , where P0 (m01 ) = P0 (x) =  1 1+  h0 (x) 1+  P  y∈M0 −{m01 }  P  ,  (3.3)  , x 6= m01 ∈ M0 ,  (3.4)  y∈M0 −{m01 } h0 (y)  h0 (y)  (X0 =x) and h0 = g ◦ g0 . Moreover h0 (x) = PP(X 1 . 0 =m0 ) The conditional probabilities Pn are given by  Pn (x0 , · · · , xn−1 , mn1 ) = Pn (x0 , · · · , xn−1 , x) =  1 1+  h(x) 1+  P  y∈Mn −{mn 1}  P  ,  (3.5)  , x 6= mn1 ∈ Mn ,  (3.6)  hn (y) y∈Mn −{mn 1}  hn (y)  (Xn =x|Xn−1 =xn−1 ,··· ,X0 =x0 ) . where, hn = g ◦ gn . Moreover hn (x0 , · · · , x) = PP(X 1 n =mn |Xn−1 =xn−1 ,··· ,X0 =x0 ) Conversely, any collection of arbitrary functions g0 , g1 , · · · gives rise to a unique stochastic process by the above relations.  Proof The result is immediate by Theorems 3.3.1 and 3.4.1.  72  3.5. Functions of r variables on a finite domain Remark. We can view the arbitrary functions g0 , · · · , gn on M0′ , M0 × M1′ , · · · , M0 ×· · ·×Mn−1 ×Mn′ as arbitrary functions g0 on M0′ , g1 (., x1 ), x1 6= m11 on M0 and gn (., xn ), xn 6= mn1 on M0 × · · · × Mn−1 . As a check we can compute the number of free parameters of such a stochastic process on M0 , · · · , Mn . We can specify such a process by c0 c1 · · · cn − 1 parameters by specifying the joint distribution on M0 × M1 × · · · × Mn . If we specify the stochastic process using the above theorems and the gi functions, we need (m0 − 1) + m0 (m1 − 1) + m0 m1 (m2 − 1) + · · · + m0 m1 · · · mn−1 (mn − 1) which is the same number after expanding the terms and canceling out. Remark. In the case of rth–order Markov chains, gn (x0 , · · · , xn ) only depends on the last r + 1 components for n > r. Remark. In the case of homogenous rth–order Markov chains, Mi = M0 , ∀i. Fix m0 ∈ M0 and suppose |M0 | = c0 . We only need to specify g0 to gr , which are completely arbitrary functions. We only need to ′ specify g0 on M0′ , g1 on M0 × M1′ to gr on M0 × · · · × Mr+1 . This also shows everyPhomogenous Markov chain of order at most r is characterized by (c0 − 1) ri=0 cr0 = cr+1 − 1 elements in R. We could have also counted 0 all such Markov chains by noting they are uniquely represented by the joint probability density pr+1 on M0r+1 which has cr+1 − 1 free parameters (since 0 it has to sum up to 1). To describe processes using Markov chains, we need to find appropriate parametric forms. We investigate the generality of these forms in the following section and use the concept of partial likelihood to estimate them. We find appropriate parametric representations of gn which are functions of n + 1 finite variables. In the next section we study the properties of such functions. We call a variable “finite” if it only takes values in a finite subset of R.  3.5  Functions of r variables on a finite domain  This section studies the properties of functions of r variables with finite domain. First, we present a result of Besag [6] who studied such functions in the context of Markov random fields. However the statement of the result in his paper is inaccurate and moreover it gives no rigorous proof of his result. We present a rigorous statement, proof of the result and generalization of Besag’s theorem.  73  3.5. Functions of r variables on a finite domain  3.5.1  First representation theorem  This subsection presents a corrected version of a theorem stated by Besag in [6] and a constructive proof. Then we generalize this theorem and apply it to stationary binary Markov chains to get a parametric representation. Q Theorem 3.5.1 Suppose, f : i=1,··· ,r Mi → R, Mi being finite with |Mi | = ci and 0 ∈ Mi , ∀i, 1 ≤ i ≤ r. Let Mi′ = Mi − {0}. Then there exist a unique family of functions {Gi1 ,··· ,ik : Mi′1 ×Mi′2 ×· · ·×Mi′k → R, 1 ≤ k ≤ r, 1 ≤ i1 < i2 < · · · < ik ≤ r}, such that  f (x1 , · · · , xr ) = f (0, · · · , 0) +  r X i=1  xi Gi (xi ) + · · · + X  1≤i1 <i2 <···<ik ≤r  (xi1 · · · xik )Gi1 ,··· ,ik (xi1 , · · · , xik )  + · · · + (x1 x2 · · · xr )G12···r (x1 , · · · , xr ).  . Remark. In [6], Besag claims that {Gi1 ,··· ,ik : Mi1 × Mi2 × · · · × Mik → R} (without removing one element from each set) are unique. Proof Denote by IA the indicator function of a set A and Nk = {(x1 , · · · , xr ) :  r X i=1  I{0} (xi ) ≤ k}.  Existence: The proof is by induction. For i = 1, · · · , r, define Gi : Mi′ → R,  Gi (xi ) =  f (0, · · · , 0, xi , 0, · · · , 0) − f (0, · · · , 0) , xi  where xi is the ith coordinate. Then let f1 (x1 , · · · , xr ) = f (0, · · · , 0) + Pr i=1 xi Gi (xi ). Note that f1 = f on N1 . Next define Gi1 ,i2 : Mi′1 × Mi′2 → R by  74  3.5. Functions of r variables on a finite domain  Gi1 ,i2 (xi1 , xi2 ) = f (0, · · · , 0, xi1 , 0, · · · , 0, xi2 , 0, · · · , 0) − f1 (0, · · · , 0, xi1 , 0, · · · , 0, xi2 , 0, · · · , 0) , x i1 x i2 th where, xi1 , xi2 are the ith 1 and i2 coordinates, respectively. Using the {Gi1 ,i2 }, we can define f2 on N2 by  f2 (x1 , · · · , xr ) = f (0, · · · , 0) +  r X  xi Gi (xi ) +  i=1  X  xi1 xi2 Gi1 ,i2 (xi1 , xi2 ).  1≤i1 <i2 ≤r  Or equivalently, f2 (x1 , · · · , xr ) = f1 (x1 · · · , xr ) +  X  xi1 xi2 Gi1 ,i2 (xi1 , xi2 ).  1≤i1 <i2 ≤r  It is easy to see that f2 = f on N2 . In general, suppose we have defined Gi1 ,··· ,ik−1 and fk−1 , let Gi1 ,··· ,ik (xi1 , · · · , xik ) = f (0, · · · , 0, xi1 , 0, · · · , 0, xik , 0, · · · , 0) − fk−1 (0, · · · , 0, xi1 , 0, · · · , 0, xik , 0, · · · , 0) , x i1 · · · x ik for (xi1 , · · · , xik ) ∈ Mi′1 × · · · × Mi′k . Also let fk (x1 , · · · , xr ) = fk−1 (x1 , · · · , xr )+  X  1≤i1 <i2 <···<ik ≤r  xi1 · · · xik Gi1 ,··· ,ik (xi1 , · · · , xik )  We claim f = fk on Nk . To see that, fix x = (x1 , · · · , xr ). If x has less than k nonzero elements, the second term in the above expansion will be zero and fk (x1 , · · · , xr ) = fk−1 (x1 , · · · , xr ) = f (x1 , · · · , xr ), by the induction hypothesis and we are done. However if x has exactly k nonzero elements x = (x1 , · · · , xr ) = (0, · · · , 0, xj1 , 0, · · · , 0, xjk , 0 · · · ). 75  3.5. Functions of r variables on a finite domain Then X  1≤i1 <i2 <···<ik ≤r  xi1 · · · xik Gi1 ,··· ,ik (xi1 , · · · , xik ) = xj1 · · · xjk Gj1 ,··· ,jk (xj1 , · · · , xjk ).  Hence fk (x1 , · · · , xr ) = fk−1 (x1 , · · · , xr ) + (xj1 , · · · , xjk )Gj1 ,··· ,jk (xj1 , · · · , xjk )  = fk−1 (x1 , · · · , xr )+ f (· · · , 0, xj1 , 0, · · · , 0, xjk , 0, · · · ) − fk−1 (· · · , 0, xj1 , 0, · · · , 0, xjk , 0, · · · ) xj 1 · · · xj k xj 1 · · · xj k  = f (x1 , · · · , xr ). Q By induction, f = fr on Nr = i=1,··· ,r Mi . Hence, the family of functions satisfies the conditions. Uniqueness: To prove uniqueness, suppose  {Gi1 ,··· ,ik : Mi′1 ×Mi′2 ×· · ·×Mi′k → R, 1 ≤ k ≤ r, 1 ≤ i1 < i2 < · · · < ik ≤ r}, and {Hi1 ,··· ,ik : Mi′1 ×Mi′2 ×· · ·×Mi′k → R, 1 ≤ k ≤ r, 1 ≤ i1 < i2 < · · · < ik ≤ r}, are two families of functions satisfying the equation. Also assume fkG and fkH are the summation functions as defined above corresponding to the two families. We need to show Gi1 ,··· ,ik = Hi1 ,··· ,ik on Mi′1 × · · · × Mi′k . We use induction on k. It is easy to verify the result for the case k = 1. Now suppose x = (xi1 , · · · , xik ) ∈ Mi′1 × Mi′2 × · · · × Mi′k . Then by definition Gi1 ,··· ,ik (xi1 , · · · , xik ) =  G (0, · · · , 0, x , 0, · · · , 0, x , 0, · · · , 0) f (0, · · · , 0, xi1 , 0, · · · , 0, xik , 0, · · · , 0) − fk−1 i1 ik , x i1 · · · x ik  and Hi1 ,··· ,ik (xi1 , · · · , xik ) =  H (0, · · · , 0, x , 0, · · · , 0, x , 0, · · · , 0) f (0, · · · , 0, xi1 , 0, · · · , 0, xik , 0, · · · , 0) − fk−1 i1 ik . x i1 · · · x ik  76  3.5. Functions of r variables on a finite domain G = f H . Hence we are done. But by induction hypothesis fk−1 k−1  We can think of this representation of f as an expansion around (0, · · · , 0). However, (0, · · · , 0) has no intrinsic role and we can generalize the above theorem as follows. Q Theorem 3.5.2 Suppose, f : M = i=1,··· ,r Mi → R, Mi being finite and |Mi | = ci . For any fixed (µ1 , · · · , µr ) ∈ M , let Mi′ = Mi − {µi }. Then there exist unique functions {Hi1 ,··· ,ik : Mi′1 ×Mi′2 ×· · ·×Mi′k → R, 1 ≤ k ≤ r, 1 ≤ i1 < i2 < · · · < ik ≤ r}, such that f (x1 , · · · , xr ) = f (µ1 , · · · , µr ) + X  1≤i1 <i2 <···<ik ≤r  r X i=1  (xi − µi )Hi (xi ) + · · · +  (xi1 − µi1 ) · · · (xik − µik )Hi1 ,··· ,ik (xi1 , · · · , xik )+  · · · + (x1 − µ1 )(x2 − µ2 ) · · · (xr − µr )H12···r (x1 , · · · , xr ).  Proof Let Ni = Mi −µi (meaning that we subtract µi from allQ elements of Mi ) so that Ni and Mi have the same cardinality. Also let N = i=1,··· ,r Ni and Ni′ = Ni − {0}. Then define a bijective mapping φi : Ni → Mi , φi (xi ) = xi + µi . This will induce a bijective mapping Φ between N and M that takes (0, · · · , 0) Q to (µ1 , · · · , µr ). Now consider f ◦ Φ : i=1,··· ,r Ni → R. By the previous theorem, unique functions {Gi1 ,··· ,ik : Ni′1 × Ni′2 × · · · × Ni′k → R, 1 ≤ k ≤ r, 1 ≤ i1 < i2 < · · · < ik ≤ r} exist such that  f ◦ Φ(x1 , · · · , xr ) = f ◦ Φ(0, · · · , 0) + X  1≤i1 <i2 <···<ik ≤r  r X i=1  xi Gi (xi ) + · · · +  xi1 · · · xik Gi1 ,··· ,ik (xi1 , · · · , xik ) + · · · + x1 x2 · · · xr G12···r (x1 , · · · , xr ). 77  3.5. Functions of r variables on a finite domain Hence, f (φ1 (x1 ), · · · , φr (xr )) = f (φ1 (0), · · · , φr (0)) + X  1≤i1 <i2 <···<ik ≤r  r X i=1  xi Gi (xi ) + · · · +  xi1 · · · xik Gi1 ,··· ,ik (xi1 , · · · , xik ) + · · · + x1 x2 · · · xr G12···r (x1 , · · · , xr ).  We conclude, f (x1 + µ1 , · · · , xr + µr ) = f (µ1 , · · · , µr ) + X  1≤i1 <i2 <···<ik ≤r  r X i=1  xi Gi (xi ) + · · · +  xi1 · · · xik Gi1 ,··· ,ik (xi1 , · · · , xik ) + · · · + x1 x2 · · · xr G12···r (x1 , · · · , xr ).  This gives f (x1 , · · · , xr ) = f (µ1 , · · · , µr ) + X  1≤i1 <i2 <···<ik ≤r  r X (xi − µ1 )Gi (xi − µi ) + · · · + i=1  (xi1 − µi1 ) · · · (xik − µik )Gi1 ,··· ,ik (xi1 − µi1 , · · · , xik − µik )+  · · · + (x1 − µ1 )(x2 − µ2 ) · · · (xr − µr )G12···r (x1 − µ1 , · · · , xr − µr ).  To prove the existence, let Hi1 ,··· ,ik (xi1 , · · · , xik ) = Gi1 ,··· ,ik (xi1 − µi1 , · · · , xik − µik ). The uniqueness can be obtained as in the previous theorem. We call this expression the Besag expansion around (µ1 , · · · , µr ). Corollary 3.5.3 In the case of binary {0, 1} variables, the G functions are simply real numbers, since Mi′1 × · · · × Mi′k has exactly one element: (1, · · · , 1). Hence, we have found a linear representation of f in terms of the xi1 · · · xik . Corollary 3.5.4 Suppose that {Xt } is an rth–order Markov chain, Xt taking values in Mt = {0, 1} and the conditional probability P (Xt = 1|Xt−1 , · · · , X0 ), 78  3.5. Functions of r variables on a finite domain is well-defined and in (0,1). Let g : R → R+ be a given bijective transformation. Then gt (xt−1 , · · · , x0 ) = g−1 {  P (Xt = 1|Xt−1 = xt−1 , · · · , X0 = x0 ) }, P (Xt = 0|Xt−1 = xt−1 , · · · , X0 = x0 )  is a function of t variables, (xt−1 , · · · , x0 ), for t < r and is a function of r variables, (xt−1 , · · · , xt−r ), for t > r. Hence there exist unique parameters αt0 , {αti1 ,··· ,it }1≤i1 ,··· ,it ≤t for t < r and αt0 , {αti1 ,··· ,ir }1≤i1 ,··· ,ir ≤r for t ≥ r such that for t < r: g−1 {  P (Xt = 1|Xt−1 , · · · , X0 ) }= P (Xt = 0|Xt−1 , · · · , X0 ) t X Xt−i αti + · · · + αt0 + i=1  X  αti1 ,··· ,ik Xt−i1  1≤i1 <i2 <···<ik ≤t  · · · Xt−ik + · · · +  αt12···t Xt−1 Xt−2 · · · X0 .  and for t ≥ r: g−1 {  P (Xt = 1|Xt−1 , · · · , X0 ) }= P (Xt = 0|Xt−1 , · · · , X0 ) r X t α0 + Xt−i αti + · · · + i=1  X  1≤i1 <i2 <···<ik ≤r  αti1 ,··· ,ik Xt−i1  · · · Xt−ik + · · · +  αt12···r Xt−1 Xt−2 · · · Xt−r .  Moreover, given any collection of parameters, αt0 , {αti1 ,··· ,it }1≤i1 ,··· ,it ≤t for t < r and αt0 , {αti1 ,··· ,ir }1≤i1 ,··· ,ir ≤r for t ≥ r a unique stochastic process (upto distribution) is specified using the above relations. In the case of homogenous Markov chains the αt0 , αti1 ,··· ,ik do not depend on t for t > r. The above corollary shows that the conditional probability of a Markov chain after an appropriate transformation can be uniquely represented as a linear combination of monomial products of previous states. 79  3.5. Functions of r variables on a finite domain One might conjecture that the same result holds for all categorical– valued Markov chains (with a finite number of states) using the above theorem. This is not true in general since the {Gi1 ,··· ,ik } are functions. In the next section, we prove another representation theorem which paves the way for the categorical case. As it turns out, we need more terms in order to write down the transformed conditional probability as a linear combination of past processes.  3.5.2  Second representation theorem  In this section, we prove a new representation theorem for functions of r finite variables. We start with the trivial finite–valued one–variable function and then extend the result to r–variable functions. The proof for the general case is non–trivial and is done again by induction. Lemma 3.5.5 Suppose f : M → R, M ⊂ R being finite of cardinality c. Let d = c − 1. Then f has a unique representation of the form X f (x) = αi xi , ∀x ∈ M. 0≤i≤d  Remark. The lemma states that, if we consider the vector space V = {f : M → R}, then the monomial functions {pi }0≤i≤d , where pi : M → R, pi (x) = xi form a basis for V . Proof First note that the dimension of V is c. To show this, suppose M = {m1 , · · · , mc } and consider the following isomorphism of vector spaces, I : V → Rc f 7→ (f (m1 ), · · · , f (mc )). It only remains to show that {pi }0≤i≤d is an independent set. To prove this suppose, X αi xi = 0, ∀x ∈ M. 0≤i≤d  P That would mean that the d–th degree polynomial p(x) = 0≤i≤d αi xi has at least c = d + 1 disjoint roots which is greater than its degree. This contradicts the fundamental theorem of algebra.  80  3.5. Functions of r variables on a finite domain Theorem 3.5.6 (Categorical Expansion Theorem) Suppose MQ i is a finite subset of R with |Mi | = ci , i = 1, 2, · · · , r. Let di = ci − 1, M = i=1,··· ,r Mi and consider the vector space of functions over R, V = {f : M → R} with the function addition as the addition operation of the vector space and the scalar product of a real number to the function as the scalarQproduct of the vector space. Then this vector space is of dimension C = i=1,··· ,r ci and {xi11 · · · xirr }0≤i1 ≤d1 ,··· ,0≤ir ≤dr forms a basis for it. Proof To show that the dimension of the vector space is C, suppose M = {m1 , · · · , mc } and consider following the isomorphism of vector spaces: I : V → RC , {xi11  f 7→ (f (m1 ), · · · , f (mC )).  To show that · · · xirr }0≤i1 ≤di ,··· ,0≤ir ≤dr forms a basis, we only need to show that it is an independent collection since there are exactly C elements in it. We proceed by induction on r. The case r = 1 was shown in the above lemma. Suppose we have shown the result for r − 1 and we want to show it for r. Assume a linear combination of the basis is equal to zero. We can arrange the terms based on powers of xr : p0 (x1 , · · · , xr−1 )+xr p1 (x1 , · · · , xr−1 )+· · ·+xdr r pd (x1 , · · · , xr−1 ) = 0, (3.7) ∀(x1 , · · · , xr ) ∈ M1 × · · · × Mr .  Fix the values of x′1 , · · · , x′r−1 ∈ M1 × · · · × Mr−1 . Then Equation (3.7) is zero for cr values of xr . Hence by Lemma 3.5.5, all the coefficients: p0 (x′1 , · · · , x′r−1 ), p1 (x′1 , · · · , x′r−1 ), · · · , pd (x′1 , · · · , x′r−1 ), are zero and we conclude: p0 (x1 , · · · , xr−1 ) = 0, p1 (x1 , · · · , xr−1 ) = 0, · · · , pd (x1 , · · · , xr−1 ) = 0, ∀(x1 , · · · , xr−1 ) ∈ M1 × · · · × Mr−1 . Again by the induction assumption all the coefficients in these polynomials are zero. Hence, all the coefficients in the original linear combination in Equation (3.7) are zero.  81  3.5. Functions of r variables on a finite domain Corollary 3.5.7 Suppose Xt is a categorical stochastic process, where Xt takes values in Mt , |Mt | = ct = dt +1 < ∞. Also assume that the conditional probability P (Xt = xt |Xt−1 = xt−1 , · · · , X0 = x0 ),  is well–defined and in (0,1). Fix m1t ∈ Mt . Let g : R → R+ be a bijective transformation, then there are unique parameters {αti0 ,··· ,it }t∈N,0≤i0 ≤dt −1,0≤i1 ≤dt−1 ,0≤i2 ≤dt−2 ,··· ,0≤it ≤d0 , such that P (Xt = xt |Xt−1 = xt−1 , · · · , X0 = x0 ) = Pt (x0 , · · · , xt ), where Pt (x0 , · · · , xt−1 , mt1 ) = Pt (x0 , · · · , xt−1 , x) =  h(x) 1+  P  1 1+  P  y∈M −{m1 } ht (y)  for ht (x0 , · · · , xt ) = g ◦ gt (x0 , · · · , xt−1 , xt ) and X gt (x0 , · · · , xt−1 , xt ) =  ,  (3.8)  , x 6= mt1 ∈ Mt ,  (3.9)  y∈M −{mt1 } ht (y)  0≤i0 ≤dt −1,0≤i1 ≤dt−1 ,··· ,0≤it ≤d0  t 0 αti0 ,··· ,it xit−0 · · · xit−t ,  (x0 , · · · , xt ) ∈ M0 × · · · × Mt−1 × Mt′ .  On the other hand any set of arbitrary parameters αti0 ,··· ,it gives rise to a unique stochastic process with the above equations. Corollary 3.5.8 Suppose that {Xt } is an rth–order Markov chain where Xt takes values in Mt a finite subset of real numbers, |Mt | = ct = dt + 1 < ∞, the conditional probability P (Xt = xt |Xt−1 = xt−1 , · · · , X0 = x0 ), is well–defined and belongs to (0, 1). Fix m1t ∈ Mt , let Mt′ = Mt − {m1t } and suppose g : R → R+ is a given bijective transformation. Then gt (xt , · · · , x0 ) = g−1 {  P (Xt = xt |Xt−1 = xt−1 , · · · , X0 = x0 ) }, P (Xt = m1t |Xt−1 = xt−1 , · · · , X0 = x0 ) 82  3.5. Functions of r variables on a finite domain is a function of t + 1 variables for t < r, (xt , · · · , x0 ) and is a function of r + 1 variables,(xt , · · · , xt−r ), for t > r. Hence there exist parameters {αti0 ,··· ,it }0≤i0 ≤dt −1,0≤i1 ≤dt−1 ,··· ,0≤it ≤d0 , for t < r and {αti0 ,··· ,ir }0≤i0 ≤dt −1,0≤i1 ≤dt−1 ,··· ,0≤ir ≤dt−r , for t ≥ r such that for t < r:  g−1 {  P (Xt = xt |Xt−1 = xt−1 , · · · , X0 = x0 ) }= P (Xt = m1t |Xt−1 = xt−1 , · · · , X0 = x0 ) X 0 t αti0 ,··· ,it xit−0 · · · xit−t ,  0≤i0 ≤dt −1,0≤i1 ≤dt−1 ,··· ,0≤it ≤d0  (x0 , · · · , xt ) ∈ M0 × · · · Mt−1 × Mt′ ,  and for t ≥ r: g−1 {  P (Xt = xt |Xt−1 = xt−1 , · · · , X0 = x0 ) }= P (Xt = m1t |Xt−1 = xt−1 , · · · , X0 = x0 ) X 0 r αti0 ,··· ,ir xit−0 · · · xit−r  0≤i0 ≤dt −1,0≤i1 ≤dt−1 ,··· ,0≤ir ≤dt−r  (x0 , · · · , xt ) ∈ M0 × · · · Mt−1 × Mt′ .  Moreover any collection of arbitrary parameters {αti0 ,··· ,it }0≤i0 ≤dt −1,0≤i1 ≤dt−1 ,··· ,0≤it ≤d0 , for t < r, and {αti0 ,··· ,ir }0≤i0 ≤dt −1,0≤i1 ≤dt−1 ,··· ,0≤ir ≤dt−r , for t ≥ r, specify a unique stochastic process (upto distribution) by the above relations. In the case of homogenous Markov chains the αti1 ,··· ,ir do not depend on t for t > r. One might question the usefulness of such a representation. After all we have exactly as many parameters in the model as the values of the original function. In the following, we explain the importance of linear representations of such functions.  83  3.5. Functions of r variables on a finite domain 1. A vast amount of theory has been developed to deal with linear models. Generalized linear models in the case of independent sequence of random variables is a powerful tool. As we will see in sequel, these ideas can be imported into time series using the concept of partial likelihood. 2. Although we have as many parameters in the model as the values of the original function, the representation gives us a convenient framework for modeling, in particular for making various model reductions by omitting some terms or assuming certain coefficients are equal. 3. Although this is a representation for stationary rth–order Markov chains (or representation for arbitrary locally rth–order chains at time t), this representation allows us to accommodate other explanatory variables simply as additive linear terms and extend the model to non–stationary cases. This cannot be done in the same way if we try to model the original values of the function. Example As an example consider a categorical response variable Y and r categorical explanatory variables X1 , · · · , Xr , are given. Suppose the Xi takes values in the Mi which include 0. Our purpose is to model Y based on X1 , · · · , Xr . In order to do that, we consider the conditional probability P (Y = y|X1 = x1 , · · · , Xr = xr ). Again, we assume that the conditional probability is well-defined everywhere and takes values in (0, 1). The above theorem shows that after applying a transformation the conditional probability can be written as a linear combination of multiples of powers of the Xi . Although, the theorem above shows the form of the conditional probability in general and paves the way to the estimation of the conditional probabilities by estimating the parameters, the large number of parameters makes this a challenging task which might be impractical in some cases. In the next section, we introduce some classes of r variable functions that can be useful for some applications.  84  3.5. Functions of r variables on a finite domain  3.5.3  Special cases of functions of r finite variables  The first class of functions we introduce are obtained by power restrictions. We simply assume that gt can be represented only by powers less than k. Suppose Xt takes values in 0, 1, · · · , ct − 1. Then for a k-restricted power stationary rth–order Markov chain, the gt , t > r is given by: X  0≤i1 ≤d1 ,··· ,0≤ir ≤dr ,  P  j ij ≤k  i1 ir αi1 ,··· ,ir Xt−1 · · · Xt−r .  In particular, we can let k = 1 and get β0 +  X  βi Xt−i .  i  This is useful especially for binary Markov chains. The second class of functions are useful in the case when relationships exist between the states in terms of a semi–metric d. Suppose {Xt } is an rth–order Markov chain and Xt takes values in the same finite set M = {1, · · · , m}. Also let d : M × M → R,  be a semi–metric being a mapping on M that satisfies the following conditions: d ≥ 0;  d(x, z) ≤ d(x, y) + d(y, z);  d(x, x) = 0.  Then we introduce the following model: k  g−1 {  X P (Xt = j|Xt−1 , · · · , Xt−r ) } = α0,j + αi,j d(j, Xt−i ) P (Xt = 1|Xt−1 , · · · , Xt−r ) i=1  for j = 2, · · · , m. For this model P (Xt = 1|Xt−1 , · · · , Xt−r ) = 1 −  X  j=2,··· ,m  P (Xt = j|Xt−1 , · · · , Xt−r ).  Finally, we introduce a simple class for the binary Markov chain of order r. For any bijective transformation g : R → R+ g−1 {  P (Xt = 1|Xt−1 , · · · , Xt−r ) } = α0 + α1 Nt−1 , P (Xt = 0|Xt−1 , · · · , Xt−r ) 85  3.6. Generalized linear models for time series Pr where Nt−1 = j=1 Xt−j . For example in the 0-1 precipitation process example seen in the Introduction, Nt−1 counts the number of the days out of r days before today that had some precipitation.  3.6  Generalized linear models for time series  Generalized linear models were developed to extend ordinary linear regression to the case that the response is not normal. However, that extension required the assumption of independently observed responses. The notion of partial likelihood was introduced to generalize these ideas to time series where the data are dependent. What follows in this section is a summary of the first chapter in Kedem and Fokianos [27], which we have included for completeness. Definition Let Ft , t = 1, 2, · · · be an increasing sequence of σ–fields, F0 ⊂ F1 ⊂ F2 , · · · and let Y1 , Y2 , · · · be a sequence of random variables such that Yt is Ft –measurable. Denote the density of Yt , given Ft , by ft (yt ; θ), where θ ∈ Rp is a fixed parameter. The partial likelihood (P L) is given by P L(θ; y1 , · · · , yN ) =  N Y  ft (yt ; θ).  t=1  Example As an example, suppose Yt represents the 0-1 P N process in Calgary, while M Tt denotes the maximum daily temperature process. We can define Ft as follows: 1. Ft = σ{Yt−1 , Yt−2 , · · · }. In this case, we are assuming the information available to us is the value of the process on each of the previous days. 2. Ft = σ{Yt−1 , Yt−2 , · · · M Tt−1 , M Tt−2 , · · · }. In this case, we are assuming we have all the information regarding the 0-1 process of precipitation and maximum temperature for previous days. 3. Ft = σ{Yt−1 , Yt−2 , · · · M Tt , M Tt−1 , M Tt−2 , · · · }. In this case, we add to the information in 2 the knowledge of today’s maximum temperature. The vector θ that maximizes the above equation is called the maximum partial likelihood (MPLE). Wong [48] has studied its properties. Its consistency, asymptotic normality and efficiency can be shown under certain regularity conditions. 86  3.6. Generalized linear models for time series In this report, we are mainly interested in the case: Ft = σ{Yt−1 , Yt−2 , · · · }. We assume that the information Ft is given as a vector of random variables and denote it by Zt , which we call the covariate process: Zt = (Zt1 , · · · , Ztp )′ . Zt might also include the past values of responses Yt−1 , Yt−2 , · · · . Let µt = E[Yt |Ft−1 ], be the conditional expectation of the response given the information we have up to the time t. Kedem and Fokianos in [27] address time series following generalized linear models satisfying certain conditions about the so-called random and systematic components: • Random components: For t = 1, 2, · · · , N f (yt ; θt , φ|Ft−1 ) = exp{  yt θt − b(θt ) + c(yt ; φ)}. at (φ)  • The parametric function αt (φ) is of the form φ/wt , where φ is the dispersion parameter, and wt is a known parameter called “weight parameter”. The parameter θt is called the natural parameter. • Systematic components: For t = 1, 2, · · · , N , g(µt ) = ηt =  p X  ′ βj Z(t−1)j = Zt−1 β,  j=1  for some known monotone function g called the link function. Example Binary time series: As an example consider {Yt }, a binary time series. Let us denote by πt the probability of success given Ft−1 . Then for t = 1, 2, · · · , N , f (yt; θt , φ|Ft−1 ) = exp(yt log(  πt ) + log(1 − πt )) 1 − πt  with E[Yt |Ft−1 ] = πt , b(θt ) = − log(1 − πt ) = log(1 + exp(θt )), V (πt ) = πt (1 − πt ), φ = 1, and wt = 1. The canonical link gives rise to the so–called “logistic model”: g(πt ) = θt (πt ) = log(  πt ′ ) = ηt = Zt−1 β. 1 − πt 87  3.6. Generalized linear models for time series In the notation of Corollary 3.5.4, Yt = Xt , πt = P (Xt = 1|Xt−1 , · · · , Xt−r ) ′ and Zt−1 = (1, Xt−1 , · · · , Xt−r , Xt−1 Xt−2 , · · · , Xt−1 · · · Xt−r ). We can also ′ consider other covariate processes such as Zt−1 = (1, Xt−1 , · · · , Xt−r ) and so on. In order to study the asymptotic behavior of the maximum likelihood estimator, we consider the conditional information matrix. To establish large sample properties, the stability of the conditional information matrix and the central limit theorem for martingales are required. Proofs may be found in Kedem and Fokianos [27].  Inference for partial likelihood The definitions of partial likelihood and exponential family of distributions imply that the log partial likelihood is given by N X  N X yt θt − b(θt ) log f (yt ; θt , φ|Ft−1 ) = { + c(yt , φ)} = l(β) = αt (φ) t=1 t=1  N N ′ β) − b(u(z ′ )) X X yt u(zt−1 t−1 { + c(yt , φ)} = lt , α (φ) t t=1 t=1  where u(.) = (g◦µ(.))−1 = µ−1 (g−1 (.)), so that θt = u(zt−1 β). We introduce the notation, ∂ ∂ ′ ▽=( ,··· , ) ∂β1 ∂βp and call ▽l(β) the partial score. To compute the gradient, we can use the chain rule in the following manner ∂lt ∂lt ∂θt ∂µt ∂ηt = . ∂βj ∂βj ∂µt ∂ηt ∂βj Some algebra shows SN (β) = ▽l(β) =  N X  Z(t−1)  t=1  σt2 (β)  ∂µt Yt − µt (β) , ∂ηt σt2 (β)  where, = V ar[Yt |Ft−1 ]. The partial score process is defined from the partial sums as St (β) = ▽l(β) =  t X s=1  Z(s−1)  ∂µs Ys − µs (β) . ∂ηs σs2 (β) 88  3.6. Generalized linear models for time series One can show the terms in the above sums to be orthogonal: ∂µt Yt − µt (β) ∂µs Ys − µs (β) ] = 0, s < t. Z(s−1) 2 ∂ηt σt (β) ∂ηs σs2 (β) Also, E[SN (β) = 0]. The cumulative information matrix is defined by E[Z(t−1)  GN (β) =  N X  Cov[Z(t−1)  t=1  ∂µt Yt − µt (β) |Ft−1 ]. ∂ηt σt2 (β)  The unconditional information matrix is simply Cov(SN (β)) = FN (β) = E[GN (β)]. Next let HN (β) = −▽▽′ l(β).  Kedem and Fokianso [27] show that  HN (β) = GN (β) − RN (β), where  N  1 X ′ RN (β) = Zt−1 dt (β)Zt−1 (Yt − µt (β)), αt (φ) t=1  [∂ 2 u(ηt )/∂ηt2 ].  and dt (β) = St satisfies the martingale property:  E[St+1 (β)|Ft−1 ] = St (β). To prove the consistency and other properties of the estimators, we need: Assumption A: A1. The true parameter β belongs to an open set B ⊂ R. A2. The covariate vectorP Zt almost surely lies in a non random compact p ′ ′ set Γ of R , such that P [ N t=1 Zt−1 Zt−1 > 0] = 1. In addition, Zt−1 β lies −1 almost surely in the domain H of the inverse link function h = g for all Zt−1 ∈ Γ and β ∈ B. A3. The inverse link function h, defined in (A2), is twice continuously differentiable and |∂h(λ)/∂λ| = 6 0. R A4. There is a probability measure ν on Rp such that Rp zz ′ ν(dz) is positive definite, and such that for Borel sets A ⊂ Rp , N 1 X I[Zt−1 ∈A] → ν(A). N t=1  89  3.7. Simulation studies Theorem 3.6.1 Under assumption A the maximum likelihood estimator is almost surely unique for all sufficiently large N , and 1. the estimator is consistent and asymptotically normal, p  β̂ → β in probability, and √ d N (β̂ − β) → Np (0, G−1 (β)), in distribution as N → ∞, for some matrix G. 2. The following limit holds in probability, as N → ∞: √ 1 p N (β̂ − β) − √ G−1 (β)SN (β) → 0. N We follow Kedem and Fokianos [27], who used similar models, to assume the above conditions for our models. However, we conjecture that the above assumptions hold for the partial likelihood of stationary rth–order Markov chains (with strictly positive joint distribution) in terms of our parametric linear form at least for the binary case. In fact assumptions A1. to A3. are easy to check and only A4. poses some challenge. We leave this for future research and use several simulation studies to check the consistency of the estimators in next section as well as Chapter 4 and Chapter 10. For more discussion regarding the assumptions and consistency see [27].  3.7  Simulation studies  This section presents the results of some simulation studies about the partial likelihood applied to categorical rth–order Markov chains. We also investigate the performance of the BIC to pick the appropriate (“true”) model. In particular, we generate samples from a seasonal Markov chain Yt where, Zt−1 = (1, Yt−1 , cos(ωt)), ω =  2π . 366  We consider this Markov chain over 5 years from 2000 to 2005 and assume logit{P (Yt = 1|Zt−1 )} = β ′ Zt−1 , where β = (−1, 1, −0.5). 90  3.7. Simulation studies To generate samples for this chain, we need an initial value of the past two states, which we take it to be (1, 1). We denote the process Yt−k by Y k for simplicity. To check the performance of the partial likelihood and estimates of the variance using GN , we generate 50 chains with this initial value and then compare the parameter estimates with the true parameters. We also compare the theoretical variances with the experimental variances. Table 3.7 shows that the parameter estimates are fairly close to the true values. Also the experimental and theoretical variances are similar. sim. sd  theo. sd  β̂1  β̂2  β̂3  sd(β̂1 )  sd(β̂2 )  sd(β̂3 )  sd(β̂1 )  sd(β̂2 )  sd(β̂3 )  -0.99  1.0  -0.42  0.07  0.10  0.07  0.06  0.12  0.07  Table 3.1: The estimated parameters for the model Zt−1 = (1, Yt−1 , cos(ωt)) with parameters β = (−1, 1, −0.5). The standard deviation for the parameters is computed once using GN (theo. sd) and once using the generated samples (sim. sd). In Kedem and Fokianos [27] other simulation studies have been done to check the validity of this method. To check the normality of the parameter estimates, we plot the three parameter estimates histograms in Figure 3.1. The figure shows that the parameter estimates have a distribution close to Gaussian. Next we check the performance of the BIC criterion in picking the optimal (“true”) model. We use the same model as above and then compute the BIC for a few models to see if BIC picks the right one. We denote Yt−k by Y k and cos(ωt) by COS for simplicity. For an assessment, we simulate a few other chains.  91  3.7. Simulation studies  beta2  beta3  4 Density  2 1  0.5 −1.1  −1.0  −0.9  −0.8  0  0.0  1 0 −1.2  3  Density  1.0  1.5  3 2  Density  2.0  4  5  2.5  5  6  3.0  6  7  3.5  beta1  0.7  beta1  0.9  1.1  1.3  −0.70  beta2  −0.60  −0.50  −0.40  beta3  Figure 3.1: The distribution of parameter estimates for the model with the covariate process Zt−1 = (1, Yt−1 , cos(ωt)) and parameters (β1 = −1, β2 = 1, β3 = −0.5). Model: Zt−1  BIC  parameter estimates  (1) (1, Y 1 ) (1, Y 1 , Y 2 ) (1, Y 1 , COS) (1, Y 1 , SIN ) (1, Y 1 , COS, SIN ) (1, Y 1 , Y 2 , Y 1 Y 2 ) (1, Y 1 , Y 2 , Y 1 Y 2 , COS) (1, Y 1 , Y 2 , Y 1 Y 2 , COS, SIN )  2380.0 2267.1 2273.7 2217.7 2274.4 2225.1 2281.1 2232.4 2239.8  (-0.605) (-1.03, 1.11) (-1.064, 1.091, 0.101) (-1.00, 0.970, -0.558) ( -1.037, 1.117, 0.026) (-1.00, 0.970, -0.559, 0.028) (-1.055, 1.0615, 0.0647, 0.077) (-0.985, 0.943, -0.0870, 0.0915, -0.564) (-0.981, 0.957, -0.0946, 0.0723, -0.575, 0.0232)  Table 3.2: BIC values for several models competing for the role of the true model, where Zt−1 = (1, Y 1 , COS), β = (−1, 1, −0.5). As we see in Table 3.2, the true model has the smallest BIC showing it performs well in this case. Also note that models which include the covariates of the true model have accurate estimates for the parameters associated with (1, Y 1 , COS), while giving very small magnitude for other parameters.  92  3.8. Concluding remarks Model: Zt−1  BIC  parameter estimates  (1) (1, Y 1 ) (1, Y 1 , Y 2 ) (1, Y 1 , COS) (1, Y 1 , SIN ) (1, Y 1 , COS, SIN ) (1, Y 1 , Y 2 , Y 1 Y 2 ) (1, Y 1 , Y 2 , Y 1 Y 2 , COS) (1, Y 1 , Y 2 , Y 1 Y 2 , COS, SIN )  2537.3 2329.5 2245.5 2265.9 2336.7 2273.0 2251.3 2213.7 2221.2  (0.0799) (-0.649, 1.417) (-1.022, 1.144, 0.998) (-0.553, 1.236, -0.617) (-0.648, 1.415, -0.0433) (-0.552, 1.235, -0.617, -0.0480) (-1.08, 1.287, 1.140, -0.278) (-0.936, 1.11, 0.966, -0.175, -0.511) (-0.927, 1.101, 0.940, -0.160, -0.549, -0.0441)  (1, Y 1 , Y 2 , COS)  2206.8  (-0.899, 1.0263, 0.875, -0.515)  Table 3.3: BIC values for several models competing for the role of true model given by Zt−1 = (1, Y 1 , Y 2 , COS), β = (−1, 1, 1, −0.5). Table 3.3 presents the true model in the last row. Ignore that row for a moment. The smallest “BIC” corresponds to (1, Y 1 , Y 2 , Y 1 Y 2 , COS), which has an component Y 1 Y 2 added to the true model. However, the coefficients of this model are very close to the true model and the coefficient for Y 1 Y 2 is relatively small in magnitude. The true model has the smallest BIC again and the parameter estimates are close to the correct values.  3.8  Concluding remarks  In summary, this chapter shows that a categorical discrete–time stochastic process can be represented using a small number of ascending joint distributions P (X0 = x0 ), P (X0 = x0 , X1 = x1 ), P (X0 = x0 , X1 = x1 , X2 = x2 ), · · · . As a corollary of the above, we showed that a categorical discrete–time stochastic process can be represented using the conditional probabilities P (X0 = x0 ), P (X1 = x1 |X0 = x0 ), P (X2 = x2 |X0 = x0 , X1 = x1 ), · · · . A parametric form was found for the conditional probability distribution of categorical discrete–time stochastic processes. The parameters can be estimated for stationary binary Markov chains using partial likelihood.  93  Chapter 4  Binary precipitation process 4.1  Introduction  This chapter studies the Markov order of the 0-1 precipitation process (P N from now on). Many authors such as Anderson et al. in [4] and Barlett in [5] have developed techniques to test different assumptions about the order of the Markov chain. For example in [4], Anderson et al. develop a Chi-squared test to test that a Markov chain is of a given order against a larger order. In particular, we can test the hypothesis that a chain is 0th–order Markov against a 1st–order Markov chain, which in this case is testing independence against the usual (1st–order) Markov assumption. (This reduces simply to the well–known Pearson’s Chi-squared test.) Hence, to “choose” the Markov order one might follow a strategy of testing 0th– order against 1st–order, testing 1st–order against 2nd–order and so on to rth–order against (r + 1)th–order, until the test rejects the null hypothesis and then choose the last r as the optimal order. However, some drawbacks are immediately seen with this method: 1. The choice of the significance level will affect our chosen order. 2. The method only works for chains with several independent observations of the same finite chain. 3. We cannot account for some other explanatory variables in the model, for example the maximum temperature. Issues like this have led researchers to think about other methods of order selection. Akaike in [2], using the information distance and Schwartz in [42] using Bayesian methods develop the AIC and BIC, respectively. Other methods and generalizations of the above methods have been proposed by some authors such as Hannan in [20], Shibita in [44] and Haughton in [22]. Many authors have studied the order of precipitation processes at different locations on Earth. Gabriel et al. in [18] use the test developed in Anderson et al. [4] to show that the precipitation in Tel-aviv is a 1st–order 94  4.2. Models for 0-1 precipitation process Markov chain. Tong in [45] used the AIC for Hong Kong, Honolulu and New York and showed that the process is 1st–order in Hong Kong and Honolulu but 0th–order in New York. In a later paper, [46], Tong and Gates use the same techniques for Manchester and Liverpool in England and also reexamined the Tel–aviv data. Chin in [12] studies the problem using AIC over 100 stations (separately) in the United States over 25 years. He concludes that the order depends on the season and geographical location. Moreover, he finds a prevalence of first order conditional dependence in summer and higher orders in winter. Other studies have been done by several authors using similar techniques over other locations. For example, Moon et al. in [35] study this issue at 14 location in South Korea. This report investigates the Markov order for a cold–climate region. The Markov order of the precipitation in this region might be different due to a large fraction of precipitation being being in the form of snowfall. The report also drops the homogeneity (stationarity) condition usually imposed in studying the Markov order. In fact the model proposed here can accommodate both continuous (here time and potentially geographical location and other explanatory variables) and categorical variables (e.g. precipitation occurred/not occurred on a given day). An issue with increasing the order of a Markov chain is the exponential increase in number of parameters in the model. Here as a special case, we propose models that increase with the order of Markov chain by adding only 1 parameter. Other authors such as Raftery in [40] and Ching in [13] have proposed other methods to reduce the number of parameters. The dataset used in this study contains more than 110 years of daily precipitation for some stations. This allows us to look at some properties of the precipitation process such as stationarity more closely.  4.2  Models for 0-1 precipitation process  In the light of Categorical Expansion Theorem (Theorem 3.5.6), from the previous chapter, we know all the possible forms of rth–order Markov chains for binary data. Since, this theorem gives us linear forms, time series following generalized linear models (TGLM) provides a method to estimate the parameters. For two reasons it is beneficial to study simpler models rather than a full model: 1. There are a large number of parameters to estimate in the full model. 2. There are better interpretations for the parameters in simpler models. 95  4.2. Models for 0-1 precipitation process We introduce a few processes that are useful in modeling precipitation: • Yt represents the occurrence of precipitation on day t. Here Yt is a binary process with 1 denoting precipitation and 0 denoting its absence on day t. P l • Nt−1 = lj=1 Yt−j represents the number of P N days in the past l days. • Binary processes for modeling m years, say l1 to l2 . Here, we define the binary processes Alt , l ∈ [l1 , l2 ] by ( 1, Alt = 0,  if t belongs to the year l . otherwsie  This is a binary deterministic process to model the year effect. • Seasonal processes (deterministic): 2π . 366 We can also consider higher order terms in the Fourier series cos(ωnt) and sin(ωnt), where n is a natural number. cos(ωt) and sin(ωt), ω =  Some possibly interesting models present themselves when Zt−1 is a covariate process. The probability of precipitation today depends on the value of that covariate process, and those processes might include: l ). This model assumes that the probability of P N • Zt−1 = (1, Nt−1 today only depends on the number of P N days during l previous days. l ,Y • Zt−1 = (1, Nt−1 t−1 ). This model assumes that the P N occurrence today depends on the P N occurrence yesterday and the number of P N occurrences during l previous days. l ,Y • Zt−1 = (1, cos(ωt), sin(ωt), Nt−1 t−1 ).  • Zt−1 = (1, cos(ωt), sin(ωt), Yt−1 ). • Zt−1 = (1, Yt−1 , · · · , Yt−r ). This is a special case of Markov chain of order r. No interaction between the days is assumed. In this model increasing the order of Markov chain by one corresponds to adding one parameter to the model. 96  4.3. Exploratory analysis of the data • Zt−1 = (1, Yt−1 , · · · , Yt−r , Yt−1 Yt−2 ). In this model, the interaction between the previous day and two days ago is included. • Zt−1 = (1, cos(ωt), sin(ωt), Yt−1 , · · · , Yt−r ). In this model, two seasonal terms are added to the previous model. • Zt−1 = (A1t , · · · , Akt , Yt−1 , · · · , Yt−r ). This model has a different intercept for various years (year effect).  4.3  Exploratory analysis of the data  The data includes the daily precipitation for 48 stations over Alberta from 1895 to 2006. First, we make the plot of transition probabilities for a few locations. We pick Calgary and Banff, which have a rather long period of data available for P N . We have also repeated the procedure for some other locations such as Edmonton and seen similar results. Figures 4.1 to 4.7 show the plots for Banff. For Calgary see plots in Chapter 2. Figure 4.1 plots the estimated 1st–order transition probabilities p̂11 (the probability of precipitation if precipitation occurs the day before) and p̂01 (the probability of precipitation if it does not occur the day before). These transition probabilities are estimated using the observed data. For example p̂11 for January 5th is estimated by nn111 , where n11 is the number of pairs of days (Jan. 4th, Jan. 5th) with precipitation and n1 is the number of Jan. 5th with precipitation during available years. Figures 4.2 and 4.3 show similar plots for estimated 2nd–order transition probabilities. Figures 4.4 and 4.5 give the estimated annual probability of precipitation for Banff and Calgary computed by dividing the number of wet days of a year by the number of days in that year. The plot of the logit function and the transformed estimated probability of precipitation in Banff are shown in Figures 4.6 and 4.7. We summarize the conclusions and conjectures based on the exploratory analysis of the data as followings: • The binary P N process is not stationary. Figure 4.1 shows that the transition probabilities change over time and depend on the season. • Figure 4.1 also suggests the transition probabilities change continuously over time. Although a high variation is seen in the higher order probabilities, a generally continuous trend is observed. There is a periodic trend for the transition probabilities over the course of the year 97  4.3. Exploratory analysis of the data and a simple periodic function should suffice modeling these probabilities. • Figure 4.1 suggests p11 and p01 differ over the course of the year, so a 0th–order Markov chain (independent) does not seem appropriate. • Figure 4.2 plots the curves p̂111 , p̂011 and Figure 4.3 plots the curves p̂001 , p̂101 . They have considerable overlaps over the course of the year. Therefore a 2nd–order Markov chain does not seem necessary. • Figures 4.4 and 4.5 show the estimated probability of precipitation for different years, computed by averaging through the days of a given year. The probability of precipitation seems to differ year–to–year. It also seems that consecutive years have similar probability and hence assuming that different years are identically distributed and independent does not seem reasonable. The probability of precipitation has increased over the past century for Calgary, while for Banff the probability of precipitation seems to have been changing with a more irregular pattern. • Figure 4.6 shows the plot of the logit function, while Figure 4.7 shows the result of applying the logit function to the estimated probabilities. We observe how the logit function transforms the values between 0 and 1 to a wider range in R. Since logit is an increasing function the peaks are observed at the same time as the original values. The Categorical Expansion Theorem (Theorem 3.5.6) shows the general form for binary rth–order Markov processes. Table 4.8 compares all possible 2nd–order Markov chains (including the constant process). We discuss the implications of these possible models and use the following abbreviations: Y k = Yt−k , COS = cos(ωt), SIN = sin(ωt), COS2 = cos(2ωt) and SIN 2 = sin(2ωt). Some proposed models: • Zt−1 = 1: The probability of P N ’s occurrence does not depend on the previous days. In other words days are independent. • Zt−1 = (1, Y 1 ): The probability of P N today depends only on the day before and given the latter’s value, it is independent of the other previous days.  98  0.4 0.0  0.2  Probability  0.6  0.8  1.0  4.3. Exploratory analysis of the data  0  100  200  300  Day of the year  Figure 4.1: The transition probabilities for the Banff site. The dotted line represents pˆ11 (the estimated probability of precipitation if precipitation occurs the day before) and the dashed represents pˆ01 (the estimated probability of precipitation if precipitation does not occur the day before.)  99  0.4 0.0  0.2  Probability  0.6  0.8  1.0  4.3. Exploratory analysis of the data  0  100  200  300  Day of the year  Figure 4.2: The solid curve represents p̂111 (the estimated probability of precipitation if during both two previous days precipitation occurs) and the dashed curve represents p̂011 (the estimated probability that precipitation occurs if precipitation occurs the day before and does not occur two days ago) for the Banff site.  100  0.4 0.0  0.2  Probability  0.6  0.8  1.0  4.3. Exploratory analysis of the data  0  100  200  300  Day of the year  Figure 4.3: The solid curve represents p̂001 (the estimated probability of precipitation occurring if it does not occur during the two previous days) and the dotted curve is p̂101 (the estimated probability that precipitation occurs if precipitation does not occur the day before but occurs two days ago) for the Banff site.  101  0.4 0.0  0.2  Probability  0.6  0.8  1.0  4.3. Exploratory analysis of the data  1900  1920  1940  1960  1980  2000  Year  0.4 0.0  0.2  Probability  0.6  0.8  1.0  Figure 4.4: Banff’s estimated mean annual probability of precipitation calculated from historical data.  1900  1920  1940  1960  1980  2000  Year  Figure 4.5: Calgary’s estimated mean annual probability of precipitation calculated from historical data. 102  0 −6  −4  −2  logit  2  4  6  4.3. Exploratory analysis of the data  0.0  0.2  0.4  0.6  0.8  1.0  x  0.0 −0.5 −1.0  Probability of Precipitation  0.5  1.0  Figure 4.6: The logit function: logit(x) = log(x/(1 − x)).  0  100  200  300  Day of the year  Figure 4.7: The logit of the estimated probability of precipitation in Banff for different days of the year. 103  4.3. Exploratory analysis of the data • Zt−1 = (1, Y 2 ) : The probability of P N given the information for the day before yesterday is independent of other previous days, in particular yesterday! This does not seem reasonable. • Zt−1 = (1, Y 1 , Y 2 ) : This model includes both Y 1 and Y 2 . One might suspect that it has all the information and therefore is the most general 2nd–order Markov model. However, note that in the model the transformed conditional probability is a linear combination of the past two states: logit{P (Y = 1|Y 1 , Y 2 )} = α0 + α1 Y 1 + α2 Y 2 , which implies, logit{P (Y = 1|Y 1 = 0, Y 2 = 0)} = α0 , logit{P (Y = 1|Y 1 = 1, Y 2 = 0)} = α0 + α1 , logit{P (Y = 1|Y 1 = 0, Y 2 = 1)} = α0 + α2 , and logit{P (Y = 1|Y 1 = 1, Y 2 = 1)} = α0 + α1 + α2 . We conclude that logit{P (Y = 1|Y 1 = 1, Y 2 = 0)} − logit{P (Y = 1|Y 1 = 0, Y 2 = 0)} = logit{P (Y = 1|Y 1 = 1, Y 2 = 1)} − logit{P (Y = 1|Y 1 = 0, Y 2 = 1)} = α1 .  In other words, the model implies that no matter what the value Y 2 has, the differences between the conditional probabilities given Y 1 = 1 and given Y 1 = 0 (in the logit scale) are the same. • Zt−1 = (1, Y 1 Y 2 ): Among other things, this model implies that the conditional probabilities given (Y 1 = 0, Y 2 = 1), (Y 1 = 1, Y 2 = 0) or (Y 1 = 0, Y 2 = 0) are the same.  104  4.4. Comparing the models using BIC • Zt−1 = (1, Y 1 , Y 1 Y 2 ): Among other things this model implies that the conditional probabilities given any of the pairs (Y 1 = 0, Y 2 = 0) or (Y 1 = 0, Y 2 = 0) are the same. • Zt−1 = (1, Y 2 , Y 1 Y 2 ): The interpretation is similar to the previous case. • Zt−1 = (1, Y 1 , Y 2 , Y 1 Y 2 ) : This is the full 2nd–order stationary Markov model with no restrictive assumptions as shown by Categorical Expansion Theorem. The above explanations show that one must be careful about the assumptions made about any proposed model. Including/dropping various covariates can lead to implications that might be unrealistic.  4.4  Comparing the models using BIC  This section uses the methods developed previously to find appropriate models for the 0-1 P N process. We use the P N data for Calgary from 2000 to 2004. We compare several models using the BIC criterion. The partial likelihood is computed and then maximized using the “optim” function in “R”. Using “Time Series Following Generalized Linear Models” as discussed by Kedem et al. in [27], for binary time series with the canonical link function, we have: P (Yt = 1|Zt−1 ) = logit−1 (αZt−1 ), and, P (Yt = 0|Zt−1 ) = 1 − logit−1 (αZt−1 ). We conclude that the log partial likelihood is equal to: N X t=1  X  1≤t≤N,Yt =1  log P (Yt |Zt−1 ) =  log(logit−1 (αZt−1 )) +  X  1≤t≤N,Yt =0  log(1 − logit−1 (αZt−1 )).  105  4.4. Comparing the models using BIC To ensure that the maximum picked by “optim” in the R package is close to the actual maximum, several initial values were chosen randomly until stability was achieved. In order to find an optimal model to describe a binary (0-1) P N process, we can include several factors such as previous values of the process, seasonal terms, previous maximum temperature values and so on. We have done this comparison in several tables. The smallest BIC in the tables is shown by boldface. Table 4.1 shows the constant process 1 and N l , the number of wet days during l previous days, as predictors. Note that N 1 = Y 1 . The BIC criterion in this case picks the simplest model which includes only the previous day. Hence a 1st–order Markov chain is chosen among these particular lth–order chains. Model: Zt−1 (1, N 1 ) (1, N 2 ) (1, N 3 ) (1, N 4 ) (1, N 5 ) (1, N 6 ) (1, N 7 ) (1, N 8 ) (1, N 9 ) (1, N 10 ) (1, N 11 ) (1, N 12 ) (1, N 13 ) (1, N 14 ) (1, N 15 )  BIC 2268.1 2294.5 2293.4 2292.7 2296.9 2305.9 2311.3 2317.2 2322.1 2325.6 2330.4 2335.7 2336.3 2340.5 2342.6  parameter estimates (−1.035, 1.268) (−1.097, 0.726) (−1.181, 0.559) (−1.244, 0.462) (−1.281, 0.390) (−1.292, 0.331) (−1.308, 0.291) (−1.317, 0.258) (−1.32, 0.232) (−1.34, 0.212) (−1.34, 0.193) (−1.34, 0.177) (−1.36, 0.168) (−1.35, 0.155) (−1.36, 0.146)  Table 4.1: BIC values for models including N l , the number of precipitation days during the past l days for the Calgary site. Table 4.2 compares models with predictors: 1, Y l and N l , l = 1, 2, · · · , 30. Since Y 1 = N 1 the first row is obviously an over–parameterized model. The smallest BIC corresponds to the model (1, Y 1 , N 28 ). Even the model (1, Y 1 , N 4 ) shows an improvement over (1, Y 1 ). Hence by adding the number of P N days to the simple model (1, Y 1 ), an improvement is achieved.  106  4.4. Comparing the models using BIC Model: Zt−1 1  1  (1, Y , N ) (1, Y 1 , N 2 ) (1, Y 1 , N 3 ) (1, Y 1 , N 4 ) (1, Y 1 , N 5 ) (1, Y 1 , N 6 ) (1, Y 1 , N 7 ) (1, Y 1 , N 8 ) (1, Y 1 , N 9 ) (1, Y 1 , N 10 ) (1, Y 1 , N 11 ) (1, Y 1 , N 12 ) (1, Y 1 , N 13 ) (1, Y 1 , N 14 ) (1, Y 1 , N 15 ) (1, Y 1 , N 16 ) (1, Y 1 , N 17 ) (1, Y 1 , N 18 ) (1, Y 1 , N 19 ) (1, Y 1 , N 20 ) (1, Y 1 , N 21 ) (1, Y 1 , N 22 ) (1, Y 1 , N 23 ) (1, Y 1 , N 24 ) (1, Y 1 , N 25 ) (1, Y 1 , N 26 ) (1, Y 1 , N 27 ) (1, Y 1 , N 28 ) (1, Y 1 , N 29 ) (1, Y 1 , N 30 )  BIC  parameter estimates  2275.6 2270.2 2258.3 2250.6 2247.5 2248.2 2247.1 2247.5 2247.6 2247.4 2248.3 2249.6 2248.1 2249.7 2249.5 2249.0 2245.3 2246.8 2246.8 2245.6 2246.0 2247.6 2245.9 2246.0 2246.8 2246.6 2246.2 2244.7 2245.4 2246.2  (-1.04, -0.40, 1.67) (-1.10, 0.94, 0.255) (-1.21, 0.88, 0.279) (-1.28, 0.88, 0.254) (-1.32, 0.91, 0.221) (-1.34, 0.95, 0.187) (-1.37, 0.97, 0.167) (-1.39, 0.99, 0.149) (-1.40, 1.01, 0.136) (-1.42, 1.02, 0.126) (-1.43, 1.04, 0.115) (-1.43, 1.05, 0.105) (-1.46, 1.06, 0.102) (-1.46, 1.07, 0.0945) (-1.47, 1.07, 0.0905) (-1.49, 1.08, 0.0872) (-1.51, 1.08, 0.0853) (-1.53, 1.08, 0.0831) (-1.55, 1.08, 0.0820) (-1.56, 1.08, 0.0787) (-1.56, 1.08, 0.0749) (-1.55, 1.09, 0.0703) (-1.58, 1.09, 0.0701) (-1.58, 1.09, 0.0678) (-1.58, 1.10, 0.0647) (-1.59, 1.10, 0.0632) (-1.60, 1.10, 0.0618) (-1.62, 1.10, 0.0615) (-1.62, 1.10, 0.0593) (-1.622, 1.11, 0.0571)  Table 4.2: BIC values for models including N l , the number of wet days during the past l days and Y 1 , the precipitation occurrence of the previous day for the Calgary site. Table 4.3 compares models with predictors (1, N l , COS, SIN ). We have added (COS, SIN ) to capture the seasonality in the precipitation over a year. (1, N 1 , COS, SIN ) (which is the same as (1, Y 1 , COS, SIN )) is the winner. Note that this model is better than the simpler model (1, Y 1 ) or the model (1, Y 1 , N 28 ).  107  4.4. Comparing the models using BIC Model: Zt−1 1  BIC  (1, N , COS, SIN ) (1, N 2 , COS, SIN ) (1, N 3 , COS, SIN ) (1, N 4 , COS, SIN ) (1, N 5 , COS, SIN ) (1, N 6 , COS, SIN ) (1, N 7 , COS, SIN ) (1, N 8 , COS, SIN ) (1, N 9 , COS, SIN ) (1, N 10 , COS, SIN )  2222.5 2254.6 2260.1 2264.1 2270.8 2280.5 2286.7 2293.0 2293.1 2302.2  parameter estimates (-1.00, (-1.02, (-1.07, (-1.11, (-1.12, (-1.11, (-1.11, (-1.09, (-1.08, (-1.07,  1.10, -0.588, 0.0999) 0.592, -0.564, 0.0977) 0.443, -0.538, 0.0961) 0.359, -0.518, 0.0959) 0.295,-0.508, 0.0971) 0.240, -0.510, 0.0999) 0.205, -0.508, 0.101) 0.176, -0.511, 0.103) 0.153, -0.513, 0.105) 0.136, -0.516, 0.107)  Table 4.3: BIC values for models including N l , the number of wet days during the past l days and seasonal terms for the Calgary site. Table 4.4 includes Y 1 , seasonal terms and N l for l = 1, 2, · · · , 10 as predictors. The model with predictors (1, Y 1 , N 5 , COS, SIN ), which includes a combination of seasonal terms and number of precipitation days has the smallest BIC so far. Note that both the seasonal terms and the number of precipitation days prior to the day we are looking at, are indicators of “weather conditions”. There are natural cycles throughout the year that can inform us about the weather conditions of a particular day of the year. These natural cycles are modeled by the periodic functions COS and SIN . Also by looking at a short period prior to the current day (short–term past), we might be able to determine the weather conditions. Precipitation may not follow a very regular seasonal pattern similar to temperature as shown in the exploratory analysis. Which one of these variables (seasonal or short–term past) is more important or necessary might depend on the location and other factors.  108  4.4. Comparing the models using BIC Model: Zt−1 1  BIC  1  (1, Y , N , COS, SIN ) (1, Y 1 , N 2 , COS, SIN ) (1, Y 1 , N 3 , COS, SIN ) (1, Y 1 , N 4 , COS, SIN ) (1, Y 1 , N 5 , COS, SIN ) (1, Y 1 , N 6 , COS, SIN ) (1, Y 1 , N 7 , COS, SIN ) (1, Y 1 , N 8 , COS, SIN ) (1, Y 1 , N 9 , COS, SIN ) (1, Y 1 , N 10 , COS, SIN )  2230.0 2229.2 2224.8 2222.1 2221.7 2223.3 2223.7 2224.7 2225.5 2226.0  parameter estimates (-1.00, (-1.03, (-1.10, (-1.14, (-1.16, (-1.16, (-1.17, (-1.16, (-1.16, (-1.16,  -2.31, 3.41, -0.589, 0.0999) 0.977, 0.0997, -0.576, 0.0985) 0.895, 0.156, -0.546, 0.0946) 0.89, 0.147, -0.525, 0.0941) 0.922, 0.124, -0.515, 0.0934) 0.959, 0.0954, -0.517, 0.0946) 0.978, 0.0822, -0.513, 0.0947) 0.997, 0.0682, -0.515, 0.0945) 1.0129, 0.0582, -0.515, 0.0961) 1.026, 0.0502, -0.517, 0.0958)  Table 4.4: BIC values for models including N l , the number of P N days during the past l days, Y 1 , the precipitation occurrence of the previous day and seasonal terms for the Calgary site. Table 4.5 compares models with different number of predictors from (1, Y 1 ) to (1, Y 1 , · · · , Y 7 ). The first model is a 1st–order Markov chain and the last one is a 7th– order chain. The optimal model picked is: (1, Y 1 , Y 2 , Y 3 ). Comparing this table to Table 4.2, we see that (1, Y 1 , N 3 ) is superior to (1, Y 1 ), (1, Y 1 , Y 2 ) and (1, Y 1 , Y 2 , Y 3 ). Note that (1, Y 1 , N 3 ) is equivalent to (1, Y 1 , Y 2 + Y 3 ). Hence, including Y 2 and Y 3 and giving them the same weight is better than not including them, including one of them or including both of them. Model: Zt−1 (1, Y 1 ) (1, Y 1 , Y 2 ) (1, Y 1 , Y 2 , Y 3 ) (1, Y 1 , · · · , Y 4 ) (1, Y 1 , · · · , Y 5 ) (1, Y 1 , · · · , Y 6 ) (1, Y 1 , · · · , Y 7 )  BIC 2268.1 2270.2 2263.3 2263.9 2268.5 2335.4 2286.7  parameter estimates (-1.034, 1.27) (-1.11, 1.20, 0.23) (-1.21, 1.19, 0.140, 0.410) (-1.28, 1.16, 0.133, 0.334, 0.281) (-1.32, 1.15, 0.121, 0.328, 0.232, 0.192) (-1.34, 1.15, 0.0837, 0.357, 0.213, 0.135, 0.115) (-1.51, 1.33, -0.113, 0.378, 0.418, 0.204, -0.0050, 0.214)  Table 4.5: BIC values for Markov models of different order with small number os parameters for the Calgary site. Table 4.6 compares models with different Markov orders plus the seasonal terms. The model (1, Y 1 , COS, SIN ) is the winner. Hence, whether we include the seasonal terms or not, the model that only depends on the previous day is the winner. 109  4.4. Comparing the models using BIC Model: Zt−1  BIC 1  (1, COS, SIN, Y ) (1, COS, SIN, Y 1 , Y 2 ) (1, COS, SIN, Y 1 , Y 2 , Y 3 ) (1, COS, SIN, Y 1 , · · · , Y 4 ) (1, COS, SIN, Y 1 , · · · , Y 5 ) (1, COS, SIN, Y 1 , · · · , Y 6 ) (1, COS, SIN, Y 1 , · · · , Y 7 )  parameter estimates  2222.6 2229.1 2230.4 2247.3 2243.4 2501.6 2447.3  (-1.0, (-1.0, (-1.1, (-1.1, (-1.3, (-1.2, (-1.1,  -0.5, -0.5, -0.5, -0.5, -0.4, -1.5, -0.2,  0.1, 1.1) 0.1, 1.0, 0.1) 0.1, 1.0, 0.02, 0.3) 0.1, 1.0, 0.03, 0.2, 0.15) 0.2, 1.4, -0.4, -0.1, 1.0, -0.15) 0.4, 0.2, 0.8, 0.9, 0.9, -0.6, -0.2) 0.07, 0.8, -0.02, 0.3, 0.4, -0.07, 0.4, -0.3)  Table 4.6: BIC values for Markov models with different order plus seasonal terms for the Calgary site. Table 4.7 studies seasonality more. We consider the possibility that there are more/less terms of the Fourier series of a periodic function over the year. It turns out that the model with (1, Y 1 , COS) is the optimal model so far. Hence, only one term seem to suffice modeling the seasonal nature of the process. Model: Zt−1 (1, COS) (1, SIN ) (1, COS, SIN ) (1, Y 1 , COS) (1, Y 1 , SIN ) (1, Y 1 , COS, SIN ) (1, Y 1 , COS, SIN, COS2) (1, Y 1 , COS, SIN, SIN 2) (1, Y 1 , COS, SIN, COS2, SIN 2)  BIC 2322.7 2424.3 2327.3 2216.9 2273.9 2222.6 2229.7 2230.0 2237.2  parameter estimates (-0.556, -0.717) (-0.523, 0.115) (-0.568, -0.738, 0.119) (-1.00 , 1.10, -0.587) (-1.03, 1.26, 0.0933) (-1.004, 1.102, -0.589, 0.100) (-1.00, 1.10, -0.586, 0.0998, 0.0247) (-1.00, 1.10, -0.590, 0.101, 0.0125) (-1.01, 1.11, -0.575, 0.0978, 0.0236, -0.0101)  Table 4.7: BIC values for models including seasonal terms and the occurrence of precipitation during the previous day for the Calgary site. Table 4.8 compares all stationary 2nd–order Markov models. The smallest BIC corresponds to (1, Y 1 ).  110  4.4. Comparing the models using BIC Model: Zt−1 (1) (1, Y 1 ) (1, Y 2 ) (1, Y 1 , Y 2 ) (1, Y 1 Y 2 ) (1, Y 1 , Y 1 Y 2 ) (1, Y 2 , Y 1 Y 2 ) (1, Y 1 , Y 2 , Y 1 Y 2 )  BIC 2419.6 2268.0 2392.8 2270.2 2335.5 2272.7 2342.3 2277.7  parameter estimates (-0.528) (-1.04, 1.27) (-0.756, 0.590) (-1.110, 1.197, 0.256) (-0.779, 1.134) (-1.040, 1.113, 0.282) (-0.757, -0.113, 1.225) ( -1.103, 1.177, 0.234, 0.048)  Table 4.8: BIC values for 2nd–order Markov models for precipitation at the Calgary site. Table 4.9 compares all 2nd–order Markov chains with a seasonal COS term. The model (1, Y 1 , COS) is the winner. Model: Zt−1 (1, COS) (1, COS, Y 1 ) (1, COS, Y 2 ) (1, COS, Y 1 Y 2 ) (1, COS, Y 1 , Y 2 ) (1, COS, Y 1 , Y 1 Y 2 ) (1, COS, Y 2 , Y 1 Y 2 ) (1, COS, Y 1 , Y 2 , Y 1 Y 2 )  BIC 2322.7 2216.8 2317.4 2223.5 2276.1 2223.9 2280.9 2231.0  parameter estimates (-0.567, (-1.005, (-0.708, (-0.760, (-1.033, (-1.004, (-0.709, (-1.028,  -0.738) -0.587, 1.106) -0.679, 0.372) -0.618, 0.905) -0.575, 1.080, 0.103) -0.580, 1.041, 0.120) -0.632, -0.244, 1.093) -0.575, 1.065, 0.085, 0.037)  Table 4.9: BIC values for 2nd–order Markov models for precipitation at the Calgary site plus seasonal terms. Table 4.10 also includes the maximum and minimum temperature of the day before, as predictors of some of the models which performed better in the above tables. We have also included the annual processes A1 , · · · , A5 to one of the models. Finally, we have included the model (1, Y 1 , N 5 , COS). This model has a combination of the seasonal term COS and the short–term past process N 5 which did the best when combined with the seasonal terms and Y 1 in Table 4.4. It turns out that including M T and mt does not improve the BIC as well as does the annual terms. However, (1, Y 1 , N 5 , COS) has the smallest BIC in all the models, which is a seasonal Markov chain of order 5 with only 4 parameters. Also the simpler model, (1, Y 1 , COS), has a close BIC to (1, Y 1 , N 5 , COS). 111  4.5. Changing the location and the time period Model: Zt−1 1  (1, COS, Y ) (1, Y 1 , COS, M T 1 ) (1, Y 1 , COS, mt1 ) (1, Y 1 , COS, M T 1 , mt1 ) (1, Y 1 , COS, A1 , · · · , A5 ) (1, Y 1 , N 5 , COS, M T 1 ) (1, Y 1 , N 5 , COS, SIN, M T 1 , mt1 ) (1, Y 1 , N 5 , COS, M T 1 , mt1 ) (Y 1 , N 5 , COS, M T 1 , A1 , · · · , A5 ) (Y 1 , N 5 , COS, A1 , · · · , A5 ) (1, Y 1 , M T 1 ) (1, Y 1 , N 5 , COS) (1, Y 1 , N 5 , COS, M T 1 )  BIC 2216.8 2221.7 2224.2 2227.4 2241.2 2297.3 2516.8 2393.9 2697.1 2447.1 2251.5 2215.8 2223.8  parameter estimates (-1.005, -0.587, 1.106) (-0.84, 1.0, -0.74, -0.012) (-1.0, 1.0, -0.65, -0.0055) (-0.65, 0.99, -0.67, -0.025, 0.022) ( 1.1, -0.5, -0.9, -1.2, -1.1, -1.0, -0.7) (-2.13, 0.9, 0.4, 0.6, 0.2, 0.04) (1.4, 0.04, 0.2, 0.7, 0.8, -0.2, 0.3) ( 1.4, 0.7, -0.1, -0.5, 0.5, -0.1, 0.2) (1.23, -0.64, -2.0, -0.10, 2.0, 1.2, 2.2, 1.2, 1.8) (0.1, 0.1, -0.7, -0.39, -0.01, -0.2, -0.9, -1) (-1.2, 1.3, 0.021) (-1.1, 0.9, 0.1, -0.5) (-1.2, 0.9, 0.1, -0.4, 0.0)  Table 4.10: BIC values for models including several covariates as temperature, seasonal terms and year effect for precipitation at the Calgary site.  4.5  Changing the location and the time period  This section compares various models for a different time period and location. Table 4.11 compares various models for the 0-1 P N process in Calgary between 1990 and 1994 which is a 5–year period. In Table 4.12, we have compared several models for 0-1 P N process over Medicine Hat site between 2000 and 2004. Table 4.11 shows that among the compared models (1, Y 1 , COS) has the smallest BIC. In particular the BIC for this model is smaller than the BIC for (1, Y 1 , N 5 , COS) which has the smallest BIC for Calgary 2000– 2004. However (1, Y 1 , COS) was the second optimal model also for Calgary 2000–2004 with a close BIC to the optimal. Including the maximum and minimum temperature to the model increases the BIC again.  112  4.5. Changing the location and the time period Model: Zt−1 1  (1, Y ) (1, Y 1 , Y 2 ) (1, Y 1 , COS) (1, Y 1 , N 5 ) (1, Y 1 , N 10 ) (1, Y 1 , N 15 ) (1, Y 1 , COS, SIN ) (1, Y 1 , N 5 , COS) (1, Y 1 , N 5 , SIN ) (1, Y 1 , N 5 , COS, SIN ) (1, Y 1 , N 10 , COS) (1, Y 1 , N 10 , COS, SIN ) (1, Y 1 , N 5 , COS, M T 1 ) (1, Y 1 , N 5 , COS, mt1 )  BIC 2312.7 2318.8 2228.8 2303.3 2287.9 2282.7 2231.9 2236.4 2307.8 2239.4 2236.4 2239.4 2244.3 2244.1  parameter estimates (-0.931, (-0.967, (-0.858, (-1.168, (-1.581, (-1.486, (-0.855, (-0.864, (-1.160, (-0.849, (-0.847, (-0.847, (-0.433, (-0.910,  1.275) 1.238, 0.126) 1.036, -0.712) 1.012, 0.168) 1.015, 0.132) 1.045, 0.105) 1.026, -0.715 , 0.152) 1.032, 0.004, -0.709) 1.011, 0.164, 0.125) 1.031, -0.004, -0.718, 0.152) 1.030, -0.002, -0.721, 0.153) 1.030, -0.002, -0.721 , 0.153) 1.046, -0.096, -1.078, -0.021) 1.011, 0.031, -0.584, 0.006)  Table 4.11: BIC values for several models for the binary process of precipitation in Calgary, 1990–1994 Table 4.12 shows that the smallest BIC corresponds to (1, Y 1 , COS). However, several models have similar BIC values. Also, including the maximum and minimum temperature increases the BIC here. Model: Zt−1 1  (1, Y ) (1, Y 1 , Y 2 ) (1, Y 1 , N 5 ) (1, Y 1 , N 10 ) (1, Y 1 , N 15 ) (1, Y 1 , N 20 ) (1, Y 1 , COS) (1, Y 1 , COS, SIN ) (1, Y 1 , N 5 , COS) (1, Y 1 , N 5 , SIN ) (1, Y 1 , N 5 , COS, SIN ) (1, Y 1 , N 10 , COS) (1, Y 1 , N 10 , COS, SIN ) (1, Y 1 , N 5 , COS, M T 1 ) (1, Y 1 , N 5 , COS, mt1 ) (1, Y 1 , N 15 , COS)  BIC 2202.9 2207.9 2203.6 2228.9 2200.5 2202.5 2201.2 2202.9 2203.9 2206.6 2206.6 2201.9 2205.1 2306.5 2211.1 2202.7  parameter estimates (-1.138, (-1.183, (-1.275, (-0.858, (-1.420, (-1.421, (-1.134, (-1.132, (-1.252, (-1.263, (-1.239, (-1.336, (-1.311, (-1.455, (-1.238, (-1.363,  1.094) 1.051, 0.181) 0.921, 0.119) 1.036, -0.712) 0.980, 0.065) 1.008, 0.048) 1.067, -0.224) 1.052, -0.225, 0.177) 0.924, 0.101, -0.201) 0.922, 0.109, 0.158) 0.925, 0.091, -0.204, 0.163) 0.958, 0.073, -0.183) 0.958, 0.065, -0.187, 0.151) 2.099, -0.130, 0.041, 0.004) 0.937, 0.087, -0.267, -0.005) 0.981, 0.053, -0.175)  Table 4.12: BIC values for several models for precipitation occurrence in Medicine Hat, 2000-2004  113  4.5. Changing the location and the time period In summary, in all the three cases (1, Y 1 , COS), is either optimal or the second to the optimal (using BIC). We have also tried BIC for Calgary with a long time period of close to 100 years and surprisingly the same simple model (1, Y 1 , COS) was the optimal.  114  Chapter 5  On the definition of “quantile” and its properties 5.1  Introduction  This chapter points out deficiencies in the classical definition (as well as some other widely used definitions) of the median and more generally the quantile and the so-called quantile function. Moreover redefining it appropriately gives us a basis on which we can find necessary and sufficient conditions for the sample quantiles to converge for arbitrary distribution functions. In the next chapter, we define a “degree of separation” function to measure the goodness of the approximation (or estimation). We argue that this function can be viewed as a natural loss function for assessing estimations and approximations. One characteristic of this loss function is its invariance under strictly monotonic transformations of the random variable, in particular re-scaling. In this chapter, we have used the terms data vector, approximation, estimation, exact and true quantiles repeatedly. To clarify what we mean by these terms, we give the following explanations: • Data vector: A vector of real numbers. We do not consider these values as random in general. We use the term random vector or random sample for a vector of random variables. We define the quantile for data vectors, but the same definition applies to a random sample. • Approximation and exact value: Suppose a very large data vector is given. We can compute the exact mean/median of such a vector by using all the data and the definition of mean/median. One can approximate the mean/median using various techniques. Note that both approximation and exact terms are used for data vectors of (nonrandom) numbers. • Estimation and true value: Estimation means finding functions of the random sample to estimate parameters of the underlying distribution. 115  5.1. Introduction The parameters are called the true values. The sample definition of quantiles varies in different text books. In [24], Hyndman et al. point out many different definitions in statistical packages for quantiles of a sample. In [17], Freund et al. point out various definitions for quartiles of data and propose a new definition using the concept of “hinge”. The traditional definition of quantiles for a random variable X with distribution function F , lqX (p) = inf{x|F (x) ≥ p}, appears in classic works as [38]. We call this the “left quantile function”. In some books (e.g. [41]) the quantile is defined as rqX (p) = sup{x|F (x) ≤ p}, this is what we call the “right quantile function”. Also in robustness literature people talk about the upper and lower medians which are a very specific case of these definitions. However, we do not know of any work that considers both definitions, explore their relation and show that considering both has several advantages. A physical motivation is given for the right/left definition of quantiles. It is widely claimed that (e.g. Koenker in [29] or Hao and Naiman in [21]) the traditional quantile function is invariant under monotonic transformations. We show that this does not hold even for strictly increasing functions. However, we prove that the traditional quantile function is invariant under non-decreasing left continuous transformations. We also show that the right quantile function is invariant under non-decreasing right continuous transformations. A similar neat result is found for continuous decreasing transformations using the Quantile Symmetry Theorem also proved in this chapter. Suppose we know that a data point is larger than a known number of other data points and smaller than another known number of data points. Of interest are the quantiles to which this data point corresponds. Lemma 5.2.4 gives a result about this. We will use this lemma later to establish the precision of our proposed algorithm for approximating quantiles of large datasets. Quantiles are often used as the inverse of distribution functions. In general neither the distribution function nor the quantile function are invertible. However Lemma 5.5.1 shows how quantiles can be used to characterize sets of the form {x|F (x) < p}, a case that is equivalent to (−∞, lqF (p)). 116  5.1. Introduction Lemma 5.7.1 shows the left continuity of the left quantile function and the right continuity of the right quantile function. Section 5.8 finds necessary and sufficient conditions for the left and right quantile functions to be equal at p ∈ [0, 1]. We also find out that the left and right quantile functions coincide except for at most a countable number of values in [0,1]. Then we characterize the image of the the left and right quantile functions and show that the image corresponds to “heavy” points (heavy point is a point that the probability of being in a neighborhood around that point is positive). Section 5.9 shows that given any of lq, rq and F uniquely determines the other two and formulas are given in order to find them. We also show that if one of lq and rq is two-sided continuous then so is the other one. Lemma 5.10.1 shows that the strict monotonicity of the distribution function F on its “real domain” {x|0 < F (x) < 1} is equivalent to two-sided continuity of lq/rq. Conversely, strict monotonicity of lq/rq corresponds to continuity of F. Section 5.12 presents the desirable “Quantile Symmetry Theorem”, a result that could be only obtained by considering both left and right quantiles. This relation can help us prove several other useful results regarding quantiles. Also using the quantile symmetry theorem, we find a relation for the equivariance property of quantiles under non-increasing transformations. Section 5.14 studies the limit properties of left and right quantile functions. In Theorem 5.14.7, we show that if left and right quantiles are equal, i.e. lqF (p) = rqF (p), then both sample versions lqFn , rqFn are convergent to the common distribution value. We found an equivalent statement in Serfling [43] with a rather similar proof. The condition for convergence there is said to be lqF (p) being the unique solution of F (x−) < p ≤ F (x) which can be shown to be equivalent to lqF (p) = rqF (p). Note how considering both left and right quantiles has resulted in a cleaner, more comprehensible condition for the limits. In a problem Serfling asks to show with an example that this condition cannot be dropped. We show much more by proving that if lqF (p) 6= rqF (p) then both rqFn (p) and rqFn (p) diverge almost surely. The almost sure divergence result can be viewed as an extension to a well-known result in probability theory which says that if X1 , X2 , · · · an i.i.d sequence P from a fair coin with -1 denoting tail and 1 denoting head and Zn = ni=1 Xi then P (Zn = 0 i.o.) = 1. The proof in [9] uses the Borel–Cantelli Lemma to get around the problem of dependence of Zn . This is equivalent to saying for the fair coin both lqFn (1/2) and rqFn (1/2) diverge almost surely. For the general case, we use the Borel–Cantelli Lemma again. But we also need a lemma (Lemma 5.14.10) which uses the Berry–Esseen Theorem in 117  5.2. Definition of median and quantiles of data vectors and random samples its proof to show the deviations of the sum of the random variables can become arbitrarily large, a result that is easy to show as done in [9] for the simple fair coin example. Finally, we show that even though in the case that lqF (p) 6= rqF (p), lqFn , rqFn are divergent; for large ns they will fall in (lqF (p) − ǫ, lqF (p)] ∪ [rqF (p), rqF (p) + ǫ). In fact we show that lim inf lqFn (p) = lim inf rqFn (p) = lqF (p) n→∞  n→∞  and lim sup lqFn (p) = lim sup rqFn (p) = rqF (p). n→∞  n→∞  The proof is done by constructing a new random variable Y from the original random variable X with distribution function FX by shifting back all the values greater than rqX (p) to lqX (p). This makes lqY (p) = rqY (p) in the new random variable. Then we apply the convergence result to Y .  5.2  Definition of median and quantiles of data vectors and random samples  This section presents a way to define quantiles of data vectors and random samples. We confine our discussion to data vectors since the definition for random samples is merely a formalistic extension. Suppose, we are given a very long data vector. The goal is to find the median of this vector. Let us denote the data vector by x = (x1 , · · · , xn ). Suppose y = (y1 , · · · , yn ) is an increasing sorted vector of elements of x = (x1 , · · · , xn ). Then usually the y +y median of x is defined to be y(n+1)/2 if n is odd and n/2 2(n+2)/2 if n is even. Essentially the median is defined so that half data lies below it and half lies above it. However, when n is even, any value between yn/2 and y(n+2)/2 serves this purpose and taking the average of the two values seems arbitrary. Intuitively, the quantile should have the following properties: 1. It should be a member of the data vector. In other words if x = (x1 , · · · , xn ) is the data vector then the quantile should be equal to one of xi , i = 1, · · · , n. 2. Equivariance: If we transform the data using an increasing continuous transformation of R, find the quantile and transform back, we should get the same result, had we found the quantile of the original data. 118  5.2. Definition of median and quantiles of data vectors and random samples More formally, if we denote the quantile of a data vector x for p ∈ (0, 1) by qx (p) then for any φ : R → R strictly increasing and bijective qx (p) = y ⇔ qφ(x) (p) = φ(y). 3. Symmetry: The p-th quantile of the data vector x = (x1 , · · · , xn ) should be the negative of (1 − p)-th quantile of data vector −x = (−x1 , · · · , −xn ): qx (p) = −q−x (1 − p). Particularly, the median of x should be the image of the median of the image of x with respect to 0. 4. The “amount” of data between qx (p1 ) and qx (p2 ) should be p2 − p1 of the the “data amount” of the whole vector if p1 < p2 . 5. If we “cut” a sorted data vector up until the p1 -th quantile and compute the p2 -th quantile for the new vector, we should get the p1 p2 -th quantile of the original vector. For example the median of a sorted vector upto its median should be the first quartile. This chapter develops a definition for quantiles that satisfies the first three conditions. We will address the last two conditions in later chapters and develop a framework in which they are satisfied. Consider the example x = (0, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 10). We see that the median by the usual definition is 1.5 not apparent in the observed data. Also if we take bijective, increasing and continuous transformation φ(x) = x3 , we see that the classic definition does not satisfy the second property. The median and quantiles can be defined both for distributions and data vectors (and random samples). For a random variable X having a distribution function F , the p-th quantile is traditionally defined as qF (p) = inf{x|F (x) ≥ p}.  (5.1)  This can be used to define the quantiles of a data vector using the empirical (sample) distribution function Fn , Fn (x) =  n X  1(−∞,xi ] (x).  i=1  119  5.2. Definition of median and quantiles of data vectors and random samples With this definition of the quantile, the equivariance property holds and the result is a realizable data value. This definition faces another issue however. Consider flipping a fair coin with outcomes: 0,1. Then the distribution of X is given by  x<0  0 1/2 0 ≤ x < 1 FX (x) =  1 x≥1  Hence by definition 5.1, qF (p) = 0, p ≤ 1/2 and qF (p) = 1, p > 1/2. This all seems to be reasonable other than qF (p) = 0, p = 1/2. Based on the symmetry of the distribution there should not be any advantage for 0 over 1 to be the median. For the quantiles of the data vectors the same issue occurs. For example, consider x = (1, 2, 3, 4, 5, 6) and apply definition 5.1 to Fn corresponding to this data vector. We will get 3 as the median but in fact 4 should to be as eligible by symmetry. Before to get to our definition of quantile we provide the following motivating examples. Example A student decided to buy a new memory chip for his computer. He needed to choose between the available RAM sizes (1 GB, 2GB etc) in his favorite store. In a trade-off between price and speed, he decided to get a RAM chip that is at least as large as 2/3 RAMs bought in the store during the day before. He could access the information regarding all RAMs bought the day before, in particular their size. He entered the size data into the R package he had recently downloaded for free. He had heard about the quantiles in his elementary statistics course so he decided to compute the quantile of the data for p = 2/3. When he computed that he got 2.666 (GB). He knew a RAM of size 2.666 does not exist and concluded this must be a result of an interpolation procedure in R. Since the closest integer to 2.666 is 3 he concluded that 3 GB is the size he is looking for. He went back to the store asking for 3 GB RAM and was told they have never sold such a RAM in that store! He thought there must be an error in the dataset so he looked the data again 1, 1, 1, 1, 2, 2, 2, 2, 4, 4, 4, 4 Surprisingly there was no 3. R had interpolated 2 and 4 to give 2.66 and mislead the student. Example A supervisor asked 2 graduate students to summarize the following data regarding the intensity of the earthquakes in a specific region: 120  5.2. Definition of median and quantiles of data vectors and random samples row number 1 2 3 4 5 6 7 8 9 10  ML (Richter)  A (shaking amplitude)  4.21094 4.69852 4.92185 5.12098 5.21478 5.28943 5.32558 5.47828 5.59103 5.72736  1.62532 × 104 4.99482 × 104 8.35314 × 104 13.21235 × 104 16.39759 × 104 19.47287 × 104 21.16313 × 104 30.08015 × 104 38.99689 × 104 53.37772 × 104  Table 5.1: Earthquakes intensities Earthquake intensity is usually measured in ML scale, which is related to A by the following formula: ML = log10 A. In the data file handed to the students (Table 5.1), the data is sorted with respect to ML in increasing order from top to bottom. Hence the data is arranged decreasingly with respect to A from top to bottom. The supervisor asked two graduate students to compute the center of the intensity of the earthquakes using this dataset. One of the students used A and the usual definition of median and so obtained (16.39759 × 104 + 19.47287 × 104 )/2 = 17.93523 × 104 . The second student used the ML and the usual definition of median to find (5.21478 + 5.28943)/2 = 5.252105. When the supervisor saw the results he figured that the students must have used different scales. Hence he tried to make the scales the same by transforming one of the results 105.252105 = 17.86920 × 104 . To his surprise the results were not quite the same. He was bothered to notice that the definition of median is not invariant under the change of scale which is continuous strictly increasing.  121  5.2. Definition of median and quantiles of data vectors and random samples Example A scientist asked two of his assistants to summarize the following data regarding the acidity of rain: row number 1 2 3 4 5 6 7 8 9 10  pH  aH  4.7336 4.8327 4.8492 5.0050 5.0389 5.2487 5.2713 5.2901 5.5731 5.6105  18.4672 × 10−6 14.6994 × 10−6 14.1514 × 10−6 9.8855 × 10−6 9.1432 × 10−6 5.6403 × 10−6 5.3543 × 10−6 5.1274 × 10−6 2.6724 × 10−6 2.4519 × 10−6  Table 5.2: Rain acidity data pH is defined as the cologarithm of the activity of dissolved hydrogen ions (H + ). pH = − log10 aH. In the data file handed to the students (Table 5.2) the data is sorted with respect to pH in increasing order from top to bottom. Hence the data is arranged decreasingly with respect to aH from top to bottom. The scientist asked the two assistant to compute the 20th and 80th percentile of the data to get an idea of the variability of the acidity. First assistant used the pH scale and the traditional definition of the quantile qF (p) = inf{x|F (x) ≥ p}, where F is the empirical distribution of the data. He got the following two numbers qF (0.2) = 4.8327 and qF (0.8) = 5.2901  (5.2)  these values are positioned in row 2 and 8 respectively. The second assistant also used the traditional definition of the quantiles and the aH scale to get qF (0.2) = 2.6724 × 10−6 and qF (0.8) = 14.1514 × 10−6 ,  (5.3)  which correspond to row 9 and 3. 122  5.2. Definition of median and quantiles of data vectors and random samples The scientist noticed the assistants used different scales. Then he thought since one of the scales is in the opposite order of the other and 0.2 and 0.8 are the same distance from 0 and 1 respectively, he must get the other assistant’s result by transforming one. So he transformed the second assistant’s results given in Equation 5.3 (or by simply looking at the corresponding rows, 9 and 3 under pH), to get 5.5731 and 4.8492, which are not the same as the first assistants result in Equation 5.2. He noticed the position of these values are only one off from the previous values (being in row 9 and 3 instead of 8 and 2). Then he tried the same himself for 25th and 75th percentile using both scales pH : qF (0.25) = 4.8492 and qF (0.75) = 5.2901, which are positioned at 3rd and 8th row. aH : qF (0.25) = 5.1274 × 10−6 and qF (0.25) = 14.1514 × 10−6 , which are positioned at row 8th and 3rd. This time he was surprised to observe the symmetry he expected. He wondered when such symmetry exist and what is true in general. He conjectured that the asymmetric definition of the traditional quantile is the reason of this asymmetry. He also thought that the symmetry property is off at most by one position in the dataset. To define the quantile, we perform a thought experiment and use our intuition to decide how it should be defined. Suppose a data vector x = (x1 , · · · , xn ) is given. Define the sort operator which permutes the components of a vector to give a vector with non-decreasing coordinates by sort(x) = (y1 , · · · , yn ). In statistics yi defined as above is called the i–th order statistics of x and is usually denoted by x(i) or xi:n . [This definition extends to random vectors (X1 , · · · , Xn ) as well.] The concept of quantile should only depend on sort(x). Let z = (z1 , · · · , zr ) be the non–decreasing subvector of all distinct elements of x.PIf zi is repeated mi times, we say zi has multiplicity mi and therefore ri=1 mi = n. Now imagine, a uniform bar of length 1. Cut the bar from left to right to r parts of lengths mn1 , · · · , mnr proportional to 123  5.2. Definition of median and quantiles of data vectors and random samples the multiplicity of the zi . Assign a unique color to every zi , i = 1, · · · , r and color its piece with that color. Then reassemble the stick from left to right in the original order. To define the p-th quantile measure a length p from the left hand of the bar (whose total length is one). Determine the reassembled bar’s color at that point. However, this protocol fails at the end points as well as the points where two colors meet. Since each color is an equally eligible choice, we are led to the idea in defining the quantiles of a two–state solution at these points, giving us the left and right quantiles. But proceeding with our bar analogy, the intersection points and boundary points are: m1 m1 + m2 m1 + · · · + mr−1 , ,··· , , 1. n n n By the above discussion, if p is not an intersection/boundary point both left and right quantiles, which we denote by lqx and rqx respectively should be the same and equal to  0 < p < mn1  z1 m1 +···+mi−1 i z < p < m1 +···+m lqx (p) = rqx (p) = n n  i m1 +···+mr−1 zr <p<1 n 0,  For the intersection points, if p =  m1 +···+mi−1 n  then  lqx (p) = zi−1 and rqx (p) = zi . For the boundary points we define lqx (0) = −∞, rqx (0) = z1 , lqx (1) = zr , rqx (1) = ∞. As a convention, for a sorted vector y of length n, we define y0 = −∞ and yn+1 = ∞. Lemma 5.2.1 Suppose x is a data vector of length n and y = sort(x) = (y1 , · · · , yn ). Also let y0 = −∞ and yn+1 = ∞. For 0 < p < 1, let [np] denote the integer part of np. Then a) np = [np] ⇒ lqx (p) = y[np], rqx (p) = y[np]+1. b) np > [np] ⇒ lqx (p) = y[np]+1 , rqx (p) = y[np]+1. c) y = sort(x) and pi = i/n, i = 0, 1, · · · , n, implies y = (lqx (p1 ), · · · , lqx (pn )) = (rqx (p0 ), · · · , rqx (pn−1 )). 124  5.2. Definition of median and quantiles of data vectors and random samples Proof a) Let np = h ∈ N. There are four cases: 1. For h = 0 and h = n the result is trivial by the definition of y0 and yn+1 . 2. 0 < h < m1 ⇒ 0 < p < m1 /n and by definition lqx (p) = rqx (p) = z1 . But yh = yh+1 = z1 . 3. There exists 1 < i ≤ r such that m1 +· · ·+mi−1 < h < m1 +· · ·+mi ⇒ m1 +···+mi−1 i < p < m1 +···+m and by definition lqx (p) = rqx (p) = zi . n n But yh = yh+1 = zi since m1 + · · · + mi−1 < h < m1 + · · · + mi . i 4. h = m1 + · · · + mi , i < r ⇒ p = m1 +···+m , i < r. By definition since n this is an intersection point lqx (p) = zi and rqx (p) = zi+1 . But zi = yh and zi+1 = yh+1 .  b) Let h = [np] ⇒ nh < p < h+1 n . Since h and h + 1 differ exactly by one unit, there exists an i such that m1 + · · · + mi−1 h h+1 m1 + · · · + mi ≤ <p< ≤ . n n n n Then by definition lqx (p) = rqx (p) = zi . But since m1 + · · · + mi−1 < h + 1 ≤ m1 + · · · + mi , yh+1 = zi . c) Straightforward consequence of the definition. Suppose y ′ ∈ {y1 , · · · , yn }, for future reference, we define some additional notations for data vectors. Definition The minimal index of y ′ , m(y ′ ) and the maximal index of y ′ , M (y ′ ) are defined as below: m(y ′ ) = min{i|yi = y ′ }, M (y ′ ) = max{i|yi = y ′ }. It is easy to see that in y = sort(x) = (y1 , · · · , yn ) all the coordinates between m(y ′ ) and M (y ′ ) are equal to y ′ . Also note that if y ′ = zi then M (y ′ ) − m(y ′ ) + 1 = mi is the multiplicity of zi . We use the notation mx and Mx whenever we want to emphasize that they depend on the data vector x. 125  5.2. Definition of median and quantiles of data vectors and random samples Lemma 5.2.2 Suppose x = (x1 , · · · , xn ), y = sort(x) and z a non–decreasing vector of all distinct elements of x. Then a) m(zi+1 ) = M (zi ) + 1, i = 0, · · · , r − 1. b) Suppose φ is a bijective increasing transformation over R, mφ (x)(φ(zi )) = mx (zi ), and Mφ(x) (φ(zi )) = Mx (zi ), for i = 1, · · · , r. Proof a) is straightforward. b) Note that mx (y ′ ) = min{i|yi = y ′ } = min{i|φ(yi ) = φ(y ′ )} = mφ(x) (φ(y ′ )). A similar argument works for Mx . We also define the position and standardized position of an element of a data vector. Definition Let x = (x1 , · · · , xn ) be a vector and y = sort(x) = (y1 , · · · , y n). Then for y ′ ∈ {y1 , · · · , yn }, we define posx (y ′ ) = {mx (y ′ ), mx (y ′ ) + 1, · · · , Mx (y ′ )}, where pos stands for position. Then we define the standardized position of y ′ to be sposx (y ′ ) = (  mx (y ′ ) − 1 Mx (y ′ ) , ). n n  In the following lemma we show that for every p ∈ spos(y ′ ) (and only p ∈ spos(y ′ )), we have rq(p) = lq(p) = y ′ . For example if 1/2 ∈ spos(y ′ ) then y ′ is the (left and right) median. Lemma 5.2.3 Suppose x = (x1 , · · · , xn ), y = sort(x) = (y1 , · · · , yn ) and y ′ ∈ {y1 , · · · , yn }. Then p ∈ sposx(y ′ ) ⇔ lqx (p) = rqx (p) = y ′ . Proof Let z = (z1 , · · · , zr ) be the reduced vector with multiplicities m1 , · · · , mr . Then y ′ = mi for some i = 1, · · · , r. 126  5.2. Definition of median and quantiles of data vectors and random samples case I: If i = 2, · · · , r, then  m(y ′ ) = m1 + · · · + mi−1 + 1,  and M (y ′ ) = m1 + · · · + mi . case II: If i = 1, then m(y ′ ) = 1 and M (y ′ ) = m1 . ′  ′  ′  ′  In any of the above cases for p ∈ ( m(yn)−1 , M n(y ) ) and only p ∈ ( m(yn)−1 , M n(y ) ) rqx (p) = lqx (p) = zi , by definition. Now we prove a lemma that will become useful later on. It is easy to see that if u ∈ pos(y ′ ) then (  u−1 u , ) ⊂ spos(y ′ ). n n  We conclude that u−1 u , ) ⊂ spos(y ′ ). n n ′ In fact spos(y ) can possibly have a few points on the edge of the intervals u not in ∪u∈pos(y′ ) ( u−1 n , n ). ∪u∈pos(y′ ) (  Lemma 5.2.4 Suppose x is a data vector of length n and y ′ is an element of this vector. Also assume y ′ ≥ xi , i ∈ I, y ′ ≤ xj , j ∈ J, I ∩ J = φ, Then there exist a p in words lq(p) = rq(p) = y ′ .  ( |I|−1 n ,1  I, J ⊂ {1, 2, · · · , n}. −  |J| n )  that belongs to spos(y ′ ). In other  Proof From the assumption, we conclude that pos(y ′ ) includes a number between |I| and n − |J|. Let us call it u0 . Hence ( u0n−1 , un0 ) ⊂ spos(y ′ ). Since |I| ≤ u0 ≤ n − |J|, we conclude that spos(y ′ ) intersects with ∪|I|≤u≤n−|J|(  u−1 u |I| − 1 |J| , )⊂( ,1 − ). n n n n  127  5.3. Defining quantiles of a distribution  5.3  Defining quantiles of a distribution  So far, we have only defined the quantile for data vectors. Now we turn to defining the quantile for distribution functions. The p-th quantile for a random variable X with distribution function F as pointed out above is traditionally defined to be q(p) = inf{u|F (u) ≥ p}. We showed by an example above the asymmetry issue to which that definition can lead. We show that the issue arises due to the flatness of F in an interval. To get around this problem as the case of data vectors, we define the left and right quantile for the distribution F as follows: lqF (p) = inf{u|F (u) ≥ p}, and rqF (p) = inf{u|F (u) > p}. If there are more than one random variables in the discussion, to avoid confusion, we use the notations lqFX , rqFX . Also when there is no chance of confusion, we simply use lq, rq. The reason for this definition should become clear soon. First let us apply this definition to the fair coin example. If p 6= 1/2 then both lqF (p) and rqF (p) will be the same and give us the same value. However, lqF (1/2) = 0 and rqF (1/2) = 1. This is exactly what one would hope for. To see the consequences of this definition, we prove the following lemma: Lemma 5.3.1 (Quantile Properties Lemma) Suppose X is a random variable on the probability space (Ω, Σ, P ) with distribution function F : a) F (lqF (p)) ≡ P (X ≤ lqF (p)) ≥ p. b) lqF (p) ≤ rqF (p). c) p1 < p2 ⇒ rqF (p1 ) ≤ lqF (p2 ). This and (b) imply that lqF (p1 ) ≤ rqF (p1 ) ≤ lqF (p2 ) ≤ rqF (p2 ). d) rqF (p) = sup{x|F (x) ≤ p}. 128  5.3. Defining quantiles of a distribution e) P (lqF (p) < X < rqF (p)) = 0. In other words if lqF (p) < rqF (p) then F is flat in the interval (lqF (p), rqF (p)). f ) P (X < rqF (p)) ≤ p. g) If lqF (p) < rqF (p) then F (lqF (p)) = p and hence P (X ≥ rqF (p)) = 1−p. h) lqF (1) > −∞, rqF (0) < ∞ and P (rqF (0) ≤ X ≤ lqF (1)) = 1. i) lqF (p) and rqF (p) are non–decreasing functions of p. j) Suppose F has a jump at x, in other words P (X = x) > 0, which is equivalent to limy→x− F (y) < F (x). Then lqF (F (x)) = x. k) x < lqF (p) ⇒ F (x) < p and x > rqF (p) ⇒ F (x) > p. Proof a) Take a strictly decreasing sequence {xn } in R that tends to lq(p). For every xn , F (xn ) ≥ p since xn > lq(p). Otherwise F (xn ) < p ⇒ F (y) < p, ∀y ≤ xn . Hence (−∞, xn ] ∩ {y|F (y) ≥ p} = ∅. We conclude that lq(p) = inf{y|F (y) ≥ p} ≥ xn > lq(p), which is a contradiction. Now since F is right continuous lim F (xn ) = F (lq(p)).  n→∞  But F (xn ) ≥ p, ∀n ∈ N. Hence limn→∞ F (xn ) ≥ p. b) Note that {u|F (u) > p} ⊂ {u|F (u) ≥ p}. c) Note that {x|F (x) ≥ p2 } ⊂ {x|F (x) > p1 } if p2 > p1 . 129  5.3. Defining quantiles of a distribution d) Suppose p ∈ [0, 1] is given. Let A = {x|F (x) > p} and B = {x|F (x) ≤ p}. We want to show that inf A = sup B. Consider two cases: 1) Suppose inf A < sup B. Then pick inf A < y < sup B. We get a contradiction as follows: inf A < y ⇒ F (y) > p. Otherwise, since F is increasing F (y) ≤ p ⇒ y < x, ∀x ∈ A ⇒ y ≤ inf A. y < sup B ⇒ F (y) ≤ p. Otherwise, since F is increasing F (y) > p ⇒ y > x, ∀x ∈ B ⇒ y ≥ sup B. We conclude F (y) > p and F (y) ≤ p, a contradiction. 2) Suppose sup B < inf A. Take sup B < y < inf A. sup B < y ⇒ F (y) > p. Otherwise, F (y) ≤ p ⇒ y ∈ B ⇒ y ≤ sup B. y < inf A ⇒ F (y) ≤ p. Otherwise F (y) > p ⇒ y ∈ A ⇒ y ≥ inf A. Once more F (y) > p and F (y) ≤ p which is a contradiction.  e) Suppose F is not flat in that interval. ∃v1 < v2 ∈ (lq(p), rq(p)) such that F (v2 ) > F (v1 ). F (v2 ) > F (v1 ) ≥ F (lq(p)) ≥ p. This is a contradiction since v2 < rq(p). f) Take an increasing sequence xn ↑ rqF (p), then note that P (X ≤ xn ) ≤ p since xn < rqF (p). Let An = {X ≤ xn } and A = {X < rqF (p)} then limn→∞ An = A, by continuity of the probability (See [9]): P (X < rqF (p)) = P ( lim An ) = lim P (An ) ≤ p. n→∞  n→∞  g) By a) F (lqF (p)) = P (X ≤ lqF (p)) ≥ p. Suppose P (X ≤ lqF (p)) > p. This implies that lqF (p) ≥ rqF (p). By b) we get lqF (p) = rqF (p), which is a contradiction. h) Note that lqF (0) = inf{x|F (x) ≥ 0} = inf R = −∞. Suppose rqF (0) = ∞. Then {x|F (x) > 0} = ∅ ⇒ ∀x ∈ R, F (x) = 0, a contradiction to the properties of a distribution function F . 130  5.3. Defining quantiles of a distribution Also note that rqF (1) = inf{x|F (x) > 1} = inf ∅ = ∞. Suppose lqF (1) = −∞. Then inf{x|F (x) ≥ 1} = −∞ ⇒ ∀x ∈ R, F (x) ≥ 1 ⇒ ∀x ∈ R, F (x) = 1, a contradiction. For the second part note that rqF (0) ≤ lqF (1) by (c). Then  P (rqF (0) ≤ X ≤ lqF (1)) =  1 − P (lqF (1) < X < rqF (1)) − P (lqF (0) < X < rqF (0)) = 1 − 0 − 0,  by part (e). i) Trivial. j) Suppose P (X = x) > 0 then limy→x− F (y) = P (X < x) < P (X < x) + P (X = x) = F (x). Now assume that limy→x− F (y) < F (x), then P (X < x) < F (x) ⇒ P (X = x) > 0.  To prove that in this case lqF (F (x)) = x, let p = F (x) we want to show lqF (p) = x. Note that F (x) = p gives lqF (p) ≤ x. On other hand for any y < x, we know that F (y) < p, by a) y cannot be lqF (p). Hence x = lqF (F (x)).  k) First part follows from the definition of lq and the second part from part (d).  The following lemma is useful in proving that a specific value is the left or right quantile for a given p. Lemma 5.3.2 (Quantile value criterion) a) lqF (p) is the only a satisfying (i) and (ii), where (i) F (a) ≥ p, (ii) x < a ⇒ F (x) < p. 131  5.4. Left and right extreme points b) rqF (p) is the only a satisfying (i) and (ii), where (i) x < a ⇒ F (x) ≤ p, (ii) x > a ⇒ F (x) > p. Proof a) Both properties hold for lqF (p) by previous lemma. If both a < b satisfy them, then F (a) ≥ p by (i). But since b satisfies the properties and a < b, by (ii), F (a) < p which is a contradiction. b) Both properties hold for rqF (p) by previous lemma. If both a < b satisfy them, then we can get a contradiction similar to above.  5.4  Left and right extreme points  In Lemma 5.3.1, we showed these properties about rqX (0) and lqX (1): rqX (0) < ∞,  lqX (1) > −∞,  rqX (0) ≤ lqX (1), and P (rqX (0) ≤ X ≤ lqX (1)) = 1. The above states that all the mass is between these two values. We will show in the next lemma that these values are also the minimal values to satisfy this property. This is the motivation for the following definition. Definition We call rqF (0) the “left extreme” and lqF (1) the “right extreme” of the distribution function F . Lemma 5.4.1 (Left and right extreme points property) Suppose X is a random variable with distribution function F . a) The right extreme lqF (1) is the smallest a satisfying P (X ≤ a) = 1. In other words min{P (X ≤ a) = 1} = lqF (1). a  132  5.5. The quantile functions as inverse b) The left extreme rqF (0) is the biggest a satisfying P (X ≥ a) = 1. max{P (X ≥ a) = 1} = rqF (0). a  c) Consider the following subset of R2 I 2 = {(a, b) ∈ R2 |P (X ∈ [a, b]) = 1}. Then ∩(a,b)∈I 2 [a, b] = [rqX (0), lqX (1)]. Proof a) In Lemma 5.3.1, we showed F (lqF (1)) = 1. Also F (a) < 1 for a < lqF (1) by the definition of lqF . b) In Lemma 5.3.1, we showed P (X ≥ rqX (0)) = 1. Suppose a > rqX (0). Then since rqX (p) = inf{x|F (x) > 0}, ∃c ∈ {x|F (x) > 0}, c < a ⇒ ∃c < a, F (c) > 0 ⇒  ∃c, P (X < a) ≥ F (c) > 0 ⇒  P (X ≥ a) = 1 − P (X < a)  <  1.  c) This is straightforward from a) and b).  5.5  The quantile functions as inverse  The following lemma shows that lqX and rqX can be considered as the inverse of the distribution function in some sense. Lemma 5.5.1 (Quantile functions as inverse of the distribution function) a) F (x) < p ⇔ x < lqX (p). (i.e. {x|F (x) < p} = (−∞, lqF (p)).) b) {x|F (x) ≤ p} = (−∞, rqX (p)] or (−∞, rqX (p)). c) If F is continuous at rqX (p) then {x|F (x) ≤ p} = (−∞, rqX (p)]. d) {x|F (x) ≥ p} = [lqX (p), ∞). e) {x|F (x) > p} = (rqX (p), ∞) or [rqX (p), ∞). f ) If F is continuous then {x|F (x) > p} = (rqX (p), ∞). Proof 133  5.5. The quantile functions as inverse a) (⇒) is true because otherwise if x ≥ lqX (p) ⇒ F (x) ≥ F (lqX (p)) ≥ p, which is a contradiction. To show (⇐) note that by the definition of lqX (p), if F (x) ≥ p then x ≤ lqX (p). b) We need to show that (1) (−∞, rqX (p)) ⊂ {x|F (x) ≤ p} and (2) {x|F (x) ≤ p} ⊂ (−∞, rqX (p)]. For (1), suppose x < rqX (p). We claim F (x) ≤ p. Otherwise if F (x) > p by the definition of rqX (p), rqX (p) ≤ x. For (2), suppose F (x) ≤ p. Then since rqX (p) = sup{x|F (x) ≤ p}, we conclude x ≤ rqX (p). c) By Part (b), it suffices to show F (rqX (p)) = p. This is shown in the next lemma. d) R.H.S ⊂ L.H.S by Lemma 5.3.1 part (a). L.H.S ⊂ R.H.S by the definition of lq. e) Note that x > rqF (p) then F (x) > p by Lemma 5.3.1 part (k). Also F (x) > p ⇒ rqF (p) ≤ x by definition of rq. f) This is a consequence of part (e) and next lemma.  For the continuous distribution functions, we have the following lemma. Lemma 5.5.2 (Continuous distributions inverse) If F is continuous F (x) = p ⇔ x ∈ [lqX (p), rqX (p)]. Proof If x < lqX (p) then we already showed that F (x) < p. Also if x > lqX (p) then rqX (p) = sup{x|F (x) ≤ p} ⇒ F (x) > p. (Because otherwise if F (x) ≤ p ⇒ rqX (p) ≥ p.) It remains to show that F (lqX (p)) = F (rqX (p)) = p. But by Lemma 5.3.1, we have F (lqX (p)) ≥ p. Hence it suffices to show that F (rqX (p)) ≤ p. But by Part (f) of Lemma 5.3.1 and continuity of F F (rqF (x)) = P (X ≤ rqF (x)) = P (X < rqF (x)) ≤ p.  134  5.6. Equivariance property of quantile functions  5.6  Equivariance property of quantile functions  Example (Counter example for Koenker–Hao claim) Suppose X is distributed uniformly on [0,1]. Then lqX (1/2) = 1/2. Now consider the following strictly increasing transformations ( x −∞ < x < 1/2 φ(x) = . x + 5 x ≥ 1/2 Let T = φ(X) then the distribution   0      t P (T ≤ t) = 1/2    t−5     1  of T is given by t≤0 0 < t ≤ 1/2 1/2 < t ≤ 5 + 1/2 . 5 + 1/2 < t ≤ 5 + 1 t>5+1  It is clear form above that lqT (1/2) = 1/2 6= φ(lqX (1/2)) = φ(1/2) = 5 + 1/2. We start by defining φ≤ (y) = {x|φ(x) ≤ y}, φ⋆ (y) = sup φ≤ (y), and φ≥ (y) = {x|φ(x) ≥ y}, φ⋆ (y) = inf φ≥ (y). Then we have the following lemma. Lemma 5.6.1 Suppose φ is non-decreasing. a) If φ is left continuous then φ(φ⋆ (y)) ≤ y. b) If φ is right continuous then φ(φ⋆ (y)) ≥ y. Proof  135  5.6. Equivariance property of quantile functions a) Suppose xn ↑ φ⋆ (y) a strictly increasing sequence. Then since xn < φ⋆ (y), we conclude xn ∈ φ≤ (y) ⇒ φ(xn ) ≤ y. Hence limn→∞ φ(xn ) ≤ y. But by left continuity φ(xn ) ↑ φ(φ⋆ (y)). b) Suppose xn ↓ φ⋆ (y) a strictly decreasing sequence. Then since xn > φ⋆ (y), we conclude xn ∈ φ≥ (y) ⇒ φ(xn ) ≥ y. Hence limn→∞ φ(xn ) ≥ y. But by right continuity φ(xn ) ↓ φ(φ⋆ (y)).  Theorem 5.6.2 (Quantile Equivariance Theorem) Suppose φ : R → R is non-decreasing. a) If φ is left continuous then lqφ(X) (p) = φ(lqX (p)). b) If φ is right continuous then rqφ(X) (p) = φ(rqX (p)).  Proof a) We use Lemma 5.3.2 to prove this. We need to show (i) and (ii) in that lemma for φ(lqX (p)). First note that (i) holds since Fφ(X) (φ(lqX (p))) = P (φ(X) ≤ φ(lqX (p))) ≤ P (X ≤ lqX (p)) ≥ p. For (ii) let y < φ(lqX (p)). Then we want to show that Fφ(X) (y) < p. It is sufficient to show φ⋆ (y) < lqX (p). Because then P (φ(X) ≤ y) ≤ P (X ≤ φ⋆ (y)) < p.  To prove φ⋆ (y) < lqX (p), note that by the previous lemma φ(φ⋆ (y)) ≤ y < φ(lqX (p)). b) We use Lemma 5.3.2 to prove this. We need to show (i) and (ii) in that lemma for φ(rqX (p)). To show (i) note that if y < φ(rqX (p)), P (φ(X) ≤ y) ≤ P (φ(X) < φ(rqX (p))) ≤ P (X < rqX (p)) ≤ p. 136  5.7. Continuity of the left and right quantile functions To show (ii), suppose y > φ(rqX (p)). We only need to show φ⋆ (y) > rqX (p) because then P (φ(X) ≤ y) ≥ P (X < φ⋆ (y)) > p. But by previous lemma φ(φ⋆ (y)) ≥ y > φ(rqX (p)). Hence φ⋆ (y) > rqX (p).  5.7  Continuity of the left and right quantile functions  Lemma 5.7.1 (Continuity of quantile functions) Suppose F is a distribution function. Then a) lqF is left continuous. b) rqF is right continuous. Proof a) Suppose pn ↑ p be a strictly increasing sequence in [0,1]. Then since lqF is increasing, lqF (pn ) is increasing and hence has a limit we call y. We need to show y = lqF (p). We show this in two steps: 1. y ≤ lqF (p): Let A = {x|F (x) ≥ p}. Then for any x ∈ A: F (x) ≥ p ⇒ F (x) ≥ pn ⇒ x ≥ lqF (pn ) ⇒ x ≥ sup lqF (pn ) ⇒ x ≥ y. n∈N  Hence lqF (p) = inf A ≥ y. 2. y ≥ lqF (p): We only need to show that F (y) ≥ p. But y ≥ lqF (pn ), ∀n ⇒ F (y) ≥ F (lqF (pn )) ≥ pn , ∀n ⇒ F (y) ≥ p. b) Take a strictly decreasing sequence pn ↓ p, we need to show rqF (pn ) → rq(p). The limit of rqF (pn ) exists since rq is non–decreasing. Let y = inf n∈N rqF (pn ). We proceed in two steps: 1. rqF (p) ≤ y: rqF (p) ≤ rqF (pn ), ∀n ∈ N ⇒ rqF (p) ≤ inf rqF (pn ) = y. n∈N  137  5.7. Continuity of the left and right quantile functions 2. rqF (p) ≥ y: Since rqF (p) = sup{x|F (x) ≤ p} by Lemma 5.3.1, we only need to show z < y ⇒ F (z) ≤ p. But if F (z) > p then F (z) > pn for some n ∈ N ⇒ z ≥ rqF (pn ) for some n ∈ N. Hence, y > z ≥ rq(pn ) for some n ∈ N, which is a contradiction to y = inf n∈N rq(pn ).  FX is a function that ranges over [0, 1]. Once F hits 1 it will remain one. Similarly before F becomes positive it is always zero. This is the motivation for the following definition. Definition Suppose F is a distribution function. We define the real domain of F to be RD(F ) = {x|0 < F (x) < 1}. Lemma 5.7.2 Suppose F is a distribution function. Then RD(F ) = (rq(0), lq(1)) or RD(F ) = [rq(0), lq(1)). Proof We proceed in two steps (a),(b). (a) RD(F ) ⊂ [rq(0), lq(1)): Note that (a) ⇔ [rq(0), lq(1))c ⊂ RD(F )c , where c stands for taking the compliment of a set in R. If x ∈ [rq(0), lq(1))c then x < rq(0) or x ≥ lq(1). x < rq(0) then F (x) = 0 by the definition of rq(0). x ≥ lq(1) then F (x) ≥ F (lq(1)) ≥ 1 ⇒ F (x) = 1. (b) (rq(0), lq(1)) ⊂ RD(F ): x > rq(0) ⇒ F (x) > 0. (This is because rq(0) = sup{x|F (x) ≤ 0}.) x < lq(1) ⇒ F (x) < 1. (This is because lq(1) = inf{x| F (x) = 1}.) Definition For a random variable X with distribution function F , we define the L-quantile and R-quantile functions on R: LQF : R → R, LQF = lqF ◦ F, RQF : R → R, RQF = rqF ◦ F. 138  5.7. Continuity of the left and right quantile functions Lemma 5.7.3 (Properties of LQ and RQ) a) LQF , RQF are non–decreasing. b)LQF (x) ≤ x ≤ RQF (x). c) LQF , RQF are left continuous and right continuous, respectively. d) lqF (F (x)) = rqF (F (x)) ⇒ LQF (x) = RQF (x) = x. e) We have the following equalities: LQF (v) = inf{u|F (u) = F (v)}, RQF (v) = sup{u|F (u) = F (v)}. f ) P (LQF (x) < X < RQF (x)) = 0. Proof a) This result follows from the fact that lqF , rqF and F are non–decreasing. b) LQF (x) = inf{y|F (y) ≥ F (x)}. Since x ∈ {y|F (y) ≥ F (x)}, x ≥ LQF (x). RQF (x) = sup{y|F (y) ≤ F (x)}. Since x ∈ {y|F (y) ≤ F (x)}, RQF (x) ≥ x. c) Suppose xn ↓ x is a strictly decreasing sequence, then F (xn ) ↓ F (x) since F is right continuous. Hence rqF (F (xn )) ↓ rqF (F (x)) since rqF is right continuous by Lemma 5.7.1. To prove LQF is left continuous, let xn ↑ x be a strictly increasing sequence and let pn = F (xn ). Then since {pn } is an increasing and bounded sequence, pn → p′ . Also let F (x) = p. We consider two cases: 1. p = p′ . In this case pn ↑ p is a strictly increasing sequence. Since lqF is left continuous, limn→∞ LQF (xn ) = limn→∞ lqF (pn ) = lqF (p) = LQF (x). 2. p′ < p. This means F has a jump at x. By Lemma 5.3.1 j), LQF (x) = lqF (F (x)) = x. Let y = limn→∞ lqF (F (xn )). We claim y ≥ x. Otherwise since F (x) = p and F has a jump at p, F (y) < p ⇒ F (y) < pn , for some n ∈ N. But y = supn∈N lq(F (xn )). Hence y ≥ lq(F (xn )) and F (y) ≥ F (lq(pn )) ≥ pn > p a contradiction. Thus y = limn→∞ lqF (F (xn )) ≥ x.  Also note that lqF (pn ) ≤ lqF (F (x)) = x, ∀n ⇒ y = supn∈N lqF (pn ) ≤ lqF (F (x)) = x. We conclude y = x. In other words y = limn→∞ LQF (xn ) = LQF (x).  d) This result is a straightforward consequence of b). e) This result follows immediately from the definition of these quantiles. 139  5.7. Continuity of the left and right quantile functions f) P (LQF (x) < X < RQF (x)) = P (lqF (F (x)) < X < rqF (F (x))) = 0, by Lemma 5.3.1. Example Suppose the distribution function F depicted in Figure 5.1 is given as follows 2 arctan(x)+1  π  x≤0  5     0≤x≤1 1/5 F (x) = x/5 1≤x<2.   3/5 2≤x<3    2  arctan(x−3)+4 π x≥3 5  Then lqF (0.2) = 0, rqF (0.2) = 1, lqF (0.5) = rqF (0.5) = 2 and lqF (0.55) = rqF (0.55) = 2. We have also plotted lq, rq, LQ, RQ in Figures 5.2 to 5.5.  If we are given a data vector, we can compute the sample distribution and then compute the left and right quantile functions. In the sequel, we show that we get the same definition as we gave for left and right quantile for a vector. Lemma 5.7.4 Suppose a data vector x is given and Fn is its sample distribution. Then lqx (p) = lqFn (p) and rqx (p) = rqFn (p). Proof We show this for non–intersection points. Similar arguments work for intersection points. If p is not an intersection point, then m1 +···+mi−1 i < p < m1 +···+m and rqx (p) = lqx (p) = zi . We want to show n n that inf{u|Fn (u) ≥ p} is also zi , where Fn (u) =  n X  I(−∞,xi ] (u).  i=1  But it follows that: Fn (zi ) =  m1 + · · · + mi ; n  lqFn (p) = inf{u|Fn (u) ≥ p}; rqFn (p) = inf{u|Fn (u) > p}. 140  0.0  0.2  0.4  F  0.6  0.8  1.0  5.7. Continuity of the left and right quantile functions  −2  −1  0  1  2  3  4  5  x  Figure 5.1: An example of a distribution function with discontinuities and flat intervals.  141  −2  −1  0  1  lq(p)  2  3  4  5  5.7. Continuity of the left and right quantile functions  0.0  0.2  0.4  0.6  0.8  1.0  p  −2  −1  0  1  rq(p)  2  3  4  5  Figure 5.2: The left quantile (lq) function for the distribution function given in Example 5.7. Notice that this function is left continuous and increasing.  0.0  0.2  0.4  0.6  0.8  1.0  p  Figure 5.3: The right quantile (rq) function for the distribution function given in Example 5.7. Notice that this function is right continuous and increasing. 142  −2  −1  0  1  LQ(x)  2  3  4  5  5.7. Continuity of the left and right quantile functions  −4  −2  0  2  4  6  8  x  −2  −1  0  1  RQ(x)  2  3  4  5  Figure 5.4: LQ function for Example 5.7. Notice that this function is increasing and left continuous.  −4  −2  0  2  4  6  8  x  Figure 5.5: RQ function for Example 5.7, notice that this function is increasing and right continuous.  143  5.8. Equality of left and right quantiles Since Fn is a step function the right hand side of the two above equations can only be one of −∞, z1 , · · · , zr , ∞. The first u that makes Fn greater than or equal to p is zi , proving the assertion.  Lemma 5.7.4 guarantees that our definition of quantile for data vectors is consistent with the definition for distributions. Lemma 5.3.1 shows that if a distribution function F is flat then rq and lq might differ. To study this further when rq and lq are equal, we define the concept of heavy and weightless points in the next section.  5.8  Equality of left and right quantiles  This section finds necessary and sufficient conditions for the left and right quantiles to be equal. We start with some definitions. Definition Suppose X is a random variable with the distribution function F . x ∈ R is called a weightless point of a distribution function F if there exist a neighborhood (an open interval) around x such that F is flat in that neighborhood. We call a point heavy if it is not weightless. Denote the set of all heavy points by H. Definition A point x ∈ R is called a super heavy point if P (X ∈ (x − ǫ, x]) > 0, P (X ∈ [x, x + ǫ)) > 0, ∀ǫ > 0. We denote the set of super heavy points by SH. Obviously any super heavy point is heavy. We can also define right heavy points and left heavy points. Definition A point x ∈ R is called a right heavy point if P (X ∈ [x, x + ǫ)) > 0, ∀ǫ > 0. We show the set of all right heavy points by RH. A point x ∈ R is called a left heavy point if P (X ∈ (x − ǫ, x]) > 0, ∀ǫ > 0. We denote the set of all such points by LH. Obviously any heavy point is either right heavy or left heavy. Also a super heavy point is both right heavy and left heavy. 144  5.8. Equality of left and right quantiles Lemma 5.8.1 Suppose X is a random variable with distribution function F . Also suppose that u1 < u2 are heavy points and F is flat on [u1 , u2 ] i.e. F (u1 ) = F (u2 ). Then lq(p) = u1 and rq(p) = u2 , where p = F (u1 ) = P (X ≤ u1 ). Proof 1. lq(p) = u1 : Since F (u1 ) = p, lq(p) ≤ u1 . Suppose lq(p) < u1 . Then P (lq(p) < X < u2 ) > 0, since u1 is a heavy point. We can rewrite above as P (lq(p) < X ≤ u1 ) + P (u1 < X < u2 ) > 0, the second term is zero by the flatness assumption. Hence P (lq(p) < X ≤ u1 ) > 0. But then P (X ≤ lq(p)) = p(X ≤ u1 ) − P (lq(p) < X ≤ u1 ) < p, which is a contradiction to Lemma 5.3.1 a). 2. rq(p) = u2 : From F (u) = p for all u1 ≤ u < u2 , we conclude rq(p) ≥ u2 . To prove the inverse, note that for any u1 < u3 < u2 , F (u3 ) = p since F is flat on [u1 , u2 ]. Since rq(p) = sup{x|F (x) ≤ p} by Lemma 5.3.1, rq(p) ≥ u2 . Now note that since u2 is heavy, for any u3 > u2 , P (u1 < X < u3 ) > 0 ⇒ F (u3 ) = F (u1 ) + P (u1 < X < u3 ) > p. Hence only values less than or equal to u2 are in {x|F (x) ≤ p}. We conclude the sup is at most u2 . In other words rq(p) ≤ u2 .  Lemma 5.8.2 Suppose X is a random variable with distribution function F . Then v is a weightless point ⇔ v ∈ (LQF (v), RQF (v)). 145  5.8. Equality of left and right quantiles Proof (⇐): This is trivial by Lemma 5.7.3 part (f). (⇒): If v ∈ / (LQF (v), RQF (v)) ⇒ LQF (v) = RQF (v) = v by Lemma 5.7.3. RQF (v) = v ⇒ inf{x| F (x) > F (v)} = v ⇒  F (x) > F (v), ∀x > v ⇒ P (v < X ≤ x) > 0, ∀x > v ⇒ P (v < X < x) > 0, ∀x > v,  where the last (⇒) is because for any x > v, we can take v < x′ < x and note that P (v < X < x) ≥ P (x < X ≤ x′ ) > 0. We conclude v is a right heavy point which is a contradiction. For a weightless point v, there is an interval (a, b) such that v ∈ (a, b) and F is flat in that interval. It is useful to consider the flat interval around v. This is the motivation for the following definition. Definition Suppose X is a random variable with distribution function F and v is a weightless point of F . Then we define the weightless interval of v, I(v) by I(v) = ∪a<b,F (a)=F (b)=F (v) (a, b). Lemma 5.8.3 Suppose F is a distribution function and v is a weightless point of this distribution function. Then I(v) = (LQF (v), RQF (v)). Proof (L.H.S ⊂ R.H.S): x ∈ (a, b) for some a, b where F (a) = F (b) = F (v) then F (x) = F (v). Take x1 , x2 such that a < x1 < x < x2 < b then F (x1 ) = F (x2 ) = F (v) ⇒  LQF (v) ≤ x1 < x < x2 ≤ RQF (v) ⇒ x ∈ (LQF (v), RQF (v)).  (R.H.S ⊂ L.H.S): This is trivial since if v is weightless then LQF (v) < RQF (v). Let a = LQF (v) and b = RQF (v) then (a, b) ⊂ I(v) by definition of I(v). Corollary 5.8.4 For any weightless point v, its weightless interval is indeed an open interval. Lemma 5.8.5 Suppose X is a random variable with a distribution function F and v, v ′ are weightless points then I(v) = I(v ′ ) or I(v) ∩ I(v ′ ) = ∅. 146  5.8. Equality of left and right quantiles Proof Suppose I(v) ∩ I(v ′ ) 6= ∅. Fix u ∈ I(v) ∩ I(v ′ ). But F (u) = F (v) ⇒ LQF (u) = lqF (F (u)) = lqF (F (v)) = LQF (v) and RQF (u) = rqF (F (u)) = rqF (F (v)) = RQF (v). Hence by the previous lemma I(u) = (LQF (u), RQF (u)) = (LQF (v), RQF (v)) = I(v). A similar argument shows that I(u) = I(v ′ ) and this completes the proof.  Theorem 5.8.6 Suppose X is a random variable with distribution function F , then a) Let N be the set of all weightless points. Then N is measurable and of probability zero. b) The ranges of rqF and lqF do not intersect N . In other words range(rqF ) ∪ range(lqF ) ⊂ H. c) Any heavy point is either lqF (p) or rqF (p) (or both) for some p ∈ [0, 1]. In other words H ⊂ range(rqF ) ∪ range(lqF ). More precisely, if x is right heavy then x ∈ range(rqF ) and if x is left heavy then x ∈ range(lqF ). d) x = lq(p) = rq(p) for some p ∈ [0, 1] if and only if x is a super heavy point. Also H − SH is countable. Proof a) Suppose v is a weightless point and consider I(v) = (LQF (v), RQF (v)). Then by Lemma 5.7.3, all the points in I(v) are weightless. We showed that I(v) ∩ I(v ′ ) 6= ∅, then I(v) = I(v ′ ). Hence N can be written as a disjoint union of the form: N = ∪v∈N ′ I(v), 147  5.8. Equality of left and right quantiles for some N ′ ⊂ N . Pick a rational number qv ∈ I(v), v ∈ N ′ (“the Axiom of choice” from set theory is not needed to pick a rational number from an interval (a, b) because one can take a rational number by comparing the expansion of a and b in the base 10). But I(v) ∩ I(v ′ ) = ∅, v 6= v ′ ∈ N ′ ⇒ qv 6= qv′ . This shows N ′ is countable since the set of rational numbers is countable. Hence, N is a countable union of intervals and is measurable. Moreover, X P (X ∈ I(v)) = 0. P (N ) = P (∪v∈N ′ I(v)) = v∈N ′  b) Suppose z ∈ N . Then there exist a, b such that a < z < b and P (a < X < b) = 0. Take a′ , b′ such that a < a′ < z < b′ < b. Suppose z = lqF (p) for some p. Then P (X ≤ z) ≥ p and also P (X ≤ a′ ) = P (X ≤ z) ≥ p. This is a contradiction since z is the left quantile. Similarly, suppose z = rqF (p) for some p. Then since z < b′ , F (b′ ) > p while a′ < z gives F (a′ ) ≤ p. Hence P (a′ ≤ X ≤ b′ ) > 0, a contradiction. c) Assume x is right heavy. Then let p = F (x). We claim that rqF (p) = x. Suppose rqF (p) = x′ < x then F (x) = p is a contradiction to rqF (p) = sup{y|F (y) ≤ p}. On the other hand for any x′ > x, pick x < x′′ < x′ . We have F (x′′ ) > p since x is right heavy. Since rqF (p) = inf{y|F (y) > p} and F (x′′ ) > p then x′ > rqF (p). We conclude that rqF (p) = x. Now suppose x is left heavy. Let p = F (x). We claim lqF (p) = x. First note that for any x′ < x, F (x′ ) < F (x) = p since x is left heavy. Hence lqF (p) ≥ x. But F (x) = p and since lqF (p) = inf{y|F (y) ≥ p} we are done. d) The necessary and sufficient conditions follow immediately from c). To show that H − SH is countable, we prove LH − SH and RH − SH are countable. To that end, for any x ∈ LH−SH consider Ix = (LQ(x), RQ(x)). Since x is not super heavy this interval has positive length. Also note that x < y, x, y ∈ H implies Ix ∩ Iy = ∅. To prove this, note that since x, y are left heavy, LQ(x) = x and LQ(y) = y. We conclude Ix = (x, RQ(x)) Iy = (y, RQ(y)). If Ix ∩ Iy is nonempty then we conclude x < y < RQ(x). Then 0 = P (X ∈ (x, RQ(x))) ≤ P (X ∈ (x, y)) > 0. 148  5.8. Equality of left and right quantiles (P (X ∈ (x, y)) > 0 since y is left heavy.) This is a contradiction and hence Ix ∩ Iy = ∅. Now pick a rational number qx ∈ Ix . Then Ix ∩ Iy = ∅ ⇒ qx 6= qy . Since the set of rational numbers is countable LH − SH is countable. A similar argument works for RH − SH. Lemma 5.8.7 Suppose X is a random variable with distribution function F . Then the set A = {p| p ∈ [0, 1], lqF (p) 6= rqF (p)} is countable. Proof For every p ∈ A let J(p) = (lqF (p), rqF (p)). Then for every x ∈ J(p), F (x) = p. (F (x) ≥ F (lqF (p)) ≥ p. Now if F (x) > p, we get a contradiction to x < lqX (p).) We conclude p, p′ ∈ A, p 6= p′ ⇒ J(p) ∩ J(p′ ) = ∅. The intervals are disjoint, every interval has a positive length and their union is a subset of [0, 1]. Hence there are only countable number of such intervals. We conclude A is countable. The following lemma gives sufficient and necessary conditions for lqX = rqX , ∀p ∈ (0, 1). Lemma 5.8.8 lqX (p) = rqX (p), p ∈ (0, 1) iff FX is strictly increasing. Proof (⇒) lqX (p) = inf{x|FX (x) ≥ p} =  inf{x|x ≥ FX−1 (p)} =  inf{x|x > FX−1 (p)} = rqX (p)  .  (⇐): If Fx is not strictly increasing then ∃x2 < x1 s.t FX (x1 ) = FX (x2 ). Then let p = FX (x1 ). We also have p = FX (x2 ). Hence lqX (p) = inf{FX (x) ≥ p} ≤ x1 , and rqX (p) = sup{FX (x) ≤ p} ≥ x2 , which is a contradiction.  149  5.9. Distribution function in terms of the quantile functions  5.9  Distribution function in terms of the quantile functions  It is interesting to understand the connections amongst lq, rq and F . We answer the following question: Question: Given one of lq, rq or F , are the other two uniquely determined? The answer to this question is affirmative and the following theorem says much more. Theorem 5.9.1 Suppose F is a distribution function. Then a) For p0 ∈ (0, 1), lq(p0 ) = limp→p− rq(p0 ). Hence, the function rq uniquely 0 determines lq. b) For p0 ∈ (0, 1), rq(p0 ) = limp→p+ lq(p0 ). Hence lq uniquely determines 0 rq. c) lq or rq continuous at p0 ∈ (0, 1) ⇒ lq(p0 ) = rq(p0 ). d) lq(p0 ) = rq(p0 ) ⇒ lq and rq are continuous at p0 . e) lq is continuous at p ⇔ rq is continuous at p. f ) F (x) = inf{p|lq(p) > x}. g) F (x) = inf{p|rq(p) > x}. Proof a) Take a strictly increasing sequence pn ↑ p0 in [0, 1]. Then pn−1 < pn < pn+1 ⇒ lq(pn−1 ) < rq(pn ) < lq(pn+1 ),  (5.4)  by Lemma 5.3.1, part (c). By the left continuity of lq, lq(pn ) → lq(p0 ). Applying the Sandwich Theorem about the limits from elementary calculus to the Equation (5.4), we conclude that rq(pn ) → lq(p0 ).  150  5.9. Distribution function in terms of the quantile functions b) Take a strictly decreasing sequence pn ↓ p0 in [0, 1]. Then pn−1 > pn > pn+1 ⇒ rq(pn−1 ) > lq(pn ) > rq(pn+1 ),  (5.5)  again by Lemma 5.3.1, part (c). By the right continuity of rq, rq(pn ) → rq(p0 ). Applying the Sandwich Theorem for limits to Equation (5.5), we conclude that lq(pn ) → rq(p0 ). c) Suppose lq is continuous at p0 . Then limp→p+ lq(p) = lq(p0 ). But by the 0 previous parts of this theorem, we also have limp→p+ = rq(p0 ). Similar 0 arguments work if rq is continuous at p0 . d) To prove lq is continuous at p0 note that lim lq(p0 ) = lq(p0 ) = rq(p0 ) = lim lq(p0 ), p→p+ 0  p→p− 0  where the first equality comes from the left continuity of lq and the last one comes from (b). Similar arguments work for rq. e) This result follows immediately from the previous two parts. f) Let A = {p|lq(p) > x}. We want to show that F (x) = inf A. To do that we first show that F (x) ≤ inf A. By Lemma 5.7.3, lq(F (x)) ≤ x ⇒ F (x) ≤ a, ∀a ∈ A ⇒ F (x) ≤ inf A. It remains to show that inf A ≤ F (x). Suppose to the contrary that F (x) < inf A. Then take F (x) < p0 < inf A to get lq(p0 ) ≤ x, p0 > F (x) ⇒ F (lq(p0 )) ≤ F (x), p0 > F (x). But by Lemma 5.3.1 part (a), p0 ≤ F (lq(p0 )). Hence p0 ≤ F (lq(p0 )) ≤ F (x), p0 > F (x), which is a contradiction. 151  5.10. Two-sided continuity of lq/rq g) Let B = {p|rq(p) > x} and A be as the previous part. Then F (x) = inf A ≤ inf B. It only remains to show that inf B ≤ F (x). Otherwise, we can pick p0 , F (x) < p0 < inf B so that rq(p0 ) ≤ x, p0 > F (x) ⇒ p0 ≤ F (rq(p0 )) ≤ F (x), p0 > F (x), which is a contradiction.  5.10  Two-sided continuity of lq/rq  Lemma 5.10.1 Suppose F is a distribution function for the random variable X and lq, rq are its corresponding left and right quantile functions. Then a) F is continuous ⇔ lq is strictly increasing on (0, 1). b) F is strictly increasing on RD(F ) = {x|0 < F (x) < 1} = (rq(0), lq(1)) or [rq(0), lq(1)) ⇔ lq is continuous on (0, 1). Proof a) (⇒): F is continuous iff P (X = x) = 0, ∀x ∈ R. If the R.H.S does not hold then x = lq(p1 ) = lq(p2 ), p1 < p2 . Then for every y < x, we have F (y) < p1 . Hence P (X < x) = lim P (X ≤ y) ≤ p1 < p2 . y→x−  But F (x) ≥ p2 since lq(p2 ) = x and we conclude P (X = x) ≥ p2 − p1 , a contradiction. (⇐): If F is not continuous then P (X = x) = ǫ > 0 for some x ∈ R. Let p = F (x) then P (X < x) = p − ǫ. Pick p1 < p2 in the interval (p − ǫ, p) then lq(p1 ) = lq(p2 ) = x. b) (⇒): lq is left continuous. Hence if it is not continuous then lim lq(p) = rq(p0 ) 6= lq(p0 ).  p→p+ 0  Hence F is flat on (lq(p0 ), rq(p0 )) 6= ∅, which is a contradiction to F being increasing. 152  5.11. Characterization of left/right quantile functions (⇐): Suppose F is not continuous on RD(F ), then there exist a, b ∈ R such that F is flat on [a, b]: F (a) = F (b) = p ∈ (0, 1). But then lq(p) ≤ a and rq(p) ≥ b. Hence lq(p) 6= rq(p), which means lq is not continuous. Remark. We can replace lq is the above lemma by rq. A similar argument can be done for the proof.  5.11  Characterization of left/right quantile functions  The characterization of the distribution function is a well–known result in probability. Here we characterize the left and right quantile functions of a distribution. We start by some simple lemmas which we need in the proof. Lemma 5.11.1 Suppose An ⊂ R, n ∈ N . Then inf ∪n∈N An = inf (inf An ) n∈N  Proof a) inf ∪n∈N An ≥ inf n∈N (inf An ): a ∈ ∪n∈N An ⇒ ∃m ∈ N, a ∈ Am ⇒ ∃m ∈ N, a ≥ inf Am ⇒ a ≥ inf (inf An ). n∈N  Hence, inf ∪n∈N An ≥ inf n∈N (inf An ). b) inf ∪n∈N An ≤ inf n∈N (inf An ): inf ∪n∈N An ≤ inf Am , ∀m ∈ N ⇒ inf ∪n∈N An ≤ inf (inf An ). n∈N  Lemma 5.11.2 Suppose h : (0, 1) → R is a non–decreasing function. Then G(x) = inf{p ∈ (0, 1)|h(p) > x} is a distribution function. Proof a) We claim G is non–decreasing. Suppose x1 < x2 then let A = {p|h(p) > x1 } and B = {p|h(p) > x1 }. Then G(x1 ) = inf A and G(x2 ) = inf B. But clearly B ⊂ A hence G(x1 ) ≤ G(x2 ). 153  5.11. Characterization of left/right quantile functions b) limx→∞ G(x) = 1: First note that such a limit exist and is bounded by 1. (Because the domain of h is (0,1)). Assume limx→∞ G(x) = q < 1, take q < q ′ < 1 then take x0 > h(q ′ ). Let A = inf{p|h(p) > x0 } such that G(x0 ) = inf A. Then (p ∈ A ⇒ h(p) > x0 > h(q ′ ) ⇒ p > q ′ ) ⇒ G(x0 ) = inf A ≥ q ′ > q. We have shown there is an x0 such that G(x0 ) > q this is a contradiction to limx→∞ G(x) = q since G is non-decreasing. c) Suppose that limx→−∞ G(x) = q > 0 then take 0 < q ′ < q and x0 < h(q ′ ). Let A = inf{p|h(p) > x0 } such that G(x0 ) = inf A. We have h(q ′ ) > x0 ⇒ q ′ ∈ A ⇒ inf A ≤ q ′ ⇒ G(x0 ) ≤ q ′ < q This contradicts limx→−∞ G(x) = q > 0 since G is non–decreasing. d) G is right continuous: limx→x+ G(x) = x0 . Suppose xn ↓ x0 . In the 0 previous lemma, let An = {p|h(p) > xn } and A = ∪n∈N An = {p|h(p) > x0 }. Then G(x0 ) = inf A = inf ∪n∈N An = inf (inf An ) = inf G(xn ) = lim G(x). n∈N  n∈N  x→x+ 0  Theorem 5.11.3 (Quantile function characterization theorem) Suppose a function h : (0, 1) → R is given. Then (a) h is a left quantile function for some random variable X iff h is left continuous and non–decreasing. (b) h is a right quantile function for some random variable X iff h is right continuous and non–decreasing.  Proof If h is a left quantile function, then h is left continuous and non– decreasing as we showed in previous sections. Also if h is right continuous function then h is non–decreasing and right continuous. For the inverse of both a) and b) define G as in the above lemma. We will prove that h is lqG in a) and rqG in b). (a) Let A = {x|G(x) ≥ p0 }, we want to show h(p0 ) = inf A. 154  5.11. Characterization of left/right quantile functions (i) inf A ≤ h(p0 ): Otherwise if inf A > y > h(p0 ), then: inf A > y ⇒ inf{x|G(x) ≥ p0 } > y ⇒ G(y) < p0 ⇒ inf{p ∈ (0, 1)|h(p) > y} < p0 ⇒  ∃p ∈ (0, 1), h(p) > y, p < p0 ⇒  ∃p ∈ (0, 1)h(p0 ) ≥ h(p) > y,  which is a contradiction. (ii) inf A ≥ h(p0 ) : x ∈ A ⇒ G(x) ≥ p0 ⇒ inf{p ∈ (0, 1)|h(p) > x} ≥ p0 . Hence, ∀p < p0 , h(p) ≤ x ⇒ lim h(p) ≤ x ⇒ h(p0 ) ≤ x, p→p− 0  by left continuity of h. Hence ∀x ∈ A, h(p0 ) ≤ x ⇒ h(p0 ) ≤ inf A. (b) Let A = {x|G(x) > p0 }, we want to show h(p0 ) = inf A. (i) inf A ≤ h(p0 ): Otherwise if inf A > y > h(p0 ), then inf A > y ⇒ y ∈ / A ⇒ G(y) ≤ p0 ⇒ inf{p′ ∈ (0, 1)|h(p′ ) > y} ≤ p0 ⇒ ∀p > p0 , inf{p′ |h(p′ ) > y} < p ⇒  ∀p > p0 , ∃p′ ∈ (0, 1), h(p′ ) > y, p′ < p ⇒  ∀p > p0 , ∃p′ ∈ (0, 1), h(p) ≥ h(p′ ) > y ⇒ h(p0 ) ≥ y  which is a contradiction. (ii) inf A ≥ h(p0 ) : x ∈ A ⇒ G(x) > p0 ⇒ inf{p ∈ (0, 1)|h(p) > x} > p0 ⇒ p0 ∈ / {p ∈ (0, 1)|h(p) > x} ⇒ h(p0 ) ≤ x.  Hence h(p0 ) ≤ inf A.  Now we characterize the quantile functions of data vectors. See Figure 5.6 for an example of quantile functions for the vector x = (−2, −2, 2, 2, 2, 2, 4, 4, 4, 4). 155  0 −4 −2  lq(p)  2  4  5.11. Characterization of left/right quantile functions  0.2  0.4  0.6  0.8  1.0  0.0  0.2  0.4  0.6  0.8  1.0  0 −4 −2  rq(p)  2  4  0.0  p  Figure 5.6: For the vector x = (−2, −2, 2, 2, 4, 4, 4, 4) the left (top) and right (bottom) quantile functions are given.  156  5.12. Quantile symmetries Theorem 5.11.4 (Data vector quantile function characterization theorem) a) h : (0, 1) → R is a left quantile function for a data vector x iff h is a left continuous step function with no steps (jumps) or a finite number of steps (jumps) at some points 0 < a1 < a2 < · · · < ak < 1 where ai = n1 ni , for some n, ni ∈ N. b) h : (0, 1) → R is a right quantile function for a data vector x iff h is a right continuous step function with no steps (jumps) or finite number of steps (jumps) at some points 0 < a1 < a2 < · · · < ak < 1 where ai = n1 ni for some n, ni ∈ N. Proof We only prove a) and b) is obtained either by repeating a similar argument or using the Quantiles Symmetry Theorem (Theorem 5.12.3), which we prove in next sections. a) (⇒) For x = (x1 , · · · , xn ), it is clear that lqx is a step function with jumps at points proportional to 1/n and we proved the left continuity before. a) (⇐) The result is easy to show if h has no jumps. Let h′ = limx→+∞ h(x) and suppose h is given with jumps at a1 < a2 < · · · < ak , a1 = n1 (1/n), · · · , ak = nk (1/n). Let b1 = a1 , b2 = a2 − a1 , · · · , bk = ak − ak−1 , bk+1 = 1 − ak . Then bi = n1 mi , i = 1, 2, · · · , k+1 with m1 = n1 , m2 = n2 −n1 , · · · , mk = nk −nk−1 P and finally mk+1 = n − ki=1 mi . Then let x be a data vector with h(ai ) repeated mi times. We claim that h = lqx . First note that x is of length n. For 0 < p ≤ aP1 , we have lqx (p) = h(a1 ) = h(p). For ai−1 < p ≤ ai , i ≤ k, we P have  ni−1 n  =  i−1 j=1  n  mj  <p≤  i j=1  mj  n  =  ni n.  Hence  lqx (p) = h(ai ) = h(p), ai−1 < p ≤ ai , i ≤ k. For ak < p < 1, we have  nk n  =  Pk  j=1  n  mj  < p < 1,  lqx (p) = h′ = h(p), ak < p < 1.  5.12  Quantile symmetries  This section studies the symmetry properties of distribution functions and quantile functions. Symmetry is in the sense that if X is a random variable with left/right quantile function, some sort of symmetry between the 157  5.12. Quantile symmetries quantile functions of X and −X should exist. We only treat the quantile functions for distributions here but the results can readily be applied to data vectors by considering their empirical distribution functions. Here consider different forms of distribution functions. The usual one is defined to be FXc (x) = P (X ≤ x). But clearly one could have also considered FXo (x) = P (X < x), GcX (x) = P (X ≥ x) or GoX (x) = P (X > x) to characterize the distribution of a random variable. We call F c the left– closed distribution function, F o the left–open distribution function, Gc the right–closed and Go the right–open distribution function. Like the usual distribution function these functions can be characterized by their limits in infinity, monotonicity and right continuity. First note that c F−X (x) = P (−X ≤ x) = P (X ≥ −x) = GcX (−x).  Since the left hand side is right continuous, GcX is left continuous. Also note that FXc (x) + GoX (x) = 1 ⇒ GoX (x) = 1 − FXc (x),  FXo (x) + GcX (x) = 1 ⇒ FXo (x) = 1 − GcX (x). The above equations imply the following: a) Go and F c are right continuous. b) F o and Gc are left continuous. c) Go and Gc are non–decreasing. d) limx→∞ F (x) = 1 and limx→−∞ F (x) = 0 for F = F o , F c . e) limx→∞ G(x) = 0 and limx→−∞ G(x) = 1 for G = Go , Gc . It is easy to see that the above given properties for F o , Go , Gc characterize all such functions. The proof can be given directly using the properties of the probability measure (such as continuity) or by using arguments similar to the above. Another lemma about the relation of F c , F o , Go , Gc is given below. Lemma 5.12.1 Suppose F o , F c , Go , Gc are defined as above. Then a) if any of F c , F o , Go , Gc are continuous, all of other are continuous too. b) F c being strictly increasing is equivalent to F o being strictly increasing. c) if F c is strictly increasing, Go is strictly decreasing. d) Gc being strictly decreasing is equivalent to Go being strictly increasing. 158  5.12. Quantile symmetries Proof a) Note that limy→x− F c (x) = limy→x− F o (x) and limy→x+ F c (x) = limy→x+ F o (x). If these two limits are equal for either F c or F o they are equal for the others as well. b) If either F c or F o are not strictly increasing then they are constant on [x1 , x2 ], x1 < x2 . Take x1 < y1 < y2 < x2 . Then F o (x1 ) = F o (x2 ) ⇒ P (y1 ≤ X ≤ y2 ) = 0 ⇒ F c (y1 ) = F c (y2 ). Also we have F c (x1 ) = F c (x2 ) ⇒ P (y1 ≤ X ≤ y2 ) = 0 ⇒ F o (y1 ) = F o (y2 ). c) This is trivial since Go = 1 − F c . d) If Gc is strictly decreasing then F o is strictly increasing since Gc = 1−F o . By part b), F c strictly is increasing. Hence Go = 1 − F c is strictly decreasing. The relationship between these distribution functions and the quantile functions are interesting and have interesting implications. It turns out that we can replace F c by F o in some definitions. Lemma 5.12.2 Suppose X is a random variable with open and closed left distributions F o , F c as well as open and closed right distributions Go , Gc . Then a) lqX (p) = inf{x|FXo (x) ≥ p}. In other words, we can replace F c by F o in the left quantile definition. b) rqX (p) = inf{x|FXo (x) > p}. In other words, we can replace F c by F o in the right quantile definition. Proof a) Let A = {x|FXo (x) ≥ p} and B = {x|FXc (x) ≥ p}. We want to show that inf A = inf B. Now A ⊂ B ⇒ inf A ≥ inf B. But inf B < inf A ⇒ ∃x0 , y0 , inf B < x0 < y0 < inf A. Then  159  5.12. Quantile symmetries  inf B < x0 ⇒ ∃b ∈ B, b < x0 ⇒ ∃b ∈ R, p ≤ P (X ≤ b) ≤ P (X ≤ x0 ) ⇒ P (X ≤ x0 ) ≥ p ⇒ P (X < y0 ) ≥ p.  On the other hand y0 < inf A ⇒ y0 ∈ / A ⇒ P (X < y0 ) < p, which is a contradiction, thus proving a). b) Let A = {x|FXo (x) > p} and B = {x|FXc (x) > p}. We want to show inf A = inf B. Again, A ⊂ B ⇒ inf A ≥ inf B. But inf B < inf A ⇒ ∃x0 , y0 , inf B < x0 < y0 < inf A. Then inf B < x0 ⇒ ∃b ∈ B, b < x0 ⇒ ∃b ∈ R, p < P (X ≤ b) ≤ P (X ≤ x0 ) ⇒ P (X ≤ x0 ) > p ⇒ P (X < y0 ) > p.  On the other hand, / A ⇒ P (X < y0 ) ≤ p, y0 < inf A ⇒ y0 ∈ which is a contradiction. Using the above results, we establish the main theorem of this section which states the symmetry property of the left and right quantiles. Theorem 5.12.3 (Quantile Symmetry Theorem) Suppose X is a random variable and p ∈ [0, 1]. Then lqX (p) = −rq−X (1 − p). Remark. We immediately conclude rqX (p) = −lq−X (1 − p), by replacing X by −X and p by 1 − p. 160  5.12. Quantile symmetries Proof  R.H.S = − sup{x|P (−X ≤ x) ≤ 1 − p} = inf{−x|P (X ≥ −x) ≤ 1 − p} =  inf{x|P (X ≥ x) ≤ 1 − p} =  inf{x|1 − P (X ≥ x) ≥ p} = inf{x|1 − Gc (x) ≥ p} =  inf{x|F o (x) ≥ p} = lqX (p).  Now we show how these symmetries can become useful to derive other relationships/definitions for quantiles. Lemma 5.12.4 Suppose X is a random variable with distribution function F . Then lqX (p) = sup{x|F c (x) < p}. Proof o (x) > 1 − p} = lqX (p) = −rq−X (1 − p) = − inf{x|F−X  − inf{x|1 − Gc−X (x) > 1 − p} = sup{−x|Gc−X (x) < p} = sup{−x|P (−X ≥ x) < p} = sup{x|P (X ≤ x) < p} = sup{x|F c (x) < p}.  In the previous sections, we showed that both lqX and rqX are equivariant under non-decreasing continuous transformations: lqφ(X) (p) = φ(lqX (p)), where φ is non-decreasing left continuous. Also rqφ(X) (p) = φ(rqX (p)), for φ : R → R non-decreasing right continuous. However, we did not provide any results for decreasing transformations. Now we are ready to offer a result for this case. 161  5.12. Quantile symmetries Theorem 5.12.5 (Decreasing transformation equivariance) a) Suppose φ is non-increasing and right continuous on R. Then lqφ(X) (p) = φ(rqX (1 − p)). b) Suppose φ is non-increasing and left continuous on R. Then rqφ(X) (p) = φ(lqX (1 − p)). Proof a) By the Quantile Symmetry Theorem, we have lqφ(X) (p) = −rq−φ(X) (1 − p). But −φ is non-decreasing right continuous, hence the above is equivalent to −(−φ(rqX (1 − p))) = φ(rqX (1 − p)). b) By the Quantile symmetry Theorem rqφ(X) (p) = −lq−φ(X)(1−p) = − − φ(lqX (1 − p)) = φ(lqX (p)), since −φ is non-decreasing and left continuous. Lemma 5.12.6 Suppose X is a random variable and F c , F o , Gc , Go are the corresponding distribution functions. Then we have the following inequalities: a) F c (lq(p)) ≥ p. (Hence F c (rq(p)) ≥ p.) b) F o (rq(p)) ≤ p. (Hence F o (lq(p)) ≤ p.) c) Go (lq(p)) ≤ 1 − p. (Hence Go (rq(p)) ≤ 1 − p.) d) Gc (rq(p)) ≥ 1 − p. (Hence Gc (lq(p)) ≥ 1 − p.) Proof We already showed a). b) Suppose there F o (rq(p)) = p + ǫ for some positive ǫ. Then since F o is left continuous lim F o (x) = p + ǫ. x→rq(p)+  Hence there exist x0 < rq(p) such that F (x0 ) ≥ F o (x0 ) > p + ǫ/2. This is a contradiction to rq(p) being the inf of the set {x|F (x) > p}. c) and d) are straightforward consequence of a) and b) since F c + Go = 1 and F o + Gc = 1.  162  5.13. Quantiles from the right  The quantile functions as the inverse of an open distribution function Lemma 5.12.7 Suppose X is a random variable with distribution function F and open distribution function F o . a) {x|F o (x) < p} = (−∞, lqF (p)) or (−∞, lqF (p)]. b) {x|F o (x) ≤ p} = (−∞, rqF (p)]. c) If F o is continuous then {x|F o (x) < p} = (−∞, lqF (p)]. d) {x|F o (x) > p} = (rqF (p), ∞). e) {x|F o (x) ≥ p} = (lqF (p), ∞) or [lqF (p), ∞) Proof The proof is very similar to Lemma 5.5.1 and we skip the details.  5.13  Quantiles from the right  So far, we have defined left/right quantiles using the classic distribution function F c . We also showed that in quantile definitions F c can be replaced by F o . FXc (x) = P (−∞ < X ≤ x) measures the probability from minus infinity. When we define left/right quantiles, we seek to find points where this probability from minus infinity reaches (passes) a certain value. One could also consider GcX (x) = P (x ≤ X < ∞) and define another version of quantile functions which seek points where the probability from plus infinity reaches or passes a point. This is a motivation to define the “left/right quantile functions from the right”. By indicating from the right we clarify that the probability is compute from the right hand side i.e. plus infinity. The previously defined left and right quantile functions should be called “left/right quantile functions from the left”. Definition Suppose X is a random variable with closed right distribution function GcX (x) = P (X ≥ x). Then we define the “left quantile function from the right” as follows lqf rX (p) = sup{x|GcX (x) > p}. Definition Suppose X is a random variable with closed right distribution function GcX (x) = P (X ≥ x). Then we define the right quantile function from the right as follows 163  5.13. Quantiles from the right  rqf rX (p) = sup{x|GcX (x) ≥ p}. Using the symmetries in the definition of these quantities, we will show that we have already characterized left/right from the right quantile functions. We need the following lemma. Lemma 5.13.1 Suppose X is a random variable with quantile functions lqX , rqX . Then a) rqX (p) = sup{x|F o (x) ≤ p}. b) lqX (p) = sup{x|F o (x) < p} Proof a) Let A = {x|F c (x) ≤ p} and B = {x|F o (x) ≤ p}. First note that A ⊂ B ⇒ sup A ≤ sup B. To show that the sups are indeed equal, note sup A < sup B ⇒ ∃x0 , y0 , sup A < x0 < y0 < sup B. Then sup A < x0 ⇒ F c (x0 ) > p, and y0 < sup B ⇒ ∃b ∈ B, y0 < b ⇒ ∃b, F o (b) ≤ p, y0 < b ⇒ F o (y0 ) ≤ p. But F c (x0 ) > p, F o (y0 ) ≤ p, which is a contradiction. b) Let A = {x|F c (x) < p} and B = {x|F o (x) < p}. First note that A ⊂ B ⇒ sup A ≤ sup B. To show that the sups are indeed equal, note sup A < sup B ⇒ ∃x0 , y0 , sup A < x0 < y0 < sup B. Then sup A < x0 ⇒ F c (x0 ) ≥ p, and y0 < sup B ⇒ ∃b ∈ B, y0 < b ⇒ ∃b, F o (b) < p, y0 < b ⇒ F o (y0 ) < p. 164  5.14. Limit theory But F c (x0 ) ≥ p, F o (y0 ) < p, which is a contradiction.  Lemma 5.13.2 (Quantile functions from the right) a) lqf rX (p) = rqX (1 − p). b) rqrfX (p) = lqX (1 − p). Proof a) lqrfX (p) = sup{x|GcX (x) > p} = sup{x|FXo (x) ≤ p} = rqX (1 − p). b) rqrfX (p) = sup{x|GcX (x) ≥ p} = sup{x|FXo (x) < 1 − p} = lqX (1 − p).  5.14  Limit theory  To prove limit results, we need some limit theorems from probability theory that we include here for completeness and without proof. Their proofs can be found in standard probability textbooks and appropriate references are given below. If we are dealing with two samples, X1 , · · · , Xn and Y1 , · · · , Yn , to avoid confusion we use the notation Fn,X and Fn,Y to denote their empirical distribution functions respectively. Definition Suppose X1 , X2 , · · · , is a discrete–time stochastic process. Let F(X) be the σ-algebra generated by the process and F(Xn , Xn+1 , · · · ) the σ-algebra generated by Xn , Xn+1 , · · · . Any E ∈ F(X) is called a tail event if E ∈ F(Xn , Xn+1 , · · · ) for any n ∈ N. Definition Let {An }n∈N be any collection of sets. Then {An i.o.}, read as An happens infinitely often is defined by: {An i.o.} = ∩i∈N ∪∞ j=i Aj . 165  5.14. Limit theory Theorem 5.14.1 (Kolmogorov 0–1 law): E being a tail event implies that P (E) is either 0 or 1. Proof See [9].  Theorem 5.14.2 (Glivenko–Cantelli Theorem): Suppose, X1 , X2 , · · · , i.i.d, has the sample distribution function Fn . Then lim sup |Fn (x) − F (x)| → 0, a.s..  n→∞ x∈R  Proof See [7]. Here, we extend the Glivenko–Cantelli Theorem to F o , Go and Gc . Lemma 5.14.3 Suppose X is a random variable and consider the associated distribution functions FXo , GoX and GcX with corresponding sample distribuo , Go c tion functions FX,n X,n and GX,n . Then sup |GoX,n − GoX | → 0, a.s., x∈R  o sup |FX,n − FXo | → 0, a.s., x∈R  and sup |GcX,n − GcX | → 0, a.s.. x∈R  Proof Note that FXc + GoX = 1 ⇒ GoX = 1 − FXc , and c c FX,n + GoX,n = 1 ⇒ GoX,n = 1 − FX,n .  Since Glivenko–Cantelli Theorem holds for FXc it also holds for GoX . o (x) = To show the result for FXo , note that FXo (x) = Go−X (−x) and FX,n o c c G−X,n (−x). Also to show the result for GX note that GX = 1 − FXo and o . GcX,n = 1 − FX,n  166  5.14. Limit theory Theorem 5.14.4 (Borel–Cantelli lemma): Suppose (Ω, F, P ) is a probability space. Then P 1. An ∈ F and ∞ 1 P (An ) < ∞ ⇒ P (An i.o) = 0. P 2. An ∈ F independent events with ∞ 1 P (An ) = ∞ ⇒ P (An i.o) = 1, where i.o. stands for infinitely often. Proof See [9]. Theorem 5.14.5 (Berry–Esseen bound): Let X1 , X2 , · · · , be i.i.d with E(Xi ) = 0 < ∞, E(Xi2 ) = σ and E(|Xi |3 ) = ρ. If Gn is the distribution of √ X1 + · · · + Xn /σ n and Φ(x) is the distribution function of a standard normal random variables then √ |Gn (x) − Φ(x)| ≤ 3ρ/σ 3 n. Corollary 5.14.6 Let X1 , X2 , · · · , be i.i.d with E(Xi ) = µ < ∞, E(|Xi − µ|2 ) = σ and E(|Xi − µ|3 ) = ρ. If Gn is the distribution of (X1 + · · · + Xn − √ √ nµ)/σ n = n( X̄nσ−µ ) and Φ(x) is the distribution function of a standard normal random variable then √ |Gn (x) − Φ(x)| ≤ 3ρ/σ 3 n. Proof This corollary is obtained by applying the theorem to Yi = Xi − µ. √ Now let An = (X1 + · · · + Xn − nµ)/σ n. Then √ |P (An > x)−(1−Φ(x))| = |P (An ≤ x)−Φ(x)| = |Gn (x)−Φ(x)| < 3ρ/σ 3 n. Also √ |P (x < An ≤ y)−(Φ(y)−Φ(x)))| ≤ |Gn (y)−Φ(y)|+|Gn (x)−Φ(x)| ≤ 6ρ/σ 3 n. These inequalities show that for any ǫ > 0 there exist N such that n > N, Φ(z2 ) − Φ(z1 ) − ǫ < P (z1 <  √  n(  X̄n − µ ) ≤ z2 ) < Φ(z2 ) − Φ(z1 ) + ǫ, σ  for z1 < z2 ∈ R ∪ {−∞, ∞}. It is interesting to ask under what conditions lqFn and rqFn tend to lqF and rqF as n → ∞. Theorem 5.14.7 gives a complete answer to this question. 167  5.14. Limit theory Theorem 5.14.7 (Quantile Convergence/Divergence Theorem) a) Suppose rqF (p) = lqF (p) then rqFn (p) → rqF (p), a.s., and lqFn (p) → lqF (p), a.s.. b) When lqF (p) < rqF (p) then both rqFn (p), lqFn (p) diverge almost surely. c) Suppose lqF (p) < rqF (p). Then for every ǫ > 0 there exists N such that n > N, lqFn (p), rqFn (p) ∈ (lqF (p) − ǫ, lqF (p)] ∪ [rqF (p), rqF (p) + ǫ). d) lim sup lqFn (p) = lim sup rqFn (p) = rqF (p), a.s., n→∞  n→∞  and lim inf lqFn (p) = lim inf rqFn (p) = rqF (p), a.s.. n→∞  n→∞  Proof a) Since, lqF (p) = rqF (p), we use qF (p) to denote both. Suppose ǫ > 0 is given. Then F (qF (p) − ǫ) < p ⇒ F (qF (p) − ǫ) = p − δ1 , δ1 > 0, and F (qF (p) + ǫ) > p ⇒ F (qF (p) + ǫ) = p + δ2 , δ2 > 0. By the Glivenko–Cantelli Theorem, Fn (u) → F (u) a.s., uniformly over R. We conclude that Fn (qF (p) − ǫ) → F (qF (p) − ǫ) = p − δ1 , a.s., 168  5.14. Limit theory and Fn (qF (p) + ǫ) → F (qF (p) + ǫ) = p + δ2 , a.s.. Let ǫ′ =  min(δ1 ,δ2 ) . 2  Pick N such that for n > N :  p − δ1 − ǫ′ < Fn (qF (p) − ǫ) < p − δ1 + ǫ′ , p + δ2 − ǫ′ < Fn (qF (p) + ǫ) < p + δ2 + ǫ′ . Then Fn (qF (p) − ǫ) < p − δ1 + ǫ′ < p lqFn (p) ≥ qF (p) − ǫ  ⇒  and  rqFn (p) ≥ qF (p) − ǫ.  Also p < p + δ2 − ǫ′ < Fn (qF (p) + ǫ) lqFn (p) ≤ qF (p) + ǫ  ⇒  and  rqFn (p) ≤ qF (p) + ǫ.  Re-arranging these inequalities we get: qF (p) − ǫ ≤ lqFn (p) ≤ qF (p) + ǫ, and qF (p) − ǫ ≤ rqFn (p) ≤ qF (p) + ǫ. b) This needs more development in the sequel and the proof follows. c) This also needs more development in the sequel and the proof follows. d) If lqF (p) = rqF (p) the result follows immediately from (a). Otherwise suppose lqF (p) < rqF (p). Then by (b) lqFn (p) diverges almost surely. Hence lim sup lqFn (p) 6= lim inf lqFn (p), a.s. . But by (c), ∀ǫ > 0, ∃N, n > N lqFn (p) ∈ (lqF (p) − ǫ, lqF (p)] ∪ [rqF (p), rqF (p) + ǫ). This means that every convergent subsequence of lqFn (p) has either limit lqF (p) or rqF (p), a.s.. Since lim sup lqFn (p) 6= lim inf lqFn (p), a.s., we conclude lim sup lqFn (p) = rqF (p) and lim inf lqFn (p) = lqF (p), a.s.. A similar argument works for rqFn (p).  169  5.14. Limit theory  To investigate the case lqF (p) 6= rqF (p) more, we start with the simplest example namely a fair coin. Suppose X1 , P X2 , · · · an i.i.d sequence with P (Xi = −1) = P (Xi = 1) = 12 and let Zn = ni=1 Xi . Note that Zn ≤ 0 ⇔ lqFn (1/2) = −1,  Zn > 0 ⇔ lqFn (1/2) = 1,  and Zn < 0 ⇔ rqFn (1/2) = −1,  Zn ≥ 0 ⇔ rqFn (1/2) = 1.  Hence in order to show that lqFn (1/2) and lqFn (1/2) diverge almost surely, we only need to show that P ((Zn < 0 i.o.) ∩ (Zn > 0 i.o.)) = 1. We start with a theorem from [9]. Theorem 5.14.8 Suppose Xi is as above. Then P (Zn = 0 i.o.) = 1. Proof The proof of this theorem in [9] uses the Borel–Cantelli Lemma part 2.  Theorem 5.14.9 Suppose, X1 , X2 , · · · i.i.d. and P (Xi = −1) = P (Xi = 1) = 1/2. Then lqFn (1/2) and rqFn (1/2) diverge almost surely. Proof Suppose, A = {Zn = −1 i.o.} and B = {Zn = 1 i.o.}. It suffices to show that P (A ∩ B) = 1. But ω ∈ A ∩ B ⇒ lqFn (p)(ω) = −1, i.o. and lqFn (p)(ω) = 1, i.o. Hence lqFn (p)(ω) diverges. Note that P (A) = P (B) by the symmetry of the distribution. Also it is obvious that both A and B are tail events and so have probability either zero or one. To prove P (A ∩ B) = 1, it only suffices to show that P (A ∪ B) > 0. Because then at least one of A and B has a positive probability, say A. P (A) > 0 ⇒ P (A) = 1 ⇒ P (B) = P (A) = 1 ⇒ P (A ∩ B) = 1. Now let C = {Zn = 0, i.o.}. Then P (C) = 1 by Theorem 5.14.8. If Zn (ω) = 0 then either Zn+1 (ω) = 1 or Zn+1 (ω) = −1. Hence if Zn (ω) = 0, i.o. then at least for one of a = 1 or a = −1, Zn (ω) = a, i.o.. We conclude that 170  5.14. Limit theory ω ∈ A ∪ B. This shows C ⊂ A ∪ B ⇒ P (A ∪ B) = 1. To generalize this theorem, suppose X1 , X2 , · · · , arbitrary i.i.d process and lqF (p) < rqF (p). Define the process ( 1 Xi ≥ rqF (p) Yi = 0 Xi ≤ lqF (p). (Note that P (lqX (p) < X < rqX (p)) = 0.) Then the sequence Y1 , Y2 , · · · is i.i.d., P (Yi = 0) = p and P (Yi = 1) = 1 − p. Also note that lqFn,Y (p) diverges a.s. ⇒ lqFn,X (p) diverges a.s. Hence to prove the theorem in general it suffices to prove the theorem for the Yi process. However, we first prove a lemma that we need in the proof. Lemma 5.14.10 Let Y1 , Y2 , · · · i.i.d with Pn P (Yi = 0) = p = 1 − q > 0 and P (Yi = 1) = 1 − p = q > 0. Let Sn = i=1 Yi , 0 < α, k ∈ N. Then there exists a transformation φ(k) (to N) such that P (Sφ(k) − φ(k)q < −k) > 1/2 − α, P (Sφ(k) − φ(k)q > k) > 1/2 − α. Remark. For α = 1/4, we get P (Sφ(k) − φ(k)q < −k) > 1/4, P (Sφ(k) − φ(k)q > k) > 1/4. Proof Since the first three moments of Yi are finite (E(Yi ) = q, E(|Yi − q|2 ) = q(1 − q) = σ, E(|Yi − q|3 ) = q 3 (1 − q) + (1 − q)3 q = ρ), we can apply √ the Berry-Esseen theorem to n Ȳnσ−µ . By a corollary of that theorem, for α 2 > 0 there exists an N1 such that 1 − Φ(z) −  √ Ȳn − µ α α < P( n > z) < 1 − Φ(z) + , 2 σ 2  and  √ Ȳn − µ α α < P( n < −z) < Φ(z) + , 2 σ 2 for all z ∈ R and n > N1 . Now for the given integer k pick N2 such that Φ(z) −  171  5.14. Limit theory  1 α 1 α k − < Φ( √ ) < + . 2 2 2 2 σ N2 This is possible because Φ is continuous and Φ(0) = 1/2. Now let k φ(k) = max{N1 , N2 }, z = p . σ φ(k)  Then since φ(k) ≥ N1  and  p Ȳφ(k) − µ α P ( φ(k) > z) > 1 − Φ(z) − > 1/2 − α, σ 2  p Ȳφ(k) − µ α P ( φ(k) < −z) > Φ(z) − > 1/2 − α. σ 2 These two inequalities are equivalent to P ((Sφ(k) − φ(k)q) < −k) > 1/2 − α, and P ((Sφ(k) − φ(k)q) > k) > 1/2 − α. If we put α = 1/4, we get P ((Sφ(k) − φ(k)q) < −k) > 1/4, and P ((Sφ(k) − φ(k))q > k) > 1/4.  We are now ready to prove Part b) of Theorem 5.14.7. Proof [Theorem 5.14.7, Part b)] For the process {Yi } as defined above, let n1 = 1, mk = nk + φ(nk ) and nk+1 = mk + φ(mk ). Then define Dk = (Ynk +1 + · · · + Ymk − (mk − nk )q < −nk ), Ek = (Ymk +1 + · · · + Ynk+1 − (nk+1 − mk )q > mk ), CK = Dk ∩ Ek . Since {Ck } involve non–overlapping subsequences of Ys , they are independent events. Also Dk and Ek are independent. Now note that 172  5.14. Limit theory  Ynk +1 + · · · + Ymk − (mk − nk )q < −nk ⇒  Y1 + · · · + Ymk < −nk + (mk − nk )q + nk ⇒ mk − n k Ȳmk < q<q⇒ mk lqFn,Y (p) = rqFn,Y = 0 ⇒ {Ck , i.o.} ⊂ {lqFn,Y (p) = rqFn,Y = 0, i.o.}.  Similarly, Ymk +1 + · · · + Ynk+1 − (nk+1 − mk )q > mk  ⇒ Y1 + · · · + Ynk+1 > (nk+1 − mk )q + mk mk + (nk+1 − mk )q ⇒ Ȳnk+1 > >q =1−p nk+1 ⇒ lqFn,Y (p) = rqFn,Y (p) = 1  ⇒ {Ck , i.o.} ⊂ {lqFn,Y (p) = rqFn,Y (p) = 1, i.o.}. Let us compute the probability of Ck : P (Ck ) =  P (Ynk +1 + · · · + Ymk − (mk − nk )q < −nk )×  P (Ymk +1 + · · · + Ynk+1 − (nk+1 − mk )q > mk ) = P (Y1 + · · · + Yφ(nk ) − φ(nk )q < −nk )×  P (Y1 + · · · + Yφ(mk ) − φ(mk )q > mk ) > 1/4.1/4 = 1/16. We conclude that  ∞ X k=1  P (Ck ) = ∞.  By the Borel–Cantelli Lemma, P (Ck , i.o.) = 1. We conclude that P (lqFn,Y (p) = rqFn,Y (p) = 0, i.o.) = 1, and P (lqFn,Y (p) = rqFn,Y (p) = 1, i.o.) = 1.  173  5.14. Limit theory Hence, P ({lqFn,Y (p) = rqFn,Y (p) = 0, i.o.}∩{lqFn,Y (p) = rqFn,Y (p) = 1, i.o.}) = 1.  Proof (Theorem 5.14.7, part (c)) Suppose that rqF (p) = x1 6= lqF (p) = x2 and a is an arbitrary real number. Let h = x2 − x1 . We define a new chain Y as follows: ( Xi Xi ≤ lqFX (p) Yi = Xi − h Xi ≥ rqFX (p). (See Figure 5.7.) Then Y1 , Y2 , · · · is an i.i.d sample. We drop the index i from Yi and Xi in the following for simplicity and since the Yi (as well as the Xi ) are identically distributed. We claim lqFY Y (p) = rqFY (p) = lqFX (p). To prove lqFY (p) = lqFX (p), note that FY (lqFX (p)) = P (Y ≤ lqFX (p)) ≥ P (X ≤ lqFX (p)) ≥ p ⇒ lqFY (p) ≤ lqFX (p). (The first inequality is because Y ≤ X.) Moreover for any y < lqFX (p), FY (y) = FX (y) < p. (Since X, Y < lqFX (p) ⇒ X = Y .) Hence lqFY (p) ≥ lqFX (p) and we are done. To show rqFY (p) = lqFX (p), note that rqFY (p) ≥ lqFY (p) = lqFX (p). It only remains to show that rqFY (p) ≤ lqFX (p). Suppose y > lqFX (p) and let δ = y − lqFX (p) > 0. First note that P ({Y ≤ lqFX (p) + δ}) =  P ({Y ≤ lqFX (p) + δ and X ≥ rqFX (p)} ∪  {Y ≤ lqFX (p) + δ and X ≤ lqFX (p)}) =  P ({X − h ≤ lqFX (p) + δ and X ≥ rqFX (p)} ∪  {X ≤ lqFX (p) + δ and X ≤ lqFX (p)}) =  P ({rqFX (p) ≤ X ≤ rqFX (p) + δ} ∪ {X ≤ lqFX (p)}) = P ({X ≤ rqFX (p) + δ}).  Hence, FY (y) = P (Y ≤ lqFX (p) + δ) = P (X ≤ rqFX (p) + δ) > p ⇒ 174  5.14. Limit theory rqFY (p) ≤ y, ∀y > lqFX (p). We conclude that rqFY (p) ≤ lqFY (p). To complete the proof of part (c) observe that for every ǫ > 0, we may suppose that lqFn,Y (p) ∈ (qFY (p) − ǫ, qFY (p) + ǫ). Then lqFn,X (p), rqFn,X (p) ∈ (lqFX (p) − ǫ, rqFX (p) + ǫ).  (5.6)  This is because from lqFn,Y (p) ∈ (qFY (p) − ǫ, qFY (p) + ǫ), we may conclude that Fn,Y (qFY (p) + ǫ) > p ⇒ Fn,X (rqFX (p) + ǫ) > p ⇒ lqFn,X (p), rqFn,X (p) < rqFX (p) + ǫ, and Fn,Y (qFY (p) − ǫ) < p ⇒ FnX (lqFX (p) − ǫ) < p ⇒ lqFn,X (p), rqFn,X (p) > lqFX (p) − ǫ. But by part (a) of Theorem 5.14.7, lqFn,Y (p) → qFY (p) and rqFn,Y (p) → qFY (p). Hence for given ǫ > 0 there exists an integer N such that for any n > N, lqFn,Y (p) ∈ (qFY (p) − ǫ, qF,Y (p) + ǫ). By (5.6), we have shown that for every ǫ > 0 there exists N such that for every n > N qFn,X (p), rqFn,X (p) ∈ (lqFX (p) − ǫ, rqFX (p) + ǫ), since P (Xi ∈ (lqFX (p), rqFX (p)) for some i ∈ N) = 0. We can conclude that P (lqFn,X (p) ∈ (lqFX (p), rqFX (p)) for some i ∈ N) = 0 and P (rqFn,X (p) ∈ (lqFX (p), rqFX (p)) for some i ∈ N) = 0. Hence with probability 1 qFn,X (p), rqFn,X (p) ∈ (lqFX (p) − ǫ, lqFX (p)] ∪ [rqFX (p), rqFX (p) + ǫ).  175  0.0  0.2  0.4  F  0.6  0.8  1.0  5.14. Limit theory  −2  −1  0  1  2  3  4  5  x  Figure 5.7: The solid line is the distribution function of {Xi }. Note that for the distribution of the Xi and p = 0.5, lqFX (p) = 0, rqFX (p) = 3. Let h = rq(p)−lq(p) = 3. The dotted line is the distribution function of the {Yi } which coincides with that of {Xi } to the left of lqFX (p) and is a backward shift of 3 units for values greater than rqFX (p). Note that for the {Yi }, lqFY (p) = rqFY (p) = 1.  176  5.15. Summary and discussion  5.15  Summary and discussion  This section highlights the results obtained for a two state-definition for quantiles and discuss why these results show such a consideration is useful.  Justifications and consequences of using left and right quantile functions 1. The equivariance property (under non-decreasing continuous transformations) of lqX and rqX makes them equivariant under the change of scale. This is a nice theoretical property. Also from a practical view it means that if we compute the quantile in one scale it can be easily calculated in another scale. 2. Considering lqX , rqX allowed us to find a symmetry relation on quantiles: lqX (p) = −rq−X (1 − p). 3. We found a nice formula for continuous non-increasing transformations: lqφ(X) (p) = φ(rqX (1 − p)). 4. We showed that lqFn (p) the traditional sample quantile function and rqFn (p) tend to the distribution version if and only if lqF (p) = rqF (p). Hence finding a sufficient and necessary condition that is easy to formulate in terms of lqF and rqF . 5. If we start with only the traditional quantile function lqF , then rqF (p) would arise in the limit lim sup lqFn (p) = rqF (p). n→∞  6. It is widely claimed that the “median” minimizes the absolute error E|X − a|. In next chapters, we show that argmina E|X − a| = [lqX (1/2), rqX (1/2)]. We observe both lqX (p) and rqX (p) would arise if we intend to use this as a way defining quantiles. A generalization from 1/2 to arbitrary p is left for future research. 177  5.15. Summary and discussion 7. We offered a physical motivation using a uniform bar to define quantiles for data vectors which resulted in a definition that coincide with lqX , rqX . 8. If we only use the traditional quantile function, for p = 0, we get lqX (0) = ∞ in general. However rqX (0) < ∞ is a useful value in the sense that it is the maximum a satisfying P (X ≥ a) = 1. Also rqX (1) = −∞ in general. However lqX (1) > −∞ in general and is a useful value since it is the minimum a satisfying P (X ≤ a) = 1. 9. Middle values of lqX (p), rqX (p) (for example a specific weighted combination of the two) or the whole interval [lqX (p), rqX (p)] are not preferable as a definition. This is because we showed that the range of lqX and rqX is exactly the set of heavy points. Points where the probability of being in any positive radius of them is positive. 10. From a practical point of view giving a value that has already occurred as quantile we can expect the same value or a close value happen again in the future. More formally, suppose a random sample X1 , · · · , Xn is given and we want to compute the sample qunatile. Then lqFn (p) and rqFn (p) are one of Xi s by definition. If we denote XF a future value meaning that XF is identically distributed and independent from X1 , · · · , Xn P (XF ∈ (Xi − ǫ, Xi + ǫ)) > 0. A middle value might not satisfy such a property. 11. We found out a clean nice way to show in what sense exactly lqX and rqX are close. We showed P (lqX (p) < X < rqX (p)) = 0. For data vectors this means the two values are side by side in the sorted vector. 12. We showed that lqX (p) and rqX (p) coincide except for at most a countable subset of the reals. 13. We showed that even though lqX (p) ≤ rqX (p) in general, they are not too far apart since for a very small positive value ǫ lqX (p) ≤ rqX (p) ≤ lqX (p + ǫ). 178  5.15. Summary and discussion 14. Given one of lqF or rqF , the other one can be obtained by taking the limits lqF (p0 ) = lim rqF (p), p↑p0  and rqF (p0 ) = lim lqF (p). p↓p0  15. In order to invert F , lqF , rqF gives us nice expressions for sets such as x|F (x) > p which is equal to (rqX (p), ∞) if F is continuous at rqX (p). 16. For a continuous distribution function, we have a nice formula for the inverse based on lqF and rqF F −1 (p) = [lqX (p), rqX (p)]. 17. The left (right) quantile function at given probability p can be simply put as the minimal value that the distribution function reaches (passes) p. In some practices fixing one lq or rq might be sufficient. This is because lq and rq are close in terms of the probability of the underlying random variable. For example in data vectors lq, rq will be at most one element off in terms of their position in the data vector. In most elementary statistics text books and statistical softwares quantiles are given as a one-state solution generally a weighted combination of the left and right quantiles. In order to teach the right and left quantile functions, we suggest using a simple example x = (1, 2, 3, 4) to show that there are no values in the middle and the left (2) and right median (3) are natural to consider. Then one can point out this can be generalized from p = 1/2 to any p without getting into details. It can also be pointed out that the left (right) quantile function at given probability p can be simply put as the minimal value that the distribution function reaches (passes) p. In a more advanced courses perhaps for mathematics, statistics or science students the teacher might like to show how the quantiles can be defined using the bar of length 1. Finally the mathematical formulas can be given to students with appropriate mathematical background (i.e. Familiar with the definition of sup and inf and their existence property for the real numbers). In case an interpolation procedure is to be used, we suggest the interpolation procedure to be between lqX (p) and rqX (p). Surprisingly this is not the case. For example for x = (0, 0, 0, 0, 0, 1, 1, 1, 1, 1) in the R package as 179  5.15. Summary and discussion the quantile for p = 0.48, we get 0.32. But in the vector we notice that 0s have covered 50 percent of the data and since 0.48 is strictly less than 0.48, we expect 1 to be the quantile. 0.32 in our notation is both greater than lqx (0.50) and rqx (0.50).  180  Chapter 6  Probability loss function 6.1  Introduction  This chapter develops a “loss function” to assess the goodness of an approximation or an estimator of quantiles of a distribution (or a data vector). Suppose a quantile of a very large data vector, q is approximated by q̂. Several classic losses can be considered. For example: absolute error L(q, q̂) = |q − q̂| or squared error L(q, q̂) = (q − q̂)2 which was proposed by Gauss. Quoting from [30]: “Gauss proposed the square of the error as a measure of loss or inaccuracy. Should someone object to this specification as arbitrary, he writes, he is in complete agreement. He defends his choice by an appeal to mathematical simplicity and convenience.” An obvious problem with this loss is its lack of invariance under re-scaling of of data. We propose a loss function that is invariant under strictly monotonic transformations. We also show that the sample version of this loss function tends uniformly to the distributional version. This loss function can be used also to find optimal ways to summarize a data vector and to define a measure of distance among random variables as shown in the next chapters. We define the loss of estimating/approximating q by q̂ to be the probability that the random variable falls in between the two values. A limited version of this concept only for data vectors can be found in computer science literature, where ǫ-approximations are used to approximate quantiles of large datasets. (See for example [32].) However, this concept has not been introduced as a measure of loss and the definition is limited to data vectors rather than arbitrary distributions.  6.2  Degree of separation between data vectors  Our purpose is to find good approximations to the median and other quantiles. It is not clear how such approximations should be assessed. We contend that such a method should not depend on the scale of the data. In other words it should be invariant under monotonic transformations. We  181  6.2. Degree of separation between data vectors define a function δ that measures a natural “degree of separation” between data points of a data vector x. For the sake of illustration, consider the example sort(x) = (1, 2, 3, 3, 4, 4, 4, 5, 6, 6, 7). Now suppose, we want to define the degree of separation of 3,4 and 7 in this example. Since 4 comes right after 3, we consider their degree of separation to be zero. There are 3 elements between 4 and 7 so it is appealing to measure their degree of separation as 3 but since the degree of separation should be relative, we cab also divide by n = 11, the length of the vector, and get: δ(4, 7) = 3/11. We can generalize this idea to get a definition for all pairs in R. With the same example, suppose we want to compute the degree of separation between 2.5 and 4.5 that are not members of the data vector. Then since there are 5 elements of the data vector between these two values, we define their degree of separation as 5/11. More formally, we give the following definition. Definition Suppose z < z ′ let ∆x (z, z ′ ) = {i|z < xi < z ′ }. Then we define |∆x (z, z ′ )| , n and δx (z, z) = 0. We call δx the “degree of separation” (DOS) or the “probability loss function” associated with x. δx (z, z ′ ) =  We then have the following lemma about the properties of δ. Lemma 6.2.1 The degree of separation δx has the following properties: a) δx ≥ 0. b) y < y ′ < y ′′ ⇒ δx (y, y ′′ ) ≥ δx (y, y ′ ). c) If z < z ′ and z, z ′ are elements of x, δx (z, z ′ ) = definition of m(z) and M (z) see Chapter 5.]  mx (z)−Mx (z ′ )−1 . n  [For the  d) δφ(x) (φ(z), φ(z ′ )) = δx (z, z ′ ) if φ is a strictly monotonic transformation. e) y = sort(x) and y ′ = yi < y ′′ = yj ⇒ δx (y ′ , y ′′ ) ≤ (j − i − 1)/n. Proof Both a) and b) are straightforward. We obtain c) as a straightforward consequence of the definition of mx (y ′ ) and Mx (y ′ ). To show (d), suppose z < z ′ and φ is strictly decreasing. (The strictly increasing case is similar.) Then φ(z ′ ) < φ(z) and hence ∆φ(x) (φ(z), φ(z ′ )) = {i|φ(z ′ ) < φ(xi ) < φ(z)} = {i|z < xi < z ′ } = ∆x (z, z ′ ). 182  6.3. “Degree of separation” for distributions: the “probability loss function” Finally e) is true because |∆x (y ′ , y ′′ )| = |{l|yi < xl < yj }| ≤ j − i − 1. All the definitions and results above can be applied to random vectors X = (X1 , · · · , Xn ) as well. In that case, lqX (p) and rqX (p) and δX (z, z ′ ) are random. To develop our theory, we need to study the asymptotic behavior of these statistics. We do so in later sections.  6.3  “Degree of separation” for distributions: the “probability loss function”  We define a degree of separation for distributions which corresponds to the notion of “degree of separation” defined for data vectors to measure separation between data points. Definition Suppose X has a distribution function F . Let δF (z ′ , z) = δF (z, z ′ ) = lim F (u) − F (z ′ ) = P (z ′ < X < z), z > z ′ , u→z −  and δF (z, z) = 0, z ∈ R. We also denote this by δX whenever a random variable X with distribution F is specified. We call δX the “degree of separation” or the “probability loss function” associated with X. The following lemma is a straightforward consequence of the definition. Lemma 6.3.1 Suppose x = (x1 , · · · , xn ) is a data vector with the empirical distribution Fn . Then δFn (z, z ′ ) = δx (z, z ′ ), z, z ′ ∈ R. This lemma implies that to prove a result about the degree of separation of data vectors, it suffices to show the result for the degree of separation of random variables. Theorem 6.3.2 Let X, Y be random variables and FX , FY , their corresponding distribution functions. a) Assume Y = φ(X), for a strictly increasing or decreasing function φ : R → R. Then δFX (z, z ′ ) = δFY (φ(z), φ(z ′ )), z < z ′ ∈ R. b) δF (z, z ′ ) ≤ δF (z, z ′′ ), z ≤ z ′ ≤ z ′′ . c) δF (z1 , z3 ) ≤ δF (z1 , z2 ) + δF (z2 , z3 ) + P (X = z2 ). 183  6.3. “Degree of separation” for distributions: the “probability loss function” d) Suppose, p ∈ [0, 1]. Then δF (lqF (p), rqF (p)) = 0. e) Suppose, p1 < p2 ∈ [0, 1]. Then δF (lqF (p1 ), rqF (p2 )) ≤ p2 − p1 . This immediately implies δF (lqF (p1 ), lqF (p2 )) ≤ p2 − p1 and δF (rqF (p1 ), lqF (p2 )) ≤ p2 − p1 by b). Remark. We may restate Part (c), for data vectors: Suppose x has length n and z2 is of multiplicity m, (which can be zero). Then the inequality in (c) is equivalent to δx (z1 , z3 ) ≤ δx (z1 , z2 ) + δx (z2 , z3 ) + m/n. Proof a) Note that for a strictly increasing function φ, we have P (z < X < z ′ ) = P (φ(z) < φ(X) < φ(z ′ )). Now suppose φ is strictly decreasing. Then z < z ′ ⇒ φ(z ′ ) < φ(z). Let Y = φ(X). Then δX (z, z ′ ) = P (z < X < z ′ ) = P (φ(z ′ ) < φ(X) < φ(z)) = δY (φ(z), φ(z ′ )). b) This is trivial. c) Consider the case z1 < z2 < z3 . (The other cases are easier to show.) Then δF (z1 , z3 ) = P (z1 < X < z3 ) = P (z1 < X < z2 )+P (X = z2 )+P (z2 < X < z3 ) = δ(z1 , z2 ) + δ(z2 , z3 ) + P (X = z2 ). d) This result is a straightforward consequence of Lemma 5.3.1 b) and c). e) This result follows from δF (lq(p1 ), rq(p2 )) = P (lq(p1 ) < X < rq(p2 )) = P (X < rq(p2 )) − P (X ≤ lq(p1 )) ≤ p2 − p1 . The last inequality being a result of Lemma 5.3.1 a) and d). Remark. We call part c) of the above theorem the pseudo–triangle inequality. Here we give two examples about using the probability loss function and its interpretation. 184  6.3. “Degree of separation” for distributions: the “probability loss function” Example We showed above that the triangle property does not hold for the probability loss function and that might lead to the criticism that this definition is not intuitively appealing. By an example, we now show why it makes sense that the triangle property should not hold for such a situation. Suppose a few mathematicians are standing in a line Euclid, Khawarzmi, Khayyam, Gauss, Von Neumann. If we were to ask Khwarzmi about his distance from Euclid, he would answer: “0, since I am right beside him.” If we ask Khwarazmi again about his distance to Khayyam, he will say that “my distance is 0 since I am right beside him.” However if we were to ask Euclid about his distance to Khayyam he would answer: “One unit (person) since Khwarzmi is in the middle.” We observe that this distance does not satisfy the triangle property as well. In this example the people sitting in the middle are the relevant factors. If we deal with a vector of sorted observations, then observations in the middle are the relevant factors. Example A student is told that he will receive a scholarship if he ranks first in an exam in his class in either of the subjects mathematics and physics. The teacher of the courses differ and take a practice exam in each subject. They return the students back their marks out of 100. They also publish the lists of all the marks after removing the names, to give the students a feeling of how they did in the class. Table 6.1 shows the marks in mathematics and physics.  185  6.3. “Degree of separation” for distributions: the “probability loss function” Mathematics  Physics  Physics before scaling  80 65 63 61 54 54 53 50 49 48 47 47 46 44 30  90 89 86 85 83 82 79 79 76 75 72 72 69 68 55  81.0 79.2 74.0 72.2 68.9 67.2 62.4 62.4 57.8 56.2 51.8 51.8 47.6 46.2 30.2  Table 6.1: A class marks in mathematics and physics. The third column are the raw physics marks before the physics teacher scaled them. Reza got 63 in math and 75 in physics. He decided to focus on just one subject that gives him a better chance in order to win the scholarship. He compared his mark in math with the best student in math: 63 against 80. So he needed |best mark − Reza’s mark| = 80 − 63 = 17 more marks to be as good as the best student. Then he compared his physics mark to the best student in physics. He found he needs 90-75=15 marks to be as good as him. So he thought it’s better to focus on physics. But then he realized that different teachers use different exam and scoring methods. He had heard that the physics teacher scales the marks upward by the formula √ new mark = 100 × old mark. So the student calculated the untransformed values and put the result in the third column. Now he noticed that his new mark is 56.2 while the best mark is 91. The difference this time is 24.8 which is a larger difference than before. According to his “decision-making tool”, the absolute difference, he should focus on math since the absolute difference for math was only 17. But what if the mathematics teacher had used another transformation to re-scale the marks without him knowing it? This made him see a disadvantage to using the absolute value difference. Instead he realized, he can use the number 186  6.4. Limit theory for the probability loss function of the students between himself and the best student as a measure of the difficulty of getting the best mark. He noticed his decision in this case will be independent of how the teachers re-scaled the marks. In the math case there is only one and for physics there are 8 students between him and the best student. Hence he decided that he should focus on math. This example was under the assumption that other students do not change their study habits or do not have access to the marks. If the other students had access to their marks or were ready to change their study focus, we need to take into account other possible actions of the other students and the problem will become game-theoretical in nature, a very interesting problem on its own right. The solution for that problem we conjecture to be the same.  6.4  Limit theory for the probability loss function  Theorem 6.4.1 Suppose X1 , X2 , · · · , is a sequence of i.i.d random variables with distribution function F . Then as n → ∞, δFn (z, z ′ ) → δF (z, z ′ ), a.s., uniformly in z, z ′ ∈ R. In other words sup |δFn (z, z ′ ) − δF (z, z ′ )| → 0, a.s..  z>z ′ ∈R  Proof If z = z ′ , the result is trivial. Suppose z > z ′ . We need to show that lim Fn (u) − Fn (z ′ ) → lim F (u) − F (z ′ ),  u→z −  a.s. u→z −  (6.1)  as n → ∞, uniformly in z > z ′ ∈ R. Suppose ǫ > 0 is given. By GlivenkoCantelli Theorem there exist N ∈ N such that for every n > N : ǫ |Fn (u) − F (u)| < , a.s., ∀u ∈ R. 2 Now for n > N , |( lim Fn (u) − Fn (z ′ )) − ( lim F (u) − F (z ′ ))| ≤ u→z −  u→z −  | lim (Fn (u)−F (u))|+|Fn (z ′ )−F (z ′ )| = lim |Fn (u)−F (u)|+|Fn (z ′ )−F (z ′ )|. u→z −  u→z −  187  6.5. The probability loss function for the continuous case But since |Fn (u) − F (u)| < 2ǫ , limu→z − |Fn (u) − F (u)| ≤ 2ǫ . Also |Fn (z ′ ) − F (z ′ )| < 2ǫ . Hence |( lim Fn (u) − Fn (z ′ )) − ( lim F (u) − F (z ′ ))| < ǫ. u→z −  6.5  u→z −  The probability loss function for the continuous case  This section studies the probability loss function when the distribution function is continuous. The results are given in the following lemmas, which show some of its desirable properties in the continuous case. Lemma 6.5.1 (Probability loss for continuous distributions) Suppose X is a random variable with distribution function FX . Then δX (lqX (p1 ), rqX (p2 )) = p2 − p1 , p2 > p1 , ∀p1 , p2 ∈ [0, 1] iff FX is continuous. Proof If FX is continuous then for p1 < p2 and by Lemma 5.5.2, δ(lqX (p1 ), rqX (p2 )) = P (lqX (p1 ) < X < rqX (p2 )) = P (X < rqX (p2 )) − P (X ≤ lqX (p)) = F (rqX (p2 )) − F (lqX (p2 )) = p2 − p1 .  If F is not continuous then there exists an x0 such that a = PX (X = x0 ) > 0. Let p1 = P (X < x0 ) + a/3 and p2 = P (X < x0 ) + a/2. Clearly lqX (p1 ) = x0 and rqX (p2 ) = x0 . Hence δ(lqX (p1 ), rqX (p2 )) = 0 6= p2 − p1 .  Lemma 6.5.2 Suppose δ(lqX (p1 ), rqX (p2 )) = δ(rqX (p1 ), lqX (p2 )) = a, p1 < p2 . Then also a = δ(lqX (p1 ), lqX (p2 )) = δ(rqX (p1 ), lqX (p2 )) = δ(rqX (p1 ), rqX (p2 )). Moreover, if X is continuous, all the above are equal to p2 − p1 . 188  6.6. The supremum of δX Proof The result follows immediately from the fact that all the three quantities are greater than or equal to δ(rqX (p1 ), lqX (p2 )) = a and smaller than or equal to δ(lqX (p1 ), rqX (p2 )) = a. The second part is straightforward using the previous lemma.  6.6  The supremum of δX  This section investigates how large the probability loss can become under various scenarios. The results are given in the following lemmas. Lemma 6.6.1 Let Dist be the set of all distribution functions. Then sup δF (lqF (p1 ), lqF (p2 )) = p2 − p1 , p2 > p1 , p1 , p2 ∈ (0, 1).  F ∈Dist  Proof This follows from the fact that δF (lqF (p1 ), lqF (p2 )) ≤ p2 − p1 in general, as shown in Lemma 6.3.2 and δF (lqF (p1 ), lqF (p2 )) = p2 − p1 for continuous variables. The same is true for data vectors as shown in the following lemma. Lemma 6.6.2 Suppose the supremum in the following is taken over all data vectors, then sup δx (lqx (p1 ), lqx (p2 )) = p2 − p1 , p2 > p1 , p1 , p2 ∈ (0, 1). x  Proof We know that δx (lqx (p1 ), lqx (p2 )) ≤ p2 − p1 . To show that the supremum attains the upper bound, let xn = (1, · · · , n). Then lqxn (p1 ) = [np1 ] or [np1 ] + 1. Also lqxn (p2 ) = [np2 ] or [np2 ] + 1. Then ∆, the number of elements of x between lqxn (p1 ) and lqxn (p2 ) satisfies: [np2 ] − [np1 ] − 1 ≤ ∆ ≤ [np2 ] − [np1 ] + 1 ⇒ np2 − 1 − np1 − 1 − 1 ≤ ∆ ≤ np2 − np1 + 1 ⇒ −3/n ≤ δxn (p1 , p2 ) − (p2 − p1 ) ≤ 1/n. This shows that δxn (p1 , p2 ) tends to p2 − p1 uniformly for all p1 < p2 ∈ [0, 1].  189  6.6. The supremum of δX Lemma 6.6.3 Suppose p1 , p2 , · · · , pm ∈ [0, 1] and m = 2k. Then sup max{δx (lqx (p1 ), lqx (p2 )), δx (lqx (p3 ), lqx (p4 )), · · · , δx (lqx (pm−1 ), lqx (pm ))} x  = max{|p2 − p1 |, · · · , |pm − pm−1 |}. Proof The supremum is less than or equal to the left hand side by Lemma 5.3.1. Let xn = (1, 2, · · · , n). Without loss of generality suppose p1 < p2 , p3 < p4 , · · · , p2k−1 < p2k . By the properties of quantiles of data vectors: lqxn (pi ) = x[npi ] = [npi ] or lqxn (pi ) = x[npi ]+1 = [npi ] + 1. Also, lqxn (pi+1 ) = x[npi+1] = [npi+1 ] or lqxn (pi+1 ) = x[npi+1]+1 = [npi+1 ] + 1. Then, δxn (lqxn (pi ), lqxn (pi+1 )) ≥ n1 ([npi+1 ]−[npi ]−1) ≥ n1 (npi+1 −npi −2) = (pi+1 − pi ) − n2 . Hence δxn (lqxn (pi ), lqxn (pi+1 )) > |pi+1 − pi | −  2 , i = 1, · · · , m − 1. n  The inequality shows the supremum is greater than = max{|p2 − p1 | −  2 2 , · · · , |pm − pm−1 | − }, n n  for all n ∈ N. Now let n → +∞ to get the conclusion.  Lemma 6.6.4 Suppose p1 , p2 , · · · , pm ∈ [0, 1] and a1 , a1 , · · · , a2m ∈ [0, 1]. Then Z a2 Z a4 sup[ δx (lqx (p1 ), lqx (p))dp + δx (lqx (p2 ), lqx (p))dp+ x  a1  a3  ··· + =  Z  a2  a1  a2 m  Z  |p − p1 |dp +  δx (lqx (pm ), lqx (p))dp]  a2m−1  Z  a4  a3  |p − p2 |dp + · · · +  Z  a2 m  a2m−1  |p − pm |dp.  Proof The proof is similar to the previous lemmas and we skip the details.  190  6.6. The supremum of δX  6.6.1  “c-probability loss” functions  This section introduces a family of loss functions that are very similar to the probability loss function but might be more useful in some contexts, particularly when the distribution function is not continuous. A defect of the probability loss function is: it can be equal to zero even if a 6= b, a, b ∈ R. Also we noted that even though it resembles a metric it is not one. For example the triangle inequality does not hold. We introduce the “cprobability loss function” to solve these problems. Definition Suppose X is a random variable, δX its associated probability loss function and c ≥ 0. Then let c δX (a, b) = δX (a, b) + c(1 − 1{0} (a − b)),  where 1{0} is the indicator function at zero. Note that the c-probability loss is the sum of two losses. The first, δX (a, b), is the probability of being between the two values (a and b), the second, c(1 − 1{0} (a − b)), is the penalty for a and b not being equal. One question is what value of c should be chosen as the “penalty” of not being equal to the true value. It turns out that the value of c is not very important for many purposes as shown in the following lemma. Lemma 6.6.5 (Properties of the c-probability loss functions) c (a, b) = c ⇔ a 6= b and δ (a, b) = 0. a) δX X c (a, b) = 0 or δ c (a, b) ≥ c. b) δX X c is invariant under strictly monotonic transformations. c) δX d) Let d = sup P (X = x0 ). Then if c ≥ d, δc satisfies the triangle inequality. x0 ∈R  c (lq (p), rq (p)) ≤ c. (It is either zero or c.) e) δX X X c is given for any c > 0. Then we can obtain any other δ d for f ) Suppose δX X d ≥ 0.  Proof a) and b) are trivial. c) Both δX and c(1 − 1{0} (a − b)) are invariant under monotonic transformations. d) We use the pseudo–triangle inequality for the probability loss function. c (z , z ) ≤ δ c (z , z ) + δ c (z , z ) . Take z1 , z2 , z3 ∈ R. We need to show δX 1 3 X 1 2 X 2 3 If z1 = z3 , the result is trivial. Otherwise c(1 − 1{0} (z1 − z3 )) = c and c δX (z1 , z3 ) = δX (z1 , z3 ) + c ≤ δX (z1 , z2 ) + δX (z2 , z3 ) + P (X = z2 ) + c  191  6.6. The supremum of δX ≤ δX (z1 , z2 ) + δX (z2 , z3 ) + c(1 − 1{0} (z1 − z2 )) + c(1 − 1{0} (z2 − z3 )) = c c δX (z1 , z2 ) + δX (z2 , z3 ).  e) Trivial by properties of lq, rq and δX as shown in Lemma 5.3.1. c is given. If δ c (a, b) = 0 then a = b and hence δ d (a, b) = 0. f) Suppose δX X X c (a, b) = δ (a, b) + c. From this we can obtain δ (a, b) = If a 6= b then δX X X c (a, b) − c and hence δ d (a, b) = δ c (a, b) − c + d. δX X X i.i.d  c (X , X )), if X , X δX (X1 , X2 ) (or δX 1 2 1 2 ∼ X can be considered as a measure of disparity of the common distribution. The following lemma shows that the expectation of this quantity is constant for all continuous random variables!  Lemma 6.6.6 Suppose X is a continuous random variable, then E(δX (X1 , X2 )) = 2/3, i.i.d  where X1 , X2 ∼ X. Also c (X1 , X2 )) = 2/3 + c. E(δX  Proof We know that FX (X1 ) and FX (X2 ) are both uniformly distributed on (0,1) and independent. Hence  Z  0  2  1Z 1  Z  0 1  0  E(δX (X1 , X2 )) = E(|F (X1 ) − F (X2 )|) = Z 1Z 1 |p1 − p2 |dp1 dp2 = 2 (p1 − p2 )dp1 dp2 = 0  p2  (1 − 2p2 + p22 )dp2 = 2/3.  c (X , X )) = 2/3 + c is obtained by noting that P (X = X ) = 0 for E(δX 1 2 1 2 continuous random variables.  192  Chapter 7  Approximating quantiles in large datasets 7.1  Introduction  This chapter develops an algorithm for approximating the quantiles in petascale (petabyte= one million gigabytes) datasets and uses the “probability loss function” to assess the quality of the approximation. The need for such an approximation does not arise for the sample average, another common data summary. That is because if we break down the data to equal partitions and calculate the mean for every partition, the mean of the obtained means is equal to the total mean. It is also easy to recover the total mean from the means of unequal partitions if their length is known. However computer memories, several gigabytes (GBs) in size, cannot handle large datasets that can be petabytes (PBs) in size. For example, a laptop with 2 GBs of memory, using the well–known R package, could find the median of a data file of about 150 megabytes (MBs) in size. However, it crashed for files larger than this. Since large datasets are commonly assembled in blocks, say by day or by district, that need not be a serious limitation except insofar as the quantiles computed in that way cannot be used to find the overall quantile. Nor would it help to sub–sample these blocks, unless these (possibly dependent) sub–samples could be combined into a grand sub-sample whose quantile could be computed. That will not usually be possible in practice. The algorithm proposed here is a “worst–case” algorithm in the sense that no matter how the data are arranged, we will reach the desired precision. This is of course not true if we sample from the data because there is a (perhaps small) probability that the approximation could be poor. We also address the following question: Question: If we partition the data–file into a number of sub–files and compute the medians of these, is the median of the medians a good approximation to the median of the data–file? 193  7.2. Previous work We first show that the median of the medians does not approximate the exact median well in general, even after imposing conditions on the number of partitions or their length. However for our proposed algorithm, we show how the partitioning idea can be employed differently to get good approximations. “Coarsening” is introduced to summarize data vector with the purpose of inferring about the quantiles of the original vector using the summaries. Then the “d-coarsening” quantile algorithm which works by partitioning the data (or use previously defined partitions) to possibly nonequal partitions, summarizing them using coarsening and inferring about the quantiles of the original data vector using the summaries. Then we show the deterministic accuracy of the algorithm in Theorem 7.4.1. The accuracy is measured in terms of the probability loss function of the original data vector. This is an extension of the work of Albasti et al. in [3] to non-equal size partition case. Theorem 7.4.1 still requires the partition sizes to be divisible by d the coarsening factor. In order to extend the results further to the case where the partitions are nit divisible by d, we investigate how quantiles of a data vector with missing data or contaminated data relate to the quantiles of the original data in Lemma 7.4.3 and Lemma 7.4.4. Also in Lemma 7.5.1, we show if the quantiles of a coarsened vector are used in place of the quantiles of the original data vector how much accuracy will be lost. Finally we investigate the performance of the algorithm using both simulations and real climate datasets.  7.2  Previous work  Finding quantiles and using them to summarize data is of great importance in many fields. One example is the climate studies where we have very large datasets. For example the datasets created by computer climate models are larger than PBs in size. In NCAR (National Center for Atmospheric sciences at Boulder, Colorado), the climate data (outputs of compute models) are saved on several disks. To access different parts of these data a robot needs to change disks form a very large storage space. Another case where we confront large datasets is in dealing with data streams which arise in many different applications such as finance and high–speed networking. For many applications, approximate answers suffice. In computer science, quantiles are important to both data base implementers and data base users. They can also be used by business intelligence applications to drive summary information from huge datasets. As pointed out by Gurmeet et al. in [32], a good quantile approximation 194  7.2. Previous work algorithm should 1. not require prior knowledge of the arrival or value distribution of its inputs. 2. provide explicit and tunable approximation guarantees. 3. compute results in a single pass. 4. produce multiple quantiles at no extra cost. 5. use as little memory as possible. 6. be simple to code and understand. Finding quantiles of data vectors and sorting them are parallel problems since once we sort a vector finding any given quantile can be done instantly. A good account of early work in sorting algorithms can be found in [28]. Munero et al. in [36] showed for P –pass algorithms (algorithms that scan the data P times) Θ(N/P ) storage locations are necessary and sufficient, where N is the length of the dataset. (See Appendix C for the definitions of complexity functions such as Θ.) It is well–known that the worst-case complexity of sorting is n log2 n + O(1) as shown in [33]. In [39], Paterson discusses the progress made in the so–called “selection” problem. He lets Vk (n) be the worst–case minimum number of pairwise comparisons required to find the k–th largest out of n “distinct elements”. In particular M (n) = Vk (n) for k = ⌈n/2⌉. In [8], it is shown that the lower bound for Vk (n) is n + min{k − 1, n − k} − 1, an achieved upper bound by Blum is 5.43n. Better upper bounds have been achieved through the years. The best upper bound so far is 2.9423N and the lower bound is (2 + α)N where α is of order 2−40 . Yao in [49], showed that finding approximate median needs Ω(N ) comparisons in deterministic algorithms. Using sampling this can be reduced to O( ǫ12 log(δ−1 )) independent of N , where ǫ is the accuracy of the approximation in terms of the “probability loss” in our notation. In [36], Munero et al. showed that O(N 1/p ) is necessary and sufficient to find an exact φ–quantile in p passes. Often an exact quantile is not needed. A related problem is finding space–efficient one–pass algorithms to find approximate quantiles. A summary of the work done in this subject and a new method is given in [1]. Two approximate quantile algorithms using only a constant amount of memory were given by Jain [26] Agrawal et al. in [1]. No guarantee for the error was given. Alsabti et al. in [3], provide an algorithm and guaranteed error 195  7.3. The median of the medians in one pass. This algorithm works by partitioning the data into subsets, summarizing each partition and then finding the final quantiles using the summarized partitions. The algorithm in this chapter is an extension of this algorithm to the case of partitions of unequal length.  7.3  The median of the medians  A proposed algorithm to approximate the median of a very large data vector partitions the data into subsets of equal length, computes the median for each partition and then computes the median of the medians. For example, suppose n = lm and break the data to m vectors of size l. One might conjecture that by picking l or m sufficiently large the median of the medians would ensure close proximity to the exact median. We show by an example that taking l and m very large will not help to get close to the exact median. Let l = 2b + 1 and m = 2a + 1.  partition number 1 2 . . . a a+1 a+2 . . . 2a+1  Partition  Median of the partition b  (1, 2, · · · , b, b + 1, 10 , · · · (1, 2, · · · , b, b + 1, 10b , · · · . . . (1, 2, · · · , b, b + 1, 10b , · · · (1, 2, · · · , b, b + 1, 10b , · · · (10b , 10b , · · · , 10b ) . . . (10b , 10b , · · · , 10b )  b  , 10 ) , 10b )  , 10b ) , 10b )  b+1 b+1 . . . b+1 10b 10b . . . 10b  Table 7.1: The table of data Example Table 7.1 shows the dataset partitioned into m = 2a + 1 vectors of equal length. Every vector is of length l = 2b + 1. The first a + 1 vectors are identical and 10b is repeated b times in them. The last a vectors are also identical with all components equal to 10b . The median of the medians turns out to be b + 1. However, the median of the dataset is 10b . We show that b + 1 is in fact “almost” the first quantile. This is because (b + 1) is smaller  196  7.4. Data coarsening and quantile approximation algorithm than all 10b ’s. There are (a + 1)b + a(2b + 1) data points equal to 10b . Hence b + 1 is smaller than this fraction of the data points: 2a + 2 b a 1 1 3 (a + 1)b + a(2b + 1) = + ≈1× + ≈ . (2a + 1)(2b + 1) 2a + 1 4b + 2 2a + 1 4 2 4 With a similar argument, we can show that b + 1 is greater than almost a quarter of the data points (the ones equal to 1, 2, · · · , b). Hence b + 1 is “almost” the first quantile. One can prove a rigorous version of the the following statement. The median of the medians is “almost” between the first and the third quartile. We only give a heuristic argument for simplicity. To that end, let n = lm and m = 2a + 1 and l = 2b + 1. Let M be the exact median and M ′ be the median of the medians. Order the obtained medians of each partition and denote them by M1 , · · · , Mm . By definition M ′ ≥ Mj , j ≤ a and M ′ ≤ Mj , j ≥ a + 1. Each Mj , j ≤ a is less than or equal to b data points in its partition. Hence, we conclude that M ′ is less than or equal to ab data points. Similarly M ′ is greater than or equal to ab data points (which are ab 1 disjoint for the data points used before). But ab n = (2a+1)(2b+1) ≈ 4 . Hence, M ′ is greater than or equal to 1/4 data points and less than or equal to 1/4 data points.  7.4  Data coarsening and quantile approximation algorithm  This section introduces an algorithm to approximate quantiles in very large data vectors. As we demonstrated in the previous section the median of medians algorithm is not necessarily a good approximation to the exact median of a data vector even if we have a large number of partitions and large length of the partitions. The algorithm is based on the idea of “data coarsening” which we will discuss shortly. The proposed algorithm can give us approximations to the exact quantile of known precisions in terms of degree of separation. After stating the algorithm, we prove some theorems that give us the precision of the algorithm. The results hold for partitions of non–equal length.  197  7.4. Data coarsening and quantile approximation algorithm Definition Suppose a data vector x of length n = n1 n2 is given, n1 , n2 > 1 ∈ N. Also let sort(x) = y = (y1 , · · · , yn ). Then the n2 –coarsening of x, Cn2 (x) is defined to be (yn2 , y2n2 , · · · , y(n1 −1)n2 ). Note that Cn2 (x) has length n1 − 1. Let pi = i/n1 , i = 1, 2, · · · , (n1 − 1). Then Cn2 (x) = (lqx (p1 ), · · · , lqx (pn1 −1 )). We can immediately generalize the coarsening operator. Suppose sort(x) = (y1 , · · · , yn ), and n2 < n is given. Then by The Quotient–Remainder Theorem from elementary number theory, there exist n1 ∈ N ∪ {0} and r < n2 such that n = n1 n2 + r. Define Cn2 (x) = (yn2 , · · · , yn2 (n1 −1) ). The expression is similar to before. However, there are n2 + r elements after yn2 (n1 −1) in the sorted vector y. In this sense this coarsening is not fully symmetric. We show that if n2 is small compared to n this lack of symmetry has a small effect on the approximation of quantiles. Pm Suppose x is a data vector of length n = i=1 li . We introduce the coarsening algorithm to find approximations to the large data vectors. d–Coarsening quantiles algorithm: 1. Partition x into vectors of length l1 , · · · , lm . (Or use pre–existing partitions, e.g. partitions of data saved in various files on the hard disk of a computer.) x1 = (x1 , · · · , xl1 ), x2 = (xl1 +1 , · · · , xl1 +l2 ), · · · , xm = (xPm−1 lj +1 , · · · , xn ) j=1  2. Sort each xl , l = 1, 2, · · · , m and let y l = sort(xl ), l = 1, · · · , m: y 1 = (y11 , · · · , yl11 ), · · · , y m = (y1m , · · · , ylmm ). 3. d–Coarsen every vector: 1 m (yd1 , · · · , y(c ), · · · , (ydm , · · · , y(c ), m −1)d 1 −1)d  j and for simplicity drop d and use the notation wij = yid . 1 m w1 = (w11 , · · · , w(c ), · · · , wm = (w1m , · · · , w(c ). m −1) 1 −1)  198  7.4. Data coarsening and quantile approximation algorithm 4. Stack all the above vectors into a single vector and call it w. Find rqw (p) (or lqw (p)) and call it µ. Then µ is our approximation to rqx (p) (or lqx (p)). Pm Theorem 7.4.1PSuppose x is of length n = i=1 li , m ≥ 2 and li = m ci d. Let C = i=1 ci . Apply the coarsening algorithm to x and find µ to approximate rqx (p) (or lqx (p)). Then µ is a (left and right) quantile in the interval [p − ǫ, p + ǫ], m+1 where ǫ = C−m . In other words δx (µ, rqx (p)) ≤ ǫ and δx (µ, lqx (p)) ≤ ǫ. 1 3 When li = cd, i = 1, · · · , m, ǫ = m+1 m−1 c−1 ≤ c−1 .  We need an elementary lemma in the proof of this theorem. Lemma 7.4.2 (Two interval distance lemma) Suppose two intervals I = [a, b] and J = [c, d] subsets of R are given. Then sup{|p − q|, p ∈ I, q ∈ J} = max{|a − d|, |b − c|}. Proof sup{|p − q|, p ∈ I, q ∈ J} ≥ max{|a − d|, |b − c|} is trivial because a, b ∈ I and c, d ∈ J. To show the converse note that |p − q| = p − q or q − p, p ∈ I, q ∈ J. But p − q ≤ b − c, and q − p ≤ d − a. Hence |p − q| ≤ max{b − c, d − a} ≤ max{|b − c|, |a − d|}. This completes the proof. Proof of Theorem 7.4.1. P Pm Let n′ = m i=1 (ci − 1) = i=1 ci − m = C − m and MC = {(i, j)|i = 1, 2 · · · , m, j = 1, · · · , ci −1}, the index set of w. Also let c = max{c1 , · · · , cm }. h ′ Suppose, h−1 n′ ≤ p < n′ , h = 1, · · · , n . Then since µ = rqw (p), there ′ are disjoint subsets of MC , K and K such that |K| = h, |K ′ | = n′ − h, µ ≥ wji , (i, j) ∈ K and µ ≤ wji , (i, j) ∈ K ′ . (This is because if we let v = sort(w), rqw (p) = vh since [n′ p] = h − 1.) 199  7.4. Data coarsening and quantile approximation algorithm K, K ′ are not necessarily unique because of possible repetitions among the wti . Hence we impose another condition on K and K ′ . If (i, t) ∈ K then (i, u) ∈ / K ′ , u < t. It is always possible to arrange for this condition. For suppose, (i, t) ∈ K and (i, u) ∈ K ′ , u < t. Then µ ≥ wit and µ ≤ wui , hence wti ≤ wiu . But since u < t we have wti ≤ wiu by the definition of wi . We conclude that wti = wiu . Now we can simply exchange (i, t) and (i, u) between K and K ′ . If we continue this procedure after finite number of steps we will get K and K ′ with the desired property. Now define •  K1 = {(i, 1)|(i, 1) ∈ K}, with |K1 | = k1 and I1 = {(i, j)|j ≤ d, (i, 1) ∈ K}, Then |I1 | = k1 d. Also note that if (i, j) ∈ I1 , µ ≥ w1i ≥ yji .  • Let  K2 = {(i, 2)|, (i, 2) ∈ K},  with |K2 | = k2 and I2 = {(i, j)|d < j ≤ 2d, (i, 2) ∈ K}.  Then |I2 | = k2 d. Also note that if (i, j) ∈ I2 , µ ≥ w2i ≥ yji . • Let  Kt = {(i, t)|(i, t) ∈ K},  with |Kt | = kt and It = {(i, j)|(t − 1)d < j ≤ td, (i, t) ∈ K}.  Then |It | = kt d. Also note that if (i, j) ∈ It , µ ≥ wti ≥ yji . • Let  Kc−1 = {(i, (c − 1))|(i, c − 1) ∈ K},  with |Kc−1 | = kc−1 and I(c−1) = {(i, j)|(c − 2)d < j ≤ (c − 1)d, (i, c − 1) ∈ K}.  i ≥ yji . Then |Ic−1 | = kc−1 d. Also note that if (i, j) ∈ I(c−1) , µ ≥ w(c−1)  200  7.4. Data coarsening and quantile approximation algorithm Note that K = ∪c−1 t=1 Kt , |K| = k1 , + · · · + kc−1 . Since the Kt are disjoint the It are also disjoint. Let I = ∪c−1 t=1 It then |I| = d(k1 + · · · + kc−1 ) = d|K|. Also note that (i, j) ∈ I ⇒ µ ≥ yji . Similarly define, •  K1′ = {(i, 1)|(i, 1) ∈ K ′ }, |K1′ | = k1′ , and I1′ = {(i, j)|d < j ≤ 2d, (i, 1) ∈ K ′ }.  Then |I1′ | = k1′ d. Also note that if (i, j) ∈ I1′ , µ ≤ w1i ≤ yji . • Let  K2′ = {(i, 2)|(i, 2) ∈ K ′ }, |K2′ | = k2′ ,  and I2′ = {(i, j)|2d < j ≤ 3d, (i, 2) ∈ K ′ }.  Then |I2′ | = k2′ d. Also note that if (i, j) ∈ I2′ , µ ≤ w2i ≤ yji . • Let  Kt′ = {(i, t)|(i, t) ∈ K ′ }, |Kt′ | = k′ t,  and It′ = {(i, j)|td < j ≤ (t + 1)d, (i, t) ∈ K ′ }.  Then |It′ | = kt′ d. Also note that if (i, j) ∈ It′ then µ ≤ wti ≤ yji . •  ′ ′ ′ Kc−1 = {(i, (c − 1))|(i, c − 1) ∈ K ′ }, |Kc−1 | = kc−1 ,  and ′ = {(i, j)|j > (c − 1)d, (i, c − 1) ∈ K ′ }. Ic−1  ′ | = k′ ′ i i Then |Ic−1 c−1 d. Also note that if (i, j) ∈ Ic−1 ⇒ µ ≤ w(c−1) ≤ yj .  Then |I| = |K|d and |I ′ | = |K ′ |d. We claim that I ∩ I ′ = ∅. To see this note that because of how the second components in It and It′ are defined, it is only possible that It+1 = {(i, j)|td < j ≤ (t + 1)d, (i, t + 1) ∈ K} and It′ = {(i, j)|td < j ≤ (t + 1)d, (i, t) ∈ K ′ } intersect for some t = 1, · · · , c − 2. But if they intersect then there exist i, t such that (i, t + 1) ∈ K and (i, t) ∈ K ′ which is against our assumption regarding K and K ′ . Hence by Lemma 5.2.4, µ is a quantile between  201  7.4. Data coarsening and quantile approximation algorithm  [  |K|d n − |K ′ |d hd n − (n′ − h)d h m+h , ] = [ Pm , Pm ]=[ , ]. n n c d c d C C i=1 i i=1 i  But we know that  h−1 h , ). C −m C −m We are dealing with two interval in one of them µ is a quantile and the other contains p. We showed in Lemma 7.4.2 if two intervals [a, b] and [c, d] are given, the sup distance between two elements of the two intervals is p∈[  max{|a − d|, |b − c|}. Applying this to the above two intervals we get, max{| which is equal to, max{|  h−1 h−1 h m+h − |, | − |}, C C−m C −m C  mC − m2 − hm + C C − hm |, | |}. C(C − m) C(C − m)  But m2 + hm ≤ m2 + (C − m)m = mC. Hence |  mC − m2 − hm + C mC − m2 − hm + C mC + C m+1 |= ≤ = . C(C − m) C(C − m) C(C − m) C−m  Also |  C − hm C + mC m+1 |≤ ≤ . C(C − m) C(C − m) C −m  m+1 Hence the max is smaller than ǫ = C−m and we conclude that µ is a quantile ′ for p which is at most as far as ǫ to p. The case li = cd is easily obtained by replacing C = mc and noting that m+1 m−1 ≤ 3 m ≥ 2.  In most applications, usually the data partitions are not divisible by d. For example the data might be stored in files of different length with common factors. Another situation involves a very large file that is needed to be read in successive stages because of memory limitations. Suppose that 202  7.4. Data coarsening and quantile approximation algorithm we need a precision ǫ (in terms of degree of separation) and based on that we find an appropriate c and m. Note that n might not be divisible by mc. First we prove two lemmas. These lemmas show what happens to the quantiles if we throw away a small portion of the data vector or add some more data to it. The first lemma is for a situation that we have thrown away or ignored a small part of the data. The second lemma is for a situation that a small part of the data are contaminated or includes outliers. In both cases, we show how the quantiles computed in the “imperfect” vectors correspond to the quantiles of the original vector. In both case x stands for the imperfect vector and w is the complete/clean data. Lemma 7.4.3 (Missing data quantile summary lemma) Suppose x = (x1 , · · · , xn ), sort(x) = (y1 , · · · , yn ) and y ′ = lqx (p), p ∈ [0, 1]. Consider a vector x⋆ of length n⋆ and let w = stack(x, x⋆ ). Then y ′ = n⋆ lqw (p′ ), where p′ ∈ [p − ǫ, p + ǫ] and ǫ = n+n ⋆. Similarly if y ′ = rqx (p) and p ∈ [0, 1], y ′ = rqw (p′ ), where p′ ∈ [p−ǫ, p+ǫ] n⋆ and ǫ = n+n ⋆. Proof We prove the result for lqx only and a similar argument works for rqx . Let z = sort(w) then lqz = lqw . For p = 1 the result is easy to see. ′ Otherwise, ni ≤ p < i+1 n for some i = 0, · · · , n−1. But then y = lqx (p) = yi . ⋆ ′ In the new vector z since we have added n elements y = zj for some j, j i ≤ j < i + n⋆ . Hence y ′ = lqz ( n+n ⋆ ). From np − 1 < i ≤ np, we conclude np − 1 i j i + n⋆ np + n⋆ < ≤ < ≤ . n + n⋆ n + n⋆ n + n⋆ n + n⋆ n + n⋆ Hence,  n⋆ (1 − p) − 1 j n⋆ (1 − p) < − p < ⇒ n + n⋆ n + n⋆ n + n⋆ |  ⋆  j n⋆ (1 − p) − 1 n⋆ (1 − p) − p| < max{| |, | |}. n + n⋆ n + n⋆ n + n⋆ ⋆  ⋆  ⋆  (1−p) n (1−p)−1 n n −1 1 But | nn+n | ≤ max{ n+n ⋆ | ≤ n+n⋆ and | ⋆ , n+n⋆ } since p ranges in n+n⋆ [0, 1]. We conclude that that  |  j n⋆ − p| < . n + n⋆ n + n⋆  203  7.4. Data coarsening and quantile approximation algorithm Lemma 7.4.4 (Contaminated data quantile summary lemma) Suppose x = (x1 , · · · , xn ), sort(x) = (y1 , · · · , yn ) and y ′ = lqx (p), p ∈ [0, 1]. Consider the vector w = (x1 , x2 , · · · , xn−n⋆ ) then y ′ = lqw (p′ ), where p′ ∈ n⋆ [p − ǫ, p + ǫ] and ǫ = n−n ⋆. Similarly if y ′ = rqx (p) and p ∈ [0, 1], y ′ = rqw (p′ ), where p′ ∈ [p−ǫ, p+ǫ] n⋆ and ǫ = n−n ⋆. Proof We only show the case for lqx and a similar argument works for rqx . Let z = sort(w). Then lqz = lqw . If p = 1 the result is easy to see. Otherwise, i i+1 ′ n ≤ p < n for some i = 0, · · · , n − 1. But then y = lqx (p) = yi . In ⋆ the new vector z since we have removed n elements y ′ = zj for some j, j i − n⋆ ≤ j ≤ i. Hence y ′ = lqz ( n−n ⋆ ). From np − 1 < i ≤ np, we conclude ⋆ ⋆ np − 1 − n < j ≤ np ⇒ np − n ≤ j ≤ np. Hence j n⋆ p −n⋆ + n⋆ p ≤ − p ≤ ⇒ n − n⋆ n − n⋆ n − n⋆ j n⋆ | − p| ≤ . n − n⋆ n − n⋆ In the case that the partitions are not divisible by d, we can use the same algorithm with generalized coarsening. The error will increase obviously and the next two lemmas say by how much. Lemma 7.4.5 Suppose x has length n = lm + r, 0 ≤ r < l and m = cd. To find lqx (p), apply the algorithm in the previous theorems to a sub–vector of x of length lm. Then the obtained quantile is a quantile for a number in 1 r [p − ǫ, p + ǫ], where ǫ = m+1 m−1 c−1 + lm+r . Proof The result is a straightforward consequence of the Theorem 7.4.1 and the Lemma 7.4.3. P Lemma P 7.4.6 Suppose x has length n = m i=1 li and li = ci d + ri , ri < d. m Let R = i=1 ri . Then apply the algorithm above to x to find lqx (p), using the generalized coarsening. The obtained quantile is a quantile for a number m+1 R in [p − ǫ, p + ǫ] where ǫ = C−m + R+Cd . Proof Let li′ = ci d. Consider x′ a sub–vector of x consisting of (y11 , · · · , yl1′ ), (y12 , · · · , yl2′ ), · · · , (y1m , · · · , ylmm ′ ). 1  2  204  7.5. The algorithm and computations P ′ Then x′ has length m the ali=1 li . By Lemma 7.4.3 p-th quantile found byP m+1 gorithm is a quantile in [p − ǫ1 , p + ǫ1 ], ǫ1 = C−m for x′ . x has R = m i=1 ri elements more than x′ . Hence the obtained quantile is a quantile for x for R a number in [p − ǫ, p + ǫ], ǫ = ǫ1 + R+Cd .  7.5  The algorithm and computations  Suppose a data vector x has length n. To find the quantiles of this vector, we only need to sort x. Since then for any p ∈ (0, 1), we can find the first h such that p ≥ h/n. Note that sort(x) = (lqx (1/n), lqx (2/n), · · · , lqx (1)) = (rqx (0), rqx (1/n), · · · , rqx (  n−1 )). n  We only focus on left quantiles here. Similar arguments hold for the right quantile. Obviously, the longer the vector x, the finer the resulting quantiles are. Now imagine that we are given a very long data vector which cannot even be loaded on the computer memory. Firstly, sorting this data is a challenge and secondly, reporting the whole sorted vector is not feasible. Assume that we are given the sorted data vector so that we do not need to sort it. What would be an appropriate summary to report as the quantiles? As we noted also the sorted vector itself although appropriate, maybe of such length as to make further computation and file transfer impossible. The natural alternative would be to coarsen the data vector and report the resulting coarsened vector. To be more precise, suppose, length(x) = n = n1 n2 and y = sort(x) = (y1 , · · · , yn ). Then we can report y ′ = Cn2 (y) = (yn2 , · · · , y(n1 −1)n2 ). This corresponds to (lqy′ (1/n2 ), · · · , lqy′ (1)). How much will be lost by this coarsening? Suppose, we require the left quantile corresponding to (h − 1)/n < p ≤ h/n, h = 1, · · · , n. Then x would give us yh . But since (h − 1)/n < p ≤ h/n np < h ≤ np + 1. 205  7.5. The algorithm and computations Also suppose for some h′ = 1, · · · , n1 , (h′ − 1)/(n1 − 1) < p ≤ (h′ )/(n1 − 1) ⇒ (h′ − 1) < p(n1 − 1) ≤ h′ ⇒ (n1 − 1)p ≤ h′ < p(n1 − 1) + 1. Then (h − 1)(n1 − 1)/n < h′ < h(n1 − 1)/n + 1, and (h − 1)(n1 − 1)n2 /n < h′ n2 < h(n1 − 1)n2 /n + n2 .  (7.1)  Using the coarsened vector, we would report yh′ (n2 ) as the approximated quantile for p. The degree of separation between this element and the exact quantile using Equation 7.1 is less than or equal to  max{  |h − (h − 1)(n1 − 1)n2 /n| |h(n1 − 1)n2 /n + n2 − h| , }. n n  This equals max{|  −hn2 − n1 n2 + n2 −hn2 + nn2 |, | |}. n2 n2  But  |  −hn2 − n1 n2 + n2 n2 (n1 + n − 1) n2 (n1 + n) 1 n2 |= < = + , 2 2 2 n n n n n  and  −hn2 + nn2 n2 . |< 2 n n Hence the degree of separation is less than 1/n + 1/n1 . We have proved the following lemma. |  Lemma 7.5.1 Suppose x is a data vector of the length n = n1 n2 and y = sort(x), y ′ = Cn2 (y). Then if we use the quantiles of y ′ in place of x, the accuracy lost in terms of the probability loss of x (δx ) is less than 1/n+1/n1 . The algorithm proposes that instead of sorting the whole vector and then coarsening it, coarsen partitions of the data. The accuracy of the quantiles obtained in this way is given in the theorems of the previous section. This allows us to load the data into the memory in stages and avoid program failure due to the length of the data vector. We are also interested in the 206  7.5. The algorithm and computations performance of the method in terms of speed, and do a simulation study using the “R” package (a well–known software for statistical analysis) to assess this. In order to see theoretical results regarding the complexity of the special case of the algorithm for equal partitions see [3]. For the simulation study, we create a vector, x, of length n = 107 . We apply the algorithm for m = 1000, c = 20, d = 500. We create this vector in a loop of length 1000. During each iteration of the loop, we generate a random mean for a normal distribution by first sampling from N (0, 100). Then we sample 10,000 points from a normal distribution with this mean and standard deviation 1. We compare two scenarios: 1. Start by a NULL vector x and in each iteration add the full generated vector of length 10000 to x. After the loop has completed its run, sort the data vector which now has length 107 by the command sort in R and use this to find the quantiles. 2. Start with a NULL vector w. During each iteration after generating the random vector, d–coarsen the data by d = 500. (Hence m = 1000, c = 20.) In order to do that computing, first apply the sort command to the data and then simply d–coarsen the resulting sorted vector. During each iteration, add the coarsened vector to w. After all the iterations, sort w and use it to approximate quantiles. Remark. The first part corresponds to the straightforward quantiles’ calculation and the second corresponds to our algorithm. Note that in the real examples instead of the loop, we could have a list of 1000 data files and still this example serves as a way of comparing the straightforward method and our algorithm. Remark. Note that if we wanted to create an even longer vector say of length 1010 then the first method would not even complete because the computer would run out of memory in saving the whole vector x. Remark. The final stage of the algorithm can use the fact that w is built of ordered vectors to make the algorithm even faster. We will leave that a problem to be investigated in the future. We have repeated the same procedure for n = 2×107 , m = 1000, d = 500 and n = 108 , m = 1000, d = 500. The results of the simulation are given in Table 7.2, in which “DOS” stands for the degree of separation between the exact median and the approximated median. The “DOS bound” bounds the degree of separation obtained by the theorems in the previous section. For n = 107 , n = 2 × 107 significant time accrue by using the algorithm. For a vector of length 108 , R crashed when we tried to sort the original vector 207  7.5. The algorithm and computations and only the algorithm could provide results. For all cases the exact and approximated quantiles are close. In fact the dos is significantly smaller than the dos bound. This is because this is a “worst–case” bound. The exact and approximated quantiles for n = 107 are plotted in Figure 7.1. Length Exact median value Algorithm median value DOS DOS bound Time for exact median Time for the algorithm  n = 107 1.847120 1.866882 0.00012 0.05268421 186 sec 6 sec  n = 2 × 107 1.857168 1.846463 −6.475 × 10−5 0.02566667 461 s 18 s  n = 108 NA 1.846027 NA 0.005030151 NA 98 s  Table 7.2: Comparing the exact method with the proposed algorithm in R run on a laptop with 512 MB memory and a processor 1500 MHZ, m = 1000, d = 500. “DOS” stands for degree of separation in the original vector. “DOS bound” is the theoretical degree of separation obtained by Theorem 7.4.1. Next, we apply the algorithm on a real dataset. The dataset includes the daily maximum temperature for 25 stations over Alberta during the period 1940–2004. We focus on the 95th percentile. The results are given in Table 7.3. The algorithm finds the percentile more quickly but the time difference is not as large as the simulation. This is because most of the time of the algorithm and the exact computation is spent on reading the files from the hard drive. The dos bound is about 0.01 (on the 0–1 probability scale). The true degree of separation is about 0.001. The estimated quantiles and the exact quantiles are plotted in Figure 7.2. Notice that the exact and approximated values match except at the very beginning (very close to zero) and end (when it is close to 1), where we see that the circles (corresponding to exact quantiles) and the +s (corresponding to the approximated quantiles) do not completely match. This difference is at most 0.01 in terms of dos in any case.  208  −200  −100  0  p  100  200  300  7.5. The algorithm and computations  0.0  0.2  0.4  0.6  0.8  1.0  lq(p)  Figure 7.1: Comparing the approximated quantiles to the exact quantiles N = 107 . The circles are the exact quantiles and the + are the corresponding approximated quantiles.  209  −30  −20  −10  0  p  10  20  30  40  7.5. The algorithm and computations  0.0  0.2  0.4  0.6  0.8  1.0  lq(p)  Figure 7.2: Comparing the approximated quantiles to the exact quantiles for M T (daily maximum temperature) over 25 stations in Alberta 1940–2004. The circles are the exact quantiles and the + the approximated quantiles.  210  7.5. The algorithm and computations Exact 95th percentile Algorithm 95th percentile DOS DOS bound time for exact median time for the algorithm  27 C 26.7 C 0.001278726 0.01052189 8 min 6 sec 7 min 29 sec  Table 7.3: Comparing the exact method with the proposed algorithm in R (run on a laptop with 512 MB memory and processor 1500 MHZ) to compute the quantiles of M T (daily maximum temperature) over 25 stations with data from 1940 to 2004.  211  Chapter 8  Quantile data summaries 8.1  Introduction  This chapter introduces techniques to summarize data (using quantiles), manipulate and combine such summaries. “Weighted data vectors”, which are an extension of data vectors are introduced. The operators sort and stack are extended to weighted data vectors and the operator comp (compress) is introduced to compress a data vector as much as possible with no loss of information. In the quantile definition chapter, we expressed a few appealing properties that quantiles should satisfy. We established the equivariance and symmetry properties and left the following to later: 1. The “amount” of data between qx (p1 ) and qx (p2 ) should be a p2 − p1 , p1 < p2 fraction of the “data amount” of the whole data. 2. If we cut a sorted data vector up until the p1 -th quantile and compute the p2 -th quantile for the new vector, we should get the p1 p2 -th quantile of the original vector. For example the median of a sorted vector upto its median should be the first quartile of the original vector. A natural definition for the “amount of data” between a, b would be the number of data points between a, b divided by the length of the whole vector. However, by this definition there is no hope of establishing property (1) knowing that p2 − p1 can be irrational. Also for the second property one might conjecture that if we define the cut operator to be the sorted vector from left to lqx (p1 ) (or rqx (p1 )) then this property holds. However, consider x = (1, 2) and a cut of length 0.6. Then we get the same vector x′ = (1, 2) after the cut using this definition since lqx (0.6) = 2. Now the 0.7th left (or right) quantile of the cut vector x′ is lqx′ (0.7) = 2. However, lqx (0.6 × 0.7) = lqx (0.42) = 1. 212  8.1. Introduction In the following, we define the cut operator for p ∈ (0, 1) in a way that it ends with lqx (p) but satisfies property (2). The idea can be explained in the example by considering the vector x = (1, 2) as a weighted vector with weights (1/2, 1/2) and give 2 less “weight” than 1 after the cut. In summary, this chapter provides a framework to establish these properties, using the “partition” operator and the “cut” operator. When dealing with summarized data the following general question is a fundamental one: Question: Suppose x is a data vector which consists of m subvectors x1 , · · · , xm . In other words x = stack(x1 , · · · , xm ). Assume we do not have access to the xi but to the wi , their summaries (possibly a result of coarsening of the xi ). Then how can we approximate the quantiles of the original data vector x and assess how good this approximation is? We have already encountered such a problem in Chapter 7, where we answered the question in some specific cases. We do not answer the question in general in this chapter but provide a framework to formalize and answer these type of questions. In computer science quantiles are sometimes used to summarize large datasets. A good summary of the work for creating quantile summaries of datasets in a single pass is given in [19]. In order to make a summary (of length k) of a data vector using the quantiles, one has various choices to pick certain probability indices p1 ≤ p2 ≤ · · · ≤ pk , and save the corresponding quantiles. Using the probability loss function, we find an optimal way of doing this. Then we consider the problem of finding argminE(L(X, a)), a  for various L (loss) functions. It is widely claimed that if L is the absolute value function, the argmin is the median of X. We show that the argmin is in fact [lqX (1/2), rqX (1/2)]. We also find the argminE(δX (X, a)). a  Finally, we find optimal “probability index vectors” to assign quantiles to a random sample X1 , · · · , Xn , which can be used to make a quantile–quantile plot. Some previous techniques to make a q–q plot are discussed in [24]. 213  8.2. Generalization to weighted vectors  8.2  Generalization to weighted vectors  This section extends the definitions and ideas developed before (quantiles, probability loss function, sorting, stacking etc.) from ordinary data vectors to weighted vectors. A weighted vector has two extra components compared to an ordinary vector: a weight allocation and a data amount. This allows us to summarize information in some cases. For example, consider the vector (1, 1, 1, 1, 1, 1, 1, 1, 1, 2). We observe that 1 is repeated 9 times and 2 only one time. We can summarize this by giving the elements (1, 2) a weight allocation (0.9, 0.1) and a data amount 10 which is the length of the vector in this case. Weighted vectors also enable us to define the “cut” operator to cut data vectors. Definition We call a triple χ = (x, wχ , nχ ) a weighted vector if length(x) = P length(wχ ) = lx , x = (x1 , · · · , xl ), wχ = (w1χ , · · · , wlχ ), li=1 wiχ = 1 and nχ a positive real number. Note that nχ is not necessarily equal to the length of x. We call wχ the “weight vector” of χ and nχ the “data amount” of χ. Remark. Note that in order to specify a weight vector w, we do not need to specify the last component since the weights must sum up to one. Examples: 1. χ = ((1, 2, 3), (1/3, 1/3, 1/3), 3). This is equivalent to an ordinary vector of length 3 in a sense we make clear soon. 2. χ = ((1, 2, 3), (1/3, 1/3, 1/3), 6). Notice this weighted vector has the same elements as before with a data amount of 6 which is two times the previous vector. This vector is equivalent to the ordinary vector x = (1, 1, 2, 2, 3, 3). 3. χ = ((1, 1, 2, 3), (1/6, 1/6, 1/3, 1/3), 3). This is equivalent to vector given in 1. Note that one is repeated two times here. However, the sum of the weights for 1 is 1/6+1/6=1/3 which is the same as the vector defined in 1. 4. χ = ((1), (1), 1/2). Here we only have 1/2 data amount. i.e. we have less than one observation! (1/2 of an observation to be precise.) √ 5. χ = ((1, 2), (1/2, 1/2), 3).  214  8.2. Generalization to weighted vectors The first vector, x, in the definition χ = (x, wχ , nχ ), is the vector of possible values, the second one, wχ , is the corresponding weights for elements of x and the third component, nχ , is a measure of how fine the vector is. A vector is called an ordinary vector if the length of x, lx , is equal to nχ and wiχ = wjχ , i, j ∈ 1, · · · , lx . The ordinary vector corresponds to the usual data vectors. Denote the space of all weighted vectors by Υ. We define some operations and an equivalence relation on Υ. Definition Suppose χ = (x, wχ , nχ ) then comp(χ) = ξ = (y, wξ , nξ ), where y = (y1 , · · · , yr ) is a non-decreasing vector of all disjoint elements of x, P wiξ = xj =yi wjχ and nξ = nχ . It is clear that comp (compress operator) is an operator from Υ to Υ. Then we define an equivalence relation on Υ. Definition χ ∼ ξ in Υ iff comp(χ) = comp(ξ). Clearly, ∼ is an equivalence relation. Let us define a transformation of a weighted vector. Definition Suppose χ = (x, wχ , nχ ) is a weighted vector and φ a transformation of R (not necessarily increasing). Then φ(χ) = ζ = (z, wζ = wχ , nζ = nχ ), where zi = φ(xi ), i = 1, 2, · · · , lx . For ordinary vectors x, y, comp(x) = comp(y) iff sort(x) = sort(y). Also comp leaves the last component of a weighted vector (the data amount) unchanged. Since x and wχ have the same length, we can show an element of Υ by pair consisting of a matrix of dimension 2 × lx and a number nχ :   x1 · · · xlx χ=( , nχ ) w1χ · · · wlχx  Given a weighted vector χ = (x, wχ , nχ ), we can naturally define a distribution function as follows. Definition Suppose χ = (x, wχ , nχ ) is a weighted vector. The the empirical distribution of χ is defined as X χ Fχ (a) = wi . i, xi ≤a  215  8.2. Generalization to weighted vectors Remark. If χ is an ordinary vector then Fχ is the usual empirical function. Then we extend the definition of the stack operator to weighted vectors. Definition Suppose χ = (x, wχ , nχ ) and ξ = (y, wξ , nξ ) are given then stack : Υ × Υ → Υ, (χ, ξ) 7→ ζ = (z, wζ , nχ + nξ ),  where (z, wζ ) in the matrix notation is given by x1 χ  w1χ nχn+nξ  ··· ···  xlx  y1 χ  wlχx nχn+nξ  ξ  w1ξ nχn+nξ  ··· ···  yly  ξ  wlξy nχn+nξ  !  .  Remark. In the definition, notice how the data amounts are used to adjust the weights. Remark. For ordinary vectors x, y the stack operator coincide to concatenating x and y. Lemma 8.2.1 (Stack operator properties) a) The stack operator preserves the equivalence relation defined above, i.e. χ1 ∼ ξ1 , χ2 ∼ ξ2 , then stack(χ1 , χ2 ) ∼ stack(ξ1 , ξ2 ) b) stack(χ1 , stack(χ2 , χ3 )) ∼ stack(stack(χ1 , χ2 ), χ3 ) Proof a) Suppose χi = (xi , wχi , nχi ), ξi = (y i , wξi , nξi ) and χi ∼ ξi for i = 1, 2. Let χ = comp(stack(χ1 , χ2 )), comp(stack(ξ1 , ξ2 )) = ξ. We need to show χ = ξ. Let χ = (x, wχ , nχ ) and ξ = (y, wξ , nξ ). From χi = ξi for i = 1, 2, we conclude nχi = nξi , i = 1, 2, which in turn gives nχ = nχ1 + nχ2 = nξ1 + nξ2 = nξ . Also x = y since both x and y are increasingly sorted and every element in x is an element of x1 or x2 which have the same elements as y 1 or y 2 . Now to show wiχ = wiξ , i = 1, 2, · · · , lx , suppose xi = yi be the corresponding element in x = y. Assume that the corresponding weight for xi in χ1 is w and in χ2 is w′ . Then the corresponding weight in ξ1 and ξ2 must be w and 216  8.2. Generalization to weighted vectors w′ respectively by the assumed equivalence relations. Hence wiχ and wiξ are equal to n χ2 n χ1 ′ w. χ1 + w . , n + n χ2 n χ1 + n χ2 and nξ2 nξ1 ′ w. ξ1 + w . , n + nξ2 nξ1 + nξ2 which are equal. b) Let χ = (x, wχ , nχ ) = comp[stack(χ1 , stack(χ2 , χ3 ))] and ′  ′  χ′ = (x′ , wχ , nχ ) = comp[stack(stack(χ1 , χ2 ), χ3 ))]. We show χ = χ′ . Firstly, note that ′  nχ = nχ1 + (nχ2 + nχ3 ) = (nχ1 + nχ2 ) + nχ3 = nχ . x = x′ is trivial. Fix xi = x′i in x = x′ . Suppose its corresponding weight in χj is equal to wj , j = 1, 2, 3. To show that the corresponding weights ′ wiχ and wiχ are equal, note that the corresponding weight of xi in χ is a combination of its weights in χ1 and stack(χ2 , χ3 ): 3  wiχ  = w1  n χ1  n χ1 n χ2 n χ3 n χ2 + n χ +[w +w ] 2 3 + (nχ2 + nχ3 ) n χ2 + n χ3 nχ2 + nχ3 nχ1 + (nχ2 + nχ3 )  and the corresponding weight of xi in χ′ is a combination of its weights in stack(χ1 , χ2 ) and χ3 : ′  wiχ = [w1  n χ1 n χ2 n χ1 + n χ2 n χ3 +w ] +w . 2 3 n χ1 + n χ2 nχ1 + nχ2 (nχ1 + nχ2 ) + nχ3 (nχ1 + nχ2 ) + nχ3  But the previous two expressions are equal and the proof is complete. This lemma implies that we can use the notation stack(χ1 , · · · , χm ).  Definition of quantiles and DOS for weighted vectors Now let us get to the definition of quantiles. We can proceed exactly in the same way as we did before by having in mind a bar of length one. Or alternatively, we can apply the quantile function definition for usual distributions to the empirical distribution of a weighted vector Fχ . This time, we proceed in a slightly different fashion which is equivalent to these 217  8.2. Generalization to weighted vectors methods. Suppose χ = (x, wχ , nχ ) is given and ζ = comp(χ) = (z, wζ , nχ ). We assume z has length lz . First, we define lqindχ : (0, 1] → {1, 2, · · · , lz }, and rqindχ : [0, 1) → {1, 2, · · · , lz }, the “left quantile index” and “right quantile index” functions and then define the left and right quantile functions using the index functions. If ζ = comp(χ) = (z, wζ , nx ) then we define lqχ (p) = zlqindχ (p) , p ∈ (0, 1], lqχ (p) = −∞, p = 0, and rqχ (p) = zrqindχ (p) , p ∈ [0, 1), rqχ (p) = ∞, p = 1. Let ζ = comp(χ). lqindχ and rqindχ are defined as follows: • p = 0 then lqindχ (p) not defined and rqindχ (p) = 1. • 0 < p < w1ζ then lqindχ (p) = rqindχ (p) = 1. • p = w1ζ then lqindχ (p) = 1 and rqindχ (p) = 2. .. . ζ • w1ζ + · · · + wi−1 < p < w1ζ + · · · + wiζ then lqindχ (p) = rqindχ (p) = i.  • p = w1ζ + · · · + wiζ then lqindχ (p) = i, rqindχ (p) = i + 1. .. . • p = 1 then lqindχ (p) = lz and rqindχ is not defined. Remark. It is easy to see that χ ∼ ξ then lqχ = lqξ , rqχ = rqξ . Remark. For ordinary vectors, this is equivalent to the definition given in the previous sections. 218  8.2. Generalization to weighted vectors Remark. Consider the natural distribution function Fχ corresponding to a weighted vector χ then lqχ = lqFχ and rqχ = rqFχ . Hence, lqχ , rqχ satisfy all the properties proved for left and right quantile functions of a distribution function. Definition We generalize the degree of separation (probability loss function) δχ on the set of weighted vectors as follows: δχ : R × R → R+ ∪ {0}, X δχ (z ′ , z) = δχ (z, z ′ ) = wjχ , z < z ′ , z<xj <z ′  and δχ (z, z) = 0. Lemma 8.2.2 (Properties of the probability loss function for weighted vectors) a) δχ = δFχ . b) δχ only depends on comp(χ). c) δχ satisfies the pseudo–triangle property. Proof a) and b) are trivial and c) follows from a) and pseudo–triangle property for the probability loss functions for distributions.  8.2.1  Partition operator  This section introduces the partition operator to partition data into arbitrarily sized partitions. This allows us to address the two remaining properties for quantiles we pointed out in the introduction (in Lemma 8.2.5). The idea behind the definition of the partition operator can be explained as follows. Suppose a weighted vector χ = (x, wχ , nχ ) is given Pmand we want to partition it to smaller vectors with weights (p1 , · · · , pm ), i=1 pi = 1. Consider a bar of length 1 and then color it from left to right using colors corresponding to the xi with length wiχ . Then cut the bar from left to right using the given weights (p1 , · · · , pm ). Now each one of the small bars is the partitions we needed. More formally, we have the following definition: P Definition Suppose P = (p1 , p2 , · · · , pm ) is given, such that m i=1 pi = 1. χ χ Then a P-partition of a weighted data vector χ = (x, w , n ) is denoted 219  8.2. Generalization to weighted vectors by part(P, χ) = (χ1 , · · · , χm ) and is a collection of m weighted vectors 1 1 m m χ1 = (x1 , wχ , nχ = nχ .p1 ), · · · , χm = (xm , wχ , nχ = nχ .pm ) defined as follows: P P 1. x1 = (xs1 , · · · , xt1 ), s1 = 1, v1 = 1≤j≤t1 wj ≥ p1 , 1≤j<t1 wj < p1 P P 2. x2 = (xs2 , · · · , xt2 ), v2 = 1≤j≤t2 wj − p1 ≥ p2 , 1≤j<t2 wj − p1 < ( t1 + 1 v1 = p1 p2 , s 2 = t1 v1 > p1 .. . P P P k. xk = (xsk , · · · , xtk ), vk = 1≤j≤tk wj − k−1 j=1 pj ≥ pk , 1≤j<t2 wj − ( Pk−1 tk−1 + 1 vk−1 = pk−1 . j=1 pj < pk , sk = tk−1 vk−1 > pk−1 .. . The corresponding weight vectors and data amounts are defined as: 1 1. wχ = p11 (wsχ1 , wsχ2 , · · · , wtχ1 − (v1 − p1 )), .. . ( χ χ χ 1 vk−1 = pk−1 k pk (wsk , wsk +1 , · · · , wtk − (vk − pk )) χ k. w = 1 . χ χ vk−1 > pk−1 pk (vk−1 − pk−1 , wsk +1 , · · · , wtk − (vk − pk )) .. . Lemma 8.2.3 If χ = (x, wχ , nχ ) is an ordinary vector and lx = nχ = n1 + · · · + nm . Let P = ( nnχ1 , · · · , nnm χ ) then the P-partition of χ is simply obtained by starting from the left and partitioning x to vectors of length n1 , n2 , · · · , nm . Proof This is a straightforward conclusion of the definition. Lemma 8.2.4 Suppose χ = (x, wχ , nχ ) is partitioned by some P = (p1 , · · · , pm ) to χ1 , · · · , χm then stack(χ1 , · · · , χm ) ∼ χ. ′  ′  Proof Let χ′ = stack(χ1 , · · · , χm ) and suppose χ′ = (x′ , wχ , nχ ). Then clearly x′ and x have the same distinct elements. (Although it might be the  220  8.2. Generalization to weighted vectors case that x′ 6= x since some elements of x are repeated more than once in x.) Also m X ′ nχ = pi n χ = n χ . i=1  In order to show that for z an element of the vector x, its corresponding weight is equal in χ and χ′ , suppose z is equal to xi1 , · · · , xir in x with corresponding wiχ1 , · · · , wiχr . Then the weight corresponding to z in χ Pr weights is equal to k=1 wiχk . Now note that any of xik , k = 1, · · · , r, corresponds to one or two elements in stack(χ1 , · · · , χm ) by the definition of the partitions operator. It can be the case that xik only appears in χs or in χs , χs+1 if xik is at the end of the partition χs and at the beginning of the next. In the first case when xik only appears in χs , its weight in χs will be p1s wiχk and χ  χ χ s 1 hence its weight contribution in stack(χ1 , · · · , χm ) will be nn.p χ p wi = wi . s k k In the second case its weight in χs will be p1s (wiχk − (vs − ps )) and in χs+1 1 will be ps+1 (vs − ps ). Hence its weight contribution in stack(χ1 , · · · , χm ) χ  χ  1 coming from χs , χs+1 is nnχps p1s (wiχk − (vs − ps )) + n npχs+1 ps+1 (vs − ps ) = wiχk . Summing up all the weights in stack(χ1 , · · · , χm ), we get the same value of Pr χ k=1 wik .  Using the partition operator, we can easily define the cut operator as follows. Definition Let D = {(a, b)| a, b ∈ (0, 1), a < b}. Then cut : Υ × D → Υ is defined to be cut(χ, p1 , p2 ) = χ2 , where χ2 is the second component of part(P, comp(χ)) = (χ1 , χ2 , χ3 ), the result of applying a partition operator with weights P = (p1 , p2 − p1 , 1 − p2 ) to comp(χ). We also define left cut and right cuts, lcut, rcut : (0, 1) → R, lcut(χ, p) = χ1 , rcut(χ, 1 − p) = χ2 ,  where χ1 and χ2 are the first and second component of the partition of χ by P = (p, 1 − p). Lemma 8.2.5 Suppose χ = (x, wχ , nχ ) is a weighted vector and (p1 , p2 ) in D. Then 221  8.2. Generalization to weighted vectors a) The amount of data in cut(χ, p1 , p2 ) is nχ (p2 − p1 ). b) cut(χ, p1 , p2 ) starts with rqχ (p1 ) and ends with lqχ (p2 ). c) The vector of lcut(χ, p) ends with lqχ (p). d) The vector of rcut(χ, p) starts with rqχ (1 − p). e) Suppose p1 , p2 ∈ (0, 1) then lcut(lcut(χ, p1 ), p2 ) = lcut(χ, p1 p2 ). f ) Suppose p1 , p2 ∈ (0, 1) then rcut(rcut(χ, p1 ), p2 ) = rcut(χ, p1 p2 ). Proof a) is trivial. To prove b), consider the definition of the partition operator as given in Definition 8.2.1 for arbitrary P = (p′1 , · · · , p′m ). For the first partition, xs1 = x1 = lqχ (p′1 ) and for xt1 , we have X X wj ≥ p′1 , and wj < p′1 , 1≤j≤t1  1≤j<t1  which concludes lqχ (p′1 ) = xt1 . For the k-th partition, ( tk−1 + 1 vk−1 = p′k−1 . sk = tk−1 vk−1 > p′k−1 If vk−1 = p′k−1 , then  P  P ′ p′i . Hence rqχ ( k−1 i=1 pi ) = Pk P ′ 1≤j<tk wj < 1≤j≤tk wj ≤ i=1 pi and  wj = P  1≤j≤tk−1  Pk−1 i=1  xt +1 = xsk . For tk , we have Pk−1 Pk k ′ ′ i=1 pk . Hence lqχ ( i=1 pk ) = xtk . To finish the proof, let m = 3 and p′1 = p1 , p′2 = p2 − p1 , p′3 = 1 − p2 and note that cut(χ, p1 , p2 ) corresponds to the second component of the partition operator of P = (p′1 , p′2 , p′3 ) on χ. The proof of c) is similar to b). d) can be either done by a similar direct proof or by using the Quantile Symmetry Theorem. To prove e) let χ1 = lcut(χ, p1 ) = ((x1 , · · · , xt1 ), (w11 , · · · , wt11 ), nχ .p1 ) χ1,2 = lcut(lcut(χ, p1 ), p2 ) = ((x1 , · · · , xt1,2 ), (w11,2 , · · · , wt1,2 ), nχ .p1 .p2 ) 1 χ12 = lcut(χ, p1 p2 ) = ((x1 , · · · , xt12 ), (w112 , · · · , wt12 ), nχ .p1 .p2 ) 1  We want to show χ1,2 = χ12 . It is clear that their data amount is equal. By applying the definition of lcut to the above three equations, we conclude the following: X X wj < p1 , wj ≥ p1 , (8.1) 1≤j<t  X  1≤j<t1,2  1≤j≤t  wj1 < p2 ,  X  1≤j≤t1,2  wj1 ≥ p2 ,  (8.2)  222  8.2. Generalization to weighted vectors  X  1≤j<t12  If j < t1 then wj1 = 1 p1  1 p1 wj .  X  wj < p1 p2 ,  1≤j≤t12  wj ≥ p1 p2 .  (8.3)  Hence from the first equation in 8.2, we conclude  X  1≤j<t1,2  wj < p2 ⇒  X  wj < p1 p2 .  1≤j<t1,2  Now consider two cases: Case I: t1,2 < t1 . In this case, similarly, from the second equation in 8.2, we conclude X 1 X wj ≥ p2 ⇒ wj ≥ p1 p2 . p1 1≤j≤t1,2  1≤j≤t1,2  Case II: t1,2 = t1 . In this case note that for j < t1,2 = t1 , we still have wj1 = p11 wj and for j = t1,2 = t, we have wj1 ≤ p11 wj . But X  1≤j≤t1,2 =t  wj1 = 1 ⇒  X  1≤j≤t1,2 =t  wj1 ≥ p1 ≥ p1 p2 .  P P In both cases, we showed that 1≤j≤t1,2 wj ≥ p1 p2 and 1≤j≤t1,2 =t1 wj1 ≥ p1 ≥ p1 p2 . We conclude that t1,2 = t12 . In order to show that the weight vectors of χ1,2 and χ12 are the same, note that they have the same length. We only need to show that they match on all the components except for the last one because the equality of the last one will follow. But if j < t1,2 = t12 then wj1,2 = p12 ( p11 wj ) and wj12 = p11p2 wj . f) can be done either by a similar argument as e) or using the Quantile Symmetry Theorem.  Remark. Part a) and e) address the two remaining properties we were seeking in the introduction.  8.2.2  Quantile data summaries  Here, we formally define quantile data summaries. They arise when a large data vector is summarized by a smaller vector and possibly some other information about the original vector and how the summary is been created. A large vector might have been partitioned into smaller vectors and the smaller vectors might have been summarized. First we define a probability index vector which is needed to define quantile data summaries. 223  8.2. Generalization to weighted vectors Definition A vector P = (p1 , · · · , pk ) is called a probability index vector if 0 ≤ p1 < · · · < pk ≤ 1. Definition Suppose χ = (x, wχ , nχ ), a weighted vector and a probability index vector P = (p1 , · · · , pm ) is given such that 0 ≤ p1 < p2 < · · · < pm ≤ 1. Then a P-quantile summary of χ is defined to be qs(P, χ) = (lqχ (p1 ), · · · , lqχ (pm )). Definition A summary triple is defined to be a triple (qs(P, χ), P, nχ ), where qs is the summarized vector as defined above, P is the summary probability index vector and nχ is the data amount of the original vector. We also define an ǫ-summary for ǫ < 1/2. Definition Let h = [1/ǫ]. Then the ǫ-summary for χ is defined to be the triple (qs(ǫ, χ), ǫ, nχ ): qs(ǫ, χ) = (lqχ (ǫ), lqχ (2ǫ), · · · , lqχ ((h − 1)ǫ)). Note that [0, ǫ), [ǫ, 2ǫ), · · · , [(h − 1)ǫ, 1] is a partition of [0,1] to intervals of the same length ǫ other than the last one, which can be greater than ǫ. However it is less than 2ǫ. If ǫ = 1/s for a natural number s, then the 1/s summary is going to be qs(1/s, χ) = (lqχ (1/s), lqχ (2/s), · · · , lqχ ((s − 1)/s)). Remark. For an ordinary vector x = (x1 , · · · , xn ), suppose n = n1 n2 . Then we defined the n2 –coarsening operator to be Cn2 (x) = (lqx (p1 ), · · · , lqx (pn1 −1 ), where pi = i/n, i = 1, · · · , n1 − 1. This is the same as qs(ǫ, x), for ǫ = 1/n1 . Hence the coarsening operator is a special case of creating an ǫ-summary. We also define summary lists. Definition Suppose χ = stack(χ1 , · · · , χm ) and m probability index vectors P1 , · · · , Pm are given. Then let ξ i = qs(χi , Pi ). Then the list 224  8.3. Optimal probability indices for vector data summaries  1  ξ1 P1 nχ  .. ..  ξ =  ... . .  m χ ξm Pm n    is called a quantile summary list of χ. Note that ξ is not a matrix in general since the length of the summary indices might differ. Quantile summary vectors or quantile summary lists are to be used to infer the original vector χ. They can be used as “inputs” to procedures for approximating lqχ . The formal definition of a data summary procedure is defined below. Definition Suppose χ is a weighted vector and input is a quantile summary list. Then a quantile summary procedure is defined to be a left quantile function: proc(input, χ) : [0, 1] → R. “proc” tries to approximate the quantiles of the original vector χ using the input. It is desirable to find procedures that have good accuracy. Example The d-coarsening algorithm can be viewed as an example of the above framework. There the vector χ is simply an ordinary vector of length n which is a concatenation of x1 , · · · , xl . The summary list consists of dcoarsening of partitions x1 , · · · , xl . In other words x1 , · · · , xm which are of length li = ci d are summarized by Pi = (1/ci , · · · , (ci − 1)/ci , i = 1, · · · , m) to w1 , · · · , wm . Finally the “proc” is simply the left quantile function of the concatenation of w1 , · · · , wm . The accuracy in terms of the probability loss Pm m+1 was bounded by ǫ = C−m , C = i=1 ci . In other words sup δx (proc(input, x)(p), lqx (p)) ≤ ǫ.  p∈(0,1)  8.3  Optimal probability indices for vector data summaries  Suppose a data vector x or a distribution X is given. The data vector x might be too long to carry around or save in the memory. Similarly the distribution X might be too complicated or unknown. To make inferences about a data vector x or the distribution of X, we might use a summary or 225  8.3. Optimal probability indices for vector data summaries some other procedure. For example, we might save a vector data summary instead of the vector x of length n, where n is very large: qs(P, x) = (lqx (p1 ), · · · , lqx (pm )), p1 < · · · < pm . The following question motivates our ensuing development: Question: How should P = {p1 , · · · , pm } be chosen to provide good approximation/prediction to x (or X)? A natural way to approximate x or X is to estimate all the quantiles. (This is equivalent to approximating or estimating the whole data vector x or the distribution function of X.) We are given an input. In the case of a data vector it is usually a quantile data summary and in the case of the random variable X it might be a random sample. Then a “procedure” can be employed to approximate/estimate the quantiles of x or X. For any given p the left quantile lqx (p) or lqX (p) is approximated/estimated by the procedure using the input. We denote this value by proc(input, x)(p) or proc(input, X)(p). Then a loss L can be used to assess the goodness of such a procedure: L(proc(input, x)(p), lqx (p)). To assess the overall goodness of such a procedure, we can use either the sup loss or the integral loss: sup L(proc(input, x)(p), lqx (p)), p∈[0,1]  or  Z  L(proc(input, x)(p), lqx (p))dp.  p∈[0,1]  For simplicity, we restrict to data vectors from here. We use the probability loss δx as the most natural choice. We want to minimize this loss n order to find optimal ways to summarize data (create input) and find optimal procedures. Definition We define the crudity of the procedure proc at p given the input to be crud(proc(input, x)(p)) = δx (proc(input, x)(p), lqx (p)). Also the “sup crudity” and “integral crudity” are respectively given by  226  8.3. Optimal probability indices for vector data summaries  SC(proc(input, x)) = sup δx (proc(input, x)(p), lqx (p)), p∈[0,1]  and IC(proc(input, x)) =  Z  δx (proc(input, x)(p), lqx (p))dp. p∈[0,1]  Using the above framework, we look for good procedures to summarize data vectors and later distribution functions. A quantile data summary was defined to be qs(P, x) = (lqx (p1 ), · · · , lqx (pm )), p1 < · · · < pm , for a probability index vector P = (p1 , · · · , pm ). There is a natural procedure associated with this input that is a quantile data summary, which we define below. Definition Suppose x is a data vector which has been summarized by P = (p1 , · · · , pm ). Then we define the shortest distance quantile procedure of x associated with P to be proc(input, x)(p) = lqx (pi ), i = argmin{|p − pj |, j = 1, · · · , m}. j  If there were more than one minimum above, take the smaller value. We denote this procedure by shproc(x, P). The shortest distance procedure be specified by the notation “7→” as shown below: 1 1. 0 ≤ p ≤ p1 + p2 −p 7→ lqx (p1 ). 2 p2 −p1 2 7→ lqx (p2 ). 2. p1 + 2 < p ≤ p2 + p3 −p 2 .. . m. pm−1 + pm −p2 m−1 < p ≤ 1 7→ lqx (pm ). The largest loss in the first part of the procedure is the maximum of the two values, p2 − p1 δx (lqx (0), lqx (p1 )), δx (lqx (p1 ), lqx (p1 + )). (8.4) 2 For the second part, it is the maximum of δx (lqx (p1 +  p2 − p1 p3 − p2 ), lqx (p2 )), δx (lqx (p2 ), lqx (p2 + )). 2 2  (8.5)  227  8.3. Optimal probability indices for vector data summaries For the m-th part it is the maximum of δx (lqx (pm−1 +  pm − pm−1 ), lqx (pm )), δx (lqx (pm ), lqx (1)). 2  (8.6)  We use quantile data summaries to save space and memory for operations on very large datasets. Hence, we have a limitation on m. The interesting question is what is an optimal index set P of length m to summarize data vectors? In the beginning, we usually do not have any information about x so the P should be chosen in way that works well for all possible data vectors. Hence, we settle for either argmin sup SC(shproc(input, x)(p), lqx (p)) = x  P  argmin sup sup δx (shproc(input, x)(p), lqx (p)), x p∈[0,1]  P  or  argmin sup IC(shproc(input, x)(p), lqx (p)) = x  P  argmin sup P  x  Z  1  δx (shproc(input, x)(p), lqx (p))dp.  0  We sort out the sup crudity case first. By Lemma 6.6.3, taking the sup of the max over all x in Equations 8.4, 8.5 and 8.6, we get the maximum of the following quantities: 1 1. p1 , p2 −p 2 . p3 −p2 1 2. p2 −p 2 , 2 . p3 −p2 p4 −p3 3. 2 , 2 . .. .  m. pm −p2 m−1 , 1 − pm . Hence, sup sup δx (shproc(input, x)(p), lqx (p)) = x p∈[0,1]  max {p1 ,  p∈[0,1]  p2 − p1 p2 − p1 p3 − p2 p3 − p2 p4 − p3 pm − pm−1 , , , , ,··· , , 1 − pm }. 2 2 2 2 2 2  After omitting the repetitions, we need to minimize: max{p1 ,  p2 − p1 p3 − p2 p4 − p3 pm − pm−1 , , ,··· , , 1 − pm }, 2 2 2 2 228  8.3. Optimal probability indices for vector data summaries over all p1 < p2 < · · · < pm ∈ [0, 1]. We claim that p1 =  1 1 , p2 − p1 = 1/m, p3 − p2 = 1/m, · · · , pm−1 = 1/m, pm = 1 − , 2m 2m  is the solution. Note that in this case the max is equal to 1/2m. We show that we cannot do better. Let α1 = p1 , p2 − p1 α2 = , 2 p3 − p2 α3 = , 2 .. . pm − pm−1 αm = , 2 αm+1 = pm . We have α1 +2α2 +· · ·+2αm +αm+1 = 1. The αi are non-negative, there are 1 + 2(m − 2) + 1 of them (counting the ones with multiple 2 two times) and 1 they sum up to 1. If all of them are less than 2m the sum will be less than 1. Hence we conclude the maximum is obtained when they are all equal to 1/2m. Now let us do the integral crudity case. We claim the solution is the same. We compute the integral in the following, using 6.6.4 in the second equality: Z 1 sup δx (lqx (p), shproc(input, x)(p))dp = x  0  Z sup[ x  p1 +  p2 −p1 2  δx (lqx (p1 ), lqx (p))dp +  0  Z  p2 +  p1 +  +··· +  Z  p3 −p2 2  p2 −p1 2  δx (lqx (p2 ), lqx (p))dp  1  pm−1 +  pm −pm−1 2  δx (lqx (pm ), lqx (p))dp]  229  8.3. Optimal probability indices for vector data summaries  = = + = +  Z Z Z Z Z  p1 +  p2 −p1 2  0  |p − p1 |dp +  p1  (p1 − p)dp +  0  p2 +  p3 −p2 2  p2 p1  pdp + 0 p3 −p2 2  0  Z  p1 +  Z  p2 + p1 +  p−p1 2  p1  p2 −p1 2  p2 −p1 2  pdp +  0  Z  |p − p2 |dp + · · · +  (p − p1 )dp +  (p − p2 )dp + · · · + Z  p3 −p2 2  Z  Z  Z  1 pm−1 +  p2 −p1 2  |p − pm |dp  p2  p1 +  p2 −p1 2  (p2 − p)dp  pm pm−1 +  pm −pm−1 2  pm −pm−1 2  (pm − p)dp +  Z  1 pm  (p − pm )dp  pdp  0  pdp + · · · +  Z  pm −pm−1 2  pdp +  0  Z  1−pm  pdp 0  = (1/2)α21 + α2 + · · · + α2m + (1/2)α2m+1 = (1/2)(α21 + 2α2 + · · · + 2α2m + α2m+1 ), pm −pm−1 1 , αm+1 = 1 − pm . where α1 = p1 , α2 = p2 −p 2 , · · · , αm = 2 We have the restriction α1 + 2α2 + · · · + 2αm + αm+1 − 1 = 0 and αi ≥ 0. In order to minimize  α21 + 2α22 + · · · + 2α2m + α2m+1 , we use Lagrange Multiplier’s Method. Let f (x1 , · · · , xm+1 ) = x21 +2x22 · · ·+2x2m +x2m+1 −λ(x1 +2x2 +· · ·+2xm +xm+1 −1). Taking the partial derivatives and putting them equal to zero, we get: ∂g ∂x1 = 2x1 − λ = 0, ∂g ∂x2 = 4x2 − 2λ = 0,  .. .  ∂g ∂xm = 4xm − 2λ = 0, ∂g ∂xm+1 = 2xm+1 − λ = 0.  By summing up the equations we get: 2(x1 + 2x2 + · · · + 2xm + xm+1 ) − 2λ(m − 1) − 2λ = 2 − 2λ(m − 1) − 2λ = 0. 1 Hence λ = m . This gives xi = 1 · · · = pm − pm−1 = m .  1 2m .  Hence p1 = pm =  1 2m  and p2 − p1 = 230  8.4. Other loss functions  8.4  Other loss functions  It is well-known that argminEX (X − a)2 , a  is the mean, when it exists. This fact is used in classical statistics for estimation of parameters and regression. It is also widely claimed that argminEX |X − a|, a  is “the median”. In particular for data vectors x = (x1 , · · · , xn ), this will take the form n 1X argmin |xi − a|. n a i=1  It is not clear what is meant by “the median”? For data vectors does it mean that the classic median (the middle value when there is odd number of elements and the average of the two middle values otherwise) is the unique solution? In general, is the answer unique? What is the connection of the solution to the left and right quantiles? We provide answers to some of these questions in the following theorem. Theorem 8.4.1 Suppose X is a random variable and E|X − a| is finite for some a ∈ R then 1 1 argminE|X − a| = [lqX ( ), rqX ( )]. 2 2 a  Proof  E|X − a| =  Z  R  |X − a|dP =  Z  X>a  (X − a)dP +  Z  X<a  (a − X)dP.  We prove the theorem in three steps: 1. If a < lqX (1/2) then E|X − a| > E|X − lqX (1/2)|. 2. If a > rqX (1/2) then E|X − a| > E|X − rqX (1/2)|. 3. If lqX (1/2) ≤ a, b ≤ rqX (1/2) then E|X − a| = E|X − b|. Step 1. Let b = lqX (1/2) and ǫ = b − a > 0. Then 231  8.4. Other loss functions  E|X − b| = = ≤ =  Z  Z  (X − b)dP + (b − X)dP X<b Z Z (X − a − ǫ)dP + (a + ǫ − X)dP Z X≥b Z X<b |X − a| − ǫdP + (|X − a| + ǫ)dP X≥b  X≥b  X<b  E|X − a| − ǫ(P (X ≥ b) − P (X < b)).  But P (X ≥ b) − P (X < b) is non-negative since P (X < lqX (1/2)) ≤ 1/2. Hence E|X − b| ≤ E|X − a|. To show that the equality cannot happen take a < a′ < b and let ǫ′ = a′ − a then Z Z E|X − a′ | = (X − a′ )dP + (a′ − X)dP ′ ′ X≥a X<a Z Z = (X − a − ǫ′ )dP + (a + ǫ′ − X)dP X≥a′ X<a′ Z Z ≤ |X − a| − ǫ′ dP + (|X − a| + ǫ′ )dP X≥a′  =  X<a  ′  E|X − a| − ǫ (P (X ≥ a′ ) − P (X < a′ )).  But P (X ≥ a′ ) − P (X < a′ ) is positive since P (X < a′ ) < 1/2 and a′ < lqX (p) ⇒ P (X < a′ ) < 1/2. Hence E|X − a′ | < E|X − a|. But also since a′ < b, we have E|X − b| ≤ E|X − a′ | < E|X − a|. Step 2. For a > rqX (1/2) = c one can either repeat a similar argument to that in Step 1 or use the Quantile Symmetry Theorem as we do here. Consider the random variable −X. Then a > rqX (1/2) ⇒ −a < −rqX (1/2) = lq−X (1/2) Now since −a < −c = lq−X (1/2) by applying Step 1 to −X, we get E| − X − (−c)| < E| − X − (−a)| ⇒ E|X − a| < E|X − c|. Step 3. If lqX (1/2) = rqX (1/2) the result is trivial. Otherwise let b = lqX (1/2) < rqX (1/2) = c and a < a′ ∈ [b, c]. By Lemma 5.3.1 if lqX (p) < rqX (p). So P (X ≤ lqX (p)) = p and P (X ≥ rqX (p)) = 1 − p. Hence P (X ≤ b) = P (X ≥ c) = 1/2. Let ǫ = a′ − a. Then 232  8.4. Other loss functions  Z  E|X − a| =  Z  =  Z  =  b<X<c  X≥c  X≥c  |X − a|dP +  Z  X≥c  (X − a − ǫ)dP + Z (X − a′ )dP +  Z  (X − a)dP +  X≤b  X≤b  Z  X≤b  +ǫ/2 − ǫ/2  (a − X + ǫ)dP  (a′ − X)dP  = E|X − a′ |.  Corollary 8.4.2 Suppose FX is continuous and ∃a ∈ R, E|X − a| < ∞. Then argminE|X − a| = {a|F (a) = 1/2}. a  Proof Note that if F is continuous F (a) = p ⇔ a ∈ [lqX (p), rqX (p)], by Lemma 5.5.2. Now let us find argminE(δF (X, a)). a  We solve the problem for continuous variables only here and leave the general case as an interesting open problem. Our conjecture is that the same result holds in general. Lemma 8.4.3 Suppose X be a random variable with continuous distribution function F . Then argminE(δF (X, a)) = [lqX (1/2), rqX (1/2)]. a  Proof If F is continuous then F (X) ∼ U (0, 1). Also δF (X, a) = |F (X) − F (a)|. Z argminE(δF (X, a)) = argmin |F (X) − F (a)|dP. a  a  Ω  The last expression is minimized if F (a) equals the median of the uniform. We conclude F (a) = 1/2 and the proof is complete.  233  8.4. Other loss functions  8.4.1  Optimal index vectors for assigning quantiles to a random sample  Given a sample X1 , · · · , Xn , i.i.d ∼ X, we can find the sample order statistics X(i) , i = 1, · · · , n. Suppose we want to assign these order statistics to quantiles, lqX (pi ), i = 1, · · · , n, of the true distribution of X. In other words, what is the optimal index vector P = (p1 , · · · , pn ) to assign lqX (pi ) to X(i) . This can be used to make a qq–plot. We define the optimal vector to be the index vector that minimizes the expected probability loss n  E[  1X δX (X(i) , lqX (pi ))]. n i=1  We only solve the problem for continuous variables and leave the general case as an open problem. Under the continuity assumption, we have E[  n  n  i=1  i=1  1X 1X δX (X(i) , lqX (pi ))] = E(|FX (X(i) ) − pi |), n n  which is minimized if and only if the individual terms E(|FX (X(i) ) − pi |) are minimized. Since FX is a continuous random variable, FX (X(i) ) is also continuous. Hence the minimum is obtained by solving P (FX (X(i) ) ≤ x) = 1/2 by the corollary of Theorem 8.4.1. By Lemma 5.5.1, this is equivalent to P (X(i) ≤ rqF (x)) = 1/2. The distribution of the order statistics, X(i) is given by n   X n P (X(i) ≤ y) = F (y)j (1 − F (y))n−j , j j=i  as discussed by Casella and Berger in [11]. Hence, the minimum is obtained by solving n   X n j=i  j  j  n−j  F (rqX (x)) (1 − F (rqX (x)))  =  n   X n j=i  j  xj (1 − x)n−j = 1/2,  which does not have a closed form solution in general. Also note that the solution does not on F . However, the solution always exists and Pndepend n j is unique since j=i j x (1 − x)n−j is increasing, continuous on (0,1) and ranges between 0 and 1. We also prove that the resulting index vector is symmetric in the sense that pn−i+1 = 1 − pi , i = 1, 2, · · · , n. For the proof, consider the random sample (Y1 , · · · , Yn ) = (−X1 , · · · , −Xn ). Then the sorted vector is (Y(1) , · · · , Y(n) ) = (−X(n) , · · · , −X(1) ). Hence Y(i) = 234  8.4. Other loss functions −X(n−i+1) . Suppose p1 , · · · , pn is an optimal summary index vector. Then pi is the solution of the first equation below argminE|FY (Y(i) ) − a| = argminE|1 − FX (Y(i) ) − a| = a  a  argminE|FX (X(n−i+1) ) − (1 − a)|. a  But if we let b = 1− a the solution to the last equation is b = 1− a = pn−i+1 . We conclude that pi = 1 − pn−i+1 . As examples, we solve the equation for n = 1, 2, where closed form solutions exist. n = 1. Then X(1) = X1 . It is easy to see that the solution is p = 1/2. n = 2. Then we want to solve two equations 2   X 2  xj (1 − x)2−j = 1/2,  2   X 2  xj (1 − x)2−j = 1/2,  j=1  and  j=2  j  j  which are equivalent to 2x(1 − x) + x2 = 1/2, and x2 = 1/2, We get p1 = √12 and p2 = 1 − √12 . Note that in general for n, the last equation is xn = 1/2. Hence pn = √ √ n n 1/ 2 and p1 = 1 − 1/ 2.  235  Chapter 9  Quantile distribution distance and estimation 9.1  Introduction  This chapter uses the probability loss function as a basis for estimating unknown parameters of a distribution and defining a distance among distribution functions. The “probability loss” and “c-probability loss” functions were introduced to measure the distance between quantiles. This is not the same as any other specific loss functions that have been proposed in statistical decision theory [30], where the loss function, L, is the loss of the statistician in estimating the true parameter vector θ = (θ1 , · · · , θk ) ∈ Θ, by an estimator θ̂(X1 , · · · , Xn ) which is a function of the data (a random sample X1 , · · · , Xn drawn from the distribution parameterized by θ). The estimator is then chosen in such a way that L(θ̂(X1 , · · · , Xn ), θ) becomes small in some sense. However, it is not possible to use the probability loss function in the same manner for parameter estimation. We defined δX (z ′ , z) = δX (z, z ′ ) = P (z ′ < X < z), z ′ ≤ z, z, z ′ ∈ R. Now it is clear that δX (θ, a) cannot even be evaluated since θ is a kdimensional vector and k is possibly greater than 1. This chapter presents two methods to estimate the parameters of distributions. More theoretical and applied development is necessary to justify such estimation procedures which we leave for future research. The first method derives from considering families of distributions that are identified by their values on certain quantiles and the second method from defining a distance among distributions and then trying to minimize that distance. These methods are designed to give estimates that are equivariant under continuous strictly monotonic transformations. The distances associated with probability measures in this section are based on the distances between the quantiles using the probability loss function and they are invariant under monotonic transformations. This property does not hold in classical methods. For example the sample mean x̄, an estimator of the location parameter 236  9.2. Quantile–specified parameter families for normal distribution is equivariant under linear transformations but not all continuous strictly monotonic transformations. Quantile distance allows us to measure closeness of distributions to each other. We also define a quantile distance for the tails of the distributions. We show that even though two distributions are very close in terms of “overall quantile distance”, they might not be very close in terms of “tail quantile distance”. This shows that to study extremes (for example extremely hot temperature) if we use a good overall fit, our results might not be reliable. We use this observation in the next chapter in choosing our method of studying extreme temperature events.  9.2  Quantile–specified parameter families  This section considers families of distributions that are identified by their values on certain quantiles. In this case the parameters in the vector θ = (θ1 , · · · , θk ) are certain quantiles. Then we use the “probability loss” or the “c-probability loss” to characterize the loss and thus yield optimal parameter estimators. Definition A family of random variables {Xθ }θ=(θ1 ,··· ,θk )∈Θ , and a probability index vector P = (p1 , · · · , pk ), 0 ≤ p1 < p2 < · · · < pk ≤ 1 are called a left–quantile–specified family if (θ1 , · · · , θk ) = (lqXθ (p1 ), · · · , lqXθ (pk )),  and the distribution of Xθ is know given θ. Note that this implies that θ ∈ Θ then θ1 ≤ θ2 ≤ · · · θk . We can similarly define: Definition A family of random variables {Xθ }θ=(θ1 ,··· ,θk )∈Θ , and a probability index vector P = (p1 , · · · , pk ), 0 ≤ p1 < p2 < · · · < pk ≤ 1 are called a right–quantile–specified family (θ1 , · · · , θk ) = (rqXθ (p1 ), · · · , rqXθ (pk )),  and the distribution of Xθ is know given θ. Note that this implies that θ ∈ Θ then θ1 ≤ θ2 ≤ · · · ≤ θk . Example Consider the family {U (0, 2a)}a∈R+ , of uniformly distributed random variables on (0, 2a), a > 0. Then, we can express this family as the quantile–specified family {Xθ }θ∈R+ with P = (1/2). The reason is if Xθ ∼ U (0, 2a) then θ = lqXθ (1/2) = a. 237  9.2. Quantile–specified parameter families Example Consider the family N = {N (µ, σ 2 )| − ∞ < µ < +∞, σ 2 > 0}. Then we claim this is a quantile–specified family. To verify that claim let P = (1/2, p2 ) where p2 = P (Z ≤ 1) and Z has the standard normal distribution. Let µ = lqX (1/2) = θ1 , and µ + σ 2 = lqX (p2 ) = θ2 . Then we can equivalently represent N by {Xθ }θ=(θ1 ,θ2 )∈Θ , where Θ = {(θ1 , θ2 )|θ1 < θ2 }. Because (µ, σ 2 ) is in 1:1 correspondence with θ = (θ1 , θ2 ) as defined above, where P (X ≤ µ + σ 2 ) = P (Z ≤ 1) = p2 .  Note that this representation is not unique. For example, we can take P = (1/2, p2 ) with p2 = P (Z ≤ 2). Then the alternate re-parametrization in terms of variables is µ = lqX (1/2) = θ1 , and µ + 2σ 2 = lqX (p2 ) = θ2 . It should be clear that if the goal is to infer the parameters of the original family, i.e. a in U (0, 2a) and (µ, σ 2 ) then it is desirable that the θi are simple functions of the original parameters and the original parameters be easily obtainable from the θi . Linear combinations seem to be the easiest to handle. We suggest the following framework to estimate the parameters: • Express the original parameterized family Xβ as a quantile specified family Xθ with P = (p1 , · · · , pk ). • Use argminDi ∈F E[L(θi , Di (input)], i = 1, · · · , k  where input is the information available to us, usually a random sample, (X1 , · · · , Xn ),  Di is an estimator of θi = lqX (pi ) (a function of the random sample), L is a loss function and F is the class of the estimators. The loss c , c > 0. functions of our interest are L = δXθ and L = δX θ 238  9.2. Quantile–specified parameter families • Using the estimated parameters solve for the original parameters, the βi . c , c > 0 depends on the unknown distribution function X . Note that δX θ θ Many issues in the above framework need to be addressed including: the existence and uniqueness of the argmin, properties of the estimators and so on which we leave for future research. In next subsections we show the Equivariance property of the method and apply it to a particular class of estimators using simulations.  9.2.1  Equivariance of quantile–specified families estimation  Here, we show the equivariance property of estimation using quantile–specified families in the following lemmas. Lemma 9.2.1 Suppose {Xθ }θ∈Θ is left–quantile–specified with P = (p1 , · · · , pk ), and φ is a continuous strictly increasing transformation which induces a map on Rk : Φ : Rk → Rk , (θ1 , · · · , θk ) 7→ (φ(θ1 ), · · · , φ(θk )).  Let Θ′ = Φ(Θ), θ ′ = Φ(θ) for θ ∈ Θ and consider the family of distributions Yθ′ = φ(Xθ ). Then {Yθ′ }θ′ ∈Θ′ is also a left–quantile–specified family with the same index vector P = (p1 , · · · , pk ). Proof Suppose the distribution of Xθ is specified by Fθ . Then P (Yθ′ ≤ a) = P (φ(Xθ ) ≤ a)  = Fθ (φ−1 (a)) = FΦ−1 (θ′ ) (φ−1 (a)). Hence the distribution of Yθ′ is known given θ ′ . It remains to show that for θ ′ ∈ Θ′ , (θ1′ , · · · , θk′ ) = (lqYθ′ (p1 ), · · · , lqYθ′ (pk )).  But  (lqYθ′ (p1 ), · · · , lqYθ′ (pk )) =  (lqφ(Xθ ) (p1 ), · · · , lqφ(Xθ ) (pk )) =  (φ(lqXθ (p1 )), · · · , φ(lqXθ (pk ))) = (φ(θ1 ), · · · , φ(θk )) = (θ1′ , · · · , θk′ )  . 239  9.2. Quantile–specified parameter families  Lemma 9.2.2 Suppose {Xθ }θ∈Θ is left–quantile–specified with P = (p1 , · · · , pk ), and φ is a continuous strictly decreasing transformation which induces a map on Rk : Φ : Rk → Rk , (θ1 , · · · , θk ) 7→ (φ(θk ), · · · , φ(θ1 )).  Let Θ′ = Φ(Θ), θ ′ = Φ(θ) for θ ∈ Θ and consider the family of distributions Yθ′ = φ(Xθ ). Then {Yθ′ }θ′ ∈Θ′ is a right–quantile–specified family with the index vector P = (1 − pk , · · · , 1 − p1 ). Proof Suppose the distribution of Xθ is specified by Fθ . Then since Fθ the left closed distribution of Xθ is known, the right closed distribution of Xθ , GcX (Xθ ) is also known. Then P (Yθ′ ≤ a) = P (φ(Xθ ) ≤ a) = P (Xθ ≥ φ−1 (a)) = Gcθ (φ−1 (a)) = GcΦ−1 (θ′ ) (φ−1 (a)), where Gcθ is the right closed distribution function. Hence the distribution of Yθ′ is known given θ ′ . It remains to show that for θ ′ ∈ Θ′ , (θ1′ , · · · , θk′ ) = (rqYθ′ (1 − pk ), · · · , rqYθ′ (1 − p1 )). But (rqYθ′ (1 − pk ), · · · , rqYθ′ (1 − p1 )) =  (rqφ(Xθ ) (1 − pk ), · · · , rqφ(Xθ ) (1 − p1 )) =  (φ(lqXθ (pk )), · · · , φ(lqXθ (p1 ))) = (φ(θk ), · · · , φ(θ1 )) = (θ1′ , · · · , θk′ ).  For a parameter θ, we want to find argminD∈F E(δX (lqX (p), D)) 240  9.2. Quantile–specified parameter families where F is a family of estimators for θ and D ∈ F is a function D : Rn → R, where n is the size of the sample and D(X1 , · · · , Xn ) is the estimator of θ = lqX (p). Lemma 9.2.3 Suppose a random sample X1 , · · · , Xn is given, Xθ is a left– quantile–specified family with θ = lqX (p), φ a strictly monotonic continuous transformation on R, F is a family of estimators to estimate θ and the following argmin is nonempty argminD∈F E(δX (lqX (θ), D)), and let F ′ = φ(F). Then a) if φ is strictly increasing argminD′ ∈F ′ E(δφ(X) (lqφ(X) (p), D ′ )) = φ(argminD∈F E(δX (lqX (p), D))) b) if φ is strictly decreasing argminD′ ∈F ′ E(δφ(X) (lqφ(X) (p), D ′ )) = φ(argminD∈F E(δX (rqX (1 − p), D))) Proof We only prove a) and b) is similar. min E(δφ(X) (lqφ(X) (p), D ′ ))  D ′ ∈F ′  = min E(δφ(X) (φ(lqX )(p), φ(D))) D∈F  = min E(δX (lqX (p), D)) D∈F  Note that for a general family of estimators, F argminD∈F E(δX (lqX (p), D)) depends on the unknown distribution X by δX . We suggest two possible ways to get around this issue: • Restrict to a family F that argminD∈F E(δX (lqX (p), D)) does not depend on the distribution. 241  9.2. Quantile–specified parameter families • Use the empirical distribution to approximate the expression E(δX (lqX (p), D)). We will not explore the second method here and leave it for future research. Next subsection shows an important instance of the first method.  9.2.2  Continuous distributions with the order statistics family of estimators  Suppose that the desired distribution X is continuous then E(δX (lqX (p), D)) = E|FX (lqX (p)) − FX (D)| = E|p − FX (D)|. Now suppose a random sample X1 , · · · , Xn is given and we want to estimate lqX (p). We restrict to an important family of estimators, order statistics: F = {X1:n , · · · , Xn:n }. Then for i = 1, · · · , n:  E|p − FX (Xi:n )|,  does not depend on FX . This is because the distribution of FX (Xi:n ) does not depend on FX . It can be obtained as shown below: Gi (y) = P (FX (Xi:n ) ≤ y) = P (Xi:n ≤ lqX (y)) = n   X n P (X1 , · · · , Xj ≤ lqX (y) and Xj+1 , · · · , Xn > lqX (y)) = j j=i n   n   X X n n j j n−j = P (X ≤ lqX (y)) P (X > lqX (y)) y (1 − y)n−j . j j j=i  j=i  By taking the derivative of the above expression we can find the density function gi (p) and conclude: Z 1 E|p − FX (Xi:n )| = |p − y|gi (y)dy. 0  For a given p we want to find the i that minimize above which does not on FX . We can approach this problem theoretically to find such an i. Or we could try to estimate these integral using numerical methods. However, here we use simulation for two examples and leave the general case for future research. 242  9.3. Probability divergence (distance) measures Example Consider a family of continuous variables, quantile–specified by P = (1/2, P (Z ≤ 1)) where Z is the standard normal. Suppose a random sample X1 , · · · , Xn is given and we want to estimate lqX (1/2) and lqX (P (Z ≤ 1)) using the family of estimators, order statistics: F = {X1:n , · · · , Xn:n }. We estimate the parameters for n = 25 and n = 20. In order to minimize the loss we can approximate the loss by approximating the integral in Equation 9.2.2 or approximating E|p − FX (Xi:n )|, using an arbitrary continuous distribution such as standard normal to do the simulations. For a large number M , we create M samples of length n from normal and for every sample we find the i that minimize the loss. Then for every i, we compute the mean of such losses and find out which has the smallest mean loss. We do that for M = 1, · · · , 1000. The results for n = 25 are given in Figure 9.1. We see that for large M the estimator for lqX (1/2) is X13:25 and for lqX (P (Z < 1)) it is X22:25 . The results for n = 20 are given in Figure 9.2. The estimator for lqX (1/2) has changed between X10:20 and X11:21 and it is X18:20 for lqX (P (Z ≤ 1)). This shows that the argmin is not necessarily unique.  9.3  Probability divergence (distance) measures  In probability theory, physics and statistics several measures have been introduced as the “distance” of two probability measures (or random variables). These measures have several applications, one of which is parameter estimation. We list some of these measures in this section. The next section then introduces new measures of distance among probability measures using the c-probability loss functions (c ≥ 0). • The Kullback-Leibler (KL) distance: Suppose P, Q are probability measures and P is absolutely continuous with respect to Q. Then dP consider the Radon-Nikodym derivative of P with respect to Q, dQ [See [9]]. Then we define: DKL (P, Q) =  Z  Ω  log  dP dP. dQ  If P and Q have density functions over R, p(x), q(x) then 243  9.3. Probability divergence (distance) measures  15 13 11  optimal order  P(Z<0)  0  200  400  600  800  1000  600  800  1000  23 22 21 20  optimal order  24  P(Z<1)  0  200  400  simulations number  Figure 9.1: The order statistics family members that estimate lqX (1/2) and lqX (P (Z ≤ 1)) for a random sample of length 25 obtained by generating samples of size 1 to 1000 from a standard normal distribution  244  9.3. Probability divergence (distance) measures  11 10 9 8  optimal order  12  P(Z<0)  0  200  400  600  800  1000  600  800  1000  17.0 16.0 15.0  optimal order  18.0  P(Z<1)  0  200  400  simulations number  Figure 9.2: The order statistics family members that estimate lqX (1/2) and lqX (P (Z ≤ 1)) for a random sample of length 20 obtained by generating samples of size 1 to 1000 from a standard normal distribution  245  9.3. Probability divergence (distance) measures  Z  p(x)log(  R  p(x) )dx. q(x)  The symmetric version of this distance is called Kullback-Jeffreys DKJ (P, Q) = DKL (P, Q) + DKL (Q, P ). We show that the Kullback-Leibler distance is invariant under bijective differentiable monotonic transformations when the density functions exists and are positive everywhere on the real line. Let g be a monotonic, bijective and differentiable (bijective and differentiable will automatically imply strictly monotonic) transformation and X, Y random variables with density functions fX (x) and fY (x), positive on R. Then the density functions of g(X) and g(Y ) are respectively (g−1 )′ (x)fX (g−1 (x)) and (g−1 )′ (x)fY (g−1 (x)). Hence  DKL (φ(X), φ(Y )) = R∞  −∞  (g−1 )′ f  R∞  −∞ (g  X (g  −1 )′ f  −1 ′ −1 −1 (x)) log (g ) fX (g (x)) dx (g −1 )′ fY (g −1 (x))  X (g  =  −1 −1 (x)) log (fX (g (x)) dx. fY (g −1 (x))  We use the change of variable x = g(y). Then dx = (g−1 )′ dy and the proof is complete. For the strictly decreasing case note that the density function of g(X) and g(Y ) are respectively −(g−1 )′ (x)fX (g−1 (x)) and −(g−1 )(x)′ fY (g−1 (x)) and a similar argument works. We leave the general case (where the density function does not exist or is not positive over all the real line) as an open(?) problem. • Let P and Q be two probability distributions over a space Ω such that P is absolutely continuous with respect to Q. Then, for a convex function f such that f (1) = 0, the f -divergence of Q from P is If (P, Q) =  Z  Ω  f    dP dQ    dQ.  Note that the same argument as the one for KL distance shows that this distance is invariant for monotonic differentiable bijective transformations when the density functions exist and are positive.  246  9.3. Probability divergence (distance) measures • The Kolmogorov-Smirnov distance: Suppose X, Y are random variables on R with distribution functions FX and FY . Then KS(X, Y ) = sup |FX (x) − FY (x)|. x∈R  The Gilvenko-Cantelli Theorem states that if X1 , · · · , Xn is a random sample drawn from the distribution Fθ0 and Fn , the empirical distribution function lim KS(Fθ0 , Fn ) > ǫ = 0, a.s..  n→∞  Note that the KS metric is invariant under monotonic transformations. Take φ to be strictly monotonic on R. Then  sup |Fφ(X) (x) − Fφ(Y )(x) | =  x∈R sup |FX (φ−1 (x)) x∈R sup |FX (φ−1 (x)) φ−1 (x)∈R  − FY (φ−1 (x))| = − FY (φ−1 (x))| =  sup |FX (x) − FY (x)|. x∈R  Although the KS metric is invariant under strictly monotonic transformations, it is not intuitively very appealing as we show in the following example. Example Consider X ∼ U (0, 1), Y ∼ U (1/2, 3/2) and let Z be distributed as FZ :   0    1/2 FZ (z) =  z    1  z<0 0 ≤ z ≤ 1/2 . 1/2 < z < 1 z≥1  Then we have KS(X, Y ) = KS(X, Z) = 1/2. But we observe that FZ matches FX on (1/2, 1) while FX and FY differ by 1/2 on (0, 1). Another way to see the defect is the quantiles of Z and X match half of the time but the quantiles of X and Y are off as much as one half of a unit at all times. 247  9.4. Quantile distance measures To overcome the above problem one might (naively) suggest using an integral version IKS(X, Y ) =  Z  x∈R  |FX (x) − FY (x)|dx.  However, this definition is not well-defined. To see that consider FX (x) = 1 − 8/x, x > 8 and FY (x) = 1 − 9/x, x > 9. Then |FX (x) − FY (x)| = 1/x on [8, ∞], which does not have finite integral. It is also not invariant under strictly monotonic transformations for if φ is strictly monotonic and differentiable, IKS(φ(X), φ(Y )) =  Z  x∈R  |FX (φ−1 (x)) − FY (φ−1 (x))|dx.  In the right hand side of the above equation the factor (φ−1 )′ , that would make the distance invariant under transformations, is missing. • Lévy distance: Suppose (Ω, Σ, Pθ )θ∈Θ be a statistical space, where the Pθ are probability measures on Ω with σ-field Σ. Then we define  Lev(Fθ1 , Fθ2 ) = inf{ǫ > 0|Fθ1 (x − ǫ) < Fθ2 (x) < Fθ1 (x + ǫ), ∀x ∈ R}. It can be shown that convergence in the Lévy metric implies weak convergence for distribution function in R [31]. It is shift invariant but not scale invariant as discussed in [31].  9.4  Quantile distance measures  This section introduces the quantile distance measure to measure the distance among distribution functions on R (or random variables). We begin with a general definition using the quantiles and then consider interesting particular cases. The intuition behind all these metrics lies in their capability to measure the separation in the quantiles of two random variables. Definition Suppose a statistical space (Ω, P, {Xθ }θ∈Θ ) and a loss function L defined over the extended real numbers R ∪ {−∞, +∞} are given. Also let E be a measurable subset of (0,1) and dµE is a measure on E. Then we can define the following two measures of distance between Xθ1 and Xθ2 ,  248  9.4. Quantile distance measures  SQDLE (Xθ1 , Xθ2 ) = sup L(lqXθ1 (p), lqXθ2 (p)), p∈E  and IQDLE (Xθ1 , Xθ2 )  =  Z  p∈E  L(lqXθ1 (p), lqXθ2 (p))dµE ,  which we call the sup quantile distance and integral quantile distance respectively. Remark. Note that in general SQDLE and IQDLE are neither well-defined nor metrics on the space of random variables.. Remark. We can also take L(rqXθ1 (p), rqXθ2 (p)) in the above definitions. Remark. The natural choice for E is (0, 1) and the measure µ = L, where L is the Lebègues measure on (0, 1). However, one might choose another E depending on the purpose. For example E = (0.8, 1) might be more appropriate if the purpose is modeling the high extremes. c , δ c Remark. Interesting choices for L are δXθ1 , δX Xθ1 + δXθ2 and δXθ + θ1 1 c . Note that in all these cases the quantile distance is defined since these δX θ2 quantities are bounded respectively by 1, 1 + c, 2, 2 + 2c. The rest of this report focuses on quantile distances obtained from cprobability losses (c ≥ 0). (Note that c = 0 corresponds to the usual probability loss.)  9.4.1  Quantile distance invariance under continuous strictly monotonic transformations  This subsection show the invariance of quantile distance under strictly monotonic tranformations in the following lemmas. Lemma 9.4.1 (Quantile distance invariance under continuous strictly increasing transformations) Suppose X, Y are random variables, let Z E IQDδc (X, Y ) = L(lqX (p), lqY (p))dµE , X  E  and SQDδEc (X, Y ) = sup L(lqX (p), lqY (p)), X  p∈E  where E ⊂ (0, 1), c ≥ 0 and µE is a measure on E. Then IQDδEc (X, Y ) = IQDδEc X  φ(X)  (φ(X), φ(Y )), 249  9.4. Quantile distance measures and SQDδEc (X, Y ) = SQDδEc X  φ(X)  (φ(X), φ(Y )),  for all φ : R → R continuous and strictly increasing transformations. Proof The proof attains from noting that δφ(X) (lqφ(X) (p), lqφ(Y ) (p)) = δφ(X) (lqφ(X) (p), lqφ(Y ) (p)) + c(1 − 1{0} (lqφ(X) (p) − lqφ(Y ) (p))) = δφ(X) (φ(lqX (p)), φ(lqY (p))) + c(1 − 1{0} (lqX (p) − lqY (p))) = δX (lqX (p), lqY (p)) + c(1 − 10 (lqX (p) − lqY (p))) = c δX (lqX (p), lqY (p)).  c + δ c , which follows immeRemark. The above lemma is also true for δX Y diately.  Lemma 9.4.2 If E a measurable subset of [0,1] then the two following distance measures are equal: Z LQDδEX (X, Y ) = δX (lqX (p), lqY (p))dp, E  and RQDδEX (X, Y  )=  Z  δX (rqX (p), rqY (p))dp. E  The following two measures are also equal: Z E LQDδX +δY (X, Y ) = (δX + δY )(lqX (p), lqY (p))dp, E  and RQDδEX +δY (X, Y ) =  Z  (δX + δY )(rqX (p), rqY (p))dp. E  Proof We prove the first part part of the lemma and the second part is deduced from the first. We showed in the quantile definition section that the set {p|lqX (p) 6= rqX (p)} is countable. Hence, {p|lqX (p) 6= rqX (p)} ∪ {p|lqY (p) 6= rqY (p)},  250  9.4. Quantile distance measures is also countable. In the complement of this set δX (lqX (p), lqY (p)) = δX (rqX (p), rqY (p)). Hence the integral values are the same. Remark. Note that the above theorem also holds for any measure µ on any E ⊂ (0, 1) which is continuous with respect to the Lebègue measure. Because of this lemma we will not worry about the left or right quantile in the definitions. The following lemma establishes a relationship between LQDδX and c . LQDδX Lemma 9.4.3 Let E be a measurable subset of [0,1] and kE = L{p ∈ E|lqX (p) 6= lqY (p)}, where L is the Lebègue measure. Let Z E c LQDδc (X, Y ) = δX (lqX (p), lqY (p))dp, X  E  and LQDδEX (X, Y  )=  Z  δX (lqX (p), lqY (p))dp.  E  Then  LQDδEc (X, Y ) = LQDδEX (X, Y ) + ckE . X  Proof c (X, Y ) LQDδX =  Z  E  Z Z  Z  Z lqX (p)6=lqY (p),p∈E  lqX (p)=lqY (p),p∈E  lqX (p)6=lqY (p),p∈E  c δX (lqX (p), lqY (p))dp = c δX (lqX (p), lqY (p))dp + c δX (lqX (p), lqY (p))dp =  δX (lqX (p), lqY (p))dp + lqX (p)=lqY (p),p∈E  [δX (lqX (p), lqY (p)) + c(1 − 1{0} )(lqX (p) − lqY (p))]dp = LQDδEX (X, Y ) + ckE . 251  9.4. Quantile distance measures  Remark. Note that the same is true for RQDδEc and RQDδEX . Also X  L{p ∈ E|lqX (p) 6= lqY (p)} = L{p, p ∈ E|rqX (p) 6= rqY (p)}, because lqX , rqX and lqY , rqY are unequal only on a measure zero set. Hence the constant kE is the same as before and RQDδEc (X, Y ) = RQDδEX (X, Y ) + ckE . X  Lemma 9.4.4 Suppose E a measurable subset of [0,1] then the two following distance measures are equal Z E c LQDδc (X, Y ) = δX (lqX (p), lqY (p))dp, X  p∈E  and RQDδEc (X, Y X  Z  )=  p∈E  Also these two measures are equal Z LQDδEc +δc (X, Y ) = X  Y  p∈E  and RQDδEc +δc X Y  (X, Y ) =  Z  p∈E  c δX (rqX (p), rqY (p))dp.  c (δX + δYc )(lqX (p), lqY (p))dp,  c (δX + δYc )(rqX (p), rqY (p))dp.  Proof This is a straightforward consequence of the previous two lemmas.  Remark. Note that the above theorem also holds for any measure µ on any E ⊂ (0, 1) which is continuous with respect to the Lebègue measure. Lemma 9.4.5 (Quantile distance invariance under continuous strictly monotonic transformations) Suppose X, Y are random variables and let QD E (X, Y ) = LQDδEX (X, Y ),  (9.1)  QDcE (X, Y ) = LQDδEc (X, Y ),  (9.2)  X  252  9.4. Quantile distance measures where, E ⊂ (0, 1) symmetric, meaning p ∈ E ⇔ (1 − p) ∈ E, and µ is absolutely continuous with respect to the Lebègue measure and symmetric on E in the sense that if A is measurable then so is 1 − A while µ(A) = µ(1 − A). Then 9.1 and 9.2 are invariant under continuous strictly monotonic transformations, i.e. a) QD E (φ(X), φ(Y )) = LQDδEφ(X) (φ(X), φ(Y )) = QD E (X, Y ) = QDδEX (X, Y ), b) QDcE (φ(X), φ(Y )) = LQDδEc  φ(X)  (φ(X), φ(Y )) = QDcE (X, Y ) = QDδEc (X, Y ). X  Proof For φ continuous and strictly increasing transformations, we have shown the result in Lemma 9.4.1. Suppose φ is continuous and strictly decreasing. a) We use lqφ(X) (p) = φ(rqX (1 − p)) which we proved above using quantile symmetries: δφ(X) (lqφ(X) (p), lqφ(Y ) (p)) = δφ(X) (φ(rqX (1 − p)), φ(rqY (1 − p))) =  δ−φ(X) (−φ(rqX (1 − p)), −φ(rqY (1 − p))),  where the last equality is because δX (a, b) = δ−X (−a, −b). Now since −φ is continuous and increasing, the above is equal to δX (rqX (1 − p), rqY (1 − p)). We use this result in the following: Z E QD (X, Y ) = δX (lqX (p), lqY (p))dµE E Z = δX (rqX (1 − p), rqY (1 − p))dµE . E  Then we do a change of variable p → (1 − p) and by symmetry of µ, we find that the above is equal to Z δX (rqX (p), rqY (p))dµE . E  But by the previous lemmas and since µ is continuous with respect to the Lebègue measure, this is equal to Z δX (lqX (p), lqY (p))dµE . E  253  9.4. Quantile distance measures b) We only consider continuous and strictly decreasing functions φ: LQDδEc  φ(X)  Z  E  (φ(X), φ(Y )) =  c(1 − 1{0} (lqφ(X) (p) − lqφ(Y ) (p)))dp + LDQδφ(X) (X, Y ) = ckE + LDQE δφ(X) (φ(X), φ(Y )),  where, kE = µ{p ∈ E|lqφ(X) (p) 6= lqφ(Y ) (p)} =  µ{p ∈ E|φ(rqX (1 − p)) 6= φ(rqY (1 − p))} = µ{p ∈ E|rqX (1 − p) 6= rqY (1 − p)} = µ{p ∈ E|rqX (p) 6= rqY (p)} = µ{p ∈ E|lqX (p) 6= lqY (p)}.  We showed in a) that E LDQE δφ(X) (φ(X), φ(Y )) = LDQδX (X, Y )  and because we just showed that kE = µ{p ∈ E, |(lqX (p)) 6= (lqY )(p)}, we conclude DQE δc  E E = ckE +LDQE δφ(X) (φ(X), φ(Y )) = ckE +LDQδX (X, Y ) = LQDδc (X, Y ).  9.4.2  Quantile distance closeness of empirical distribution and the true distribution  φ(X)  X  The next theorem shows that the quantile distance between the sample distribution and the true distribution tends to zero when the sample size becomes large. Theorem 9.4.6 Let X1 , X2 , · · · be an i.i.d. random sample drawn from an arbitrary distribution function F . Then (a) SQDδX (F, Fn ) = sup δF (lqFn (p), lqF (p)) → 0., a.s., p∈(0,1)  254  9.4. Quantile distance measures and (b) IQDδX (F, Fn ) =  Z  δF (lqFn (p), lqF (p)) → 0., a.s..  p∈(0,1)  Proof We only need to prove (a) since (b) is a straightforward consequence of (a). Clearly lqFn (p) = Xi:n for p ∈ ((i − 1)/n, i/n], i = 1, 2, · · · , n. Also Fnc (Xi:n ) ≥ i/n and Fno (Xi:n ) ≤ (i − 1)/n. Pick an N large enough in the Glivenko-Cantelli Theorem such that n > N ⇒ |Fn (x) − F (x)| < ǫ, and |Fno (x) − F o (x)| < ǫ, uniformly in x. Consider two cases: Case I: Xi:n < lqF (p). Then δF (lqFn (p), lqF (p)) = δF (Xi:n , lqF (p)) = o  F (lqF (p)) − F c (Xi:n ) ≤ F o (lqF (p)) − Fnc (Xi:n ) + ǫ ≤ p − i/n + ǫ ≤ ǫ.  Case II: Xi:n > lqF (p). Then δF (lqFn (p), lqF (p)) = δF (Xi:n , lqF (p)) = o  F (Xi:n ) − F c (lqF (p)) ≤ Fno (Xi:n ) + ǫ − p  ≤ (i − 1)/n + ǫ − p ≤ ǫ.  i Since this holds for i = 1, 2, · · · , n and (0, 1) = ∪i=1,2,··· ,n ( i−1 n , n ], the supremum is also less than ǫ.  9.4.3  Quantile distance and KS distance closeness  Clearly if X ∼ Y , then LQDLE (X, Y ) = 0. In the following theorem we study c , c ≥ 0 and E = [0, 1]. The Kolmogorov the inverse question for L = δX Smirnoff distance was defined to be KS(X, Y ) = sup |FX (x) − FY (x)|. x∈R  We also define the “open Kolmogorov Smirnoff” distance as KS o (X, Y ) = sup |FXo (x) − FYo (x)|. x∈R  255  9.4. Quantile distance measures Lemma 9.4.7 Suppose X, Y are random variables, then KS o (X, Y ) = KS(X, Y ). To prove the lemma, we show that KS(X, Y ) ≤ ǫ ⇔ KS o (X, Y ) ≤ ǫ. Suppose KS(X, Y ) ≤ ǫ. If the R.H.S does not hold then there exist x ∈ R such that FXo (x) > FYo (x) + ǫ. Since FXo is left continuous, we conclude there is a y < x such that FXo (y) > FYo (x) + ǫ. Hence, FXc (y) ≥ FXo (y) > FYo (x) + ǫ ≥ FYc (y) + ǫ, which is a contradiction. Inversely, suppose KS o (X, Y ) ≤ ǫ. If the L.H.S does not hold then there exist x ∈ R such that FXc (x) > FYc (x) + ǫ. Since FYc is right continuous, we conclude there is y > x such that FXc (x) > FYc (y) + ǫ. Hence, FXo (y) ≥ FXc (x) > FYc (x) + ǫ ≥ FYo (y) + ǫ, which is a contradiction.  Lemma 9.4.8 Kolmogorov Smirnoff closeness implies Quantile distance closeness. More formally if for two random variables X, Y , KS(X, Y ) ≤ ǫ then SQDδX (X, Y ) = sup δX (lqX (p), lqY (p)) ≤ ǫ. p∈(0,1)  Proof For p ∈ (0, 1), suppose lqX (p) < lqY (p). Then δ(lqX (p), lqY (p)) = FXo (lqY (p)) − FXc (lqY (p)) ≤ 256  9.4. Quantile distance measures F o (lqY (p)) + ǫ − p ≤ p + ǫ − p = ǫ.  The discussion for lqY (p) < lqX (p) is similar.  Remark. By symmetry also KS(X, Y ) ≤ ǫ ⇒ SQDδY (X, Y ) ≤ ǫ. The converse needs the continuity assumption: Lemma 9.4.9 Suppose X, Y are continuous random variables. Then quantile distance closeness implies Kolmogorov Smirnoff distance closeness. More formally, suppose SQDδX (X, Y ) = sup δX (lqX (p), lqY (p)) ≤ ǫ p∈(0,1)  and SQDδY (X, Y ) = sup δY (lqX (p), lqY (p)) ≤ ǫ. p∈(0,1)  Then KS(X, Y ) ≤ ǫ. Proof Suppose the result is not true and there exists x such that |FX (x) − FY (x)| ≥ ǫ. Then let p1 = FX (x) and p2 = FY (x) and without loss of generality assume p2 > p1 . Since FY (x) = p2 , lqY (p2 ) ≤ x. But lqX (p2 ) = y > x. Otherwise p2 ≤ FX (lqX (p2 )) = FX (x) = p1 which is a contradiction. δX (lqX (p2 ), lqY (p2 )) = FXo (y) − FY (x) = FX (y) − FY (x) ≥ p2 − p1 > ǫ,  which is a contradiction. Note that we have used continuity of X in the second equality. Remark. This is not true in general. Consider X with P (X = 0) = 1 and Y with P (Y = 1) = 1. Then FX (1/2) − FY (1/2) = 1 and SQDδX (X, Y ) + SQDδY (X, Y ) = 0. In the next theorem we show that if the quantile distance between two variables are zero and one of them is continuous then they are identically distributed. Theorem 9.4.10 Suppose F1 , F2 distribution functions, F1 continuous and their quantile distance is zero. In other words, sup δF1 (lqF1 (p), lqF2 (p)) = 0.  p∈(0,1)  Then F1 = F2 . 257  9.4. Quantile distance measures Proof Suppose the result does not hold. Then we have two cases. Case I: ∃x, p1 = F1 (x) < F2 (x) = p2 . F1 (x) = p1 ⇒ lqF1 (p2 ) = y > x,  and Hence  F2 (x) = p2 ⇒ lqF2 (p2 ) = z ≤ x.  δF1 (lqF1 (p2 ), lqF2 (p2 )) = F1 (y) − F1 (z) ≥ F1 (y) − F1 (x) ≥ p2 − p1 . Case II: ∃x, p1 = F1 (x) > F2 (x) = p2 . Take p3 ∈ (p2 , p1 ). Then F1 (x) = p1 ⇒ lqF1 (p3 ) = y ≤ x. However if lqF1 (p3 ) = x, we conclude F1 (lqF1 (p3 )) = F1 (x) ⇒ p3 = p1 , which is a contradiction. Note that we have used the continuity of F1 in F1 (lqF1 (p3 )) = p3 . Also F2 (x) = p2 ⇒ lqF2 (p3 ) = z > x. Hence  δF1 (lqF1 (p3 ), lqF2 (p3 )) = δF1 (y, z) = F1 (z)−F1 (y) ≥ F1 (x)−F1 (y) ≥ p1 −p3 .  Here we prove an easy lemma regarding the continuity of δ. Lemma 9.4.11 Suppose F is a continuous distribution function. For any fixed b ∈ R, δF (a, b) is a continuous function in a. Proof Note that δF (a, b) = |F (b) − F (a)| because F is a continuous function. Lemma 9.4.12 Suppose F1 , F2 are distribution functions, F1 is continuous and δF1 (lqF1 (p0 ), lqF2 (p0 )) = ∆ > 0, for some p0 ∈ (0, 1) then there exist 0 < ǫ < p0 such that δF1 (lqF1 (p), lqF2 (p)) > ∆/3, p ∈ (p0 − ǫ, p0 ). 258  9.4. Quantile distance measures Proof Since F1 is continuous δF1 (lqF1 (p), lqF2 (p)) = |p − F1 (lqF2 (p))|. Let lqF2 (p0 ) = x1 and F1 (x1 ) = p1 . Then |p0 − p1 | = ∆. By continuity of F1 there exist ǫ′ > 0 such that x ∈ (x1 − ǫ′ , x1 + ǫ′ ) ⇒ F1 (x) ∈ (p1 −  ∆ ∆ , p1 + ). 3 3  By left continuity of lqF2 for ǫ′ positive, there exists an 0 < ǫ < min(∆/3, p0 ) such that p ∈ (p0 − ǫ, p0 ) ⇒ lqF2 (p) ∈ (x1 − ǫ′ , x1 ). Hence for p ∈ (p0 − ǫ, p0 ), we have F1 (lqF2 (p)) ∈ (p1 − ∆/3, p1 + ∆/3). Hence δF1 (lqF1 (p), lqF2 (p)) = |p − F1 (lqF2 (p))| ≥ |p0 − p1 | − ǫ −  ∆ ≥ ∆/3. 3  Lemma 9.4.13 Suppose F1 , F2 are distribution functions and F1 is continuous. Also assume Z 1 IDQδF1 (F1 , F2 ) = δF1 (lqF1 (p), lqF2 (p)) = 0. 0  Then F1 = F2 . Proof The assumption implies that δF1 (lqF1 (p), lqF2 (p)) = 0, ∀p ∈ (0, 1). For otherwise if δF1 (lqF1 (p0 ), lqF2 (p0 )) = ∆ > 0, for some p0 . By the previous lemma there exist 0 < ǫ < p0 such that δF1 (lqF1 (p), lqF2 (p)) > ∆/3, p ∈ (p0 − ǫ, p0 ). This implies that Z  0  1  δF1 (lqF1 (p), lqF2 (p)) ≥ ǫ∆,  which is a contradiction. Now we can use Lemma 9.4.10 to conclude F1 = F2 .  259  9.4. Quantile distance measures  9.4.4  Quantile distance for continuous variables  From now on we only consider continuous variables and the probability loss function with c = 0, δX . Some results can be generalized to the general distributions but we leave that for future research. We use the simpler notations:  QDX (X, Xθ ) = LQDδX (X, Xθ ) =  Z  0  1  δX (lqX (p), lqXθ (p))dp.  Also QD(X, Xθ ) = QDX (X, Xθ ) + QDXθ (X, Xθ ). Quantile distance in the continuous case can be obtained by: Z 1 QDX (X, Xθ ) = δX (lqX (p), lqXθ (p))dp = 0 Z 1 Z 1 |FX ◦ lqX (p) − FX ◦ lqXθ (p)|dp = |p − FX ◦ lqXθ (p)|dp. 0  0  We can also consider the quantile distance closeness in the tails. Consider the tails to correspond to probabilities E = (0, 0.025) ∪ (0.0975, 1). Then L(E) = 0.05 (L being the Lèbegue measure) and we can define tail QDX (X, Xθ )  δX (lqX (p), lqXθ (p))dp/0.05 = Z Z |FX ◦ lqX (p) − FX ◦ lqXθ (p)|dp/0.05 = |p − FX ◦ lqXθ (p)|dp/0.05. E  =  Z  E  E  We have divided the integral by 0.05 the length of E to make this measure comparable to the overall measure over [0,1], which has length 1. Then we compute the quantile distance of the standard normal to some known distributions. Both the overall quantile distance and the tail quantile distance are calculated (by approximating the integrals) and the results are given in Table 9.1 and 9.2. For the overall quantile distance we observe that QDX and QDY have almost the same value. A theoretical result regarding this observation is desirable and we leave this for future research. This is not true in general for the tail distance. Then we find the closest Cauchy with scale parameter in (0,4) (and location parameter=0) to the standard normal. Once using the quantile distance and once using the tail quantile distance. We find the quantile distance of 260  9.4. Quantile distance measures the standard normal to all Cauchy distributions with scale parameters on the grid (0.01, 0.02, · · · , 4.00) (and location parameter=0). The results are given in Figures 9.3 and 9.5 respectively. For the overall quantile distance the optimal Cauchy is the one with scale parameter 0.66 and for the tail quantile distance, the optimal Cauchy is the one with scale parameter 0.12. Figure 9.4 depicts the normal distribution functions compared with a few Cauchy distributions including the optimal and Figure 9.6 depicts the normal distribution in the upper tail with a few Cauchy distributions including the optimal in tails with scale parameter 0.12. Figure 9.7 depicts the standard normal distribution compared with the optimal Cauchy for the overall quantile distance and the optimal Cauchy for the tail quantile distance. We conclude that a fit that is optimally might not be optimal on the tails. We use this fact later in choosing our method to model extreme temperature events. Distribution  QDX (X, Y )  QDY (X, Y )  QD  Y = N (1, 1) Y = N (0.5, 1) Y = N (0, 2) Y = t(1) Y = t(10) Y = t(100) Y = Cauchy(scale = 1) Y = χ2 (1) U (−0.5, 0.5) U (−1, 1) U (−2, 2) U (−3, 3)  0.2605080 0.138301 0.1024215 0.06382985 0.0078747 0.000795163 0.06376941 0.2190132 0.1522836 0.06562216 0.05612716 0.1171562  0.2605080 0.138301 0.1024207 0.0637436 0.007872528 0.0007951621 0.06376579 0.2190249 0.1522991 0.06563009 0.0561283 0.1171562  0.5210159 0.276602 0.2048422 0.1275734 0.01574723 0.001590325 0.1275352 0.4380381 0.3045827 0.1312522 0.1122555 0.2343124  Table 9.1: Comparing standard normal with various distributions using quantile distance, where U denotes the uniform distribution and χ2 the Chi-squared distribution.  261  0.15  1  2  3  4  0  1  2  3  4  0  1  2  3  4  0.15  0  QD  0.1 0.2 0.3 0.4  0.05  QD2  0.05  QD1  9.4. Quantile distance measures  scale parameter  Figure 9.3: Cauchy distribution’s distance with different scale parameter (and location parameter=0) to the standard normal. In the plots QD1 = QX and QD2 = QDY and QD = QD1 + QD2, where X is the standard normal and Y is the Cauchy.  262  0.0  0.2  0.4  F(x)  0.6  0.8  1.0  9.4. Quantile distance measures  −3  −2  −1  0  1  2  3  x  Figure 9.4: The distribution function of standard normal (solid) compared with the optimal Cauchy (and location parameter=0) picked by quantile distance minimization with scale parameter=0.66 (dashed curve), Cauchy with scale parameter=1 (dotted) and Cauchy with scale parameter=0.5 (dot dashed).  263  QD1 QD  1  2  3  4  0  1  2  3  4  0  1  2  3  4  0.00 0.10 0.20 0.30  0  0.05 0.15 0.25  QD2  0.00  0.10  0.20  9.4. Quantile distance measures  scale parameter  Figure 9.5: Cauchy distribution’s distance with different scale parameter (and location parameter=0) to the standard normal on the tails. In the plots QD1 = QX and QD2 = QDY and QD = QD1 + QD2, where X is the standard normal and Y is the Cauchy.  264  0.90 0.80  0.85  F(x)  0.95  1.00  9.4. Quantile distance measures  2.0  2.2  2.4  2.6  2.8  3.0  x  Figure 9.6: The distribution function of standard normal (solid) compared with the optimal Cauchy picked by tail quantile distance minimization with scale parameter=0.12 (dashed curve), Cauchy with scale parameter=0.65 (dotted) and Cauchy with scale parameter=0.01 (dot dashed).  265  0.0  0.2  0.4  F(x)  0.6  0.8  1.0  9.4. Quantile distance measures  −3  −2  −1  0  1  2  3  x  Figure 9.7: Comparing the standard normal distribution (solid) with optimal Cauchy picked by quantile distance (dashed) and the optimal Cauchy picked by tail quantile distance minimization (dotted).  266  9.4. Quantile distance measures Distribution  tail QDX (X, Y )  QDYtail (X, Y )  QDtail (X, Y )  Y = N (1, 1) Y = N (0.5, 1) Y = N (0, 2) Y = t(1) Y = t(10) Y = t(100) Cauchy(scale = 1) Y = χ2 (1) U (−0.5, 0.5) U (−1, 1) U (−2, 2) U (−3, 3)  0.05075276 0.01824013 0.01249034 0.0125000 0.007631262 0.0009740074 0.0125000 0.25006521 0.3004565 0.1523052 0.01313629 0.01083494  0.05075276 0.01824013 0.11206984 0.1184949 0.011192379 0.0010122519 0.1180231 0.06467072 0.0125000 0.0125000 0.01205279 0.10054194  0.10150552 0.03648026 0.12456018 0.1309949 0.018823642 0.0019862594 0.1305231 0.31473593 0.3129565 0.1648052 0.02518908 0.11137688  Table 9.2: Comparing standard normal on the tails with some distributions using quantile distance, where U denotes the uniform distribution and χ2 the Chi-squared distribution.  9.4.5  Equivariance of estimation under monotonic transformations using the quantile distance  Suppose a family of distributions {Xθ }θ∈Θ , Θ ⊂ Rk is given. Also assume φ is a continuous and strictly monotonic transformation on R. Consider the family of distributions {Yθ = φ(Xθ )}θ∈Θ . Then the family {Yθ }θ∈Θ is parameterized by the same parameters since P (Yθ < a) = P (φ(Xθ ) < a) = P (Xθ < φ−1 (a)). Then the following lemma shows the equivariance property of quantile distance estimation. Lemma 9.4.14 Suppose a random variable X and a family of distributions {Xθ }θ∈Θ are given, A = argminθ∈Θ  Z  0  1  δX (lqX (p), lqXθ (p))dp,  is nonempty and φ is a continuous and strictly monotonic transformation. Let Z 1 B = argminθ∈Θ δφ(X) (lqφ(X) (p), lqφ(Xθ ) (p))dp. 0  Then A = B. In other words if Xθ is an optimal estimator of X, then φ(Xθ ) is an optimal estimator of φ(X). 267  9.4. Quantile distance measures Proof This is trivial by invariance properties of quantile distance under continuous strictly monotonic transformations. Remark. The above is also true if we use replace the integral quantile distance by the sup quantile distance.  9.4.6  Estimation using quantile distance  Here we only consider estimation using integral quantile distance. In order to estimate a distribution X using a parameterized family {Xθ }θ∈Θ , one can try to find Z 1  argminθ∈Θ  0  δX (lqX (p), lqXθ (p))dp.  However, the above expression depends on δX an unknown. The available information to us is usually a random sample X1 , · · · , Xn . Remark. If we use the empirical distribution instead of the distribution of X is above, we get: Z 1 δFn (lqFn (p), lqXθ (p))dp. argminθ∈Θ 0  The argmin can be checked again to be equivariant under continuous and strictly monotonic transformations. Tables 9.3 and 9.4 compare the maximum likelihood estimation to the quantile distance estimation method for a sample of size N = 20 and N = 100 respectively. In each case we generate 50 samples of length N and estimate the parameters using both methods. Then we assess the performance by a few measures: mean absolute error, mean square error, mean probability loss error and mean quantile distance. In both cases maximum likelihood has done slightly better in terms of all errors except the quantile distance error in which case the quantile distance estimation has done significantly better. The histogram for both estimation methods for N = 20 and N = 100 are given in Figures 9.8 and 9.9 respectively. For both maximum likelihood and quantile distance estimations for N = 100 the parameters have a symmetric (close to normal) distribution.  268  9.4. Quantile distance measures Error type Mean probability loss error for µ = lqN(µ,σ 2 ) (1/2) Mean probability loss for σ 2 + µ = lqN(µ,σ 2 ) (P (Z < 1)) Mean abs. error for µ Mean abs. error for σ Mean square error µ Mean square error for σ Mean QD error  QD error  s.e. of QD error  ML error  s.e. ML error  0.077  0.061  0.077  0.055  0.185  0.114  0.176  0.096  0.198 0.159 0.064 0.041 0.035  0.160 0.127 0.089 0.065 0.009  0.196 0.132 0.058 0.025 0.122  0.143 0.085 0.077 0.028 0.073  Table 9.3: Assessment of Maximum likelihood estimation and quantile distance estimation using several measures of error for a sample of size 20. In the table s.e. stands for the standard error. Error type Mean probability loss for µ = lqN(µ,σ 2 ) (1/2) Mean probability loss for σ 2 + µ = lqN(µ,σ 2 ) (P (Z < 1)) Mean abs. error for µ Mean abs. error for σ Mean square error µ Mean square error for σ Mean QD error  QD error  s.e. of QD error  M L error  s.e. ML error  0.028  0.020  0.027  0.020  0.157  0.046  0.165  0.038  0.070 0.079 0.007 0.009 0.014  0.051 0.052 0.009 0.011 0.003  0.068 0.061 0.007 0.005 0.045  0.051 0.039 0.009 0.005 0.026  Table 9.4: Assessment of Maximum likelihood estimation and quantile distance estimation using several measures of error for a sample of size 100. In the table s.e. stands for the standard error.  269  1.0  Density  0.5  1.0  0.0  0.0  0.5  Density  1.5  1.5  9.4. Quantile distance measures  −0.6  −0.2 0.0  0.2  0.4  0.6  −0.6  Density 0.6  0.8  1.0  QD sd estimate  0.2  0.6  1.2  0.0 0.5 1.0 1.5 2.0 2.5  1.5 1.0  Density  0.5 0.0 0.4  −0.2  ML mean estimate  2.0  QD mean estimate  0.6  0.8  1.0  1.2  1.4  ML sd estimate  Figure 9.8: Histograms for the parameter estimates using quantile distance and maximum likelihood methods for a sample of size 20.  270  4 3 0  1  2  Density  3 2 0  1  Density  4  9.4. Quantile distance measures  −0.2  −0.1  0.0  0.1  0.2  −0.2  −0.1  0.0  0.1  0.2  ML mean estimate  3 0  0  1  2  Density  2 1  Density  3  4  4  QD mean estimate  0.8  0.9  1.0  1.1  QD sd estimate  1.2  0.85  0.95  1.05  1.15  ML sd estimate  Figure 9.9: Histograms for the parameter estimates using quantile distance and maximum likelihood methods for a sample of size 100.  271  Chapter 10  Binary temperature processes 10.1  Introduction  This chapter uses the theory developed in previous chapters to find appropriate models for extreme temperature events. We consider both low and high temperatures. The temperature is measured in degrees centigrade. We define a day with minimum temperature (mt) less than zero as extremely cold and denote it by e: ( 1 mt(t) ≤ 0 (deg C) e(t) = . 0 mt(t) > 0 (deg C) Taking 0 (deg C) to be the cut–off for low temperature seems reasonable in the absence of any other considerations, since it is the usual definition of a frost. In agriculture, where most plants contain a lot of water this can be considered as an important cut–off. No seemingly natural cut-off like that for minimum temperature exists for extremely high temperature. To define extreme events, we ask the following questions: 1. Should the definition of an extreme event depend on the purpose of our model? 2. Should it depend on the time of the year and location? 3. What should be the cut–off (threshold) to define an extreme event? 4. Should we use a certain quantile as the cut-off? In that case which quantile should be used? We provide some answers in the following: 1. The answer to the first question is clearly affirmative. For example, a high temperature day for agriculture purposes is different from energy 272  10.1. Introduction providing purposes. Even for the farmer, different crops may have different tolerances to hot or cold weather. 2. The answer of the second question depends on the model’s purpose. We might want to vary the definition over time and space for some purposes. 3. We do not know of any such natural cut–off for high temperatures like that for low temperature. 4. Quantiles have long been used to determine the extreme events. Choosing the level of the quantile depends on the purpose. Some extreme– value modelers pick the quantile high enough to insure the validity of the assumptions underlying their models as Embrechts et al. discuss in [16]. For example, a well–known result asserts that P (X − u < v|X > u) follows a known distribution (extreme value distributions e.g. Pareto) when u is large. [See [16].] We do not favor such methods of choosing the threshold. The threshold should be picked primarily to reflect our needs in the real problem rather than satisfy the assumptions of the models. If the models do not satisfy the conditions, we should find others rather than move the threshold up. Based on the above discussion with the statistician’s knowledge alone, one cannot define the extreme events. Ralph Wright (personal communication) in AAFRD (Agriculture and Rural Development in Alberta, Canada) raises similar points. In particular he said the following about the droughts: “Drought is really defined by the impact that the moisture deficit has on a specific use or uses. Its definition can vary both with time of year and from place–to–place. Drought can be short–term or long–term. For example, one month of hot dry weather can significantly reduce crop yields, despite the fact that normal amounts of precipitation have been received over the past year. On the other hand, crops may do fine in dry weather conditions if precipitation has been received in a timely manner and temperatures have been favorable. However under the same conditions, a dam operator in the same area may have severe shortages in the reservoir and declare drought like conditions (e.g. with low winter snow–fall and poor spring run–off). You will need to define your drought based on whom or what is being impacted by the water shortage.” Since we do not have any standard definition of an extremely hot day, we use the data. In our example, to define a binary process of (hot)/(not hot) for temperature, we pick the global spatial/temporal 95th percentile using 273  10.1. Introduction the data from 25 stations over Alberta that had daily maximum temperature (M T ) data from 1940 to 2004. The 95th percentile was computed using the quantile algorithm developed in previous chapters and turned out to be 26.7. The exact value was also found and turned out to be q = 27 (deg C). Then We define the binary process of extremely hot temperature as: ( 1 M T (t) ≥ q E(t) = , 0 M T (t) < q where q = 27 (deg C) here. In order to study extreme events (e.g. for M T ) three approaches come to mind: 1. Model the whole daily M T process and use that to infer about the extremes. For M T , we have shown that a Gaussian distribution fits the daily values fairly well. However, in the tails, usually of paramount concern, the fit does not do well as shown in the qq–plots in Chapter 2. Another difficulty with this approach is picking a covariance function to model the covariance over time. Also in Chapter 9, we showed that even though two distributions are very close in terms of overall quantile distance, they might not be very close in terms of tail quantile distance (Figure 9.7). This shows in order to study extremes (for example extremely hot temperature) if we use a good overall fit, our results might not be reliable. 2. Use a specified threshold and model the values exceeding the threshold. This approach has several drawbacks. Firstly we cannot answer the question of how often or in what periods of the year the extremes happen. This is because we model the actual extreme values and ignore the non–extreme values. Secondly, strong assumption of independence is needed for this method. Thirdly we need to pick the threshold high enough to make the model reasonable as mentioned before. This might not be an optimal threshold from a practical point of view. 3. Based on a real problem, use a threshold to define a new binary process of (extreme)/(not extreme) values and then model that binary process. This is the method we use and it does not have the issues mentioned in 1 and 2 because the threshold is not taken to satisfy some statistical property and we make few assumptions about the binary chain.  274  10.2. rth–order Markov models for extreme minimum temperatures  10.2  rth–order Markov models for extreme minimum temperatures  This section looks for appropriate models for the binary process e(t) of cold/not cold temperature days. This is a binary process and the Categorical Expansion Theorem (Theorem 3.5.6) gives the form of all such rth– order Markov chains. Here we also consider other covariates such as the minimum temperature of the previous day and two days ago as well as seasonal covariates (deterministic). The next subsection uses graphical tools and exploratory techniques to investigate the properties the model should have. Then we use the BIC criterion and compare several proposed models. We use partial likelihood techniques to estimate parameters as proposed by Kedem et al. in [27].  10.2.1  Exploratory analysis for binary extreme minimum temperatures  Here we perform an exploratory analysis of the binary process e(t) using two stations for this purpose, Banff and Medicine Hat which have data from 1895 to 2006. The transition probabilities are computed from the historical data considering years as independent observations. The results are summarized a follows: • Figures 10.1 and 10.2 plot the probability of a freezing day over the course of a year for the Banff and Medicine Hat stations, respectively. A regular seasonal pattern is seen. Medicine Hat seems to have a much longer frost–free period. • Figures 10.3 and 10.4 plot the estimated transition probabilities, p̂01 and p̂11 for the Banff and Medicine Hat stations. If the chain were a 0th–order Markov chain then these two curves would overlap. This is not the case and Markov chain at least of 1st–order seems necessary. In the p̂01 curve for both Banff and Medicine Hat, high fluctuations are seen at the beginning and end of the year which corresponds to the cold season. This is not surprising because there are very few pairs in the data with a freezing day followed by a non–freezing day in a cold season in Alberta. • In Figure 10.4, pˆ11 is missing for a period over the summer. This is because no freezing day is observed over this period in the summer and hence p̂11 could not be estimated. 275  0.6 0.4 0.0  0.2  Probability of a freezing day  0.8  1.0  10.2. rth–order Markov models for extreme minimum temperatures  0  100  200  300  Day of the year  Figure 10.1: The estimated probability of a freezing day for the Banff site for different days of a year computed using the historical data. • Figures 10.5 and 10.6 give the plots for the 2nd–order transition probabilities. They overlap substantially and hence a 2nd–order Markov chain does not seem to be necessary.  10.2.2  Model selection for extreme minimum temperature  This section finds models for the extreme minimum temperature process e(t). Here Zt−1 denotes the covariate process. We investigate the following predictors: • ek (t) ≡ e(t − k). Was it an extremely cold day k days ago? • mtk (t) ≡ mt(t − k), the actual minimum temperature k days ago. • N k , the number of freezing days during the k previous days. • SIN , COS, SIN 2 and COS2 which are abbreviations for sin(ωt), 2π cos(ωt), sin(2ωt) and cos(2ωt), respectively (with ω = 366 ). 276  0.6 0.4 0.0  0.2  Probability of a freezing day  0.8  1.0  10.2. rth–order Markov models for extreme minimum temperatures  0  100  200  300  Day of the year  Figure 10.2: The estimated probability of a freezing day for the Medicine Hat site for different days of a year computed using the historical data.  277  0.6 0.4 0.0  0.2  Probability  0.8  1.0  10.2. rth–order Markov models for extreme minimum temperatures  0  100  200  300  Day of the year  Figure 10.3: The estimated 1st–order transition probabilities for the 0-1 process of extreme minimum temperatures for the Banff site. The dotted line represents the estimated probability of “e(t) = 1 if e(t − 1) = 1” (pˆ11 ) and the dashed, “e(t) = 1 if e(t − 1) = 0” (pˆ01 ).  278  0.6 0.4 0.0  0.2  Probability  0.8  1.0  10.2. rth–order Markov models for extreme minimum temperatures  0  100  200  300  Day of the year  Figure 10.4: The estimated 1st–order transition probabilities for the 0-1 process of extreme minimum temperatures for the Medicine Hat site. The dotted line represents the estimated probability of “e(t) = 1 if e(t − 1) = 1” (pˆ11 ) and the dashed, “e(t) = 1 if e(t − 1) = 0” (pˆ01 ).  279  0.6 0.4 0.0  0.2  Probability  0.8  1.0  10.2. rth–order Markov models for extreme minimum temperatures  0  100  200  300  Day of the year  Figure 10.5: The estimated 2nd–order transition probabilities for the 0-1 process of extreme minimum temperature for the Banff site with p̂111 (solid) compared with p̂011 (dotted) both calculated from the historical data.  280  0.6 0.4 0.0  0.2  Probability  0.8  1.0  10.2. rth–order Markov models for extreme minimum temperatures  0  100  200  300  Day of the year  Figure 10.6: The estimated 2nd–order transition probabilities for the 01 process of extreme minimum temperatures for the Banff site with p̂001 (solid) compared with p̂101 (dotted) calculated from the historical data.  281  0.6 0.4 0.0  0.2  Probability  0.8  1.0  10.2. rth–order Markov models for extreme minimum temperatures  0  100  200  300  Day of the year  Figure 10.7: The estimated 2nd–order transition probabilities for the 0-1 process of extreme minimum temperatures for the Medicine Hat site with p̂111 (solid) compared with p̂011 (dotted) calculated from the historical data.  282  0.6 0.4 0.0  0.2  Probability  0.8  1.0  10.2. rth–order Markov models for extreme minimum temperatures  0  100  200  300  Day of the year  Figure 10.8: The estimated 2nd–order transition probabilities for the 0-1 process of extreme minimum temperatures for the Medicine Hat site with p̂001 (solid) compared with p̂101 (dotted) calculated from the historical data.  283  10.2. rth–order Markov models for extreme minimum temperatures Table 10.1 compares models with a constant and N k as the covariate process. The optimal model picked by the BIC criterion is the model with the covariates Zt−1 = (1, N 11 ). Model: Zt−1 (1, N 1 ) (1, N 2 ) (1, N 3 ) (1, N 4 ) (1, N 5 ) (1, N 6 ) (1, N 7 ) (1, N 8 ) (1, N 9 ) (1, N 10 ) (1, N 11 ) (1, N 12 ) (1, N 13 ) (1, N 14 ) (1, N 15 ) (1, N 16 ) (1, N 17 ) (1, N 18 ) (1, N 19 ) (1, N 20 )  BIC 1251.7 1166.5 1142.9 1121.6 1111.2 1093.1 1087.4 1081.7 1077.1 1066.5 1056.4 1059.5 1062.3 1072.8 1080.9 1091.9 1104.2 1112.1 1118.6 1126.5  parameter estimates (-2.144, (-2.501, (-2.653, (-2.773, (-2.852, (-2.932, (-2.977, (-3.015, (-3.047, (-3.089, (-3.130, (-3.135, (-3.140, (-3.126, (-3.118, (-3.102, (-3.083, (-3.075, (-3.068, (-3.058,  4.260) 2.490) 1.755) 1.371) 1.125) 0.961) 0.835) 0.739) 0.663) 0.605) 0.557) 0.511) 0.472) 0.437) 0.406) 0.379) 0.354) 0.334) 0.315) 0.299)  Table 10.1: BIC values for models including N k for the extreme minimum temperature process e(t) at the Medicine Hat site.  284  10.3. rth–order Markov models for extreme maximum temperatures Model: Zt−1 (1) (1, e1 ) (1, e2 ) (1, e1 , e2 ) (1, e1 , e2 , e1 e2 ) (1, mt1 ) (1, mt1 , mt2 ) (1, COS, SIN ) (1, COS, SIN, COS2, SIN 2) (1, COS, SIN, COS2) (1, COS, SIN, SIN 2) (1, mt1 , mt2 , mt3 ) (1, mt1 , mt2 , mt1 mt2 ) (1, e1 , COS, SIN ) (1, mt1 , COS, SIN ) (1, mt1 , mt2 , COS, SIN )  BIC 2539.9 1251.7 1473.6 1157.7 1162.4 963.7 954.0 984.0 984.2 986.7 984.4 940.7 943.4 901.5 855.3 861.9  parameter estimates (-0.0251) (-2.144, 4.260) (-1.856, 3.683) (-2.501, 3.085, 1.896) (-2.586, 3.389, 2.190, -0.593) (0.109, -0.400) (0.091, -0.329, -0.082) (-0.070, 4.292, 1.324) (-0.502, 4.505, 1.399, -0.464, -0.493) (-0.258, 4.359, 1.335, -0.353) (-0.217, 4.365, 1.360, -0.402) (0.062, -0.319, -0.009, -0.094) (0.211, -0.339, -0.084, -0.0091) (-1.008, 1.840, 3.325, 1.013) (-0.074, -0.234, 2.394, 0.746) (-0.076, -0.247, 0.023, 2.504, 0.785)  Table 10.2: BIC values for several models for the extreme minimum temperature e(t) at the Medicine Hat site. Table 10.2 compares several models some of which include seasonal terms and continuous variables. The optimal model is (1, mt1 , COS, SIN ), which has the temperature of the previous day and seasonal terms. The model (1, e1 , COS, SIN ) has a larger BIC but is preferable to all models other than (1, mt1 , COS, SIN ) and (1, mt1 , mt2 , COS, SIN ). Note that it is not possible to compute the probability of events in the long-term future using (1, mt1 , COS, SIN ), since we do not know mt except for perhaps the present time. Hence the optimal applicable model seems to be (1, e1 , COS, SIN ).  10.3  rth–order Markov models for extreme maximum temperatures  This section finds appropriate models for the binary process of extremely hot temperature E(t) as defined above. To define a hot day, we use the 95th percentile of data from 25 stations over Alberta that had daily M T data from 1940 to 2004. The 95th percentile turns out to be q = 27 (deg C). Once we used the fast algorithm developed in Chapter 7 to pick the quantile and once we used an exact method; the algorithm gave us the approximate value q = 26.7, which is very close to the exact value. (See Table ?? for more details on the computation.)  285  10.3. rth–order Markov models for extreme maximum temperatures  10.3.1  Exploratory analysis for extreme maximum temperatures  This section uses explanatory data analysis techniques to study the binary process E(t). Again we use two stations for this purpose, the Banff and Medicine Hat sites that have data from 1895 to 2006. The transition probabilities are computed using the historical data considering years as independent observations. The results are summarized as follows: • Figures 10.9 and 10.10 plot the probabilities of a hot day over the course of a year for the Banff and Medicine Hat stations respectively. A regular seasonal pattern is seen. Medicine Hat seems to have a much longer period of hot days. • Figures 10.11 and 10.12 plot the estimated transition probabilities, p̂01 and p̂11 for Banff and Medicine Hat. If the chain were a 0th–order Markov chain then these two curves would overlap. This is not the case so Markov chain of at least 1st–order seems necessary. In the p̂01 curve for both Banff and Medicine Hat, large fluctuations are seen in the middle of the year, which corresponds to the warm season. This is not surprising because there are very few pairs in the data with a hot day followed by a not–hot day in the warm season in Alberta. • In Figure 10.12, pˆ11 is missing for a period over the cold season. This is because no hot day is observed during this period in the cold season and hence p̂11 could not be estimated. • Figures 10.13 and 10.14 give the plots for the 2nd–order transition probabilities. They overlap heavily and hence a 2nd–order Markov chain does not seem to be necessary.  10.3.2  Model selection for extreme maximum temperature  Here, we use the following abbreviations: • E k (t) = E(t − k). Was it an extreme day k days ago? • M T k (t) = M T (t − k), the actual maximum temperature k days ago. • N k , COS, SIN , COS, SIN 2 and COS2 as previous sections.  286  0.6 0.4 0.0  0.2  Probability of a hot day  0.8  1.0  10.3. rth–order Markov models for extreme maximum temperatures  0  100  200  300  Day of the year  Figure 10.9: The estimated probability of a hot day (maximum temperature ≥ 27 (deg C)) for different days of the year for the Banff site calculated from the historical data.  287  0.6 0.4 0.0  0.2  Probability of a hot day  0.8  1.0  10.3. rth–order Markov models for extreme maximum temperatures  0  100  200  300  Day of the year  Figure 10.10: The estimated probability of a hot day (maximum temperature ≥ 27 (deg C)) for different days of the year for the Medicine Hat site calculated from the historical data.  288  0.6 0.4 0.0  0.2  Probability  0.8  1.0  10.3. rth–order Markov models for extreme maximum temperatures  0  100  200  300  Day of the year  Figure 10.11: The estimated 1st–order transition probabilities for the binary process of extremely hot temperatures for the Banff site. The dotted line represent the estimated probability of “E(t) = 1 if E(t − 1) = 1” (pˆ11 ) and the dashed, “E(t) = 1 if E(t − 1) = 0” (pˆ01 ).  289  0.6 0.4 0.0  0.2  Probability  0.8  1.0  10.3. rth–order Markov models for extreme maximum temperatures  0  100  200  300  Day of the year  Figure 10.12: The estimated 1st–order transition probabilities for the binary process of extremely hot temperatures for the Medicine Hat site. The dotted line represents the estimated probability of “E(t) = 1 if E(t − 1) = 1” (pˆ11 ) and the dashed, “E(t) = 1 if E(t − 1) = 0” (pˆ01 ).  290  0.6 0.4 0.0  0.2  Probability  0.8  1.0  10.3. rth–order Markov models for extreme ma