UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Statistical models for agroclimate risk analysis Hosseini, Mohamadreza 2009-12-01

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata

Download

Media
24-ubc_2010_spring_hosseini_mohamadreza.pdf [ 4.6MB ]
Metadata
JSON: 24-1.0070885.json
JSON-LD: 24-1.0070885-ld.json
RDF/XML (Pretty): 24-1.0070885-rdf.xml
RDF/JSON: 24-1.0070885-rdf.json
Turtle: 24-1.0070885-turtle.txt
N-Triples: 24-1.0070885-rdf-ntriples.txt
Original Record: 24-1.0070885-source.json
Full Text
24-1.0070885-fulltext.txt
Citation
24-1.0070885.ris

Full Text

Statistical models for agroclimate riskanalysisbyMohamadreza HosseiniB.Sc., Amirkabir University, 2003M.Sc., McGill University, 2005A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinThe Faculty of Graduate Studies(Statistics)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)November, 2009c© Mohamadreza Hosseini 2009AbstractIn order to model the binary process of precipitation and the dichotomizedtemperature process, we use the conditional probability of the present giventhe past. We find necessary and sufficient conditions for a collection offunctions to correspond to the conditional probabilities of a discrete–timecategorical stochastic process X1,X2,···. Moreover we find parametric rep-resentations for such processes and in particular rth–order Markov chains.To dichotomize the temperature process, quantiles are often used in theliterature. We propose using a two–state definition of the quantiles by con-sidering the “left quantile” and “right quantile” functions instead of thetraditional definition. This has various advantages such as a symmetry re-lation between the quantiles of random variables X and −X. We show thatthe left (right) sample quantile tends to the left (right) distribution quantileat p ∈ [0,1], if and only if the left and right distribution quantiles are iden-tical at p and diverge almost surely otherwise. In order to measure the lossof estimating (or approximating) a quantile, we introduce a loss functionthat is invariant under strictly monotonic transformations and call it the“probability loss function.” Using this loss function, we introduce measuresof distance among random variables that are invariant under continuousstrictly monotonic transformations. We use this distance measures to showoptimal overall fits to a random variable are not necessarily optimal in thetails. This loss function is also used to find equivariant estimators of theparameters of distribution functions.We develop an algorithm to approximate quantiles of large datasetswhich works by partitioning the data or use existing partitions (possiblyof non-equal size). We show the deterministic precision of this algorithmand how it can be adjusted to get customized precisions. Then we develop aframework to optimally summarize very large datasets using quantiles andcombining such summaries in order to infer about the original dataset.Finally we show how these higher order Markov models can be used toconstruct confidence intervals for the probability of frost–free periods.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . xxDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxi1 Thesis introduction . . . . . . . . . . . . . . . . . . . . . . . . . 12 Exploratory analysis of the Canadian weather data . . . . 72.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.2 Data description . . . . . . . . . . . . . . . . . . . . . . . . . 82.3 Temperature and precipitation . . . . . . . . . . . . . . . . . 82.4 Daily values, distributions . . . . . . . . . . . . . . . . . . . 242.5 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.5.1 Temporal correlation . . . . . . . . . . . . . . . . . . 482.5.2 Spatial correlation . . . . . . . . . . . . . . . . . . . . 562.6 Summary and conclusions . . . . . . . . . . . . . . . . . . . . 603 rth-order Markov chains . . . . . . . . . . . . . . . . . . . . . 623.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 623.2 Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . 643.3 Consistency of the conditional probabilities . . . . . . . . . . 653.4 Characterizing density functionsandrth–order Markov chains703.5 Functions of r variables on a finite domain . . . . . . . . . . 733.5.1 First representation theorem . . . . . . . . . . . . . . 743.5.2 Second representation theorem . . . . . . . . . . . . . 80iiiTable of Contents3.5.3 Special cases of functions of r finite variables . . . . . 853.6 Generalized linear models for time series . . . . . . . . . . . 863.7 Simulation studies . . . . . . . . . . . . . . . . . . . . . . . . 903.8 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 934 Binary precipitation process . . . . . . . . . . . . . . . . . . . 944.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 944.2 Models for 0-1 precipitation process . . . . . . . . . . . . . . 954.3 Exploratory analysis of the data . . . . . . . . . . . . . . . . 974.4 Comparing the models using BIC . . . . . . . . . . . . . . . 1054.5 Changing the location and the time period . . . . . . . . . . 1125 On the definition of “quantile” and its properties . . . . . 1155.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1155.2 Definition of median and quantiles of data vectors and ran-dom samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185.3 Defining quantiles of a distribution . . . . . . . . . . . . . . . 1285.4 Left and right extreme points . . . . . . . . . . . . . . . . . . 1325.5 The quantile functions as inverse . . . . . . . . . . . . . . . . 1335.6 Equivariance property of quantile functions . . . . . . . . . . 1355.7 Continuity of the left and right quantile functions . . . . . . 1375.8 Equality of left and right quantiles . . . . . . . . . . . . . . . 1445.9 Distribution function in terms of the quantile functions . . . 1505.10 Two-sided continuity of lq/rq . . . . . . . . . . . . . . . . . . 1525.11 Characterization of left/right quantile functions . . . . . . . 1535.12 Quantile symmetries . . . . . . . . . . . . . . . . . . . . . . . 1575.13 Quantiles from the right . . . . . . . . . . . . . . . . . . . . 1635.14 Limit theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 1655.15 Summary and discussion . . . . . . . . . . . . . . . . . . . . 1776 Probability loss function . . . . . . . . . . . . . . . . . . . . . 1816.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1816.2 Degree of separation between data vectors . . . . . . . . . . 1816.3 “Degree of separation” for distributions: the “probability lossfunction” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1836.4 Limit theory for the probability loss function . . . . . . . . . 1876.5 The probability loss function for the continuous case . . . . . 1886.6 The supremum of δX . . . . . . . . . . . . . . . . . . . . . . 1896.6.1 “c-probability loss” functions . . . . . . . . . . . . . . 191ivTable of Contents7 Approximating quantiles in large datasets . . . . . . . . . . 1937.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1937.2 Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . 1947.3 The median of the medians . . . . . . . . . . . . . . . . . . . 1967.4 Data coarsening and quantile approximation algorithm . . . 1977.5 The algorithm and computations . . . . . . . . . . . . . . . . 2058 Quantile data summaries . . . . . . . . . . . . . . . . . . . . . 2128.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2128.2 Generalization to weighted vectors . . . . . . . . . . . . . . . 2148.2.1 Partition operator . . . . . . . . . . . . . . . . . . . . 2198.2.2 Quantile data summaries . . . . . . . . . . . . . . . . 2238.3 Optimal probability indices for vector data summaries . . . . 2258.4 Other loss functions . . . . . . . . . . . . . . . . . . . . . . . 2318.4.1 Optimal index vectors for assigning quantiles to a ran-dom sample . . . . . . . . . . . . . . . . . . . . . . . 2349 Quantile distribution distance and estimation . . . . . . . . 2369.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 2369.2 Quantile–specified parameter families . . . . . . . . . . . . . 2379.2.1 Equivariance of quantile–specified families estimation 2399.2.2 Continuous distributions with the order statistics fam-ily of estimators . . . . . . . . . . . . . . . . . . . . . 2429.3 Probability divergence (distance) measures . . . . . . . . . . 2439.4 Quantile distance measures . . . . . . . . . . . . . . . . . . . 2489.4.1 Quantile distance invariance under continuous strictlymonotonic transformations . . . . . . . . . . . . . . . 2499.4.2 Quantile distance closeness of empirical distributionand the true distribution . . . . . . . . . . . . . . . . 2549.4.3 Quantile distance and KS distance closeness . . . . . 2559.4.4 Quantile distance for continuous variables . . . . . . . 2609.4.5 Equivariance of estimation under monotonic transfor-mations using the quantile distance . . . . . . . . . . 2679.4.6 Estimation using quantile distance . . . . . . . . . . . 26810 Binary temperature processes . . . . . . . . . . . . . . . . . . 27210.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 27210.2 rth–order Markov models for extreme minimum temperatures 27510.2.1 Exploratoryanalysis for binaryextreme minimumtem-peratures . . . . . . . . . . . . . . . . . . . . . . . . . 275vTable of Contents10.2.2 Model selection for extreme minimum temperature . 27610.3 rth–order Markov models for extreme maximum tempera-tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28510.3.1 Exploratory analysis for extreme maximum tempera-tures . . . . . . . . . . . . . . . . . . . . . . . . . . . 28610.3.2 Model selection for extreme maximum temperature . 28610.4 Probability of a frost–free period for Medicine Hat . . . . . . 29610.5 Possible applications of the models . . . . . . . . . . . . . . . 30311 Conclusions and future research . . . . . . . . . . . . . . . . 30411.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 30411.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30411.3 Future research . . . . . . . . . . . . . . . . . . . . . . . . . 30511.3.1 rth-order Markov chains . . . . . . . . . . . . . . . . 30511.3.2 Approximating quantiles and data summaries . . . . 30611.3.3 Parameter estimation using probability loss and quan-tile distances . . . . . . . . . . . . . . . . . . . . . . . 306Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 308AppendicesA Climate review . . . . . . . . . . . . . . . . . . . . . . . . . . . 312A.1 Organizations and resources . . . . . . . . . . . . . . . . . . 312A.2 Definitions and climate variables . . . . . . . . . . . . . . . . 313A.3 Climatology . . . . . . . . . . . . . . . . . . . . . . . . . . . 318A.3.1 General circulations . . . . . . . . . . . . . . . . . . . 318A.3.2 Topography of Canada . . . . . . . . . . . . . . . . . 319A.4 Someinteresting facts aboutCanadian geography and weather319B ExtractingCanadianClimate Datafrom Environment Canadadataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322B.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 322B.2 Using Python to extract data . . . . . . . . . . . . . . . . . . 325B.3 New functions to write stations’ data . . . . . . . . . . . . . 330B.4 Concluding remarks . . . . . . . . . . . . . . . . . . . . . . . 331C Algorithms and Complexity . . . . . . . . . . . . . . . . . . . 332viTable of ContentsD Notations and Definitions . . . . . . . . . . . . . . . . . . . . 333viiList of Tables2.1 The summary statistics for the mean annual maximum tem-perature, min temperature and precipitation at the Calgarysite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.2 Confidence intervals for the mean annual maximum temper-ature, min temperature and precipitation at the Calgary site. 162.3 Lines fitted to annual mean minimum temperature and an-nual mean precipitation against annual mean maximum tem-perature. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.4 The regression line parameters for the fitted lines for eachvariable with respect to time for the Calgary site. . . . . . . . 242.5 The regression line parameters for the fitted lines for eachvariable with respect to time for the Banff site. . . . . . . . . 242.6 The regression line parameters for the fitted lines for eachvariable with respect to time for the Medicine Hat site. . . . 243.1 Theestimated parameters for themodelZt−1 = (1,Yt−1,cos(ωt))with parameters β = (−1,1,−0.5). The standard deviationfor the parameters is computed once using GN (theo. sd) andonce using the generated samples (sim. sd). . . . . . . . . . . 913.2 BIC values for several models competing for the role of thetrue model, where Zt−1 = (1,Y 1,COS), β = (−1,1,−0.5). . . 923.3 BIC values for several models competing for the role of truemodel given by Zt−1 = (1,Y 1,Y 2,COS), β = (−1,1,1,−0.5). 934.1 BIC values for models including Nl, the number of precipita-tion days during the past l days for the Calgary site. . . . . . 1064.2 BIC values for models including Nl, the number of wet daysduring the past l days and Y 1, the precipitation occurrenceof the previous day for the Calgary site. . . . . . . . . . . . . 1074.3 BIC values for models including Nl, the number of wet daysduring the past l days and seasonal terms for the Calgary site.108viiiList of Tables4.4 BIC values for models including Nl, the number of PN daysduring the past l days, Y 1, the precipitation occurrence ofthe previous day and seasonal terms for the Calgary site. . . 1094.5 BIC values for Markov models of different order with smallnumber os parameters for the Calgary site. . . . . . . . . . . 1094.6 BIC values for Markov models with different order plus sea-sonal terms for the Calgary site. . . . . . . . . . . . . . . . . 1104.7 BIC values for models including seasonal terms and the occur-rence of precipitation during the previous day for the Calgarysite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1104.8 BIC values for 2nd–order Markov models for precipitation atthe Calgary site. . . . . . . . . . . . . . . . . . . . . . . . . . 1114.9 BIC values for 2nd–order Markov models for precipitation atthe Calgary site plus seasonal terms. . . . . . . . . . . . . . . 1114.10 BIC values for models including several covariates as temper-ature, seasonal terms and year effect for precipitation at theCalgary site. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.11 BIC values for several models for the binary process of pre-cipitation in Calgary, 1990–1994 . . . . . . . . . . . . . . . . 1134.12 BIC values for several models for precipitation occurrence inMedicine Hat, 2000-2004 . . . . . . . . . . . . . . . . . . . . . 1135.1 Earthquakes intensities . . . . . . . . . . . . . . . . . . . . . . 1215.2 Rain acidity data . . . . . . . . . . . . . . . . . . . . . . . . . 1226.1 A class marks in mathematics and physics. The third columnare the raw physics marks before the physics teacher scaledthem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1867.1 The table of data . . . . . . . . . . . . . . . . . . . . . . . . . 1967.2 Comparing the exact method with the proposed algorithmin R run on a laptop with 512 MB memory and a processor1500 MHZ, m = 1000,d = 500. “DOS” stands for degreeof separation in the original vector. “DOS bound” is thetheoretical degree of separation obtained by Theorem 7.4.1. . 2087.3 Comparing the exact method with the proposed algorithmin R (run on a laptop with 512 MB memory and processor1500 MHZ) to compute the quantiles of MT (daily maximumtemperature) over 25 stations with data from 1940 to 2004. . 211ixList of Tables9.1 Comparing standard normal with various distributions usingquantile distance, where U denotes the uniform distributionand χ2 the Chi-squared distribution. . . . . . . . . . . . . . . 2619.2 Comparing standard normal on the tails with some distribu-tions using quantile distance, where U denotes the uniformdistribution and χ2 the Chi-squared distribution. . . . . . . . 2679.3 Assessment of Maximum likelihood estimation and quantiledistance estimation using several measures of error for a sam-ple of size 20. In the table s.e. stands for the standard error. 2699.4 Assessment of Maximum likelihood estimation and quantiledistance estimation using several measures of error for a sam-ple of size 100. In the table s.e. stands for the standard error. 26910.1 BIC values for models including Nk for the extreme mini-mum temperature process e(t) at the Medicine Hat site. . . . 28410.2 BIC values for several models for the extreme minimum tem-perature e(t) at the Medicine Hat site. . . . . . . . . . . . . . 28510.3 BIC values for models including Nk for the extremely hotprocess E(t). . . . . . . . . . . . . . . . . . . . . . . . . . . . 29510.4 BIC values for several models for the extremely hot processE(t). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29610.5 BIC values for models including Nk for the extremely coldprocess e(t) at the Medicine Hat site. . . . . . . . . . . . . . . 29810.6 BIC values for several models including Nk and seasonalterms for the extremely cold process e(t) at the Medicine Hatsite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29910.7 BIC values for several models for the extremely cold processe(t) at the Medicine Hat site. . . . . . . . . . . . . . . . . . . 29910.8 Theoretical and simulation estimated standard deviations forextremely cold process e(t) at the Medicine Hat site. . . . . . 300xList of Figures2.1 Alberta site locations for temperature (deg C) data. Thereare 25 stations available with temperature data over Alberta. 82.2 Alberta site locations for precipitation (mm) data. There are47 stations available with precipitations data over Alberta. . 92.3 The number of years available for sites with temperature (degC) data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.4 The number of years available for sites with precipitation(mm) data available. . . . . . . . . . . . . . . . . . . . . . . . 102.5 The elevation (meters) of sites with temperature data available. 102.6 The elevation (meters) of the sites with precipitation dataavailable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.7 The time series of daily maximum temperature (deg C) atthe Calgary site from 2000 to 2003. . . . . . . . . . . . . . . . 122.8 The time series of daily minimum temperature (deg C) at theCalgary site from 2000 to 2003. . . . . . . . . . . . . . . . . . 122.9 The time series of daily precipitation (mm) at the Calgarysite from 2000 to 2003. . . . . . . . . . . . . . . . . . . . . . . 132.10 The time series of monthly maximum temperature (deg C) atthe Calgary site, 1995–2005. . . . . . . . . . . . . . . . . . . . 132.11 The time series of monthly minimum temperature means (degC) at the Calgary site, 1995–2005. . . . . . . . . . . . . . . . 142.12 The time series of monthly precipitation means (mm) at theCalgary site, 1995–2005. . . . . . . . . . . . . . . . . . . . . . 142.13 The annual mean maximum temperature (C) for Calgary sitefor all available years. . . . . . . . . . . . . . . . . . . . . . . 152.14 The annual mean minimum temperature (C) for Calgary sitefor all available years. . . . . . . . . . . . . . . . . . . . . . . 152.15 The annual mean precipitation (mm) for Calgary site for allavailable years. . . . . . . . . . . . . . . . . . . . . . . . . . . 162.16 The histogram of annual maximum temperature means (degC) for Calgary with a normal curve fitted to the data. . . . . 17xiList of Figures2.17 The normal qq–plot for annual maximum temperature means(deg C) for Calgary. . . . . . . . . . . . . . . . . . . . . . . . 182.18 The histogram of annual minimum temperature means (degC) for Calgary with normal curve fitted to the data. . . . . . 182.19 The normal qq–plot for annual minimum temperature means(deg C) for Calgary. . . . . . . . . . . . . . . . . . . . . . . . 192.20 The histogram of annual precipitation means (mm) for Cal-gary with normal curve fitted to the data. . . . . . . . . . . 192.21 The normal qq–plot for annual precipitation means for Calgary. 202.22 The time series plots of maximum temperature (deg C), min-imum temperature (deg C) and precipitation (mm) annualmeans for Calgary. The time series plot in the bottom isminimum temperature, the one in the middle is precipitationand the top curve is maximum temperature. . . . . . . . . . . 212.23 The regression line fitted to maximum temperature and min-imum temperature annual means for Calgary. . . . . . . . . . 222.24 The regression line fitted to maximum temperature and pre-cipitation annual means for Calgary. . . . . . . . . . . . . . . 222.25 The regression line fitted to summer minimum temperaturemeans against time for Calgary. . . . . . . . . . . . . . . . . . 232.26 The time series of daily maximum temperature at the Calgarysite for four given dates: January 1st, April 1st, July 1st andOctober 1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.27 The histogram of daily maximum temperature at the Calgarysite for four given dates: January 1st, April 1st, July 1st andOctober 1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.28 The normal qq–plots of of daily maximum temperature at theCalgary site for four given dates: January 1st, April 1st, July1st and October 1st. . . . . . . . . . . . . . . . . . . . . . . . 272.29 The time series of daily minimum temperature for Calgary forfour given dates: January 1st, April 1st, July 1st and October1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.30 The histogram of daily minimum temperature at the Calgarysite for four given dates: January 1st, April 1st, July 1st andOctober 1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292.31 The normal qq-plots of daily minimum temperature at theCalgary site for four given dates: January 1st, April 1st, July1st and October 1st. . . . . . . . . . . . . . . . . . . . . . . . 30xiiList of Figures2.32 The time series of daily precipitation at the Calgary site forfour given dates: January 1st, April 1st, July 1st and October1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.33 The histogram of daily precipitation at the Calgary site forfour given dates: January 1st, April 1st, July 1st and October1st. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322.34 The confidence intervals for the daily mean maximum tem-perature (deg C) at the Calgary site. Dashed line shows theupper bound and the solid line the lower bound of the confi-dence intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . 332.35 The confidence intervals for the daily mean minimum tem-perature (deg C) at the Calgary site. Dashed line shows theupper bound and the solid the lower bound of the confidenceintervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342.36 The confidence intervals for the probability of precipitation(mm) at the Calgary site for the days of the year. Dashedline shows the upper bound and the solid the lower bound ofthe confidence intervals. . . . . . . . . . . . . . . . . . . . . . 352.37 The confidence intervals for the standard deviation of eachday of the year for maximum temperature (deg C) at theCalgary site. Dashed line shows the upper bound and thesolid the lower bound of the confidence intervals. . . . . . . . 362.38 The confidence intervals for the standard deviation of eachday of the year for minimum temperature (deg C) at theCalgary site. Dashed line shows the upper bound and thesolid the lower bound of the confidence intervals. . . . . . . . 372.39 The confidence intervals for standard deviation (sd) of eachday of the year for the probability of precipitation (mm) (0-1precipitation process) at the Calgary site. Dashed line showsthe upper bound and the solid the lower bound of the con-fidence intervals. Plot shows sd ≤ 1/2. This is becausesd =radicalbigp(1−p) which has a maximum value of 12. . . . . . . 382.40 The distribution of each day of the year for MT (C) from Jan1st to Dec 1st. The year has been divided to two halves. Ineach half rainbow colors are used to show the change of thedistribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392.41 The distribution of each day of the year for mt (C) from Jan1st to Dec 1st. The year has been divided to two halves. Ineach half rainbow colors are used to show the change of thedistribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40xiiiList of Figures2.42 The histogram of daily precipitation greater than 0.2 mmat the Calgary site with Gamma density curve fitted usingMaximum likelihood. . . . . . . . . . . . . . . . . . . . . . . . 402.43 The qq–plots of daily precipitation greater than 0.2 mm atthe Calgary site with Gamma curve fitted using Maximumlikelihood. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412.44 The Gamma fit of each day of 4 months for precipitation(mm). In each month rainbow colors are used to show thechange of the distribution. . . . . . . . . . . . . . . . . . . . . 422.45 The maximum likelihood estimate for α, the shape parameterof the Gamma distribution fitted to the precipitation amounts. 432.46 The confidence interval for MOM estimate of the shape pa-rameter, α, of the Gamma distribution fitted to daily precip-itation amounts. The dotted line is the upper bound and thesolid line the lower bound. As seen in the figure the upperbounds at the beginning and end of the year have becomevery large. We have not shown them because otherwise thenthe pattern in the rest of the year could not be seen. . . . . . 442.47 The 1st-order transition probabilities. The dotted line is thethe probability of precipitation if it happened the day before( ˆp11) and the dashed is the probability of precipitation if itdid not happen the day before ( ˆp01). . . . . . . . . . . . . . . 452.48 The 2nd–order transition probabilities for the precipitationat the Calgary site: ˆp111 (solid) against ˆp011 (dotted). . . . . 462.49 The 2nd–order transition probabilities for the precipitationat the Calgary site: ˆp001 (solid) against ˆp101 (dotted). . . . . 472.50 The correlation and covariance plot for maximum tempera-ture at the Calgary site for Jan 1st and 732 consequent days. 482.51 The correlation plot for maximum temperature (deg C) atthe Calgary site for Jan 1st and 732 consequent days. . . . . 492.52 The correlation plot for minimum temperature (deg C) at theCalgary site for Jan 1st and 732 consequent days. . . . . . . . 502.53 The correlation plot for precipitation (mm) at the Calgarysite for Jan 1st and 732 consequent days. . . . . . . . . . . . 512.54 The correlation plot for maximum temperature (deg C) atthe Calgary site for Feb 1st (solid), April 1st (dashed), July1st (dotted) and Oct 1st (dot dash) and 30 consequent days. 522.55 The correlation plot for minimum temperature (deg C) at theCalgary site for Feb 1st (solid), April 1st (dashed), July 1st(dotted) and Oct 1st (dot dash) and 30 consequent days. . . . 53xivList of Figures2.56 The correlation plot for precipitation (mm) at the Calgarysite for Feb 1st (solid), April 1st (dashed), July 1st (dotted)and Oct 1st (dot dashed) and 30 consequent days. . . . . . . 542.57 The correlation plot for maximum temperature and minimumtemperature (deg C) between Calgary and Medicine Hat. . . 552.58 The correlation plot for precipitation (mm) between Calgaryand Medicine Hat. . . . . . . . . . . . . . . . . . . . . . . . . 552.59 The correlation plot for maximum temperature (deg C) withrespect to distance (km). . . . . . . . . . . . . . . . . . . . . . 562.60 The correlation plot for minimum temperature (deg C) withrespect to distance(km). . . . . . . . . . . . . . . . . . . . . . 572.61 The correlation plot for precipitation (mm) with respect todistance (km). . . . . . . . . . . . . . . . . . . . . . . . . . . 582.62 The correlation plot for precipitation (mm) 0-1 process withrespect to distance (km). . . . . . . . . . . . . . . . . . . . . . 593.1 The distribution of parameter estimates for the model withthe covariate process Zt−1 = (1,Yt−1,cos(ωt)) and parame-ters (β1 = −1,β2 = 1,β3 = −0.5). . . . . . . . . . . . . . . . . 924.1 The transition probabilities for the Banff site. The dotted linerepresents ˆp11 (the estimated probability of precipitation ifprecipitation occursthe day before) and the dashed representsˆp01 (the estimated probability of precipitation if precipitationdoes not occur the day before.) . . . . . . . . . . . . . . . . . 994.2 The solid curve represents ˆp111 (the estimated probability ofprecipitation if during both two previous days precipitationoccurs) and the dashed curve represents ˆp011 (the estimatedprobability that precipitation occurs if precipitation occursthe day before and does not occur two days ago) for the Banffsite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1004.3 The solid curve represents ˆp001 (the estimated probability ofprecipitation occurring if it does not occur during the twoprevious days) and the dotted curve is ˆp101 (the estimatedprobability that precipitation occurs if precipitation does notoccur the day before but occurs two days ago) for the Banffsite. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1014.4 Banff’s estimated mean annual probability of precipitationcalculated from historical data. . . . . . . . . . . . . . . . . . 102xvList of Figures4.5 Calgary’s estimated mean annual probability of precipitationcalculated from historical data. . . . . . . . . . . . . . . . . . 1024.6 The logit function: logit(x) = log(x/(1−x)). . . . . . . . . . 1034.7 Thelogit of the estimated probability of precipitation in Banfffor different days of the year. . . . . . . . . . . . . . . . . . . 1035.1 An example of a distribution function with discontinuities andflat intervals. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1415.2 The left quantile (lq) function for the distribution functiongiven in Example 5.7. Notice that this function is left contin-uous and increasing. . . . . . . . . . . . . . . . . . . . . . . . 1425.3 The right quantile (rq) function for the distribution functiongiven in Example 5.7. Notice that this function is right con-tinuous and increasing. . . . . . . . . . . . . . . . . . . . . . . 1425.4 LQ function for Example 5.7. Notice that this function isincreasing and left continuous. . . . . . . . . . . . . . . . . . 1435.5 RQ function for Example 5.7, notice that this function is in-creasing and right continuous. . . . . . . . . . . . . . . . . . . 1435.6 For the vector x = (−2,−2,2,2,4,4,4,4) the left (top) andright (bottom) quantile functions are given. . . . . . . . . . . 1565.7 The solid line is the distribution function of {Xi}. Notethat for the distribution of the Xi and p = 0.5, lqFX(p) =0,rqFX(p) = 3. Let h = rq(p) − lq(p) = 3. The dotted lineis the distribution function of the {Yi} which coincides withthat of {Xi} to the left of lqFX(p) and is a backward shift of 3units for values greater than rqFX(p). Note that for the {Yi},lqFY (p) = rqFY (p) = 1. . . . . . . . . . . . . . . . . . . . . . . 1767.1 Comparing the approximated quantiles to the exact quantilesN = 107. The circles are the exact quantiles and the + arethe corresponding approximated quantiles. . . . . . . . . . . . 2097.2 Comparing the approximated quantiles to the exact quantilesfor MT (daily maximum temperature) over 25 stations inAlberta 1940–2004. The circles are the exact quantiles andthe + the approximated quantiles. . . . . . . . . . . . . . . . 2109.1 The order statistics family members that estimate lqX(1/2)and lqX(P(Z ≤ 1)) for a randomsample of length 25 obtainedby generating samples of size 1 to 1000 from a standard nor-mal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 244xviList of Figures9.2 The order statistics family members that estimate lqX(1/2)and lqX(P(Z ≤ 1)) for a randomsample of length 20 obtainedby generating samples of size 1 to 1000 from a standard nor-mal distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 2459.3 Cauchy distribution’s distance with different scale parameter(and location parameter=0) to the standard normal. In theplots QD1 = QX and QD2 = QDY and QD = QD1+ QD2,where X is the standard normal and Y is the Cauchy. . . . . 2629.4 The distribution function of standard normal (solid) com-pared with the optimal Cauchy (and location parameter=0)picked by quantile distance minimization with scale param-eter=0.66 (dashed curve), Cauchy with scale parameter=1(dotted) and Cauchy with scale parameter=0.5 (dot dashed). 2639.5 Cauchy distribution’s distance with different scale parameter(and location parameter=0) to the standard normal on thetails. In the plots QD1 = QX and QD2 = QDY and QD =QD1 + QD2, where X is the standard normal and Y is theCauchy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2649.6 The distribution function of standard normal (solid) com-pared with the optimal Cauchy picked by tail quantile dis-tance minimization withscale parameter=0.12 (dashed curve),Cauchy with scale parameter=0.65 (dotted) and Cauchy withscale parameter=0.01 (dot dashed). . . . . . . . . . . . . . . . 2659.7 Comparing the standard normal distribution (solid) with op-timal Cauchy picked by quantile distance (dashed) and theoptimal Cauchy picked by tail quantile distance minimization(dotted). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2669.8 Histograms for the parameter estimates using quantile dis-tance and maximum likelihood methods for a sample of size20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2709.9 Histograms for the parameter estimates using quantile dis-tance and maximum likelihood methods for a sample of size100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27110.1 The estimated probability of a freezing day for the Banff sitefor different days of a year computed using the historical data.27610.2 The estimated probability of a freezing day for the MedicineHat site for different days of a year computed using the his-torical data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277xviiList of Figures10.3 The estimated 1st–order transition probabilities for the 0-1 process of extreme minimum temperatures for the Banffsite. The dotted line represents the estimated probability of“e(t) = 1 if e(t −1) = 1” ( ˆp11) and the dashed, “e(t) = 1 ife(t−1) = 0” ( ˆp01). . . . . . . . . . . . . . . . . . . . . . . . . 27810.4 The estimated 1st–order transition probabilities for the 0-1process of extreme minimum temperatures for the MedicineHat site. The dotted line represents the estimated probabilityof “e(t) = 1 if e(t−1) = 1” ( ˆp11) and the dashed, “e(t) = 1if e(t−1) = 0” ( ˆp01). . . . . . . . . . . . . . . . . . . . . . . . 27910.5 The estimated 2nd–order transition probabilities for the 0-1process of extreme minimum temperature for the Banff sitewith ˆp111 (solid) compared with ˆp011 (dotted) both calculatedfrom the historical data. . . . . . . . . . . . . . . . . . . . . . 28010.6 The estimated 2nd–order transition probabilities for the 0-1process of extreme minimum temperatures for the Banff sitewith ˆp001 (solid) compared with ˆp101 (dotted) calculated fromthe historical data. . . . . . . . . . . . . . . . . . . . . . . . . 28110.7 The estimated 2nd–order transition probabilities for the 0-1process of extreme minimum temperatures for the MedicineHat site with ˆp111 (solid) compared with ˆp011 (dotted) calcu-lated from the historical data. . . . . . . . . . . . . . . . . . . 28210.8 The estimated 2nd–order transition probabilities for the 0-1process of extreme minimum temperatures for the MedicineHat site with ˆp001 (solid) compared with ˆp101 (dotted) calcu-lated from the historical data. . . . . . . . . . . . . . . . . . . 28310.9 The estimated probability of a hot day (maximum tempera-ture ≥ 27 (deg C)) for different days of the year for the Banffsite calculated from the historical data. . . . . . . . . . . . . 28710.10The estimated probability of a hot day (maximum temper-ature ≥ 27 (deg C)) for different days of the year for theMedicine Hat site calculated from the historical data. . . . . 28810.11The estimated 1st–order transition probabilities for the bi-nary process of extremely hot temperatures for the Banffsite. The dotted line represent the estimated probability of“E(t) = 1 if E(t−1) = 1” ( ˆp11) and the dashed, “E(t) = 1 ifE(t−1) = 0” ( ˆp01). . . . . . . . . . . . . . . . . . . . . . . . 289xviiiList of Figures10.12The estimated 1st–order transition probabilities for the bi-nary process of extremely hot temperatures for the MedicineHat site. The dotted line represents the estimated probabilityof “E(t) = 1 if E(t−1) = 1” ( ˆp11) and the dashed, “E(t) = 1if E(t−1) = 0” ( ˆp01). . . . . . . . . . . . . . . . . . . . . . . 29010.13The estimated 2nd–order transition probabilities for the bi-nary process of extremely hot temperatures for the Banff sitewith ˆp111 (solid) compared with ˆp011 (dotted) calculated fromthe historical data. . . . . . . . . . . . . . . . . . . . . . . . . 29110.14The estimated 2nd–order transition probabilities for the bi-nary process of extremely hot temperatures for the Banff sitewith ˆp001 (solid) compared with ˆp101 (dotted) calculated fromthe historical data. . . . . . . . . . . . . . . . . . . . . . . . . 29210.15The estimated 2nd–order transition probabilities for the bi-nary process of extremely hot temperatures for the MedicineHat site with ˆp111 (solid) compared with ˆp011 (dotted), calcu-lated from the historical data. . . . . . . . . . . . . . . . . . . 29310.16The estimated 2nd–order transition probabilities for the bi-nary process of extremely hot temperatures for the MedicineHat site with ˆp001 (solid) compared with ˆp101 (dotted) calcu-lated from the historical data. . . . . . . . . . . . . . . . . . . 29410.17Medicine Hat’s estimated mean annual probability of frostcalculated from the historical data. . . . . . . . . . . . . . . . 29710.18Normal curved fitted to the distribution of 50 samples of theestimated parameters. . . . . . . . . . . . . . . . . . . . . . . 301B.1 Canada site locations . . . . . . . . . . . . . . . . . . . . . . . 324xixAcknowledgementsI would like to thank my supervisors, Prof. Jim Zidek and Prof. Nhu Lefor what they have taught me in statistics and a lot more, their encourage-ments, ideas and financial support through various RA positions during myPhD studies. I feel very grateful and lucky to have them as my supervisors.I should also thank Prof. Matias Salibian-Barrera on my supervisory com-mittee for giving me great ideas and feedbacks. I also like to thank otherpeople in the statistics department at UBC, prof. Paul Gustafson, prof.John Petkau, prof. Constance Van Eden and prof. Ruben Zamar whichI owe them a lot of things I know. I also thank Mike Marin (instructorat UBC) for having various interesting discussions about statistics and sci-ence, and Viena Tran for helping me regarding many administrative issues.I like to thank my friend Dr. Nathaniel Newlands for insightful commentsand good suggestions and Ralph Wright (Alberta Agriculture Food and Ru-ral Development) for making useful comments about the definition of theextremes.Finally, I like to express my deepest appreciation to all the people whohelped me learn and love statistics and mathematics during all my life, frommy grandfather who graciously taught me mathematics when I was a childto my mother who has been an inspiration and high school teachers whoencouraged me to study mathematics as a university major.xxDedicationTo my lovely parents,my amazing brother: Alireza,my sweet sister: Fatima,my best friends: Mostafa Aghajanpour, Mahmoud Sohrabi, MasoudFeizbakhsh, Behruz Khajali, Ali Mehrabian, Prof. Masoud Asgharian,Prof. Niky Kamran, Mirella Simoneova, Kiyouko Futaeda, May Yun, YukiEzaki, Soheil Keshmiri, Naoko Yoshimi and Mike Marin.xxiChapter 1Thesis introductionThisthesisdevelops mathematical and statistical frameworkto modelstochas-tic processes over time. In particular it develops models for precipitationand extreme (high or low) temperature events occurrences. This is impor-tant for Canada’s agriculture since agricultural production is dependent onweather and water availability.We study the quantiles of data and distributions in detail and developa framework for approximating quantiles in large datasets and inference.We also study categorical Markov chains of higher order and apply them toprecipitation and temperature processes. However, the methodologies andtheories developed here are general and can be used in many other appli-cations where such processes are encountered (such as physics, chemistry,climatology, economics and so on).Sample quantiles and quantile function are fundamental concepts instatistics. In the study of extreme events they are often used to pick appro-priate thresholds. We use the quantiles specifically to pick thresholds for thetemperatureprocess. Thismotivates usto studythe concept ofquantiles andextend their classic definition to provide a more intuitively appealing alter-native. This alternative also enables us to get interesting asymptotic resultsabout their sample counterparts and a framework to approximate quantilesand make inference. In fact weather datasets (observed weather or output ofclimate models) are very large in size. This makes computing the quantilesof such large datasets computationally intensive. Along with this alternativedefinition, we present an algorithm for computing/approximating quantilesin large datasets.The data used in this thesis come from the climate data CD published byEnvironment Canada [10], which includes the daily observed precipitationand temperature data for several station from 1895 to 2007 (the years vary-ing with the station). The data are saved in several binary files. We havewritten a Python module to extract the data in desired formats. The guideto using this module is in Appendix B. For most of the analysis however, wehave used the “homogenized” dataset for Alberta. This dataset is adjustedfor change of instruments and location of the stations. More information1Chapter 1. Thesis introductionabout the datasets is given in Appendix B and Chapter 2.Chapter 2 presents results from the exploratory analysis of the dataset.We look at thevariables’ daily time series, monthly means time series, annualmeans time series and the distribution of the daily/annual means values. Wealso look at the relation between the variables as well as some long–termtrends by simple techniques such as linear regression. For example it seemsthat the mean summer daily minimum temperature has increased over timeat some locations in Alberta. Then we study the seasonal patterns of thesevariables over the course of the year. As expected there is a strong sea-sonal component in these processes. For example, we observe that the dailytemperature is more variable in the colder seasons than the warmer ones.The daily values for the minimum and maximum temperature seem to bedescribed fairly well by a Gaussian process. However, some deviations fromthe Gaussian assumption is seen in the tails. This is particularly impor-tant in modeling extreme events and will help us in later chapters to chooseour approach to modeling the occurrence of such extremes. As a part ofthe exploratory analysis, we look the precipitation occurrence. A questionthat has been addressed by several authors (e.g. Tong in [45] and Gabrielet al. in [18]) is the Markov order of such a chain. The exploratory anal-ysis using the transition probabilities plots leads to the conjecture that a1st–order Markov chain should be appropriate. This is studied in detail inlater chapters. We also look at the spatial–temporal correlation function ofthese processes. Several interesting features are observed. For example forthe maximum and minimum temperature, the correlation seems to be sta-tionary over time. Also the geodesic distance seems to describe the spatialcorrelation for temperature well. For precipitation on other hand not muchspatial correlation is observed. This could be due to the fact that we haveonly 47 precipitation stations available over Alberta and it is more variableover space compared to temperature.Let us denote a general weather process by Xt, where t denotes time.The main approach we take to model the process is discrete–time categoricalrth–order Markov chains (r a natural number), where we have the followingassumption for the conditional probabilities:P(Xt|Xt−1,···) = P(Xt|Xt−1,··· ,Xt−r).“Categorical chain” heremeans thatXt takes only a finitenumber ofpossiblestates. For example it can be a two state space of the (occurrence)/(non-occurrence) of precipitation. Dichotomizing the temperature process, wecan consider processes such as (freezing)/(not freezing). Processes with2Chapter 1. Thesis introductionmore than two states can also be considered. For example a process withthree states: (not warm)/(warm)/(hot).Chapter 3 studies the rth–order categorical Markov chains in general.We present a new representation theorem for such chains that expresses theabove conditional probability as a linear combination of the monomials ofpast process values Xt−1,··· ,Xt−r. We will show the existence and unique-ness of such a representation. In the stationary case since the conditionalprobability is the same for all time points, some more work on the consis-tency shows that this representation characterizes all stationary categoricalrth–order Markov chains. For the binary case the result is a corollary ofa theorem stated in [6]. However, the expression of the theorem in [6] isflawed as also pointed out by Cressie et al. [14]. We present a rigorousstatement along with a constructive proof for the theorem. For discrete–time categorical chains with more than two states this theorem does notseem especially useful. We prove a new theorem for this case that givesus representation for all discrete–time categorical chains (rather than onlybinary). In order to estimate the parameters of such a model in the binarycase and infer about them, we use the “Time series following general linearmodels” as described in [27]. The inferences are similar to generalized linearmodels. However because of dependencies over time some extensions of theusual theory are needed. Maximizing the “partial likelihood” will give us“consistent” estimators as shown in [48]. We apply the partial likelihoodtheory to our proposed rth–order Markov models. Simulations show thatpartial likelihood and the representation together give us satisfactory re-sults for the binary case. We also check the performance of the Bayesianinformation criterion (BIC), developed in [42] and others, to pick optimalmodels by simulation studies and we get satisfactory results. This allows usnot only to pick the order of the Markov chain but also to compare severalMarkov chains of the same order. Another advantage of this model to ex-isting ones is the capacity to accommodate other continuous variables. Forexample, we can add some seasonal processes to get a non–stationary chain.[In previous studies regarding the order of the chain e.g. [45] and [18], itwas assumed that the precipitation chain is stationary.] We can also addcovariate processes such as temperature of the previous day to the model.Then we apply these techniques to the binary precipitation process in Al-berta and pick appropriate models. A 1st–order non-stationary (with oneseasonal term) seem to be the most appropriate based on the BIC methodfor model selection.To apply these techniques to the temperature processes, we need a wayof dichotomizing the temperature process. Usually certain quantiles are cho-3Chapter 1. Thesis introductionsen in order to do so. Computing the quantiles for large datasets can becomputationally challenging. Very large datasets are often encountered inclimatology, either from a multiplicity of observations over time and space oroutputs from deterministic models (sometimes in petabytes= 1 million giga-bytes). Loading a large data vector and sorting it, is impossible sometimesdue to memory limitations or computing power. We show that a proposedalgorithm to approximating the median, “the median of the median” per-forms poorly. Instead, we propose a new algorithm that can give us goodapproximations to the exact quantiles, which is an extension of the algo-rithm proposed in [3]. In fact, we derive the precision of the algorithm. Thealgorithm partitions the data, “coarsens” the partitions at every iteration,put the coarsened vectors together and sort it instead of the original vector.Working on the quantiles, in order to find some theory to justify the use-fulness and accuracy of the algorithm motivated us to think about the defi-nition of the quantile function and quantiles for data vectors. The quantilefunction of a random variable X with distribution function F is traditionallydefined asq(p) = inf{u|F(u) ≥ p}.Applying this to the fair coin example with 0,1 as outputs, we get q(1/2) =0. This is counterintuitive to the fact that the distribution has equal masson 0 and 1. Also a standard definition for the quantiles does not exist fora data vector. [For example Hyndman et al. [25] point out that there aremany definitions of quantiles in different packages.] For example supposea data vector has an even number of points, then there is no point exactlyin the middle, in which case, the average of the two middle values is oftenproposed as the median. We argue that this is not a good definition. In fact,we present an alternative way of defining quantiles that is motivated by anintuitive experiment and resolve all the above problems. We propose usingthe two-state definition of right and left quantiles instead of only quantile.The left quantile is defined as above and the right quantile is defined to berq(p) = sup{u|F(u) ≤ p}.We also define left and right quantiles for the data vectors and study thelimit properties of the sample quantiles. For example it turns out the sampleleft and right quantile converge to the distribution quantiles if and only ifthe left and right quantile are equal. This again shows another interestingaspect of the definition of quantiles and confirms that it is not redundant.This definition is an extension to the concept of upper and lower median4Chapter 1. Thesis introductionin robustness literature. Also in some books (e.g. [41]) rq(p) is taken tobe the definition of quantiles. However, we do not know of any study oftheir properties or a claim that considering both can lead to many inter-esting results. We also show that the widely claimed equivariance propertyof traditional (left) quantile functions under strictly increasing transforma-tions (for example in [21] and [29]) is false. However, we show that theleft (right) quantile is equivariant under left (right) continuous increasingtransformations. We also provide a neat result for continuous decreasingtransformations. We also show that the probability that the random vari-able is between these the right and left quantile is zero and the left and rightquantile are identical except for at most a countable subset of [0,1].Since our objective is to approximate to the exact quantiles by our al-gorithm, we need a way of assessing the accuracy such an approximation(a loss function). We introduce a new loss function that is invariant un-der strictly monotonic transformations of the data or the random variable.This loss function is very natural and in summary the loss of estimating aquantile z by z′ is the probability that the random variable is between thesetwo values. In other words, we use the mass of the random variable itselfbetween the two values to judge the goodness of the approximation. Wealso show some limit theorems to show the empirical loss function tends tothe loss function of the distribution. This loss function might be a usefultool in many other contexts and is an interesting topic for future research.We show by simulations and real data the algorithm performs well. Thenwe will apply it to the weather data to pick the 95% quantile for the max-imum daily temperature. After picking the quantiles we use the rth–orderMarkov techniques and partial likelihood to find appropriate models to de-scribe the temperature. Using this loss function and the theory developedfor the quantiles, we introduce measures to compute “distance” among dis-tribution functions over the reals (random variables) that are invariant un-der continuous strictly monotonic transformations. We use this distancemeasures to show optimal overall fits to a random variable are not neces-sarily optimal in the tails (and hence not appropriate to study extremes).We also find “optimal” ways of picking a limited number of probabilities0 ≤ p1 < ··· < pk ≤ 1 to summarize a random variable by its correspondingquantiles.Finally we show how these higher order Markov models can be used toconstruct confidence intervals for the probability of a frost free week at thebeginning of August at Medicine Hat (in Alberta).The last chapter provides a summary of the work and the conclusions.It also points out some interesting questions that are not answered in this5Chapter 1. Thesis introductionthesis and a research proposal for the future.6Chapter 2Exploratory analysis of theCanadian weather data2.1 IntroductionThis chapter performs an exploratory analysis for the Homogenized climatedataset for the province of Alberta in Canada. We have access to dailymaximum temperature (MT), daily minimum temperature (mt) and precip-itation (PN). The temperature data have been provided to us by Vincent,L.A. and the precipitation data have been provided to us by Eva Mekis bothfrom Environment Canada. This dataset has been homogenized for changesof instrument, changes of the location of the stations and so on. More in-formation about these data can be found in [34] and [47]. These data area homogenized part of a larger dataset published by Environment Canada(2007), which are in binary format and a Python module in order to extractthem is provided in Appendix B.This chapter uses several graphical and analytical tools to examine thebehavior of selected climate variables. Looking at the data, we will see someinteresting features that suggest future research.Section 2 describes the dataset. For example the location plots of thestations and their elevation plots are given. In Section 3, we look at the dailyand annual time series of temperatures and precipitation. The normalityof the distribution of annual values and the associations between differentvariables are investigated. We have also investigated the seasonal patternsas well as the long–term patterns for different variables over the courseof the year. For example, the mean summer daily minimum temperatureshows a significant increasing pattern over the course of the past centuryin Calgary and some other locations. Section 4 looks at the distributionof the daily values. For example, a normal distribution seems to describethe temperature and a Gamma distribution, the precipitation daily values.Confidence intervals for the mean/standard deviation in the normal caseand shape/scale parameters in the Gamma case are given. Section 5 looks72.2. Data description−118 −116 −114 −112 −1105052545658Temperature (deg C)Longitude W (degrees)Latitude N (degrees)Figure 2.1: Alberta site locations for temperature (deg C) data. There are25 stations available with temperature data over Alberta.at the spatial and temporal correlation of different variables.2.2 Data descriptionThe temperature data comes from 25 stations over Alberta which operatedfrom 1895 to 2006. PN data involve 47 stations from 1895 to 2006. Differentstations have different intervals of data available. For example, the PN datafor Caldwell is available from 1911 to 1990. Figures 2.1 and 2.2 respectivelydepict the location of the stations for temperature (both MT and mt) andPN. The number of years available for each station is plotted against thelocation in Figures 2.3 and 2.4. Another available variable for the locationof the stations is the elevation. Figures 2.5 and 2.6 show the elevation inmeters. As seen in the plots, some stations have both temperature andprecipitation data.2.3 Temperature and precipitationTo get some initial impression of the data, we look at the time series of MT,mt, and PN at a fixed location. We use the Calgary site since it has a long82.3. Temperature and precipitation−118 −116 −114 −112 −1105052545658PN (mm)Longitude W (degrees)Latitude N (degrees)Figure 2.2: Alberta site locations for precipitation (mm) data. There are 47stations available with precipitations data over Alberta.Temperature (deg C)−120 −118 −116 −114 −112 −110 60 70 80 9010011012048505254565860Longitude W (degrees)Latitude N (degrees)Available years of dataFigure 2.3: The number of years available for sites with temperature (degC) data.92.3. Temperature and precipitationPN (mm)−120 −118 −116 −114 −112 −110 20 40 60 8010012048505254565860Longitude W (degrees)Latitude N (degrees)Available years of dataFigure 2.4: The number of years available for sites with precipitation (mm)data available.Temperature (deg C)−120 −118 −116 −114 −112 −110 200 400 600 80010001200140048505254565860Longitude W (degrees)Latitude N (degrees)Elevation(m)Figure 2.5: The elevation (meters) of sites with temperature data available.102.3. Temperature and precipitationPN (mm)−120 −118 −116 −114 −112 −110 200 400 600 80010001200140048505254565860Longitude W (degrees)Latitude N (degrees)Elevation(m)Figure 2.6: The elevation (meters) of the sites with precipitation data avail-able.period of data available and includes both temperature and precipitation.Looking at the maximum and minimum temperature, we see the peri-odic trend over the course of a year as shown in Figures 2.7 and 2.8 whichillustrate the MT and mt daily values from 2000 to 2003. A regular seasonaltrend is seen in both processes.Looking at the PN plot in Figure 2.9, we observe a large number of zeros.Moreover, seasonal patterns are hard to see by looking at daily values. Toillustrate the seasonal patterns better, we look at the monthly averages forMT, mt and PN over the period 1995 to 2005 in Figures 2.10, 2.11 and 2.12.Now the seasonal patterns for precipitation can be seen better in Figure 2.12.Next we look at the mean annual values of the three variables for allavailable years that have less than 10 missing days (Figures 2.13, 2.14 and2.15). Table 2.1 gives a summary of these annual means.112.3. Temperature and precipitation2000.0 2000.5 2001.0 2001.5 2002.0 2002.5 2003.0−20−10010203040YearMT (deg C)Figure 2.7: The time series of daily maximum temperature (deg C) at theCalgary site from 2000 to 2003.2000.0 2000.5 2001.0 2001.5 2002.0 2002.5 2003.0−30−20−10010Yearmt (deg C)Figure 2.8: The time series of daily minimum temperature (deg C) at theCalgary site from 2000 to 2003.122.3. Temperature and precipitation2000.0 2000.5 2001.0 2001.5 2002.0 2002.5 2003.0051015202530YearPN (mm)Figure 2.9: The time series of daily precipitation (mm) at the Calgary sitefrom 2000 to 2003.1996 1998 2000 2002 2004 2006−1001020Time (Year and month)MT (deg C)Figure 2.10: The time series of monthly maximum temperature (deg C) atthe Calgary site, 1995–2005.132.3. Temperature and precipitation1996 1998 2000 2002 2004 2006−20−15−10−50510Time (Year and month)mt (deg C)Figure 2.11: The time series of monthly minimum temperature means (degC) at the Calgary site, 1995–2005.1996 1998 2000 2002 2004 2006012345Time (Year and month)PN (mm)Figure 2.12: The time series of monthly precipitation means (mm) at theCalgary site, 1995–2005.142.3. Temperature and precipitation1900 1920 1940 1960 1980 20009101112131415YearMT (deg C)Figure 2.13: The annual mean maximum temperature (C) for Calgary sitefor all available years.1900 1920 1940 1960 1980 2000−4−3−2−1Yearmt (deg C)Figure 2.14: The annual mean minimum temperature (C) for Calgary sitefor all available years.152.3. Temperature and precipitation1900 1920 1940 1960 1980 20001.01.52.02.5YearPN (mm)Figure 2.15: The annual mean precipitation (mm) for Calgary site for allavailable years.Variable min 1st quartile median mean 3rd quartile maxMT (deg C) 7.59 9.64 10.37 10.36 11.19 13.46mt (deg C) -4.83 -3.40 -2.54 -2.66 -1.95 0.07PN (mm) 0.68 1.12 1.28 1.29 1.39 2.51Table 2.1: The summary statistics for the mean annual maximum temper-ature, min temperature and precipitation at the Calgary site.Assuming stochastic normality and independence of the observations, wecan obtain confidence intervals for all three variables and these are given inTable 2.2. The confidence intervals are fairly narrow.Variable 95% confidence intervalMT (deg C) (10.14,10.57)mt (deg C) (-2.85,-2.47)PN (mm) (1.24,1.35)Table 2.2: Confidence intervals for the mean annual maximum temperature,min temperature and precipitation at the Calgary site.162.3. Temperature and precipitationMT (deg C)Density8 10 12 14 160.000.050.100.150.200.250.300.35Figure 2.16: The histogram of annual maximum temperature means (degC) for Calgary with a normal curve fitted to the data.To investigate the shape of the distribution of annual means, we look atthe histogram of each variable with a normal curvefitted in Figures 2.16, 2.18and 2.20. The corresponding normal qq–plots (quantile–quantile) are alsogiven in Figures 2.17, 2.19 and 2.21 to asses the normality assumption. Boththe histogram and the qq–plots for MT validate the normality assumptions.The histogram for mt is slightly left skewed. For PN, some deviation fromthe normality assumption is seen. This is expected since the daily PNprocess is very far from normal to start with. Hence, even averaging throughthe whole year has not quite given us a normal distribution.We plot all three variables (annual mean MT, mt and PN) in the samegraph, Figure 2.22. As shown in that figure, MT and mt show the sametrends over time. To get an idea of how the two variables are related, we fit aregression line, taking mt as response and MT as the explanatory variable.As seen in Figure 2.23, the regression fit looks very good. We repeat thisanalysis this time taking MT as explanatory variable and PN as response.As shown in Figure 2.24, the fit is still reasonable, but the association is notas strong. As shown in Table 3, both fits are significant. One can criticizethe use of a simple regression since the independence assumption mightnot be satisfied. Finding more reliable and sensible relationships among the172.3. Temperature and precipitation−2 −1 0 1 29101112131415Normal Q−Q PlotTheoretical QuantilesSample Quantiles (deg C)Figure 2.17: The normal qq–plot for annual maximum temperature means(deg C) for Calgary.mt (deg C)Density−5 −4 −3 −2 −1 00.00.10.20.30.4Figure 2.18: The histogram of annual minimum temperature means (deg C)for Calgary with normal curve fitted to the data.182.3. Temperature and precipitation−2 −1 0 1 2−4−3−2−1Normal Q−Q PlotTheoretical QuantilesSample Quantiles (deg C)Figure 2.19: The normal qq–plot for annual minimum temperature means(deg C) for Calgary.PN (mm)Density0 1 2 30.00.51.01.52.0Figure 2.20: The histogram of annual precipitation means (mm) for Calgarywith normal curve fitted to the data.192.3. Temperature and precipitation−2 −1 0 1 21.01.52.02.5Normal Q−Q PlotTheoretical QuantilesSample Quantiles (mm)Figure 2.21: The normal qq–plot for annual precipitation means for Calgary.variables needs a multivariate model taking account of correlation and otheraspects of the processes. Also note that these are annual averages which arenot as correlated as daily values over time as seen in the annual time seriesplots.Variables Intercept Slope p-value for intercept p-value for slopemt (deg C) -10.40 0.746 2 ×10−16 2 ×10−16PN (mm) 2.13 -0.082 1.49 ×10−14 0.0005Table 2.3: Lines fitted to annual mean minimum temperature and annualmean precipitation against annual mean maximum temperature.Next we look at the change in the seasonal means for all three variables.As we noted above there are missing data particularly near the beginning ofthe time series. This has caused the gap at the beginning of most plots. Toget a longer time series of means, we first compute the monthly means al-lowing 3 missing days and then compute the annual mean using the monthlymeans. This is reasonable since nearby days have similar values. We do theregression analysis for three locations: Calgary, Banff and Medicine Hat.We fit the regression line to annual means, spring means, summer means,fall means and winter means for each of MT, mt and PN with respect to202.3. Temperature and precipitation1900 1920 1940 1960 1980 2000−10−5051015YearTemp (deg C) & PN (mm)Figure 2.22: The time series plots of maximum temperature (deg C), mini-mumtemperature (deg C)and precipitation (mm)annual meansfor Calgary.The time series plot in the bottom is minimum temperature, the one in themiddle is precipitation and the top curve is maximum temperature.212.3. Temperature and precipitation8 9 10 11 12 13−5−4−3−2−10MT (deg C)mt (deg C)Figure 2.23: The regression line fitted to maximum temperature and mini-mum temperature annual means for Calgary.8 9 10 11 12 131.01.52.02.5MT (deg C)PN (mm)Figure 2.24: The regression line fitted to maximum temperature and pre-cipitation annual means for Calgary.222.3. Temperature and precipitation1900 1920 1940 1960 1980 2000678910Yearmt (deg C)Figure 2.25: The regression line fitted to summer minimum temperaturemeans against time for Calgary.time. The results are given in Table 4, 5 and 6. We have only included fitsthat turned out to be significant. Note that PN does not appear in any ofthe tables. Annual minimum temperature and summer mean temperatureshow an increase in all three locations. Figure 2.25 depicts one of the timeseries (mt summer mean for Calgary) with the regression line fitted.232.4. Daily values, distributionsVariable Season Intercept Slope p-value for intercept p-value for slopemt (deg C) Year -24.72 0.112 2×10−05 0.0001mt (deg C) Spring -30.05 0.138 0.0008 0.0024mt (deg C) Summer -20.11 0.0144 6×10−7 3×10−11Table 2.4: The regression line parameters for the fitted lines for each variablewith respect to time for the Calgary site.Variable Season Intercept Slope p-value for intercept p-value for slopeMT (deg C) Year -12.99 0.0105 0.019 0.0002MT (deg C) Spring -17.0 0.0048 0.075 0.009MT (deg C) Fall -12.64 0.0106 0.19 0.0326mt (deg C) Year -37.0 0.01666 2×10−10 2×10−8mt (deg C) Spring -49.8 0.0229 5×10−9 10−7mt (deg C) Summer -36.8 0.0212 2×10−15 2×10−16Table 2.5: The regression line parameters for the fitted lines for each variablewith respect to time for the Banff site.Variable Season Intercept Slope p-value for intercept p-value for slopeMT (deg C) Year -24.6 0.0185 0.00102 3×10−6MT (deg C) Spring -34.24 0.0235 0.009 0.0005mt (deg C) Year -39.98 0.0197 5×10−10 2×10−9mt (deg C) Spring -39.81 0.0196 5×10−5 9×10−5mt (deg C) Summer -10.93 0.0112 0.0199 7×10−6mt (deg C) Fall -24.66 0.0122 0.0110 0.0137Table 2.6: The regression line parameters for the fitted lines for each variablewith respect to time for the Medicine Hat site.2.4 Daily values, distributionsThis section studies the daily values for all three variables. To that end,we pick four days of the year, Jan 1st, April 1st, July 1st and October 1st.Let us look at the time series, histograms and normal qq–plots for eachvariable over the years. Figures 2.26 to 2.31 give the results. In fact theplots show that a normal distribution fits the data for daily MT and mt forthe the selected days fairly well. However, some deviations from the normaldistribution is seen, particularly in the tails. We also tried the first day ofeach month and observed similar results.242.4. Daily values, distributions1900 1940 1980−20−10010Jan 1stMT (deg C)1900 1940 1980−15−5051020Apr 1stMT (deg C)1900 1940 198015202530July 1stYearMT (deg C)1900 1940 19800510152025Oct 1stYearMT (deg C)Figure 2.26: The time series of daily maximum temperature at the Calgarysite for four given dates: January 1st, April 1st, July 1st and October 1st.252.4. Daily values, distributionsJan 1stDensity−30 −20 −10 0 100.000.020.04Apr 1stDensity−15 −5 0 5 10 15 200.000.020.040.06July 1stMT (deg C)Density15 20 25 300.000.040.08Oct 1stMT (deg C)Density−5 0 5 10 15 20 25 300.000.020.040.06Figure 2.27: The histogram of daily maximum temperature at the Calgarysite for four given dates: January 1st, April 1st, July 1st and October 1st.262.4. Daily values, distributions−2 −1 0 1 2−20−10010Jan 1stSample Quantiles (deg C)−2 −1 0 1 2−15−5051020Apr 1st−2 −1 0 1 215202530July 1stTheoritical QuantilesSample Quantiles (deg C)−2 −1 0 1 20510152025Oct 1stTheoritical QuantilesFigure 2.28: The normal qq–plots of of daily maximum temperature atthe Calgary site for four given dates: January 1st, April 1st, July 1st andOctober 1st.272.4. Daily values, distributions1900 1940 1980−30−20−100Jan 1stmt (deg C)1900 1940 1980−25−15−505Apr 1stmt (deg C)1900 1940 19802468101214July 1stYearmt (deg C)1900 1940 1980−50510Oct 1stYearmt (deg C)Figure 2.29: The time series of daily minimum temperature for Calgary forfour given dates: January 1st, April 1st, July 1st and October 1st.282.4. Daily values, distributionsJan 1stDensity−40 −30 −20 −10 00.000.020.04Apr 1stDensity−25 −15 −5 0 5 100.000.040.08July 1stmt (deg C)Density2 4 6 8 10 12 14 160.000.050.100.15Oct 1stmt (deg C)Density−5 0 5 100.000.040.08Figure 2.30: The histogram of daily minimum temperature at the Calgarysite for four given dates: January 1st, April 1st, July 1st and October 1st.292.4. Daily values, distributions−2 −1 0 1 2−30−20−100Jan 1stSample Quantiles (deg C)−2 −1 0 1 2−25−15−505Apr 1st−2 −1 0 1 22468101214July 1stTheoritical QuantilesSample Quantiles (deg C)−2 −1 0 1 2−50510Oct 1stTheoritical QuantilesFigure 2.31: The normal qq-plots of daily minimum temperature at the Cal-gary site for four given dates: January 1st, April 1st, July 1st and October1st.302.4. Daily values, distributions1900 1940 1980024681012Jan 1stYearPN (mm)1900 1940 1980051015Apr 1stYearPN (mm)1900 1940 19800510152025July 1stYearPN (mm)1900 1940 1980010203040Oct 1stYearPN (mm)Figure 2.32: The time series of daily precipitation at the Calgary site forfour given dates: January 1st, April 1st, July 1st and October 1st.We plot the histogram for PN as well (Figure 2.33). The distribution isfar from normal because of high frequency of no PN (dry) days.Next, we use the available years to compute the confidence intervals forthe mean of every given day of the year for MT and mt. For PN, weconstruct the confidence intervals for probability of PN. [A PN day isdefined to be a day with PN > 0.2 (mm). This is because any precipitationamount less than 0.2 (mm) is barely measurable.] Figures 2.34 to 2.36give the confidence intervals for the means. The confidence interval for thestandard deviations (obtained by bootstrap techniques) are given in Figures2.37 to 2.39. A regular seasonal pattern is seen in the means and standarddeviations. For example the maximum for MT and mt occurs around the200th (in July) day of the year and the minimum occurs at the beginning312.4. Daily values, distributionsJan 1stPN (mm)Density0 2 4 6 8 10 12 140.00.10.20.30.4Apr 1stPN (mm)Density0 5 10 15 200.00.10.20.30.4July 1stPN (mm)Density0 5 10 15 20 25 300.000.050.100.15Oct 1stPN (mm)Density0 10 20 30 40 500.000.050.100.15Figure 2.33: The histogram of daily precipitation at the Calgary site forfour given dates: January 1st, April 1st, July 1st and October 1st.322.4. Daily values, distributions0 100 200 300−50510152025Day of the yearMean MT (deg C)Figure 2.34: The confidence intervals for the daily mean maximum temper-ature (deg C) at the Calgary site. Dashed line shows the upper bound andthe solid line the lower bound of the confidence intervals.and the end of the year. Comparing the plots of the means and the standarddeviations, we observe that warmer days have smaller standard deviationsthan colder days. For example the minimum standard deviation for theMaximum and minimum temperature happens around the 200th day of theyear which correspond to the warmest period of the year. The plots seemto indicate that a simple periodic function suffices to model the seasonalpatterns. Contrary to MT and mt, for the 0-1 PN process, the standarddeviation is the highest in June, when the probability of precipitation isclose to 12.As shown above, the distribution of daily PN values is far from normal.This time, after removing the zeros, we fit a Gamma distribution to PN(Figure 2.42). The Gamma qq–plots are given in Figure 2.43 and reveal afairly good fit.332.4. Daily values, distributions0 100 200 300−15−10−50510Day of the yearMean mt (deg C)Figure 2.35: The confidence intervals for the daily mean minimum temper-ature (deg C) at the Calgary site. Dashed line shows the upper bound andthe solid the lower bound of the confidence intervals.342.4. Daily values, distributions0 100 200 3000.00.20.40.60.8Day of the yearProbability of PNFigure 2.36: The confidence intervals for the probability of precipitation(mm) at the Calgary site for the days of the year. Dashed line shows theupper bound and the solid the lower bound of the confidence intervals.352.4. Daily values, distributions0 100 200 30046810Day of the yearStandard deviation (deg C)Figure 2.37: The confidence intervals for the standard deviation of each dayof the year for maximum temperature (deg C) at the Calgary site. Dashedline shows the upper bound and the solid the lower bound of the confidenceintervals.362.4. Daily values, distributions0 100 200 300246810Day of the yearStandard deviation (deg C)Figure 2.38: The confidence intervals for the standard deviation of each dayof the year for minimum temperature (deg C) at the Calgary site. Dashedline shows the upper bound and the solid the lower bound of the confidenceintervals.372.4. Daily values, distributions0 100 200 3000.350.400.450.500.55Day of the yearStandard deviation (mm)Figure 2.39: The confidence intervals for standard deviation (sd) of eachday of the year for the probability of precipitation (mm) (0-1 precipitationprocess) at the Calgary site. Dashed line shows the upper bound and thesolid the lower bound of the confidence intervals. Plot shows sd ≤ 1/2. Thisis because sd =radicalbigp(1−p) which has a maximum value of 12.382.4. Daily values, distributions−40 −20 0 20 400.000.050.100.15First half of the yearMT (C)Density−40 −20 0 20 400.000.050.100.15Second half of the yearMT (C)DensityFigure 2.40: The distribution of each day of the year for MT (C) fromJan 1st to Dec 1st. The year has been divided to two halves. In each halfrainbow colors are used to show the change of the distribution.392.4. Daily values, distributions−40 −20 0 10 20 300.000.050.100.15First half of the yearmt (C)Density−40 −20 0 10 20 300.000.050.100.15Second half of the yearmt (C)DensityFigure 2.41: The distribution of each day of the year for mt (C) from Jan 1stto Dec 1st. The year has been divided to two halves. In each half rainbowcolors are used to show the change of the distribution.Jan 1stPN (mm)Density0 10 20 30 40 500.00.10.20.3Apr 1stPN (mm)Density0 10 20 30 40 500.000.100.200.30July 1stPN (mm)Density0 10 20 30 40 500.000.060.12Oct 1stPN (mm)Density0 10 20 30 40 500.000.050.100.15Figure 2.42: The histogram of daily precipitation greater than 0.2 mm at theCalgary site with Gamma density curve fitted using Maximum likelihood.402.4. Daily values, distributions0 5 10 15 200246812Theoritical QuantilesSample Quantile0 5 10 15 20 25051015Theoritical QuantilesSample Quantile0 5 10 15 20 25051020Theoritical QuantilesSample Quantile0 5 10 15 20 2501030Theoritical QuantilesSample QuantileFigure 2.43: The qq–plots of daily precipitation greater than 0.2 mm at theCalgary site with Gamma curve fitted using Maximum likelihood.412.4. Daily values, distributions0 2 4 6 8 100.00.20.40.6MarchPN (mm)Density0 2 4 6 8 100.00.20.40.6JunePN (mm)Density0 2 4 6 8 100.00.20.40.6SeptemberPN (mm)Density0 2 4 6 8 100.00.20.40.6DecemberPN (mm)DensityFigure 2.44: The Gamma fit of each day of 4 months for precipitation (mm).In each month rainbow colors are used to show the change of the distribution.Figures 2.40, 2.41 and 2.44 reveal the result of our investigation of thechange in the distribution over a period of time. For MT and mt, we havedone that for the course of the year. The figures show how the distributiondeforms continuously over the year. We can also notice changes in meanand standard deviation over the year. For PN, we have done the same onlyfor 4 different months because of high irregularity of the process.Next, we look at the parameters of the Gamma distribution fitted toPN over the course of a year. If we use maximum likelihood estimates(MLE), which we have used above to form the Gamma curve, the confidenceintervals, obtained by bootstrap method will be very wide (tend rapidly toinfinity). Hence, we use the “method of moments estimates” (MOM), to422.4. Daily values, distributions0 100 200 300012345Day of the yearEstimated parameter, alphaFigure 2.45: The maximum likelihood estimate for α, the shape parameterof the Gamma distribution fitted to the precipitation amounts.obtain confidence intervals. The MOM confidence intervals are given inFigure 2.46. When using MLE estimates, since there is no closed form forthem, we need to use Newton method to find the Maximum values. However,MOM gives us closed form solution. This advantage might explain the betterbehavior of MOM estimates in forming the confidence intervals. However,even the MOM confidence intervals do not look satisfactory and are ratherwide and irregular specially at the beginning and end of the year.We can also consider the 0-1 process of PN (1 for wet and 0 for dry)and compute the transition probabilities for PN (Figure 2.47). The figureshows the probability of PN is changing continuously over the year and canbe modeled by a simple periodic function.Considering the 0-1 process of PN as a chain leads to the interestingquestion as the order of the Markov chain. Let us denote by 1 a PN oc-currence and 0 otherwise. Suppose xt = 1 denote PN on day t and xt = 0denote no PN and let pxt−r···xt(t), denote the probability of observing xt onday t of the year conditional on the chain (xt−r ···xt−1). In Figure 2.47, wehave plotted the estimated ˆp11(t) and ˆp01(t) for different days of the year.432.4. Daily values, distributions0 100 200 3000.51.01.52.02.5Day of the yearEstimated parameter, alphaFigure 2.46: The confidence interval for MOM estimate of the shape param-eter, α, of the Gamma distribution fitted to daily precipitation amounts.The dotted line is the upper bound and the solid line the lower bound. Asseen in the figure the upper bounds at the beginning and end of the yearhave become very large. We have not shown them because otherwise thenthe pattern in the rest of the year could not be seen.442.4. Daily values, distributions0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 2.47: The 1st-order transition probabilities. The dotted line is thethe probability of precipitation if it happened the day before ( ˆp11) and thedashed is the probability of precipitation if it did not happen the day before( ˆp01).452.5. Correlation0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 2.48: The 2nd–order transition probabilities for the precipitation atthe Calgary site: ˆp111 (solid) against ˆp011 (dotted).The clear gap between this two estimated probabilities indicate that a 1st–order Markov chain should be preferred to a 0th–order. Figures 2.48 and2.49 plot ˆp111 against ˆp011 and ˆp001 against ˆp101. The estimated probabilitiesseem to be close and overlap heavily over the course of the year. Hence a1st–order Markov chain seems to suffice for describing the binary process ofPN.2.5 CorrelationThe correlation in a spatial–temporal process can depend on time and space.This section studies the temporal and spatial patterns of the correlationfunction separately.462.5. Correlation0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 2.49: The 2nd–order transition probabilities for the precipitation atthe Calgary site: ˆp001 (solid) against ˆp101 (dotted).472.5. Correlation0 200 400 600−0.20.20.61.0DayCorrelation0 200 400 600−202060DayCovarianceFigure 2.50: The correlation and covariance plot for maximum temperatureat the Calgary site for Jan 1st and 732 consequent days.2.5.1 Temporal correlationHere we look at the correlation/covariance of the variables as a function oftime. The location is taken to be the Calgary site. First we look at thecorrelation/covariance of a given day and its consequent days. We pick Jan1st and compute the correlation/covariance with the following days: Jan2nd, Jan 3rd and etc. Figure 2.50 shows that the correlation and covariancehave the same trends for maximum temperature. Figures 2.51 to 2.53 show adecreasing trend for correlation over time for MT,mt and PN. The decreaseis far from linear and it looks to be exponentially decreasing. The plots alsoindicate that only a few consequent days are possibly correlated and inparticular two days that are one year apart can be considered independent.This assumption might be useful in building a spatial–temporal model.482.5. Correlation0 200 400 600−0.20.00.20.40.60.81.0DayCorrelationFigure 2.51: The correlation plot for maximum temperature (deg C) at theCalgary site for Jan 1st and 732 consequent days.492.5. Correlation0 200 400 600−0.20.00.20.40.60.81.0DayCorrelationFigure 2.52: The correlation plot for minimum temperature (deg C) at theCalgary site for Jan 1st and 732 consequent days.502.5. Correlation0 200 400 6000.00.20.40.60.81.0DayCorrelationFigure 2.53: The correlation plot for precipitation (mm) at the Calgary sitefor Jan 1st and 732 consequent days.Next we look at the correlation of responses on other days of the yearwith their 30 consecutive days. Our goal is to see if the correlation functionhas the same behavior over the course of a year. We pick, Feb 1st, April1st, July 1st, Oct 1st. Figures 2.54 and 2.56 show similar patterns.Finally, we look at the correlation of two fixed locations over the courseof the year (by changing the day). The results are given in Figures 2.57 and2.58. Strong correlation and clear seasonal patterns are seen for MT andmt. This seems to indicate in particular that the temperature process is notstationary. The correlation in the middle of the year around day 200 whichcorresponds to the summer season seems to be smaller than the correlationat the beginning and end of the year which correspond to the cold season.512.5. Correlation0 5 10 15 20 25 30−0.20.00.20.40.60.81.0DayCorrelationFigure 2.54: The correlation plot for maximum temperature (deg C) at theCalgary site for Feb 1st (solid), April 1st (dashed), July 1st (dotted) andOct 1st (dot dash) and 30 consequent days.522.5. Correlation0 5 10 15 20 25 30−0.20.00.20.40.60.81.0DayCorrelationFigure 2.55: The correlation plot for minimum temperature (deg C) at theCalgary site for Feb 1st (solid), April 1st (dashed), July 1st (dotted) andOct 1st (dot dash) and 30 consequent days.532.5. Correlation0 5 10 15 20 25 30−0.20.00.20.40.60.81.0DayCorrelationFigure 2.56: The correlation plot for precipitation (mm) at the Calgary sitefor Feb 1st (solid), April 1st (dashed), July 1st (dotted) and Oct 1st (dotdashed) and 30 consequent days.542.5. Correlation0 100 200 3000.50.70.9Day of the yearCorrelation, MT0 100 200 3000.50.70.9Day of the yearCorrelation, mtFigure 2.57: The correlation plot for maximum temperature and minimumtemperature (deg C) between Calgary and Medicine Hat.0 100 200 3000.00.20.40.60.81.0Day of the yearCorrelation, PNFigure 2.58: The correlation plot for precipitation (mm) between Calgaryand Medicine Hat.552.5. Correlation0 200 400 600 8000.00.20.40.60.81.0Jan 1stCorrelation0 200 400 600 8000.00.20.40.60.81.0Apr 1stCorrelation0 200 400 600 8000.00.20.40.60.81.0July 1stDistance (km)Correlation0 200 400 600 8000.00.20.40.60.81.0Oct 1stDistance (km)CorrelationFigure 2.59: The correlation plot for maximum temperature (deg C) withrespect to distance (km).2.5.2 Spatial correlationThis subsection looks at the spatial correlation by fixing the time to a fewdates: January 1st, April 1st, July 1st and Oct 1st distributed over year’sclimate regime. We plot the correlation with respect to the geodesic distance(km) on the surface of the earth. Figures 2.59 to 2.62 show the results forMT, mt, PN and 0-1 PN respectively. For MT and mt, we observe a cleardecreasing trend with respect to distance. The trend for PN does not seemto be regular.562.5. Correlation0 200 400 600 8000.00.20.40.60.81.0Jan 1stCorrelation0 200 400 600 8000.00.20.40.60.81.0Apr 1stCorrelation0 200 400 600 8000.00.20.40.60.81.0July 1stDistance (km)Correlation0 200 400 600 8000.00.20.40.60.81.0Oct 1stDistance (km)CorrelationFigure 2.60: The correlation plot for minimum temperature (deg C) withrespect to distance(km).572.5. Correlation0 200 400 600 8000.00.20.40.60.81.0Jan 1stCorrelation0 200 400 600 8000.00.20.40.60.81.0Apr 1stCorrelation0 200 400 600 8000.00.20.40.60.81.0July 1stDistance (km)Correlation0 200 400 600 8000.00.20.40.60.81.0Oct 1stDistance (km)CorrelationFigure 2.61: The correlation plot for precipitation (mm) with respect todistance (km).582.5. Correlation0 200 400 600 8000.00.20.40.60.81.0Jan 1stCorrelation0 200 400 600 8000.00.20.40.60.81.0Apr 1stCorrelation0 200 400 600 8000.00.20.40.60.81.0July 1stDistance (km)Correlation0 200 400 600 8000.00.20.40.60.81.0Oct 1stDistance (km)CorrelationFigure 2.62: The correlation plot for precipitation (mm) 0-1 process withrespect to distance (km).592.6. Summary and conclusions2.6 Summary and conclusionsThis section summarizes our findings of the exploratory analysis.• There is a strong seasonal trend in the temperature and precipitationprocesses. See Figures 2.7, 2.8, 2.11 and 2.36.• The summer average minimum temperature has increased over severallocations over the past century. See Figure 2.25.• mt and MT are highly correlated. See Figure 2.23.• The distributions of daily maximum temperature and minimum tem-perature are rather close to the Gaussian distribution in the centerwith some deviations seen in the tails. See Figures 2.27 and 2.29.• Thetemperature process in Alberta is less variable in the warm seasonsand the converse holds for the precipitation process. See Figures 2.37,2.38 and 2.39.• The distribution of the daily temperature varies continuously over thecourse of the year. This could not be shown for precipitation. (Thismight be because we need more data.)• The correlation between two sites depends on the time of the year.They are more correlated in cold seasons. This might be becausethere are more (strong) global weather regimes in the cold seasonsinfluencing the whole region.• The correlation over time for MT, mt and PN seems stationary andis decreasing with a nonlinear trend (exponentially) with respect tothe time difference.• The spatial correlations for MT and mt are strong and decreasingalmost linearly with respect to the geodesic distance.• The spatial correlation for PN is not strong. It might be because thesites are too faraway to capture the spatial correlation for PN.The future chapters investigate some of these items. In particular afterdeveloping some theory regarding Markov chains, we investigate the orderof the binary precipitation process. Then we will turn to modeling theoccurrence of extreme temperature. Instead of using a Gaussian process to602.6. Summary and conclusionsmodel the temperature and use that to infer about the occurrence of theextremes, we use a categorical chain. This is because of the deviations fromnormality in the tails as pointed out above.61Chapter 3rth-order Markov chains3.1 IntroductionThis chapter studies rth–order categorical Markov chains and more gen-erally, categorical discrete–time stochastic processes. By “categorical”, wemean chains that have a finite number of possible states at each time point.Such chains have important applications in many areas, one of which is mod-eling weather processes such as precipitation over time. In fact, we use thesechains to model the binary process of precipitation as well as dichotomizedtemperature processes. In rth–order Markov chains, the conditional proba-bility of the present given the past is modeled. Such a conditional probabilityis a function of the past r states, where each one of them only takes finitepossible values.It is usefuland intuitively appealing to specify or model a discrete processover time by the conditional probabilities rather than the joint distribution.However, one must check the consistency of such a specification i.e. to provethat it corresponds to a full joint distribution. In the case of discrete–timecategorical processes, we prove a theorem that shows the conditional prob-abilities can be used to specify the process. Also we prove a representationtheorem which states that every such conditional probability after an appro-priate transformation can be written as a linear summation of monomialsof the past processes. In fact, we represent all categorical discrete–timestochastic processes over time, in particular rth–order Markov chains andmore particularly stationary rth–order Markov chains. For the binary casethe result is a consequence of an expansion theorem due to Besag [6]. Togeneralize the result to arbitrary categorical Markov chains, we prove a newexpansion theorem which generalizes the result to the case of arbitrary cat-egorical rth–order Markov chains (rather than binary only).The result simplifies the task of modeling categorical stochastic pro-cesses. Since we have written the conditional probability as a linear com-bination, we can simply add other covariates as linear terms to the modelto build non-stationary chains. For example, we can add seasonal terms orgeographical coordinates (longitude and latitude). The theory of “partial623.1. Introductionlikelihood” allows us to estimate the parameters of such chain models forthe binary case. By restricting the degree of those polynomials or by requir-ing that some of their coefficients be the same, we can find simpler models.Simulation studies show that the “BIC” criterion (Bayesian information cri-terion) combined with the partial likelihood works well in that they recoverthe correct simulation model. Since we are only dealing with the categoricalcase all the density functions in this chapter are densities with the respectto the counting measure on the real line.Specifying a categorical chain over time (with positive joint densities)using conditional probabilities of the present given the past is quite commonin statistics and probability. However, we did not find a rigorous result forsufficient and necessary condition for a collection of function to correspondto the conditionals of a unique stochastic process. The proof is given inTheorem 3.5.6. This is an easy consequence of Lemma 3.3.2 that states thatthe “ascending” joint densities can uniquely determine such a stochasticprocess.Another commonly used technique in statistics is transforming a discreteprobability density from (0,1) to the real numbers using a transformationssuch as “log” for example in logistic regression. This is done to removethe restriction of these quantities and ease modeling of such probabilities.Theorem 3.4.1 provides a characterization of all such density functions givenany bijective transformation between positive numbers and reals. Henceany positive discrete density function (mass function) correspond to a aunique function on reals and any arbitrary function on reals correspondto a positive function (after fixing the transformation and one element withpositive probability). We do not know of a result in this generality elsewhere.Obviously now modeling such arbitrary function on reals which can only takefinite values is easier.In order to find a parametric form for an arbitrary function over the realsthat only takes finite values, for the binary case, we use a corollary of a resultstated by Besag [6] who used such functions in modeling Markov randomfields. However, Besag did not provide a rigorous proof and the statement ofthe theorem is flawed as also pointed out by Cressie et al. in [14]. They alsostate a correct version of the theorem without offering a proof. We providea rigorous statement and proof in Theorem 3.5.1. The corollary can only beobtained if the flaw in the statement is fixed. In order to extend to stochasticprocesses that can have more than two states at some times, we prove a newrepresentation theorem in Theorem 3.5.6. Some novel simplified models withless parameters for such processes are given in Subsection 3.5.3 and manyof them have been investigated in later chapters to model precipitation and633.2. Markov chainsextreme temperature events occurrences.3.2 Markov chainsLet {Xt}t∈T be a stochastic process on the index set T, where T = Z, T = N(the integers or natural numbers respectively) or T = {0,1,··· ,n}. It iscustomary to call {Xt}t∈T a chain, since T is countable and has a naturalordering. {Xt}t∈T is called an rth–order Markov chain if:P(Xt|Xt−1,···) = P(Xt|Xt−1,··· ,Xt−r), ∀t such that t,t−r ∈ T.We call the Markov chain homogenous ifP(Xt = xt|Xt−1 = xt−1,··· ,Xt−r = xt−r) =P(Xt′ = xt|Xt′−1 = xt−1,··· ,Xt′−r = xt−r),∀t,t′ ∈ T such that t−r and t′−r are also in T. Note that Markovness canbe defined as a local property. We call {Xt}t∈T locally rth–order Markov att ifP(Xt|Xt−1,···) = P(Xt|Xt−1,··· ,Xt−r).Hence, we can have chains with a different Markov order at different times.Let Xt be the binary random variable for precipitation on day t, with 1denoting the occurrence of precipitation and 0 non-occurrence. In particular,consider the precipitation (PN) for Calgary site from 1895 to 2006. Thisprocess can be considered in two possible ways:1. Let X1,X2,··· ,X366 denote the binary random variable of precipi-tation for days of a year. Suppose we repeatedly observe this chainyear-by-year from 1895 to 2006 and take these observed chains to be in-dependent and identically distributed from one year to the next. Withthis assumption, techniques developed in [4] can be applied in order toinfer the Markov order of the chain. However, this approach presentsthree issues. Firstly independence of the successive chains seems ques-tionable. In particular, the end of any one year will be autocorrelatedwith the beginning of the next. Secondly this model unrealistically as-sumes the 0-1 precipitation stochastic process is identically distributedover all years. Thirdly and more technically, leap years have 366 dayswhile non–leap years have 365. We can resolve this last issue by for-mally assuming a missing data day in the non–leap years, by dropping643.3. Consistency of the conditional probabilitiesthe last day in the leap year or by using other methods. However,none of these approaches seem completely satisfactory.2. Alternatively, we could consider the observations of Calgary daily pre-cipitation as coming from a single process that spans the entire timeinterval from 1895 to 2006. In this case, we will show below that wecan still build models that bring in the seasonality effects within ayear.3.3 Consistency of the conditional probabilitiesTo represent a stochastic process, we only need to specify the joint probabil-ity distributions for all finite collections of states. The Kolmogorov exten-sion theorem then guarantees the existence and uniqueness of an underlyingstochastic process from which these distributions derive, provided they areconsistent as described below. (See [9] for example.)To state the version of that celebrated theorem we require, let T denotesome interval (that can be thought of as “time”), and let n ∈ N = {1,2,...}.For each k ∈ N and finite sequence of times t1,··· ,tk, let νt1···tk be a prob-ability measure on (Rn)k. Suppose that these measures satisfy two consis-tency conditions:1. Permutation invariance. For all permutations π (a bijective andone–to–one map from a set to itself) of 1,··· ,k and measurable setsFi ⊂ Rn,νtπ(1)...tπ(k) (F1 ×···×Fk) = νt1...tk parenleftbigFπ−1(1) ×···×Fπ−1(k)parenrightbig.2. Marginalization consistency. For all measurable sets Fi ⊆ Rn, m ∈N:νt1...tk (F1 ×···×Fk) = νt1...tktk+1,...,tk+m (F1 ×···×Fk ×Rn ×···×Rn).Then there exists a probability space (Ω,F,P) and a stochastic processX : T ×Ω → Rn,such that:νt1...tk (F1 ×···×Fk) = P(Xt1 ∈ F1,...,Xtk ∈ Fk),for all ti ∈ T, k ∈ N and measurable sets Fi ⊆ Rn, i.e. X has the νt1...tkas its finite–dimensional distributions. (See [37] for more details.)653.3. Consistency of the conditional probabilitiesRemark. Note that Condition 1 is equivalent toνtπ(1)...tπ(k)parenleftbigFπ(1) ×···×Fπ(k)parenrightbig= νt1...tk (F1 ×···×Fk).This is seen by replacing F1×···×Fk by Fπ(1)×···×Fπ(k) in the first equality.Remark. We are only concerned about the case n = 1. This is becausewe consider stochastic processes, a collection of random variables from thesame sample space to R1 = R.When working on (higher order) Markov chains over the index set N, itis natural to consider the conditional distributions of the present, time t,given the past instead of the finite joint distributions, in other wordsPt(x0,··· ,xt) = P(Xt = xt|Xt−1 = xt−1,··· ,X0 = x0),for {Xt}t∈N∪0 plus the starting distributionP0(x0) = P(X0 = x0).However that raises a fundamental question – does there exist a stochasticprocess whose conditional distributions match the specified ones and if so, isit unique? We answer this question affirmatively in this section for the caseof discrete–time categorical processes, in particular higher order categoricalMarkov chains. We also restrict ourselves to chains for which all the jointprobabilities are positive. Let M0,M1,··· ⊂ R be the state spaces for time0,1,··· , where each one of them is of finite cardinality. A probability mea-sure on the finite space M0 can be represented through its density function,a positive function P0 : M0 → R satisfying the conditionsummationdisplaym∈M0P0(m) = 1.The following theorem ensures the consistency of our probability model.Theorem 3.3.1 Suppose M0,M1,··· ⊂ R, |Mt| = ct < ∞, t = 0,1,···.Let P0 : M0 → R be the density of a probability measure on M0 and moregenerally for n = 1,..., Pn(x0,x1,··· ,xn−1,.) be a positive probability den-sity on Mn, ∀(x0,··· ,xn−1) ∈ M0×···×Mn−1. Then there exists a uniquestochastic process (up to distributional equivalence) on a probability space(Ω,Σ,P) such thatP(Xn = xn|Xn−1 = xn−1,··· ,X0 = x0) = Pn(x0,x1,··· ,xn−1,xn).663.3. Consistency of the conditional probabilitiesTo prove this theorem, we first consider a related problem whose solu-tion is used in the proof. More precisely, we consider stochastic processes{Xn}n∈N∪{0}, where the state space for Xn is Mn, i = 0,1,2,··· and finite.Suppose pn : M0 ×M1 ×···×Mn → R is the joint probability distribution(density) of a random vector {X0,...,Xn}, i.e.pn(x0,··· ,xn) = P(X0 = x0,··· ,Xn = xn).We call a such sequence of functions, {pn}n∈N, the “ascending joint distribu-tions” of the stochastic process {Xn}n∈N∪{0}. It is clear that given a familyof functions {pn}n∈N, other joint distributions such asP(Xt1 = xt1,··· ,Xtk = xtk),are obtainable by summing over appropriate components. Now considerthe inverse problem. Given the {pn}n∈N and some type of consistency be-tween them, is there a (unique) stochastic process that matches these jointdistributions? The following lemma gives an affirmative answer.Lemma 3.3.2 Suppose Mt ⊂ R, t = 0,1,··· are finite, p0 : M0 → R repre-sents a probability density function (i.e. summationtextx0∈M0 p(x0) = 1) and functionspn : M1×···×Mn → R+∪{0} satisfy the following (consistency) condition:summationdisplayxn∈Mnpn(x0,··· ,xn) = pn−1(x0,··· ,xn−1).Then there exist a unique stochastic process (up to distributional equivalence){Xt}t∈N∪{0} such thatP(X0 = x0,··· ,Xn = xn) = pn(x0,··· ,xn)ProofExistence: By the Kolmogorov extension theorem quoted above, we onlyneed to show there exists a consistent family of measures (density functions){qt1,···,tk|k ∈ N,(t1,··· ,tk) ∈ Nk},such that q1,···,t = pt. We define such a family of functions, prove they aremeasures and consistent.For any sequence, t1,··· ,tk, let t = max{t1,··· ,tk} and defineqt1,···,tk(xt1,··· ,xtk) =summationdisplayxu∈Mu,u∈{1,···,t}−{t1,···,tk}pt(x1,··· ,xt).We need to prove three things:673.3. Consistency of the conditional probabilitiesa) Each qt1,···,tk is a density function. It suffices to show that qt is a mea-sure because the qt1,···,tk are sums of such measures and so are measuresthemselves. But pt is nonnegative by assumption. It only remains toshow that pt sums up to one. For t = 1 it is in the assumptions of thetheorem. For t > 1, it can be done by induction because of the followingidentitysummationdisplayxi∈Mi,i=0,1,···,tpt(x0,··· ,xt) =summationdisplayxi∈Mi,i=0,1,···,t−1pt−1(x0,··· ,xt−1)where the right hand side is obtained by the assumptionsummationtextMn pn = pn−1.b) In order to satisfy the first condition of Kolmogorov extension theorem,we need to showqt1,···,tk(xt1,··· ,xtk) = qtπ(1),···,tπ(k)(xtπ(1),··· ,xtπ(k)),for π a permutation of {1,2,··· ,k}. But this is obvious sincemax{t1,··· ,tk} = max{tπ(1),··· ,tπ(k)}.c) Inorder to satisfy the second condition of Kolmogorov extension theorem,we need to showsummationdisplayxti∈Mtiqt1,···,ti,···,tk(xt1,··· ,xti,··· ,xtk) = qt1,···,ˆti,···,tk(xt1,··· , ˆxti,··· ,xtk),where the notationˆabove a component means that component is omit-ted.To prove this, we consider two cases:Case I: t = max{t1,··· ,tk} = max{t1,··· , ˆti,··· ,tk}:summationdisplayxti∈Mtiqt1,···,ti,···,tk(xt1,··· ,xti,··· ,xtk) =summationdisplayxti∈Mtisummationdisplayxu∈Mu,u∈{1,···,t}−{t1,···,ti,···,tk}pt(x0,··· ,xt) =summationdisplayxu∈Mu,u∈{1,···,t}−{t1,···,ˆti,···,tk}pt(x0,··· ,xt) =pt1,···,ˆti,···,tk(xt1,··· , ˆxti,··· ,xtk)683.3. Consistency of the conditional probabilitiesCase II: max{t1,··· , ˆti,··· ,tk} = t′ < t = ti:summationdisplayxti∈Mtiqt1,···,ti,···,tk(xt1,··· ,xti,··· ,xtk) =summationdisplayxti∈Mtisummationdisplayxu∈Mu,u∈{1,···,t}−{t1,···,ti,···,tk}pt(x0,··· ,xt) =summationdisplayxu∈Mu,u∈{1,···,t}−{t1,···,ˆti,···,tk}pt(x0,··· ,xt) =summationdisplayxu∈Mu,u∈{1,···,t′}−{t1,···,ˆti,···,tk}summationdisplayxv∈Mv,v∈{t′+1,···,t}ft(x0,··· ,xt) =summationdisplayxu∈Mu,u∈{1,···,t′}−{t1,···,ˆti,···,tk}pt′(x0,··· ,xt′) =qt1,···,ˆti,···,tk(xt1,··· , ˆxti,··· ,xtk).Uniqueness: Suppose{Yt}t∈N∪{0} is another stochastic process satisfying theconditions of the theorem with the p′t1,···,tk as the joint measures.p′1,···,t = pt = p1,···,t,by the assumption. Taking the appropriate sums on the two sides, we getp′t1,···,tk = pt1,···,tk. Now the uniqueness is a straight consequence of the Kol-mogorov Extension Theorem.Remark. Note that we did not impose the positivity of the functions forthis case.Now we are ready to prove Theorem 3.3.1.ProofExistence: In Lemma 3.3.2, letp0 = P0,p1 : M0 ×M1 → R, p1(x0,x1) = p0(x0)P1(x0,x1),...pn : M1×M2×···×Mn → R, pn(x0,··· ,xn) = pn−1(x0,··· ,xn−1)Pn(x0,··· ,xn).693.4. Characterizing density functions and rth–order Markov chainsTo see that the {pi} satisfy the conditions of Lemma 3.3.2, note thatsummationdisplayxn∈Mnpn(x0,··· ,xn) =summationdisplayxn∈Mnpn−1(x0,··· ,xn−1)Pn(x0,··· ,xn) =pn−1(x0,··· ,xn−1)summationdisplayxn∈MnPn(x0,··· ,xn) =pn−1(x0,··· ,xn−1).Lemma 3.3.2 shows the existence of a stochastic process with joint dis-tributions matching the pi. Furthermore, the positivity of the {Pi} impliesthat of the {pi}. Thus all the conditionals exist for such a process and theymatch the Pi by the definition of the conditional probabilities.Uniqueness. Any stochastic process satisfying the above conditions, has ajoint distribution that matches those of the {pi} and hence by the abovetheorem they are unique.3.4 Characterizing density functions andrth–order Markov chainsThe previous section saw discrete–time categorical processes represented interms of conditional probability density functions. However such densitieson finite domains satisfy certain restrictions that can make modeling themdifficult. That leads to the idea of linking them to unrestricted functionson R in much the same spirit as a single probability can profitably be logittransformed in logistic regression.To begin, let X be a random variable with probability density p definedon a finite set M = {m1,··· ,mn}. The section finds the class of all possiblesuch ps with p(mi) > 0, i = 1,··· ,n and g : R → R+, a fixed bijection. Forexample g(x) = exp(x). The following theorem characterizes the relationshipbetween p and g. While particular examples of the following theorem areused commonly in statistical modeling we are not aware of a reference whichcontains this result or the proof in this generality.Theorem 3.4.1 Let g : R → R+ a bijection. For every choice of probabilitydensity p on M = {m1,··· ,mn}, n ≥ 2, there exists a unique functionf : M −{m1} → R, such that703.4. Characterizing density functions and rth–order Markov chainsp(m1) = 11+summationtexty∈M−{m1} h(y), (3.1)p(x) = h(x)1+summationtexty∈M−{m1}h(y), x negationslash= m1, (3.2)where h = g ◦f. Moreover, h(x) = p(x)/p(m1). Inversely, for an arbitraryfunction f : M −{m1} → R, the p defined above is a density function.ProofExistence: Suppose p : M → (0,1) is given. Let h(x) = p(x)p(m1), x negationslash= m1 andf : M −{m1} → R, f(x) = g−1 ◦h(x). Obviously h = g ◦f. Moreover11+summationtexty∈M−{m1} h(y) =11 +summationtexty∈M−{m1} p(y)/p(m1) =11 +(1−p(m1))/p(m1) = p(m1)and h(x)1 +summationtexty∈M−{m1} h(y) =p(x)/p(m1)1+ (1−p(m1))/p(m1) = p(x),thereby establishing the validity of equations (3.1) and (3.2).Uniqueness: Suppose for f1,f2, we get the same p. Let h1 = g◦f1, h2 = g◦f2, by dividing 3.2 by 3.1 for h1 and h2, we get h1(x) = p(x)/p(m1) = h2(x)hence g ◦f1 = g ◦f2. Since g is a bijection f1 = f2.Corollary 3.4.2 Fixing a bijection g and m1 ∈ M, every density functioncorresponds to an arbitrary vector of length n−1 over R.Example Consider the binomial distribution with a trials and probabilityof success π and the transformation g(x) = expx. Then M = {0,1,··· ,a}.Let m1 = 0 then for x negationslash= 0f(x) = g−1(h(x)) = logp(x)/p(0) = logparenleftbiggnxparenrightbiggpx(1−p)n−x/(1−p)n =logparenleftbiggnxparenrightbigg+xlog{p/(1−p)}.713.4. Characterizing density functions and rth–order Markov chainsTheorem 3.4.3 Fix a bijection g : R → R+, mn1 ∈ Mn. Let Mn,n =0,1,··· be finite subsets of R with cardinality greater than or equal to 2 andM′n = Mn −{mn1}, ∀n. Then every categorical stochastic process with pos-itive joint distribution on the Mn having initial density P0 : M0 → R andconditional probabilities Pn at stage n given the past, can be uniquely repre-sented by means of unique functions:g0 : M′0 → R...gn : M0 ×···×Mn−1 ×M′n → R...for n = 1,..., whereP0(m01) = 11 +summationtexty∈M0−{m01}h0(y), (3.3)P0(x) = h0(x)1 +summationtexty∈M0−{m01}h0(y), x negationslash= m01 ∈ M0, (3.4)and h0 = g ◦g0. Moreover h0(x) = P(X0=x)P(X0=m10).The conditional probabilities Pn are given byPn(x0,··· ,xn−1,mn1) = 11+summationtexty∈Mn−{mn1} hn(y), (3.5)Pn(x0,··· ,xn−1,x) = h(x)1+summationtexty∈Mn−{mn1} hn(y), x negationslash= mn1 ∈ Mn, (3.6)where, hn = g◦gn. Moreover hn(x0,··· ,x) = P(Xn=x|Xn−1=xn−1,···,X0=x0)P(Xn=m1n|Xn−1=xn−1,···,X0=x0).Conversely, any collection of arbitrary functions g0,g1,··· gives rise to aunique stochastic process by the above relations.ProofThe result is immediate by Theorems 3.3.1 and 3.4.1.723.5. Functions of r variables on a finite domainRemark. We can view the arbitrary functions g0,··· ,gn on M′0,M0 ×M′1,··· ,M0×···×Mn−1×M′n asarbitraryfunctionsg0 onM′0, g1(.,x1), x1 negationslash=m11 on M0 and gn(.,xn), xn negationslash= mn1 on M0 ×···×Mn−1. As a check we cancompute the number of free parameters of such a stochastic process onM0,··· ,Mn. We can specify such a process by c0c1 ···cn−1 parameters byspecifying the joint distribution on M0 ×M1 ×···×Mn. If we specify thestochastic process using the above theorems and the gi functions, we need(m0−1)+m0(m1−1)+m0m1(m2−1)+···+m0m1 ···mn−1(mn−1) whichis the same number after expanding the terms and canceling out.Remark. In the case of rth–order Markov chains, gn(x0,··· ,xn) only de-pends on the last r + 1 components for n > r.Remark. In the case of homogenous rth–order Markov chains, Mi =M0, ∀i. Fix m0 ∈ M0 and suppose |M0| = c0. We only need to spec-ify g0 to gr, which are completely arbitrary functions. We only need tospecify g0 on M′0, g1 on M0 × M′1 to gr on M0 × ··· × M′r+1. This alsoshows every homogenous Markov chain of order at most r is characterizedby (c0 −1)summationtextri=0 cr0 = cr+10 −1 elements in R. We could have also countedall such Markov chains by noting they are uniquely represented by the jointprobability density pr+1 on Mr+10 which has cr+10 −1 free parameters (sinceit has to sum up to 1).To describe processes using Markov chains, we need to find appropriateparametric forms. We investigate the generality of these forms in the fol-lowing section and use the concept of partial likelihood to estimate them.We find appropriate parametric representations of gn which are functions ofn + 1 finite variables. In the next section we study the properties of suchfunctions. We call a variable “finite” if it only takes values in a finite subsetof R.3.5 Functions of r variables on a finite domainThis section studies the properties of functions of r variables with finitedomain. First, we present a result of Besag [6] who studied such functionsin the context of Markov random fields. However the statement of the resultin hispaper isinaccurate and moreover it gives no rigorous proofof hisresult.We present a rigorous statement, proof of the result and generalization ofBesag’s theorem.733.5. Functions of r variables on a finite domain3.5.1 First representation theoremThis subsection presents a corrected version of a theorem stated by Besagin [6] and a constructive proof. Then we generalize this theorem and applyit to stationary binary Markov chains to get a parametric representation.Theorem 3.5.1 Suppose, f :producttexti=1,···,r Mi → R, Mi being finite with |Mi| =ci and 0 ∈ Mi, ∀i, 1 ≤ i ≤ r. Let M′i = Mi−{0}. Then there exist a uniquefamily of functions{Gi1,···,ik : M′i1×M′i2×···×M′ik → R, 1 ≤ k ≤ r, 1 ≤ i1 < i2 < ··· < ik ≤ r},such thatf(x1,··· ,xr) = f(0,··· ,0) +rsummationdisplayi=1xiGi(xi)+···+summationdisplay1≤i1<i2<···<ik≤r(xi1 ···xik)Gi1,···,ik(xi1,··· ,xik)+···+ (x1x2···xr)G12···r(x1,··· ,xr)..Remark. In [6], Besag claims that {Gi1,···,ik : Mi1 ×Mi2 ×···×Mik → R}(without removing one element from each set) are unique.Proof Denote by IA the indicator function of a set A andNk = {(x1,··· ,xr) :rsummationdisplayi=1I{0}(xi) ≤ k}.Existence: The proof is by induction. For i = 1,··· ,r, defineGi : M′i → R,Gi(xi) = f(0,··· ,0,xi,0,··· ,0)−f(0,··· ,0)xi,where xi is the ith coordinate. Then let f1(x1,··· ,xr) = f(0,··· ,0) +summationtextri=1 xiGi(xi). Note that f1 = f on N1.Next define Gi1,i2 : M′i1 ×M′i2 → R by743.5. Functions of r variables on a finite domainGi1,i2(xi1,xi2) =f(0,··· ,0,xi1,0,··· ,0,xi2,0,··· ,0)−f1(0,··· ,0,xi1,0,··· ,0,xi2,0,··· ,0)xi1xi2 ,where, xi1,xi2 are the ith1 and ith2 coordinates, respectively. Using the{Gi1,i2}, we can define f2 on N2 byf2(x1,··· ,xr) = f(0,··· ,0) +rsummationdisplayi=1xiGi(xi)+summationdisplay1≤i1<i2≤rxi1xi2Gi1,i2(xi1,xi2).Or equivalently,f2(x1,··· ,xr) = f1(x1 ··· ,xr) +summationdisplay1≤i1<i2≤rxi1xi2Gi1,i2(xi1,xi2).It is easy to see that f2 = f on N2.In general, suppose we have defined Gi1,···,ik−1 and fk−1, letGi1,···,ik(xi1,··· ,xik) =f(0,··· ,0,xi1,0,··· ,0,xik,0,··· ,0)−fk−1(0,··· ,0,xi1,0,··· ,0,xik,0,··· ,0)xi1 ···xik ,for (xi1,··· ,xik) ∈ M′i1 ×···×M′ik.Also letfk(x1,··· ,xr) = fk−1(x1,··· ,xr)+summationdisplay1≤i1<i2<···<ik≤rxi1 ···xikGi1,···,ik(xi1,··· ,xik)We claim f = fk on Nk.To see that, fix x = (x1,··· ,xr). If x has less than k nonzero elements,the second term in the above expansion will be zero andfk(x1,··· ,xr) = fk−1(x1,··· ,xr) = f(x1,··· ,xr),by the induction hypothesis and we are done.However if x has exactly k nonzero elementsx = (x1,··· ,xr) = (0,··· ,0,xj1,0,··· ,0,xjk,0···).753.5. Functions of r variables on a finite domainThensummationdisplay1≤i1<i2<···<ik≤rxi1 ···xikGi1,···,ik(xi1,··· ,xik) =xj1 ···xjkGj1,···,jk(xj1,··· ,xjk).Hencefk(x1,··· ,xr) = fk−1(x1,··· ,xr) + (xj1,··· ,xjk)Gj1,···,jk(xj1,··· ,xjk)= fk−1(x1,··· ,xr)+xj1 ···xjk f(··· ,0,xj1,0,··· ,0,xjk,0,···)−fk−1(··· ,0,xj1,0,··· ,0,xjk,0,···)xj1 ···xjk= f(x1,··· ,xr).By induction, f = fr on Nr = producttexti=1,···,r Mi. Hence, the family of functionssatisfies the conditions.Uniqueness: To prove uniqueness, suppose{Gi1,···,ik : M′i1×M′i2×···×M′ik → R, 1 ≤ k ≤ r, 1 ≤ i1 < i2 < ··· < ik ≤ r},and{Hi1,···,ik : M′i1×M′i2×···×M′ik → R, 1 ≤ k ≤ r, 1 ≤ i1 < i2 < ··· < ik ≤ r},are two families of functions satisfying the equation. Also assume fGk andfHk are the summation functions as defined above corresponding to the twofamilies. We need to show Gi1,···,ik = Hi1,···,ik on M′i1 ×···×M′ik. We useinduction on k. It is easy to verify the result for the case k = 1. Nowsuppose x = (xi1,··· ,xik) ∈ M′i1 ×M′i2 ×···×M′ik. Then by definitionGi1,···,ik(xi1,··· ,xik) =f(0,··· ,0,xi1,0,··· ,0,xik,0,··· ,0)−fGk−1(0,··· ,0,xi1,0,··· ,0,xik,0,··· ,0)xi1 ···xik ,andHi1,···,ik(xi1,··· ,xik) =f(0,··· ,0,xi1,0,··· ,0,xik,0,··· ,0)−fHk−1(0,··· ,0,xi1,0,··· ,0,xik,0,··· ,0)xi1 ···xik .763.5. Functions of r variables on a finite domainBut by induction hypothesis fGk−1 = fHk−1. Hence we are done.We can thinkof thisrepresentation of f as an expansion around(0,··· ,0).However, (0,··· ,0) has no intrinsic role and we can generalize the abovetheorem as follows.Theorem 3.5.2 Suppose, f : M = producttexti=1,···,r Mi → R, Mi being finite and|Mi| = ci. For any fixed (µ1,··· ,µr) ∈ M, let M′i = Mi −{µi}. Then thereexist unique functions{Hi1,···,ik : M′i1×M′i2×···×M′ik → R, 1 ≤ k ≤ r, 1 ≤ i1 < i2 < ··· < ik ≤ r},such thatf(x1,··· ,xr) = f(µ1,··· ,µr) +rsummationdisplayi=1(xi −µi)Hi(xi)+···+summationdisplay1≤i1<i2<···<ik≤r(xi1 −µi1)···(xik −µik)Hi1,···,ik(xi1,··· ,xik)+···+ (x1 −µ1)(x2 −µ2)···(xr −µr)H12···r(x1,··· ,xr).Proof Let Ni = Mi−µi (meaning that we subtract µi from all elements ofMi) so that Ni and Mi have the same cardinality. Also let N =producttexti=1,···,r Niand N′i = Ni −{0}. Then define a bijective mappingφi : Ni → Mi,φi(xi) = xi +µi.Thiswillinducea bijective mappingΦ between N andM that takes (0,··· ,0)to (µ1,··· ,µr). Now consider f ◦ Φ : producttexti=1,···,r Ni → R. By the previoustheorem, unique functions{Gi1,···,ik : N′i1×N′i2×···×N′ik → R, 1 ≤ k ≤ r, 1 ≤ i1 < i2 < ··· < ik ≤ r}exist such thatf ◦Φ(x1,··· ,xr) = f ◦Φ(0,··· ,0) +rsummationdisplayi=1xiGi(xi)+···+summationdisplay1≤i1<i2<···<ik≤rxi1 ···xikGi1,···,ik(xi1,··· ,xik) +···+x1x2 ···xrG12···r(x1,··· ,xr).773.5. Functions of r variables on a finite domainHence,f(φ1(x1),··· ,φr(xr)) = f(φ1(0),··· ,φr(0)) +rsummationdisplayi=1xiGi(xi) +···+summationdisplay1≤i1<i2<···<ik≤rxi1 ···xikGi1,···,ik(xi1,··· ,xik) +···+x1x2 ···xrG12···r(x1,··· ,xr).We conclude,f(x1 +µ1,··· ,xr +µr) = f(µ1,··· ,µr) +rsummationdisplayi=1xiGi(xi) +···+summationdisplay1≤i1<i2<···<ik≤rxi1 ···xikGi1,···,ik(xi1,··· ,xik) +···+x1x2···xrG12···r(x1,··· ,xr).This givesf(x1,··· ,xr) = f(µ1,··· ,µr) +rsummationdisplayi=1(xi −µ1)Gi(xi −µi) +···+summationdisplay1≤i1<i2<···<ik≤r(xi1 −µi1)···(xik −µik)Gi1,···,ik(xi1 −µi1,··· ,xik −µik)+···+ (x1 −µ1)(x2 −µ2)···(xr −µr)G12···r(x1 −µ1,··· ,xr −µr).To prove the existence, letHi1,···,ik(xi1,··· ,xik) = Gi1,···,ik(xi1 −µi1,··· ,xik −µik).The uniqueness can be obtained as in the previous theorem.We call this expression the Besag expansion around (µ1,··· ,µr).Corollary 3.5.3 In the case of binary {0,1} variables, the G functionsare simply real numbers, since M′i1 × ··· × M′ik has exactly one element:(1,··· ,1). Hence, we have found a linear representation of f in terms ofthe xi1 ···xik.Corollary 3.5.4 Suppose that {Xt} is an rth–order Markov chain, Xt tak-ing values in Mt = {0,1} and the conditional probabilityP(Xt = 1|Xt−1,··· ,X0),783.5. Functions of r variables on a finite domainis well-defined and in (0,1). Let g : R → R+ be a given bijective transfor-mation. Thengt(xt−1,··· ,x0) = g−1{P(Xt = 1|Xt−1 = xt−1,··· ,X0 = x0)P(Xt = 0|Xt−1 = xt−1,··· ,X0 = x0)},is a function of t variables, (xt−1,··· ,x0), for t < r and is a function of rvariables, (xt−1,··· ,xt−r), for t > r. Hence there exist unique parametersαt0, {αti1,···,it}1≤i1,···,it≤t for t < r and αt0,{αti1,···,ir}1≤i1,···,ir≤r for t ≥ r suchthatfor t < r:g−1{P(Xt = 1|Xt−1,··· ,X0)P(Xt = 0|Xt−1,··· ,X0)} =αt0 +tsummationdisplayi=1Xt−iαti +···+summationdisplay1≤i1<i2<···<ik≤tαti1,···,ikXt−i1 ···Xt−ik +···+αt12···tXt−1Xt−2 ···X0.and for t ≥ r:g−1{P(Xt = 1|Xt−1,··· ,X0)P(Xt = 0|Xt−1,··· ,X0)} =αt0 +rsummationdisplayi=1Xt−iαti +···+summationdisplay1≤i1<i2<···<ik≤rαti1,···,ikXt−i1 ···Xt−ik +···+αt12···rXt−1Xt−2 ···Xt−r.Moreover, given any collection of parameters, αt0, {αti1,···,it}1≤i1,···,it≤t fort < r and αt0,{αti1,···,ir}1≤i1,···,ir≤r for t ≥ r a unique stochastic process(upto distribution) is specified using the above relations.In the case of homogenous Markov chains the αt0, αti1,···,ik do not depend ont for t > r.The above corollary shows that the conditional probability of a Markov chainafter an appropriate transformation can be uniquely represented as a linearcombination of monomial products of previous states.793.5. Functions of r variables on a finite domainOne might conjecture that the same result holds for all categorical–valued Markov chains (with a finite number of states) using the above the-orem. This is not true in general since the {Gi1,···,ik} are functions. In thenext section, we prove another representation theorem which paves the wayfor the categorical case. As it turns out, we need more terms in order towrite down the transformed conditional probability as a linear combinationof past processes.3.5.2 Second representation theoremIn this section, we prove a new representation theorem for functions of rfinite variables. We start with the trivial finite–valued one–variable functionand then extend the result to r–variable functions. The proof for the generalcase is non–trivial and is done again by induction.Lemma 3.5.5 Suppose f : M → R, M ⊂ R being finite of cardinality c.Let d = c−1. Then f has a unique representation of the formf(x) =summationdisplay0≤i≤dαixi, ∀x ∈ M.Remark. The lemma states that, if we consider the vector space V ={f : M → R}, then the monomial functions {pi}0≤i≤d, where pi : M →R, pi(x) = xi form a basis for V.Proof First note that the dimension of V is c. To show this, supposeM = {m1,··· ,mc} and consider the following isomorphism of vector spaces,I : V → Rcf mapsto→ (f(m1),··· ,f(mc)).It only remains to show that {pi}0≤i≤d is an independent set. To prove thissuppose, summationdisplay0≤i≤dαixi = 0, ∀x ∈ M.That would mean that the d–th degree polynomial p(x) =summationtext0≤i≤d αixi hasat least c = d+ 1 disjoint roots which is greater than its degree. This con-tradicts the fundamental theorem of algebra.803.5. Functions of r variables on a finite domainTheorem 3.5.6 (Categorical Expansion Theorem) Suppose Mi is a finitesubset of R with |Mi| = ci, i = 1,2,··· ,r. Let di = ci−1,M =producttexti=1,···,r Miand consider the vector space of functions over R, V = {f : M → R} withthe function addition as the addition operation of the vector space and thescalar product of a real number to the function as the scalar product of thevector space. Then this vector space is of dimension C = producttexti=1,···,r ci and{xi11 ···xirr }0≤i1≤d1,···,0≤ir≤dr forms a basis for it.Proof To show that the dimension of the vector space is C, supposeM = {m1,··· ,mc} and consider following the isomorphism of vector spaces:I : V → RC,f mapsto→ (f(m1),··· ,f(mC)).To show that {xi11 ···xirr }0≤i1≤di,···,0≤ir≤dr forms a basis, we only need toshow that it is an independent collection since there are exactly C elementsin it. We proceed by induction on r. The case r = 1 was shown in the abovelemma. Suppose we have shown the result for r − 1 and we want to showit for r. Assume a linear combination of the basis is equal to zero. We canarrange the terms based on powers of xr:p0(x1,··· ,xr−1)+xrp1(x1,··· ,xr−1)+···+xdrr pd(x1,··· ,xr−1) = 0, (3.7)∀(x1,··· ,xr) ∈ M1 ×···×Mr.Fix the values of x′1,··· ,x′r−1 ∈ M1 ×···×Mr−1. Then Equation (3.7) iszero for cr values of xr. Hence by Lemma 3.5.5, all the coefficients:p0(x′1,··· ,x′r−1),p1(x′1,··· ,x′r−1),··· ,pd(x′1,··· ,x′r−1),are zero and we conclude:p0(x1,··· ,xr−1) = 0,p1(x1,··· ,xr−1) = 0,··· ,pd(x1,··· ,xr−1) = 0,∀(x1,··· ,xr−1) ∈ M1 ×···×Mr−1.Again by the induction assumption all the coefficients in these polynomialsare zero. Hence, all the coefficients in the original linear combination inEquation (3.7) are zero.813.5. Functions of r variables on a finite domainCorollary 3.5.7 Suppose Xt is a categorical stochastic process, where Xttakes values in Mt, |Mt| = ct = dt+1 < ∞. Also assume that the conditionalprobabilityP(Xt = xt|Xt−1 = xt−1,··· ,X0 = x0),is well–defined and in (0,1). Fix m1t ∈ Mt. Let g : R → R+ be a bijectivetransformation, then there are unique parameters{αti0,···,it}t∈N,0≤i0≤dt−1,0≤i1≤dt−1,0≤i2≤dt−2,···,0≤it≤d0,such thatP(Xt = xt|Xt−1 = xt−1,··· ,X0 = x0) = Pt(x0,··· ,xt),wherePt(x0,··· ,xt−1,mt1) = 11 +summationtexty∈M−{mt1} ht(y), (3.8)Pt(x0,··· ,xt−1,x) = h(x)1 +summationtexty∈M−{m1} ht(y),x negationslash= mt1 ∈ Mt, (3.9)for ht(x0,··· ,xt) = g ◦gt(x0,··· ,xt−1,xt) andgt(x0,··· ,xt−1,xt) =summationdisplay0≤i0≤dt−1,0≤i1≤dt−1,···,0≤it≤d0αti0,···,itxi0t−0 ···xitt−t,(x0,··· ,xt) ∈ M0 ×···×Mt−1 ×M′t.On the other hand any set of arbitrary parameters αti0,···,it gives rise to aunique stochastic process with the above equations.Corollary 3.5.8 Suppose that {Xt} is an rth–order Markov chain where Xttakes values in Mt a finite subset of real numbers, |Mt| = ct = dt + 1 < ∞,the conditional probabilityP(Xt = xt|Xt−1 = xt−1,··· ,X0 = x0),is well–defined and belongs to (0,1). Fix m1t ∈ Mt, let M′t = Mt−{m1t} andsuppose g : R → R+ is a given bijective transformation. Thengt(xt,··· ,x0) = g−1{ P(Xt = xt|Xt−1 = xt−1,··· ,X0 = x0)P(Xt = m1t|Xt−1 = xt−1,··· ,X0 = x0)},823.5. Functions of r variables on a finite domainis a function of t + 1 variables for t < r, (xt,··· ,x0) and is a function ofr + 1 variables,(xt,··· ,xt−r), for t > r. Hence there exist parameters{αti0,···,it}0≤i0≤dt−1,0≤i1≤dt−1,···,0≤it≤d0, for t < rand{αti0,···,ir}0≤i0≤dt−1,0≤i1≤dt−1,···,0≤ir≤dt−r, for t ≥ rsuch that for t < r:g−1{ P(Xt = xt|Xt−1 = xt−1,··· ,X0 = x0)P(Xt = m1t|Xt−1 = xt−1,··· ,X0 = x0)} =summationdisplay0≤i0≤dt−1,0≤i1≤dt−1,···,0≤it≤d0αti0,···,itxi0t−0 ···xitt−t,(x0,··· ,xt) ∈ M0 ×···Mt−1 ×M′t,and for t ≥ r:g−1{ P(Xt = xt|Xt−1 = xt−1,··· ,X0 = x0)P(Xt = m1t|Xt−1 = xt−1,··· ,X0 = x0)} =summationdisplay0≤i0≤dt−1,0≤i1≤dt−1,···,0≤ir≤dt−rαti0,···,irxi0t−0 ···xirt−r(x0,··· ,xt) ∈ M0 ×···Mt−1 ×M′t.Moreover any collection of arbitrary parameters{αti0,···,it}0≤i0≤dt−1,0≤i1≤dt−1,···,0≤it≤d0, for t < r,and{αti0,···,ir}0≤i0≤dt−1,0≤i1≤dt−1,···,0≤ir≤dt−r, for t ≥ r,specify a unique stochastic process (upto distribution) by the above relations.In the case of homogenous Markov chains the αti1,···,ir do not depend on tfor t > r.One might question the usefulness of such a representation. After all wehave exactly as many parameters in the model as the values of the originalfunction. In the following, we explain the importance of linear representa-tions of such functions.833.5. Functions of r variables on a finite domain1. A vast amount of theory has been developed to deal with linear mod-els. Generalized linear models in the case of independent sequence ofrandom variables is a powerful tool. As we will see in sequel, theseideas can be imported into time series using the concept of partiallikelihood.2. Although we have as many parameters in the model as the values of theoriginal function, the representation gives us a convenient frameworkfor modeling, in particular for making various model reductions byomitting some terms or assuming certain coefficients are equal.3. Although this is a representation for stationary rth–order Markovchains (or representation for arbitrary locally rth–order chains at timet), this representation allows us to accommodate other explanatoryvariables simply as additive linear terms and extend the model tonon–stationary cases. This cannot be done in the same way if we tryto model the original values of the function.Example As an example consider a categorical response variable Y and rcategorical explanatory variablesX1,··· ,Xr,are given. Suppose the Xi takes values in the Mi which include 0. Ourpurpose is to model Y based on X1,··· ,Xr. In order to do that, we considerthe conditional probabilityP(Y = y|X1 = x1,··· ,Xr = xr).Again, we assume that the conditional probability is well-defined everywhereand takes values in (0,1). The above theorem shows that after applying atransformation the conditional probability can be written as a linear com-bination of multiples of powers of the Xi.Although, the theorem above shows the form of the conditional prob-ability in general and paves the way to the estimation of the conditionalprobabilities by estimating the parameters, the large number of parametersmakes this a challenging task which might be impractical in some cases. Inthe next section, we introduce some classes of r variable functions that canbe useful for some applications.843.5. Functions of r variables on a finite domain3.5.3 Special cases of functions of r finite variablesThe first class of functions we introduce are obtained by power restrictions.We simply assume that gt can be represented only by powers less than k.Suppose Xt takes values in 0,1,··· ,ct − 1. Then for a k-restricted powerstationary rth–order Markov chain, the gt, t > r is given by:summationdisplay0≤i1≤d1,···,0≤ir≤dr,summationtextj ij≤kαi1,···,irXi1t−1 ···Xirt−r.In particular, we can let k = 1 and getβ0 +summationdisplayiβiXt−i.This is useful especially for binary Markov chains.The second class of functions are useful in the case when relationshipsexist between the states in terms of a semi–metric d. Suppose {Xt} is anrth–order Markov chain and Xt takes values in the same finite set M ={1,··· ,m}. Also letd : M ×M → R,be a semi–metric being a mapping on M that satisfies the following condi-tions:d ≥ 0;d(x,z) ≤ d(x,y) +d(y,z);d(x,x) = 0.Then we introduce the following model:g−1{P(Xt = j|Xt−1,··· ,Xt−r)P(Xt = 1|Xt−1,··· ,Xt−r)} = α0,j +ksummationdisplayi=1αi,jd(j,Xt−i)for j = 2,··· ,m. For this modelP(Xt = 1|Xt−1,··· ,Xt−r) = 1−summationdisplayj=2,···,mP(Xt = j|Xt−1,··· ,Xt−r).Finally, we introduce a simple class for the binary Markov chain of orderr. For any bijective transformation g : R→ R+g−1{P(Xt = 1|Xt−1,··· ,Xt−r)P(Xt = 0|Xt−1,··· ,Xt−r)} = α0 +α1Nt−1,853.6. Generalized linear models for time serieswhere Nt−1 = summationtextrj=1 Xt−j. For example in the 0-1 precipitation processexample seen in the Introduction, Nt−1 counts the number of the days outof r days before today that had some precipitation.3.6 Generalized linear models for time seriesGeneralized linear models were developed to extend ordinary linear regres-sion to the case that the response is not normal. However, that extensionrequired the assumption of independently observed responses. The notionof partial likelihood was introduced to generalize these ideas to time serieswhere the data are dependent. What follows in this section is a summaryof the first chapter in Kedem and Fokianos [27], which we have included forcompleteness.Definition Let Ft, t = 1,2,··· be an increasing sequence of σ–fields, F0 ⊂F1 ⊂ F2,··· and let Y1,Y2,··· be a sequence of random variables such thatYt is Ft–measurable. Denote the density of Yt, given Ft, by ft(yt;θ), whereθ ∈ Rp is a fixed parameter. The partial likelihood (PL) is given byPL(θ;y1,··· ,yN) =Nproductdisplayt=1ft(yt;θ).Example As an example, suppose Yt represents the 0-1 PN process inCalgary, while MTt denotes the maximum daily temperature process. Wecan define Ft as follows:1. Ft = σ{Yt−1,Yt−2,···}. In this case, we are assuming the informationavailable to us is the value of the process on each of the previous days.2. Ft = σ{Yt−1,Yt−2,··· MTt−1,MTt−2,···}. In this case, we are assum-ing we have all the information regarding the 0-1 process of precipita-tion and maximum temperature for previous days.3. Ft = σ{Yt−1,Yt−2,··· MTt,MTt−1,MTt−2,···}. In this case, we addto the information in 2 the knowledge of today’s maximum tempera-ture.The vector θ that maximizes the above equation is called the maximumpartial likelihood (MPLE). Wong [48] has studied its properties. Its con-sistency, asymptotic normality and efficiency can be shown under certainregularity conditions.863.6. Generalized linear models for time seriesIn thisreport, we aremainly interested in the case: Ft = σ{Yt−1,Yt−2,···}.We assume that the information Ft is given as a vector of random variablesand denote it by Zt, which we call the covariate process:Zt = (Zt1,··· ,Ztp)′.Zt might also include the past values of responses Yt−1,Yt−2,···.Let µt = E[Yt|Ft−1], be the conditional expectation of the response giventhe information we have up to the time t.Kedem and Fokianos in [27] address time series following generalizedlinear models satisfying certain conditions about the so-called random andsystematic components:• Random components: For t = 1,2,··· ,Nf(yt;θt,φ|Ft−1) = exp{ytθt −b(θt)at(φ)+c(yt;φ)}.• The parametric function αt(φ) is of the form φ/wt, where φ is thedispersion parameter, and wt is a known parameter called “weightparameter”. The parameter θt is called the natural parameter.• Systematic components: For t = 1,2,··· ,N,g(µt) = ηt =psummationdisplayj=1βjZ(t−1)j = Z′t−1β,for some known monotone function g called the link function.Example Binary time series: As an example consider {Yt}, a binary timeseries. Let us denote by πt the probability of success given Ft−1. Then fort = 1,2,··· ,N,f(yt;θt,φ|Ft−1) = exp(yt log( πt1−πt)+ log(1−πt))with E[Yt|Ft−1] = πt, b(θt) = −log(1−πt) = log(1 + exp(θt)), V(πt) =πt(1−πt), φ = 1, and wt = 1.The canonical link gives rise to the so–called “logistic model”:g(πt) = θt(πt) = log( πt1−πt) = ηt = Z′t−1β.873.6. Generalized linear models for time seriesIn the notation of Corollary 3.5.4, Yt = Xt, πt = P(Xt = 1|Xt−1,··· ,Xt−r)and Z′t−1 = (1,Xt−1,··· ,Xt−r,Xt−1Xt−2,··· ,Xt−1 ···Xt−r). We can alsoconsider other covariate processes such as Z′t−1 = (1,Xt−1,··· ,Xt−r) andso on.In order to study the asymptotic behavior of the maximum likelihoodestimator, we consider the conditional information matrix. To establishlarge sample properties, the stability of the conditional information matrixand the central limit theorem for martingales are required. Proofs may befound in Kedem and Fokianos [27].Inference for partial likelihoodThe definitions of partial likelihood and exponential family of distributionsimply that the log partial likelihood is given byl(β) =Nsummationdisplayt=1logf(yt;θt,φ|Ft−1) =Nsummationdisplayt=1{ytθt −b(θt)αt(φ)+c(yt,φ)} =Nsummationdisplayt=1{ytu(z′t−1β)−b(u(z′t−1))αt(φ) +c(yt,φ)} =Nsummationdisplayt=1lt,where u(.) = (g◦µ(.))−1 = µ−1(g−1(.)), so that θt = u(zt−1β). We introducethe notation,▽ = ( ∂∂β1,··· , ∂∂βp)′and call ▽l(β) the partial score. To compute the gradient, we can use thechain rule in the following manner∂lt∂βj =∂lt∂βj∂θt∂µt∂µt∂ηt∂ηt∂βj.Some algebra showsSN(β) = ▽l(β) =Nsummationdisplayt=1Z(t−1)∂µt∂ηtYt −µt(β)σ2t(β) ,where, σ2t(β) = Var[Yt|Ft−1]. The partial score process is defined from thepartial sums asSt(β) = ▽l(β) =tsummationdisplays=1Z(s−1)∂µs∂ηsYs −µs(β)σ2s(β) .883.6. Generalized linear models for time seriesOne can show the terms in the above sums to be orthogonal:E[Z(t−1)∂µt∂ηtYt −µt(β)σ2t(β) Z(s−1)∂µs∂ηsYs −µs(β)σ2s(β) ] = 0, s < t.Also, E[SN(β) = 0].The cumulative information matrix is defined byGN(β) =Nsummationdisplayt=1Cov[Z(t−1)∂µt∂ηtYt −µt(β)σ2t(β) |Ft−1].The unconditional information matrix is simplyCov(SN(β)) = FN(β) = E[GN(β)].Next letHN(β) = −▽▽′l(β).Kedem and Fokianso [27] show thatHN(β) = GN(β)−RN(β),whereRN(β) = 1αt(φ)Nsummationdisplayt=1Zt−1dt(β)Z′t−1(Yt −µt(β)),and dt(β) = [∂2u(ηt)/∂η2t ].St satisfies the martingale property:E[St+1(β)|Ft−1] = St(β).To prove the consistency and other properties of the estimators, we need:Assumption A:A1. The true parameter β belongs to an open set B ⊂ R.A2. The covariate vector Zt almost surely lies in a non random compactset Γ of Rp, such that P[summationtextNt=1 Zt−1Z′t−1 > 0] = 1. In addition, Z′t−1β liesalmost surely in the domain H of the inverse link function h = g−1 for allZt−1 ∈ Γ and β ∈ B.A3. The inverse link function h, defined in (A2), is twice continuouslydifferentiable and |∂h(λ)/∂λ| negationslash= 0.A4. There is a probability measure ν on Rp such thatintegraltextRp zz′ν(dz) is positivedefinite, and such that for Borel sets A ⊂ Rp,1NNsummationdisplayt=1I[Zt−1∈A] → ν(A).893.7. Simulation studiesTheorem 3.6.1 Under assumption A the maximum likelihood estimator isalmost surely unique for all sufficiently large N, and1. the estimator is consistent and asymptotically normal,ˆβ p→ βin probability, and√N(ˆβ −β) d→ Np(0,G−1(β)),in distribution as N → ∞, for some matrix G.2. The following limit holds in probability, as N → ∞:√N(ˆβ −β)− 1√NG−1(β)SN(β) p→ 0.We follow Kedem and Fokianos [27], who used similar models, to assumethe above conditions for our models. However, we conjecture that the aboveassumptions hold for the partial likelihood of stationary rth–order Markovchains (with strictly positive joint distribution) in terms of our parametriclinear form at least for the binary case. In fact assumptions A1. to A3. areeasy to check and only A4. poses some challenge. We leave this for futureresearch and use several simulation studies to check the consistency of theestimators in next section as well as Chapter 4 and Chapter 10. For morediscussion regarding the assumptions and consistency see [27].3.7 Simulation studiesThis section presents the results of some simulation studies about the partiallikelihood applied to categorical rth–order Markov chains. We also investi-gate the performance of the BIC to pick the appropriate (“true”) model. Inparticular, we generate samples from a seasonal Markov chain Yt where,Zt−1 = (1,Yt−1,cos(ωt)), ω = 2π366.We consider this Markov chain over 5 years from 2000 to 2005 and assumelogit{P(Yt = 1|Zt−1)} = β′Zt−1,where β = (−1,1,−0.5).903.7. Simulation studiesTo generate samples for this chain, we need an initial value of the pasttwo states, which we take it to be (1,1). We denote the process Yt−k by Y kfor simplicity.To check the performance of the partial likelihood and estimates of thevariance using GN, we generate 50 chains with this initial value and thencompare the parameter estimates with the true parameters. We also com-pare the theoretical variances with the experimental variances. Table 3.7shows that the parameter estimates are fairly close to the true values. Alsothe experimental and theoretical variances are similar.sim. sd theo. sdˆβ1 ˆβ2 ˆβ3 sd(ˆβ1) sd(ˆβ2) sd(ˆβ3) sd(ˆβ1) sd(ˆβ2) sd(ˆβ3)-0.99 1.0 -0.42 0.07 0.10 0.07 0.06 0.12 0.07Table 3.1: The estimated parameters for the model Zt−1 = (1,Yt−1,cos(ωt))with parameters β = (−1,1,−0.5). The standard deviation for the param-eters is computed once using GN (theo. sd) and once using the generatedsamples (sim. sd).In Kedem and Fokianos [27] other simulation studies have been done tocheck the validity of this method.To check the normality of the parameter estimates, we plot the threeparameter estimates histograms in Figure 3.1. The figure shows that theparameter estimates have a distribution close to Gaussian.Next we check the performance of the BIC criterion in picking the op-timal (“true”) model. We use the same model as above and then computethe BIC for a few models to see if BIC picks the right one. We denote Yt−kby Y k and cos(ωt) by COS for simplicity. For an assessment, we simulatea few other chains.913.7. Simulation studiesbeta1beta1Density−1.2 −1.1 −1.0 −0.9 −0.80123456beta2beta2Density0.7 0.9 1.1 1.30.00.51.01.52.02.53.03.5beta3beta3Density−0.70 −0.60 −0.50 −0.4001234567Figure 3.1: The distribution of parameter estimates for the model with thecovariate process Zt−1 = (1,Yt−1,cos(ωt)) and parameters (β1 = −1,β2 =1,β3 = −0.5).Model: Zt−1 BIC parameter estimates(1) 2380.0 (-0.605)(1,Y 1) 2267.1 (-1.03, 1.11)(1,Y 1,Y 2) 2273.7 (-1.064, 1.091, 0.101)(1,Y 1,COS) 2217.7 (-1.00, 0.970, -0.558)(1,Y 1,SIN) 2274.4 ( -1.037, 1.117, 0.026)(1,Y 1,COS,SIN) 2225.1 (-1.00, 0.970, -0.559, 0.028)(1,Y 1,Y 2,Y 1Y 2) 2281.1 (-1.055, 1.0615, 0.0647, 0.077)(1,Y 1,Y 2,Y 1Y 2,COS) 2232.4 (-0.985, 0.943, -0.0870, 0.0915, -0.564)(1,Y 1,Y 2,Y 1Y 2,COS,SIN) 2239.8 (-0.981, 0.957, -0.0946, 0.0723, -0.575, 0.0232)Table 3.2: BIC values for several models competing for the role of the truemodel, where Zt−1 = (1,Y 1,COS), β = (−1,1,−0.5).As we see in Table 3.2, the true model has the smallest BIC showing itperforms well in this case. Also note that models which include the covari-ates of the true model have accurate estimates for the parameters associatedwith (1,Y 1,COS), while giving very small magnitude for other parameters.923.8. Concluding remarksModel: Zt−1 BIC parameter estimates(1) 2537.3 (0.0799)(1,Y 1) 2329.5 (-0.649, 1.417)(1,Y 1,Y 2) 2245.5 (-1.022, 1.144, 0.998)(1,Y 1,COS) 2265.9 (-0.553, 1.236, -0.617)(1,Y 1,SIN) 2336.7 (-0.648, 1.415, -0.0433)(1,Y 1,COS,SIN) 2273.0 (-0.552, 1.235, -0.617, -0.0480)(1,Y 1,Y 2,Y 1Y 2) 2251.3 (-1.08, 1.287, 1.140, -0.278)(1,Y 1,Y 2,Y 1Y 2,COS) 2213.7 (-0.936, 1.11, 0.966, -0.175, -0.511)(1,Y 1,Y 2,Y 1Y 2,COS,SIN) 2221.2 (-0.927, 1.101, 0.940, -0.160, -0.549, -0.0441)(1,Y 1,Y 2,COS) 2206.8 (-0.899, 1.0263, 0.875, -0.515)Table 3.3: BIC values for several models competing for the role of true modelgiven by Zt−1 = (1,Y 1,Y 2,COS), β = (−1,1,1,−0.5).Table 3.3 presents the true model in the last row. Ignore that row for amoment. The smallest “BIC” corresponds to (1,Y 1,Y 2,Y 1Y 2,COS), whichhas an component Y 1Y 2 added to the true model. However, the coefficientsof this model are very close to the true model and the coefficient for Y 1Y 2is relatively small in magnitude. The true model has the smallest BIC againand the parameter estimates are close to the correct values.3.8 Concluding remarksIn summary, this chapter shows that a categorical discrete–time stochasticprocess can be represented using a small number of ascending joint distri-butionsP(X0 = x0),P(X0 = x0,X1 = x1),P(X0 = x0,X1 = x1,X2 = x2),··· .As a corollary of the above, we showed that a categorical discrete–timestochastic process can be represented using the conditional probabilitiesP(X0 = x0),P(X1 = x1|X0 = x0),P(X2 = x2|X0 = x0,X1 = x1),··· .A parametric form was found for the conditional probability distributionof categorical discrete–time stochastic processes. The parameters can beestimated for stationary binary Markov chains using partial likelihood.93Chapter 4Binary precipitation process4.1 IntroductionThis chapter studies the Markov order of the 0-1 precipitation process (PNfrom now on). Many authors such as Anderson et al. in [4] and Barlettin [5] have developed techniques to test different assumptions about theorder of the Markov chain. For example in [4], Anderson et al. developa Chi-squared test to test that a Markov chain is of a given order againsta larger order. In particular, we can test the hypothesis that a chain is0th–order Markov against a 1st–order Markov chain, which in this caseis testing independence against the usual (1st–order) Markov assumption.(This reduces simply to the well–known Pearson’s Chi-squared test.) Hence,to “choose” the Markov order one might follow a strategy of testing 0th–order against 1st–order, testing 1st–order against 2nd–order and so on torth–order against (r + 1)th–order, until the test rejects the null hypothesisand then choose the last r as the optimal order. However, some drawbacksare immediately seen with this method:1. The choice of the significance level will affect our chosen order.2. The method only works for chains with several independent observa-tions of the same finite chain.3. We cannot account for some other explanatory variables in the model,for example the maximum temperature.Issues like this have led researchers to think about other methods of orderselection. Akaike in [2], using the information distance and Schwartz in[42] using Bayesian methods develop the AIC and BIC, respectively. Othermethods and generalizations of the above methods have been proposed bysome authors such as Hannan in [20], Shibita in [44] and Haughton in [22].Many authors have studied the order of precipitation processes at dif-ferent locations on Earth. Gabriel et al. in [18] use the test developed inAnderson et al. [4] to show that the precipitation in Tel-aviv is a 1st–order944.2. Models for 0-1 precipitation processMarkov chain. Tong in [45] used the AIC for Hong Kong, Honolulu and NewYork and showed that the process is 1st–order in Hong Kong and Honolulubut 0th–order in New York. In a later paper, [46], Tong and Gates usethe same techniques for Manchester and Liverpool in England and also re-examined the Tel–aviv data. Chin in [12] studies the problem using AIC over100 stations (separately) in the United States over 25 years. He concludesthat the order depends on the season and geographical location. Moreover,he finds a prevalence of first order conditional dependence in summer andhigher orders in winter. Other studies have been done by several authorsusing similar techniques over other locations. For example, Moon et al. in[35] study this issue at 14 location in South Korea.This report investigates the Markov order for a cold–climate region. TheMarkov order of the precipitation in this region might be different due to alarge fraction of precipitation being being in the form of snowfall. The re-port also drops the homogeneity (stationarity) condition usually imposed instudying the Markov order. In fact the model proposed here can accommo-date both continuous (here time and potentially geographical location andother explanatory variables) and categorical variables (e.g. precipitationoccurred/not occurred on a given day).An issue with increasing the order of a Markov chain is the exponentialincrease in number of parameters in the model. Here as a special case, wepropose models that increase with the order of Markov chain by adding only1 parameter. Other authors such as Raftery in [40] and Ching in [13] haveproposed other methods to reduce the number of parameters. The datasetused in this study contains more than 110 years of daily precipitation forsome stations. This allows us to look at some properties of the precipitationprocess such as stationarity more closely.4.2 Models for 0-1 precipitation processIn the light of Categorical Expansion Theorem (Theorem 3.5.6), from theprevious chapter, we know all the possible forms of rth–order Markov chainsfor binary data. Since, this theorem gives us linear forms, time series follow-ing generalized linear models (TGLM) provides a method to estimate theparameters. For two reasons it is beneficial to study simpler models ratherthan a full model:1. There are a large number of parameters to estimate in the full model.2. There are better interpretations for the parameters in simpler models.954.2. Models for 0-1 precipitation processWe introduce a few processes that are useful in modeling precipitation:• Yt represents the occurrence of precipitation on day t. Here Yt is abinaryprocess with 1 denoting precipitation and 0 denoting its absenceon day t.• Nlt−1 = summationtextlj=1 Yt−j represents the number of PN days in the past ldays.• Binary processes for modeling m years, say l1 to l2. Here, we definethe binary processes Alt, l ∈ [l1,l2] byAlt =braceleftBigg1, if t belongs to the year l0, otherwsie .This is a binary deterministic process to model the year effect.• Seasonal processes (deterministic):cos(ωt) and sin(ωt), ω = 2π366.We can also consider higher order terms in the Fourier series cos(ωnt)and sin(ωnt), where n is a natural number.Some possibly interesting models present themselves when Zt−1 is a co-variate process. The probability of precipitation today depends on the valueof that covariate process, and those processes might include:• Zt−1 = (1,Nlt−1). This model assumes that the probability of PNtoday only depends on the number of PN days during l previous days.• Zt−1 = (1,Nlt−1,Yt−1). This model assumes that the PN occurrencetoday depends on the PN occurrence yesterday and the number ofPN occurrences during l previous days.• Zt−1 = (1,cos(ωt),sin(ωt),Nlt−1,Yt−1).• Zt−1 = (1,cos(ωt),sin(ωt),Yt−1).• Zt−1 = (1,Yt−1,··· ,Yt−r). This is a special case of Markov chain oforder r. No interaction between the days is assumed. In this modelincreasing the order of Markov chain by one corresponds to adding oneparameter to the model.964.3. Exploratory analysis of the data• Zt−1 = (1,Yt−1,··· ,Yt−r,Yt−1Yt−2). In this model, the interactionbetween the previous day and two days ago is included.• Zt−1 = (1,cos(ωt),sin(ωt),Yt−1,··· ,Yt−r). In this model, two sea-sonal terms are added to the previous model.• Zt−1 = (A1t,··· ,Akt,Yt−1,··· ,Yt−r). This model has a different inter-cept for various years (year effect).4.3 Exploratory analysis of the dataThe data includes the daily precipitation for 48 stations over Alberta from1895 to 2006.First, we make the plot of transition probabilities for a few locations.We pick Calgary and Banff, which have a rather long period of data avail-able for PN. We have also repeated the procedure for some other locationssuch as Edmonton and seen similar results. Figures 4.1 to 4.7 show theplots for Banff. For Calgary see plots in Chapter 2. Figure 4.1 plots theestimated 1st–order transition probabilities ˆp11 (the probability of precip-itation if precipitation occurs the day before) and ˆp01 (the probability ofprecipitation if it does not occur the day before). These transition probabil-ities are estimated using the observed data. For example ˆp11 for January 5this estimated by n11n1 , where n11 is the number of pairs of days (Jan. 4th, Jan.5th) with precipitation and n1 is the number of Jan. 5th with precipitationduring available years. Figures 4.2 and 4.3 show similar plots for estimated2nd–order transition probabilities. Figures 4.4 and 4.5 give the estimatedannual probability of precipitation for Banff and Calgary computed by di-viding the number of wet days of a year by the number of days in that year.The plot of the logit function and the transformed estimated probability ofprecipitation in Banff are shown in Figures 4.6 and 4.7. We summarize theconclusions and conjectures based on the exploratory analysis of the dataas followings:• The binary PN process is not stationary. Figure 4.1 shows that thetransition probabilities change over time and depend on the season.• Figure 4.1 also suggests the transition probabilities change continu-ously over time. Although a high variation is seen in the higher orderprobabilities, a generally continuous trend is observed. There is a pe-riodic trend for the transition probabilities over the course of the year974.3. Exploratory analysis of the dataand a simple periodic function should suffice modeling these probabil-ities.• Figure 4.1 suggests p11 and p01 differ over the course of the year, so a0th–order Markov chain (independent) does not seem appropriate.• Figure 4.2 plots the curves ˆp111, ˆp011 and Figure 4.3 plots the curvesˆp001, ˆp101. They have considerable overlaps over the course of the year.Therefore a 2nd–order Markov chain does not seem necessary.• Figures 4.4 and 4.5 show the estimated probability of precipitation fordifferent years, computed by averaging through the days of a givenyear. The probability of precipitation seems to differ year–to–year. Italso seems that consecutive years have similar probability and henceassuming that different years are identically distributed and indepen-dent does not seem reasonable. The probability of precipitation hasincreased over the past century for Calgary, while for Banff the prob-ability of precipitation seems to have been changing with a more ir-regular pattern.• Figure 4.6 shows the plot of the logit function, while Figure 4.7 showsthe result of applying the logit function to the estimated probabilities.We observe how the logit function transforms the values between 0and 1 to a wider range in R. Since logit is an increasing function thepeaks are observed at the same time as the original values.The Categorical Expansion Theorem (Theorem 3.5.6) shows the generalform for binary rth–order Markov processes. Table 4.8 compares all possi-ble 2nd–order Markov chains (including the constant process). We discussthe implications of these possible models and use the following abbrevia-tions: Y k = Yt−k, COS = cos(ωt), SIN = sin(ωt), COS2 = cos(2ωt) andSIN2 = sin(2ωt).Some proposed models:• Zt−1 = 1:The probability of PN’s occurrence does not depend on the previousdays. In other words days are independent.• Zt−1 = (1,Y 1):The probability of PN today depends only on the day before and giventhe latter’s value, it is independent of the other previous days.984.3. Exploratory analysis of the data0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 4.1: The transition probabilities for the Banff site. The dotted linerepresents ˆp11 (the estimated probability of precipitation if precipitation oc-curs the day before) and the dashed represents ˆp01 (the estimated probabilityof precipitation if precipitation does not occur the day before.)994.3. Exploratory analysis of the data0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 4.2: The solid curve represents ˆp111 (the estimated probability ofprecipitation if during both two previous days precipitation occurs) and thedashed curve represents ˆp011 (the estimated probability that precipitationoccurs if precipitation occurs the day before and does not occur two daysago) for the Banff site.1004.3. Exploratory analysis of the data0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 4.3: The solid curve represents ˆp001 (the estimated probability ofprecipitation occurring if it does not occur during the two previous days)and the dotted curve is ˆp101 (the estimated probability that precipitationoccurs if precipitation does not occur the day before but occurs two daysago) for the Banff site.1014.3. Exploratory analysis of the data1900 1920 1940 1960 1980 20000.00.20.40.60.81.0YearProbabilityFigure 4.4: Banff’s estimated mean annual probability of precipitation cal-culated from historical data.1900 1920 1940 1960 1980 20000.00.20.40.60.81.0YearProbabilityFigure 4.5: Calgary’s estimated mean annual probability of precipitationcalculated from historical data.1024.3. Exploratory analysis of the data0.0 0.2 0.4 0.6 0.8 1.0−6−4−20246xlogitFigure 4.6: The logit function: logit(x) = log(x/(1−x)).0 100 200 300−1.0−0.50.00.51.0Day of the yearProbability of PrecipitationFigure 4.7: The logit of the estimated probability of precipitation in Banfffor different days of the year.1034.3. Exploratory analysis of the data• Zt−1 = (1,Y 2) :The probability of PN given the information for the day before yes-terday is independent of other previous days, in particular yesterday!This does not seem reasonable.• Zt−1 = (1,Y 1,Y 2) :This model includes both Y 1 and Y 2. One might suspect that it has allthe information and therefore is the most general 2nd–order Markovmodel. However, note that in the model the transformed conditionalprobability is a linear combination of the past two states:logit{P(Y = 1|Y 1,Y 2)} = α0 +α1Y 1 +α2Y 2,which implies,logit{P(Y = 1|Y 1 = 0,Y 2 = 0)} = α0,logit{P(Y = 1|Y 1 = 1,Y 2 = 0)} = α0 +α1,logit{P(Y = 1|Y 1 = 0,Y 2 = 1)} = α0 +α2,andlogit{P(Y = 1|Y 1 = 1,Y 2 = 1)} = α0 +α1 +α2.We conclude thatlogit{P(Y = 1|Y 1 = 1,Y 2 = 0)}−logit{P(Y = 1|Y 1 = 0,Y 2 = 0)} =logit{P(Y = 1|Y 1 = 1,Y 2 = 1)}−logit{P(Y = 1|Y 1 = 0,Y 2 = 1)} =α1.In other words, the model implies that no matter what the value Y 2has, the differences between the conditional probabilities given Y 1 = 1and given Y 1 = 0 (in the logit scale) are the same.• Zt−1 = (1,Y 1Y 2):Among other things, this model implies that the conditional probabil-ities given (Y 1 = 0,Y 2 = 1), (Y 1 = 1,Y 2 = 0) or (Y 1 = 0,Y 2 = 0)are the same.1044.4. Comparing the models using BIC• Zt−1 = (1,Y 1,Y 1Y 2):Among other things this model implies that the conditional probabil-ities given any of the pairs (Y 1 = 0,Y 2 = 0) or (Y 1 = 0,Y 2 = 0) arethe same.• Zt−1 = (1,Y 2,Y 1Y 2):The interpretation is similar to the previous case.• Zt−1 = (1,Y 1,Y 2,Y 1Y 2) :This is the full 2nd–order stationary Markov model with no restrictiveassumptions as shown by Categorical Expansion Theorem.The above explanations show that one must be careful about the as-sumptions made about any proposed model. Including/dropping variouscovariates can lead to implications that might be unrealistic.4.4 Comparing the models using BICThis section uses the methods developed previously to find appropriate mod-els for the 0-1 PN process. We use the PN data for Calgary from 2000 to2004. We compare several models using the BIC criterion. The partiallikelihood is computed and then maximized using the “optim” function in“R”.Using “Time Series Following Generalized Linear Models” as discussedby Kedem et al. in [27], for binary time series with the canonical linkfunction, we have:P(Yt = 1|Zt−1) = logit−1(αZt−1),and,P(Yt = 0|Zt−1) = 1−logit−1(αZt−1).We conclude that the log partial likelihood is equal to:Nsummationdisplayt=1logP(Yt|Zt−1) =summationdisplay1≤t≤N,Yt=1log(logit−1(αZt−1))+summationdisplay1≤t≤N,Yt=0log(1−logit−1(αZt−1)).1054.4. Comparing the models using BICTo ensure that the maximum picked by “optim” in the R package is closeto the actual maximum, several initial values were chosen randomly untilstability was achieved.In order to find an optimal model to describe a binary (0-1) PN process,we can include several factors such as previous values of the process, seasonalterms, previous maximum temperature values and so on. We have done thiscomparison in several tables. The smallest BIC in the tables is shown byboldface.Table 4.1 shows the constant process 1 and Nl, the number of wet daysduringl previous days, as predictors. Note that N1 = Y 1. The BIC criterionin this case picks the simplest model which includes only the previous day.Hence a 1st–order Markov chain is chosen among these particular lth–orderchains.Model: Zt−1 BIC parameter estimates(1,N1) 2268.1 (−1.035,1.268)(1,N2) 2294.5 (−1.097,0.726)(1,N3) 2293.4 (−1.181,0.559)(1,N4) 2292.7 (−1.244,0.462)(1,N5) 2296.9 (−1.281,0.390)(1,N6) 2305.9 (−1.292,0.331)(1,N7) 2311.3 (−1.308,0.291)(1,N8) 2317.2 (−1.317,0.258)(1,N9) 2322.1 (−1.32,0.232)(1,N10) 2325.6 (−1.34,0.212)(1,N11) 2330.4 (−1.34,0.193)(1,N12) 2335.7 (−1.34,0.177)(1,N13) 2336.3 (−1.36,0.168)(1,N14) 2340.5 (−1.35,0.155)(1,N15) 2342.6 (−1.36,0.146)Table 4.1: BIC values for models including Nl, the number of precipitationdays during the past l days for the Calgary site.Table 4.2 compares models with predictors:1,Y l and Nl, l = 1,2,··· ,30.Since Y 1 = N1 the first row is obviously an over–parameterized model.The smallest BIC corresponds to the model (1,Y 1,N28). Even the model(1,Y 1,N4) shows an improvement over (1,Y 1). Hence by addingthe numberof PN days to the simple model (1,Y 1), an improvement is achieved.1064.4. Comparing the models using BICModel: Zt−1 BIC parameter estimates(1,Y 1,N1) 2275.6 (-1.04, -0.40, 1.67)(1,Y 1,N2) 2270.2 (-1.10, 0.94, 0.255)(1,Y 1,N3) 2258.3 (-1.21, 0.88, 0.279)(1,Y 1,N4) 2250.6 (-1.28, 0.88, 0.254)(1,Y 1,N5) 2247.5 (-1.32, 0.91, 0.221)(1,Y 1,N6) 2248.2 (-1.34, 0.95, 0.187)(1,Y 1,N7) 2247.1 (-1.37, 0.97, 0.167)(1,Y 1,N8) 2247.5 (-1.39, 0.99, 0.149)(1,Y 1,N9) 2247.6 (-1.40, 1.01, 0.136)(1,Y 1,N10) 2247.4 (-1.42, 1.02, 0.126)(1,Y 1,N11) 2248.3 (-1.43, 1.04, 0.115)(1,Y 1,N12) 2249.6 (-1.43, 1.05, 0.105)(1,Y 1,N13) 2248.1 (-1.46, 1.06, 0.102)(1,Y 1,N14) 2249.7 (-1.46, 1.07, 0.0945)(1,Y 1,N15) 2249.5 (-1.47, 1.07, 0.0905)(1,Y 1,N16) 2249.0 (-1.49, 1.08, 0.0872)(1,Y 1,N17) 2245.3 (-1.51, 1.08, 0.0853)(1,Y 1,N18) 2246.8 (-1.53, 1.08, 0.0831)(1,Y 1,N19) 2246.8 (-1.55, 1.08, 0.0820)(1,Y 1,N20) 2245.6 (-1.56, 1.08, 0.0787)(1,Y 1,N21) 2246.0 (-1.56, 1.08, 0.0749)(1,Y 1,N22) 2247.6 (-1.55, 1.09, 0.0703)(1,Y 1,N23) 2245.9 (-1.58, 1.09, 0.0701)(1,Y 1,N24) 2246.0 (-1.58, 1.09, 0.0678)(1,Y 1,N25) 2246.8 (-1.58, 1.10, 0.0647)(1,Y 1,N26) 2246.6 (-1.59, 1.10, 0.0632)(1,Y 1,N27) 2246.2 (-1.60, 1.10, 0.0618)(1,Y 1,N28) 2244.7 (-1.62, 1.10, 0.0615)(1,Y 1,N29) 2245.4 (-1.62, 1.10, 0.0593)(1,Y 1,N30) 2246.2 (-1.622, 1.11, 0.0571)Table 4.2: BIC values for models including Nl, the number of wet daysduring the past l days and Y 1, the precipitation occurrence of the previousday for the Calgary site.Table 4.3 compares models with predictors (1,Nl,COS,SIN). We haveadded (COS,SIN) to capture the seasonality in the precipitation over ayear. (1,N1,COS,SIN) (which is the same as (1,Y 1,COS,SIN)) is thewinner. Note that this model is better than the simpler model (1,Y 1) orthe model (1,Y 1,N28).1074.4. Comparing the models using BICModel: Zt−1 BIC parameter estimates(1,N1,COS,SIN) 2222.5 (-1.00, 1.10, -0.588, 0.0999)(1,N2,COS,SIN) 2254.6 (-1.02, 0.592, -0.564, 0.0977)(1,N3,COS,SIN) 2260.1 (-1.07, 0.443, -0.538, 0.0961)(1,N4,COS,SIN) 2264.1 (-1.11, 0.359, -0.518, 0.0959)(1,N5,COS,SIN) 2270.8 (-1.12, 0.295,-0.508, 0.0971)(1,N6,COS,SIN) 2280.5 (-1.11, 0.240, -0.510, 0.0999)(1,N7,COS,SIN) 2286.7 (-1.11, 0.205, -0.508, 0.101)(1,N8,COS,SIN) 2293.0 (-1.09, 0.176, -0.511, 0.103)(1,N9,COS,SIN) 2293.1 (-1.08, 0.153, -0.513, 0.105)(1,N10,COS,SIN) 2302.2 (-1.07, 0.136, -0.516, 0.107)Table 4.3: BIC values for models including Nl, the number of wet daysduring the past l days and seasonal terms for the Calgary site.Table 4.4 includes Y 1, seasonal terms and Nl for l = 1,2,··· ,10 aspredictors. The model with predictors(1,Y 1,N5,COS,SIN),which includes a combination of seasonal terms and number of precipitationdays has the smallest BIC so far. Note that both the seasonal terms and thenumber of precipitation days prior to the day we are looking at, are indica-tors of “weather conditions”. There are natural cycles throughout the yearthat can inform us about the weather conditions of a particular day of theyear. These natural cycles are modeled by the periodic functions COS andSIN. Also by looking at a short period prior to the current day (short–termpast), we might be able to determine the weather conditions. Precipitationmay not follow a very regular seasonal pattern similar to temperature asshown in the exploratory analysis. Which one of these variables (seasonalor short–term past) is more important or necessary might depend on thelocation and other factors.1084.4. Comparing the models using BICModel: Zt−1 BIC parameter estimates(1,Y 1,N1,COS,SIN) 2230.0 (-1.00, -2.31, 3.41, -0.589, 0.0999)(1,Y 1,N2,COS,SIN) 2229.2 (-1.03, 0.977, 0.0997, -0.576, 0.0985)(1,Y 1,N3,COS,SIN) 2224.8 (-1.10, 0.895, 0.156, -0.546, 0.0946)(1,Y 1,N4,COS,SIN) 2222.1 (-1.14, 0.89, 0.147, -0.525, 0.0941)(1,Y 1,N5,COS,SIN) 2221.7 (-1.16, 0.922, 0.124, -0.515, 0.0934)(1,Y 1,N6,COS,SIN) 2223.3 (-1.16, 0.959, 0.0954, -0.517, 0.0946)(1,Y 1,N7,COS,SIN) 2223.7 (-1.17, 0.978, 0.0822, -0.513, 0.0947)(1,Y 1,N8,COS,SIN) 2224.7 (-1.16, 0.997, 0.0682, -0.515, 0.0945)(1,Y 1,N9,COS,SIN) 2225.5 (-1.16, 1.0129, 0.0582, -0.515, 0.0961)(1,Y 1,N10,COS,SIN) 2226.0 (-1.16, 1.026, 0.0502, -0.517, 0.0958)Table 4.4: BIC values for models including Nl, the number of PN daysduring the past l days, Y 1, the precipitation occurrence of the previous dayand seasonal terms for the Calgary site.Table 4.5 compares models with different number of predictors from(1,Y 1) to(1,Y 1,··· ,Y 7).The first model is a 1st–order Markov chain and the last one is a 7th–order chain. The optimal model picked is: (1,Y 1,Y 2,Y 3). Comparing thistable to Table 4.2, we see that (1,Y 1,N3) is superior to (1,Y 1), (1,Y 1,Y 2)and (1,Y 1,Y 2,Y 3). Note that (1,Y 1,N3) is equivalent to (1,Y 1,Y 2 +Y 3).Hence, including Y 2 and Y 3 and giving them the same weight is better thannot including them, including one of them or including both of them.Model: Zt−1 BIC parameter estimates(1,Y 1) 2268.1 (-1.034, 1.27)(1,Y 1,Y 2) 2270.2 (-1.11, 1.20, 0.23)(1,Y 1,Y 2,Y 3) 2263.3 (-1.21, 1.19, 0.140, 0.410)(1,Y 1,··· ,Y 4) 2263.9 (-1.28, 1.16, 0.133, 0.334, 0.281)(1,Y 1,··· ,Y 5) 2268.5 (-1.32, 1.15, 0.121, 0.328, 0.232, 0.192)(1,Y 1,··· ,Y 6) 2335.4 (-1.34, 1.15, 0.0837, 0.357, 0.213, 0.135, 0.115)(1,Y 1,··· ,Y 7) 2286.7 (-1.51, 1.33, -0.113, 0.378, 0.418, 0.204, -0.0050, 0.214)Table 4.5: BIC values for Markov models of different order with small num-ber os parameters for the Calgary site.Table 4.6 compares modelswith different Markov ordersplusthe seasonalterms. The model (1,Y 1,COS,SIN) is the winner. Hence, whether weinclude the seasonal terms or not, the model that only depends on theprevious day is the winner.1094.4. Comparing the models using BICModel: Zt−1 BIC parameter estimates(1,COS,SIN,Y 1) 2222.6 (-1.0, -0.5, 0.1, 1.1)(1,COS,SIN,Y 1,Y 2) 2229.1 (-1.0, -0.5, 0.1, 1.0, 0.1)(1,COS,SIN,Y 1,Y 2,Y 3) 2230.4 (-1.1, -0.5, 0.1, 1.0, 0.02, 0.3)(1,COS,SIN,Y 1,··· ,Y 4) 2247.3 (-1.1, -0.5, 0.1, 1.0, 0.03, 0.2, 0.15)(1,COS,SIN,Y 1,··· ,Y 5) 2243.4 (-1.3, -0.4, 0.2, 1.4, -0.4, -0.1, 1.0, -0.15)(1,COS,SIN,Y 1,··· ,Y 6) 2501.6 (-1.2, -1.5, 0.4, 0.2, 0.8, 0.9, 0.9, -0.6, -0.2)(1,COS,SIN,Y 1,··· ,Y 7) 2447.3 (-1.1, -0.2, 0.07, 0.8, -0.02, 0.3, 0.4, -0.07, 0.4, -0.3)Table 4.6: BIC values for Markov models with different order plus seasonalterms for the Calgary site.Table 4.7 studies seasonality more. We consider the possibility that thereare more/less terms of the Fourier series of a periodic function over the year.It turns out that the model with (1,Y 1,COS) is the optimal model so far.Hence, only one term seem to suffice modeling the seasonal nature of theprocess.Model: Zt−1 BIC parameter estimates(1,COS) 2322.7 (-0.556, -0.717)(1,SIN) 2424.3 (-0.523, 0.115)(1,COS,SIN) 2327.3 (-0.568, -0.738, 0.119)(1,Y 1,COS) 2216.9 (-1.00 , 1.10, -0.587)(1,Y 1,SIN) 2273.9 (-1.03, 1.26, 0.0933)(1,Y 1,COS,SIN) 2222.6 (-1.004, 1.102, -0.589, 0.100)(1,Y 1,COS,SIN,COS2) 2229.7 (-1.00, 1.10, -0.586, 0.0998, 0.0247)(1,Y 1,COS,SIN,SIN2) 2230.0 (-1.00, 1.10, -0.590, 0.101, 0.0125)(1,Y 1,COS,SIN,COS2,SIN2) 2237.2 (-1.01, 1.11, -0.575, 0.0978, 0.0236, -0.0101)Table 4.7: BIC values for models including seasonal terms and the occur-rence of precipitation during the previous day for the Calgary site.Table 4.8 compares all stationary 2nd–order Markov models. The small-est BIC corresponds to (1,Y 1).1104.4. Comparing the models using BICModel: Zt−1 BIC parameter estimates(1) 2419.6 (-0.528)(1,Y 1) 2268.0 (-1.04, 1.27)(1,Y 2) 2392.8 (-0.756, 0.590)(1,Y 1,Y 2) 2270.2 (-1.110, 1.197, 0.256)(1,Y 1Y 2) 2335.5 (-0.779, 1.134)(1,Y 1,Y 1Y 2) 2272.7 (-1.040, 1.113, 0.282)(1,Y 2,Y 1Y 2) 2342.3 (-0.757, -0.113, 1.225)(1,Y 1,Y 2,Y 1Y 2) 2277.7 ( -1.103, 1.177, 0.234, 0.048)Table 4.8: BIC values for 2nd–order Markov models for precipitation at theCalgary site.Table 4.9 compares all 2nd–order Markov chains with a seasonal COSterm. The model (1,Y 1,COS) is the winner.Model: Zt−1 BIC parameter estimates(1,COS) 2322.7 (-0.567, -0.738)(1,COS,Y 1) 2216.8 (-1.005, -0.587, 1.106)(1,COS,Y 2) 2317.4 (-0.708, -0.679, 0.372)(1,COS,Y 1Y 2) 2223.5 (-0.760, -0.618, 0.905)(1,COS,Y 1,Y 2) 2276.1 (-1.033, -0.575, 1.080, 0.103)(1,COS,Y 1,Y 1Y 2) 2223.9 (-1.004, -0.580, 1.041, 0.120)(1,COS,Y 2,Y 1Y 2) 2280.9 (-0.709, -0.632, -0.244, 1.093)(1,COS,Y 1,Y 2,Y 1Y 2) 2231.0 (-1.028, -0.575, 1.065, 0.085, 0.037)Table 4.9: BIC values for 2nd–order Markov models for precipitation at theCalgary site plus seasonal terms.Table 4.10 also includes the maximum and minimum temperature of theday before, as predictors of some of the models which performed better in theabove tables. We have also included the annual processes A1,··· ,A5 to oneof the models. Finally, we have included the model (1,Y 1,N5,COS). Thismodel has a combination of the seasonal term COS and the short–term pastprocess N5 which did the best when combined with the seasonal terms andY 1 in Table 4.4. It turns out that including MT and mt does not improvethe BIC as well as does the annual terms. However, (1,Y 1,N5,COS) hasthe smallest BIC in all the models, which is a seasonal Markov chain of order5 with only 4 parameters. Also the simpler model,(1,Y 1,COS),has a close BIC to (1,Y 1,N5,COS).1114.5. Changing the location and the time periodModel: Zt−1 BIC parameter estimates(1,COS,Y 1) 2216.8 (-1.005, -0.587, 1.106)(1,Y 1,COS,MT1) 2221.7 (-0.84, 1.0, -0.74, -0.012)(1,Y 1,COS,mt1) 2224.2 (-1.0, 1.0, -0.65, -0.0055)(1,Y 1,COS,MT1,mt1) 2227.4 (-0.65, 0.99, -0.67, -0.025, 0.022)(1,Y 1,COS,A1,··· ,A5) 2241.2 ( 1.1, -0.5, -0.9, -1.2, -1.1, -1.0, -0.7)(1,Y 1,N5,COS,MT1) 2297.3 (-2.13, 0.9, 0.4, 0.6, 0.2, 0.04)(1,Y 1,N5,COS,SIN,MT1,mt1) 2516.8 (1.4, 0.04, 0.2, 0.7, 0.8, -0.2, 0.3)(1,Y 1,N5,COS,MT1,mt1) 2393.9 ( 1.4, 0.7, -0.1, -0.5, 0.5, -0.1, 0.2)(Y 1,N5,COS,MT1,A1,··· ,A5) 2697.1 (1.23, -0.64, -2.0, -0.10, 2.0, 1.2, 2.2, 1.2, 1.8)(Y 1,N5,COS,A1,··· ,A5) 2447.1 (0.1, 0.1, -0.7, -0.39, -0.01, -0.2, -0.9, -1)(1,Y 1,MT1) 2251.5 (-1.2, 1.3, 0.021)(1,Y 1,N5,COS) 2215.8 (-1.1, 0.9, 0.1, -0.5)(1,Y 1,N5,COS,MT1) 2223.8 (-1.2, 0.9, 0.1, -0.4, 0.0)Table 4.10: BIC values for models including several covariates as tempera-ture, seasonal terms and year effect for precipitation at the Calgary site.4.5 Changing the location and the time periodThis section compares various models for a different time period and loca-tion. Table 4.11 compares various models for the 0-1 PN process in Calgarybetween 1990 and 1994 which is a 5–year period. In Table 4.12, we havecompared several models for 0-1 PN process over Medicine Hat site between2000 and 2004.Table 4.11 shows that among the compared models (1,Y 1,COS) hasthe smallest BIC. In particular the BIC for this model is smaller than theBIC for (1,Y 1,N5,COS) which has the smallest BIC for Calgary 2000–2004. However (1,Y 1,COS) was the second optimal model also for Calgary2000–2004 with a close BIC to the optimal. Including the maximum andminimum temperature to the model increases the BIC again.1124.5. Changing the location and the time periodModel: Zt−1 BIC parameter estimates(1,Y 1) 2312.7 (-0.931, 1.275)(1,Y 1,Y 2) 2318.8 (-0.967, 1.238, 0.126)(1,Y 1,COS) 2228.8 (-0.858, 1.036, -0.712)(1,Y 1,N5) 2303.3 (-1.168, 1.012, 0.168)(1,Y 1,N10) 2287.9 (-1.581, 1.015, 0.132)(1,Y 1,N15) 2282.7 (-1.486, 1.045, 0.105)(1,Y 1,COS,SIN) 2231.9 (-0.855, 1.026, -0.715 , 0.152)(1,Y 1,N5,COS) 2236.4 (-0.864, 1.032, 0.004, -0.709)(1,Y 1,N5,SIN) 2307.8 (-1.160, 1.011, 0.164, 0.125)(1,Y 1,N5,COS,SIN) 2239.4 (-0.849, 1.031, -0.004, -0.718, 0.152)(1,Y 1,N10,COS) 2236.4 (-0.847, 1.030, -0.002, -0.721, 0.153)(1,Y 1,N10,COS,SIN) 2239.4 (-0.847, 1.030, -0.002, -0.721 , 0.153)(1,Y 1,N5,COS,MT1) 2244.3 (-0.433, 1.046, -0.096, -1.078, -0.021)(1,Y 1,N5,COS,mt1) 2244.1 (-0.910, 1.011, 0.031, -0.584, 0.006)Table 4.11: BIC values for several models for the binary process of precipi-tation in Calgary, 1990–1994Table 4.12 shows that the smallest BIC corresponds to (1,Y 1,COS).However, several models have similar BIC values. Also, including the max-imum and minimum temperature increases the BIC here.Model: Zt−1 BIC parameter estimates(1,Y 1) 2202.9 (-1.138, 1.094)(1,Y 1,Y 2) 2207.9 (-1.183, 1.051, 0.181)(1,Y 1,N5) 2203.6 (-1.275, 0.921, 0.119)(1,Y 1,N10) 2228.9 (-0.858, 1.036, -0.712)(1,Y 1,N15) 2200.5 (-1.420, 0.980, 0.065)(1,Y 1,N20) 2202.5 (-1.421, 1.008, 0.048)(1,Y 1,COS) 2201.2 (-1.134, 1.067, -0.224)(1,Y 1,COS,SIN) 2202.9 (-1.132, 1.052, -0.225, 0.177)(1,Y 1,N5,COS) 2203.9 (-1.252, 0.924, 0.101, -0.201)(1,Y 1,N5,SIN) 2206.6 (-1.263, 0.922, 0.109, 0.158)(1,Y 1,N5,COS,SIN) 2206.6 (-1.239, 0.925, 0.091, -0.204, 0.163)(1,Y 1,N10,COS) 2201.9 (-1.336, 0.958, 0.073, -0.183)(1,Y 1,N10,COS,SIN) 2205.1 (-1.311, 0.958, 0.065, -0.187, 0.151)(1,Y 1,N5,COS,MT1) 2306.5 (-1.455, 2.099, -0.130, 0.041, 0.004)(1,Y 1,N5,COS,mt1) 2211.1 (-1.238, 0.937, 0.087, -0.267, -0.005)(1,Y 1,N15,COS) 2202.7 (-1.363, 0.981, 0.053, -0.175)Table 4.12: BIC values for several models for precipitation occurrence inMedicine Hat, 2000-20041134.5. Changing the location and the time periodIn summary, in all the three cases(1,Y 1,COS),is either optimal or the second to the optimal (using BIC). We have alsotried BIC for Calgary with a long time period of close to 100 years andsurprisingly the same simple model (1,Y 1,COS) was the optimal.114Chapter 5On the definition of“quantile” and its properties5.1 IntroductionThis chapter points out deficiencies in the classical definition (as well as someother widely used definitions) of the median and more generally the quan-tile and the so-called quantile function. Moreover redefining it appropriatelygives us a basis on which we can find necessary and sufficient conditions forthe sample quantiles to converge for arbitrary distribution functions. In thenext chapter, we define a “degree of separation” function to measure thegoodness of the approximation (or estimation). We argue that this func-tion can be viewed as a natural loss function for assessing estimations andapproximations. One characteristic of this loss function is its invariance un-der strictly monotonic transformations of the random variable, in particularre-scaling.In this chapter, we have used the terms data vector, approximation,estimation, exact and true quantiles repeatedly. To clarify what we meanby these terms, we give the following explanations:• Data vector: A vector of real numbers. We do not consider these valuesas random in general. We use the term random vector or randomsample for a vector of random variables. We define the quantile fordata vectors, but the same definition applies to a random sample.• Approximation and exact value: Suppose a very large data vectoris given. We can compute the exact mean/median of such a vectorby using all the data and the definition of mean/median. One canapproximate the mean/median using various techniques. Note thatboth approximation and exact terms are used for data vectors of (non-random) numbers.• Estimation and true value: Estimation means finding functions of therandom sample to estimate parameters of the underlying distribution.1155.1. IntroductionThe parameters are called the true values.The sample definition of quantiles varies in different text books. In [24],Hyndman et al. point out many different definitions in statistical packagesfor quantiles of a sample. In [17], Freund et al. point out various defini-tions for quartiles of data and propose a new definition using the concept of“hinge”.The traditional definition of quantiles for a random variable X withdistribution function F,lqX(p) = inf{x|F(x) ≥ p},appears in classic works as [38]. We call this the “left quantile function”. Insome books (e.g. [41]) the quantile is defined asrqX(p) = sup{x|F(x) ≤ p},this is what we call the “right quantile function”. Also in robustness lit-erature people talk about the upper and lower medians which are a veryspecific case of these definitions. However, we do not know of any work thatconsiders both definitions, explore their relation and show that consideringboth has several advantages.A physical motivation is given for the right/left definition of quantiles.It is widely claimed that (e.g. Koenker in [29] or Hao and Naiman in [21])the traditional quantile function is invariant under monotonic transforma-tions. We show that this does not hold even for strictly increasing functions.However, we prove that the traditional quantile function is invariant un-der non-decreasing left continuous transformations. We also show that theright quantile function is invariant under non-decreasing right continuoustransformations. A similar neat result is found for continuous decreasingtransformations using the Quantile Symmetry Theorem also proved in thischapter.Suppose we know that a data point is larger than a known number ofother data points and smaller than another known number of data points.Of interest are the quantiles to which this data point corresponds. Lemma5.2.4 gives a result about this. We will use this lemma later to establishthe precision of our proposed algorithm for approximating quantiles of largedatasets.Quantiles are often used as the inverse of distribution functions. In gen-eral neither the distribution function nor the quantile function are invertible.However Lemma 5.5.1 shows how quantiles can be used to characterize setsof the form {x|F(x) < p}, a case that is equivalent to (−∞,lqF(p)).1165.1. IntroductionLemma 5.7.1 shows the left continuity of the left quantile function andthe right continuity of the right quantile function.Section 5.8 finds necessary and sufficient conditions for the left and rightquantile functions to be equal at p ∈ [0,1]. We also find out that the leftand right quantile functions coincide except for at most a countable numberof values in [0,1]. Then we characterize the image of the the left and rightquantile functions and show that the image corresponds to “heavy” points(heavy point is a point that the probability of being in a neighborhoodaround that point is positive).Section 5.9 shows that given any of lq,rq and F uniquely determines theother two and formulas are given in order to find them. We also show thatif one of lq and rq is two-sided continuous then so is the other one. Lemma5.10.1 shows that the strict monotonicity of the distribution function F onits “real domain” {x|0 < F(x) < 1} is equivalent to two-sided continuity oflq/rq. Conversely, strict monotonicity of lq/rq corresponds to continuity ofF.Section 5.12 presents the desirable “Quantile Symmetry Theorem”, a re-sult that could be only obtained by considering both left and right quantiles.This relation can help us prove several other useful results regarding quan-tiles. Also using the quantile symmetry theorem, we find a relation for theequivariance property of quantiles under non-increasing transformations.Section 5.14 studies the limit properties of left and right quantile func-tions. In Theorem 5.14.7, we show that if left and right quantiles are equal,i.e. lqF(p) = rqF(p), then both sample versions lqFn,rqFn are convergent tothe common distribution value. We found an equivalent statement in Ser-fling [43] with a rather similar proof. The condition for convergence thereis said to be lqF(p) being the unique solution of F(x−) < p ≤ F(x) whichcan be shown to be equivalent to lqF(p) = rqF(p). Note how consideringboth left and right quantiles has resulted in a cleaner, more comprehensiblecondition for the limits. In a problem Serfling asks to show with an examplethat this condition cannot be dropped. We show much more by proving thatif lqF(p) negationslash= rqF(p) then both rqFn(p) and rqFn(p) diverge almost surely. Thealmost sure divergence result can be viewed as an extension to a well-knownresult in probability theory which says that if X1,X2,··· an i.i.d sequencefrom a fair coin with -1 denoting tail and 1 denoting head and Zn =summationtextni=1 Xithen P(Zn = 0 i.o.) = 1. The proof in [9] uses the Borel–Cantelli Lemmato get around the problem of dependence of Zn. This is equivalent to say-ing for the fair coin both lqFn(1/2) and rqFn(1/2) diverge almost surely.For the general case, we use the Borel–Cantelli Lemma again. But we alsoneed a lemma (Lemma 5.14.10) which uses the Berry–Esseen Theorem in1175.2. Definition of median and quantiles of data vectors and random samplesits proof to show the deviations of the sum of the random variables canbecome arbitrarily large, a result that is easy to show as done in [9] for thesimple fair coin example. Finally, we show that even though in the case thatlqF(p) negationslash= rqF(p), lqFn,rqFn are divergent; for large ns they will fall in(lqF(p)−ǫ,lqF(p)]∪[rqF(p),rqF(p)+ǫ).In fact we show thatliminfn→∞ lqFn(p) = liminfn→∞ rqFn(p) = lqF(p)andlimsupn→∞lqFn(p) = limsupn→∞rqFn(p) = rqF(p).The proof is done by constructing a new random variable Y from the originalrandom variable X with distribution function FX by shifting back all thevalues greater than rqX(p) to lqX(p). This makes lqY (p) = rqY(p) in thenew random variable. Then we apply the convergence result to Y.5.2 Definition of median and quantiles of datavectors and random samplesThis section presents a way to define quantiles of data vectors and randomsamples. We confine our discussion to data vectors since the definition forrandom samples is merely a formalistic extension. Suppose, we are given avery long data vector. The goal is to find the median of this vector. Let usdenote the data vector by x = (x1,··· ,xn). Suppose y = (y1,··· ,yn) is anincreasing sorted vector of elements of x = (x1,··· ,xn). Then usually themedian of x is defined to be y(n+1)/2 if n is odd and yn/2+y(n+2)/22 if n is even.Essentially the median is defined so that half data lies below it and halflies above it. However, when n is even, any value between yn/2 and y(n+2)/2serves this purpose and taking the average of the two values seems arbitrary.Intuitively, the quantile should have the following properties:1. It should be a member of the data vector. In other words if x =(x1,··· ,xn) is the data vector then the quantile should be equal toone of xi, i = 1,··· ,n.2. Equivariance: If we transform the data using an increasing continuoustransformation of R, find the quantile and transform back, we shouldget the same result, had we found the quantile of the original data.1185.2. Definition of median and quantiles of data vectors and random samplesMore formally, if we denote the quantile of a data vector x for p ∈ (0,1)by qx(p) then for any φ : R → R strictly increasing and bijectiveqx(p) = y ⇔ qφ(x)(p) = φ(y).3. Symmetry: The p-th quantile of the data vector x = (x1,··· ,xn)should be the negative of (1 − p)-th quantile of data vector −x =(−x1,··· ,−xn):qx(p) = −q−x(1−p).Particularly, the median of x should be the image of the median of theimage of x with respect to 0.4. The “amount” of data between qx(p1) and qx(p2) should be p2 −p1 ofthe the “data amount” of the whole vector if p1 < p2.5. If we “cut” a sorted data vector up until the p1-th quantile and com-pute the p2-th quantile for the new vector, we should get the p1p2-thquantile of the original vector. For example the median of a sortedvector upto its median should be the first quartile.This chapter develops a definition for quantiles that satisfies the firstthree conditions. We will address the last two conditions in later chaptersand develop a framework in which they are satisfied.Consider the example x = (0,1,1,1,1,1,2,2,2,2,2,10). We see that themedian by the usual definition is 1.5 not apparent in the observed data. Alsoif we take bijective, increasing and continuous transformation φ(x) = x3, wesee that the classic definition does not satisfy the second property.The median and quantiles can be defined both for distributions anddata vectors (and random samples). For a random variable X having adistribution function F, the p-th quantile is traditionally defined asqF(p) = inf{x|F(x) ≥ p}. (5.1)This can be used to define the quantiles of a data vector using the empirical(sample) distribution function Fn,Fn(x) =nsummationdisplayi=11(−∞,xi](x).1195.2. Definition of median and quantiles of data vectors and random samplesWith this definition of the quantile, the equivariance property holds and theresult is a realizable data value. This definition faces another issue however.Consider flipping a fair coin with outcomes: 0,1. Then the distribution ofX is given byFX(x) =0 x < 01/2 0 ≤ x < 11 x ≥ 1Hence by definition 5.1, qF(p) = 0, p ≤ 1/2 and qF(p) = 1, p > 1/2. Thisall seems to be reasonable other than qF(p) = 0, p = 1/2. Based on thesymmetry of the distribution there should not be any advantage for 0 over1 to be the median. For the quantiles of the data vectors the same issueoccurs. For example, consider x = (1,2,3,4,5,6) and apply definition 5.1to Fn corresponding to this data vector. We will get 3 as the median but infact 4 should to be as eligible by symmetry.Before to get to our definition of quantile we provide the following mo-tivating examples.Example A student decided to buy a new memory chip for his computer.He needed to choose between the available RAM sizes (1 GB, 2GB etc) inhis favorite store. In a trade-off between price and speed, he decided toget a RAM chip that is at least as large as 2/3 RAMs bought in the storeduring the day before. He could access the information regarding all RAMsbought the day before, in particular their size. He entered the size data intothe R package he had recently downloaded for free. He had heard aboutthe quantiles in his elementary statistics course so he decided to computethe quantile of the data for p = 2/3. When he computed that he got 2.666(GB). He knew a RAM of size 2.666 does not exist and concluded this mustbe a result of an interpolation procedure in R. Since the closest integer to2.666 is 3 he concluded that 3 GB is the size he is looking for. He went backto the store asking for 3 GB RAM and was told they have never sold sucha RAM in that store! He thought there must be an error in the dataset sohe looked the data again1,1,1,1,2,2,2,2,4,4,4,4Surprisingly there was no 3. R had interpolated 2 and 4 to give 2.66 andmislead the student.Example A supervisor asked 2 graduate students to summarize the follow-ing data regarding the intensity of the earthquakes in a specific region:1205.2. Definition of median and quantiles of data vectors and random samplesrow number ML (Richter) A (shaking amplitude)1 4.21094 1.62532 ×1042 4.69852 4.99482 ×1043 4.92185 8.35314 ×1044 5.12098 13.21235 ×1045 5.21478 16.39759 ×1046 5.28943 19.47287 ×1047 5.32558 21.16313 ×1048 5.47828 30.08015 ×1049 5.59103 38.99689 ×10410 5.72736 53.37772 ×104Table 5.1: Earthquakes intensitiesEarthquake intensity is usually measured in ML scale, which is relatedto A by the following formula:ML = log10 A.In the data file handed to the students (Table 5.1), the data is sorted withrespect to ML in increasing order from top to bottom. Hence the data isarranged decreasingly with respect to A from top to bottom.The supervisor asked two graduate students to compute the center of theintensity of the earthquakes using this dataset. One of the students used Aand the usual definition of median and so obtained(16.39759 ×104 + 19.47287 ×104)/2 = 17.93523×104.The second student used the ML and the usual definition of median tofind(5.21478 + 5.28943)/2 = 5.252105.When the supervisor saw the results he figured that the students musthave used different scales. Hence he tried to make the scales the same bytransforming one of the results105.252105 = 17.86920 ×104.To his surprise the results were not quite the same. He was bothered tonotice that the definition of median is not invariant under the change ofscale which is continuous strictly increasing.1215.2. Definition of median and quantiles of data vectors and random samplesExample A scientist asked two of his assistants to summarize the followingdata regarding the acidity of rain:row number pH aH1 4.7336 18.4672 ×10−62 4.8327 14.6994 ×10−63 4.8492 14.1514 ×10−64 5.0050 9.8855 ×10−65 5.0389 9.1432 ×10−66 5.2487 5.6403 ×10−67 5.2713 5.3543 ×10−68 5.2901 5.1274 ×10−69 5.5731 2.6724 ×10−610 5.6105 2.4519 ×10−6Table 5.2: Rain acidity datapH is defined as the cologarithm of the activity of dissolved hydrogenions (H+).pH = −log10 aH.In the data file handed to the students (Table 5.2) the data is sorted withrespect to pH in increasing order from top to bottom. Hence the data isarranged decreasingly with respect to aH from top to bottom.The scientist asked the two assistant to compute the 20th and 80thpercentile of the data to get an idea of the variability of the acidity. Firstassistant used the pH scale and the traditional definition of the quantileqF(p) = inf{x|F(x) ≥ p},where F is the empirical distribution of the data. He got the following twonumbersqF(0.2) = 4.8327 and qF(0.8) = 5.2901 (5.2)these values are positioned in row 2 and 8 respectively.The second assistant also used the traditional definition of the quantilesand the aH scale to getqF(0.2) = 2.6724×10−6 and qF(0.8) = 14.1514×10−6, (5.3)which correspond to row 9 and 3.1225.2. Definition of median and quantiles of data vectors and random samplesThescientist noticed the assistants used differentscales. Thenhethoughtsince one of the scales is in the opposite order of the other and 0.2 and 0.8are the same distance from 0 and 1 respectively, he must get the other assis-tant’s result by transforming one. So he transformed the second assistant’sresults given in Equation 5.3 (or by simply looking at the correspondingrows, 9 and 3 under pH), to get5.5731 and 4.8492,which are not the same as the first assistants result in Equation 5.2. Henoticed the position of these values are only one off from the previous values(being in row 9 and 3 instead of 8 and 2).Then he tried the same himself for 25th and 75th percentile using bothscalespH : qF(0.25) = 4.8492 and qF(0.75) = 5.2901,which are positioned at 3rd and 8th row.aH : qF(0.25) = 5.1274×10−6 and qF(0.25) = 14.1514×10−6,which are positioned at row 8th and 3rd. This time he was surprised toobserve the symmetry he expected. He wondered when such symmetry existand what is true in general. He conjectured that the asymmetric definitionof the traditional quantile is the reason of this asymmetry. He also thoughtthat the symmetry property is off at most by one position in the dataset.To define the quantile, we perform a thought experiment and use ourintuition to decide how it should be defined. Suppose a data vector x =(x1,··· ,xn) is given. Define the sort operator which permutes the compo-nents of a vector to give a vector with non-decreasing coordinates bysort(x) = (y1,··· ,yn).In statistics yi defined as above is called the i–th order statistics of x andis usually denoted by x(i) or xi:n. [This definition extends to random vec-tors (X1,··· ,Xn) as well.] The concept of quantile should only depend onsort(x). Let z = (z1,··· ,zr) be the non–decreasing subvector of all distinctelements of x. If zi is repeated mi times, we say zi has multiplicity miand therefore summationtextri=1 mi = n. Now imagine, a uniform bar of length 1. Cutthe bar from left to right to r parts of lengths m1n ,··· , mrn proportional to1235.2. Definition of median and quantiles of data vectors and random samplesthe multiplicity of the zi. Assign a unique color to every zi, i = 1,··· ,rand color its piece with that color. Then reassemble the stick from left toright in the original order. To define the p-th quantile measure a length pfrom the left hand of the bar (whose total length is one). Determine thereassembled bar’s color at that point. However, this protocol fails at theend points as well as the points where two colors meet. Since each color isan equally eligible choice, we are led to the idea in defining the quantiles ofa two–state solution at these points, giving us the left and right quantiles.But proceeding with our bar analogy, the intersection points and boundarypoints are:0, m1n , m1 +m2n ,··· , m1 +···+mr−1n ,1.By the above discussion, if p is not an intersection/boundary point both leftand right quantiles, which we denote by lqx and rqx respectively should bethe same and equal tolqx(p) = rqx(p) =z1 0 < p < m1nzi m1+···+mi−1n < p < m1+···+minzr m1+···+mr−1n < p < 1For the intersection points, if p = m1+···+mi−1n thenlqx(p) = zi−1 and rqx(p) = zi.For the boundary points we definelqx(0) = −∞,rqx(0) = z1,lqx(1) = zr,rqx(1) = ∞.As a convention, for a sorted vector y of length n, we define y0 = −∞ andyn+1 = ∞.Lemma 5.2.1 Suppose x is a data vector of length n and y = sort(x) =(y1,··· ,yn). Also let y0 = −∞ and yn+1 = ∞. For 0 < p < 1, let [np]denote the integer part of np. Thena) np = [np] ⇒ lqx(p) = y[np], rqx(p) = y[np]+1.b) np > [np] ⇒ lqx(p) = y[np]+1, rqx(p) = y[np]+1.c) y = sort(x) and pi = i/n, i = 0,1,··· ,n, impliesy = (lqx(p1),··· ,lqx(pn)) = (rqx(p0),··· ,rqx(pn−1)).1245.2. Definition of median and quantiles of data vectors and random samplesProofa) Let np = h ∈ N. There are four cases:1. For h = 0 and h = n the result is trivial by the definition of y0 andyn+1.2. 0 < h < m1 ⇒ 0 < p < m1/n and by definition lqx(p) = rqx(p) = z1.But yh = yh+1 = z1.3. There exists 1 < i ≤ r such that m1+···+mi−1 < h < m1+···+mi ⇒m1+···+mi−1n < p <m1+···+min and by definition lqx(p) = rqx(p) = zi.But yh = yh+1 = zi since m1 +···+mi−1 < h < m1 +···+mi.4. h = m1 +···+ mi,i < r ⇒ p = m1+···+min , i < r. By definition sincethis is an intersection point lqx(p) = zi and rqx(p) = zi+1. But zi = yhand zi+1 = yh+1.b) Let h = [np] ⇒ hn < p < h+1n . Since h and h + 1 differ exactly by oneunit, there exists an i such thatm1 +···+mi−1n ≤hn < p <h+ 1n ≤m1 +···+min .Then by definition lqx(p) = rqx(p) = zi. But sincem1 +···+mi−1 < h+ 1 ≤ m1 +···+mi,yh+1 = zi.c) Straightforward consequence of the definition.Supposey′ ∈ {y1,··· ,yn}, for futurereference, we define some additionalnotations for data vectors.Definition The minimal index of y′, m(y′) and the maximal index of y′,M(y′) are defined as below:m(y′) = min{i|yi = y′}, M(y′) = max{i|yi = y′}.It is easy to see that in y = sort(x) = (y1,··· ,yn) all the coordinatesbetween m(y′) and M(y′) are equal to y′. Also note that if y′ = zi thenM(y′) −m(y′) + 1 = mi is the multiplicity of zi. We use the notation mxand Mx whenever we want to emphasize that they depend on the data vectorx.1255.2. Definition of median and quantiles of data vectors and random samplesLemma 5.2.2 Suppose x = (x1,··· ,xn), y = sort(x) and z a non–decreasingvector of all distinct elements of x. Thena) m(zi+1) = M(zi)+ 1, i = 0,··· ,r−1.b) Suppose φ is a bijective increasing transformation over R,mφ(x)(φ(zi)) = mx(zi),andMφ(x)(φ(zi)) = Mx(zi),for i = 1,··· ,r.Proof a) is straightforward.b) Note thatmx(y′) = min{i|yi = y′} = min{i|φ(yi) = φ(y′)} = mφ(x)(φ(y′)).A similar argument works for Mx.We also define the position and standardized position of an element of adata vector.Definition Letx = (x1,··· ,xn)bea vector andy = sort(x) = (y1,··· ,y n).Then for y′ ∈ {y1,··· ,yn}, we defineposx(y′) = {mx(y′),mx(y′) + 1,··· ,Mx(y′)},where pos stands for position. Then we define the standardized position ofy′ to besposx(y′) = (mx(y′)−1n ,Mx(y′)n ).In the following lemma we show that for every p ∈ spos(y′) (and only p ∈spos(y′)), we have rq(p) = lq(p) = y′. For example if 1/2 ∈ spos(y′) then y′is the (left and right) median.Lemma 5.2.3 Suppose x = (x1,··· ,xn), y = sort(x) = (y1,··· ,yn) andy′ ∈ {y1,··· ,yn}. Thenp ∈ sposx(y′) ⇔ lqx(p) = rqx(p) = y′.Proof Letz = (z1,··· ,zr) bethereduced vector with multiplicities m1,··· ,mr.Then y′ = mi for some i = 1,··· ,r.1265.2. Definition of median and quantiles of data vectors and random samplescase I: If i = 2,··· ,r, thenm(y′) = m1 +···+mi−1 + 1,andM(y′) = m1 +···+mi.case II: If i = 1, then m(y′) = 1 and M(y′) = m1.In anyof theabove cases forp ∈ (m(y′)−1n , M(y′)n ) andonly p ∈ (m(y′)−1n , M(y′)n )rqx(p) = lqx(p) = zi,by definition.Now we prove a lemma that will become useful later on. It is easy to seethat if u ∈ pos(y′) then(u−1n , un) ⊂ spos(y′).We conclude that∪u∈pos(y′)(u−1n , un) ⊂ spos(y′).In fact spos(y′) can possibly have a few points on the edge of the intervalsnot in ∪u∈pos(y′)(u−1n , un).Lemma 5.2.4 Suppose x is a data vector of length n and y′ is an elementof this vector. Also assumey′ ≥ xi, i ∈ I, y′ ≤ xj, j ∈ J,I ∩J = φ, I,J ⊂ {1,2,··· ,n}.Then there exist a p in (|I|−1n ,1 − |J|n ) that belongs to spos(y′). In otherwords lq(p) = rq(p) = y′.Proof From the assumption, we conclude that pos(y′) includes a numberbetween |I| and n−|J|. Let us call it u0. Hence (u0−1n , u0n ) ⊂ spos(y′). Since|I| ≤ u0 ≤ n−|J|, we conclude that spos(y′) intersects with∪|I|≤u≤n−|J|(u−1n , un) ⊂ (|I|−1n ,1− |J|n ).1275.3. Defining quantiles of a distribution5.3 Defining quantiles of a distributionSo far, we have only defined the quantile for data vectors. Now we turn todefining the quantile for distribution functions.The p-th quantile for a random variable X with distribution function Fas pointed out above is traditionally defined to beq(p) = inf{u|F(u) ≥ p}.We showed by an example above the asymmetry issue to which that defini-tion can lead. We show that the issue arises due to the flatness of F in aninterval. To get around this problem as the case of data vectors, we definethe left and right quantile for the distribution F as follows:lqF(p) = inf{u|F(u) ≥ p},andrqF(p) = inf{u|F(u) > p}.If there are more than one random variables in the discussion, to avoidconfusion, we use the notations lqFX,rqFX. Also when there is no chance ofconfusion, we simply use lq,rq. The reason for this definition should becomeclear soon. First let us apply this definition to the fair coin example. Ifp negationslash= 1/2 then both lqF(p) and rqF(p) will be the same and give us the samevalue. However, lqF(1/2) = 0 and rqF(1/2) = 1. This is exactly what onewould hope for. To see the consequences of this definition, we prove thefollowing lemma:Lemma 5.3.1 (Quantile Properties Lemma) Suppose X is a random vari-able on the probability space (Ω,Σ,P) with distribution function F:a) F(lqF(p)) ≡ P(X ≤ lqF(p)) ≥ p.b) lqF(p) ≤ rqF(p).c) p1 < p2 ⇒ rqF(p1) ≤ lqF(p2). This and (b) imply thatlqF(p1) ≤ rqF(p1) ≤ lqF(p2) ≤ rqF(p2).d) rqF(p) = sup{x|F(x) ≤ p}.1285.3. Defining quantiles of a distributione) P(lqF(p) < X < rqF(p)) = 0. In other words if lqF(p) < rqF(p) then Fis flat in the interval (lqF(p),rqF(p)).f) P(X < rqF(p)) ≤ p.g) If lqF(p) < rqF(p) then F(lqF(p)) = p and hence P(X ≥ rqF(p)) = 1−p.h) lqF(1) > −∞,rqF(0) < ∞ and P(rqF(0) ≤ X ≤ lqF(1)) = 1.i) lqF(p) and rqF(p) are non–decreasing functions of p.j) Suppose F has a jump at x, in other words P(X = x) > 0, which isequivalent to limy→x− F(y) < F(x). Then lqF(F(x)) = x.k) x < lqF(p) ⇒ F(x) < p and x > rqF(p) ⇒ F(x) > p.Proofa) Take a strictly decreasing sequence {xn} in R that tends to lq(p). Forevery xn, F(xn) ≥ p since xn > lq(p). OtherwiseF(xn) < p ⇒ F(y) < p, ∀y ≤ xn.Hence (−∞,xn]∩{y|F(y) ≥ p} = ∅. We conclude thatlq(p) = inf{y|F(y) ≥ p}≥ xn > lq(p),which is a contradiction. Now since F is right continuouslimn→∞F(xn) = F(lq(p)).But F(xn) ≥ p, ∀n ∈ N. Hence limn→∞F(xn) ≥ p.b) Note that {u|F(u) > p}⊂ {u|F(u) ≥ p}.c) Note that {x|F(x) ≥ p2} ⊂ {x|F(x) > p1} if p2 > p1.1295.3. Defining quantiles of a distributiond) Suppose p ∈ [0,1] is given. Let A = {x|F(x) > p} and B = {x|F(x) ≤p}. We want to show that inf A = supB.Consider two cases:1) Suppose inf A < supB. Then pick inf A < y < supB. We get acontradiction as follows:inf A < y ⇒ F(y) > p. Otherwise, since F is increasing F(y) ≤ p ⇒y < x, ∀x ∈ A ⇒ y ≤ inf A.y < supB ⇒ F(y) ≤ p. Otherwise, since F is increasing F(y) > p ⇒y > x, ∀x ∈ B ⇒ y ≥ supB.We conclude F(y) > p and F(y) ≤ p, a contradiction.2) Suppose supB < inf A. Take supB < y < inf A.supB < y ⇒ F(y) > p. Otherwise, F(y) ≤ p ⇒ y ∈ B ⇒ y ≤ supB.y < inf A ⇒ F(y) ≤ p. Otherwise F(y) > p ⇒ y ∈ A ⇒ y ≥ inf A.Once more F(y) > p and F(y) ≤ p which is a contradiction.e) Suppose F is not flat in that interval. ∃v1 < v2 ∈ (lq(p),rq(p)) such thatF(v2) > F(v1).F(v2) > F(v1) ≥ F(lq(p)) ≥ p.This is a contradiction since v2 < rq(p).f) Take an increasing sequence xn ↑ rqF(p), then note that P(X ≤ xn) ≤ psince xn < rqF(p). Let An = {X ≤ xn} and A = {X < rqF(p)} thenlimn→∞An = A, by continuity of the probability (See [9]):P(X < rqF(p)) = P( limn→∞An) = limn→∞P(An) ≤ p.g) By a) F(lqF(p)) = P(X ≤ lqF(p)) ≥ p. Suppose P(X ≤ lqF(p)) > p.This implies that lqF(p) ≥ rqF(p). By b) we get lqF(p) = rqF(p), whichis a contradiction.h) Note thatlqF(0) = inf{x|F(x) ≥ 0} = infR = −∞.Suppose rqF(0) = ∞. Then{x|F(x) > 0} = ∅ ⇒ ∀x ∈ R, F(x) = 0,a contradiction to the properties of a distribution function F.1305.3. Defining quantiles of a distributionAlso note thatrqF(1) = inf{x|F(x) > 1} = inf∅ = ∞.Suppose lqF(1) = −∞. Theninf{x|F(x) ≥ 1} = −∞ ⇒∀x ∈ R, F(x) ≥ 1 ⇒ ∀x ∈ R, F(x) = 1,a contradiction. For the second part note that rqF(0) ≤ lqF(1) by (c).ThenP(rqF(0) ≤ X ≤ lqF(1)) =1−P(lqF(1) < X < rqF(1)) − P(lqF(0) < X < rqF(0)) =1−0−0,by part (e).i) Trivial.j) Suppose P(X = x) > 0 then limy→x− F(y) = P(X < x) < P(X <x) + P(X = x) = F(x). Now assume that limy→x− F(y) < F(x), thenP(X < x) < F(x) ⇒ P(X = x) > 0.To prove that in this case lqF(F(x)) = x, let p = F(x) we want to showlqF(p) = x. Note that F(x) = p gives lqF(p) ≤ x. On other hand forany y < x, we know that F(y) < p, by a) y cannot be lqF(p). Hencex = lqF(F(x)).k) First part follows from the definition of lq and the second part from part(d).The following lemma is useful in proving that a specific value is the leftor right quantile for a given p.Lemma 5.3.2 (Quantile value criterion)a) lqF(p) is the only a satisfying (i) and (ii), where(i) F(a) ≥ p,(ii) x < a ⇒ F(x) < p.1315.4. Left and right extreme pointsb) rqF(p) is the only a satisfying (i) and (ii), where(i) x < a ⇒ F(x) ≤ p,(ii) x > a ⇒ F(x) > p.Proofa) Both properties hold for lqF(p) by previous lemma. If both a < b satisfythem, then F(a) ≥ p by (i). But since b satisfies the properties and a < b,by (ii), F(a) < p which is a contradiction.b) Both properties hold for rqF(p) by previous lemma. If both a < b satisfythem, then we can get a contradiction similar to above.5.4 Left and right extreme pointsIn Lemma 5.3.1, we showed these properties about rqX(0) and lqX(1):rqX(0) < ∞, lqX(1) > −∞,rqX(0) ≤ lqX(1),andP(rqX(0) ≤ X ≤ lqX(1)) = 1.The above states that all the mass is between these two values. We willshow in the next lemma that these values are also the minimal values tosatisfy this property. This is the motivation for the following definition.Definition We call rqF(0) the “left extreme” and lqF(1) the “right ex-treme” of the distribution function F.Lemma 5.4.1 (Left and right extreme points property)Suppose X is a random variable with distribution function F.a) The right extreme lqF(1) is the smallest a satisfyingP(X ≤ a) = 1.In other wordsmina {P(X ≤ a) = 1} = lqF(1).1325.5. The quantile functions as inverseb) The left extreme rqF(0) is the biggest a satisfyingP(X ≥ a) = 1.maxa {P(X ≥ a) = 1} = rqF(0).c) Consider the following subset of R2I2 = {(a,b) ∈R2|P(X ∈ [a,b]) = 1}.Then∩(a,b)∈I2[a,b] = [rqX(0),lqX(1)].Proof a) In Lemma 5.3.1, we showed F(lqF(1)) = 1. Also F(a) < 1 fora < lqF(1) by the definition of lqF.b) In Lemma 5.3.1, we showed P(X ≥ rqX(0)) = 1. Suppose a > rqX(0).Then since rqX(p) = inf{x|F(x) > 0},∃c ∈ {x|F(x) > 0},c < a ⇒∃c < a,F(c) > 0 ⇒∃c, P(X < a) ≥ F(c) > 0 ⇒P(X ≥ a) = 1−P(X < a) < 1.c) This is straightforward from a) and b).5.5 The quantile functions as inverseThe following lemma shows that lqX and rqX can be considered as theinverse of the distribution function in some sense.Lemma 5.5.1 (Quantile functions as inverse of the distribution function)a) F(x) < p ⇔ x < lqX(p). (i.e. {x|F(x) < p} = (−∞,lqF(p)).)b) {x|F(x) ≤ p} = (−∞,rqX(p)] or (−∞,rqX(p)).c) If F is continuous at rqX(p) then {x|F(x) ≤ p} = (−∞,rqX(p)].d) {x|F(x) ≥ p} = [lqX(p),∞).e) {x|F(x) > p} = (rqX(p),∞) or [rqX(p),∞).f) If F is continuous then {x|F(x) > p} = (rqX(p),∞).Proof1335.5. The quantile functions as inversea) (⇒) is true because otherwise if x ≥ lqX(p) ⇒ F(x) ≥ F(lqX(p)) ≥ p,which is a contradiction. To show (⇐) note that by the definition oflqX(p), if F(x) ≥ p then x ≤ lqX(p).b) We need to show that (1) (−∞,rqX(p)) ⊂ {x|F(x) ≤ p}and (2){x|F(x) ≤p} ⊂ (−∞,rqX(p)]. For (1), suppose x < rqX(p). We claim F(x) ≤ p.Otherwise if F(x) > p by the definition of rqX(p), rqX(p) ≤ x. For (2),suppose F(x) ≤ p. Then since rqX(p) = sup{x|F(x) ≤ p}, we concludex ≤ rqX(p).c) By Part (b), it suffices to show F(rqX(p)) = p. This is shown in the nextlemma.d) R.H.S ⊂ L.H.S by Lemma 5.3.1 part (a). L.H.S ⊂ R.H.S by the definitionof lq.e) Note that x > rqF(p) then F(x) > p by Lemma 5.3.1 part (k). AlsoF(x) > p ⇒ rqF(p) ≤ x by definition of rq.f) This is a consequence of part (e) and next lemma.For the continuous distribution functions, we have the following lemma.Lemma 5.5.2 (Continuous distributions inverse) If F is continuous F(x) =p ⇔ x ∈ [lqX(p),rqX(p)].Proof If x < lqX(p) then we already showed that F(x) < p. Also if x >lqX(p) then rqX(p) = sup{x|F(x) ≤ p} ⇒ F(x) > p. (Because otherwise ifF(x) ≤ p ⇒ rqX(p) ≥ p.) It remains to show that F(lqX(p)) = F(rqX(p)) =p. But by Lemma 5.3.1, we haveF(lqX(p)) ≥ p.Hence it suffices to show that F(rqX(p)) ≤ p. But by Part (f) of Lemma5.3.1 and continuity of FF(rqF(x)) = P(X ≤ rqF(x)) = P(X < rqF(x)) ≤ p.1345.6. Equivariance property of quantile functions5.6 Equivariance property of quantile functionsExample (Counter example for Koenker–Hao claim) Suppose X is dis-tributed uniformly on [0,1]. Then lqX(1/2) = 1/2. Now consider the follow-ing strictly increasing transformationsφ(x) =braceleftBiggx −∞ < x < 1/2x+ 5 x ≥ 1/2 .Let T = φ(X) then the distribution of T is given byP(T ≤ t) =0 t ≤ 0t 0 < t ≤ 1/21/2 1/2 < t ≤ 5 + 1/2t−5 5 + 1/2 < t ≤ 5+ 11 t > 5 + 1.It is clear form above that lqT(1/2) = 1/2 negationslash= φ(lqX(1/2)) = φ(1/2) =5 + 1/2.We start by definingφ≤(y) = {x|φ(x) ≤ y}, φ⋆(y) = supφ≤(y),andφ≥(y) = {x|φ(x) ≥ y}, φ⋆(y) = inf φ≥(y).Then we have the following lemma.Lemma 5.6.1 Suppose φ is non-decreasing.a) If φ is left continuous thenφ(φ⋆(y)) ≤ y.b) If φ is right continuous thenφ(φ⋆(y)) ≥ y.Proof1355.6. Equivariance property of quantile functionsa) Suppose xn ↑ φ⋆(y) a strictly increasing sequence. Then since xn < φ⋆(y),we conclude xn ∈ φ≤(y) ⇒ φ(xn) ≤ y. Hence limn→∞φ(xn) ≤ y. But byleft continuity φ(xn) ↑ φ(φ⋆(y)).b) Suppose xn ↓ φ⋆(y) a strictly decreasing sequence. Then since xn > φ⋆(y),we conclude xn ∈ φ≥(y) ⇒ φ(xn) ≥ y. Hence limn→∞φ(xn) ≥ y. But byright continuity φ(xn) ↓ φ(φ⋆(y)).Theorem 5.6.2 (Quantile Equivariance Theorem) Suppose φ : R → R isnon-decreasing.a) If φ is left continuous thenlqφ(X)(p) = φ(lqX(p)).b) If φ is right continuous thenrqφ(X)(p) = φ(rqX(p)).Proofa) We use Lemma 5.3.2 to prove this. We need to show (i) and (ii) in thatlemma for φ(lqX(p)). First note that (i) holds sinceFφ(X)(φ(lqX(p))) = P(φ(X) ≤ φ(lqX(p))) ≤ P(X ≤ lqX(p)) ≥ p.For (ii) let y < φ(lqX(p)). Then we want to show that Fφ(X)(y) < p. Itis sufficient to show φ⋆(y) < lqX(p). Because thenP(φ(X) ≤ y) ≤ P(X ≤ φ⋆(y)) < p.To prove φ⋆(y) < lqX(p), note that by the previous lemmaφ(φ⋆(y)) ≤ y < φ(lqX(p)).b) We use Lemma 5.3.2 to prove this. We need to show (i) and (ii) in thatlemma for φ(rqX(p)). To show (i) note that if y < φ(rqX(p)),P(φ(X) ≤ y) ≤ P(φ(X) < φ(rqX(p))) ≤ P(X < rqX(p)) ≤ p.1365.7. Continuity of the left and right quantile functionsTo show (ii), suppose y > φ(rqX(p)). We only need to show φ⋆(y) >rqX(p) because thenP(φ(X) ≤ y) ≥ P(X < φ⋆(y)) > p.But by previous lemma φ(φ⋆(y)) ≥ y > φ(rqX(p)). Hence φ⋆(y) >rqX(p).5.7 Continuity of the left and right quantilefunctionsLemma 5.7.1 (Continuity of quantile functions) Suppose F is a distribu-tion function. Thena) lqF is left continuous.b) rqF is right continuous.Proofa) Suppose pn ↑ p be a strictly increasing sequence in [0,1]. Then since lqFis increasing, lqF(pn) is increasing and hence has a limit we call y. We needto show y = lqF(p). We show this in two steps:1. y ≤ lqF(p): Let A = {x|F(x) ≥ p}. Then for any x ∈ A:F(x) ≥ p ⇒ F(x) ≥ pn ⇒ x ≥ lqF(pn) ⇒ x ≥ supn∈NlqF(pn) ⇒ x ≥ y.Hence lqF(p) = inf A ≥ y.2. y ≥ lqF(p): We only need to show that F(y) ≥ p. Buty ≥ lqF(pn), ∀n ⇒ F(y) ≥ F(lqF(pn)) ≥ pn, ∀n ⇒ F(y) ≥ p.b) Take a strictly decreasing sequence pn ↓ p, we need to show rqF(pn) →rq(p). The limit of rqF(pn) exists since rq is non–decreasing. Let y =infn∈NrqF(pn). We proceed in two steps:1. rqF(p) ≤ y:rqF(p) ≤ rqF(pn), ∀n ∈ N ⇒ rqF(p) ≤ infn∈NrqF(pn) = y.1375.7. Continuity of the left and right quantile functions2. rqF(p) ≥ y: Since rqF(p) = sup{x|F(x) ≤ p} by Lemma 5.3.1, weonly need to show z < y ⇒ F(z) ≤ p. But if F(z) > p thenF(z) > pn for some n ∈ N⇒ z ≥ rqF(pn) for some n ∈ N.Hence,y > z ≥ rq(pn) for some n ∈ N,which is a contradiction to y = infn∈Nrq(pn).FX is a function that ranges over [0,1]. Once F hits 1 it will remain one.Similarly before F becomes positive it is always zero. This is the motivationfor the following definition.Definition SupposeF is a distribution function. We define the real domainof F to be RD(F) = {x|0 < F(x) < 1}.Lemma 5.7.2 Suppose F is a distribution function. ThenRD(F) = (rq(0),lq(1)) or RD(F) = [rq(0),lq(1)).Proof We proceed in two steps (a),(b).(a) RD(F) ⊂ [rq(0),lq(1)):Note that (a) ⇔ [rq(0),lq(1))c ⊂ RD(F)c, where c stands for taking thecompliment of a set in R. If x ∈ [rq(0),lq(1))c then x < rq(0) or x ≥ lq(1).x < rq(0) then F(x) = 0 by the definition of rq(0).x ≥ lq(1) then F(x) ≥ F(lq(1)) ≥ 1 ⇒ F(x) = 1.(b) (rq(0),lq(1)) ⊂ RD(F):x > rq(0) ⇒ F(x) > 0. (This is because rq(0) = sup{x|F(x) ≤ 0}.)x < lq(1) ⇒ F(x) < 1. (This is because lq(1) = inf{x| F(x) = 1}.)Definition For a random variable X with distribution function F, we definethe L-quantile and R-quantile functions on R:LQF : R → R, LQF = lqF ◦F,RQF : R → R, RQF = rqF ◦F.1385.7. Continuity of the left and right quantile functionsLemma 5.7.3 (Properties of LQ and RQ)a) LQF,RQF are non–decreasing.b)LQF(x) ≤ x ≤ RQF(x).c) LQF,RQF are left continuous and right continuous, respectively.d) lqF(F(x)) = rqF(F(x)) ⇒ LQF(x) = RQF(x) = x.e) We have the following equalities:LQF(v) = inf{u|F(u) = F(v)}, RQF(v) = sup{u|F(u) = F(v)}.f) P(LQF(x) < X < RQF(x)) = 0.Proofa) This result follows from the fact that lqF,rqF and F are non–decreasing.b) LQF(x) = inf{y|F(y) ≥ F(x)}. Since x ∈ {y|F(y) ≥ F(x)}, x ≥LQF(x).RQF(x) = sup{y|F(y) ≤ F(x)}. Since x ∈ {y|F(y) ≤ F(x)}, RQF(x) ≥ x.c) Suppose xn ↓ x is a strictly decreasing sequence, then F(xn) ↓ F(x) sinceF is right continuous. Hence rqF(F(xn)) ↓ rqF(F(x)) since rqF is rightcontinuous by Lemma 5.7.1.To prove LQF is left continuous, let xn ↑ x be a strictly increasing sequenceand let pn = F(xn). Then since{pn} is an increasing and bounded sequence,pn → p′. Also let F(x) = p. We consider two cases:1. p = p′. In this case pn ↑ p is a strictly increasing sequence. Since lqFis left continuous, limn→∞LQF(xn) = limn→∞lqF(pn) = lqF(p) =LQF(x).2. p′ < p. This means F has a jump at x. By Lemma 5.3.1 j), LQF(x) =lqF(F(x)) = x. Let y = limn→∞lqF(F(xn)). We claim y ≥ x.Otherwise since F(x) = p and F has a jump at p, F(y) < p ⇒F(y) < pn, for some n ∈ N. But y = supn∈Nlq(F(xn)). Hencey ≥ lq(F(xn)) and F(y) ≥ F(lq(pn)) ≥ pn > p a contradiction. Thusy = limn→∞lqF(F(xn)) ≥ x.Also note that lqF(pn) ≤ lqF(F(x)) = x, ∀n ⇒ y = supn∈N lqF(pn) ≤lqF(F(x)) = x. We concludey = x. In other wordsy = limn→∞LQF(xn) =LQF(x).d) This result is a straightforward consequence of b).e) This result follows immediately from the definition of these quantiles.1395.7. Continuity of the left and right quantile functionsf) P(LQF(x) < X < RQF(x)) = P(lqF(F(x)) < X < rqF(F(x))) = 0, byLemma 5.3.1.Example Suppose the distribution function F depicted in Figure 5.1 isgiven as followsF(x) =2π arctan(x)+15 x ≤ 01/5 0 ≤ x ≤ 1x/5 1 ≤ x < 23/5 2 ≤ x < 32π arctan(x−3)+45 x ≥ 3.Then lqF(0.2) = 0, rqF(0.2) = 1, lqF(0.5) = rqF(0.5) = 2 and lqF(0.55) =rqF(0.55) = 2. We have also plotted lq,rq,LQ,RQ in Figures 5.2 to 5.5.If we are given a data vector, we can compute the sample distributionand then compute the left and right quantile functions. In the sequel, weshow that we get the same definition as we gave for left and right quantilefor a vector.Lemma 5.7.4 Suppose a data vector x is given and Fn is its sample distri-bution. Then lqx(p) = lqFn(p) and rqx(p) = rqFn(p).ProofWe show this for non–intersection points. Similar arguments work forintersection points. If p is not an intersection point, thenm1+···+mi−1n < p <m1+···+min and rqx(p) = lqx(p) = zi. We want to showthat inf{u|Fn(u) ≥ p} is also zi, whereFn(u) =nsummationdisplayi=1I(−∞,xi](u).But it follows that:Fn(zi) = m1 +···+min ;lqFn(p) = inf{u|Fn(u) ≥ p};rqFn(p) = inf{u|Fn(u) > p}.1405.7. Continuity of the left and right quantile functions−2 −1 0 1 2 3 4 50.00.20.40.60.81.0xFFigure 5.1: An example of a distribution function with discontinuities andflat intervals.1415.7. Continuity of the left and right quantile functions0.0 0.2 0.4 0.6 0.8 1.0−2−1012345plq(p)Figure 5.2: The left quantile (lq) function for the distribution function givenin Example 5.7. Notice that this function is left continuous and increasing.0.0 0.2 0.4 0.6 0.8 1.0−2−1012345prq(p)Figure 5.3: The right quantile (rq) function for the distribution functiongiven in Example 5.7. Notice that this function is right continuous andincreasing.1425.7. Continuity of the left and right quantile functions−4 −2 0 2 4 6 8−2−1012345xLQ(x)Figure 5.4: LQ function for Example 5.7. Notice that this function is in-creasing and left continuous.−4 −2 0 2 4 6 8−2−1012345xRQ(x)Figure 5.5: RQ function for Example 5.7, notice that this function is in-creasing and right continuous.1435.8. Equality of left and right quantilesSince Fn is a step function the right hand side of the two above equationscan only be one of −∞,z1,··· ,zr,∞. The first u that makes Fn greaterthan or equal to p is zi, proving the assertion.Lemma 5.7.4 guarantees that our definition of quantile for data vectorsis consistent with the definition for distributions.Lemma 5.3.1 shows that if a distribution function F is flat then rq andlq might differ. To study this further when rq and lq are equal, we definethe concept of heavy and weightless points in the next section.5.8 Equality of left and right quantilesThis section finds necessary and sufficient conditions for the left and rightquantiles to be equal. We start with some definitions.Definition Suppose X is a random variable with the distribution functionF. x ∈ R is called a weightless point of a distribution function F if thereexist a neighborhood (an open interval) around x such that F is flat in thatneighborhood. We call a point heavy if it is not weightless. Denote the setof all heavy points by H.Definition A point x ∈ R is called a super heavy point ifP(X ∈ (x−ǫ,x]) > 0,P(X ∈ [x,x+ǫ)) > 0, ∀ǫ > 0.We denote the set of super heavy points by SH. Obviously any super heavypoint is heavy.We can also define right heavy points and left heavy points.Definition A point x ∈ R is called a right heavy point ifP(X ∈ [x,x +ǫ)) > 0, ∀ǫ > 0.We show the set of all right heavy points by RH. A point x ∈R is called aleft heavy point ifP(X ∈ (x−ǫ,x]) > 0, ∀ǫ > 0.We denote the set of all such points by LH. Obviously any heavy pointis either right heavy or left heavy. Also a super heavy point is both rightheavy and left heavy.1445.8. Equality of left and right quantilesLemma 5.8.1 Suppose X is a random variable with distribution functionF. Also suppose that u1 < u2 are heavy points and F is flat on [u1,u2]i.e. F(u1) = F(u2). Then lq(p) = u1 and rq(p) = u2, where p = F(u1) =P(X ≤ u1).Proof1. lq(p) = u1: Since F(u1) = p, lq(p) ≤ u1. Suppose lq(p) < u1. ThenP(lq(p) < X < u2) > 0,since u1 is a heavy point. We can rewrite above asP(lq(p) < X ≤ u1) +P(u1 < X < u2) > 0,the second term is zero by the flatness assumption. HenceP(lq(p) < X ≤ u1) > 0.But thenP(X ≤ lq(p)) = p(X ≤ u1)−P(lq(p) < X ≤ u1) < p,which is a contradiction to Lemma 5.3.1 a).2. rq(p) = u2: From F(u) = p for all u1 ≤ u < u2, we conclude rq(p) ≥u2. To prove the inverse, note that for any u1 < u3 < u2, F(u3) = psince F is flat on [u1,u2]. Since rq(p) = sup{x|F(x) ≤ p} by Lemma5.3.1, rq(p) ≥ u2. Now note that since u2 is heavy, for any u3 > u2,P(u1 < X < u3) > 0 ⇒ F(u3) = F(u1) +P(u1 < X < u3) > p.Hence only values less than or equal to u2 are in {x|F(x) ≤ p}. Weconclude the sup is at most u2. In other words rq(p) ≤ u2.Lemma 5.8.2 Suppose X is a random variable with distribution functionF. Thenv is a weightless point ⇔ v ∈ (LQF(v),RQF(v)).1455.8. Equality of left and right quantilesProof (⇐): This is trivial by Lemma 5.7.3 part (f).(⇒): If v /∈ (LQF(v),RQF(v)) ⇒ LQF(v) = RQF(v) = v by Lemma 5.7.3.RQF(v) = v ⇒ inf{x| F(x) > F(v)} = v ⇒F(x) > F(v), ∀x > v ⇒ P(v < X ≤ x) > 0, ∀x > v ⇒P(v < X < x) > 0,∀x > v,where the last (⇒) is because for any x > v, we can take v < x′ < x andnote that P(v < X < x) ≥ P(x < X ≤ x′) > 0. We conclude v is a rightheavy point which is a contradiction.For a weightless point v, there is an interval (a,b) such that v ∈ (a,b)and F is flat in that interval. It is useful to consider the flat interval aroundv. This is the motivation for the following definition.Definition Suppose X is a random variable with distribution function Fand v is a weightless point of F. Then we define the weightless interval ofv, I(v) byI(v) = ∪a<b,F(a)=F(b)=F(v)(a,b).Lemma 5.8.3 Suppose F is a distribution function and v is a weightlesspoint of this distribution function. ThenI(v) = (LQF(v),RQF(v)).Proof (L.H.S ⊂ R.H.S): x ∈ (a,b) for some a,b where F(a) = F(b) = F(v)then F(x) = F(v). Take x1,x2 such that a < x1 < x < x2 < b thenF(x1) = F(x2) = F(v) ⇒LQF(v) ≤ x1 < x < x2 ≤ RQF(v) ⇒x ∈ (LQF(v),RQF(v)).(R.H.S ⊂ L.H.S): This is trivial since if v is weightless then LQF(v) <RQF(v). Let a = LQF(v) and b = RQF(v) then (a,b) ⊂ I(v) by definitionof I(v).Corollary 5.8.4 For any weightless point v, its weightless interval is indeedan open interval.Lemma 5.8.5 Suppose X is a random variable with a distribution functionF and v,v′ are weightless points then I(v) = I(v′) or I(v)∩I(v′) = ∅.1465.8. Equality of left and right quantilesProof Suppose I(v)∩I(v′) negationslash= ∅. Fix u ∈ I(v)∩I(v′). ButF(u) = F(v) ⇒LQF(u) = lqF(F(u)) = lqF(F(v)) = LQF(v)andRQF(u) = rqF(F(u)) = rqF(F(v)) = RQF(v).Hence by the previous lemmaI(u) = (LQF(u),RQF(u)) = (LQF(v),RQF(v)) = I(v).A similar argument shows that I(u) = I(v′) and this completes the proof.Theorem 5.8.6 Suppose X is a random variable with distribution functionF, thena) Let N be the set of all weightless points. Then N is measurable andof probability zero.b) The ranges of rqF and lqF do not intersect N. In other wordsrange(rqF)∪range(lqF) ⊂ H.c) Any heavy point is either lqF(p) or rqF(p) (or both) for some p ∈ [0,1].In other wordsH ⊂ range(rqF)∪range(lqF).More precisely, if x is right heavy then x ∈ range(rqF) and if x is left heavythen x ∈ range(lqF).d) x = lq(p) = rq(p) for some p ∈ [0,1] if and only if x is a super heavypoint. Also H −SH is countable.Proofa) Suppose v is a weightless point and consider I(v) = (LQF(v),RQF(v)).Then by Lemma 5.7.3, all the points in I(v) are weightless. We showed thatI(v) ∩I(v′) negationslash= ∅, then I(v) = I(v′). Hence N can be written as a disjointunion of the form:N = ∪v∈N′I(v),1475.8. Equality of left and right quantilesfor some N′ ⊂ N. Pick a rational number qv ∈ I(v), v ∈ N′ (“the Axiomof choice” from set theory is not needed to pick a rational number froman interval (a,b) because one can take a rational number by comparing theexpansion of a and b in the base 10). ButI(v)∩I(v′) = ∅, v negationslash= v′ ∈ N′ ⇒ qv negationslash= qv′.This shows N′ is countable since the set of rational numbers is countable.Hence, N is a countable union of intervals and is measurable. Moreover,P(N) = P(∪v∈N′I(v)) =summationdisplayv∈N′P(X ∈ I(v)) = 0.b) Suppose z ∈ N. Then there exist a,b such that a < z < b and P(a <X < b) = 0. Take a′,b′ such that a < a′ < z < b′ < b. Suppose z = lqF(p)for some p. Then P(X ≤ z) ≥ p and also P(X ≤ a′) = P(X ≤ z) ≥ p. Thisis a contradiction since z is the left quantile. Similarly, suppose z = rqF(p)for some p. Then since z < b′, F(b′) > p while a′ < z gives F(a′) ≤ p. HenceP(a′ ≤ X ≤ b′) > 0, a contradiction.c) Assume x is right heavy. Then let p = F(x). We claim that rqF(p) = x.Suppose rqF(p) = x′ < x then F(x) = p is a contradiction to rqF(p) =sup{y|F(y) ≤ p}. On the other hand for any x′ > x, pick x < x′′ < x′. Wehave F(x′′) > p since x is right heavy. Since rqF(p) = inf{y|F(y) > p} andF(x′′) > p then x′ > rqF(p). We conclude that rqF(p) = x.Now suppose x is left heavy. Let p = F(x). We claim lqF(p) = x. Firstnote that for any x′ < x, F(x′) < F(x) = p since x is left heavy. HencelqF(p) ≥ x. But F(x) = p and since lqF(p) = inf{y|F(y) ≥ p} we are done.d) The necessary and sufficient conditions follow immediately from c). Toshow that H − SH is countable, we prove LH − SH and RH − SH arecountable. To that end, for any x ∈ LH−SH consider Ix = (LQ(x),RQ(x)).Since x is not super heavy this interval has positive length. Also note thatx < y,x,y ∈ H implies Ix ∩Iy = ∅. To prove this, note that since x,y areleft heavy, LQ(x) = x and LQ(y) = y. We concludeIx = (x,RQ(x))Iy = (y,RQ(y)).If Ix ∩Iy is nonempty then we conclude x < y < RQ(x). Then0 = P(X ∈ (x,RQ(x))) ≤ P(X ∈ (x,y)) > 0.1485.8. Equality of left and right quantiles(P(X ∈ (x,y)) > 0 since y is left heavy.) This is a contradiction and henceIx ∩Iy = ∅. Now pick a rational number qx ∈ Ix. ThenIx ∩Iy = ∅⇒ qx negationslash= qy.Since the set of rational numbers is countable LH − SH is countable. Asimilar argument works for RH −SH.Lemma 5.8.7 Suppose X is a random variable with distribution functionF. Then the set A = {p| p ∈ [0,1], lqF(p) negationslash= rqF(p)} is countable.Proof For every p ∈ A let J(p) = (lqF(p),rqF(p)). Then for every x ∈ J(p),F(x) = p. (F(x) ≥ F(lqF(p)) ≥ p. Now if F(x) > p, we get a contradictionto x < lqX(p).) We concludep,p′ ∈ A,p negationslash= p′ ⇒ J(p)∩J(p′) = ∅.The intervals are disjoint, every interval has a positive length and their unionis a subset of [0,1]. Hence there are only countable number of such intervals.We conclude A is countable.The following lemma gives sufficient and necessary conditions for lqX =rqX, ∀p ∈ (0,1).Lemma 5.8.8 lqX(p) = rqX(p), p ∈ (0,1) iff FX is strictly increasing.Proof (⇒)lqX(p) = inf{x|FX(x) ≥ p} =inf{x|x ≥ F−1X (p)} =inf{x|x > F−1X (p)} = rqX(p) .(⇐): If Fx is not strictly increasing then ∃x2 < x1 s.t FX(x1) = FX(x2).Then let p = FX(x1). We also have p = FX(x2). HencelqX(p) = inf{FX(x) ≥ p} ≤ x1,andrqX(p) = sup{FX(x) ≤ p}≥ x2,which is a contradiction.1495.9. Distribution function in terms of the quantile functions5.9 Distribution function in terms of the quantilefunctionsIt is interesting to understand the connections amongst lq, rq and F. Weanswer the following question:Question: Given one of lq,rq or F, are the other two uniquely determined?The answer to this question is affirmative and the following theorem saysmuch more.Theorem 5.9.1 Suppose F is a distribution function. Thena) For p0 ∈ (0,1), lq(p0) = limp→p−0rq(p0). Hence, the function rq uniquelydetermines lq.b) For p0 ∈ (0,1), rq(p0) = limp→p+0lq(p0). Hence lq uniquely determinesrq.c) lq or rq continuous at p0 ∈ (0,1) ⇒ lq(p0) = rq(p0).d) lq(p0) = rq(p0) ⇒ lq and rq are continuous at p0.e) lq is continuous at p ⇔ rq is continuous at p.f) F(x) = inf{p|lq(p) > x}.g) F(x) = inf{p|rq(p) > x}.Proofa) Take a strictly increasing sequence pn ↑ p0 in [0,1]. Thenpn−1 < pn < pn+1 ⇒lq(pn−1) < rq(pn) < lq(pn+1), (5.4)by Lemma 5.3.1, part (c). By the left continuity of lq, lq(pn) → lq(p0).Applying the Sandwich Theorem about the limits from elementary cal-culus to the Equation (5.4), we conclude that rq(pn) → lq(p0).1505.9. Distribution function in terms of the quantile functionsb) Take a strictly decreasing sequence pn ↓ p0 in [0,1]. Thenpn−1 > pn > pn+1 ⇒rq(pn−1) > lq(pn) > rq(pn+1), (5.5)again by Lemma 5.3.1, part (c). By the right continuity of rq, rq(pn) →rq(p0). Applying the Sandwich Theorem for limits to Equation (5.5), weconclude that lq(pn) → rq(p0).c) Suppose lq is continuous at p0. Then limp→p+0lq(p) = lq(p0). But by theprevious parts of this theorem, we also have limp→p+0= rq(p0). Similararguments work if rq is continuous at p0.d) To prove lq is continuous at p0 note thatlimp→p−0lq(p0) = lq(p0) = rq(p0) = limp→p+0lq(p0),where the first equality comes from the left continuity of lq and the lastone comes from (b). Similar arguments work for rq.e) This result follows immediately from the previous two parts.f) Let A = {p|lq(p) > x}. We want to show that F(x) = inf A.To do that we first show that F(x) ≤ inf A.By Lemma 5.7.3,lq(F(x)) ≤ x ⇒ F(x) ≤ a, ∀a ∈ A ⇒ F(x) ≤ inf A.It remains to show that inf A ≤ F(x). Suppose to the contrary thatF(x) < inf A. Then take F(x) < p0 < inf A to getlq(p0) ≤ x,p0 > F(x)⇒ F(lq(p0)) ≤ F(x),p0 > F(x).But by Lemma 5.3.1 part (a), p0 ≤ F(lq(p0)). Hencep0 ≤ F(lq(p0)) ≤ F(x),p0 > F(x),which is a contradiction.1515.10. Two-sided continuity of lq/rqg) Let B = {p|rq(p) > x} and A be as the previous part. Then F(x) =inf A ≤ inf B.It only remains to show that inf B ≤ F(x). Otherwise, we can pick p0,F(x) < p0 < inf B so thatrq(p0) ≤ x,p0 > F(x) ⇒p0 ≤ F(rq(p0)) ≤ F(x),p0 > F(x),which is a contradiction.5.10 Two-sided continuity of lq/rqLemma 5.10.1 Suppose F is a distribution function for the random vari-able X and lq,rq are its corresponding left and right quantile functions.Thena) F is continuous ⇔ lq is strictly increasing on (0,1).b) F is strictly increasing on RD(F) = {x|0 < F(x) < 1} = (rq(0),lq(1))or [rq(0),lq(1)) ⇔ lq is continuous on (0,1).Proof a)(⇒): F is continuous iff P(X = x) = 0, ∀x ∈ R. If the R.H.S does nothold then x = lq(p1) = lq(p2), p1 < p2. Then for every y < x, we haveF(y) < p1. HenceP(X < x) = limy→x−P(X ≤ y) ≤ p1 < p2.But F(x) ≥ p2 since lq(p2) = x and we conclude P(X = x) ≥ p2 − p1, acontradiction.(⇐): If F is not continuous then P(X = x) = ǫ > 0 for some x ∈ R. Letp = F(x) then P(X < x) = p−ǫ. Pick p1 < p2 in the interval (p−ǫ,p) thenlq(p1) = lq(p2) = x.b)(⇒): lq is left continuous. Hence if it is not continuous thenlimp→p+0lq(p) = rq(p0) negationslash= lq(p0).Hence F is flat on (lq(p0),rq(p0)) negationslash= ∅, which is a contradiction to F beingincreasing.1525.11. Characterization of left/right quantile functions(⇐): Suppose F is not continuous on RD(F), then there exist a,b ∈ R suchthat F is flat on [a,b]:F(a) = F(b) = p ∈ (0,1).But then lq(p) ≤ a and rq(p) ≥ b. Hence lq(p) negationslash= rq(p), which means lq isnot continuous.Remark. We can replace lq is the above lemma by rq. A similar argumentcan be done for the proof.5.11 Characterization of left/right quantilefunctionsThe characterization of the distribution function is a well–known result inprobability. Here we characterize the left and right quantile functions of adistribution. We start by some simple lemmas which we need in the proof.Lemma 5.11.1 Suppose An ⊂ R,n ∈ N. Theninf∪n∈NAn = infn∈N(inf An)Proofa) inf∪n∈NAn ≥ infn∈N(inf An):a ∈∪n∈NAn ⇒ ∃m ∈ N, a ∈ Am ⇒ ∃m ∈ N, a ≥ inf Am ⇒ a ≥ infn∈N(inf An).Hence, inf∪n∈NAn ≥ infn∈N(inf An).b) inf∪n∈NAn ≤ infn∈N(inf An):inf∪n∈NAn ≤ inf Am, ∀m ∈ N ⇒ inf∪n∈NAn ≤ infn∈N(inf An).Lemma 5.11.2 Suppose h : (0,1) → R is a non–decreasing function. ThenG(x) = inf{p ∈ (0,1)|h(p) > x} is a distribution function.Proof a) We claim G is non–decreasing. Suppose x1 < x2 then let A ={p|h(p) > x1} and B = {p|h(p) > x1}. Then G(x1) = inf A and G(x2) =inf B. But clearly B ⊂ A hence G(x1) ≤ G(x2).1535.11. Characterization of left/right quantile functionsb) limx→∞G(x) = 1: First note that such a limit exist and is bounded by1. (Because the domain of h is (0,1)). Assume limx→∞G(x) = q < 1, takeq < q′ < 1 then take x0 > h(q′). Let A = inf{p|h(p) > x0} such thatG(x0) = inf A. Then(p ∈ A ⇒ h(p) > x0 > h(q′) ⇒ p > q′) ⇒ G(x0) = inf A ≥ q′ > q.We have shown there is an x0 such that G(x0) > q this is a contradiction tolimx→∞G(x) = q since G is non-decreasing.c) Suppose that limx→−∞G(x) = q > 0 then take 0 < q′ < q and x0 < h(q′).Let A = inf{p|h(p) > x0} such that G(x0) = inf A. We haveh(q′) > x0 ⇒ q′ ∈ A ⇒ inf A ≤ q′ ⇒ G(x0) ≤ q′ < qThis contradicts limx→−∞G(x) = q > 0 since G is non–decreasing.d) G is right continuous: limx→x+0G(x) = x0. Suppose xn ↓ x0. In theprevious lemma, let An = {p|h(p) > xn} and A = ∪n∈NAn = {p|h(p) > x0}.ThenG(x0) = inf A = inf∪n∈NAn = infn∈N(inf An) = infn∈NG(xn) = limx→x+0G(x).Theorem 5.11.3 (Quantile function characterization theorem) Suppose afunction h : (0,1) → R is given. Then(a) h is a left quantile function for some random variable X iff h is leftcontinuous and non–decreasing.(b) h is a right quantile function for some random variable X iff h is rightcontinuous and non–decreasing.Proof If h is a left quantile function, then h is left continuous and non–decreasing as we showed in previous sections. Also if h is right continuousfunction then h is non–decreasing and right continuous. For the inverse ofboth a) and b) define G as in the above lemma. We will prove that h is lqGin a) and rqG in b).(a) Let A = {x|G(x) ≥ p0}, we want to show h(p0) = inf A.1545.11. Characterization of left/right quantile functions(i) inf A ≤ h(p0): Otherwise if inf A > y > h(p0), then:inf A > y ⇒ inf{x|G(x) ≥ p0} > y ⇒ G(y) < p0 ⇒inf{p ∈ (0,1)|h(p) > y} < p0 ⇒∃p ∈ (0,1), h(p) > y, p < p0 ⇒∃p ∈ (0,1)h(p0) ≥ h(p) > y,which is a contradiction.(ii) inf A ≥ h(p0) :x ∈ A ⇒ G(x) ≥ p0 ⇒ inf{p ∈ (0,1)|h(p) > x}≥ p0.Hence,∀p < p0, h(p) ≤ x ⇒ limp→p−0h(p) ≤ x ⇒ h(p0) ≤ x,by left continuity of h. Hence∀x ∈ A, h(p0) ≤ x ⇒ h(p0) ≤ inf A.(b) Let A = {x|G(x) > p0}, we want to show h(p0) = inf A.(i) inf A ≤ h(p0): Otherwise if inf A > y > h(p0), theninf A > y ⇒ y /∈ A ⇒ G(y) ≤ p0 ⇒inf{p′ ∈ (0,1)|h(p′) > y} ≤ p0 ⇒∀p > p0,inf{p′|h(p′) > y} < p ⇒∀p > p0, ∃p′ ∈ (0,1), h(p′) > y,p′ < p ⇒∀p > p0, ∃p′ ∈ (0,1), h(p) ≥ h(p′) > y ⇒h(p0) ≥ ywhich is a contradiction.(ii) inf A ≥ h(p0) :x ∈ A ⇒ G(x) > p0 ⇒ inf{p ∈ (0,1)|h(p) > x} > p0 ⇒p0 /∈ {p ∈ (0,1)|h(p) > x} ⇒ h(p0) ≤ x.Hence h(p0) ≤ inf A.Now we characterize the quantile functions of data vectors. See Figure5.6 for an example of quantile functions for the vectorx = (−2,−2,2,2,2,2,4,4,4,4).1555.11. Characterization of left/right quantile functions0.0 0.2 0.4 0.6 0.8 1.0−4−2024lq(p)0.0 0.2 0.4 0.6 0.8 1.0−4−2024prq(p)Figure 5.6: For the vector x = (−2,−2,2,2,4,4,4,4) the left (top) and right(bottom) quantile functions are given.1565.12. Quantile symmetriesTheorem 5.11.4 (Data vector quantile function characterization theorem)a) h : (0,1) → R is a left quantile function for a data vector x iff h is a leftcontinuous step function with no steps (jumps) or a finite number of steps(jumps) at some points 0 < a1 < a2 < ··· < ak < 1 where ai = 1nni, forsome n,ni ∈N.b) h : (0,1) → R is a right quantile function for a data vector x iff h isa right continuous step function with no steps (jumps) or finite number ofsteps (jumps) at some points 0 < a1 < a2 < ··· < ak < 1 where ai = 1nnifor some n,ni ∈ N.ProofWe only prove a) and b) is obtained either by repeating a similar argu-ment or using the Quantiles Symmetry Theorem (Theorem 5.12.3), whichwe prove in next sections.a) (⇒) For x = (x1,··· ,xn), it is clear that lqx is a step function withjumps at points proportional to 1/n and we proved the left continuity before.a) (⇐) Theresultis easyto show ifhhasno jumps. Leth′ = limx→+∞h(x)and supposehis given with jumpsata1 < a2 < ··· < ak,a1 = n1(1/n),··· ,ak =nk(1/n). Let b1 = a1,b2 = a2 −a1,··· ,bk = ak −ak−1,bk+1 = 1−ak. Thenbi = 1nmi,i = 1,2,··· ,k+1 with m1 = n1,m2 = n2−n1,··· ,mk = nk−nk−1and finally mk+1 = n −summationtextki=1 mi. Then let x be a data vector with h(ai)repeated mi times. We claim that h = lqx. First note that x is of length n.For 0 < p ≤ a1, we have lqx(p) = h(a1) = h(p). For ai−1 < p ≤ ai,i ≤ k, wehave ni−1n =summationtexti−1j=1 mjn < p ≤summationtextij=1 mjn =nin . Hencelqx(p) = h(ai) = h(p), ai−1 < p ≤ ai,i ≤ k.For ak < p < 1, we have nkn =summationtextkj=1 mjn < p < 1,lqx(p) = h′ = h(p), ak < p < 1.5.12 Quantile symmetriesThis section studies the symmetry properties of distribution functions andquantile functions. Symmetry is in the sense that if X is a random vari-able with left/right quantile function, some sort of symmetry between the1575.12. Quantile symmetriesquantile functions of X and −X should exist. We only treat the quantilefunctions for distributions here but the results can readily be applied to datavectors by considering their empirical distribution functions.Here consider different forms of distribution functions. The usual oneis defined to be FcX(x) = P(X ≤ x). But clearly one could have alsoconsidered FoX(x) = P(X < x), GcX(x) = P(X ≥ x) or GoX(x) = P(X > x)to characterize the distribution of a random variable. We call Fc the left–closed distribution function, Fo the left–open distribution function, Gc theright–closed and Go the right–open distribution function. Like the usualdistribution function these functions can be characterized by their limits ininfinity, monotonicity and right continuity.First note thatFc−X(x) = P(−X ≤ x) = P(X ≥ −x) = GcX(−x).Since the left hand side is right continuous, GcX is left continuous. Also notethatFcX(x) +GoX(x) = 1 ⇒ GoX(x) = 1−FcX(x),FoX(x) +GcX(x) = 1 ⇒ FoX(x) = 1−GcX(x).The above equations imply the following:a) Go and Fc are right continuous.b) Fo and Gc are left continuous.c) Go and Gc are non–decreasing.d) limx→∞F(x) = 1 and limx→−∞F(x) = 0 for F = Fo,Fc.e) limx→∞G(x) = 0 and limx→−∞G(x) = 1 for G = Go,Gc.It is easy to see that the above given properties for Fo,Go,Gc character-ize all such functions. The proof can be given directly using the properties ofthe probability measure (such as continuity) or by using arguments similarto the above.Another lemma about the relation of Fc,Fo,Go,Gc is given below.Lemma 5.12.1 Suppose Fo,Fc,Go,Gc are defined as above. Thena) if any of Fc,Fo,Go,Gc are continuous, all of other are continuous too.b) Fc being strictly increasing is equivalent to Fo being strictly increasing.c) if Fc is strictly increasing, Go is strictly decreasing.d) Gc being strictly decreasing is equivalent to Go being strictly increasing.1585.12. Quantile symmetriesProof a) Note that limy→x− Fc(x) = limy→x− Fo(x) and limy→x+ Fc(x) =limy→x+ Fo(x). If these two limits are equal for either Fc or Fo they areequal for the others as well.b) If either Fc or Fo are not strictly increasing then they are constant on[x1,x2],x1 < x2. Take x1 < y1 < y2 < x2. ThenFo(x1) = Fo(x2) ⇒ P(y1 ≤ X ≤ y2) = 0 ⇒ Fc(y1) = Fc(y2).Also we haveFc(x1) = Fc(x2) ⇒ P(y1 ≤ X ≤ y2) = 0 ⇒ Fo(y1) = Fo(y2).c) This is trivial since Go = 1−Fc.d) If Gc is strictly decreasing then Fo is strictly increasing since Gc = 1−Fo.By part b), Fc strictly is increasing. Hence Go = 1−Fc is strictly decreas-ing.The relationship between these distribution functions and the quantilefunctions are interesting and have interesting implications. It turns out thatwe can replace Fc by Fo in some definitions.Lemma 5.12.2 Suppose X is a random variable with open and closed leftdistributions Fo,Fc as well as open and closed right distributions Go,Gc.Thena) lqX(p) = inf{x|FoX(x) ≥ p}. In other words, we can replace Fc by Fo inthe left quantile definition.b) rqX(p) = inf{x|FoX(x) > p}. In other words, we can replace Fc by Fo inthe right quantile definition.Proof a) Let A = {x|FoX(x) ≥ p} and B = {x|FcX(x) ≥ p}. We want toshow that inf A = inf B. NowA ⊂ B ⇒ inf A ≥ inf B.Butinf B < inf A ⇒ ∃x0,y0, inf B < x0 < y0 < inf A.Then1595.12. Quantile symmetriesinf B < x0 ⇒ ∃b ∈ B, b < x0 ⇒ ∃b ∈ R, p ≤ P(X ≤ b) ≤ P(X ≤ x0)⇒ P(X ≤ x0) ≥ p ⇒ P(X < y0) ≥ p.On the other handy0 < inf A ⇒ y0 /∈ A ⇒ P(X < y0) < p,which is a contradiction, thus proving a).b) Let A = {x|FoX(x) > p} and B = {x|FcX(x) > p}. We want to showinf A = inf B. Again,A ⊂ B ⇒ inf A ≥ inf B.Butinf B < inf A ⇒ ∃x0,y0, inf B < x0 < y0 < inf A.Theninf B < x0 ⇒ ∃b ∈ B, b < x0 ⇒ ∃b ∈ R, p < P(X ≤ b) ≤ P(X ≤ x0)⇒ P(X ≤ x0) > p ⇒ P(X < y0) > p.On the other hand,y0 < inf A ⇒ y0 /∈ A ⇒ P(X < y0) ≤ p,which is a contradiction.Using the above results, we establish the main theorem of this sectionwhich states the symmetry property of the left and right quantiles.Theorem 5.12.3 (Quantile Symmetry Theorem) Suppose X is a randomvariable and p ∈ [0,1]. ThenlqX(p) = −rq−X(1−p).Remark. We immediately concluderqX(p) = −lq−X(1−p),by replacing X by −X and p by 1−p.1605.12. Quantile symmetriesProofR.H.S = −sup{x|P(−X ≤ x) ≤ 1−p} =inf{−x|P(X ≥ −x) ≤ 1−p} =inf{x|P(X ≥ x) ≤ 1−p} =inf{x|1−P(X ≥ x) ≥ p} =inf{x|1−Gc(x) ≥ p} =inf{x|Fo(x) ≥ p} = lqX(p).Now we show how these symmetries can become useful to derive otherrelationships/definitions for quantiles.Lemma 5.12.4 Suppose X is a random variable with distribution functionF. ThenlqX(p) = sup{x|Fc(x) < p}.ProoflqX(p) = −rq−X(1−p) = −inf{x|Fo−X(x) > 1−p} =−inf{x|1−Gc−X(x) > 1−p} = sup{−x|Gc−X(x) < p} =sup{−x|P(−X ≥ x) < p} = sup{x|P(X ≤ x) < p} =sup{x|Fc(x) < p}.In the previous sections, we showed that both lqX and rqX are equivari-ant under non-decreasing continuous transformations:lqφ(X)(p) = φ(lqX(p)),where φ is non-decreasing left continuous. Alsorqφ(X)(p) = φ(rqX(p)),for φ : R→ R non-decreasing right continuous. However, we did not provideany results for decreasing transformations. Now we are ready to offer a resultfor this case.1615.12. Quantile symmetriesTheorem 5.12.5 (Decreasing transformation equivariance)a) Suppose φ is non-increasing and right continuous on R. Thenlqφ(X)(p) = φ(rqX(1−p)).b) Suppose φ is non-increasing and left continuous on R. Thenrqφ(X)(p) = φ(lqX(1−p)).Proof a) By the Quantile Symmetry Theorem, we havelqφ(X)(p) = −rq−φ(X)(1−p).But −φ is non-decreasing right continuous, hence the above is equivalent to−(−φ(rqX(1−p))) = φ(rqX(1−p)).b) By the Quantile symmetry Theoremrqφ(X)(p) = −lq−φ(X)(1−p) = −−φ(lqX(1−p)) = φ(lqX(p)),since −φ is non-decreasing and left continuous.Lemma 5.12.6 Suppose X is a random variable and Fc,Fo,Gc,Go are thecorresponding distribution functions. Then we have the following inequali-ties:a) Fc(lq(p)) ≥ p. (Hence Fc(rq(p)) ≥ p.)b) Fo(rq(p)) ≤ p. (Hence Fo(lq(p)) ≤ p.)c) Go(lq(p)) ≤ 1−p. (Hence Go(rq(p)) ≤ 1−p.)d) Gc(rq(p)) ≥ 1−p. (Hence Gc(lq(p)) ≥ 1−p.)Proof We already showed a).b) Suppose there Fo(rq(p)) = p+ǫ for some positive ǫ. Then since Fo is leftcontinuouslimx→rq(p)+Fo(x) = p+ǫ.Hence there exist x0 < rq(p) such that F(x0) ≥ Fo(x0) > p+ǫ/2. This is acontradiction to rq(p) being the inf of the set {x|F(x) > p}.c) and d) are straightforward consequence of a) and b) since Fc + Go = 1and Fo +Gc = 1.1625.13. Quantiles from the rightThe quantile functions as the inverse of an open distributionfunctionLemma 5.12.7 Suppose X is a random variable with distribution functionF and open distribution function Fo.a) {x|Fo(x) < p} = (−∞,lqF(p)) or (−∞,lqF(p)].b) {x|Fo(x) ≤ p} = (−∞,rqF(p)].c) If Fo is continuous then {x|Fo(x) < p} = (−∞,lqF(p)].d) {x|Fo(x) > p} = (rqF(p),∞).e) {x|Fo(x) ≥ p} = (lqF(p),∞) or [lqF(p),∞)Proof The proof is very similar to Lemma 5.5.1 and we skip the details.5.13 Quantiles from the rightSo far, we have defined left/right quantiles using the classic distributionfunction Fc. We also showed that in quantile definitions Fc can be replacedby Fo. FcX(x) = P(−∞ < X ≤ x) measures the probability from minusinfinity. When we define left/right quantiles, we seek to find points wherethis probability from minus infinity reaches (passes) a certain value. Onecould also consider GcX(x) = P(x ≤ X < ∞) and define another version ofquantile functions which seek points where the probability from plus infinityreaches or passes a point. This is a motivation to define the “left/rightquantile functions from the right”. By indicating from the right we clarifythat the probability is compute from the right hand side i.e. plus infinity.The previously defined left and right quantile functions should be called“left/right quantile functions from the left”.Definition Suppose X is a random variable with closed right distributionfunction GcX(x) = P(X ≥ x). Then we define the “left quantile functionfrom the right” as followslqfrX(p) = sup{x|GcX(x) > p}.Definition Suppose X is a random variable with closed right distributionfunction GcX(x) = P(X ≥ x). Then we define the right quantile functionfrom the right as follows1635.13. Quantiles from the rightrqfrX(p) = sup{x|GcX(x) ≥ p}.Using the symmetries in the definition of these quantities, we will showthat we have already characterized left/right from the right quantile func-tions. We need the following lemma.Lemma 5.13.1 Suppose X is a random variable with quantile functionslqX,rqX. Thena) rqX(p) = sup{x|Fo(x) ≤ p}.b) lqX(p) = sup{x|Fo(x) < p}Proof a) Let A = {x|Fc(x) ≤ p} and B = {x|Fo(x) ≤ p}. First note thatA ⊂ B ⇒ supA ≤ supB.To show that the sups are indeed equal, notesupA < supB ⇒ ∃x0,y0, supA < x0 < y0 < supB.ThensupA < x0 ⇒ Fc(x0) > p,andy0 < supB ⇒ ∃b ∈ B, y0 < b ⇒ ∃b, Fo(b) ≤ p,y0 < b ⇒ Fo(y0) ≤ p.ButFc(x0) > p,Fo(y0) ≤ p,which is a contradiction.b) Let A = {x|Fc(x) < p} and B = {x|Fo(x) < p}. First note thatA ⊂ B ⇒ supA ≤ supB.To show that the sups are indeed equal, notesupA < supB ⇒ ∃x0,y0, supA < x0 < y0 < supB.ThensupA < x0 ⇒ Fc(x0) ≥ p,andy0 < supB ⇒ ∃b ∈ B, y0 < b ⇒ ∃b, Fo(b) < p,y0 < b ⇒ Fo(y0) < p.1645.14. Limit theoryButFc(x0) ≥ p,Fo(y0) < p,which is a contradiction.Lemma 5.13.2 (Quantile functions from the right)a) lqfrX(p) = rqX(1−p).b) rqrfX(p) = lqX(1−p).Proofa)lqrfX(p) = sup{x|GcX(x) > p} = sup{x|FoX(x) ≤ p} = rqX(1−p).b)rqrfX(p) = sup{x|GcX(x) ≥ p} = sup{x|FoX(x) < 1−p} = lqX(1−p).5.14 Limit theoryTo prove limit results, we need some limit theorems from probability theorythat we include here for completeness and withoutproof. Their proofscan befound in standard probability textbooks and appropriate references are givenbelow. If we are dealing with two samples, X1,··· ,Xn and Y1,··· ,Yn, toavoid confusion we use the notation Fn,X and Fn,Y to denote their empiricaldistribution functions respectively.Definition Suppose X1,X2,··· , is a discrete–time stochastic process. LetF(X) be the σ-algebra generated by the process and F(Xn,Xn+1,···) theσ-algebra generated by Xn,Xn+1,···. Any E ∈ F(X) is called a tail eventif E ∈ F(Xn,Xn+1,···) for any n ∈ N.Definition Let {An}n∈N be any collection of sets. Then {An i.o.}, read asAn happens infinitely often is defined by:{An i.o.} = ∩i∈N ∪∞j=i Aj.1655.14. Limit theoryTheorem 5.14.1 (Kolmogorov 0–1 law):E being a tail event implies that P(E) is either 0 or 1.Proof See [9].Theorem 5.14.2 (Glivenko–Cantelli Theorem):Suppose, X1,X2,··· , i.i.d, has the sample distribution function Fn. Thenlimn→∞supx∈R|Fn(x)−F(x)| → 0, a.s..Proof See [7].Here, we extend the Glivenko–Cantelli Theorem to Fo,Go and Gc.Lemma 5.14.3 Suppose X is a random variable and consider the associateddistribution functions FoX,GoX and GcX with corresponding sample distribu-tion functions FoX,n,GoX,n and GcX,n. Thensupx∈R|GoX,n −GoX|→ 0, a.s.,supx∈R|FoX,n −FoX|→ 0, a.s.,andsupx∈R|GcX,n −GcX|→ 0, a.s..Proof Note thatFcX +GoX = 1 ⇒ GoX = 1−FcX,andFcX,n +GoX,n = 1 ⇒ GoX,n = 1−FcX,n.Since Glivenko–Cantelli Theorem holds for FcX it also holds for GoX.To show the result for FoX, note that FoX(x) = Go−X(−x) and FoX,n(x) =Go−X,n(−x). Also to show the result for GcX note that GcX = 1 − FoX andGcX,n = 1−FoX,n.1665.14. Limit theoryTheorem 5.14.4 (Borel–Cantelli lemma):Suppose (Ω,F,P) is a probability space. Then1. An ∈ F and summationtext∞1 P(An) < ∞ ⇒ P(An i.o) = 0.2. An ∈ F independent events with summationtext∞1 P(An) = ∞ ⇒ P(An i.o) = 1,where i.o. stands for infinitely often.Proof See [9].Theorem 5.14.5 (Berry–Esseen bound): Let X1,X2,··· , be i.i.d with E(Xi) =0 < ∞, E(X2i ) = σ and E(|Xi|3) = ρ. If Gn is the distribution ofX1 +···+Xn/σ√nand Φ(x) is the distribution function of a standard normal random variablesthen|Gn(x)−Φ(x)| ≤ 3ρ/σ3√n.Corollary 5.14.6 Let X1,X2,··· , be i.i.d with E(Xi) = µ < ∞, E(|Xi −µ|2) = σ and E(|Xi−µ|3) = ρ. If Gn is the distribution of (X1 +···+Xn−nµ)/σ√n = √n( ¯Xn−µσ ) and Φ(x) is the distribution function of a standardnormal random variable then|Gn(x)−Φ(x)| ≤ 3ρ/σ3√n.Proof This corollary is obtained by applying the theorem to Yi = Xi −µ.Now let An = (X1 +···+Xn −nµ)/σ√n. Then|P(An > x)−(1−Φ(x))| = |P(An ≤ x)−Φ(x)| = |Gn(x)−Φ(x)| < 3ρ/σ3√n.Also|P(x < An ≤ y)−(Φ(y)−Φ(x)))| ≤ |Gn(y)−Φ(y)|+|Gn(x)−Φ(x)| ≤ 6ρ/σ3√n.These inequalities show that for any ǫ > 0 there exist N such that n > N,Φ(z2)−Φ(z1)−ǫ < P(z1 < √n(¯Xn −µσ ) ≤ z2) < Φ(z2)−Φ(z1)+ǫ,for z1 < z2 ∈ R∪{−∞,∞}.It is interesting to ask under what conditions lqFn and rqFn tend to lqFand rqF as n → ∞. Theorem 5.14.7 gives a complete answer to this question.1675.14. Limit theoryTheorem 5.14.7 (Quantile Convergence/Divergence Theorem)a) Suppose rqF(p) = lqF(p) thenrqFn(p) → rqF(p), a.s.,andlqFn(p) → lqF(p), a.s..b) When lqF(p) < rqF(p) then both rqFn(p),lqFn(p) diverge almost surely.c) Suppose lqF(p) < rqF(p). Then for every ǫ > 0 there exists N such thatn > N,lqFn(p),rqFn(p) ∈ (lqF(p)−ǫ,lqF(p)]∪[rqF(p),rqF(p) +ǫ).d)limsupn→∞lqFn(p) = limsupn→∞rqFn(p) = rqF(p), a.s.,andliminfn→∞ lqFn(p) = liminfn→∞ rqFn(p) = rqF(p), a.s..Proofa) Since, lqF(p) = rqF(p), we use qF(p) to denote both. Suppose ǫ > 0 isgiven. ThenF(qF(p)−ǫ) < p ⇒ F(qF(p)−ǫ) = p−δ1, δ1 > 0,andF(qF(p) +ǫ) > p ⇒ F(qF(p) +ǫ) = p+δ2, δ2 > 0.By the Glivenko–Cantelli Theorem,Fn(u) → F(u) a.s.,uniformly over R. We conclude thatFn(qF(p)−ǫ) → F(qF(p)−ǫ) = p−δ1, a.s.,1685.14. Limit theoryandFn(qF(p) +ǫ) → F(qF(p)+ǫ) = p+δ2, a.s..Let ǫ′ = min(δ1,δ2)2 . Pick N such that for n > N :p−δ1 −ǫ′ < Fn(qF(p)−ǫ) < p−δ1 +ǫ′,p+δ2 −ǫ′ < Fn(qF(p) +ǫ) < p+δ2 +ǫ′.ThenFn(qF(p)−ǫ) < p−δ1 +ǫ′ < p ⇒lqFn(p) ≥ qF(p)−ǫ and rqFn(p) ≥ qF(p)−ǫ.Alsop < p+δ2 −ǫ′ < Fn(qF(p) +ǫ) ⇒lqFn(p) ≤ qF(p) +ǫ and rqFn(p) ≤ qF(p) +ǫ.Re-arranging these inequalities we get:qF(p)−ǫ ≤ lqFn(p) ≤ qF(p) +ǫ,andqF(p)−ǫ ≤ rqFn(p) ≤ qF(p)+ǫ.b) This needs more development in the sequel and the proof follows.c) This also needs more development in the sequel and the proof follows.d) If lqF(p) = rqF(p) the result follows immediately from (a). Other-wise suppose lqF(p) < rqF(p). Then by (b) lqFn(p) diverges almostsurely. Hence limsuplqFn(p) negationslash= liminf lqFn(p), a.s. . But by (c), ∀ǫ >0, ∃N, n > NlqFn(p) ∈ (lqF(p)−ǫ,lqF(p)]∪[rqF(p),rqF(p)+ǫ).This means that every convergent subsequence of lqFn(p) has either limitlqF(p) or rqF(p), a.s.. Since limsuplqFn(p) negationslash= liminf lqFn(p), a.s., weconclude limsuplqFn(p) = rqF(p) and liminf lqFn(p) = lqF(p), a.s..A similar argument works for rqFn(p).1695.14. Limit theoryTo investigate the case lqF(p) negationslash= rqF(p) more, we start with the simplestexample namely a fair coin. Suppose X1,X2,··· an i.i.d sequence withP(Xi = −1) = P(Xi = 1) = 12 and let Zn =summationtextni=1 Xi. Note thatZn ≤ 0 ⇔ lqFn(1/2) = −1, Zn > 0 ⇔ lqFn(1/2) = 1,andZn < 0 ⇔ rqFn(1/2) = −1, Zn ≥ 0 ⇔ rqFn(1/2) = 1.Hence in order to show that lqFn(1/2) and lqFn(1/2) diverge almost surely,we only need to show that P((Zn < 0 i.o.) ∩ (Zn > 0 i.o.)) = 1. We startwith a theorem from [9].Theorem 5.14.8 Suppose Xi is as above. Then P(Zn = 0 i.o.) = 1.Proof The proof of this theorem in [9] uses the Borel–Cantelli Lemma part2.Theorem 5.14.9 Suppose, X1,X2,··· i.i.d. and P(Xi = −1) = P(Xi =1) = 1/2. Then lqFn(1/2) and rqFn(1/2) diverge almost surely.Proof Suppose, A = {Zn = −1 i.o.} and B = {Zn = 1 i.o.}. It suffices toshow thatP(A∩B) = 1.But ω ∈ A ∩ B ⇒ lqFn(p)(ω) = −1, i.o. and lqFn(p)(ω) = 1, i.o. HencelqFn(p)(ω) diverges.Note that P(A) = P(B) by the symmetry of the distribution. Also it isobvious that both A and B are tail events and so have probability either zeroor one. To prove P(A∩B) = 1, it only suffices to show that P(A∪B) > 0.Because then at least one of A and B has a positive probability, say A.P(A) > 0 ⇒ P(A) = 1 ⇒ P(B) = P(A) = 1 ⇒ P(A∩B) = 1.Now let C = {Zn = 0, i.o.}. Then P(C) = 1 by Theorem 5.14.8. If Zn(ω) =0 then either Zn+1(ω) = 1 or Zn+1(ω) = −1. Hence if Zn(ω) = 0, i.o. thenat least for one of a = 1 or a = −1, Zn(ω) = a, i.o.. We conclude that1705.14. Limit theoryω ∈ A∪B. This shows C ⊂ A∪B ⇒ P(A∪B) = 1.To generalize this theorem, suppose X1,X2,··· , arbitrary i.i.d processand lqF(p) < rqF(p). Define the processYi =braceleftBigg1 Xi ≥ rqF(p)0 Xi ≤ lqF(p).(Note that P(lqX(p) < X < rqX(p)) = 0.) Then the sequence Y1,Y2,··· isi.i.d., P(Yi = 0) = p and P(Yi = 1) = 1−p. Also note thatlqFn,Y (p) diverges a.s. ⇒ lqFn,X(p) diverges a.s.Hence to prove the theorem in general it suffices to prove the theorem forthe Yi process. However, we first prove a lemma that we need in the proof.Lemma 5.14.10 Let Y1,Y2,··· i.i.d with P(Yi = 0) = p = 1− q > 0 andP(Yi = 1) = 1−p = q > 0. Let Sn = summationtextni=1 Yi, 0 < α, k ∈ N. Then thereexists a transformation φ(k) (to N) such thatP(Sφ(k) −φ(k)q < −k) > 1/2−α,P(Sφ(k) −φ(k)q > k) > 1/2−α.Remark. For α = 1/4, we getP(Sφ(k) −φ(k)q < −k) > 1/4,P(Sφ(k) −φ(k)q > k) > 1/4.Proof Since the first three moments of Yi are finite (E(Yi) = q,E(|Yi −q|2) = q(1−q) = σ,E(|Yi −q|3) = q3(1−q) + (1−q)3q = ρ), we can applythe Berry-Esseen theorem to √n ¯Yn−µσ . By a corollary of that theorem, forα2 > 0 there exists an N1 such that1−Φ(z)− α2 < P(√n¯Yn −µσ > z) < 1−Φ(z) +α2,andΦ(z)− α2 < P(√n¯Yn −µσ < −z) < Φ(z) +α2,for all z ∈ R and n > N1. Now for the given integer k pick N2 such that1715.14. Limit theory12 −α2 < Φ(kσ√N2) <12 +α2.This is possible because Φ is continuous and Φ(0) = 1/2. Now letφ(k) = max{N1,N2}, z = kσradicalbigφ(k).Then since φ(k) ≥ N1P(radicalbigφ(k)¯Yφ(k) −µσ > z) > 1−Φ(z)−α2 > 1/2−α,andP(radicalbigφ(k)¯Yφ(k) −µσ < −z) > Φ(z)−α2 > 1/2−α.These two inequalities are equivalent toP((Sφ(k) −φ(k)q) < −k) > 1/2−α,andP((Sφ(k) −φ(k)q) > k) > 1/2−α.If we put α = 1/4, we getP((Sφ(k) −φ(k)q) < −k) > 1/4,andP((Sφ(k) −φ(k))q > k) > 1/4.We are now ready to prove Part b) of Theorem 5.14.7.Proof [Theorem 5.14.7, Part b)]For the process {Yi} as defined above, let n1 = 1,mk = nk +φ(nk) andnk+1 = mk +φ(mk). Then defineDk = (Ynk+1 +···+Ymk −(mk −nk)q < −nk),Ek = (Ymk+1 +···+Ynk+1 −(nk+1 −mk)q > mk),CK = Dk ∩Ek.Since {Ck} involve non–overlapping subsequences of Ys, they are indepen-dent events. Also Dk and Ek are independent. Now note that1725.14. Limit theoryYnk+1 +···+Ymk −(mk −nk)q < −nk ⇒Y1 +···+Ymk < −nk + (mk −nk)q +nk ⇒¯Ymk <mk −nkmk q < q ⇒lqFn,Y (p) = rqFn,Y = 0 ⇒{Ck, i.o.} ⊂ {lqFn,Y (p) = rqFn,Y = 0, i.o.}.Similarly,Ymk+1 +···+Ynk+1 −(nk+1 −mk)q > mk⇒ Y1 +···+Ynk+1 > (nk+1 −mk)q +mk⇒ ¯Ynk+1 > mk + (nk+1 −mk)qnk+1> q = 1−p⇒ lqFn,Y (p) = rqFn,Y (p) = 1⇒ {Ck, i.o.} ⊂ {lqFn,Y (p) = rqFn,Y (p) = 1, i.o.}.Let us compute the probability of Ck:P(Ck) =P(Ynk+1 +···+Ymk −(mk −nk)q < −nk)×P(Ymk+1 +···+Ynk+1 −(nk+1 −mk)q > mk) =P(Y1 +···+Yφ(nk) −φ(nk)q < −nk)×P(Y1 +···+Yφ(mk) −φ(mk)q > mk) > 1/4.1/4 = 1/16.We conclude that ∞summationdisplayk=1P(Ck) = ∞.By the Borel–Cantelli Lemma, P(Ck, i.o.) = 1. We conclude thatP(lqFn,Y (p) = rqFn,Y (p) = 0, i.o.) = 1,andP(lqFn,Y (p) = rqFn,Y (p) = 1, i.o.) = 1.1735.14. Limit theoryHence,P({lqFn,Y (p) = rqFn,Y (p) = 0, i.o.}∩{lqFn,Y (p) = rqFn,Y (p) = 1, i.o.}) = 1.Proof (Theorem 5.14.7, part (c))Suppose that rqF(p) = x1 negationslash= lqF(p) = x2 and a is an arbitrary real number.Let h = x2 −x1. We define a new chain Y as follows:Yi =braceleftBiggXi Xi ≤ lqFX(p)Xi −h Xi ≥ rqFX(p).(See Figure 5.7.) Then Y1,Y2,··· is an i.i.d sample. We drop the index ifrom Yi and Xi in the following for simplicity and since the Yi (as well asthe Xi) are identically distributed. We claimlqFY Y(p) = rqFY (p) = lqFX(p).To prove lqFY (p) = lqFX(p), note thatFY (lqFX(p)) = P(Y ≤ lqFX(p)) ≥ P(X ≤ lqFX(p)) ≥ p ⇒ lqFY (p) ≤ lqFX(p).(The first inequality is because Y ≤ X.) Moreover for any y < lqFX(p),FY (y) = FX(y) < p. (Since X,Y < lqFX(p) ⇒ X = Y.) Hence lqFY (p) ≥lqFX(p) and we are done. To show rqFY (p) = lqFX(p), note that rqFY (p) ≥lqFY (p) = lqFX(p). It only remains to show that rqFY (p) ≤ lqFX(p). Supposey > lqFX(p) and let δ = y −lqFX(p) > 0. First note thatP({Y ≤ lqFX(p)+δ}) =P({Y ≤ lqFX(p) +δ and X ≥ rqFX(p)} ∪{Y ≤ lqFX(p) +δ and X ≤ lqFX(p)}) =P({X −h ≤ lqFX(p) +δ and X ≥ rqFX(p)} ∪{X ≤ lqFX(p) +δ and X ≤ lqFX(p)}) =P({rqFX(p) ≤ X ≤ rqFX(p)+δ}∪{X ≤ lqFX(p)}) =P({X ≤ rqFX(p) +δ}).Hence,FY (y) = P(Y ≤ lqFX(p) +δ) = P(X ≤ rqFX(p) +δ) > p ⇒1745.14. Limit theoryrqFY (p) ≤ y, ∀y > lqFX(p).We conclude that rqFY (p) ≤ lqFY (p).To complete the proof of part (c) observe that for every ǫ > 0, we maysuppose that lqFn,Y (p) ∈ (qFY (p)−ǫ,qFY (p) +ǫ). ThenlqFn,X(p),rqFn,X(p) ∈ (lqFX(p)−ǫ,rqFX(p)+ǫ). (5.6)This is because from lqFn,Y (p) ∈ (qFY (p)−ǫ,qFY (p) + ǫ), we may concludethatFn,Y (qFY (p)+ǫ) > p ⇒ Fn,X(rqFX(p) +ǫ) > p ⇒lqFn,X(p),rqFn,X(p) < rqFX(p) +ǫ,andFn,Y (qFY (p)−ǫ) < p ⇒ FnX(lqFX(p)−ǫ) < p ⇒lqFn,X(p),rqFn,X(p) > lqFX(p)−ǫ.But by part (a) of Theorem 5.14.7, lqFn,Y (p) → qFY (p) and rqFn,Y (p) →qFY (p). Hence for given ǫ > 0 there exists an integer N such that for anyn > N, lqFn,Y (p) ∈ (qFY (p)−ǫ,qF,Y (p) + ǫ). By (5.6), we have shown thatfor every ǫ > 0 there exists N such that for every n > NqFn,X(p),rqFn,X(p) ∈ (lqFX(p)−ǫ,rqFX(p) +ǫ),sinceP(Xi ∈ (lqFX(p),rqFX(p)) for some i ∈ N) = 0.We can conclude thatP(lqFn,X(p) ∈ (lqFX(p),rqFX(p)) for some i ∈ N) = 0andP(rqFn,X(p) ∈ (lqFX(p),rqFX(p)) for some i ∈ N) = 0.Hence with probability 1qFn,X(p),rqFn,X(p) ∈ (lqFX(p)−ǫ,lqFX(p)]∪[rqFX(p),rqFX(p) +ǫ).1755.14. Limit theory−2 −1 0 1 2 3 4 50.00.20.40.60.81.0xFFigure 5.7: The solid line is the distribution function of {Xi}. Note thatfor the distribution of the Xi and p = 0.5, lqFX(p) = 0,rqFX(p) = 3. Leth = rq(p)−lq(p) = 3. The dotted line is the distribution function of the {Yi}which coincides with that of {Xi} to the left of lqFX(p) and is a backwardshift of 3 units for values greater than rqFX(p). Note that for the {Yi},lqFY (p) = rqFY (p) = 1.1765.15. Summary and discussion5.15 Summary and discussionThis section highlights the results obtained for a two state-definition forquantiles and discuss why these results show such a consideration is useful.Justifications and consequences of using left and rightquantile functions1. The equivariance property (under non-decreasing continuous transfor-mations) of lqX and rqX makes them equivariant under the change ofscale. This is a nice theoretical property. Also from a practical viewit means that if we compute the quantile in one scale it can be easilycalculated in another scale.2. Considering lqX,rqX allowed us to find a symmetry relation on quan-tiles:lqX(p) = −rq−X(1−p).3. We found a nice formula for continuous non-increasing transforma-tions:lqφ(X)(p) = φ(rqX(1−p)).4. We showed that lqFn(p) the traditional sample quantile function andrqFn(p) tend to the distribution version if and only if lqF(p) = rqF(p).Hence finding a sufficient and necessary condition that is easy to for-mulate in terms of lqF and rqF.5. If we start with only the traditional quantile function lqF, then rqF(p)would arise in the limitlimsupn→∞lqFn(p) = rqF(p).6. It is widely claimed that the “median” minimizes the absolute errorE|X −a|. In next chapters, we show thatargminaE|X −a| = [lqX(1/2),rqX(1/2)].We observe both lqX(p) and rqX(p) would arise if we intend to use thisas a way defining quantiles. A generalization from 1/2 to arbitrary pis left for future research.1775.15. Summary and discussion7. We offered a physical motivation using a uniform bar to define quan-tiles for data vectors which resulted in a definition that coincide withlqX,rqX.8. If we only use the traditional quantile function, for p = 0, we getlqX(0) = ∞ in general. However rqX(0) < ∞ is a useful value inthe sense that it is the maximum a satisfying P(X ≥ a) = 1. AlsorqX(1) = −∞ in general. However lqX(1) > −∞ in general and is auseful value since it is the minimum a satisfying P(X ≤ a) = 1.9. Middle values of lqX(p),rqX(p) (for example a specific weighted combi-nation of the two) or the whole interval [lqX(p),rqX(p)] are not prefer-able as a definition. This is because we showed that the range of lqXand rqX is exactly the set of heavy points. Points where the probabil-ity of being in any positive radius of them is positive.10. From a practical point of view giving a value that has already occurredas quantile we can expect the same value or a close value happen againin the future. More formally, suppose a random sample X1,··· ,Xnis given and we want to compute the sample qunatile. Then lqFn(p)and rqFn(p) are one of Xis by definition. If we denote XF a futurevalue meaning that XF is identically distributed and independent fromX1,··· ,XnP(XF ∈ (Xi −ǫ,Xi +ǫ)) > 0.A middle value might not satisfy such a property.11. We found out a clean nice way to show in what sense exactly lqX andrqX are close. We showedP(lqX(p) < X < rqX(p)) = 0.For data vectors this means the two values are side by side in thesorted vector.12. We showed that lqX(p) and rqX(p) coincide except for at most a count-able subset of the reals.13. We showed that even though lqX(p) ≤ rqX(p) in general, they are nottoo far apart since for a very small positive value ǫlqX(p) ≤ rqX(p) ≤ lqX(p+ǫ).1785.15. Summary and discussion14. Given one of lqF or rqF, the other one can be obtained by taking thelimitslqF(p0) = limp↑p0rqF(p),andrqF(p0) = limp↓p0lqF(p).15. In order to invert F, lqF,rqF gives us nice expressions for sets such asx|F(x) > p which is equal to (rqX(p),∞) if F is continuous at rqX(p).16. For a continuous distribution function, we have a nice formula for theinverse based on lqF and rqFF−1(p) = [lqX(p),rqX(p)].17. The left (right) quantile function at given probability p can be simplyputas the minimal value that the distribution function reaches (passes)p.In some practices fixing one lq or rq might be sufficient. This is becauselq and rq are close in terms of the probability of the underlying randomvariable. For example in data vectors lq,rq will be at most one element offin terms of their position in the data vector.In most elementary statistics text books and statistical softwares quan-tiles are given as a one-state solution generally a weighted combination ofthe left and right quantiles. In order to teach the right and left quantilefunctions, we suggest using a simple example x = (1,2,3,4) to show thatthere are no values in the middle and the left (2) and right median (3) arenatural to consider. Then one can point out this can be generalized fromp = 1/2 to any p without getting into details. It can also be pointed outthat the left (right) quantile function at given probability p can be simplyput as the minimal value that the distribution function reaches (passes) p.In a more advanced courses perhaps for mathematics, statistics or sciencestudents the teacher might like to show how the quantiles can be definedusing the bar of length 1. Finally the mathematical formulas can be given tostudents with appropriate mathematical background (i.e. Familiar with thedefinition of sup and inf and their existence property for the real numbers).In case an interpolation procedure is to be used, we suggest the interpo-lation procedure to be between lqX(p) and rqX(p). Surprisingly this is notthe case. For example for x = (0,0,0,0,0,1,1,1,1,1) in the R package as1795.15. Summary and discussionthe quantile for p = 0.48, we get 0.32. But in the vector we notice that 0shave covered 50 percent of the data and since 0.48 is strictly less than 0.48,we expect 1 to be the quantile. 0.32 in our notation is both greater thanlqx(0.50) and rqx(0.50).180Chapter 6Probability loss function6.1 IntroductionThis chapter develops a “loss function” to assess the goodness of an ap-proximation or an estimator of quantiles of a distribution (or a data vec-tor). Suppose a quantile of a very large data vector, q is approximated byˆq. Several classic losses can be considered. For example: absolute errorL(q, ˆq) = |q − ˆq| or squared error L(q, ˆq) = (q − ˆq)2 which was proposedby Gauss. Quoting from [30]: “Gauss proposed the square of the error as ameasure of loss or inaccuracy. Should someone object to this specification asarbitrary, he writes, he is in complete agreement. He defends his choice byan appeal to mathematical simplicity and convenience.” An obvious prob-lem with this loss is its lack of invariance under re-scaling of of data. Wepropose a loss function that is invariant under strictly monotonic transfor-mations. We also show that the sample version of this loss function tendsuniformly to the distributional version. This loss function can be used alsoto find optimal ways to summarize a data vector and to define a measure ofdistance among random variables as shown in the next chapters.We define the loss of estimating/approximating q by ˆq to be the prob-ability that the random variable falls in between the two values. A limitedversion of this concept only for data vectors can be found in computer sci-ence literature, where ǫ-approximations are used to approximate quantilesof large datasets. (See for example [32].) However, this concept has notbeen introduced as a measure of loss and the definition is limited to datavectors rather than arbitrary distributions.6.2 Degree of separation between data vectorsOur purpose is to find good approximations to the median and other quan-tiles. It is not clear how such approximations should be assessed. We con-tend that such a method should not depend on the scale of the data. Inother words it should be invariant under monotonic transformations. We1816.2. Degree of separation between data vectorsdefine a function δ that measures a natural “degree of separation” betweendata points of a data vector x. For the sake of illustration, consider theexample sort(x) = (1,2,3,3,4,4,4,5,6,6,7). Now suppose, we want to de-fine the degree of separation of 3,4 and 7 in this example. Since 4 comesright after 3, we consider their degree of separation to be zero. There are3 elements between 4 and 7 so it is appealing to measure their degree ofseparation as 3 but since the degree of separation should be relative, we cabalso divide by n = 11, the length of the vector, and get: δ(4,7) = 3/11. Wecan generalize this idea to get a definition for all pairs in R. With the sameexample, suppose we want to compute the degree of separation between 2.5and 4.5 that are not members of the data vector. Then since there are 5elements of the data vector between these two values, we define their degreeof separation as 5/11. More formally, we give the following definition.Definition Suppose z < z′ let ∆x(z,z′) = {i|z < xi < z′}. Then we defineδx(z,z′) = |∆x(z,z′)|n ,and δx(z,z) = 0. We call δx the “degree of separation” (DOS) or the “prob-ability loss function” associated with x.We then have the following lemma about the properties of δ.Lemma 6.2.1 The degree of separation δx has the following properties:a) δx ≥ 0.b) y < y′ < y′′ ⇒ δx(y,y′′) ≥ δx(y,y′).c) If z < z′ and z,z′ are elements of x, δx(z,z′) = mx(z)−Mx(z′)−1n . [For thedefinition of m(z) and M(z) see Chapter 5.]d) δφ(x)(φ(z),φ(z′)) = δx(z,z′) if φ is a strictly monotonic transformation.e) y = sort(x) and y′ = yi < y′′ = yj ⇒ δx(y′,y′′) ≤ (j −i−1)/n.ProofBoth a) and b) are straightforward. We obtain c) as a straightforwardconsequence of the definition of mx(y′) and Mx(y′). To show (d), supposez < z′ and φ is strictly decreasing. (The strictly increasing case is similar.)Then φ(z′) < φ(z) and hence∆φ(x)(φ(z),φ(z′)) = {i|φ(z′) < φ(xi) < φ(z)} = {i|z < xi < z′} = ∆x(z,z′).1826.3. “Degree of separation” for distributions: the “probability loss function”Finally e) is true because |∆x(y′,y′′)| = |{l|yi < xl < yj}| ≤ j −i−1.All the definitions and results above can be applied to random vectorsX = (X1,··· ,Xn) as well. In that case, lqX(p) and rqX(p) and δX(z,z′) arerandom. To develop our theory, we need to study the asymptotic behaviorof these statistics. We do so in later sections.6.3 “Degree of separation” for distributions: the“probability loss function”We define a degree of separation for distributions which corresponds to thenotion of “degree of separation” defined for data vectors to measure separa-tion between data points.Definition Suppose X has a distribution function F. LetδF(z′,z) = δF(z,z′) = limu→z−F(u)−F(z′) = P(z′ < X < z), z > z′,and δF(z,z) = 0, z ∈ R. We also denote this by δX whenever a randomvariable X with distribution F is specified. We call δX the “degree of sepa-ration” or the “probability loss function” associated with X.The following lemma is a straightforward consequence of the definition.Lemma 6.3.1 Suppose x = (x1,··· ,xn) is a data vector with the empiricaldistribution Fn. ThenδFn(z,z′) = δx(z,z′), z,z′ ∈ R.This lemma implies that to prove a result about the degree of separationof data vectors, it suffices to show the result for the degree of separation ofrandom variables.Theorem 6.3.2 Let X,Y be random variables and FX,FY , their corre-sponding distribution functions.a) Assume Y = φ(X), for a strictly increasing or decreasing function φ :R → R. Then δFX(z,z′) = δFY (φ(z),φ(z′)), z < z′ ∈ R.b) δF(z,z′) ≤ δF(z,z′′), z ≤ z′ ≤ z′′.c) δF(z1,z3) ≤ δF(z1,z2)+δF(z2,z3) +P(X = z2).1836.3. “Degree of separation” for distributions: the “probability loss function”d) Suppose, p ∈ [0,1]. Then δF(lqF(p),rqF(p)) = 0.e) Suppose, p1 < p2 ∈ [0,1]. Then δF(lqF(p1),rqF(p2)) ≤ p2 −p1. This im-mediately implies δF(lqF(p1),lqF(p2)) ≤ p2 −p1 and δF(rqF(p1),lqF(p2)) ≤p2 −p1 by b).Remark. We may restate Part (c), for data vectors: Suppose x has lengthn and z2 is of multiplicity m, (which can be zero). Then the inequality in(c) is equivalent to δx(z1,z3) ≤ δx(z1,z2)+δx(z2,z3) +m/n.Proofa) Note that for a strictly increasing function φ, we haveP(z < X < z′) = P(φ(z) < φ(X) < φ(z′)).Now suppose φ is strictly decreasing. Then z < z′ ⇒ φ(z′) < φ(z). LetY = φ(X). ThenδX(z,z′) = P(z < X < z′) = P(φ(z′) < φ(X) < φ(z)) = δY (φ(z),φ(z′)).b) This is trivial.c) Consider the case z1 < z2 < z3. (The other cases are easier to show.)ThenδF(z1,z3) = P(z1 < X < z3) = P(z1 < X < z2)+P(X = z2)+P(z2 < X < z3)= δ(z1,z2) +δ(z2,z3)+P(X = z2).d) This result is a straightforward consequence of Lemma 5.3.1 b) andc).e) This result follows fromδF(lq(p1),rq(p2)) = P(lq(p1) < X < rq(p2))= P(X < rq(p2))−P(X ≤ lq(p1)) ≤ p2 −p1.The last inequality being a result of Lemma 5.3.1 a) and d).Remark. We call part c) of the above theorem the pseudo–triangle inequal-ity.Here we give two examples about using the probability loss function andits interpretation.1846.3. “Degree of separation” for distributions: the “probability loss function”Example We showed above that the triangle property does not hold forthe probability loss function and that might lead to the criticism that thisdefinition is not intuitively appealing. By an example, we now show why itmakes sense that the triangle property should not hold for such a situation.Suppose a few mathematicians are standing in a lineEuclid, Khawarzmi, Khayyam, Gauss, Von Neumann.If we were to ask Khwarzmi about his distance from Euclid, he would an-swer: “0, since I am right beside him.” If we ask Khwarazmi again about hisdistance to Khayyam, he will say that “my distance is 0 since I am right be-side him.” However if we were to ask Euclid about his distance to Khayyamhe would answer: “One unit (person) since Khwarzmi is in the middle.” Weobserve that this distance does not satisfy the triangle property as well. Inthis example the people sitting in the middle are the relevant factors. If wedeal with a vector of sorted observations, then observations in the middleare the relevant factors.Example A student is told that he will receive a scholarship if he ranks firstin an exam in his class in either of the subjects mathematics and physics.The teacher of the courses differ and take a practice exam in each subject.They return the students back their marks out of 100. They also publish thelists of all the marks after removing the names, to give the students a feelingof how they did in the class. Table 6.1 shows the marks in mathematics andphysics.1856.3. “Degree of separation” for distributions: the “probability loss function”Mathematics Physics Physics before scaling80 90 81.065 89 79.263 86 74.061 85 72.254 83 68.954 82 67.253 79 62.450 79 62.449 76 57.848 75 56.247 72 51.847 72 51.846 69 47.644 68 46.230 55 30.2Table 6.1: A class marks in mathematics and physics. The third column arethe raw physics marks before the physics teacher scaled them.Reza got 63 in math and 75 in physics. He decided to focus on just onesubject that gives him a better chance in order to win the scholarship. Hecompared his mark in math with the best student in math: 63 against 80.So he needed|best mark−Reza’s mark| = 80−63 = 17more marks to be as good as the best student. Then he compared his physicsmark to the best student in physics. He found he needs 90-75=15 marks tobe as good as him. So he thought it’s better to focus on physics. But then herealized that different teachers use different exam and scoring methods. Hehad heard that the physics teacher scales the marks upward by the formulanew mark = √100×old mark.So the student calculated the untransformed values and put the result in thethird column. Now he noticed that his new mark is 56.2 while the best markis 91. The difference this time is 24.8 which is a larger difference than before.According to his “decision-making tool”, the absolute difference, he shouldfocus on math since the absolute difference for math was only 17. But whatif the mathematics teacher had used another transformation to re-scale themarks without him knowing it? This made him see a disadvantage to usingthe absolute value difference. Instead he realized, he can use the number1866.4. Limit theory for the probability loss functionof the students between himself and the best student as a measure of thedifficulty of getting the best mark. He noticed his decision in this case willbe independent of how the teachers re-scaled the marks. In the math casethere is only one and for physics there are 8 students between him and thebest student. Hence he decided that he should focus on math.This example was under the assumption that other students do notchange their study habits or do not have access to the marks. If the otherstudents had access to their marks or were ready to change their study fo-cus, we need to take into account other possible actions of the other studentsand the problem will become game-theoretical in nature, a very interestingproblem on its own right. The solution for that problem we conjecture tobe the same.6.4 Limit theory for the probability loss functionTheorem 6.4.1 Suppose X1,X2,··· , is a sequence of i.i.d random vari-ables with distribution function F. Then as n → ∞,δFn(z,z′) → δF(z,z′), a.s.,uniformly in z,z′ ∈ R. In other wordssupz>z′∈R|δFn(z,z′)−δF(z,z′)| → 0, a.s..Proof If z = z′, the result is trivial. Suppose z > z′. We need to show thatlimu→z−Fn(u)−Fn(z′) →a.s. limu→z−F(u)−F(z′), (6.1)as n → ∞, uniformly in z > z′ ∈ R. Suppose ǫ > 0 is given. By Glivenko-Cantelli Theorem there exist N ∈ N such that for every n > N:|Fn(u)−F(u)| < ǫ2, a.s., ∀u ∈ R.Now for n > N,|( limu→z−Fn(u)−Fn(z′))−( limu→z−F(u)−F(z′))| ≤| limu→z−(Fn(u)−F(u))|+|Fn(z′)−F(z′)| = limu→z−|Fn(u)−F(u)|+|Fn(z′)−F(z′)|.1876.5. The probability loss function for the continuous caseBut since |Fn(u) −F(u)| < ǫ2, limu→z− |Fn(u) −F(u)| ≤ ǫ2. Also |Fn(z′)−F(z′)| < ǫ2. Hence|( limu→z−Fn(u)−Fn(z′))−( limu→z−F(u)−F(z′))| < ǫ.6.5 The probability loss function for thecontinuous caseThis section studies the probability loss function when the distribution func-tion is continuous. The results are given in the following lemmas, which showsome of its desirable properties in the continuous case.Lemma 6.5.1 (Probability loss for continuous distributions) Suppose X isa random variable with distribution function FX. Then δX(lqX(p1),rqX(p2)) =p2 −p1, p2 > p1, ∀p1,p2 ∈ [0,1] iff FX is continuous.ProofIf FX is continuous then for p1 < p2 and by Lemma 5.5.2,δ(lqX(p1),rqX(p2)) = P(lqX(p1) < X < rqX(p2)) =P(X < rqX(p2))−P(X ≤ lqX(p)) = F(rqX(p2))−F(lqX(p2)) = p2 −p1.If F is not continuous then there exists an x0 such that a = PX(X = x0) > 0.Let p1 = P(X < x0)+a/3 and p2 = P(X < x0)+a/2. Clearly lqX(p1) = x0and rqX(p2) = x0. Henceδ(lqX(p1),rqX(p2)) = 0 negationslash= p2 −p1.Lemma 6.5.2 Suppose δ(lqX(p1),rqX(p2)) = δ(rqX(p1),lqX(p2)) = a, p1 <p2.Then alsoa = δ(lqX(p1),lqX(p2))= δ(rqX(p1),lqX(p2))= δ(rqX(p1),rqX(p2)).Moreover, if X is continuous, all the above are equal to p2 −p1.1886.6. The supremum of δXProof The result follows immediately from the fact that all the three quan-tities are greater than or equal to δ(rqX(p1),lqX(p2)) = a and smaller thanor equal to δ(lqX(p1),rqX(p2)) = a. The second part is straightforward us-ing the previous lemma.6.6 The supremum of δXThis section investigates how large the probability loss can become undervarious scenarios. The results are given in the following lemmas.Lemma 6.6.1 Let Dist be the set of all distribution functions. ThensupF∈DistδF(lqF(p1),lqF(p2)) = p2 −p1, p2 > p1, p1,p2 ∈ (0,1).Proof This follows from the fact that δF(lqF(p1),lqF(p2)) ≤ p2 − p1 ingeneral, as shown in Lemma 6.3.2 and δF(lqF(p1),lqF(p2)) = p2 − p1 forcontinuous variables.The same is true for data vectors as shown in the following lemma.Lemma 6.6.2 Suppose the supremum in the following is taken over all datavectors, thensupxδx(lqx(p1),lqx(p2)) = p2 −p1, p2 > p1, p1,p2 ∈ (0,1).Proof We know that δx(lqx(p1),lqx(p2)) ≤ p2 − p1. To show that thesupremum attains the upper bound, let xn = (1,··· ,n). Then lqxn(p1) =[np1] or [np1] + 1. Also lqxn(p2) = [np2] or [np2] + 1. Then ∆, the numberof elements of x between lqxn(p1) and lqxn(p2) satisfies:[np2]−[np1]−1 ≤ ∆ ≤ [np2]−[np1] +1 ⇒np2 −1−np1 −1−1 ≤ ∆ ≤ np2 −np1 + 1 ⇒−3/n ≤ δxn(p1,p2)−(p2 −p1) ≤ 1/n.This shows that δxn(p1,p2) tends to p2−p1 uniformly for all p1 < p2 ∈ [0,1].1896.6. The supremum of δXLemma 6.6.3 Suppose p1,p2,··· ,pm ∈ [0,1] and m = 2k. Thensupxmax{δx(lqx(p1),lqx(p2)),δx(lqx(p3),lqx(p4)),··· ,δx(lqx(pm−1),lqx(pm))}= max{|p2 −p1|,··· ,|pm −pm−1|}.Proof The supremum is less than or equal to the left hand side by Lemma5.3.1. Let xn = (1,2,··· ,n). Without loss of generality suppose p1 <p2,p3 < p4,··· ,p2k−1 < p2k. By the properties of quantiles of data vectors:lqxn(pi) = x[npi] = [npi] or lqxn(pi) = x[npi]+1 = [npi]+ 1.Also, lqxn(pi+1) = x[npi+1] = [npi+1] or lqxn(pi+1) = x[npi+1]+1 = [npi+1]+1.Then, δxn(lqxn(pi),lqxn(pi+1)) ≥ 1n([npi+1]−[npi]−1) ≥ 1n(npi+1−npi−2) =(pi+1 −pi)− 2n. Henceδxn(lqxn(pi),lqxn(pi+1)) > |pi+1 −pi|− 2n, i = 1,··· ,m−1.The inequality shows the supremum is greater than= max{|p2 −p1|− 2n,··· ,|pm −pm−1|− 2n},for all n ∈N. Now let n → +∞ to get the conclusion.Lemma 6.6.4 Suppose p1,p2,··· ,pm ∈ [0,1] and a1,a1,··· ,a2m ∈ [0,1].Thensupx[integraldisplay a2a1δx(lqx(p1),lqx(p))dp +integraldisplay a4a3δx(lqx(p2),lqx(p))dp+···+integraldisplay a2ma2m−1δx(lqx(pm),lqx(p))dp]=integraldisplay a2a1|p−p1|dp +integraldisplay a4a3|p−p2|dp +···+integraldisplay a2ma2m−1|p−pm|dp.Proof The proof is similar to the previous lemmas and we skip the details.1906.6. The supremum of δX6.6.1 “c-probability loss” functionsThis section introduces a family of loss functions that are very similar tothe probability loss function but might be more useful in some contexts,particularly when the distribution function is not continuous. A defect ofthe probability loss function is: it can be equal to zero even if a negationslash= b,a,b ∈R. Also we noted that even though it resembles a metric it is not one.For example the triangle inequality does not hold. We introduce the “c-probability loss function” to solve these problems.Definition Suppose X is a random variable, δX its associated probabilityloss function and c ≥ 0. Then letδcX(a,b) = δX(a,b) +c(1−1{0}(a−b)),where 1{0} is the indicator function at zero.Note that the c-probability loss is the sum of two losses. The first,δX(a,b), is the probability of being between the two values (a and b), thesecond, c(1−1{0}(a−b)), is the penalty for a and b not being equal. Onequestion is what value of c should be chosen as the “penalty” of not beingequal to the true value. It turns out that the value of c is not very importantfor many purposes as shown in the following lemma.Lemma 6.6.5 (Properties of the c-probability loss functions)a) δcX(a,b) = c ⇔ a negationslash= b and δX(a,b) = 0.b) δcX(a,b) = 0 or δcX(a,b) ≥ c.c) δcX is invariant under strictly monotonic transformations.d) Let d = supx0∈RP(X = x0). Then if c ≥ d, δc satisfies the triangle inequality.e) δcX(lqX(p),rqX(p)) ≤ c. (It is either zero or c.)f) Suppose δcX is given for any c > 0. Then we can obtain any other δdX ford ≥ 0.Proof a) and b) are trivial.c) Both δX and c(1 −1{0}(a−b)) are invariant under monotonic transfor-mations.d) We use the pseudo–triangle inequality for the probability loss function.Take z1,z2,z3 ∈ R. We need to show δcX(z1,z3) ≤ δcX(z1,z2) + δcX(z2,z3) .If z1 = z3, the result is trivial. Otherwise c(1−1{0}(z1 −z3)) = c andδcX(z1,z3) = δX(z1,z3) +c ≤ δX(z1,z2) +δX(z2,z3)+P(X = z2)+c1916.6. The supremum of δX≤ δX(z1,z2) +δX(z2,z3)+c(1−1{0}(z1 −z2)) +c(1−1{0}(z2 −z3)) =δcX(z1,z2) +δcX(z2,z3).e) Trivial by properties of lq,rq and δX as shown in Lemma 5.3.1.f) Suppose δcX is given. If δcX(a,b) = 0 then a = b and hence δdX(a,b) = 0.If a negationslash= b then δcX(a,b) = δX(a,b) + c. From this we can obtain δX(a,b) =δcX(a,b)−c and hence δdX(a,b) = δcX(a,b)−c+d.δX(X1,X2) (or δcX(X1,X2)), if X1,X2 i.i.d∼ X can be considered as ameasure of disparity of the common distribution. The following lemmashows that the expectation of this quantity is constant for all continuousrandom variables!Lemma 6.6.6 Suppose X is a continuous random variable, thenE(δX(X1,X2)) = 2/3,where X1,X2 i.i.d∼ X. AlsoE(δcX(X1,X2)) = 2/3 +c.Proof We know that FX(X1) and FX(X2) are both uniformly distributedon (0,1) and independent. HenceE(δX(X1,X2)) = E(|F(X1)−F(X2)|) =integraldisplay10integraldisplay 10|p1 −p2|dp1dp2 = 2integraldisplay 10integraldisplay 1p2(p1 −p2)dp1dp2 =2integraldisplay 10(1−2p2 +p22)dp2 = 2/3.E(δcX(X1,X2)) = 2/3 + c is obtained by noting that P(X1 = X2) = 0 forcontinuous random variables.192Chapter 7Approximating quantiles inlarge datasets7.1 IntroductionThis chapter develops an algorithm for approximating the quantiles in petas-cale (petabyte= one million gigabytes) datasets and uses the “probabilityloss function” to assess the quality of the approximation. The need for suchan approximation does not arise for the sample average, another commondata summary. That is because if we break down the data to equal parti-tions and calculate the mean for every partition, the mean of the obtainedmeans is equal to the total mean. It is also easy to recover the total meanfrom the means of unequal partitions if their length is known.However computer memories, several gigabytes (GBs) in size, cannothandle large datasets that can be petabytes (PBs) in size. For example, alaptop with 2 GBs of memory, using the well–known R package, could findthe median of a data file of about 150 megabytes (MBs) in size. However, itcrashed for files larger than this. Since large datasets are commonly assem-bled in blocks, say by day or by district, that need not be a serious limitationexcept insofar as the quantiles computed in that way cannot be used to findthe overall quantile. Nor would it help to sub–sample these blocks, unlessthese (possibly dependent) sub–samples could be combined into a grandsub-sample whose quantile could be computed. That will not usually bepossible in practice. The algorithm proposed here is a “worst–case” algo-rithm in the sense that no matter how the data are arranged, we will reachthe desired precision. This is of course not true if we sample from the databecause there is a (perhaps small) probability that the approximation couldbe poor.We also address the following question:Question: If we partition the data–file into a number of sub–filesand compute the medians of these, is the median of the mediansa good approximation to the median of the data–file?1937.2. Previous workWe first show that the median of the medians does not approximatethe exact median well in general, even after imposing conditions on thenumber of partitions or their length. However for our proposed algorithm,we show how the partitioning idea can be employed differently to get goodapproximations. “Coarsening” is introduced to summarize data vector withthe purpose of inferring about the quantiles of the original vector using thesummaries. Then the “d-coarsening” quantile algorithm which works bypartitioning the data (or use previously defined partitions) to possibly non-equal partitions, summarizing them using coarsening and inferring about thequantiles of the original data vector using the summaries. Then we showthe deterministic accuracy of the algorithm in Theorem 7.4.1. The accuracyis measured in terms of the probability loss function of the original datavector. This is an extension of the work of Albasti et al. in [3] to non-equalsize partition case. Theorem 7.4.1 still requires the partition sizes to bedivisible by d the coarsening factor. In order to extend the results furtherto the case where the partitions are nit divisible by d, we investigate howquantiles of a data vector with missing data or contaminated data relate tothe quantiles of the original data in Lemma 7.4.3 and Lemma 7.4.4. Alsoin Lemma 7.5.1, we show if the quantiles of a coarsened vector are used inplace of the quantiles of the original data vector how much accuracy willbe lost. Finally we investigate the performance of the algorithm using bothsimulations and real climate datasets.7.2 Previous workFinding quantiles and using them to summarize data is of great importancein many fields. One example is the climate studies where we have very largedatasets. For example the datasets created by computer climate models arelarger than PBs in size. In NCAR (National Center for Atmospheric sciencesat Boulder, Colorado), the climate data (outputs of compute models) aresaved on several disks. To access different parts of these data a robot needsto change disks form a very large storage space. Another case where weconfront large datasets is in dealing with data streams which arise in manydifferent applications such as finance and high–speed networking. For manyapplications, approximate answers suffice. In computer science, quantilesare important to both data base implementers and data base users. Theycan also be used by business intelligence applications to drive summaryinformation from huge datasets.As pointed out by Gurmeet et al. in [32], a good quantile approximation1947.2. Previous workalgorithm should1. not require prior knowledge of the arrival or value distribution of itsinputs.2. provide explicit and tunable approximation guarantees.3. compute results in a single pass.4. produce multiple quantiles at no extra cost.5. use as little memory as possible.6. be simple to code and understand.Finding quantiles of data vectors and sorting them are parallel problemssince once we sort a vector finding any given quantile can be done instantly.A good account of early work in sorting algorithms can be found in [28].Munero et al. in [36] showed for P–pass algorithms (algorithms that scanthe data P times) Θ(N/P) storage locations are necessary and sufficient,where N is the length of the dataset. (See Appendix C for the definitionsof complexity functions such as Θ.) It is well–known that the worst-casecomplexity of sorting is nlog2 n + O(1) as shown in [33]. In [39], Patersondiscusses the progress made in the so–called “selection” problem. He letsVk(n) be the worst–case minimum number of pairwise comparisons requiredto find the k–th largest out of n “distinct elements”. In particular M(n) =Vk(n) for k = ⌈n/2⌉. In [8], it is shown that the lower bound for Vk(n) isn+min{k−1,n−k}−1, an achieved upper bound by Blum is 5.43n. Betterupper bounds have been achieved through the years. The best upper boundso far is 2.9423N and the lower bound is (2+α)N where α is of order 2−40.Yao in [49], showed that finding approximate median needs Ω(N) com-parisons in deterministic algorithms. Using sampling this can be reduced toO( 1ǫ2 log(δ−1)) independent of N, where ǫ is the accuracy of the approxima-tion in terms of the “probability loss” in our notation. In [36], Munero et al.showed that O(N1/p) is necessary and sufficient to find an exact φ–quantilein p passes.Often an exact quantile is not needed. A related problem is findingspace–efficient one–pass algorithms to find approximate quantiles. A sum-mary of the work done in this subject and a new method is given in [1]. Twoapproximate quantile algorithms using only a constant amount of memorywere given by Jain [26] Agrawal et al. in [1]. No guarantee for the errorwas given. Alsabti et al. in [3], provide an algorithm and guaranteed error1957.3. The median of the mediansin one pass. This algorithm works by partitioning the data into subsets,summarizing each partition and then finding the final quantiles using thesummarized partitions. The algorithm in this chapter is an extension of thisalgorithm to the case of partitions of unequal length.7.3 The median of the mediansA proposed algorithm to approximate the median of a very large data vectorpartitions the data into subsets of equal length, computes the median foreach partition and then computes the median of the medians. For example,suppose n = lm and break the data to m vectors of size l. One mightconjecture that by picking l or m sufficiently large the median of the medianswould ensure close proximity to the exact median. We show by an examplethat taking l and m very large will not help to get close to the exact median.Let l = 2b + 1 and m = 2a + 1.partition number Partition Median of the partition1 (1,2,··· ,b,b+ 1,10b,··· ,10b) b+ 12 (1,2,··· ,b,b+ 1,10b,··· ,10b) b+ 1. . .. . .. . .a (1,2,··· ,b,b+ 1,10b,··· ,10b) b+ 1a+1 (1,2,··· ,b,b+ 1,10b,··· ,10b) 10ba+2 (10b,10b,··· ,10b) 10b. . .. . .. . .2a+1 (10b,10b,··· ,10b) 10bTable 7.1: The table of dataExampleTable 7.1 shows the dataset partitioned into m = 2a+1 vectors of equallength. Every vector is of length l = 2b + 1. The first a + 1 vectors areidentical and 10b is repeated b times in them. The last a vectors are alsoidentical with all components equal to 10b. The median of the medians turnsout to be b + 1. However, the median of the dataset is 10b. We show thatb+1 is in fact “almost” the first quantile. This is because (b+1) is smaller1967.4. Data coarsening and quantile approximation algorithmthan all 10b’s. There are (a+1)b+a(2b+1) data points equal to 10b. Henceb +1 is smaller than this fraction of the data points:(a+ 1)b +a(2b + 1)(2a + 1)(2b + 1) =2a + 22a + 1b4b + 2 +a2a + 1 ≈ 1×14 +12 ≈34.With a similar argument, we can show that b + 1 is greater than almost aquarter of the data points (the ones equal to 1,2,··· ,b). Hence b + 1 is“almost” the first quantile.One can prove a rigorous version of the the following statement.The median of the medians is “almost” between the first and the thirdquartile.We only give a heuristic argument for simplicity. To that end, let n = lmand m = 2a + 1 and l = 2b + 1. Let M be the exact median and M′ bethe median of the medians. Order the obtained medians of each partitionand denote them by M1,··· ,Mm. By definition M′ ≥ Mj, j ≤ a andM′ ≤ Mj, j ≥ a+1. Each Mj, j ≤ a is less than or equal to b data pointsin its partition. Hence, we conclude that M′ is less than or equal to ab datapoints. Similarly M′ is greater than or equal to ab data points (which aredisjoint for the data points used before). But abn = ab(2a+1)(2b+1) ≈ 14. Hence,M′ is greater than or equal to 1/4 data points and less than or equal to 1/4data points.7.4 Data coarsening and quantile approximationalgorithmThis section introduces an algorithm to approximate quantiles in very largedata vectors. As we demonstrated in the previous section the median ofmedians algorithm is not necessarily a good approximation to the exactmedian of a data vector even if we have a large number of partitions andlarge length of the partitions. The algorithm is based on the idea of “datacoarsening” which we will discuss shortly. The proposed algorithm can giveus approximations to the exact quantile of known precisions in terms ofdegree of separation. After stating the algorithm, we prove some theoremsthat give us the precision of the algorithm. The results hold for partitionsof non–equal length.1977.4. Data coarsening and quantile approximation algorithmDefinition Suppose a data vector x of length n = n1n2 is given, n1,n2 >1 ∈ N. Also let sort(x) = y = (y1,··· ,yn). Then the n2–coarseningof x, Cn2(x) is defined to be (yn2,y2n2,··· ,y(n1−1)n2). Note that Cn2(x)has length n1 − 1. Let pi = i/n1,i = 1,2,··· ,(n1 − 1). Then Cn2(x) =(lqx(p1),··· ,lqx(pn1−1)).We can immediately generalize the coarsening operator. Supposesort(x) = (y1,··· ,yn),and n2 < n is given. Then by The Quotient–Remainder Theorem fromelementary number theory, there exist n1 ∈ N∪{0} and r < n2 such thatn = n1n2 + r. Define Cn2(x) = (yn2,··· ,yn2(n1−1)). The expression issimilar to before. However, there are n2 +r elements after yn2(n1−1) in thesorted vector y. In this sense this coarsening is not fully symmetric. Weshow that if n2 is small compared to n this lack of symmetry has a smalleffect on the approximation of quantiles.Suppose x is a data vector of length n = summationtextmi=1 li. We introduce thecoarsening algorithm to find approximations to the large data vectors.d–Coarsening quantiles algorithm:1. Partition x into vectors of length l1,··· ,lm. (Or use pre–existingpartitions, e.g. partitions of data saved in various files on the harddisk of a computer.)x1 = (x1,··· ,xl1),x2 = (xl1+1,··· ,xl1+l2),··· ,xm = (xsummationtextm−1j=1 lj+1,··· ,xn)2. Sort each xl, l = 1,2,··· ,m and let yl = sort(xl), l = 1,··· ,m:y1 = (y11,··· ,y1l1),··· ,ym = (ym1 ,··· ,ymlm).3. d–Coarsen every vector:(y1d,··· ,y1(c1−1)d),··· ,(ymd ,··· ,ym(cm−1)d),and for simplicity drop d and use the notation wji = yjid.w1 = (w11,··· ,w1(c1−1)),··· ,wm = (wm1 ,··· ,wm(cm−1)).1987.4. Data coarsening and quantile approximation algorithm4. Stack all the above vectors into a single vector and call it w. Findrqw(p) (or lqw(p)) and call it µ. Then µ is our approximation torqx(p) (or lqx(p)).Theorem 7.4.1 Suppose x is of length n = summationtextmi=1 li, m ≥ 2 and li =cid. Let C = summationtextmi=1 ci. Apply the coarsening algorithm to x and find µto approximate rqx(p) (or lqx(p)). Then µ is a (left and right) quantile inthe interval[p−ǫ,p+ǫ],where ǫ = m+1C−m. In other words δx(µ,rqx(p)) ≤ ǫ and δx(µ,lqx(p)) ≤ ǫ.When li = cd, i = 1,··· ,m, ǫ = m+1m−1 1c−1 ≤ 3c−1.We need an elementary lemma in the proof of this theorem.Lemma 7.4.2 (Two interval distance lemma)Suppose two intervals I = [a,b] and J = [c,d] subsets of R are given. Thensup{|p−q|,p ∈ I,q ∈ J} = max{|a−d|,|b−c|}.Proof sup{|p −q|,p ∈ I,q ∈ J} ≥ max{|a − d|,|b −c|} is trivial becausea,b ∈ I and c,d ∈ J.To show the converse note that |p−q| = p−q or q −p, p ∈ I,q ∈ J. Butp−q ≤ b−c,andq −p ≤ d−a.Hence|p−q| ≤ max{b−c,d−a}≤ max{|b−c|,|a−d|}.This completes the proof.Proof of Theorem 7.4.1.Let n′ = summationtextmi=1(ci − 1) = summationtextmi=1 ci − m = C − m and MC = {(i,j)|i =1,2··· ,m,j = 1,··· ,ci−1}, theindexset ofw. Also letc = max{c1,··· ,cm}.Suppose, h−1n′ ≤ p < hn′, h = 1,··· ,n′. Then since µ = rqw(p), thereare disjoint subsets of MC, K and K′ such that |K| = h, |K′| = n′ − h,µ ≥ wij, (i,j) ∈ K and µ ≤ wij, (i,j) ∈ K′. (This is because if we letv = sort(w), rqw(p) = vh since [n′p] = h−1.)1997.4. Data coarsening and quantile approximation algorithmK,K′ are not necessarily unique because of possible repetitions amongthe wit. Hence we impose another condition on K and K′. If (i,t) ∈ Kthen (i,u) /∈ K′, u < t. It is always possible to arrange for this condition.For suppose, (i,t) ∈ K and (i,u) ∈ K′, u < t. Then µ ≥ wti and µ ≤ wiu,hence wit ≤ wui . But since u < t we have wit ≤ wui by the definition of wi.We conclude that wit = wui . Now we can simply exchange (i,t) and (i,u)between K and K′. If we continue this procedure after finite number ofsteps we will get K and K′ with the desired property.Now define•K1 = {(i,1)|(i,1) ∈ K},with |K1| = k1 andI1 = {(i,j)|j ≤ d,(i,1) ∈ K},Then |I1| = k1d. Also note that if (i,j) ∈ I1, µ ≥ wi1 ≥ yij.• LetK2 = {(i,2)|,(i,2) ∈ K},with |K2| = k2 andI2 = {(i,j)|d < j ≤ 2d,(i,2) ∈ K}.Then |I2| = k2d. Also note that if (i,j) ∈ I2, µ ≥ wi2 ≥ yij.• LetKt = {(i,t)|(i,t) ∈ K},with |Kt| = kt andIt = {(i,j)|(t −1)d < j ≤ td,(i,t) ∈ K}.Then |It| = ktd. Also note that if (i,j) ∈ It, µ ≥ wit ≥ yij.• LetKc−1 = {(i,(c−1))|(i,c−1) ∈ K},with |Kc−1| = kc−1 andI(c−1) = {(i,j)|(c −2)d < j ≤ (c−1)d,(i,c−1) ∈ K}.Then |Ic−1| = kc−1d. Also note that if (i,j) ∈ I(c−1), µ ≥ wi(c−1) ≥ yij.2007.4. Data coarsening and quantile approximation algorithmNote that K = ∪c−1t=1Kt, |K| = k1,+···+kc−1. Since the Kt are disjointthe It are also disjoint. Let I = ∪c−1t=1It then |I| = d(k1 +···+kc−1) = d|K|.Also note that (i,j) ∈ I ⇒ µ ≥ yij.Similarly define,•K′1 = {(i,1)|(i,1) ∈ K′},|K′1| = k′1,andI′1 = {(i,j)|d < j ≤ 2d,(i,1) ∈ K′}.Then |I′1| = k′1d. Also note that if (i,j) ∈ I′1, µ ≤ wi1 ≤ yij.• LetK′2 = {(i,2)|(i,2) ∈ K′},|K′2| = k′2,andI′2 = {(i,j)|2d < j ≤ 3d,(i,2) ∈ K′}.Then |I′2| = k′2d. Also note that if (i,j) ∈ I′2, µ ≤ wi2 ≤ yij.• LetK′t = {(i,t)|(i,t) ∈ K′},|K′t| = k′t,andI′t = {(i,j)|td < j ≤ (t+ 1)d,(i,t) ∈ K′}.Then |I′t| = k′td. Also note that if (i,j) ∈ I′t then µ ≤ wit ≤ yij.•K′c−1 = {(i,(c−1))|(i,c−1) ∈ K′},|K′c−1| = k′c−1,andI′c−1 = {(i,j)|j > (c−1)d,(i,c−1) ∈ K′}.Then|I′c−1| = k′c−1d. Also note that if (i,j) ∈ I′c−1 ⇒ µ ≤ wi(c−1) ≤ yij.Then |I| = |K|d and |I′| = |K′|d. We claim that I ∩ I′ = ∅. To see thisnote that because of how the second components in It and I′t are defined,it is only possible that It+1 = {(i,j)|td < j ≤ (t + 1)d,(i,t + 1) ∈ K} andI′t = {(i,j)|td < j ≤ (t+1)d,(i,t) ∈ K′} intersect for some t = 1,··· ,c−2.But if they intersect then there exist i,t such that (i,t+1) ∈ K and (i,t) ∈K′ which is against our assumption regarding K and K′. Hence by Lemma5.2.4, µ is a quantile between2017.4. Data coarsening and quantile approximation algorithm[|K|dn , n−|K′|dn ] = [hdsummationtextmi=1 cid, n−(n′ −h)dsummationtextmi=1 cid] = [hC, m+hC ].But we know thatp ∈ [ h−1C −m, hC −m).We are dealing with two interval in one of them µ is a quantile and the othercontains p.We showed in Lemma 7.4.2 if two intervals [a,b] and [c,d] are given, thesup distance between two elements of the two intervals ismax{|a−d|,|b−c|}.Applying this to the above two intervals we get,max{|m+hC − h−1C −m|,| h−1C −m − hC|},which is equal to,max{|mC −m2 −hm+CC(C −m) |,|C −hmC(C −m)|}.But m2 +hm ≤ m2 + (C −m)m = mC. Hence|mC −m2 −hm+CC(C −m) | =mC −m2 −hm+CC(C −m) ≤mC +CC(C −m) =m+ 1C −m.Also| C −hmC(C −m)|≤ C +mCC(C −m) ≤ m+ 1C −m.Hence the max is smaller than ǫ = m+1C−m and we conclude that µ is a quantilefor p′ which is at most as far as ǫ to p.The case li = cd is easily obtained by replacing C = mc and noting thatm+1m−1 ≤ 3 m ≥ 2.In most applications, usually the data partitions are not divisible byd. For example the data might be stored in files of different length withcommon factors. Another situation involves a very large file that is neededto be read in successive stages because of memory limitations. Suppose that2027.4. Data coarsening and quantile approximation algorithmwe need a precision ǫ (in terms of degree of separation) and based on thatwe find an appropriate c and m. Note that n might not be divisible by mc.First we prove two lemmas. These lemmas show what happens to thequantiles if we throw away a small portion of the data vector or add somemore data to it. The first lemma is for a situation that we have thrown awayor ignored a small part of the data. The second lemma is for a situationthat a small part of the data are contaminated or includes outliers. Inboth cases, we show how the quantiles computed in the “imperfect” vectorscorrespond to the quantiles of the original vector. In both case x stands forthe imperfect vector and w is the complete/clean data.Lemma 7.4.3 (Missing data quantile summary lemma)Suppose x = (x1,··· ,xn), sort(x) = (y1,··· ,yn) and y′ = lqx(p),p ∈ [0,1].Consider a vector x⋆ of length n⋆ and let w = stack(x,x⋆). Then y′ =lqw(p′), where p′ ∈ [p−ǫ,p +ǫ] and ǫ = n⋆n+n⋆.Similarly if y′ = rqx(p) and p ∈ [0,1], y′ = rqw(p′), where p′ ∈ [p−ǫ,p+ǫ]and ǫ = n⋆n+n⋆.Proof We prove the result for lqx only and a similar argument works forrqx.Let z = sort(w) then lqz = lqw. For p = 1 the result is easy to see.Otherwise, in ≤ p < i+1n for some i = 0,··· ,n−1. But then y′ = lqx(p) = yi.In the new vector z since we have added n⋆ elements y′ = zj for some j,i ≤ j < i+n⋆. Hence y′ = lqz( jn+n⋆). From np−1 < i ≤ np, we concludenp−1n+n⋆ <in+n⋆ ≤jn+n⋆ <i+n⋆n+n⋆ ≤np+n⋆n+n⋆ .Hence,n⋆(1−p)−1n+n⋆ <jn+n⋆ −p <n⋆(1−p)n+n⋆ ⇒| jn+n⋆ −p| < max{|n⋆(1−p)−1n+n⋆ |,|n⋆(1−p)n+n⋆ |}.But |n⋆(1−p)n+n⋆ | ≤ n⋆n+n⋆ and |n⋆(1−p)−1n+n⋆ | ≤ max{n⋆−1n+n⋆, 1n+n⋆} since p ranges in[0,1]. We conclude that that| jn+n⋆ −p| < n⋆n+n⋆.2037.4. Data coarsening and quantile approximation algorithmLemma 7.4.4 (Contaminated data quantile summary lemma)Suppose x = (x1,··· ,xn), sort(x) = (y1,··· ,yn) and y′ = lqx(p),p ∈ [0,1].Consider the vector w = (x1,x2,··· ,xn−n⋆) then y′ = lqw(p′), where p′ ∈[p−ǫ,p +ǫ] and ǫ = n⋆n−n⋆.Similarly if y′ = rqx(p) and p ∈ [0,1], y′ = rqw(p′), where p′ ∈ [p−ǫ,p+ǫ]and ǫ = n⋆n−n⋆.Proof We only show the case for lqx and a similar argument works for rqx.Let z = sort(w). Thenlqz = lqw. If p = 1 the result is easy to see. Otherwise,in ≤ p <i+1n for some i = 0,··· ,n − 1. But then y′ = lqx(p) = yi. Inthe new vector z since we have removed n⋆ elements y′ = zj for some j,i−n⋆ ≤ j ≤ i. Hence y′ = lqz( jn−n⋆). From np−1 < i ≤ np, we concludenp−1−n⋆ < j ≤ np ⇒ np−n⋆ ≤ j ≤ np. Hence−n⋆ +n⋆pn−n⋆ ≤jn−n⋆ −p ≤n⋆pn−n⋆ ⇒| jn−n⋆ −p|≤ n⋆n−n⋆.In the case that the partitions are not divisible by d, we can use the samealgorithm with generalized coarsening. The error will increase obviously andthe next two lemmas say by how much.Lemma 7.4.5 Suppose x has length n = lm + r, 0 ≤ r < l and m = cd.To find lqx(p), apply the algorithm in the previous theorems to a sub–vectorof x of length lm. Then the obtained quantile is a quantile for a number in[p−ǫ,p +ǫ], where ǫ = m+1m−1 1c−1 + rlm+r.Proof The result is a straightforward consequence of the Theorem 7.4.1and the Lemma 7.4.3.Lemma 7.4.6 Suppose x has length n =summationtextmi=1 li and li = cid+ ri, ri < d.Let R =summationtextmi=1 ri. Then apply the algorithm above to x to find lqx(p), usingthe generalized coarsening. The obtained quantile is a quantile for a numberin [p−ǫ,p+ǫ] where ǫ = m+1C−m + RR+Cd.Proof Let l′i = cid. Consider x′ a sub–vector of x consisting of(y11,··· ,y1l′1),(y21,··· ,y2l′2),··· ,(ym1 ,··· ,yml′m).2047.5. The algorithm and computationsThen x′ has length summationtextmi=1 l′i. By Lemma 7.4.3 p-th quantile found by the al-gorithm is a quantile in [p−ǫ1,p+ǫ1], ǫ1 = m+1C−m for x′. x has R =summationtextmi=1 rielements more than x′. Hence the obtained quantile is a quantile for x fora number in [p−ǫ,p +ǫ], ǫ = ǫ1 + RR+Cd.7.5 The algorithm and computationsSuppose a data vector x has length n. To find the quantiles of this vector,we only need to sort x. Since then for any p ∈ (0,1), we can find the first hsuch that p ≥ h/n. Note thatsort(x) = (lqx(1/n),lqx(2/n),··· ,lqx(1)) = (rqx(0),rqx(1/n),··· ,rqx(n−1n )).We only focus on left quantiles here. Similar arguments hold for the rightquantile.Obviously, the longer the vector x, the finer the resulting quantiles are.Now imagine that we are given a very long data vector which cannot evenbe loaded on the computer memory. Firstly, sorting this data is a challengeand secondly, reporting the whole sorted vector is not feasible. Assume thatwe are given the sorted data vector so that we do not need to sort it. Whatwould be an appropriate summary to report as the quantiles? As we notedalso the sorted vector itself although appropriate, maybe of such lengthas to make further computation and file transfer impossible. The naturalalternative would be to coarsen the data vector and report the resultingcoarsened vector. To be more precise, suppose, length(x) = n = n1n2 andy = sort(x) = (y1,··· ,yn). Then we can reporty′ = Cn2(y) = (yn2,··· ,y(n1−1)n2).This corresponds to(lqy′(1/n2),··· ,lqy′(1)).How much will be lost by this coarsening? Suppose, we require the leftquantile corresponding to (h − 1)/n < p ≤ h/n, h = 1,··· ,n. Then xwould give us yh. But since (h−1)/n < p ≤ h/nnp < h ≤ np+ 1.2057.5. The algorithm and computationsAlso suppose for some h′ = 1,··· ,n1,(h′ −1)/(n1 −1) < p ≤ (h′)/(n1 −1) ⇒ (h′ −1) < p(n1 −1) ≤ h′⇒ (n1 −1)p ≤ h′ < p(n1 −1) + 1.Then(h−1)(n1 −1)/n < h′ < h(n1 −1)/n + 1,and(h−1)(n1 −1)n2/n < h′n2 < h(n1 −1)n2/n+n2. (7.1)Using the coarsened vector, we would report yh′(n2) as the approximatedquantile for p. The degree of separation between this element and the exactquantile using Equation 7.1 is less than or equal tomax{|h−(h−1)(n1 −1)n2/n|n , |h(n1 −1)n2/n+n2 −h|n }.This equalsmax{|−hn2 −n1n2 +n2n2 |,|−hn2 +nn2n2 |}.But|−hn2 −n1n2 +n2n2 | = n2(n1 +n−1)n2 < n2(n1 +n)n2 = 1n + n2n ,and|−hn2 +nn2n2 | < n2n .Hence the degree of separation is less than 1/n+1/n1. We have proved thefollowing lemma.Lemma 7.5.1 Suppose x is a data vector of the length n = n1n2 and y =sort(x), y′ = Cn2(y). Then if we use the quantiles of y′ in place of x, theaccuracy lost in terms of the probability loss of x (δx) is less than 1/n+1/n1.The algorithm proposes that instead of sorting the whole vector and thencoarsening it, coarsen partitions of the data. The accuracy of the quantilesobtained in this way is given in the theorems of the previous section. Thisallows us to load the data into the memory in stages and avoid programfailure due to the length of the data vector. We are also interested in the2067.5. The algorithm and computationsperformance of the method in terms of speed, and do a simulation studyusing the “R” package (a well–known software for statistical analysis) toassess this. In order to see theoretical results regarding the complexity of thespecial case of the algorithm for equal partitions see [3]. For the simulationstudy, we create a vector, x, of length n = 107. We apply the algorithm form = 1000,c = 20,d = 500. We create this vector in a loop of length 1000.During each iteration of the loop, we generate a random mean for a normaldistribution by first sampling from N(0,100). Then we sample 10,000 pointsfrom a normal distribution with this mean and standard deviation 1. Wecompare two scenarios:1. Start by a NULL vector x and in each iteration add the full generatedvector of length 10000 to x. After the loop has completed its run, sortthe data vector which now has length 107 by the command sort in Rand use this to find the quantiles.2. Start with a NULL vector w. During each iteration after generatingthe random vector, d–coarsen the data by d = 500. (Hence m = 1000,c = 20.) In order to do that computing, first apply the sort commandto the data and then simply d–coarsen the resulting sorted vector.During each iteration, add the coarsened vector to w. After all theiterations, sort w and use it to approximate quantiles.Remark. The first part corresponds to the straightforward quantiles’ cal-culation and the second corresponds to our algorithm. Note that in the realexamples instead of the loop, we could have a list of 1000 data files and stillthis example serves as a way of comparing the straightforward method andour algorithm.Remark. Note that if we wanted to create an even longer vector say oflength 1010 then the first method would not even complete because thecomputer would run out of memory in saving the whole vector x.Remark. The final stage of the algorithm can use the fact that w is builtof ordered vectors to make the algorithm even faster. We will leave that aproblem to be investigated in the future.We have repeated the same procedure for n = 2×107,m = 1000,d = 500and n = 108,m = 1000,d = 500. The results of the simulation are given inTable 7.2, in which “DOS” stands for the degree of separation between theexact median and the approximated median. The “DOS bound” bounds thedegree of separation obtained by the theorems in the previous section. Forn = 107,n = 2 × 107 significant time accrue by using the algorithm. Fora vector of length 108, R crashed when we tried to sort the original vector2077.5. The algorithm and computationsand only the algorithm could provide results. For all cases the exact andapproximated quantiles are close. In fact the dos is significantly smallerthan the dos bound. This is because this is a “worst–case” bound. Theexact and approximated quantiles for n = 107 are plotted in Figure 7.1.Length n = 107 n = 2×107 n = 108Exact median value 1.847120 1.857168 NAAlgorithm median value 1.866882 1.846463 1.846027DOS 0.00012 −6.475 ×10−5 NADOS bound 0.05268421 0.02566667 0.005030151Time for exact median 186 sec 461 s NATime for the algorithm 6 sec 18 s 98 sTable 7.2: Comparing the exact method with the proposed algorithm in Rrun on a laptop with 512 MB memory and a processor 1500 MHZ, m =1000,d = 500. “DOS” stands for degree of separation in the original vector.“DOS bound” is the theoretical degree of separation obtained by Theorem7.4.1.Next, we apply the algorithm on a real dataset. The dataset includes thedaily maximum temperature for 25 stations over Alberta during the period1940–2004. We focus on the 95th percentile. The results are given in Table7.3. The algorithm finds the percentile more quickly but the time differenceis not as large as the simulation. This is because most of the time of thealgorithm and the exact computation is spent on reading the files from thehard drive. The dos bound is about 0.01 (on the 0–1 probability scale). Thetrue degree of separation is about 0.001. The estimated quantiles and theexact quantiles are plotted in Figure 7.2. Notice that the exact and approx-imated values match except at the very beginning (very close to zero) andend (when it is close to 1), where we see that the circles (corresponding toexact quantiles) and the +s (corresponding to the approximated quantiles)do not completely match. This difference is at most 0.01 in terms of dos inany case.2087.5. The algorithm and computations0.0 0.2 0.4 0.6 0.8 1.0−200−1000100200300lq(p)pFigure 7.1: Comparing the approximated quantiles to the exact quantilesN = 107. Thecircles are the exact quantiles and the + are the correspondingapproximated quantiles.2097.5. The algorithm and computations0.0 0.2 0.4 0.6 0.8 1.0−30−20−10010203040lq(p)pFigure 7.2: Comparing the approximated quantiles to the exact quantiles forMT (daily maximum temperature) over 25 stations in Alberta 1940–2004.The circles are the exact quantiles and the + the approximated quantiles.2107.5. The algorithm and computationsExact 95th percentile 27 CAlgorithm 95th percentile 26.7 CDOS 0.001278726DOS bound 0.01052189time for exact median 8 min 6 sectime for the algorithm 7 min 29 secTable 7.3: Comparing the exact method with the proposed algorithm in R(run on a laptop with 512 MB memory and processor 1500 MHZ) to computethe quantiles of MT (daily maximum temperature) over 25 stations withdata from 1940 to 2004.211Chapter 8Quantile data summaries8.1 IntroductionThis chapter introduces techniques to summarize data (using quantiles), ma-nipulate and combine such summaries. “Weighted data vectors”, which arean extension of data vectors are introduced. The operators sort and stackare extended to weighted data vectors and the operator comp (compress) isintroduced to compress a data vector as much as possible with no loss ofinformation. In the quantile definition chapter, we expressed a few appeal-ing properties that quantiles should satisfy. We established the equivarianceand symmetry properties and left the following to later:1. The “amount” of data between qx(p1) and qx(p2) should be a p2 −p1, p1 < p2 fraction of the “data amount” of the whole data.2. If we cut a sorted data vector up until the p1-th quantile and computethe p2-th quantile for the new vector, we should get the p1p2-th quan-tile of the original vector. For example the median of a sorted vectorupto its median should be the first quartile of the original vector.A natural definition for the “amount of data” between a,b would bethe number of data points between a,b divided by the length of the wholevector. However, by this definition there is no hope of establishing property(1) knowing that p2−p1 can be irrational. Also for the second property onemight conjecture that if we define the cut operator to be the sorted vectorfrom left to lqx(p1) (or rqx(p1)) then this property holds. However, considerx = (1,2) and a cut of length 0.6. Then we get the same vector x′ = (1,2)after the cut using this definition since lqx(0.6) = 2. Now the 0.7th left (orright) quantile of the cut vector x′ islqx′(0.7) = 2.However,lqx(0.6×0.7) = lqx(0.42) = 1.2128.1. IntroductionIn the following, we define the cut operator for p ∈ (0,1) in a way that itends with lqx(p) but satisfies property (2). The idea can be explained inthe example by considering the vector x = (1,2) as a weighted vector withweights (1/2,1/2) and give 2 less “weight” than 1 after the cut. In summary,this chapter provides a framework to establish these properties, using the“partition” operator and the “cut” operator.When dealing with summarized data the following general question is afundamental one:Question: Suppose x is a data vector which consists of m subvectorsx1,··· ,xm.In other words x = stack(x1,··· ,xm). Assume we do not have access tothe xi but to the wi, their summaries (possibly a result of coarsening of thexi). Then how can we approximate the quantiles of the original data vectorx and assess how good this approximation is?We have already encountered such a problem in Chapter 7, where we an-swered the question in some specific cases. We do not answer the questionin general in this chapter but provide a framework to formalize and answerthese type of questions.In computer science quantiles are sometimes used to summarize largedatasets. A good summary of the work for creating quantile summaries ofdatasets in a single pass is given in [19].In order to make a summary (of length k) of a data vector using thequantiles, one has various choices to pick certain probability indicesp1 ≤ p2 ≤ ··· ≤ pk,and save the corresponding quantiles. Using the probability loss function,we find an optimal way of doing this. Then we consider the problem offindingargminaE(L(X,a)),for various L (loss) functions. It is widely claimed that if L is the absolutevalue function, the argmin is the median of X. We show that the argminis in fact [lqX(1/2),rqX(1/2)]. We also find theargminaE(δX(X,a)).Finally, we find optimal “probability index vectors” to assign quantiles to arandom sample X1,··· ,Xn, which can be used to make a quantile–quantileplot. Some previous techniques to make a q–q plot are discussed in [24].2138.2. Generalization to weighted vectors8.2 Generalization to weighted vectorsThis section extends the definitions and ideas developed before (quantiles,probability loss function, sorting, stacking etc.) from ordinary data vectorsto weighted vectors. A weighted vector has two extra components comparedto an ordinary vector: a weight allocation and a data amount. This allows usto summarize information in some cases. For example, consider the vector(1,1,1,1,1,1,1,1,1,2). We observe that 1 is repeated 9 times and 2 onlyone time. We can summarize this by giving the elements (1,2) a weightallocation (0.9,0.1) and a data amount 10 which is the length of the vectorin this case. Weighted vectors also enable us to define the “cut” operator tocut data vectors.Definition We call a triple χ = (x,wχ,nχ) a weighted vector if length(x) =length(wχ) = lx, x = (x1,··· ,xl), wχ = (wχ1,··· ,wχl ),summationtextli=1 wχi = 1 and nχa positive real number. Note that nχ is not necessarily equal to the lengthof x. We call wχ the “weight vector” of χ and nχ the “data amount” of χ.Remark. Note that in order to specify a weight vector w, we do not needto specify the last component since the weights must sum up to one.Examples:1. χ = ((1,2,3),(1/3,1/3,1/3),3). This is equivalent to an ordinaryvector of length 3 in a sense we make clear soon.2. χ = ((1,2,3),(1/3,1/3,1/3),6). Notice this weighted vector has thesame elements as before with a data amount of 6 which is two timesthe previous vector. This vector is equivalent to the ordinary vectorx = (1,1,2,2,3,3).3. χ = ((1,1,2,3),(1/6,1/6,1/3,1/3),3). This is equivalent to vectorgiven in 1. Note that one is repeated two times here. However, thesum of the weights for 1 is 1/6+1/6=1/3 which is the same as thevector defined in 1.4. χ = ((1),(1),1/2). Here we only have 1/2 data amount. i.e. we haveless than one observation! (1/2 of an observation to be precise.)5. χ = ((1,2),(1/2,1/2),√3).2148.2. Generalization to weighted vectorsThe first vector, x, in the definition χ = (x,wχ,nχ), is the vector of possiblevalues, the second one, wχ, is the corresponding weights for elements of xand the third component, nχ, is a measure of how fine the vector is.A vector is called an ordinary vector if the length of x, lx, is equal to nχand wχi = wχj , i,j ∈ 1,··· ,lx. The ordinary vector corresponds to the usualdata vectors. Denote the space of all weighted vectors by Υ. We define someoperations and an equivalence relation on Υ.Definition Suppose χ = (x,wχ,nχ) then comp(χ) = ξ = (y,wξ,nξ), wherey = (y1,··· ,yr) is a non-decreasing vector of all disjoint elements of x,wξi =summationtextxj=yi wχj and nξ = nχ.It is clear that comp (compress operator) is an operator from Υ to Υ. Thenwe define an equivalence relation on Υ.Definition χ ∼ ξ in Υ iff comp(χ) = comp(ξ).Clearly, ∼ is an equivalence relation. Let us define a transformation of aweighted vector.Definition Suppose χ = (x,wχ,nχ) is a weighted vector and φ a trans-formation of R (not necessarily increasing). Then φ(χ) = ζ = (z,wζ =wχ,nζ = nχ), where zi = φ(xi), i = 1,2,··· ,lx.For ordinary vectors x,y, comp(x) = comp(y) iff sort(x) = sort(y). Alsocomp leaves the last component of a weighted vector (the data amount)unchanged.Since x and wχ have the same length, we can show an element of Υ bypair consisting of a matrix of dimension 2×lx and a number nχ:χ = (parenleftbigg x1 ··· xlxwχ1 ··· wχlxparenrightbigg,nχ)Given a weighted vector χ = (x,wχ,nχ), we can naturally define a dis-tribution function as follows.Definition Supposeχ = (x,wχ,nχ) is a weighted vector. The the empiricaldistribution of χ is defined asFχ(a) =summationdisplayi, xi≤awχi .2158.2. Generalization to weighted vectorsRemark. If χ is an ordinary vector then Fχ is the usual empirical function.Then we extend the definition of the stack operator to weighted vectors.Definition Suppose χ = (x,wχ,nχ) and ξ = (y,wξ,nξ) are given thenstack : Υ×Υ → Υ,(χ,ξ) mapsto→ ζ = (z,wζ,nχ +nξ),where (z,wζ) in the matrix notation is given byparenleftBiggx1 ··· xlx y1 ··· ylywχ1 nχnχ+nξ ··· wχlx nχnχ+nξ wξ1 nξnχ+nξ ··· wξly nξnχ+nξparenrightBigg.Remark. In the definition, notice how the data amounts are used to adjustthe weights.Remark. For ordinary vectors x,y the stack operator coincide to concate-nating x and y.Lemma 8.2.1 (Stack operator properties)a) The stack operator preserves the equivalence relation defined above, i.e.χ1 ∼ ξ1, χ2 ∼ ξ2, then stack(χ1,χ2) ∼ stack(ξ1,ξ2)b)stack(χ1,stack(χ2,χ3)) ∼ stack(stack(χ1,χ2),χ3)Proof a) Suppose χi = (xi,wχi,nχi),ξi = (yi,wξi,nξi) and χi ∼ ξi fori = 1,2. Letχ = comp(stack(χ1,χ2)), comp(stack(ξ1,ξ2)) = ξ.We need to show χ = ξ. Let χ = (x,wχ,nχ) and ξ = (y,wξ,nξ). Fromχi = ξi for i = 1,2, we conclude nχi = nξi,i = 1,2, which in turn givesnχ = nχ1 +nχ2 = nξ1 +nξ2 = nξ.Also x = y since both x and y are increasingly sorted and every element inx is an element of x1 or x2 which have the same elements as y1 or y2. Nowto show wχi = wξi,i = 1,2,··· ,lx, suppose xi = yi be the correspondingelement in x = y. Assume that the corresponding weight for xi in χ1 is wand in χ2 is w′. Then the corresponding weight in ξ1 and ξ2 must be w and2168.2. Generalization to weighted vectorsw′ respectively by the assumed equivalence relations. Hence wχi and wξi areequal tow. nχ1nχ1 +nχ2 +w′. nχ2nχ1 +nχ2 ,andw. nξ1nξ1 +nξ2 +w′. nξ2nξ1 +nξ2 ,which are equal.b) Letχ = (x,wχ,nχ) = comp[stack(χ1,stack(χ2,χ3))]andχ′ = (x′,wχ′,nχ′) = comp[stack(stack(χ1,χ2),χ3))].We show χ = χ′. Firstly, note thatnχ = nχ1 + (nχ2 +nχ3) = (nχ1 +nχ2) +nχ3 = nχ′.x = x′ is trivial. Fix xi = x′i in x = x′. Suppose its corresponding weightin χj is equal to wj,j = 1,2,3. To show that the corresponding weightswχi and wχ′i are equal, note that the corresponding weight of xi in χ is acombination of its weights in χ1 and stack(χ2,χ3):wχi = w1 nχ1nχ1 + (nχ2 +nχ3)+[w2nχ2nχ2 +nχ3 +w3nχ3nχ2 +nχ3 ]nχ2 +nχ3nχ1 + (nχ2 +nχ3)and the corresponding weight of xi in χ′ is a combination of its weights instack(χ1,χ2) and χ3:wχ′i = [w1 nχ1nχ1 +nχ2 +w2nχ2nχ1 +nχ2 ]nχ1 +nχ2(nχ1 +nχ2)+nχ3 +w3nχ3(nχ1 +nχ2) +nχ3 .But the previous two expressions are equal and the proof is complete.This lemma implies that we can use the notation stack(χ1,··· ,χm).Definition of quantiles and DOS for weighted vectorsNow let us get to the definition of quantiles. We can proceed exactly inthe same way as we did before by having in mind a bar of length one.Or alternatively, we can apply the quantile function definition for usualdistributions to the empirical distribution of a weighted vector Fχ. Thistime, we proceed in a slightly different fashion which is equivalent to these2178.2. Generalization to weighted vectorsmethods. Suppose χ = (x,wχ,nχ) is given and ζ = comp(χ) = (z,wζ,nχ).We assume z has length lz. First, we definelqindχ : (0,1] → {1,2,··· ,lz},andrqindχ : [0,1) → {1,2,··· ,lz},the “left quantile index” and “right quantile index” functions and thendefine the left and right quantile functions using the index functions. Ifζ = comp(χ) = (z,wζ,nx) then we definelqχ(p) = zlqindχ(p), p ∈ (0,1], lqχ(p) = −∞, p = 0,andrqχ(p) = zrqindχ(p), p ∈ [0,1), rqχ(p) = ∞, p = 1.Let ζ = comp(χ). lqindχ and rqindχ are defined as follows:• p = 0 then lqindχ(p) not defined and rqindχ(p) = 1.• 0 < p < wζ1 then lqindχ(p) = rqindχ(p) = 1.• p = wζ1 then lqindχ(p) = 1 and rqindχ(p) = 2....• wζ1 +···+wζi−1 < p < wζ1 +···+wζi then lqindχ(p) = rqindχ(p) = i.• p = wζ1 +···+wζi then lqindχ(p) = i,rqindχ(p) = i+ 1....• p = 1 then lqindχ(p) = lz and rqindχ is not defined.Remark. It is easy to see that χ ∼ ξ then lqχ = lqξ,rqχ = rqξ.Remark. For ordinary vectors, this is equivalent to the definition given inthe previous sections.2188.2. Generalization to weighted vectorsRemark. Consider the natural distribution function Fχ corresponding to aweighted vector χ then lqχ = lqFχ and rqχ = rqFχ. Hence, lqχ,rqχ satisfy allthe properties proved for left and right quantile functions of a distributionfunction.Definition We generalize the degree of separation (probability loss func-tion) δχ on the set of weighted vectors as follows:δχ : R×R → R+ ∪{0},δχ(z′,z) = δχ(z,z′) =summationdisplayz<xj<z′wχj , z < z′,and δχ(z,z) = 0.Lemma 8.2.2 (Properties of the probability loss function for weighted vec-tors)a) δχ = δFχ.b) δχ only depends on comp(χ).c) δχ satisfies the pseudo–triangle property.Proof a) and b) are trivial and c) follows from a) and pseudo–triangleproperty for the probability loss functions for distributions.8.2.1 Partition operatorThis section introduces the partition operator to partition data into arbitrar-ily sized partitions. This allows us to address the two remaining propertiesfor quantiles we pointed out in the introduction (in Lemma 8.2.5). The ideabehind the definition of the partition operator can be explained as follows.Suppose a weighted vector χ = (x,wχ,nχ) is given and we want to partitionit to smaller vectors with weights (p1,··· ,pm),summationtextmi=1 pi = 1. Consider a barof length 1 and then color it from left to right using colors corresponding tothe xi with length wχi . Then cut the bar from left to right using the givenweights (p1,··· ,pm). Now each one of the small bars is the partitions weneeded. More formally, we have the following definition:Definition Suppose P = (p1,p2,··· ,pm) is given, such that summationtextmi=1 pi = 1.Then a P-partition of a weighted data vector χ = (x,wχ,nχ) is denoted2198.2. Generalization to weighted vectorsby part(P,χ) = (χ1,··· ,χm) and is a collection of m weighted vectorsχ1 = (x1,wχ1,nχ1 = nχ.p1),··· ,χm = (xm,wχm,nχm = nχ.pm) defined asfollows:1. x1 = (xs1,··· ,xt1), s1 = 1,v1 =summationtext1≤j≤t1 wj ≥ p1, summationtext1≤j<t1 wj < p12. x2 = (xs2,··· ,xt2), v2 = summationtext1≤j≤t2 wj − p1 ≥ p2, summationtext1≤j<t2 wj − p1 <p2, s2 =braceleftBiggt1 + 1 v1 = p1t1 v1 > p1...k. xk = (xsk,··· ,xtk), vk = summationtext1≤j≤tk wj −summationtextk−1j=1 pj ≥ pk, summationtext1≤j<t2 wj −summationtextk−1j=1 pj < pk, sk =braceleftBiggtk−1 + 1 vk−1 = pk−1tk−1 vk−1 > pk−1 ....The corresponding weight vectors and data amounts are defined as:1. wχ1 = 1p1(wχs1,wχs2,··· ,wχt1 −(v1 −p1)),...k. wχk =braceleftBigg 1pk(wχsk,wχsk+1,··· ,wχtk −(vk −pk)) vk−1 = pk−11pk(vk−1 −pk−1,wχsk+1,··· ,wχtk −(vk −pk)) vk−1 > pk−1....Lemma 8.2.3 If χ = (x,wχ,nχ) is an ordinary vector and lx = nχ =n1 + ··· + nm. Let P = (n1nχ,··· , nmnχ ) then the P-partition of χ is simplyobtained by starting from the left and partitioning x to vectors of lengthn1,n2,··· ,nm.Proof This is a straightforward conclusion of the definition.Lemma 8.2.4 Suppose χ = (x,wχ,nχ) is partitioned by some P = (p1,··· ,pm)to χ1,··· ,χm thenstack(χ1,··· ,χm) ∼ χ.Proof Let χ′ = stack(χ1,··· ,χm) and suppose χ′ = (x′,wχ′,nχ′). Thenclearly x′ and x have the same distinct elements. (Although it might be the2208.2. Generalization to weighted vectorscase that x′ negationslash= x since some elements of x are repeated more than once inx.) Alsonχ′ =msummationdisplayi=1pinχ = nχ.In order to show that for z an element of the vector x, its correspondingweight is equal in χ and χ′, suppose z is equal to xi1,··· ,xir in x withcorresponding weights wχi1,··· ,wχir. Then the weight corresponding to z in χis equal tosummationtextrk=1 wχik. Now note that any of xik,k = 1,··· ,r, corresponds toone or two elements in stack(χ1,··· ,χm) by the definition of the partitionsoperator. It can be the case that xik only appears in χs or in χs,χs+1 if xikis at the end of the partition χs and at the beginning of the next. In thefirst case when xik only appears in χs, its weight in χs will be 1pswχik andhence its weight contribution in stack(χ1,··· ,χm) will be nχ.psnχ 1pswχik = wχik.In the second case its weight in χs will be 1ps(wχik −(vs −ps)) and in χs+1will be 1ps+1(vs − ps). Hence its weight contribution in stack(χ1,··· ,χm)coming from χs,χs+1 is nχpsnχ 1ps(wχik −(vs−ps))+ nχps+1nχ 1ps+1(vs−ps) = wχik.Summing up all the weights in stack(χ1,··· ,χm), we get the same value ofsummationtextrk=1 wχik.Using the partition operator, we can easily define the cut operator asfollows.Definition Let D = {(a,b)| a,b ∈ (0,1), a < b}. Then cut : Υ×D → Υ isdefined to becut(χ,p1,p2) = χ2,where χ2 is the second component of part(P,comp(χ)) = (χ1,χ2,χ3), theresult of applying a partition operator with weights P = (p1,p2 −p1,1−p2)to comp(χ). We also define left cut and right cuts,lcut,rcut : (0,1) → R,lcut(χ,p) = χ1,rcut(χ,1−p) = χ2,where χ1 and χ2 are the first and second component of the partition of χby P = (p,1−p).Lemma 8.2.5 Suppose χ = (x,wχ,nχ) is a weighted vector and (p1,p2) inD. Then2218.2. Generalization to weighted vectorsa) The amount of data in cut(χ,p1,p2) is nχ(p2 −p1).b) cut(χ,p1,p2) starts with rqχ(p1) and ends with lqχ(p2).c) The vector of lcut(χ,p) ends with lqχ(p).d) The vector of rcut(χ,p) starts with rqχ(1−p).e) Suppose p1,p2 ∈ (0,1) then lcut(lcut(χ,p1),p2) = lcut(χ,p1p2).f) Suppose p1,p2 ∈ (0,1) then rcut(rcut(χ,p1),p2) = rcut(χ,p1p2).Proof a) is trivial. To prove b), consider the definition of the partitionoperator as given in Definition 8.2.1 for arbitrary P = (p′1,··· ,p′m). For thefirst partition, xs1 = x1 = lqχ(p′1) and for xt1, we havesummationdisplay1≤j≤t1wj ≥ p′1, andsummationdisplay1≤j<t1wj < p′1,which concludes lqχ(p′1) = xt1. For the k-th partition,sk =braceleftBiggtk−1 +1 vk−1 = p′k−1tk−1 vk−1 > p′k−1 .If vk−1 = p′k−1, then summationtext1≤j≤tk−1 wj = summationtextk−1i=1 p′i. Hence rqχ(summationtextk−1i=1 p′i) =xtk−1+1 = xsk. For tk, we have summationtext1≤j<tk wj < summationtextki=1 p′i and summationtext1≤j≤tk wj ≤summationtextki=1 p′k. Hence lqχ(summationtextki=1 p′k) = xtk. To finish the proof, let m = 3 andp′1 = p1,p′2 = p2−p1,p′3 = 1−p2 and note that cut(χ,p1,p2) corresponds tothe second component of the partition operator of P = (p′1,p′2,p′3) on χ.The proof of c) is similar to b). d) can be either done by a similar directproof or by using the Quantile Symmetry Theorem.To prove e) letχ1 = lcut(χ,p1) = ((x1,··· ,xt1),(w11,··· ,w1t1),nχ.p1)χ1,2 = lcut(lcut(χ,p1),p2) = ((x1,··· ,xt1,2),(w1,21 ,··· ,w1,2t1 ),nχ.p1.p2)χ12 = lcut(χ,p1p2) = ((x1,··· ,xt12),(w121 ,··· ,w12t1 ),nχ.p1.p2)We want to show χ1,2 = χ12. It is clear that their data amount is equal. Byapplying the definition of lcut to the above three equations, we conclude thefollowing: summationdisplay1≤j<twj < p1,summationdisplay1≤j≤twj ≥ p1, (8.1)summationdisplay1≤j<t1,2w1j < p2,summationdisplay1≤j≤t1,2w1j ≥ p2, (8.2)2228.2. Generalization to weighted vectorssummationdisplay1≤j<t12wj < p1p2,summationdisplay1≤j≤t12wj ≥ p1p2. (8.3)If j < t1 then w1j = 1p1wj. Hence from the first equation in 8.2, we conclude1p1summationdisplay1≤j<t1,2wj < p2 ⇒summationdisplay1≤j<t1,2wj < p1p2.Now consider two cases:Case I: t1,2 < t1. In this case, similarly, from the second equation in 8.2, weconclude 1p1summationdisplay1≤j≤t1,2wj ≥ p2 ⇒summationdisplay1≤j≤t1,2wj ≥ p1p2.Case II: t1,2 = t1. In this case note that for j < t1,2 = t1, we still havew1j = 1p1wj and for j = t1,2 = t, we have w1j ≤ 1p1wj. Butsummationdisplay1≤j≤t1,2=tw1j = 1 ⇒summationdisplay1≤j≤t1,2=tw1j ≥ p1 ≥ p1p2.In both cases, we showed that summationtext1≤j≤t1,2 wj ≥ p1p2 and summationtext1≤j≤t1,2=t1 w1j ≥p1 ≥ p1p2. We conclude that t1,2 = t12. In order to show that the weightvectors of χ1,2 and χ12 are the same, note that they have the same length.We only need to show that they match on all the components except for thelast one because the equality of the last one will follow. But if j < t1,2 = t12then w1,2j = 1p2( 1p1wj) and w12j = 1p1p2wj.f) can be done either by a similar argument as e) or using the QuantileSymmetry Theorem.Remark. Part a) and e) address the two remaining properties we wereseeking in the introduction.8.2.2 Quantile data summariesHere, we formally define quantile data summaries. They arise when a largedata vector is summarized by a smaller vector and possibly some otherinformation about the original vector and how the summary is been created.A large vector might have been partitioned into smaller vectors and thesmaller vectors might have been summarized. First we define a probabilityindex vector which is needed to define quantile data summaries.2238.2. Generalization to weighted vectorsDefinition A vector P = (p1,··· ,pk) is called a probability index vector if0 ≤ p1 < ··· < pk ≤ 1.Definition Suppose χ = (x,wχ,nχ), a weighted vector and a probabilityindex vector P = (p1,··· ,pm) is given such that 0 ≤ p1 < p2 < ··· < pm ≤1. Then a P-quantile summary of χ is defined to beqs(P,χ) = (lqχ(p1),··· ,lqχ(pm)).Definition A summary triple is defined to be a triple (qs(P,χ),P,nχ),where qs is the summarized vector as defined above, P is the summaryprobability index vector and nχ is the data amount of the original vector.We also define an ǫ-summary for ǫ < 1/2.Definition Let h = [1/ǫ]. Then the ǫ-summary for χ is defined to be thetriple (qs(ǫ,χ),ǫ,nχ):qs(ǫ,χ) = (lqχ(ǫ),lqχ(2ǫ),··· ,lqχ((h−1)ǫ)).Note that [0,ǫ),[ǫ,2ǫ),··· ,[(h −1)ǫ,1] is a partition of [0,1] to intervals ofthe same length ǫ other than the last one, which can be greater than ǫ.However it is less than 2ǫ. If ǫ = 1/s for a natural number s, then the 1/ssummary is going to beqs(1/s,χ) = (lqχ(1/s),lqχ(2/s),··· ,lqχ((s−1)/s)).Remark. For an ordinary vector x = (x1,··· ,xn), suppose n = n1n2.Then we defined the n2–coarsening operator to beCn2(x) = (lqx(p1),··· ,lqx(pn1−1),where pi = i/n, i = 1,··· ,n1 −1. This is the same asqs(ǫ,x),for ǫ = 1/n1. Hence the coarsening operator is a special case of creating anǫ-summary.We also define summary lists.Definition Suppose χ = stack(χ1,··· ,χm) and m probability index vec-tors P1,··· ,Pm are given. Then let ξi = qs(χi,Pi). Then the list2248.3. Optimal probability indices for vector data summariesξ =ξ1 P1 nχ1... ... ...ξm Pm nχmis called a quantile summary list of χ. Note that ξ is not a matrix ingeneral since the length of the summary indices might differ.Quantile summary vectors or quantile summary lists are to be used toinfer the original vector χ. They can be used as “inputs” to procedures forapproximating lqχ. The formal definition of a data summary procedure isdefined below.Definition Suppose χ is a weighted vector and input is a quantile summarylist. Then a quantile summary procedure is defined to be a left quantilefunction:proc(input,χ) : [0,1] → R.“proc” tries to approximate the quantiles of the original vector χ using theinput. It is desirable to find procedures that have good accuracy.Example The d-coarsening algorithm can be viewed as an example of theabove framework. There the vector χ is simply an ordinary vector of lengthn which is a concatenation of x1,··· ,xl. The summary list consists of d-coarsening of partitions x1,··· ,xl. In other words x1,··· ,xm which are oflength li = cid are summarized by Pi = (1/ci,··· ,(ci−1)/ci, i = 1,··· ,m)to w1,··· ,wm. Finally the “proc” is simply the left quantile function of theconcatenation of w1,··· ,wm. The accuracy in terms of the probability losswas bounded by ǫ = m+1C−m, C =summationtextmi=1 ci. In other wordssupp∈(0,1)δx(proc(input,x)(p),lqx(p)) ≤ ǫ.8.3 Optimal probability indices for vector datasummariesSuppose a data vector x or a distribution X is given. The data vector xmight be too long to carry around or save in the memory. Similarly thedistribution X might be too complicated or unknown. To make inferencesabout a data vector x or the distribution of X, we might use a summary or2258.3. Optimal probability indices for vector data summariessome other procedure. For example, we might save a vector data summaryinstead of the vector x of length n, where n is very large:qs(P,x) = (lqx(p1),··· ,lqx(pm)), p1 < ··· < pm.The following question motivates our ensuing development:Question: How should P = {p1,··· ,pm} be chosen to provide good approx-imation/prediction to x (or X)?A natural way to approximate x or X is to estimate all the quantiles.(This is equivalent to approximating or estimating the whole data vectorx or the distribution function of X.) We are given an input. In the caseof a data vector it is usually a quantile data summary and in the case ofthe random variable X it might be a random sample. Then a “procedure”can be employed to approximate/estimate the quantiles of x or X. For anygiven p the left quantile lqx(p) or lqX(p) is approximated/estimated by theprocedure using the input. We denote this value by proc(input,x)(p) orproc(input,X)(p). Then a loss L can be used to assess the goodness of sucha procedure:L(proc(input,x)(p),lqx(p)).To assess the overall goodness of such a procedure, we can use either thesup loss or the integral loss:supp∈[0,1]L(proc(input,x)(p),lqx(p)),or integraldisplayp∈[0,1]L(proc(input,x)(p),lqx(p))dp.For simplicity, we restrict to data vectors from here. We use the prob-ability loss δx as the most natural choice. We want to minimize this lossn order to find optimal ways to summarize data (create input) and findoptimal procedures.Definition We define the crudity of the procedure proc at p given the inputto becrud(proc(input,x)(p)) = δx(proc(input,x)(p),lqx(p)).Also the “sup crudity” and “integral crudity” are respectively given by2268.3. Optimal probability indices for vector data summariesSC(proc(input,x)) = supp∈[0,1]δx(proc(input,x)(p),lqx(p)),andIC(proc(input,x)) =integraldisplayp∈[0,1]δx(proc(input,x)(p),lqx(p))dp.Using the above framework, we look for good procedures to summarizedata vectors and later distribution functions.A quantile data summary was defined to beqs(P,x) = (lqx(p1),··· ,lqx(pm)), p1 < ··· < pm,for aprobability index vectorP = (p1,··· ,pm). Thereis a naturalprocedureassociated with this input that is a quantile data summary, which we definebelow.Definition Suppose x is a data vector which has been summarized by P =(p1,··· ,pm). Then we define the shortest distance quantile procedure of xassociated with P to beproc(input,x)(p) = lqx(pi), i = argminj{|p−pj|,j = 1,··· ,m}.If there were more than one minimum above, take the smaller value. Wedenote this procedure by shproc(x,P).The shortest distance procedure be specified by the notation “mapsto→” asshown below:1. 0 ≤ p ≤ p1 + p2−p12 mapsto→ lqx(p1).2. p1 + p2−p12 < p ≤ p2 + p3−p22 mapsto→ lqx(p2)....m. pm−1 + pm−pm−12 < p ≤ 1 mapsto→ lqx(pm).The largest loss in the first part of the procedure is the maximum of the twovalues,δx(lqx(0),lqx(p1)),δx(lqx(p1),lqx(p1 + p2 −p12 )). (8.4)For the second part, it is the maximum ofδx(lqx(p1 + p2 −p12 ),lqx(p2)),δx(lqx(p2),lqx(p2 + p3 −p22 )). (8.5)2278.3. Optimal probability indices for vector data summariesFor the m-th part it is the maximum ofδx(lqx(pm−1 + pm −pm−12 ),lqx(pm)),δx(lqx(pm),lqx(1)). (8.6)We use quantile data summaries to save space and memory for operationson very large datasets. Hence, we have a limitation on m. The interestingquestion is what is an optimal index set P of length m to summarize datavectors? In the beginning, we usually do not have any information aboutx so the P should be chosen in way that works well for all possible datavectors. Hence, we settle for eitherargminPsupxSC(shproc(input,x)(p),lqx(p)) =argminPsupxsupp∈[0,1]δx(shproc(input,x)(p),lqx(p)),orargminPsupxIC(shproc(input,x)(p),lqx(p)) =argminPsupxintegraldisplay 10δx(shproc(input,x)(p),lqx(p))dp.We sort out the sup crudity case first. By Lemma 6.6.3, taking the supof the max over all x in Equations 8.4, 8.5 and 8.6, we get the maximum ofthe following quantities:1. p1, p2−p12 .2. p2−p12 , p3−p22 .3. p3−p22 , p4−p32 ....m. pm−pm−12 , 1−pm.Hence,supxsupp∈[0,1]δx(shproc(input,x)(p),lqx(p)) =maxp∈[0,1]{p1, p2 −p12 , p2 −p12 , p3 −p22 , p3 −p22 , p4 −p32 ,··· , pm −pm−12 ,1−pm}.After omitting the repetitions, we need to minimize:max{p1, p2 −p12 , p3 −p22 , p4 −p32 ,··· , pm −pm−12 ,1−pm},2288.3. Optimal probability indices for vector data summariesover all p1 < p2 < ··· < pm ∈ [0,1]. We claim thatp1 = 12m,p2 −p1 = 1/m,p3 −p2 = 1/m,··· ,pm−1 = 1/m,pm = 1− 12m,is the solution. Note that in this case the max is equal to 1/2m. We showthat we cannot do better. Letα1 = p1,α2 = p2 −p12 ,α3 = p3 −p22 ,...αm = pm −pm−12 ,αm+1 = pm.We have α1+2α2+···+2αm+αm+1 = 1. The αi are non-negative, there are1+2(m−2)+1 of them (counting the ones with multiple 2 two times) andthey sum up to 1. If all of them are less than 12m the sum will be less than1. Hence we conclude the maximum is obtained when they are all equal to1/2m.Now let us do the integral crudity case. We claim the solution is thesame. We compute the integral in the following, using 6.6.4 in the secondequality:supxintegraldisplay 10δx(lqx(p),shproc(input,x)(p))dp =supx[integraldisplay p1+p2−p120δx(lqx(p1),lqx(p))dp +integraldisplay p2+p3−p22p1+p2−p12δx(lqx(p2),lqx(p))dp+··· +integraldisplay 1pm−1+pm−pm−12δx(lqx(pm),lqx(p))dp]2298.3. Optimal probability indices for vector data summaries=integraldisplay p1+p2−p120|p−p1|dp+integraldisplay p2+p3−p22p1+p2−p12|p−p2|dp+···+integraldisplay 1pm−1+pm−pm−12|p−pm|dp=integraldisplay p10(p1 −p)dp+integraldisplay p1+p−p12p1(p−p1)dp +integraldisplay p2p1+p2−p12(p2 −p)dp+integraldisplay p2+p3−p22p2(p−p2)dp+···+integraldisplay pmpm−1+pm−pm−12(pm −p)dp+integraldisplay 1pm(p−pm)dp=integraldisplay p10pdp+integraldisplay p2−p120pdp+integraldisplay p2−p120pdp+integraldisplay p3−p220pdp+···+integraldisplay pm−pm−120pdp+integraldisplay 1−pm0pdp= (1/2)α21 +α2 +···+α2m + (1/2)α2m+1 = (1/2)(α21 + 2α2 +···+ 2α2m +α2m+1),where α1 = p1,α2 = p2−p12 ,··· ,αm = pm−pm−12 ,αm+1 = 1−pm.We have the restriction α1 + 2α2 +··· + 2αm + αm+1 −1 = 0 and αi ≥ 0.In order to minimizeα21 + 2α22 +···+ 2α2m +α2m+1,we use Lagrange Multiplier’s Method. Letf(x1,··· ,xm+1) = x21+2x22 ···+2x2m+x2m+1−λ(x1+2x2+···+2xm+xm+1−1).Taking the partial derivatives and putting them equal to zero, we get:∂g∂x1 = 2x1 −λ = 0,∂g∂x2 = 4x2 −2λ = 0,...∂g∂xm = 4xm −2λ = 0,∂g∂xm+1 = 2xm+1 −λ = 0.By summing up the equations we get:2(x1 +2x2 +···+2xm +xm+1)−2λ(m−1)−2λ = 2−2λ(m−1)−2λ = 0.Hence λ = 1m. This gives xi = 12m. Hence p1 = pm = 12m and p2 − p1 =··· = pm −pm−1 = 1m.2308.4. Other loss functions8.4 Other loss functionsIt is well-known thatargminaEX(X −a)2,is the mean, when it exists. This fact is used in classical statistics forestimation of parameters and regression. It is also widely claimed thatargminaEX|X −a|,is “the median”. In particular for data vectors x = (x1,··· ,xn), this willtake the formargmina1nnsummationdisplayi=1|xi −a|.It is not clear what is meant by “the median”? For data vectors does itmean that the classic median (the middle value when there is odd numberof elements and the average of the two middle values otherwise) is the uniquesolution? In general, is the answer unique? What is the connection of thesolution to the left and right quantiles? We provide answers to some of thesequestions in the following theorem.Theorem 8.4.1 Suppose X is a random variable and E|X−a| is finite forsome a ∈ R thenargminaE|X −a| = [lqX(12),rqX(12)].ProofE|X −a| =integraldisplayR|X −a|dP =integraldisplayX>a(X −a)dP +integraldisplayX<a(a−X)dP.We prove the theorem in three steps:1. If a < lqX(1/2) then E|X −a| > E|X −lqX(1/2)|.2. If a > rqX(1/2) then E|X −a| > E|X −rqX(1/2)|.3. If lqX(1/2) ≤ a,b ≤ rqX(1/2) then E|X −a| = E|X −b|.Step 1. Let b = lqX(1/2) and ǫ = b−a > 0. Then2318.4. Other loss functionsE|X −b| =integraldisplayX≥b(X −b)dP +integraldisplayX<b(b−X)dP=integraldisplayX≥b(X −a−ǫ)dP +integraldisplayX<b(a +ǫ−X)dP≤integraldisplayX≥b|X −a|−ǫdP +integraldisplayX<b(|X −a|+ǫ)dP= E|X −a|−ǫ(P(X ≥ b)−P(X < b)).But P(X ≥ b) −P(X < b) is non-negative since P(X < lqX(1/2)) ≤ 1/2.Hence E|X −b|≤ E|X −a|. To show that the equality cannot happen takea < a′ < b and let ǫ′ = a′ −a thenE|X −a′| =integraldisplayX≥a′(X −a′)dP +integraldisplayX<a′(a′ −X)dP=integraldisplayX≥a′(X −a−ǫ′)dP +integraldisplayX<a′(a +ǫ′ −X)dP≤integraldisplayX≥a′|X −a|−ǫ′dP +integraldisplayX<a(|X −a|+ǫ′)dP= E|X −a|−ǫ′(P(X ≥ a′)−P(X < a′)).But P(X ≥ a′) − P(X < a′) is positive since P(X < a′) < 1/2 and a′ <lqX(p) ⇒ P(X < a′) < 1/2. Hence E|X − a′| < E|X − a|. But also sincea′ < b, we haveE|X −b| ≤ E|X −a′| < E|X −a|.Step 2. For a > rqX(1/2) = c one can either repeat a similar argumentto that in Step 1 or use the Quantile Symmetry Theorem as we do here.Consider the random variable −X. Thena > rqX(1/2) ⇒ −a < −rqX(1/2) = lq−X(1/2)Now since −a < −c = lq−X(1/2) by applying Step 1 to −X, we getE|−X −(−c)| < E|−X −(−a)| ⇒ E|X −a| < E|X −c|.Step 3. If lqX(1/2) = rqX(1/2) the result is trivial. Otherwise let b =lqX(1/2) < rqX(1/2) = c and a < a′ ∈ [b,c]. By Lemma 5.3.1 if lqX(p) <rqX(p). So P(X ≤ lqX(p)) = p and P(X ≥ rqX(p)) = 1 − p. HenceP(X ≤ b) = P(X ≥ c) = 1/2. Let ǫ = a′ −a. Then2328.4. Other loss functionsE|X −a| =integraldisplayb<X<c|X −a|dP +integraldisplayX≥c(X −a)dP +integraldisplayX≤b+ǫ/2−ǫ/2=integraldisplayX≥c(X −a−ǫ)dP +integraldisplayX≤b(a−X +ǫ)dP=integraldisplayX≥c(X −a′)dP +integraldisplayX≤b(a′ −X)dP= E|X −a′|.Corollary 8.4.2 Suppose FX is continuous and ∃a ∈ R, E|X −a| < ∞.ThenargminaE|X −a| = {a|F(a) = 1/2}.Proof Note that if F is continuous F(a) = p ⇔ a ∈ [lqX(p),rqX(p)], byLemma 5.5.2.Now let us findargminaE(δF(X,a)).We solve the problemfor continuous variables only here and leave the generalcase as an interesting open problem. Our conjecture is that the same resultholds in general.Lemma 8.4.3 Suppose X be a random variable with continuous distribu-tion function F. ThenargminaE(δF(X,a)) = [lqX(1/2),rqX(1/2)].Proof If F is continuous then F(X) ∼ U(0,1). Also δF(X,a) = |F(X) −F(a)|.argminaE(δF(X,a)) = argminaintegraldisplayΩ|F(X)−F(a)|dP.The last expression is minimized if F(a) equals the median of the uniform.We conclude F(a) = 1/2 and the proof is complete.2338.4. Other loss functions8.4.1 Optimal index vectors for assigning quantiles to arandom sampleGiven a sample X1,··· ,Xn, i.i.d ∼ X, we can find the sample order statis-tics X(i),i = 1,··· ,n. Suppose we want to assign these order statisticsto quantiles, lqX(pi),i = 1,··· ,n, of the true distribution of X. In otherwords, what is the optimal index vector P = (p1,··· ,pn) to assign lqX(pi)to X(i). This can be used to make a qq–plot. We define the optimal vectorto be the index vector that minimizes the expected probability lossE[1nnsummationdisplayi=1δX(X(i),lqX(pi))].We only solve the problem for continuous variables and leave the generalcase as an open problem. Under the continuity assumption, we haveE[1nnsummationdisplayi=1δX(X(i),lqX(pi))] = 1nnsummationdisplayi=1E(|FX(X(i))−pi|),which is minimized if and only if the individual terms E(|FX(X(i)) − pi|)are minimized. Since FX is a continuous random variable, FX(X(i)) is alsocontinuous. Hence the minimum is obtained by solving P(FX(X(i)) ≤ x) =1/2 by the corollary of Theorem 8.4.1. By Lemma 5.5.1, this is equivalentto P(X(i) ≤ rqF(x)) = 1/2. The distribution of the order statistics, X(i) isgiven byP(X(i) ≤ y) =nsummationdisplayj=iparenleftbiggnjparenrightbiggF(y)j(1−F(y))n−j,as discussed by Casella and Berger in [11]. Hence, the minimum is obtainedby solvingnsummationdisplayj=iparenleftbiggnjparenrightbiggF(rqX(x))j(1−F(rqX(x)))n−j =nsummationdisplayj=iparenleftbiggnjparenrightbiggxj(1−x)n−j = 1/2,which does not have a closed form solution in general. Also note that thesolution does not depend on F. However, the solution always exists andis unique since summationtextnj=iparenleftbignjparenrightbigxj(1−x)n−j is increasing, continuous on (0,1) andranges between 0 and 1. We also prove that the resulting index vectoris symmetric in the sense that pn−i+1 = 1 − pi, i = 1,2,··· ,n. For theproof, consider the random sample (Y1,··· ,Yn) = (−X1,··· ,−Xn). Thenthe sorted vector is (Y(1),··· ,Y(n)) = (−X(n),··· ,−X(1)). Hence Y(i) =2348.4. Other loss functions−X(n−i+1). Suppose p1,··· ,pn is an optimal summary index vector. Thenpi is the solution of the first equation belowargminaE|FY (Y(i))−a| = argminaE|1−FX(Y(i))−a| =argminaE|FX(X(n−i+1))−(1−a)|.But if we let b = 1−a the solution to the last equation is b = 1−a = pn−i+1.We conclude that pi = 1−pn−i+1.As examples, we solve the equation for n = 1,2, where closed formsolutions exist.n = 1. Then X(1) = X1. It is easy to see that the solution is p = 1/2.n = 2. Then we want to solve two equations2summationdisplayj=1parenleftbigg2jparenrightbiggxj(1−x)2−j = 1/2,and2summationdisplayj=2parenleftbigg2jparenrightbiggxj(1−x)2−j = 1/2,which are equivalent to2x(1−x) +x2 = 1/2,andx2 = 1/2,We get p1 = 1√2 and p2 = 1− 1√2.Note that in general for n, the last equation is xn = 1/2. Hence pn =1/ n√2 and p1 = 1−1/ n√2.235Chapter 9Quantile distributiondistance and estimation9.1 IntroductionThis chapter uses the probability loss function as a basis for estimatingunknown parameters of a distribution and defining a distance among distri-bution functions. The “probability loss” and “c-probability loss” functionswere introduced to measure the distance between quantiles. This is not thesame as any other specific loss functions that have been proposed in sta-tistical decision theory [30], where the loss function, L, is the loss of thestatistician in estimating the true parameter vector θ = (θ1,··· ,θk) ∈ Θ,by an estimator ˆθ(X1,··· ,Xn) which is a function of the data (a randomsample X1,··· ,Xn drawn from the distribution parameterized by θ). Theestimator is then chosen in such a way that L(ˆθ(X1,··· ,Xn),θ) becomessmall in some sense. However, it is not possible to use the probability lossfunction in the same manner for parameter estimation. We definedδX(z′,z) = δX(z,z′) = P(z′ < X < z), z′ ≤ z, z,z′ ∈ R.Now it is clear that δX(θ,a) cannot even be evaluated since θ is a k-dimensional vector and k is possibly greater than 1. This chapter presentstwo methods to estimate the parameters of distributions. More theoreticaland applied development is necessary to justify such estimation procedureswhich we leave for future research. The first method derives from consid-ering families of distributions that are identified by their values on certainquantiles and the second method from defining a distance among distribu-tions and then trying to minimize that distance.These methods are designed to give estimates that are equivariant un-der continuous strictly monotonic transformations. The distances associatedwith probability measures in this section are based on the distances betweenthe quantiles using the probability loss function and they are invariant undermonotonic transformations. This property does not hold in classical meth-ods. For example the sample mean ¯x, an estimator of the location parameter2369.2. Quantile–specified parameter familiesfor normal distribution is equivariant under linear transformations but notall continuous strictly monotonic transformations.Quantile distance allows us to measure closeness of distributions to eachother. We also definea quantile distance for the tails of the distributions. Weshow that even though two distributions are very close in terms of “overallquantile distance”, they might not be very close in terms of “tail quantiledistance”. This shows that to study extremes (for example extremely hottemperature) if we use a good overall fit, our results might not be reliable.We use this observation in the next chapter in choosing our method ofstudying extreme temperature events.9.2 Quantile–specified parameter familiesThis section considers families of distributions that are identified by theirvalues on certain quantiles. In this case the parameters in the vector θ =(θ1,··· ,θk) are certain quantiles. Then we use the “probability loss” or the“c-probability loss” to characterize the loss and thus yield optimal parameterestimators.Definition A family of random variables {Xθ}θ=(θ1,···,θk)∈Θ, and a proba-bility index vector P = (p1,··· ,pk),0 ≤ p1 < p2 < ··· < pk ≤ 1 are called aleft–quantile–specified family if(θ1,··· ,θk) = (lqXθ(p1),··· ,lqXθ(pk)),and the distribution of Xθ is know given θ. Note that this implies that θ ∈ Θthen θ1 ≤ θ2 ≤ ···θk.We can similarly define:Definition A family of random variables {Xθ}θ=(θ1,···,θk)∈Θ, and a proba-bility index vector P = (p1,··· ,pk),0 ≤ p1 < p2 < ··· < pk ≤ 1 are called aright–quantile–specified family(θ1,··· ,θk) = (rqXθ(p1),··· ,rqXθ(pk)),and the distribution of Xθ is know given θ. Note that this implies that θ ∈ Θthen θ1 ≤ θ2 ≤ ··· ≤ θk.Example Considerthe family{U(0,2a)}a∈R+, of uniformlydistributedran-dom variables on (0,2a),a > 0. Then, we can express this family asthe quantile–specified family {Xθ}θ∈R+ with P = (1/2). The reason is ifXθ ∼ U(0,2a) then θ = lqXθ(1/2) = a.2379.2. Quantile–specified parameter familiesExample Consider the family N = {N(µ,σ2)| − ∞ < µ < +∞,σ2 >0}. Then we claim this is a quantile–specified family. To verify that claimlet P = (1/2,p2) where p2 = P(Z ≤ 1) and Z has the standard normaldistribution. Letµ = lqX(1/2) = θ1,andµ+σ2 = lqX(p2) = θ2.Then we can equivalently represent N by {Xθ}θ=(θ1,θ2)∈Θ, whereΘ = {(θ1,θ2)|θ1 < θ2}.Because (µ,σ2) is in 1:1 correspondence with θ = (θ1,θ2) as defined above,whereP(X ≤ µ+σ2) = P(Z ≤ 1) = p2.Note that this representation is not unique. For example, we can take P =(1/2,p2) with p2 = P(Z ≤ 2). Then the alternate re-parametrization interms of variables isµ = lqX(1/2) = θ1,andµ+ 2σ2 = lqX(p2) = θ2.It should be clear that if the goal is to infer the parameters of the originalfamily, i.e. a in U(0,2a) and (µ,σ2) then it is desirable that the θi are simplefunctions of the original parameters and the original parameters be easilyobtainable fromthe θi. Linear combinations seem to be the easiest to handle.We suggest the following framework to estimate the parameters:• Express the original parameterized family Xβ as a quantile specifiedfamily Xθ with P = (p1,··· ,pk).• UseargminDi∈FE[L(θi,Di(input)], i = 1,··· ,kwhere input is the information available to us, usually a random sam-ple,(X1,··· ,Xn),Di is an estimator of θi = lqX(pi) (a function of the random sample),L is a loss function and F is the class of the estimators. The lossfunctions of our interest are L = δXθ and L = δcXθ,c > 0.2389.2. Quantile–specified parameter families• Using the estimated parameters solve for the original parameters, theβi.Note that δcXθ,c > 0 depends on the unknown distribution function Xθ.Many issues in the above framework need to be addressed including: theexistence and uniqueness of the argmin, properties of the estimators andso on which we leave for future research. In next subsections we show theEquivariance property of the method and apply it to a particular class ofestimators using simulations.9.2.1 Equivariance of quantile–specified families estimationHere, we show the equivariance propertyofestimation usingquantile–specifiedfamilies in the following lemmas.Lemma 9.2.1 Suppose {Xθ}θ∈Θ is left–quantile–specified withP = (p1,··· ,pk),and φ is a continuous strictly increasing transformation which induces amap on Rk:Φ : Rk → Rk,(θ1,··· ,θk) mapsto→ (φ(θ1),··· ,φ(θk)).Let Θ′ = Φ(Θ),θ′ = Φ(θ) for θ ∈ Θ and consider the family of distributionsYθ′ = φ(Xθ). Then {Yθ′}θ′∈Θ′ is also a left–quantile–specified family withthe same index vector P = (p1,··· ,pk).Proof Suppose the distribution of Xθ is specified by Fθ. ThenP(Yθ′ ≤ a) = P(φ(Xθ) ≤ a)= Fθ(φ−1(a)) = FΦ−1(θ′)(φ−1(a)).Hence the distribution of Yθ′ is known given θ′. It remains to show that forθ′ ∈ Θ′,(θ′1,··· ,θ′k) = (lqYθ′(p1),··· ,lqYθ′(pk)).But(lqYθ′(p1),··· ,lqYθ′(pk)) =(lqφ(Xθ)(p1),··· ,lqφ(Xθ)(pk)) =(φ(lqXθ(p1)),··· ,φ(lqXθ(pk))) =(φ(θ1),··· ,φ(θk)) =(θ′1,··· ,θ′k) .2399.2. Quantile–specified parameter familiesLemma 9.2.2 Suppose {Xθ}θ∈Θ is left–quantile–specified withP = (p1,··· ,pk),and φ is a continuous strictly decreasing transformation which induces amap on Rk:Φ : Rk → Rk,(θ1,··· ,θk) mapsto→ (φ(θk),··· ,φ(θ1)).Let Θ′ = Φ(Θ),θ′ = Φ(θ) for θ ∈ Θ and consider the family of distributionsYθ′ = φ(Xθ). Then {Yθ′}θ′∈Θ′ is a right–quantile–specified family with theindex vector P = (1−pk,··· ,1−p1).Proof Suppose the distribution of Xθ is specified by Fθ. Then since Fθ theleft closed distribution of Xθ is known, the right closed distribution of Xθ,GcX(Xθ) is also known. ThenP(Yθ′ ≤ a) = P(φ(Xθ) ≤ a) = P(Xθ ≥ φ−1(a))= Gcθ(φ−1(a)) = GcΦ−1(θ′)(φ−1(a)),where Gcθ is the right closed distribution function. Hence the distribution ofYθ′ is known given θ′. It remains to show that for θ′ ∈ Θ′,(θ′1,··· ,θ′k) = (rqYθ′(1−pk),··· ,rqYθ′(1−p1)).But(rqYθ′(1−pk),··· ,rqYθ′(1−p1)) =(rqφ(Xθ)(1−pk),··· ,rqφ(Xθ)(1−p1)) =(φ(lqXθ(pk)),··· ,φ(lqXθ(p1))) =(φ(θk),··· ,φ(θ1)) =(θ′1,··· ,θ′k).For a parameter θ, we want to findargminD∈FE(δX(lqX(p),D))2409.2. Quantile–specified parameter familieswhere F is a family of estimators for θ and D ∈ F is a functionD : Rn → R,where n is the size of the sample and D(X1,··· ,Xn) is the estimator ofθ = lqX(p).Lemma 9.2.3 Suppose a random sample X1,··· ,Xn is given, Xθ is a left–quantile–specified family with θ = lqX(p), φ a strictly monotonic continuoustransformation on R, F is a family of estimators to estimate θ and thefollowing argmin is nonemptyargminD∈FE(δX(lqX(θ),D)),and let F′ = φ(F). Thena) if φ is strictly increasingargminD′∈F′E(δφ(X)(lqφ(X)(p),D′)) = φ(argminD∈FE(δX(lqX(p),D)))b) if φ is strictly decreasingargminD′∈F′E(δφ(X)(lqφ(X)(p),D′)) = φ(argminD∈FE(δX(rqX(1−p),D)))Proof We only prove a) and b) is similar.minD′∈F′E(δφ(X)(lqφ(X)(p),D′))= minD∈FE(δφ(X)(φ(lqX)(p),φ(D)))= minD∈FE(δX(lqX(p),D))Note that for a general family of estimators, FargminD∈FE(δX(lqX(p),D))depends on the unknown distribution X by δX. We suggest two possibleways to get around this issue:• Restrict to a family F thatargminD∈FE(δX(lqX(p),D))does not depend on the distribution.2419.2. Quantile–specified parameter families• Use the empirical distribution to approximate the expressionE(δX(lqX(p),D)).We will not explore the second method here and leave it for future re-search. Next subsection shows an important instance of the first method.9.2.2 Continuous distributions with the order statisticsfamily of estimatorsSuppose that the desired distribution X is continuous thenE(δX(lqX(p),D)) = E|FX(lqX(p))−FX(D)| = E|p−FX(D)|.Now suppose a random sample X1,··· ,Xn is given and we want to estimatelqX(p). We restrict to an important family of estimators, order statistics:F = {X1:n,··· ,Xn:n}.Then for i = 1,··· ,n:E|p−FX(Xi:n)|,does not depend on FX. This is because the distribution of FX(Xi:n) doesnot depend on FX. It can be obtained as shown below:Gi(y) = P(FX(Xi:n) ≤ y) = P(Xi:n ≤ lqX(y)) =nsummationdisplayj=iparenleftbiggnjparenrightbiggP(X1,··· ,Xj ≤ lqX(y) and Xj+1,··· ,Xn > lqX(y)) =nsummationdisplayj=iparenleftbiggnjparenrightbiggP(X ≤ lqX(y))jP(X > lqX(y))n−j =nsummationdisplayj=iparenleftbiggnjparenrightbiggyj(1−y)n−j.By taking the derivative of the above expression we can find the densityfunction gi(p) and conclude:E|p−FX(Xi:n)| =integraldisplay 10|p−y|gi(y)dy.For a given p we want to find the i that minimize above which does noton FX. We can approach this problem theoretically to find such an i. Orwe could try to estimate these integral using numerical methods. However,here we use simulation for two examples and leave the general case for futureresearch.2429.3. Probability divergence (distance) measuresExample Consider a family of continuous variables, quantile–specified byP = (1/2,P(Z ≤ 1)) where Z is the standard normal. Suppose a ran-dom sample X1,··· ,Xn is given and we want to estimate lqX(1/2) andlqX(P(Z ≤ 1)) using the family of estimators, order statistics:F = {X1:n,··· ,Xn:n}.We estimate the parameters for n = 25 and n = 20. In order to minimize theloss we can approximate the loss by approximating the integral in Equation9.2.2 or approximatingE|p−FX(Xi:n)|,using an arbitrary continuous distribution such as standard normal to dothe simulations. For a large number M, we create M samples of length nfrom normal and for every sample we find the i that minimize the loss. Thenfor every i, we compute the mean of such losses and find out which has thesmallest mean loss. We do that for M = 1,··· ,1000. The results for n = 25are given in Figure 9.1. We see that for large M the estimator for lqX(1/2)is X13:25 and for lqX(P(Z < 1)) it is X22:25. The results for n = 20 are givenin Figure 9.2. The estimator for lqX(1/2) has changed between X10:20 andX11:21 and it is X18:20 for lqX(P(Z ≤ 1)). This shows that the argmin isnot necessarily unique.9.3 Probability divergence (distance) measuresIn probability theory, physics andstatistics several measureshave been intro-duced as the “distance” of two probability measures (or random variables).These measures have several applications, one of which is parameter estima-tion. We list some of these measures in this section. The next section thenintroduces new measures of distance among probability measures using thec-probability loss functions (c ≥ 0).• The Kullback-Leibler (KL) distance: Suppose P,Q are probabilitymeasures and P is absolutely continuous with respect to Q. Thenconsider the Radon-Nikodym derivative of P with respect to Q, dPdQ[See [9]]. Then we define:DKL(P,Q) =integraldisplayΩlog dPdQdP.If P and Q have density functions over R, p(x),q(x) then2439.3. Probability divergence (distance) measures0 200 400 600 800 1000111315P(Z<0)optimal order0 200 400 600 800 10002021222324P(Z<1)simulations numberoptimal orderFigure 9.1: The order statistics family members that estimate lqX(1/2) andlqX(P(Z ≤ 1)) for a random sample of length 25 obtained by generatingsamples of size 1 to 1000 from a standard normal distribution2449.3. Probability divergence (distance) measures0 200 400 600 800 100089101112P(Z<0)optimal order0 200 400 600 800 100015.016.017.018.0P(Z<1)simulations numberoptimal orderFigure 9.2: The order statistics family members that estimate lqX(1/2) andlqX(P(Z ≤ 1)) for a random sample of length 20 obtained by generatingsamples of size 1 to 1000 from a standard normal distribution2459.3. Probability divergence (distance) measuresintegraldisplayRp(x)log(p(x)q(x))dx.The symmetric version of this distance is called Kullback-JeffreysDKJ(P,Q) = DKL(P,Q) +DKL(Q,P).We show that the Kullback-Leibler distance is invariant under bijec-tive differentiable monotonic transformations when the density func-tions exists and are positive everywhere on the real line. Let g bea monotonic, bijective and differentiable (bijective and differentiablewill automatically imply strictly monotonic) transformation and X,Yrandom variables with density functions fX(x) and fY (x), positiveon R. Then the density functions of g(X) and g(Y ) are respectively(g−1)′(x)fX(g−1(x)) and (g−1)′(x)fY (g−1(x)). HenceDKL(φ(X),φ(Y )) =integraltext∞−∞(g−1)′fX(g−1(x))log (g−1)′fX(g−1(x))(g−1)′fY (g−1(x))dx =integraltext∞−∞(g−1)′fX(g−1(x))log (fX(g−1(x))fY (g−1(x)) dx.We use the change of variable x = g(y). Then dx = (g−1)′dy and theproof is complete. For the strictly decreasing case note that the densityfunction of g(X) and g(Y ) are respectively −(g−1)′(x)fX(g−1(x)) and−(g−1)(x)′fY (g−1(x)) and a similar argument works. We leave thegeneral case (wherethe densityfunction doesnot exist or is notpositiveover all the real line) as an open(?) problem.• Let P and Q be two probability distributions over a space Ω suchthat P is absolutely continuous with respect to Q. Then, for a convexfunction f such that f(1) = 0, the f-divergence of Q from P isIf(P,Q) =integraldisplayΩfparenleftbiggdPdQparenrightbiggdQ.Note that the same argument as the one for KL distance shows thatthis distance is invariant for monotonic differentiable bijective trans-formations when the density functions exist and are positive.2469.3. Probability divergence (distance) measures• The Kolmogorov-Smirnov distance: Suppose X,Y are random vari-ables on R with distribution functions FX and FY . ThenKS(X,Y ) = supx∈R|FX(x)−FY (x)|.The Gilvenko-Cantelli Theorem states that if X1,··· ,Xn is a randomsample drawn from the distribution Fθ0 and Fn, the empirical distri-bution functionlimn→∞KS(Fθ0,Fn) > ǫ = 0, a.s..Note that the KS metric is invariant under monotonic transforma-tions. Take φ to be strictly monotonic on R. Thensupx∈R|Fφ(X)(x)−Fφ(Y )(x)| =supx∈R|FX(φ−1(x))−FY (φ−1(x))| =supφ−1(x)∈R|FX(φ−1(x))−FY (φ−1(x))| =supx∈R|FX(x)−FY (x)|.Although theKS metric is invariant under strictly monotonic transfor-mations, it is not intuitively very appealing as we show in the followingexample.Example Consider X ∼ U(0,1), Y ∼ U(1/2,3/2) and let Z be dis-tributed as FZ:FZ(z) =0 z < 01/2 0 ≤ z ≤ 1/2z 1/2 < z < 11 z ≥ 1.Then we have KS(X,Y ) = KS(X,Z) = 1/2. But we observe thatFZ matches FX on (1/2,1) while FX and FY differ by 1/2 on (0,1).Another way to see the defect is the quantiles of Z and X match halfof the time but the quantiles of X and Y are off as much as one halfof a unit at all times.2479.4. Quantile distance measuresTo overcome the above problem one might (naively) suggest using anintegral versionIKS(X,Y ) =integraldisplayx∈R|FX(x)−FY (x)|dx.However, this definition is not well-defined. To see that considerFX(x) = 1 − 8/x,x > 8 and FY (x) = 1 − 9/x, x > 9. Then|FX(x) − FY (x)| = 1/x on [8,∞], which does not have finite inte-gral. It is also not invariant under strictly monotonic transformationsfor if φ is strictly monotonic and differentiable,IKS(φ(X),φ(Y )) =integraldisplayx∈R|FX(φ−1(x))−FY (φ−1(x))|dx.In the right hand side of the above equation the factor (φ−1)′, thatwould make the distance invariant under transformations, is missing.• L´evy distance: Suppose (Ω,Σ,Pθ)θ∈Θ be a statistical space, where thePθ are probability measures on Ω with σ-field Σ. Then we defineLev(Fθ1,Fθ2) = inf{ǫ > 0|Fθ1(x−ǫ) < Fθ2(x) < Fθ1(x+ǫ), ∀x ∈ R}.It can be shown that convergence in the L´evy metric implies weakconvergence for distribution function in R [31]. It is shift invariantbut not scale invariant as discussed in [31].9.4 Quantile distance measuresThis section introduces the quantile distance measure to measure the dis-tance among distribution functions on R (or random variables). We beginwith a general definition using the quantiles and then consider interestingparticular cases. The intuition behind all these metrics lies in their capabil-ity to measure the separation in the quantiles of two random variables.Definition Suppose a statistical space (Ω,P,{Xθ}θ∈Θ) and a loss functionL defined over the extended real numbers R∪{−∞,+∞} are given. Alsolet E be a measurable subset of (0,1) and dµE is a measure on E. Then wecan define the following two measures of distance between Xθ1 and Xθ2,2489.4. Quantile distance measuresSQDEL(Xθ1,Xθ2) = supp∈EL(lqXθ1(p),lqXθ2(p)),andIQDEL(Xθ1,Xθ2) =integraldisplayp∈EL(lqXθ1(p),lqXθ2(p))dµE,which we call the sup quantile distance and integral quantile distance re-spectively.Remark. Note that in general SQDEL and IQDEL are neither well-definednor metrics on the space of random variables..Remark. We can also take L(rqXθ1(p),rqXθ2(p)) in the above definitions.Remark. The natural choice for E is (0,1) and the measure µ = L, whereL is the Leb`egues measure on (0,1). However, one might choose anotherE depending on the purpose. For example E = (0.8,1) might be moreappropriate if the purpose is modeling the high extremes.Remark. Interesting choices for L are δXθ1, δcXθ1, δXθ1 + δXθ2 and δcXθ1+δcXθ2. Note that in all these cases the quantile distance is defined since thesequantities are bounded respectively by 1,1 +c,2,2 + 2c.The rest of this report focuses on quantile distances obtained from c-probability losses (c ≥ 0). (Note that c = 0 corresponds to the usual prob-ability loss.)9.4.1 Quantile distance invariance under continuous strictlymonotonic transformationsThis subsection show the invariance of quantile distance under strictly mono-tonic tranformations in the following lemmas.Lemma 9.4.1 (Quantile distance invariance under continuous strictly in-creasing transformations)Suppose X,Y are random variables, letIQDEδcX(X,Y ) =integraldisplayEL(lqX(p),lqY (p))dµE,andSQDEδcX(X,Y ) = supp∈EL(lqX(p),lqY (p)),where E ⊂ (0,1), c ≥ 0 and µE is a measure on E. ThenIQDEδcX(X,Y ) = IQDEδcφ(X)(φ(X),φ(Y )),2499.4. Quantile distance measuresandSQDEδcX(X,Y ) = SQDEδcφ(X)(φ(X),φ(Y )),for all φ : R→ R continuous and strictly increasing transformations.Proof The proof attains from noting thatδφ(X)(lqφ(X)(p),lqφ(Y )(p)) =δφ(X)(lqφ(X)(p),lqφ(Y )(p)) +c(1−1{0}(lqφ(X)(p)−lqφ(Y )(p))) =δφ(X)(φ(lqX(p)),φ(lqY (p))) +c(1−1{0}(lqX(p)−lqY (p))) =δX(lqX(p),lqY (p)) +c(1−10(lqX(p)−lqY (p))) =δcX(lqX(p),lqY (p)).Remark. The above lemma is also true for δcX + δcY , which follows imme-diately.Lemma 9.4.2 If E a measurable subset of [0,1] then the two following dis-tance measures are equal:LQDEδX(X,Y ) =integraldisplayEδX(lqX(p),lqY (p))dp,andRQDEδX(X,Y ) =integraldisplayEδX(rqX(p),rqY (p))dp.The following two measures are also equal:LQDEδX+δY (X,Y ) =integraldisplayE(δX +δY )(lqX(p),lqY (p))dp,andRQDEδX+δY (X,Y ) =integraldisplayE(δX +δY )(rqX(p),rqY (p))dp.Proof We prove the first part part of the lemma and the second part isdeduced from the first. We showed in the quantile definition section thatthe set {p|lqX(p) negationslash= rqX(p)} is countable. Hence,{p|lqX(p) negationslash= rqX(p)}∪{p|lqY (p) negationslash= rqY(p)},2509.4. Quantile distance measuresis also countable. In the complement of this setδX(lqX(p),lqY (p)) = δX(rqX(p),rqY (p)).Hence the integral values are the same.Remark. Note that the above theorem also holds for any measure µ onany E ⊂ (0,1) which is continuous with respect to the Leb`egue measure.Because of this lemma we will not worry about the left or right quantile inthe definitions.The following lemma establishes a relationship between LQDδX andLQDδcX.Lemma 9.4.3 Let E be a measurable subset of [0,1] andkE = L{p ∈ E|lqX(p) negationslash= lqY (p)},where L is the Leb`egue measure. LetLQDEδcX(X,Y ) =integraldisplayEδcX(lqX(p),lqY (p))dp,andLQDEδX(X,Y ) =integraldisplayEδX(lqX(p),lqY (p))dp.ThenLQDEδcX(X,Y ) = LQDEδX(X,Y ) +ckE.ProofLQDδcX(X,Y ) =integraldisplayEδcX(lqX(p),lqY (p))dp =integraldisplaylqX(p)=lqY (p),p∈EδcX(lqX(p),lqY (p))dp +integraldisplaylqX(p)negationslash=lqY (p),p∈EδcX(lqX(p),lqY (p))dp =integraldisplaylqX(p)=lqY (p),p∈EδX(lqX(p),lqY (p))dp +integraldisplaylqX(p)negationslash=lqY (p),p∈E[δX(lqX(p),lqY (p)) +c(1−1{0})(lqX(p)−lqY (p))]dp =LQDEδX(X,Y ) +ckE.2519.4. Quantile distance measuresRemark. Note that the same is true for RQDEδcXand RQDEδX. AlsoL{p ∈ E|lqX(p) negationslash= lqY (p)} = L{p,p ∈ E|rqX(p) negationslash= rqY (p)},because lqX,rqX and lqY ,rqY are unequal only on a measure zero set. Hencethe constant kE is the same as before andRQDEδcX(X,Y ) = RQDEδX(X,Y ) +ckE.Lemma 9.4.4 Suppose E a measurable subset of [0,1] then the two follow-ing distance measures are equalLQDEδcX(X,Y ) =integraldisplayp∈EδcX(lqX(p),lqY (p))dp,andRQDEδcX(X,Y ) =integraldisplayp∈EδcX(rqX(p),rqY (p))dp.Also these two measures are equalLQDEδcX+δcY(X,Y ) =integraldisplayp∈E(δcX +δcY )(lqX(p),lqY (p))dp,andRQDEδcX+δcY(X,Y ) =integraldisplayp∈E(δcX +δcY )(rqX(p),rqY (p))dp.ProofThis is a straightforward consequence of the previous two lemmas.Remark. Note that the above theorem also holds for any measure µ onany E ⊂ (0,1) which is continuous with respect to the Leb`egue measure.Lemma 9.4.5 (Quantile distance invariance under continuous strictly mono-tonic transformations)Suppose X,Y are random variables and letQDE(X,Y ) = LQDEδX(X,Y ), (9.1)QDEc (X,Y ) = LQDEδcX(X,Y ), (9.2)2529.4. Quantile distance measureswhere, E ⊂ (0,1) symmetric, meaning p ∈ E ⇔ (1−p) ∈ E, and µ is abso-lutely continuous with respect to the Leb`egue measure and symmetric on Ein the sense that if A is measurable then so is 1−A while µ(A) = µ(1−A).Then 9.1 and 9.2 are invariant under continuous strictly monotonic trans-formations, i.e.a) QDE(φ(X),φ(Y )) = LQDEδφ(X)(φ(X),φ(Y )) = QDE(X,Y ) = QDEδX(X,Y ),b) QDEc (φ(X),φ(Y )) = LQDEδcφ(X)(φ(X),φ(Y )) = QDEc (X,Y ) = QDEδcX(X,Y ).Proof For φ continuous and strictly increasing transformations, we haveshown the result in Lemma 9.4.1. Suppose φ is continuous and strictlydecreasing.a) We use lqφ(X)(p) = φ(rqX(1−p)) which we proved above using quantilesymmetries:δφ(X)(lqφ(X)(p),lqφ(Y )(p)) =δφ(X)(φ(rqX(1−p)),φ(rqY (1−p))) =δ−φ(X)(−φ(rqX(1−p)),−φ(rqY (1−p))),where the last equality is because δX(a,b) = δ−X(−a,−b). Now since −φ iscontinuous and increasing, the above is equal toδX(rqX(1−p),rqY (1−p)).We use this result in the following:QDE(X,Y ) =integraldisplayEδX(lqX(p),lqY (p))dµE=integraldisplayEδX(rqX(1−p),rqY (1−p))dµE.Then we do a change of variable p → (1−p) and by symmetry of µ, we findthat the above is equal tointegraldisplayEδX(rqX(p),rqY (p))dµE.But by the previous lemmas and since µ is continuous with respect to theLeb`egue measure, this is equal tointegraldisplayEδX(lqX(p),lqY (p))dµE.2539.4. Quantile distance measuresb) We only consider continuous and strictly decreasing functions φ:LQDEδcφ(X)(φ(X),φ(Y )) =integraldisplayEc(1−1{0}(lqφ(X)(p)−lqφ(Y)(p)))dp +LDQδφ(X)(X,Y ) =ckE +LDQEδφ(X)(φ(X),φ(Y )),where,kE = µ{p ∈ E|lqφ(X)(p) negationslash= lqφ(Y )(p)} =µ{p ∈ E|φ(rqX(1−p)) negationslash= φ(rqY (1−p))} =µ{p ∈ E|rqX(1−p) negationslash= rqY (1−p)} =µ{p ∈ E|rqX(p) negationslash= rqY (p)} =µ{p ∈ E|lqX(p) negationslash= lqY (p)}.We showed in a) thatLDQEδφ(X)(φ(X),φ(Y )) = LDQEδX(X,Y )and because we just showed that kE = µ{p ∈ E,|(lqX(p)) negationslash= (lqY )(p)}, weconcludeDQEδcφ(X)= ckE+LDQEδφ(X)(φ(X),φ(Y )) = ckE+LDQEδX(X,Y ) = LQDEδcX(X,Y ).9.4.2 Quantile distance closeness of empirical distributionand the true distributionThe next theorem shows that the quantile distance between the sampledistribution and the true distribution tends to zero when the sample sizebecomes large.Theorem 9.4.6 Let X1,X2,··· be an i.i.d. random sample drawn from anarbitrary distribution function F. Then(a) SQDδX(F,Fn) = supp∈(0,1)δF(lqFn(p),lqF(p)) → 0., a.s.,2549.4. Quantile distance measuresand(b) IQDδX(F,Fn) =integraldisplayp∈(0,1)δF(lqFn(p),lqF(p)) → 0., a.s..ProofWe only need to prove (a) since (b) is a straightforward consequence of(a). Clearly lqFn(p) = Xi:n for p ∈ ((i − 1)/n,i/n],i = 1,2,··· ,n. AlsoFcn(Xi:n) ≥ i/n and Fon(Xi:n) ≤ (i − 1)/n. Pick an N large enough in theGlivenko-Cantelli Theorem such thatn > N ⇒ |Fn(x)−F(x)| < ǫ, and |Fon(x)−Fo(x)| < ǫ,uniformly in x. Consider two cases:Case I: Xi:n < lqF(p). ThenδF(lqFn(p),lqF(p)) = δF(Xi:n,lqF(p)) =Fo(lqF(p))−Fc(Xi:n) ≤ Fo(lqF(p))−Fcn(Xi:n)+ǫ≤ p−i/n+ǫ ≤ ǫ.Case II: Xi:n > lqF(p). ThenδF(lqFn(p),lqF(p)) = δF(Xi:n,lqF(p)) =Fo(Xi:n)−Fc(lqF(p)) ≤ Fon(Xi:n) +ǫ−p≤ (i−1)/n +ǫ−p ≤ ǫ.Since this holds for i = 1,2,··· ,n and (0,1) = ∪i=1,2,···,n(i−1n , in], the supre-mum is also less than ǫ.9.4.3 Quantile distance and KS distance closenessClearly if X ∼ Y, then LQDEL(X,Y ) = 0. In the following theorem we studythe inverse question for L = δcX, c ≥ 0 and E = [0,1]. The KolmogorovSmirnoff distance was defined to beKS(X,Y ) = supx∈R|FX(x)−FY (x)|.We also define the “open Kolmogorov Smirnoff” distance asKSo(X,Y ) = supx∈R|FoX(x)−FoY (x)|.2559.4. Quantile distance measuresLemma 9.4.7 Suppose X,Y are random variables, thenKSo(X,Y ) = KS(X,Y ).To prove the lemma, we show thatKS(X,Y ) ≤ ǫ ⇔ KSo(X,Y ) ≤ ǫ.Suppose KS(X,Y ) ≤ ǫ. If the R.H.S does not hold then there exist x ∈ Rsuch thatFoX(x) > FoY (x) +ǫ.Since FoX is left continuous, we conclude there is a y < x such thatFoX(y) > FoY (x) +ǫ.Hence,FcX(y) ≥ FoX(y) > FoY (x) +ǫ ≥ FcY (y)+ǫ,which is a contradiction.Inversely, suppose KSo(X,Y ) ≤ ǫ. If the L.H.S does not hold then thereexist x ∈ R such thatFcX(x) > FcY (x) +ǫ.Since FcY is right continuous, we conclude there is y > x such thatFcX(x) > FcY (y) +ǫ.Hence,FoX(y) ≥ FcX(x) > FcY (x)+ǫ ≥ FoY (y)+ǫ,which is a contradiction.Lemma 9.4.8 Kolmogorov Smirnoff closeness implies Quantile distance close-ness. More formally if for two random variables X,Y , KS(X,Y ) ≤ ǫ thenSQDδX(X,Y ) = supp∈(0,1)δX(lqX(p),lqY (p)) ≤ ǫ.ProofFor p ∈ (0,1), suppose lqX(p) < lqY (p). Thenδ(lqX(p),lqY (p)) = FoX(lqY (p))−FcX(lqY (p)) ≤2569.4. Quantile distance measuresFo(lqY (p)) +ǫ−p ≤ p+ǫ−p = ǫ.The discussion for lqY (p) < lqX(p) is similar.Remark. By symmetry also KS(X,Y ) ≤ ǫ ⇒ SQDδY (X,Y ) ≤ ǫ.The converse needs the continuity assumption:Lemma 9.4.9 Suppose X,Y are continuous random variables. Then quan-tile distance closeness implies Kolmogorov Smirnoff distance closeness. Moreformally, supposeSQDδX(X,Y ) = supp∈(0,1)δX(lqX(p),lqY (p)) ≤ ǫandSQDδY (X,Y ) = supp∈(0,1)δY (lqX(p),lqY (p)) ≤ ǫ.ThenKS(X,Y ) ≤ ǫ.Proof Suppose the result is not true and there exists x such that|FX(x)−FY (x)| ≥ ǫ.Then let p1 = FX(x) and p2 = FY (x) and without loss of generality assumep2 > p1. Since FY (x) = p2, lqY (p2) ≤ x. But lqX(p2) = y > x. Otherwisep2 ≤ FX(lqX(p2)) = FX(x) = p1 which is a contradiction.δX(lqX(p2),lqY (p2)) = FoX(y)−FY (x) = FX(y)−FY (x) ≥ p2 −p1 > ǫ,which is a contradiction. Note that we have used continuity of X in thesecond equality.Remark. This is not true in general. Consider X with P(X = 0) = 1 andY with P(Y = 1) = 1. Then FX(1/2) −FY (1/2) = 1 and SQDδX(X,Y ) +SQDδY (X,Y ) = 0.In the next theorem we show that if the quantile distance between twovariables are zero and one of them is continuous then they are identicallydistributed.Theorem 9.4.10 Suppose F1,F2 distribution functions, F1 continuous andtheir quantile distance is zero. In other words,supp∈(0,1)δF1(lqF1(p),lqF2(p)) = 0.Then F1 = F2.2579.4. Quantile distance measuresProof Suppose the result does not hold. Then we have two cases.Case I: ∃x, p1 = F1(x) < F2(x) = p2.F1(x) = p1 ⇒ lqF1(p2) = y > x,andF2(x) = p2 ⇒ lqF2(p2) = z ≤ x.HenceδF1(lqF1(p2),lqF2(p2)) = F1(y)−F1(z) ≥ F1(y)−F1(x) ≥ p2 −p1.Case II: ∃x, p1 = F1(x) > F2(x) = p2.Take p3 ∈ (p2,p1). ThenF1(x) = p1 ⇒ lqF1(p3) = y ≤ x.However if lqF1(p3) = x, we concludeF1(lqF1(p3)) = F1(x) ⇒ p3 = p1,which is a contradiction. Note that we have used the continuity of F1 inF1(lqF1(p3)) = p3.AlsoF2(x) = p2 ⇒ lqF2(p3) = z > x.HenceδF1(lqF1(p3),lqF2(p3)) = δF1(y,z) = F1(z)−F1(y) ≥ F1(x)−F1(y) ≥ p1−p3.Here we prove an easy lemma regarding the continuity of δ.Lemma 9.4.11 Suppose F is a continuous distribution function. For anyfixed b ∈ R, δF(a,b) is a continuous function in a.Proof Note that δF(a,b) = |F(b)−F(a)| because F is a continuous func-tion.Lemma 9.4.12 Suppose F1,F2 are distribution functions, F1 is continuousandδF1(lqF1(p0),lqF2(p0)) = ∆ > 0,for some p0 ∈ (0,1) then there exist 0 < ǫ < p0 such thatδF1(lqF1(p),lqF2(p)) > ∆/3, p ∈ (p0 −ǫ,p0).2589.4. Quantile distance measuresProof Since F1 is continuousδF1(lqF1(p),lqF2(p)) = |p−F1(lqF2(p))|.Let lqF2(p0) = x1 and F1(x1) = p1. Then |p0 −p1| = ∆.By continuity of F1 there exist ǫ′ > 0 such thatx ∈ (x1 −ǫ′,x1 +ǫ′) ⇒ F1(x) ∈ (p1 − ∆3 ,p1 + ∆3 ).By left continuity of lqF2 for ǫ′ positive, there exists an 0 < ǫ < min(∆/3,p0)such thatp ∈ (p0 −ǫ,p0) ⇒ lqF2(p) ∈ (x1 −ǫ′,x1).Hence for p ∈ (p0−ǫ,p0), we have F1(lqF2(p)) ∈ (p1−∆/3,p1+∆/3). HenceδF1(lqF1(p),lqF2(p)) = |p−F1(lqF2(p))| ≥|p0 −p1|−ǫ− ∆3 ≥ ∆/3.Lemma 9.4.13 Suppose F1,F2 are distribution functions and F1 is contin-uous. Also assumeIDQδF1(F1,F2) =integraldisplay 10δF1(lqF1(p),lqF2(p)) = 0.Then F1 = F2.Proof The assumption implies that δF1(lqF1(p),lqF2(p)) = 0, ∀p ∈ (0,1).For otherwiseif δF1(lqF1(p0),lqF2(p0)) = ∆ > 0, for somep0. By the previouslemma there exist 0 < ǫ < p0 such thatδF1(lqF1(p),lqF2(p)) > ∆/3, p ∈ (p0 −ǫ,p0).This implies thatintegraldisplay 10δF1(lqF1(p),lqF2(p)) ≥ ǫ∆,which is a contradiction. Now we can useLemma 9.4.10 to conclude F1 = F2.2599.4. Quantile distance measures9.4.4 Quantile distance for continuous variablesFrom now on we only consider continuous variables and the probability lossfunction with c = 0, δX. Some results can be generalized to the generaldistributions but we leave that for future research. We use the simplernotations:QDX(X,Xθ) = LQDδX(X,Xθ) =integraldisplay 10δX(lqX(p),lqXθ(p))dp.AlsoQD(X,Xθ) = QDX(X,Xθ) +QDXθ(X,Xθ).Quantile distance in the continuous case can be obtained by:QDX(X,Xθ) =integraldisplay 10δX(lqX(p),lqXθ(p))dp =integraldisplay 10|FX ◦lqX(p)−FX ◦lqXθ(p)|dp =integraldisplay 10|p−FX ◦lqXθ(p)|dp.We can also consider the quantile distance closeness in the tails. Considerthe tails to correspond to probabilities E = (0,0.025) ∪ (0.0975,1). ThenL(E) = 0.05 (L being the L`ebegue measure) and we can defineQDtailX (X,Xθ) =integraldisplayEδX(lqX(p),lqXθ(p))dp/0.05 =integraldisplayE|FX ◦lqX(p)−FX ◦lqXθ(p)|dp/0.05 =integraldisplayE|p−FX ◦lqXθ(p)|dp/0.05.We have divided the integral by 0.05 the length of E to make this measurecomparable to the overall measure over [0,1], which has length 1.Then we compute the quantile distance of the standard normal to someknown distributions. Both the overall quantile distance and the tail quantiledistance are calculated (by approximating the integrals) and the results aregiven in Table 9.1 and 9.2. For the overall quantile distance we observe thatQDX and QDY have almost the same value. A theoretical result regardingthis observation is desirable and we leave this for future research. This isnot true in general for the tail distance.Then we find the closest Cauchy with scale parameter in (0,4) (and loca-tion parameter=0) to the standard normal. Once using the quantile distanceand once using the tail quantile distance. We find the quantile distance of2609.4. Quantile distance measuresthe standard normal to all Cauchy distributions with scale parameters onthe grid (0.01,0.02,··· ,4.00) (and location parameter=0). The results aregiven in Figures 9.3 and 9.5 respectively. For the overall quantile distancethe optimal Cauchy is the one with scale parameter 0.66 and for the tailquantile distance, the optimal Cauchy is the one with scale parameter 0.12.Figure 9.4 depicts the normal distribution functions compared with a fewCauchy distributions including the optimal and Figure 9.6 depicts the nor-mal distribution in the upper tail with a few Cauchy distributions includingthe optimal in tails with scale parameter 0.12. Figure 9.7 depicts the stan-dard normal distribution compared with the optimal Cauchy for the overallquantile distance and the optimal Cauchy for the tail quantile distance. Weconclude that a fit that is optimally might not be optimal on the tails. Weuse this fact later in choosing our method to model extreme temperatureevents.Distribution QDX(X,Y) QDY (X,Y) QDY = N(1,1) 0.2605080 0.2605080 0.5210159Y = N(0.5,1) 0.138301 0.138301 0.276602Y = N(0,2) 0.1024215 0.1024207 0.2048422Y = t(1) 0.06382985 0.0637436 0.1275734Y = t(10) 0.0078747 0.007872528 0.01574723Y = t(100) 0.000795163 0.0007951621 0.001590325Y = Cauchy(scale = 1) 0.06376941 0.06376579 0.1275352Y = χ2(1) 0.2190132 0.2190249 0.4380381U(−0.5,0.5) 0.1522836 0.1522991 0.3045827U(−1,1) 0.06562216 0.06563009 0.1312522U(−2,2) 0.05612716 0.0561283 0.1122555U(−3,3) 0.1171562 0.1171562 0.2343124Table 9.1: Comparing standard normal with various distributions usingquantile distance, where U denotes the uniform distribution and χ2 theChi-squared distribution.2619.4. Quantile distance measures0 1 2 3 40.050.15QD10 1 2 3 40.050.15QD20 1 2 3 40.10.20.30.4scale parameterQDFigure 9.3: Cauchy distribution’s distance with different scale parameter(and location parameter=0) to the standard normal. In the plots QD1 = QXand QD2 = QDY and QD = QD1+QD2, where X is the standard normaland Y is the Cauchy.2629.4. Quantile distance measures−3 −2 −1 0 1 2 30.00.20.40.60.81.0xF(x)Figure 9.4: The distribution function of standard normal (solid) comparedwith the optimal Cauchy (and location parameter=0) picked by quantiledistance minimization with scale parameter=0.66 (dashed curve), Cauchywith scale parameter=1 (dotted) and Cauchy with scale parameter=0.5 (dotdashed).2639.4. Quantile distance measures0 1 2 3 40.000.100.20QD10 1 2 3 40.000.100.200.30QD20 1 2 3 40.050.150.25scale parameterQDFigure 9.5: Cauchy distribution’s distance with different scale parameter(and location parameter=0) to the standard normal on the tails. In theplots QD1 = QX and QD2 = QDY and QD = QD1+QD2, where X is thestandard normal and Y is the Cauchy.2649.4. Quantile distance measures2.0 2.2 2.4 2.6 2.8 3.00.800.850.900.951.00xF(x)Figure 9.6: The distribution function of standard normal (solid) comparedwith the optimal Cauchy picked by tail quantile distance minimization withscale parameter=0.12 (dashed curve), Cauchy with scale parameter=0.65(dotted) and Cauchy with scale parameter=0.01 (dot dashed).2659.4. Quantile distance measures−3 −2 −1 0 1 2 30.00.20.40.60.81.0xF(x)Figure 9.7: Comparingthe standard normaldistribution (solid) with optimalCauchy picked by quantile distance (dashed) and the optimal Cauchy pickedby tail quantile distance minimization (dotted).2669.4. Quantile distance measuresDistribution QDtailX (X,Y) QDtailY (X,Y) QDtail(X,Y)Y = N(1,1) 0.05075276 0.05075276 0.10150552Y = N(0.5,1) 0.01824013 0.01824013 0.03648026Y = N(0,2) 0.01249034 0.11206984 0.12456018Y = t(1) 0.0125000 0.1184949 0.1309949Y = t(10) 0.007631262 0.011192379 0.018823642Y = t(100) 0.0009740074 0.0010122519 0.0019862594Cauchy(scale = 1) 0.0125000 0.1180231 0.1305231Y = χ2(1) 0.25006521 0.06467072 0.31473593U(−0.5,0.5) 0.3004565 0.0125000 0.3129565U(−1,1) 0.1523052 0.0125000 0.1648052U(−2,2) 0.01313629 0.01205279 0.02518908U(−3,3) 0.01083494 0.10054194 0.11137688Table 9.2: Comparing standard normal on the tails with some distributionsusing quantile distance, where U denotes the uniform distribution and χ2the Chi-squared distribution.9.4.5 Equivariance of estimation under monotonictransformations using the quantile distanceSuppose a family of distributions {Xθ}θ∈Θ, Θ ⊂ Rk is given. Also assumeφ is a continuous and strictly monotonic transformation on R. Considerthe family of distributions {Yθ = φ(Xθ)}θ∈Θ. Then the family {Yθ}θ∈Θ isparameterized by the same parameters sinceP(Yθ < a) = P(φ(Xθ) < a) = P(Xθ < φ−1(a)).Then the following lemma shows the equivariance property of quantile dis-tance estimation.Lemma 9.4.14 Suppose a random variable X and a family of distributions{Xθ}θ∈Θ are given,A = argminθ∈Θintegraldisplay 10δX(lqX(p),lqXθ(p))dp,is nonempty and φ is a continuous and strictly monotonic transformation.LetB = argminθ∈Θintegraldisplay 10δφ(X)(lqφ(X)(p),lqφ(Xθ)(p))dp.Then A = B. In other words if Xθ is an optimal estimator of X, then φ(Xθ)is an optimal estimator of φ(X).2679.4. Quantile distance measuresProof This is trivial by invariance properties of quantile distance undercontinuous strictly monotonic transformations.Remark. The above is also true if we use replace the integral quantiledistance by the sup quantile distance.9.4.6 Estimation using quantile distanceHere we only consider estimation using integral quantile distance. In orderto estimate a distribution X using a parameterized family {Xθ}θ∈Θ, one cantry to findargminθ∈Θintegraldisplay 10δX(lqX(p),lqXθ(p))dp.However, the above expression depends on δX an unknown. The availableinformation to us is usually a random sample X1,··· ,Xn.Remark. If we use the empirical distribution instead of the distribution ofX is above, we get:argminθ∈Θintegraldisplay 10δFn(lqFn(p),lqXθ(p))dp.The argmin can be checked again to be equivariant under continuous andstrictly monotonic transformations.Tables 9.3 and 9.4 compare the maximum likelihood estimation to thequantile distance estimation method for a sample of size N = 20 andN = 100 respectively. In each case we generate 50 samples of length Nand estimate the parameters using both methods. Then we assess the per-formance by a few measures: mean absolute error, mean square error, meanprobability loss error and mean quantile distance. In both cases maximumlikelihood has done slightly better in terms of all errors except the quantiledistance error in which case the quantile distance estimation has done signif-icantly better. The histogram for both estimation methods for N = 20 andN = 100 are given in Figures 9.8 and 9.9 respectively. For both maximumlikelihood and quantile distance estimations for N = 100 the parametershave a symmetric (close to normal) distribution.2689.4. Quantile distance measuresError type QD error s.e. of QD error ML error s.e. ML errorMean probability loss error for 0.077 0.061 0.077 0.055µ = lqN(µ,σ2)(1/2)Mean probability loss for 0.185 0.114 0.176 0.096σ2 +µ = lqN(µ,σ2)(P(Z < 1))Mean abs. error for µ 0.198 0.160 0.196 0.143Mean abs. error for σ 0.159 0.127 0.132 0.085Mean square error µ 0.064 0.089 0.058 0.077Mean square error for σ 0.041 0.065 0.025 0.028Mean QD error 0.035 0.009 0.122 0.073Table 9.3: Assessment of Maximum likelihood estimation and quantile dis-tance estimation using several measures of error for a sample of size 20. Inthe table s.e. stands for the standard error.Error type QD error s.e. of QD error ML error s.e. ML errorMean probability loss for 0.028 0.020 0.027 0.020µ = lqN(µ,σ2)(1/2)Mean probability loss for 0.157 0.046 0.165 0.038σ2 +µ = lqN(µ,σ2)(P(Z < 1))Mean abs. error for µ 0.070 0.051 0.068 0.051Mean abs. error for σ 0.079 0.052 0.061 0.039Mean square error µ 0.007 0.009 0.007 0.009Mean square error for σ 0.009 0.011 0.005 0.005Mean QD error 0.014 0.003 0.045 0.026Table 9.4: Assessment of Maximum likelihood estimation and quantile dis-tance estimation using several measures of error for a sample of size 100. Inthe table s.e. stands for the standard error.2699.4. Quantile distance measuresQD mean estimateDensity−0.6 −0.2 0.0 0.2 0.4 0.60.00.51.01.5ML  mean estimateDensity−0.6 −0.2 0.2 0.60.00.51.01.5QD sd estimateDensity0.4 0.6 0.8 1.0 1.20.00.51.01.52.0ML  sd estimateDensity0.6 0.8 1.0 1.2 1.40.00.51.01.52.02.5Figure 9.8: Histograms for the parameter estimates using quantile distanceand maximum likelihood methods for a sample of size 20.2709.4. Quantile distance measuresQD mean estimateDensity−0.2 −0.1 0.0 0.1 0.201234ML  mean estimateDensity−0.2 −0.1 0.0 0.1 0.201234QD sd estimateDensity0.8 0.9 1.0 1.1 1.201234ML  sd estimateDensity0.85 0.95 1.05 1.1501234Figure 9.9: Histograms for the parameter estimates using quantile distanceand maximum likelihood methods for a sample of size 100.271Chapter 10Binary temperatureprocesses10.1 IntroductionThis chapter uses the theory developed in previous chapters to find appro-priate models for extreme temperature events. We consider both low andhigh temperatures. The temperature is measured in degrees centigrade. Wedefine a day with minimum temperature (mt) less than zero as extremelycold and denote it by e:e(t) =braceleftBigg1 mt(t) ≤ 0 (deg C)0 mt(t) > 0 (deg C) .Taking 0 (deg C) to be the cut–off for low temperature seems reasonablein the absence of any other considerations, since it is the usual definition ofa frost. In agriculture, where most plants contain a lot of water this can beconsidered as an important cut–off. No seemingly natural cut-off like thatfor minimum temperature exists for extremely high temperature. To defineextreme events, we ask the following questions:1. Should the definition of an extreme event depend on the purpose ofour model?2. Should it depend on the time of the year and location?3. What should be the cut–off (threshold) to define an extreme event?4. Should we use a certain quantile as the cut-off? In that case whichquantile should be used?We provide some answers in the following:1. The answer to the first question is clearly affirmative. For example, ahigh temperature day for agriculture purposes is different from energy27210.1. Introductionproviding purposes. Even for the farmer, different crops may havedifferent tolerances to hot or cold weather.2. The answer of the second question depends on the model’s purpose.We might want to vary the definition over time and space for somepurposes.3. We do not know of any such natural cut–off for high temperatures likethat for low temperature.4. Quantiles have long beenused to determinethe extreme events. Choos-ing the level of the quantile depends on the purpose. Some extreme–value modelers pick the quantile high enough to insure the validity ofthe assumptions underlying their models as Embrechts et al. discussin [16]. For example, a well–known result asserts that P(X − u <v|X > u) follows a known distribution (extreme value distributionse.g. Pareto) when u is large. [See [16].] We do not favor such methodsof choosing the threshold. The threshold should be picked primarilyto reflect our needs in the real problem rather than satisfy the assump-tions of the models. If the models do not satisfy the conditions, weshould find others rather than move the threshold up.Based on the above discussion with the statistician’s knowledge alone, onecannot define the extreme events. Ralph Wright (personal communication)in AAFRD (Agriculture and Rural Development in Alberta, Canada) raisessimilar points. In particular he said the following about the droughts:“Drought is really defined by the impact that the moisture deficit has ona specific use or uses. Its definition can vary both with time of year and fromplace–to–place. Drought can be short–term or long–term. For example, onemonth of hot dry weather can significantly reduce crop yields, despite thefact that normal amounts of precipitation have been received over the pastyear. On the other hand, crops may do fine in dry weather conditions ifprecipitation has been received in a timely manner and temperatures havebeen favorable. However under the same conditions, a dam operator in thesame area may have severe shortages in the reservoir and declare droughtlike conditions (e.g. with low winter snow–fall and poor spring run–off). Youwill need to define your drought based on whom or what is being impactedby the water shortage.”Since we do not have any standard definition of an extremely hot day, weuse the data. In our example, to define a binary process of (hot)/(not hot)for temperature, we pick the global spatial/temporal 95th percentile using27310.1. Introductionthe data from 25 stations over Alberta that had daily maximum temperature(MT) data from 1940 to 2004. The 95th percentile was computed using thequantile algorithm developed in previous chapters and turned out to be 26.7.The exact value was also found and turned out to be q = 27 (deg C). ThenWe define the binary process of extremely hot temperature as:E(t) =braceleftBigg1 MT(t) ≥ q0 MT(t) < q ,where q = 27 (deg C) here.In order to study extreme events (e.g. for MT) three approaches cometo mind:1. Model the whole daily MT process and use that to infer about theextremes. For MT, we have shown that a Gaussian distribution fitsthe daily values fairly well. However, in the tails, usually of paramountconcern, the fit does not do well as shown in the qq–plots in Chapter 2.Another difficulty with this approach is picking a covariance functionto model the covariance over time. Also in Chapter 9, we showedthat even though two distributions are very close in terms of overallquantile distance, they might not be very close in terms of tail quantiledistance (Figure 9.7). This shows in order to study extremes (forexample extremely hot temperature) if we use a good overall fit, ourresults might not be reliable.2. Use a specified threshold and model the values exceeding the threshold.This approach has several drawbacks. Firstly we cannot answer thequestion of how often or in what periods of the year the extremeshappen. Thisis because we model the actual extreme values and ignorethe non–extreme values. Secondly, strong assumption of independenceis needed for this method. Thirdly we need to pick the threshold highenough to make the model reasonable as mentioned before. This mightnot be an optimal threshold from a practical point of view.3. Based on a real problem, use a threshold to define a new binary processof (extreme)/(not extreme) values and then model that binary process.This is the method we use and it does not have the issues mentionedin 1 and 2 because the threshold is not taken to satisfy some statisticalproperty and we make few assumptions about the binary chain.27410.2. rth–order Markov models for extreme minimum temperatures10.2 rth–order Markov models for extrememinimum temperaturesThis section looks for appropriate models for the binary process e(t) ofcold/not cold temperature days. This is a binary process and the Cate-gorical Expansion Theorem (Theorem 3.5.6) gives the form of all such rth–order Markov chains. Here we also consider other covariates such as theminimum temperature of the previous day and two days ago as well as sea-sonal covariates (deterministic). The next subsection uses graphical toolsand exploratory techniques to investigate the properties the model shouldhave. Then we use the BIC criterion and compare several proposed models.We use partial likelihood techniques to estimate parameters as proposed byKedem et al. in [27].10.2.1 Exploratory analysis for binary extreme minimumtemperaturesHere we perform an exploratory analysis of the binary process e(t) using twostations for this purpose, Banff and Medicine Hat which have data from 1895to 2006. The transition probabilities are computed from the historical dataconsidering years as independent observations. The results are summarizeda follows:• Figures 10.1 and 10.2 plot the probability of a freezing day over thecourse of a year for the Banff and Medicine Hat stations, respectively.A regular seasonal pattern is seen. Medicine Hat seems to have a muchlonger frost–free period.• Figures 10.3 and 10.4 plot the estimated transition probabilities, ˆp01and ˆp11 for the Banff and Medicine Hat stations. If the chain were a0th–order Markov chain then these two curves would overlap. This isnot the case and Markov chain at least of 1st–order seems necessary.In the ˆp01 curve for both Banff and Medicine Hat, high fluctuationsare seen at the beginning and end of the year which corresponds to thecold season. This is not surprising because there are very few pairs inthe data with a freezing day followed by a non–freezing day in a coldseason in Alberta.• In Figure 10.4, ˆp11 is missing for a period over the summer. This isbecause no freezing day is observed over this period in the summerand hence ˆp11 could not be estimated.27510.2. rth–order Markov models for extreme minimum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbability of a freezing dayFigure 10.1: The estimated probability of a freezing day for the Banff sitefor different days of a year computed using the historical data.• Figures 10.5 and 10.6 give the plots for the 2nd–order transition prob-abilities. They overlap substantially and hence a 2nd–order Markovchain does not seem to be necessary.10.2.2 Model selection for extreme minimum temperatureThis section finds models for the extreme minimum temperature processe(t). Here Zt−1 denotes the covariate process. We investigate the followingpredictors:• ek(t) ≡ e(t−k). Was it an extremely cold day k days ago?• mtk(t) ≡ mt(t−k), the actual minimum temperature k days ago.• Nk, the number of freezing days during the k previous days.• SIN, COS, SIN2 and COS2 which are abbreviations for sin(ωt),cos(ωt), sin(2ωt) and cos(2ωt), respectively (with ω = 2π366).27610.2. rth–order Markov models for extreme minimum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbability of a freezing dayFigure 10.2: The estimated probability of a freezing day for the MedicineHat site for different days of a year computed using the historical data.27710.2. rth–order Markov models for extreme minimum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.3: The estimated 1st–order transition probabilities for the 0-1process of extreme minimum temperatures for the Banff site. The dottedline represents the estimated probability of “e(t) = 1 if e(t −1) = 1” ( ˆp11)and the dashed, “e(t) = 1 if e(t−1) = 0” ( ˆp01).27810.2. rth–order Markov models for extreme minimum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.4: The estimated 1st–order transition probabilities for the 0-1process of extreme minimum temperatures for the Medicine Hat site. Thedotted line represents the estimated probability of “e(t) = 1 if e(t−1) = 1”( ˆp11) and the dashed, “e(t) = 1 if e(t−1) = 0” ( ˆp01).27910.2. rth–order Markov models for extreme minimum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.5: The estimated 2nd–order transition probabilities for the 0-1process of extreme minimum temperature for the Banff site with ˆp111 (solid)compared with ˆp011 (dotted) both calculated from the historical data.28010.2. rth–order Markov models for extreme minimum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.6: The estimated 2nd–order transition probabilities for the 0-1 process of extreme minimum temperatures for the Banff site with ˆp001(solid) compared with ˆp101 (dotted) calculated from the historical data.28110.2. rth–order Markov models for extreme minimum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.7: The estimated 2nd–order transition probabilities for the 0-1process of extreme minimum temperatures for the Medicine Hat site withˆp111 (solid) compared with ˆp011 (dotted) calculated from the historical data.28210.2. rth–order Markov models for extreme minimum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.8: The estimated 2nd–order transition probabilities for the 0-1process of extreme minimum temperatures for the Medicine Hat site withˆp001 (solid) compared with ˆp101 (dotted) calculated from the historical data.28310.2. rth–order Markov models for extreme minimum temperaturesTable 10.1 compares models with a constant and Nk as the covariateprocess. The optimal model picked by the BIC criterion is the model withthe covariates Zt−1 = (1,N11).Model: Zt−1 BIC parameter estimates(1,N1) 1251.7 (-2.144, 4.260)(1,N2) 1166.5 (-2.501, 2.490)(1,N3) 1142.9 (-2.653, 1.755)(1,N4) 1121.6 (-2.773, 1.371)(1,N5) 1111.2 (-2.852, 1.125)(1,N6) 1093.1 (-2.932, 0.961)(1,N7) 1087.4 (-2.977, 0.835)(1,N8) 1081.7 (-3.015, 0.739)(1,N9) 1077.1 (-3.047, 0.663)(1,N10) 1066.5 (-3.089, 0.605)(1,N11) 1056.4 (-3.130, 0.557)(1,N12) 1059.5 (-3.135, 0.511)(1,N13) 1062.3 (-3.140, 0.472)(1,N14) 1072.8 (-3.126, 0.437)(1,N15) 1080.9 (-3.118, 0.406)(1,N16) 1091.9 (-3.102, 0.379)(1,N17) 1104.2 (-3.083, 0.354)(1,N18) 1112.1 (-3.075, 0.334)(1,N19) 1118.6 (-3.068, 0.315)(1,N20) 1126.5 (-3.058, 0.299)Table 10.1: BIC values for models including Nk for the extreme minimumtemperature process e(t) at the Medicine Hat site.28410.3. rth–order Markov models for extreme maximum temperaturesModel: Zt−1 BIC parameter estimates(1) 2539.9 (-0.0251)(1,e1) 1251.7 (-2.144, 4.260)(1,e2) 1473.6 (-1.856, 3.683)(1,e1,e2) 1157.7 (-2.501, 3.085, 1.896)(1,e1,e2,e1e2) 1162.4 (-2.586, 3.389, 2.190, -0.593)(1,mt1) 963.7 (0.109, -0.400)(1,mt1,mt2) 954.0 (0.091, -0.329, -0.082)(1,COS,SIN) 984.0 (-0.070, 4.292, 1.324)(1,COS,SIN,COS2,SIN2) 984.2 (-0.502, 4.505, 1.399, -0.464, -0.493)(1,COS,SIN,COS2) 986.7 (-0.258, 4.359, 1.335, -0.353)(1,COS,SIN,SIN2) 984.4 (-0.217, 4.365, 1.360, -0.402)(1,mt1,mt2,mt3) 940.7 (0.062, -0.319, -0.009, -0.094)(1,mt1,mt2,mt1mt2) 943.4 (0.211, -0.339, -0.084, -0.0091)(1,e1,COS,SIN) 901.5 (-1.008, 1.840, 3.325, 1.013)(1,mt1,COS,SIN) 855.3 (-0.074, -0.234, 2.394, 0.746)(1,mt1,mt2,COS,SIN) 861.9 (-0.076, -0.247, 0.023, 2.504, 0.785)Table 10.2: BIC values for several models for the extreme minimum tem-perature e(t) at the Medicine Hat site.Table 10.2 compares several models some of which include seasonal termsand continuous variables. The optimal model is (1,mt1,COS,SIN), whichhas the temperature of the previous day and seasonal terms. The model(1,e1,COS,SIN) has a larger BIC but is preferable to all models otherthan (1,mt1,COS,SIN) and (1,mt1,mt2,COS,SIN). Note that it is notpossible to compute the probability of events in the long-term future using(1,mt1,COS,SIN), since we do not know mt except for perhaps the presenttime. Hence the optimal applicable model seems to be (1,e1,COS,SIN).10.3 rth–order Markov models for extrememaximum temperaturesThis section finds appropriate models for the binary process of extremelyhot temperature E(t) as defined above. To define a hot day, we use the 95thpercentile of data from 25 stations over Alberta that had daily MT datafrom 1940 to 2004. The 95th percentile turns out to be q = 27 (deg C).Once we used the fast algorithm developed in Chapter 7 to pick the quantileand once we used an exact method; the algorithm gave us the approximatevalue q = 26.7, which is very close to the exact value. (See Table ?? formore details on the computation.)28510.3. rth–order Markov models for extreme maximum temperatures10.3.1 Exploratory analysis for extreme maximumtemperaturesThis section uses explanatory data analysis techniques to study the binaryprocess E(t). Again we use two stations for this purpose, the Banff andMedicine Hat sites that have data from 1895 to 2006. The transition proba-bilities are computed using the historical data considering years as indepen-dent observations. The results are summarized as follows:• Figures 10.9 and 10.10 plot the probabilities of a hot day over thecourse of a year for the Banff and Medicine Hat stations respectively.A regular seasonal pattern is seen. Medicine Hat seems to have a muchlonger period of hot days.• Figures 10.11 and 10.12 plot the estimated transition probabilities,ˆp01 and ˆp11 for Banff and Medicine Hat. If the chain were a 0th–orderMarkov chain then these two curves would overlap. This is not thecase so Markov chain of at least 1st–order seems necessary. In the ˆp01curve for both Banff and Medicine Hat, large fluctuations are seen inthe middle of the year, which corresponds to the warm season. Thisis not surprising because there are very few pairs in the data with ahot day followed by a not–hot day in the warm season in Alberta.• In Figure 10.12, ˆp11 is missing for a period over the cold season. Thisis because no hot day is observed during this period in the cold seasonand hence ˆp11 could not be estimated.• Figures 10.13 and 10.14 give the plots for the 2nd–order transitionprobabilities. They overlap heavily and hence a 2nd–order Markovchain does not seem to be necessary.10.3.2 Model selection for extreme maximum temperatureHere, we use the following abbreviations:• Ek(t) = E(t−k). Was it an extreme day k days ago?• MTk(t) = MT(t−k), the actual maximum temperature k days ago.• Nk, COS, SIN, COS, SIN2 and COS2 as previous sections.28610.3. rth–order Markov models for extreme maximum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbability of a hot dayFigure 10.9: The estimated probability of a hot day (maximum temperature≥ 27 (deg C)) for different days of the year for the Banff site calculated fromthe historical data.28710.3. rth–order Markov models for extreme maximum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbability of a hot dayFigure 10.10: The estimated probability of a hot day (maximum tempera-ture ≥ 27 (deg C)) for different days of the year for the Medicine Hat sitecalculated from the historical data.28810.3. rth–order Markov models for extreme maximum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.11: The estimated 1st–order transition probabilities for the binaryprocess of extremely hot temperatures for the Banff site. The dotted linerepresent the estimated probability of “E(t) = 1 if E(t−1) = 1” ( ˆp11) andthe dashed, “E(t) = 1 if E(t−1) = 0” ( ˆp01).28910.3. rth–order Markov models for extreme maximum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.12: The estimated 1st–order transition probabilities for the binaryprocess of extremely hot temperatures for the Medicine Hat site. The dottedline represents the estimated probability of “E(t) = 1 if E(t−1) = 1” ( ˆp11)and the dashed, “E(t) = 1 if E(t−1) = 0” ( ˆp01).29010.3. rth–order Markov models for extreme maximum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.13: The estimated 2nd–order transition probabilities for the bi-nary process of extremely hot temperatures for the Banff site with ˆp111(solid) compared with ˆp011 (dotted) calculated from the historical data.29110.3. rth–order Markov models for extreme maximum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.14: The estimated 2nd–order transition probabilities for the bi-nary process of extremely hot temperatures for the Banff site with ˆp001(solid) compared with ˆp101 (dotted) calculated from the historical data.29210.3. rth–order Markov models for extreme maximum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.15: The estimated 2nd–order transition probabilities for the bi-nary process of extremely hot temperatures for the Medicine Hat site withˆp111 (solid) compared with ˆp011 (dotted), calculated from the historical data.29310.3. rth–order Markov models for extreme maximum temperatures0 100 200 3000.00.20.40.60.81.0Day of the yearProbabilityFigure 10.16: The estimated 2nd–order transition probabilities for the bi-nary process of extremely hot temperatures for the Medicine Hat site withˆp001 (solid) compared with ˆp101 (dotted) calculated from the historical data.29410.3. rth–order Markov models for extreme maximum temperaturesTable 10.3 compares several models containing Nk. The optimal modelturns out to be (1,N11) which is the same as the result for the extrememinimum temperature process e(t).Model: Zt−1 BIC parameter estimates(1,N1) 955.7 (-2.95, 3.82)(1,N2) 965.9 (-3.00, 2.16)(1,N3) 942.5 (-3.11, 1.60)(1,N4) 921.8 (-3.20, 1.29)(1,N5) 926.8 (-3.23, 1.05)(1,N6) 931.6 (-3.24, 0.89)(1,N7) 932.5 (-3.26, 0.78)(1,N8) 939.0 (-3.26, 0.69)(1,N9 931.6 (-3.29, 0.63)(1,N10) 925.9 (-3.31, 0.57)(1,N11) 911.7 (-3.35, 0.49)(1,N12) 917.5 (-3.34, 0.46)(1,N13) 922.8 (-3.33, 0.42)(1,N14) 926.0 (-3.32, 0.39)(1,N15) 932.1 (-3.31, 0.37)(1,N16) 941.7 (-3.29, 0.34)(1,N17) 951.5 (-3.28, 0.31)(1,N18) 955.3 (-3.27, 0.29)(1,N19) 960.6 (-3.26, 0.28)(1,N20) 968.3 (-3.25, 0.26)(1,N21) 975.3 (-3.23, 0.25)(1,N22) 981.8 (-3.22, 0.24)(1,N23) 986.0 (-3.22, 0.23)(1,N24) 991.6 (-3.21, 0.22)(1,N25) 997.0 (-3.21, 0.21)(1,N26) 1002.8 (-3.20, 0.20)(1,N27) 1009.5 (-3.19, 0.19)(1,N28) 1014.4 (-3.18, 0.19)Table 10.3: BIC values for models includingNk for the extremely hotprocessE(t).Table 10.4 compares several models. We observe that major reductionsare seen if we use MTk instead of Ek. The optimal model turns out tobe (1,MT1,COS,SIN) which is combination of seasonal terms and thetemperature of the day before.29510.4. Probability of a frost–free period for Medicine HatModel: Zt−1 BIC parameter estimates(1) 1520.3 (-1.774)(1,E1) 955.8 (-2.95, 3.82)(1,E2) 1170.5 (-2.581, 2.924)(1,E1,E2) 941.3 (-3.034, 3.179, 1.099)(1,E1,E2,E1E2) 929.0 (-3.202, 3.895, 2.137, -1.877)(1,MT1) 683.8 (-10.040, 0.362)(1,MT1,MT2) 689.1 (-10.135, 0.333, 0.034)(1,COS,SIN) 830.8 (-5.484, -5.616, -2.452)(1,COS,SIN,COS2,SIN2) 837.5 (-4.343, -4.255, -0.993, 0.113, 1.016)(1,COS,SIN,COS2) 837.9 (-5.850, -6.231, -2.406, -0.292)(1,COS,SIN,SIN2) 830.0 (-4.481, -4.492, -0.978, 1.011)(1,MT1,MT2,MT3) 669.2 (-10.885, 0.338, -0.061, 0.120)(1,MT1,MT2,MT1MT2) 681.9 (-21.003, 0.763, 0.452, -0.0162)(1,E1,COS,SIN) 731.3 (-4.963, 2.005, -4.096, -1.685)(1,MT1,COS,SIN) 649.9 (-10.281, 0.283, -2.829, -1.079)(1,MT1,MT2,COS,SIN) 657.3 (-10.109, 0.294, -0.011, -2.609,-1.072)Table 10.4: BIC values for several models for the extremely hot processE(t).10.4 Probability of a frost–free period forMedicine HatThis section shows how the approach developed above can be used in appli-cations. We use the developed methodology to compute two probabilities:• π1 : The probability of no frosts in the first week of October at theMedicine Hat site.• π2 : The probability of at least 5 days without frost in the first weekof October at the Medicine Hat site.The first day of October is the 275th day of the year in a leap year andthe 274th day of the year in a non–leap year. We compute the probabilitiesfor the week between 274th day and 281th day which corresponds to the firstweek of October in a non–leap year. We prefer this option to computing theprobability for the actual first week of October, since this corresponds betterto the natural cycles. Of course with a little modification one could computethe probability for the first week of October, for example by introducing aprobability of 1/4 for being in a leap year.Figure 10.17 plots the probability of a frost for each day of years since1985. Only years with more than 355 days of data are considered. The29610.4. Probability of a frost–free period for Medicine Hat1900 1920 1940 1960 1980 20000.00.20.40.60.81.0YearProbabilityFigure 10.17: Medicine Hat’s estimated mean annual probability of frostcalculated from the historical data.29710.4. Probability of a frost–free period for Medicine Hatfigure shows that the probability of a frost is fairly consistent over theyears, so we assume a constant probability of frost for all years. Table10.5 compares models with various Nk. The optimal model is (1,N11).Table 10.6 includes two seasonal terms as well as Nk. The optimum thistime (1,N1,COS,SIN), showing that in the presence of seasonal terms, theshort–term past modeled by Nk is not necessary.Model: Zt−1 BIC(1,N1) 5072.2(1,N2) 4634.8(1,N3) 4465.9(1,N4) 4407.4(1,N5) 4366.0(1,N6) 4357.4(1,N7) 4356.2(1,N8) 4342.6(1,N9) 4330.5(1,N10) 4329.1(1,N11) 4328.4(1,N12) 4332.4(1,N13) 4330.8(1,N14) 4345.1(1,N15) 4362.9(1,N16) 4385.7(1,N17) 4407.1(1,N18) 4420.1(1,N19) 4440.1(1,N20) 4463.7Table 10.5: BIC values for models including Nk for the extremely coldprocess e(t) at the Medicine Hat site.29810.4. Probability of a frost–free period for Medicine HatModel: Zt−1 BIC(1,N1,COS,SIN) 3601.3(1,N2,COS,SIN) 3654.8(1,N3,COS,SIN) 3693.9(1,N4,COS,SIN) 3735.2(1,N5,COS,SIN) 3763.1(1,N6,COS,SIN) 3791.0(1,N7,COS,SIN) 3813.5(1,N8,COS,SIN) 3826.2(1,N9,COS,SIN) 3834.9(1,N10,COS,SIN) 3843.6(1,N11,COS,SIN) 3849.8(1,N12,COS,SIN) 3855.5(1,N13,COS,SIN) 3857.4(1,N14,COS,SIN) 3862.9(1,N15,COS,SIN) 3868.1(1,N16,COS,SIN) 3873.7(1,N17,COS,SIN) 3877.9(1,N18,COS,SIN) 3878.6(1,N19,COS,SIN) 3880.5(1,N20,COS,SIN) 3882.8Table 10.6: BIC values for several models including Nk and seasonal termsfor the extremely cold process e(t) at the Medicine Hat site.Model: Zt−1 BIC parameter estimates(1) 10122.4 (-0.0858)(1,e1) 5072.2 (-2.13, 4.18)(1,e1,e2) 4598.2 (-2.530, 2.977, 2.00)(1,e1,e2,e1e2) 4582.8 (-2.65, 3.41, 2.43, -0.855)(1,COS,SIN) 3916.870 (-0.3, 4.301, 1.139)(1,COS,SIN,COS2,SIN2) 3865.6 (-0.746, 4.643, 1.253 -0.550 -0.504)(1,e1,COS,SIN) 3601.3 (-1.116, 1.760, 3.332, 0.856)(1,e1,COS,SIN,COS2,SIN2) 3566.7 (-1.49, 1.71, 3.65, 0.96, -0.48, -0.42)(1,e1,e2,COS,SIN) 3601.6 (-1.22, 1.66, 0.33, 3.19, 0.810)(1,e1,e2,COS,SIN,COS2,SIN2 3571.7 (-1.8, 1.7, 4.4, 1.3, -0.78, -0.74, 0.2, 0.4),COS3,SIN3)(1,mt1,COS,SIN,COS2,SIN2) 3356.4 (-0.66, -0.22, 2.85, 0.73, -0.56, -0.42)Table 10.7: BIC values for several models for the extremely cold process e(t)at the Medicine Hat site.29910.4. Probability of a frost–free period for Medicine HatCovariate Theoretical sd Experimental sd1 0.090 0.093e1 0.097 0.100COS 0.125 0.139SIN 0.060 0.059COS2 0.089 0.094SIN2 0.081 0.077Table 10.8: Theoretical and simulation estimated standard deviations forextremely cold process e(t) at the Medicine Hat site.Table 10.5 compares various models. The winner is(1,mt1,COS,SIN,COS2,SIN2).However, it is not possible to compute the desired probabilities using thismodel since we do not know mt1 (perhaps except at the start of the chain).Among all other models, the optimal is(1,e1,COS,SIN,COS2,SIN2),which we use to compute the probabilities.We compute the standard deviations once using simulations by gener-ating chains from the fitted model with the above covariates, and once bycomputing the partial information matrix, GN. The results are given in Ta-ble 10.8. The variance–covariance matrix calculated using partial likelihoodtheory is given below:0.0082 −0.0043 −0.0038 −0.0011 0.0050 0.0030−0.0043 0.0094 −0.0042 −0.0013 0.0002 0.0003−0.0038 −0.0042 0.0158 0.0038 −0.0052 −0.0037−0.0011 −0.0013 0.0038 0.0037 −0.0011 −0.00170.0050 0.0002 −0.0052 −0.0011 0.0079 0.00150.0030 0.0003 −0.0037 −0.0017 0.0015 0.0066We also find the variance–covariance matrix using simulations. To dothat we generate 50 chains over time using the estimated parameters. Thevariance–covariance matrix using the simulations is given by:30010.4. Probability of a frost–free period for Medicine Hatbeta1beta1Density−1.7 −1.6 −1.5 −1.4 −1.301234beta3beta2Density1.3 1.4 1.5 1.6 1.7 1.8 1.901234beta3beta3Density3.4 3.6 3.8 4.0 4.20.01.02.03.0beta4beta4Density0.90 1.00 1.10 1.200246beta5beta5Density−0.8 −0.7 −0.6 −0.5 −0.40123456beta6beta6Density−0.7 −0.6 −0.5 −0.4 −0.3 −0.2012345Figure 10.18: Normal curved fitted to the distribution of 50 samples of theestimated parameters.0.0087 −0.0035 −0.0054 −0.0012 0.0047 0.0021−0.0035 0.0101 −0.0058 −0.0009 0.0026 0.0012−0.0054 −0.0058 0.0194 0.0032 −0.0086 −0.0032−0.0012 −0.0009 0.0032 0.0035 −0.0011 −0.00180.0047 0.0026 −0.0086 −0.0011 0.0089 0.00160.0021 0.0012 −0.0032 −0.0018 0.0016 0.0059We see that the simulated variance–covariance matrix has close values tothe partial likelihood, all entries having the same sign. We also look at thedistribution of the estimators using the 50 samples. Figure 10.18 shows theparameter estimates approximately follow a normal distribution.To estimate the desired probabilities, we generate samples (10000) fromthe parameter spaceusingthe mean ofthe parametersand variance–covariancematrix from a multivariate normal. To fix ideas suppose we want to com-pute the probability of no frost between (and including) the 274th day andthe 280th day of the year. For every vector of parameters, we then computethe probability of observing (0,0,0,0,0,0,0) exactly once given it was be-low zero on the 273th day and once it was above zero. In other words we30110.4. Probability of a frost–free period for Medicine HatcomputeP(e(274) = 1,··· ,e(281) = 1|e(273) = 1),andP(e(274) = 1,··· ,e(281) = 1|e(273) = 0).We also use the historical data to estimate p0 = P(e(273) = 1). Then thedesired probability would beP(e(274) = 1,··· ,e(281) = 1) =p0P(e(274) = 1,··· ,e(281) = 1|e(273) = 1) +(1−p0)P(e(274) = 1,··· ,e(281) = 1|e(273) = 0)Then in order to get a 95% confidence intervals we use (q(0.025),q(1 −0.025)), where q is the (left) quantile function of the vector of the probabil-ities.Using the historical data, we obtain p0 = P(e(274) = 1) = 0.2432432.Then for every parameter generated from the multivariate normal with meanand the above variance–covariance matrix we can estimate the two proba-bilities π1 and π2. We sample 10000 times from the multivariate normal,compute 10000 probabilities and take the 0.025th and 0.975th (left) quan-tiles to get the following confidence intervals for π1 and π2 respectively:(0.28,0.40),and(0.74,0.85).If we use the simulated variance–covariance matrix, we’ll get the followingconfidence intervals for π1 and π2(0.28,0.40),and(0.75,0.85),which are very similar to the aforementioned intervals.30210.5. Possible applications of the models10.5 Possible applications of the modelsTo understand the potential applications of these models and results I con-tacted Dr. Nathaniel Newlands from AAFC (Agriculture and Agi-foodCanada). He give the following insightful comments.“Forecasted (probability of precipitation) is a leading indicator used bycrop insurance companies. Probabilities of this kind (agroclimate) are typ-ically most useful in early growing season by farmers in deciding plantingdates and deciding on irrigation scheduling and ordering fertilizer and otherkinds of inputs. Frost probability in latter growing season is critically impor-tant in deciding when to harvest crops before they have a higher potentialfor weather damage. So, essentially at the start and end of growing season,frost, precipitation (sometimes as a water stress index) and temp extremesare all informative for farmers and other decision makers in ag industry.I would generally say that a broader set of probabilities like these are ofspecial interest to the government side as they look for improving and/ordeveloping new models, web portals and other tools to aid a wide array ofthe decision makers in the agricultural industry with their business deci-sions. Farmers (depending on what region of Canada they are in) are usedto dealing with reoccurring weather and now climate change events, so of-ten their viewpoint and decision needs are far more regionally specific thangovernment which tries to balance regional with national needs and levelsof risk to changing agroclimate.The crop insurance industry is probably the most specific user of suchinformation. For example, they base their insurance quotes for the event ofprecipitation on some specific times of the year.”303Chapter 11Conclusions and futureresearch11.1 IntroductionThis chapter summarizes the work and draws conclusions from the the sta-tistical analysis and the theory developed in the previous chapters. We alsopoint out a few topics for future research as a continuation of the work donein this thesis.11.2 SummaryThis thesis has presented statistical techniques we have developed to modelprecipitation and temperature over time. The dataset we use is the historicalweather data published by Environment Canada [10]. A Python code wasprovided to extract the data from the binary format and the Python mod-ule is available in [23]. [See the appendices for more information regardingthe dataset, the Python module and other resources.] Then we performedan exploratory analysis of the data. See the conclusions section of Chapter2 for details. In order to model the 0-1 precipitation process over time,rth-order Markov chains are a natural choice. We found a representationtheorem for such chains using the conditional probabilities and used it topick appropriate models for precipitation and dichotomized temperatures inthe next chapters. In order to dichotomize a continuous process (tempera-ture) one can use quantiles as thresholds. The climate data are often verylarge in size and hence computing quantiles is not possible due to memoryor space limitations. We propose an algorithm that uses smaller partitionsof the data in order to approximate the quantiles and provides a measureof goodness of such approximations. Thinking about the quantiles led usto an extension of the traditional definition of “quantiles” to the “left–”and “right–quantiles” and we showed by various theorems that this defi-nition is more intuitively appealing and practically useful. For example a30411.3. Future researchsymmetric relation holds with the new definition which we used in variousapplications. In order to assess the goodness of approximating quantiles,we introduced the “probability loss function”, which we showed is invariantunder monotonic transformations. We used this loss function in various ap-plications such as picking optimal probability index vectors to summarizedata vectors or assigning quantiles to a random sample in order to make aquantile-quantile plot. Then we used this loss function, to define a distancebetween random variables and showed that this distance is also invariantunder monotonic transformations. We also pointed out how the probabil-ity loss function and the distance defined by it could be used to estimateparameters of a distribution. Chapter 10 uses the above methods to findappropriate models for extremely high and low temperatures. For example,we show how these models can be used to build confidence intervals for theprobability of a frost-free period.11.3 Future researchIn this section, we suggest a few lines of research that are continuations ofthis thesis work.11.3.1 rth-order Markov chainsChapter 3 developed a consistency theorem for the conditional probabilitiesof a discrete–time categorical stochastic process and a representation theo-rem for rth-order Markov chains. We expressed the conditional probabilitiesof such chains as a linear combination of monomials of past times and usedpartial likelihood to estimate the parameters in the binary case. We proposethe following extensions to this work:• Find a similar consistency theorem for general (not only categorical)discrete–time categorical processes and a representation theorem forrth-order Markov chains.• We used partial likelihood only to estimate the parameters in the bi-nary case; an extension is needed to chains with larger number ofstates.• We pointed out in Chapter 3 that we can add other covariates to thelinear terms to get non-stationary chains. We can also add spatialcomponents to build spatial-temporal models. However, estimating30511.3. Future researchthe parameters in this case needs an extension of the theory due tothe possible dependence over space.• A Bayesian method can be deployed to estimate the parameters ofthese models.11.3.2 Approximating quantiles and data summariesWe provided a general framework for summarizing data, combining sum-maries and making inference about the original data. We propose the fol-lowing research topics:• Suppose a data vector x is given which is partitioned to x1,··· ,xm oflengths n1,··· ,nm. We are allowed to read the partitions separatelyand save k1,··· ,km data points from these partitions.1. What information regarding x1,··· ,xm (of length k1,··· ,km)should be saved to optimally approximate lqx(p) for a fixed p?2. What information regarding x1,··· ,xm (of length k1,··· ,km)should be saved to optimally approximate lqx(p) for all p ∈ E ⊂[0,1]?3. Suppose pre-defined summaries of x1,··· ,xm are given which arenot necessarily optimal. How can we optimally infer about lqx(p)or lqx(p) for all p ∈ E ⊂ [0,1]?4. Suppose a fixed memory space is given. Find an optimal (fastest)algorithm which gives approximations of accuracy ǫ (in the prob-ability loss sense).• Supposea randomsampleX1,··· ,Xn isgiven. We can builddistribution-free confidence intervals for quantiles of the underlying distribution.(See [15].) Now suppose we have created a summary of this randomsample in a certain way. Build confidence intervals based on thesesummaries.11.3.3 Parameter estimation using probability loss andquantile distancesChapter 9 developed a framework to estimate parameters of distributions.We also introduced the quantile distances in order to measure the distancebetween random variables and showed its invariance under monotonic trans-formations. We propose the following extensions:30611.3. Future research• Given a random sample X1,··· ,Xn what is the best estimate of lqX(p)using the probability loss function. What are the properties of thatestimator? Is it consistent?• What are the suprema of LQDδX(X,Y ) and LQDδX+δY (X,Y ) overthe space of all random variables?• What is the relation between LQDδX(X,Y ) and LQDδY (X,Y )?• Do LQD1(X,Y ) = LQDδX(X,Y ) or LQD(X,Y) = LQDδX+δY (X,Y )satisfy the triangle inequality?• Chapter 9 was a theoretical chapter. A lot of simulation studies andanalysis of real data is needed to support the theory and get new ideas.307Bibliography[1] R. Agrawal and A. Swami. A one-pass space-efficient algorithm forfinding quantiles. In in Proc. 7th Intl. Conf. Management of Data(COMAD-95, 1995.[2] H. Akiake. A new look at the statistical model identification. IEEETransactions on Automatic Control, pages 716–723, 1974.[3] K. Alsabti, S. Ranka, and V. Singh. A one-pass algorithm for accuratelyestimating quantiles for disk-resident data. In VLDB ’97: Proceedingsof the 23rd International Conference on Very Large Data Bases, pages346–355, San Francisco, CA, USA, 1997. Morgan Kaufmann PublishersInc.[4] T. W. Anderson and L. A. Goodman. Statistical inference aboutmarkov chains. Ann. Math. Statist., pages 89–110, 1957.[5] M. S. Bartlett. The frequency goodness of fit test for probability chains.Proc. Cambridge Philos. Soc., pages 86–95, 1951.[6] J. Besag. Spatial interactions and the statistical analysis of latticesystems. Journal of the Royal Statistical Society series B, pages 192–225, 1974.[7] P. Billingsley. Probability and measure. John Wiley and Sons, 1985.[8] R. W. Blum and J. W. John. Time bounds for selection. J. Comput.Sys. Sci., 7:448–461, 1973.[9] L. Breiman. Probability. SIAM, 1992.[10] Environment Canada. The climate cds.http://www.weatheroffice.ec.gc.ca, 2007.[11] G. Casella and R. L. Berger. Statistical Inference. Duxbury, 2001.[12] E. H. Chin. Modeling daily precipitation process with markov chain.Water resources research, (6):949–956, 1977.308Chapter 11. Bibliography[13] W. K. Ching, E. S. Fung, and K. M. NG. Higher-order markov chainmodels for categorical data sequences. Naval Research Logistics, pages557–574, 2004.[14] N. Cressie and L. Subash. New models for markov random fields. Jour-nal of applied probability, pages 877–884, 1992.[15] H. A. David and H. N Nagaraja. Order Statistics (3rd Edition). Wiley,2003.[16] P. Embrechts, C. Klppelberg, and T. Mikosch. Modelling extremalevents for insurance and finance. Springer, 2001.[17] J. E. Freund and B. M. Perles. A new look at quartiles of ungroupeddata. The American statistician, pages 200–203, 1987.[18] K. R. Gabriel and J. Neumann. A markov chain model for daily rainfalloccurance at tel aviv. Quart. J. Roy. Met. Soc., pages 90–95, 1962.[19] M. Greenwald and S. Khanna. Space-efficient online computation ofquantile summaries. In In SIGMOD, pages 58–66, 2001.[20] E. J. Hannan. The estimation of the order of an arma process. Ann.Statist., pages 1071–1081, 1980.[21] L. Hao and D. Q. Naiman. Quantile Regression. Quantitative Applica-tions in the Social Sciences Series. SAGE publications, 2007.[22] D. M. A. Haughton. On the choice of a model to fit data from anexponential family. The Annals of Statistics, (1):342–355, 1988.[23] R. Hosseini. Python module for canadian climate data.http://bayes.stat.ubc.ca/∼reza/python, 2009.[24] R. Hyndman and Y. Fan. Sample quantiles in statistical packages. TheAmerican Statistician, 1996.[25] R. J. Hyndman and Y. Fan. Sample quantiles in statistical packages.The American Statistician, pages 361–365, 1996.[26] R. Jain and I. Chlamtac. The p2 algorithm for dynamic calculationof quantiles and histograms without storing observations. Commun.ACM, 28(10):1076–1085, 1985.309Chapter 11. Bibliography[27] B. Kedem and K. Fokianos. Regression Models for Time Series Analy-sis. Wiley Series in Probability and Statistics, 2002.[28] D. E. Knuth. Sorting and Searching, volume 3. Addison-Wesley, 1973.[29] R. Koenker. Quantile Regression. Cambridge university press, 2005.[30] E. L. Lehmann and G. Casella. The theory of point estimation.Springer-Verlag, 1998.[31] L. P. Llorente. Statistical Inference Based on Divergence Measures.CRC Press, 2006.[32] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate mediansand other quantiles in one pass and with limited memory. pages 426–435, 1998.[33] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random samplingtechniques for space efficient online computation of order statistics oflarge datasets. In In SIGMOD, pages 251–262, 1999.[34] E. Mekis and W. D. Hogg. Rehabilitation and analysis of canadiandaily precipitation time series. Atmosphere Ocean, pages 53–85, 1999.[35] S. E. Moon, S. Ryo, and J. Kwon. International journal of climatalogy.pages 1009–116, 1993.[36] J. I. Munro and M. S. Paterson. Selection and sorting with limitedstorage. Theoretical computer science, 12:253–258, 1980.[37] B. ¨Oksendal. Stochastic Differential Equations: An Introduction withApplications. Springer, 2003.[38] E. Parzen. Nonparametric statistical data modeling. Journal of theAmerican Statistical Association, 74:105–121, 1979.[39] M. Paterson. Progress in selection. pages 368–379, 1997.[40] A. E. Raftery. A model for higher order markov chains. J. R. Statist.B., (3):528–539, 1985.[41] T. Rychlik. Projecting statistical functionals. Springer, 2001.[42] G. Schwartz. Estimating the dimension of a model. Ann. Statist., pages461–464, 1978.310[43] R. J. Serfling. Approximation Theorems of Mathematical Statistics.Wiley, 1980.[44] R. Shibita. Selection of the order of an autoregressive modelby akiake’sinformation criterion. Biometrika, pages 117–126, 1976.[45] H. Tong. Determination of the order of a markov chain by akiake’sinformation criterion. J. Appl. Prob., pages 488–497, 1975.[46] H. Tong and P. Gates. On markov chain modelling to some weatherdata. Journal of applied meteorology, pages 1145–1151, 1976.[47] L.A. Vincent, X. Zhang, B. R. Bonsal, and Hogg W.D. Homogenizationof daily temperatures over canada. Journal of Climate, pages 1322–1334, 2002.[48] W. Wong. Theory of partial likelihood. The Annals of Statistics, (1):88–123, 1986.[49] F. F. Yao. On lower bounds for selection problems. Technical report,Cambridge, MA, USA, 1974.311Appendix AClimate reviewA.1 Organizations and resources• WMO: The World Meteorological Organization (WMO) is a special-ized agency of the United Nations. It is the UN system’s authoritativevoice on the state and behavior of the Earth’s atmosphere, its in-teraction with the oceans, the climate it produces and the resultingdistribution of water resources.• Environment Canada: Environment Canada’s mandate is to preserveand enhance the quality of the natural environment; conserve Canada’srenewable resources; conserve and protect Canada’s water resources;forecast weather and environmental change; enforce rules relating toboundary waters; and coordinate environmental policies and programsfor the federal government.• The Meteorological Service of Canada: The Meteorological Serviceof Canada is Canada’s source for meteorological information. TheService monitors water quantities, provides information and conductsresearch on climate, atmospheric science, air quality, ice and otherenvironmental issues, making it an important source of expertise inthese areas.• Natural Resources Canada• Agricultureand Agri–Food Canada: Agriculture andAgri-Food Canada(AAFC) provides information, research and technology, and policiesand programs to achieve security of the food system, health of the en-vironment and innovation for growth. AAFC, along with its portfoliopartners, reports to Parliament and Canadians through the Ministerof Agriculture and Agri-Food and Minister for the Canadian WheatBoard.• Alberta Agriculture Food and Rural Development312A.2. Definitions and climate variables• Statistics Canada• AMS:The American Meteorological Society promotes the developmentand dissemination of information and education on the atmosphericand related oceanic and hydrologic sciences and the advancement oftheir professional applications. Founded in 1919, AMS has a mem-bership of more than 11,000 professionals, professors, students, andweather enthusiasts. AMS publishes nine atmospheric and relatedoceanic and hydrologic journals (in print and online) sponsors morethan 12 conferences annually, and offers numerous programs and ser-vices.• GeoBase is a federal, provincial and territorial government initiativethat is overseen by the Canadian Council on Geomatics (CCOG). Itis undertaken to ensure the provision of, and access to, a common,up-to-date and maintained base of quality geospatial data for all ofCanada. Through the GeoBase portal, users with an interest in thefield of geomatics have access to quality geospatial information at nocost and with unrestricted use.A.2 Definitions and climate variables• Atmosphere: Gaseous envelope which surrounds the Earth. Definitionsource: International Meteorological Vocabulary, WMO - No. 182• Troposphere: Lower part of the terrestrial atmosphere, extending fromthe surface up to a height varying from about 9 km at the poles toabout 17 km at the equator, in which the temperature decreases fairlyuniformly with height. Definition source: International MeteorologicalVocabulary, WMO - No. 182• Meteorology: Study of the atmosphere and its phenomena. DefinitionSource: International Meteorological Vocabulary, WMO - No. 182• Climatology: Study of the mean physical state of the atmosphere to-gether with its statistical variations in both space and time as reflectedin the weather behavior over a period of many years. Definition Source:International Meteorological Vocabulary, WMO - No. 182• Hydrology: (1) Science that deals with the waters above and belowthe land surfaces of the Earth, their occurrence, circulation and dis-tribution, both in time and space, their biological, chemical and phys-313A.2. Definitions and climate variablesical properties, their reaction with their environment, including theirrelation to living beings. (2) Science that deals with the processesgoverning the depletion and replenishment of the water resources ofthe land areas, and treats the various phases of the hydrological cycle.Definition Source: International Meteorological Vocabulary, WMO -No. 182• Basic topography: General geometrical configuration of the distribu-tion of geopotential height on an isobaric surface or on a thicknesschart, or of atmospheric pressure on a constant–height chart (e.g.,mean sea–level surface chart). Definition Source: International Mete-orological Vocabulary, WMO - No. 182• Weather: State of the atmosphere at a particular time, as definedby the various meteorological elements. Term Source: InternationalMeteorological Vocabulary, WMO - No. 182• Climate: Synthesis of weather conditions in a given area, character-ized by long–term statistics (mean values, variances, probabilities ofextreme values, etc.) of the meteorological elements in that area. Def-inition source: International Meteorological Vocabulary, WMO - No.182• Paleoclimate: Climate of a prehistoric period whose main characteris-tics may be inferred, for example, from geological and paleobiological(fossil) evidence. Definition source: International Meteorological Vo-cabulary, WMO - No. 182• Climate change: (1) In the most general sense, the term ”climatechange” encompasses all forms of climatic inconstancy (i.e., any dif-ferences between long–term statistics of the meteorological elementscalculated for different periods but relating to the same area) regard-less of their statistical nature or physical causes. Climate changesmay result from such factors as changes in solar emission, long–periodchanges in the Earth’s orbital elements (eccentricity, obliquity of theecliptic, precession of the equinoxes), natural internal processes of theclimate system, or anthropogenic forcing (e.g. increasing atmosphericconcentrations of carbon dioxide and other greenhouse gases). (2) Theterm “climate change” is often used in a more restricted sense, to de-note a significant change (i.e., a change having important economic,environmental and social effects) in the mean values of a meteorolog-ical element (in particular temperature or amount of precipitation)314A.2. Definitions and climate variablesin the course of a certain period of time, where the means are takenover periods of the order of a decade or longer. Definition Source:International Meteorological Vocabulary, WMO - No. 182• Climate model: Representation of the climate system based on themathematical equation governing the behavior of the various compo-nents of the system and including treatments of key physical processesand interactions,cast in a form suitable for numerical approximation(generally now making use of electronic computers). Definition source:International Meteorological Vocabulary, WMO - No. 182• Precipitation: Hydrometeor consisting of a fall of an ensemble of par-ticles. The forms of precipitation are: rain, drizzle, snow, snow grains,snow pellets, diamond dust, hail and ice pellets. Definition Source:International Meteorological Vocabulary, WMO - No. 182• Rainfall: Amount ofprecipitation which ismeasured bymeansof araingauge. Definition Source: International Meteorological Vocabulary,WMO - No. 182• Atmospheric pressure: Pressure (force per unit area) exerted by theatmosphere on any surface by virtue of its weight; it is equivalentto the weight of a vertical column of air extending above a surfaceof unit area to the outer limit of the atmosphere. Definition Source:International Meteorological Vocabulary, WMO - No. 182• Humidity: Water vapor content of the air. Definition Source: Inter-national Meteorological Vocabulary, WMO - No. 182• Climatic season: A long spell of weather which characterizes part ofthe year and which occurs with some approach to regularity, especiallyin low latitudes. Definition Source: International Meteorological Vo-cabulary, WMO - No. 182• Growing season: Season during which meteorological conditions arefavorable to the growth of plants. Definition Source: InternationalMeteorological Vocabulary, WMO-No.182• Dry season: Period of the year characterized by the (almost) completeabsence of rainfall. The term is mainly used for low latitude regions.Definition Source: International Meteorological Vocabulary, WMO -No. 182315A.2. Definitions and climate variables• Rainy season: In the lower latitudes, an annually recurring period ofhigh rainfalls preceded and followed by relatively dry periods. Defi-nition Source: International Meteorological Vocabulary, WMO - No.182• Flood: (1) The overflowing by water of the normal confines of a streamor other body of water, or the accumulation of water by drainage overareas which are not normally submerged. (2) Controlled spreadingof water over a particular region. Definition Source: InternationalMeteorological Vocabulary, WMO - No. 182 Term Note• Drought: (1) Prolonged absence or marked deficiency of precipitation.(2) Period of abnormally dry weather sufficiently prolonged for the lackof precipitation to cause a serious hydrological imbalance. DefinitionSource: International Meteorological Vocabulary, WMO - No. 182• Drought index: An index which is related to some of the cumulativeeffects of a prolonged and abnormal moisture deficiency. DefinitionSource: International Meteorological Vocabulary, WMO - No. 182• Climate system: System consisting of the atmosphere, the hydrosphere(comprising the liquid water distributed on and beneath the Earth’ssurface, as well as the cryosphere, i.e. the snow and ice on and be-neath the surface), the surface lithosphere (comprising the rock, soiland sediment of the Earth’s surface), and the biosphere (comprisingEarth’s plant and animal life and man), which, under the effects ofthe solar radiation received by the Earth, determines the climate ofthe Earth. Although climate essentially relates to the varying statesof the atmosphere only, the other parts of the climate system alsohave a significant role in forming climate, through their interactionswith the atmosphere. Definition Source: International MeteorologicalVocabulary, WMO-No.182• Wind: Air motion relative to the Earth’s surface. Unless otherwisespecified, only the horizontal component is considered. DefinitionSource: International Meteorological Vocabulary, WMO-No.182• Humidity: Definition Water vapor content ofthe air. Definition Source:International Meteorological Vocabulary, WMO - No. 182• Statistical model: (1) Mathematical model which has been derivedfrom the statistical analysis of relevant meteorological variables. (2)316A.2. Definitions and climate variablesNumerical model, usually of the general circulation, which predictscertain statistical properties of the atmosphere rather than the fullthree-dimensional, time-dependent, distribution of each variable. Def-inition Source: International Meteorological Vocabulary, WMO - No.182• Statistical forecast: Definition Objective forecast based on a statisticalexamination of the past behavior of the atmosphere, using regressionformulae, probabilities, etc. Definition Source: International Meteoro-logical Vocabulary, WMO - No. 182• Probability forecast: Definition Objective forecast based on a statis-tical examination of the past behavior of the atmosphere, using re-gression formulae, probabilities, etc. Definition Source: InternationalMeteorological Vocabulary, WMO - No. 182• Circulation model: Simplified representation of atmospheric flow usedto study its principal characteristics. Definition Source: InternationalMeteorological Vocabulary, WMO - No. 182• El Ni˜no: An anomalous warming of ocean water off the west coast ofSouth America, usually accompanied by heavy rainfall in the coastalregion of Peru and Chile. Definition Source: International Meteoro-logical Vocabulary, WMO - No. 182• Hurricane: (1) Name given to a warm core tropical cyclone with max-imum surface wind of 118 km h-1 (64 knots, 74 mph) or greater (hur-ricane force wind) in the North Atlantic, the Caribbean and the Gulfof Mexico, and in the Eastern North Pacific Ocean. (2) A tropicalcyclone with hurricane force winds in the South Pacific and South-East Indian Ocean. Definition Source: International MeteorologicalVocabulary, WMO - No. 182• Green house effect: Warming of the lower layers of the atmospheredue to its different absorption properties for long- and short-wave ra-diation. Definition Source: International Meteorological Vocabulary,WMO - No. 182317A.3. ClimatologyA.3 ClimatologyA.3.1 General circulationsForces that cause variety of land forms on the Earth can be categorized intotwo types:• Inside forces: Volcanoes, earth quakes and etc.• Outside forces: Forces that are conveyed by atmosphere to the Earth’ssurface. Sun is the most important factor in causing such forces indifferent forms.Although, the first type is of great importance and is not totally inde-pendent of the second type, here we only focus on the second type.Weather is defined to beday-to-day variations to the state of atmosphere.In order to understand the weather, we need to understand how such forcesinteract and the factors that cause such variations.The climate system is composed of three parts:• a radiative energy flow system• a circulation system• water cycleWe will explain these in the following.The Sun is the most important source of energy driving the climatesystem. The atmosphere reflects about 31 percent of the energy to thespace. It also absorbs (ozone, water vapor and carbon dioxide) 23 percentof the energy from Sun before it reaches the Earth’s surface. Finally, theEarth’s surface absorbs about 46 percent. The Earth’s surface radiates backsome of this energy with longer wavelengthes which in turn is absorbed bythe atmosphere. In fact atmosphere is able to absorb long wavelengthesbetter. The presence of greenhouse gases (ozone, water vapor and carbondioxide) in the atmosphere can cause the greenhouse effect by absorbingmore energy from the long wavelengthes of energy. Also some of the heatfrom the earth goes back to the atmosphere indirectly by the evaporatedwater.Near the Equator the solar radiation reaches the Earth’s surface witha steeper angle and shorter path through the atmosphere compared to thepoles. This explains why it is warmer at the Equator than at the poles.318A.4. Some interesting facts about Canadian geography and weatherAtmospheric circulation are created as a natural response to the differ-ence of temperature between the Equator and the poles. However, otherfactors also have an effect: the Earth’s rotation, the force of gravity, thetemperature of the ocean and land, and the presence of topographical fea-tures such as mountains, plants ice and so on.A.3.2 Topography of CanadaA listing of main features comprises the Western Cordillera, the Prairies, theGreat Lakes, the Canadian Shield, the Gulf of St. Lawrence and the ArcticIslands. We only review the Prairies which are the most suitable lands forfarming.The Prairies extend eastward from the Rocky Mountains sloping downtowards the great Canadian Shield. The elevations range from 1500 m inthe west to about 250 m in Manitoba. The slope however is not even butis broken by steps, the Manitoba Escarpment and the Missouri Coteau.Minor hill rows tend to run parallel to these; the Cypress Hills however arean exception. A chain of large lakes in Manitoba marks the extent of a giantinland lake during glacial times. The rivers run from the Rockies toward thenortheast, some into the Arctic Ocean, others into Hudson Bay. They areoften cut deeply into the flat or slightly rolling, generally featureless plain.A.4 Some interesting facts about Canadiangeography and weather• Total Area of Canada:The total area of Canada is 9,984,670 square kilometers. Of this, 9093,507 square kilometers is land and 891,163 square kilometers isfresh water. Canada’s area is the second largest in the world (afterRussia which has a total area of 17,075,000 square kilometers). OnCanadian territory, the longest distance North to South (on land) is4,634 kilometers from Cape Columbia on Ellesmere Island, Nunavut toMiddle Island in Lake Erie, Ontario. Thelongest distance East to Westis 5,514 kilometers from Cape Spear, Newfoundland and Labrador, tothe Yukon Territory–Alaska boundary.• Boundary:The total length of the Canada–United States boundary is 8890 kilo-meters.319A.4. Some interesting facts about Canadian geography and weather• Landmass and Freshwater:Approximately 40% of Canada’s landmass and freshwater is north of60 degrees North latitude. Between them, the Northwest Territoriesand Nunavut contains 9.2% of the world’s total freshwater. The areaof Canada north of the treeline is 2,728,800 square kilometers or 27.4%of the total area of the country.• The Great Lakes:The Great Lakes (Superior, Michigan, Huron, Erie and Ontario) arethe largest group of freshwater lakes in the world. They have a totalsurface area of 245,000 square kilometers, of which about one third isin Canada. Lake Michigan is entirely within the USA.• Coastline:Canada has the world’s longest coastline: 202 080 kilometers.• Hailstorm:At the time it happened, the most expensive natural catastrophe interms of property damage was a violent hailstorm that struck Calgary(photo of Calgary) on September 7th, 1991. Insurance companies paidabout $400 million to repair over 65,000 cars, 60,000 homes and busi-nesses, and a number of aircraft.• Tornado:The Regina Tornado of June 30th, 1912, rated as F4 (winds of 330 to416 kilometres per hour) was the most severe tornado so far known inCanada. It killed 28 people, injured hundreds and demolished muchof the downtown area.• Most Severe Flood:The most severe flood in Canadian history occurred on October 14thto 15th, 1954 when Hurricane Hazel brought 214 millimeters of rainin Toronto region in just 72 hours.• Manitoulin Island:The world’s largest island in a freshwater lake is Manitoulin Island inLake Huron, 2765 square kilometers.• Mount Logan:The highest mountain in Canada is Mount Logan, Yukon Territory,5959 meters.320A.4. Some interesting facts about Canadian geography and weather• Medicine Hat:Medicine Hat is the driest city with 271 days without measurable pre-cipitation. [Source: Phillips, D. 1990. The Climate of Canada. Cata-logue No. En56-1/1990E. Ottawa: Minister of Supply and Services ofCanada.]321Appendix BExtracting Canadian ClimateData from EnvironmentCanada datasetB.1 IntroductionIn this document, some instructions are given to use the climate data pro-vided by environment Canada [10]. The data we are using are contained ina file, which can be downloaded from the environment Canada website:http://www.weatheroffice.ec.gc.ca.“The National Climate Data and Information Archive, operated and main-tained by Environment Canada, contains official climate and weather obser-vations for Canada” (quoting from the website).Environment Canada has published a series of climate data CDs: 1993,1996, 2002, 2007. The newest version is the 2007 CD. The EnvironmentCanada website also includes some other useful information, as a glossaryof some useful terms in climate literature and also some information aboutthe files. In particular, the glossary includes the definition of precipitation:Precipitation: The sum of the total rainfall and the water equivalent ofthe total snowfall observed during the day.On the 2007 CD, data are stored in a binary format in several files. TheCD includes two softwares to use the data, “cdcd” and “cdex” along withmanuals to use the softwares. “cdcd” is to view the data and “cdex” is toextract the data. “cdex” can only extract the data for one climate stationat a time in certain formats which are not necessarily convenient to use inR (a well known statistical software) or other statistical softwares. In theseformats the longitude, latitude and elevation are missing. Hence, to getthe data in our desired way, we need to read the binary files using another322B.1. Introductionprogram. Bernhard Reiter has written a code using Python to get the data,which is available online athttp://www.intevation.de/∼bernhard/archiv/uwm/canadian climate cdformat/.However, this code fails to get the data for a large proportion of the stations.We have modified the code to get the data for all stations. The modifiedcode [23] is available athttp://bayes.stat.ubc.ca/∼reza/python.After getting the data, we need to write the data in our desired formats.We have also included many new functions in Python for different extractionpurposes.There are 7802 stations from all over Canada. The available variablesare:1. maximum temperature2. minimum temperature3. one–day rainfall4. one–day snowfall5. one–day precipitation6. snow depth on the groundThese data are available, both daily and monthly. For each station the dataare available for different intervals of time.The data are saved in 8 directories on the CD labeled 1,2,··· ,8. Theycorrespond to different territories of Canada.1 --> British Columbia2 --> Yukon territories, Nunavut and North west territories3 --> Alberta4 --> Saskatchewan323B.1. Introduction−140 −120 −100 −80 −6050607080Longitude WLatitude NFigure B.1: Canada site locations5 --> Manitoba6 --> Ontario7 --> Quebec8 --> Nova Scotia, New found land and LabradorEach directory contains a number of data files and index files. For ex-ample, directory 3 which correspond to Alberta contains the following files:DATA.301,DATA.302,··· ,DATA.308andINDEX.301,INDEX.302,··· ,INDEX.308.Each DATA file corresponds to the data of a region in Alberta and thecorresponding INDEX file contains the information about the stations in thegiven region. In Figure B.1, you can see the location of available stationsover Canada.324B.2. Using Python to extract dataB.2 Using Python to extract dataIn the following, we illustrate getting the data using the python module:“Reza canadian data.py”After opening the python interface, let us import some necessary packagesand tell python where the data are stored. Using sys.path.append specifythe directory where Reza canadian data.py is stored as shown below. Also,define Topdirectory to be where the data are stored.>>>import sys>>>sys.path.append("D:\School\Research\Climate\Python_code")>>>Topdirectory="D:\Data">>>from Reza_canadian_data import *>>>stations=get_station_list(Topdirectory)Once you did that you can call the command get station list fromReza canadian data to get the list of the stations available on the CD. Letus see how many stations we have access to:>>> len(stations)7802Let us pick a random station, say the 3000–th station and find out its idand index.>>> s=stations[2436]>>> s.stationnumber’3025480’>>> s.index_record(’5480’, ’RED DEER A ’, ’YQF’, 5211, 11354, 905, 1938,1938, 1938, 1938, 1938, 1938, 1955, 2007, 2007, 2007, 2007, 2007,2007, 2007, 9904)>>> len(s.index_record)21The command “stationnumber” gives back the id of the given stationon the CD. The stations in the same district start with the same numbers.For example the stations in Alberta all start with 30 and so Red Deer is inAlberta. You can use cdcd to see the list of the stations and id numbers tofigure out which ids correspond to which districts.325B.2. Using Python to extract dataThe index record command reads the information available for the givenstation. There are many values available and it is hard to understand whatthey mean. As you see the index has 21 components. Here is the explanationof each component:1. The last four digits of the id2. station name3. Airport is the three–character airport identifier that some stationshave (e.g., “YWG” for Winnipeg); if none exists for this station thenthe field is left blank4. latitude5. longitude6. elevation7. The first available year for max temperature8. The first available year for min temperature9. The first available year for mean temperature10. The first available year for rainfall11. The first available year for snowfall12. The first available year for snow depth13. The first available year for precipitation14. The last available year for Max temperature15. The last available year for min temperature16. The last available year for mean temperature17. The last available year for rainfall18. The last available year for snowfall19. The last available year for precipitation20. The last available ye