UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

An R package for monitoring test under density ratio model and its applications Hu, Boyi 2018

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


24-ubc_2018_september_hu_boyi.pdf [ 307.68kB ]
JSON: 24-1.0371169.json
JSON-LD: 24-1.0371169-ld.json
RDF/XML (Pretty): 24-1.0371169-rdf.xml
RDF/JSON: 24-1.0371169-rdf.json
Turtle: 24-1.0371169-turtle.txt
N-Triples: 24-1.0371169-rdf-ntriples.txt
Original Record: 24-1.0371169-source.json
Full Text

Full Text

An R package for monitoring test under density ratiomodel and its applicationsbyBoyi HuB.Sc., University of Science and Technology of China, 2013M.Sc., Lakehead University, 2016A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFMaster of ScienceinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Statistics)The University of British Columbia(Vancouver)September 2018c© Boyi Hu, 2018The following individuals certify that they have read, and recommend to the Fac-ulty of Graduate and Postdoctoral Studies for acceptance, a thesis/dissertation en-titled:An R package for monitoring test under density ratio model and its applicationssubmitted by Boyi Hu in partial fulfillment of the requirements forthe degree of Master of Sciencein StatisticsExamining Committee:Jiahua ChenSupervisorNatalia NoldeSupervisory Committee MemberSupervisory Committee MemberAdditional ExamineriiAbstractQuantiles and their functions are important population characteristics in many ap-plications. In forestry, lower quantiles of the modulus of rapture and other mechan-ical properties of the wood products are important quality indices. It is importantto ensure that the wood products in the market over the years meet the establishedindustrial standards. Two well-known risk measures in finance and hydrology,value at risk (VaR) and median shortfall (MS), are quantiles of their correspond-ing marginal distributions. Developing effective statistical inference methods andtools on quantiles of interest is an important task in both theory and applications.When samples from multiple similar natured populations are available, Chen et al.[2016] proposed to use a density ratio model (DRM) to characterize potential latentstructures in these populations. The DRM enables us to fully utilized the infor-mation contained in the data from connected populations. They further proposeda composite empirical likelihood (CEL) to avoid a parametric model assumptionthat is subject to model-mis-specification risk and to accommodate clustered datastructure. A cluster-based bootstrap procedure was also investigated for varianceestimation, construction of confidence interval and test of various hypotheses.This thesis contains complementary developments to Chen et al. [2016]. First,a user-friendly R package is developed to make their methods easy-to-use for prac-titioners. We also include some diagnostic tools to allow users to investigate thegoodness of the fit of the density ratio model. Second, we use simulation to com-pare the performance DRM-CEL-based test and the famous Wilcoxin rank test forclustered data. Third, we study the performance of DRM-CEL-based inferencewhen the data set contains observations with different cluster sizes. The simulationresults show that DRM-CEL method works well in common situations.iiiLay SummaryQuantiles and their functions are important population characteristics in many ap-plications. In this thesis, we focus on lower quantiles of lumber strength distribu-tion, which are usually used as quality indices for wood products. It is importantto ensure that the wood products in the market over the years meet the establishedindustrial standards. Therefore, monitoring some certain lower quantile of the dis-tribution of strength of wood products across years is of big concern. We first re-view some existing quantile monitoring methods. Based on one of these methods,we developed an user-friendly R package to make the method easy-to-use for prac-titioners. Further, with the help of this package, we investigate the performance ofthis monitoring method in a more general situation, which is not considered in theoriginal paper.ivTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiLay Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 The density ratio model . . . . . . . . . . . . . . . . . . . . . . . 31.2 Empirical likelihood under density ratio model . . . . . . . . . . 51.3 DRM-CEL for clustered data . . . . . . . . . . . . . . . . . . . . 61.4 Bootstrap and the monitoring test . . . . . . . . . . . . . . . . . . 81.5 Contribution of this thesis . . . . . . . . . . . . . . . . . . . . . . 102 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.1 Targeted statistical problems . . . . . . . . . . . . . . . . . . . . 132.2 Typical data structure . . . . . . . . . . . . . . . . . . . . . . . . 143 Numerical tools for DRM-CEL Monitoring Test . . . . . . . . . . . 17v3.1 Model Fitting: drm fit(· · · ) . . . . . . . . . . . . . . . . . . . . . 193.2 Quantile Estimates: quan est(· · · ) . . . . . . . . . . . . . . . . . . 233.3 Visualization Tool: plot cdf(· · · ) . . . . . . . . . . . . . . . . . . 243.4 Diagnostic Tools: plot kde(· · · ) and plot qq(· · · ) . . . . . . . . . . 263.5 Monitoring Test: monitor test(· · · ) . . . . . . . . . . . . . . . . . 283.6 Data Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 Monitoring tests for Clustered Data . . . . . . . . . . . . . . . . . . 374.1 Classical Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . 384.2 Modified Wilcoxon Rank Sum Test . . . . . . . . . . . . . . . . . 394.3 Wilcoxon rank sum test for monitoring purpose . . . . . . . . . . 424.4 The DRM CEL monitoring test with unequal sized clusters . . . . 454.5 Performance of DRM CEL Monitor Test when Model is Misspecified 535 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60viList of TablesTable 4.1 Rejection Rates (%) of DRM CEL Monitoring Test and Modi-fied Wilcoxon Test at Nominal Level 95% . . . . . . . . . . . 45Table 4.2 Coverage probabilities (%) of two-sided 95% bootstrap confi-dence intervals under Normal models (type one) . . . . . . . . 49Table 4.3 Coverage probabilities (%) of two-sided 95% bootstrap confi-dence intervals under Gamma models (type one) . . . . . . . . 50Table 4.4 Coverage probabilities (%) of two-sided 95% bootstrap confi-dence intervals under Normal models (type two) . . . . . . . . 51Table 4.5 Coverage probabilities (%) of two-sided 95% bootstrap confi-dence intervals under Gamma models (type two) . . . . . . . . 52Table 4.6 Coverage probabilities (%) of two-sided 95% bootstrap confi-dence intervals under Normal models . . . . . . . . . . . . . . 55Table 4.7 Coverage probabilities (%) of two-sided 95% bootstrap confi-dence intervals under Gamma models . . . . . . . . . . . . . . 56viiList of FiguresFigure 3.1 Q-Q Plot of 2007 . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 3.2 Q-Q Plot of 2010 . . . . . . . . . . . . . . . . . . . . . . . . 32Figure 3.3 Q-Q Plot of 2011 . . . . . . . . . . . . . . . . . . . . . . . . 33Figure 3.4 Kernel Density Estimate Plot of 2007 . . . . . . . . . . . . . 33Figure 3.5 Kernel Density Estimate Plot of 2010 . . . . . . . . . . . . . 35Figure 3.6 Kernel Density Estimate Plot of 2011 . . . . . . . . . . . . . 35Figure 3.7 Estimated Distributions under DRM . . . . . . . . . . . . . . 36viiiAcknowledgmentsForemost I would like to express my deepest gratitude to my supervisor, Profes-sor Jiahua Chen. His inspiration, guidance, encouragement and insight help methrough my master career at UBC. I learn a lot from him. I also want to thankProfessor Natalia Nolde for being the second reviewer of my thesis.I would like to give my sincere appreciation and gratitude to my parents for theirsupport and encouragement. This thesis is dedicated to them.ixPrefaceThis dissertation is an unpublished work by the author, Boyi Hu, under the super-vision of Professor Jiahua Chen.xChapter 1IntroductionPopulation quantiles and their functions are important measures in many applica-tions. For example, in forestry, it is important to maintain the value of the fifthlower percentile of lumber strength distribution. Such engineering properties arevital to the strength of wood structures. See American Society for Testing andMaterials (ASTM) Standard D1990 International [2007]. Two well-known riskmeasures in finance and hydrology, value at risk (VaR) Pflug [2000] and medianshortfall (MS) Bertsimas et al. [2004], are quantiles of the corresponding distribu-tions. They are generally used to evaluate the risk tolerance of financial institutionsand hydrological structures. Monitoring the levels of these quantiles is of great in-terest in these applications.This thesis is motivated by a research project on forestry at the University ofBritish Columbia through funding “The NSERC CRD FPInnovations grant” and bythe work of Chen et al. [2016]. The Canadian forestry industry is a major contrib-utor to the Canadian economy. Canada has 42 percent of its land acreage coveredby forests. The country contains 10 percent of the world’s forested land and theforests are made up mostly of spruce, poplar and pine. To meet the society’s long-term demand for forest products and the near-term economic benefit, responsibleforest management is required to ensure that forests are legally harvested and man-aged. Data collection and statistical analysis are important parts of the effectivemanagement.The properties of wood products may change over time due to some long term1trends such as the climate change, and some catastrophic short term impacts suchas extreme forest fire series and mountain pine beetles. All of these factors mightaffect the mechanical strength of wood materials. Therefore, people from forestryare very serious at monitoring the change of quality of wood products. To mea-sure the mechanical property of wood, modulus of elasticity (MOE) and modulusof rupture (MOR) Bier and Collins [1985] are two important and commonly usedquality indices Barrett and Kellogg [1991], Bendtsen and Senft [2007], Boone andChudnoff [1972], Smith et al. [1991]. MOE and MOR are measurements of lum-ber’s elasticity and toughness respectively. Lower quantiles of the population dis-tribution of these two indices are used to measure the reliability of wood products.The 5% quantile is usually a common choice. They are important design values forwood structures.Imagine populations made of lumber produced by a number of mills over years.The lumber samples may be collected as follows. Each year, a number of mills arerandomly chosen and then several lots of wood pieces are randomly sampled fromthem. From each lot sampled, several pieces of wood are selected to form a samplefrom this mill. They form the data set to be analyzed to decide whether the woodproduct meets the industrial standard and to provide other information useful towood industry and engineers. Clearly, the observations obtained are clustered insuch applications.Many standard statistical methods are developed for data in which the ob-served values may be regarded independent of each other. This includes the fa-mous Wilcoxon rank sum test, Kolmogorov goodness-of-fit test, and the methodof Anderson [1979]. The data collected from wood industry and likely in otherapplications are, however, often clustered. This can be seen from the exploratorydata analysis of real data and based on some background information Verrill et al.[2015]. Because the wood pieces sampled from the same lot/mill are likely fromthe same tree, or from trees grown in the same region and so on, they are morelikely to have have similar strengths.When observations are correlated but analyzed as if they are independent, theuncertainty in statistical inference is generally underestimated. This leads to under-coverage of the confidence intervals and inflated type I error of significance tests.To investigate the effect of clustered data to type I errors of some standard proce-2dures, Verrill et al. [2015] conducted extensive simulation studies of eight statisticaltests proposed by the United States Department of Agriculture. Their investigationlargely confirms the damaging effect of clustered data if the clustering structureis not properly looked after. In addition, although the parameter of interest in thetargeted application is lower quantiles, these methods are generally designed fromother population statistics. Hence, even if they work well for their original purpose,additional developments are in great needs.The forestry data and data from some applications have two important char-acteristics. The first is that we often have samples from multiple populations thatshare latent structures. The second is that the data can be clustered. The paperof Anderson [1979] proposes a semi-parametric density ratio model which takesthe shared latent structure into consideration; it further develops an empirical like-lihood approach for estimation, confidence interval and hypothesis test regardingpopulation quantiles. For clustered data in forestry and likely in some other ap-plications, the joint distributions are exchangeable. This leads to non-parametriccluster effect assumption in Chen et al. [2016] and the proposed composite empiri-cal likelihood. To avoid inflated type I error in monitoring tests, Chen et al. [2016]further propose to use cluster-based bootstrapping method for variance estimation.Asymptotically valid confidence intervals and hypothesis tests are subsequentlyobtained.This thesis contains complementary work to these two papers as well as newdevelopments to the research and application area. We first give a brief review ofthe materials in these two papers in the next section.1.1 The density ratio modelIn common problems, one may have a set of random observations from a popula-tion with distribution F . We may estimate F by its empirical distributionFn(x) = n−1n∑i=11(xi ≤ x)3where 1(·) is the indicator function. We have used x1, . . . ,xn as observed valueswith sample size n. For any p ∈ (0,1), the sample quantile is defined to beξˆp = F−1n (p) = inf{x : Fn(x)≥ p}.Generally speaking, the sample quantile ξˆp is a good nonparametric estimator Ser-fling [2009].When F is known to be a member of parametric distribution family, say, F(x)=F(x;θ). One may first estimate θ with an efficient method and subsequently esti-mate population quantile ξp by F−1(x; θˆ). This is also a valid approach.As pointed out in Anderson [1979], when we have data from multiple con-nected populations, the empirical quantiles fail to utilize this extra information.The efficiency of the parametric quantile estimator F−1(x; θˆ) cannot be further im-proved, but it can suffer from model mis-specification. If one wishes to avoid riskyparametric model assumption and make use of the information in multiple samples,a density ratio model is an attractive choice.Let G1,G2, . . . ,Gm be distribution functions of m connected populations fromwhich some observations are available. In forestry, they may stand for populationsfrom m calendar years. Let g1,g2, . . . ,gm be their density functions with respect tosome σ -finite measure. The DRM postulates thatgk(x)g1(x)= exp{θ τkq(x)} (1.1)for some known function q(x) of dimension d and corresponding unknown param-eter vector θ k, where θ τk is the transpose of θ k. The first entry of q(x) is generallyrequired to be constant one so that the first element of θ k is a normalization param-eter. The DRM is semi-parametric and it allows users to choose or specify the basisfunction q(x) to reflect the nature of the populations in applications. When the ba-sis function is properly chosen, the information from multiple samples is pooledtogether to permit more efficient statistical inference. The pooled information mayimprove the efficiency of each quantile estimate. Unless otherwise specified, wewill regard these densities as with respect to Lesbesgue measure.Take two risk measures, VaR and MS, in finance and hydrology as examples.4They are quantiles of the populations under investigation. Due to the lack of ob-servations in extreme tails of the distribution, non-parametric inferences are usu-ally not meaningful so that some parametric assumptions are usually specified.The generalized extreme value (GEV) distribution family is often assumed for dataanalysis. If we assume the shape parameters ξ of the corresponding distributionsare identical, we can verify that these distributions are special cases of DRM withproper basis functions. Under the assumption, it is advantageous to make inferenceunder the semi-parametric DRM.We now move to the review of empirical likelihood method under the densityratio model.1.2 Empirical likelihood under density ratio modelConsider the simple case where the observations in multiple samples are denotedas xk, j : k = 1,2, . . . ,m and j = 1,2, . . . ,nk. Suppose they are independent and thedistribution of Xk, j is given by Gk satisfying the DRM (1.1). In the spirit of Owen[2001], Chen and Chen [2000] restrict the form of G1 toG1(x) =∑k, jpk, j1(xk, j ≤ x),where pk, j is the likelihood contribution of observation xk, j. If not specified, therange over summation or product will be all possible values of the indices. Underthe DRM assumption, we may write Gr asGr(x) =∑k, jpk, j exp{θ τrq(xk, j)}1(xk, j ≤ x)for r = 1,2, . . . ,m, with θ 1 = 0. Since Gr’s are distribution functions, we musthave∑k, jpk, j exp{θ τrq(xx, j)}= 1 (1.2)for all r = 1,2, . . . ,m.5The empirical likelihood (EL) function is given byL(G1, . . . ,Gm) =∏k, jdGk(xk, j) = {∏k, jpk, j}exp{∑kθ τk∑jq(xk, j)}with pk, j and θ k satisfying (1.2).By standard method of Lagrange multipliers Nakayama et al. [1975], the fittedvalues of pk, j are given bypˆk, j =1n∑mr=1ρr exp{θˆτrq(xk, j)}where ρr = nr/∑mk=1 nk. The fitted cumulative distribution functions are given byGˆk(x) =∑k, jpˆk, j exp{θˆ τkq(xk, j)}1(xk, j ≤ x).Chen and Liu [2013] prove that the DRM-EL quantile estimators derived fromthe above Gˆk(x) are more efficient than the empirical quantiles. They also inves-tigate other properties such as the asymptotic normality, confidence intervals andthe hypothesis test problems. We will not go over these details.1.3 DRM-CEL for clustered dataAs explained earlier that the data are often clustered in applications. Ignoring thecluster structure may lead to inflated type I errors for hypothesis tests and under-coverage problem for the confidence intervals. Hence, the methods developed inChen and Liu [2013] must be modified. For this purpose, Chen et al. [2016] furtherdeveloped a DRM based composite likelihood approach. Consider the situationwhere each Xk, j,l is a d-dimensional random vector, where k is the population index,j is the cluster index and l is the data index within each cluster. Let Fk(y1, . . . ,yd) bethe joint distribution function of Xk, j,l . In many applications, it is quite reasonableto assume that this distribution is exchangeable. For example, when d = 4, theexchangeability of Fk meansFk(y1,y2,y3,y4) = Fk(y1,y3,y2,y4) = Fk(y4,y3,y1,y2) = · · · .6This implies that the marginal distributions of each entry are same. For this reason,the population distribution of any entry of Xk, j,l is given byGk(x) = Fk(x,∞,∞,∞) = Fk(∞,x,∞,∞) = Fk(∞,∞,x,∞) = Fk(∞,∞,∞,x).If so, the DRM-EL methods seem to work, at least artificially. We need not imposeadditional restrictions on Fk.We can motivate the direct use of previous DRM-EL methods through the no-tion of composite likelihood of Lindsay [1988].Consider a random vectorY , with probability density function f (y;θ) for someparameter vector θ ∈Θ. Denote by {A1, ...,AK} a set of events with associated like-lihoods Lk(θ ;y) ∝ f (y ∈ Ak;θ ). Following Lindsay [1988] a composite likelihoodis the weighted productLc(θ ;y) =K∏k=1Lk(θ ;y)wk ,where wk are nonnegative weights to be chosen. One should not assume any prop-erties of the authentic likelihood functions for Lc(θ ). One may choose wk to reflectthe some knowledge on the relationship between these events and to improve theinference efficiency Varin et al. [2011]. In Chen and Liu [2013], they consider thecomposite likelihood of each cluster. Specifically, for each clusterxk, j = (xk, j,1,xk, j,2, . . . ,xk, j,d)τ ,Chen and Liu [2013] suggest to have Ai as the event {Xk, j,i= xk, j,i} for i= 1,2, . . . ,d.With A1,A2, . . . ,Ad being symmetric (the exchangeability of the joint distributionFk), it is natural to have w1 = w2 = · · · = wd and define a composite likelihood tobeLc(G1, . . . ,Gm) =∏k, j{d∏l=1dGk(xk, j,l)}.defined on the space of G1, . . . ,Gm satisfyingGr(x) = ∑k, j,lpk, j,l exp{θ τrq(xk, j,l)}1(xk, j,l ≤ x)7with θ 1 = 0 and that for r = 1,2, . . . ,m,∑k, j,lpk, j,l exp{θ τrq(xk, j,l)}= 1.Following the same algebra in Section 1.2, the composite likelihood is maximizedwhenpˆk, j,l =1nd∑mr=1ρr exp{θˆτrq(xk, j,l)}where θˆ r is the maximum CEL estimator. Subsequently, the maximum CEL esti-mator of Gr(x) is given byGˆr(x) = ∑k, j,lpˆk, j,l exp{θˆ τrq(xk, j,l)}1(xk, j,l ≤ x).The DRM-CEL based quantile estimators are given by ξˆr,α = Gˆ−1r (α).Chen et al. [2016] prove that the DRM-CEL based quantile estimators remainconsistent as the total sample sizem∑k=1nkd increases under some conditions andare still asymptotically normal. However, their asymptotic variances are usuallylarger than those obtained when the data are not clustered. For this reason, theresults of Chen et al. [2016] on monitoring tests cannot be directly used. To addressthis problem, they propose to use resampling methods to avoid variance under-estimation. This is the topic of the next section.1.4 Bootstrap and the monitoring testAs pointed out from the beginning, one problem of interest in forestry is to monitorthe change of lower quantiles. This task may be regarded as a hypothesis test prob-lem. We wish to know whether the population quantile of one year is significantlylower than that of the previous year, or lower than some industrial standard.Because of this, we wish to test hypothesis in the form ofH0 : ξ1,α ≤ ξ2,α8versus one-sided alternativeH1 : ξ1,α > ξ2,αwhere ξ1,α ≤ ξ2,α are α-quantiles of the first two populations for some selectedquantile level α . We consider this problem when there may also be samples fromG3,G4 and so on. We also want to know whether the existing method works or notwhen data are clustered.The results in Chen et al. [2016] imply that the difference of two DRM-CELquantile estimators, ξˆk1,α− ξˆk2,α , where k1 and k2 are population indices, is asymp-totically normal with zero mean and finite variance. Therefore, we may use Waldmethod for the purpose of hypothesis test if we have an easy to use estimate ofthe asymptotic variance. It turns out that the asymptotic variance has a complexexpression and depends on specific cluster structures. Hence, a straight estimationof the asymptotic variance is not a convenient option.To overcome this difficult, Chen et al. [2016] propose a cluster-based bootstrapmethod to estimate the distribution of ξˆk1,α− ξˆk2,α . In this scheme, the nk bootstrapclusters are sampled from the observed clusters without replacement from the kthsample. A bootstrap sample is therefore formed as{x∗k, j; j = 1,2, . . . ,nk,k = 1,2, . . . ,m}.The DRM-CEL is then applied to this bootstrap sample to obtain ξˆ ∗k1,α− ξˆ ∗k2,α . Thisprocess is repeated a large number B times. The empirical distribution formed bythese B values of ξˆ ∗k1,α − ξˆ ∗k2,α is utilized to give an estimate of the distribution ofξˆk1,α − ξˆk2,α .Clearly, confidence intervals for ξˆk1,α − ξˆk2,α can then be constructed accord-ingly. From the relationship between the confidence interval and the hypothesistest, we further obtain a procedure for monitoring test.In Chen et al. [2016], the validity of this bootstrap procedure is summarized asthe following theorem.Theorem 1 Let ξk1,α ,ξk2,α be two population quantiles of Gk1 and Gk2 . Let ϕ(x,y)be a differentiable function. Let (ξˆk1,α , ξˆk2,α) and ξ ∗k1,α ,ξ∗k2,α be DRM-CEL quantileestimators based on the original sample and bootstrap sample. Then under some9conditions and as the total sample size n→ ∞,supx∣∣Pr∗(√n{ϕ(ξ ∗k1,α ,ξ ∗k2,α)−ϕ(ξˆk1,α , ξˆk2,α)} ≤ x)−Pr(√n{ϕ(ξk1,α ,ξk2,α)−ϕ(ξk1,α ,ξk2,α)} ≤ x)∣∣= op(1)where Pr∗ denotes the conditional probability given data corresponding to boot-strapping distribution.Note that a null hypothesis in the form of ξ1 = ξ2 can be expressed as ϕ(ξ1,ξ2)=0 with ϕ(x,y) = x− y. We test the hypothesisH0 : ϕ(ξk1,α ,ξk2,α) = 0by observing whether an appropriate levelled confidence interval of ϕ(ξk1,α ,ξk2,α)contains 0. We will give details of the bootstrap procedure later.1.5 Contribution of this thesisTo apply the methods developed by Chen and Liu [2013] and Chen et al. [2016] onrealistic problems, a user-friendly R package is needed. Several relevant R func-tions have been built by the authors to carry out the simulation studies in theirpapers, but these R functions are not easy to use by users from industry. Based onthese existing R functions, we further build a user-friendly R package called “dr-mmt” to implement these DRM based quantile estimation methods. The functionsin this package are designed for application requirements and easy to access.In this thesis, we first present the R functions in the package “drmmt”. Withthese R functions, we use simulation studies to compare the performances of dif-ferent methods for quantile estimations of clustered data, and investigate the per-formance of the method developed by Chen et al. [2016] in a more general case.In Chapter 2, we first present the data collection procedure for lumber example,followed by an introduction to the required data format of our R functions for anydata set. Although the method developed by Chen et al. [2016] is motivated bylumber project, their method actually can be applied to many other areas as longas the corresponding model assumptions are satisfied. For our R functions, as long10as the format of data set is complete, they can be applied on any data sets from anyarea. Then we present how to input a real data set into our R functions. In Chap-ter 3, we first review the statistical model and corresponding parameter estimationmethod Chen et al. [2016]. We present the statistical questions coming from thetargeted applications and the solutions of these questions. Then we present the Rfunctions in our package “drmmt”. For each R function, we illustrate its inputs andoutputs and demonstrate it by a code example. We explain the algorithms behindthese function and introduce the R functions being used from other R packages. Atthe end of this chapter, we use a real data example to demonstrate how to apply thefunctions from this package to a realistic problem.In Chapter 4, we first review the classical Wilcoxon rank sum test Lehmann[2006] and introduce a cluster-based Wilcoxon rank sum test developed by Rosneret al. [2003]. The classical Wilcoxon rank sum test is not appropriate for our tar-geted application because it is designed for independent dataset and it is not builtto detect the change in population quantiles. Even if the cluster-based Wilcoxonrank sum test has a good performance in the clustered dataset, its hypothesis isstill not testing the change in quantiles. Admittedly, under some extra assumption,the Wilcoxon rank sum tests can be used to detect the change in medians, but themedian is not a useful index in the targeted applications. In some common cases,the alternative hypothesis of this cluster-based Wilcoxon rank sum has the samedirection with the alternative hypothesis of the targeted applications, but they arenot always identical. It may bring potential risks for the statistical inference and inthe following section, we use simulation studies to illustrate this. Compared withthe Wilcoxon-type hypothesis tests, the DRM-CEL monitoring test of Chen et al.[2016] can be used to test the hypothesis related to more general population quan-tiles, not only the median. In the paper, one of the assumptions under which themonitoring that is developed, is that the clusters of samples have a same numberof observations. However, in realistic applications, a data set with equal clustersizes cannot always be guaranteed. If this is the case, whether we still can use thismethod to monitor the quantiles is of our interest. For example, the cluster sizes ofthe real data set we use in Chapter 4 are not equal, but we still use the DRM-CELmethod to analyze it. We loosen the restriction of cluster sizes of populations beingequal in Chen et al. [2016] and use simulation studies to investigate the influence11of unequal cluster sizes on the DRM-CEL method. The simulation results supportthat as long as the maximum cluster size over all populations is finite and the num-ber of clusters of each population is large enough, this DRM-CEL monitoring teststill works. I believe that the mathematical techniques behind this conclusion arevery similar to that given in Chen et al. [2016], Serfling [2009]. The theory behindthis conclusion is one of my further research topics. At the end of this Chapter,we use simulation studies to investigate the performance of long term monitoringtest when model is misspecified in the cases of cluster sizes being not equal. Asummary is given at the end of this thesis.12Chapter 2Problem DescriptionAs mentioned in the introduction, the wood quality is vital to all the other relatedareas. This quality can be depicted by some certain lower quantile of the distribu-tion of lumber mechanical strength. Such a quantile is called the quality index ofa population in this context. Before a statistical analysis is conducted, real worldproblems need to be defined clearly first and then to be transfered into statisticalproblems. Answers to these statistical problems are sought based on data collectedon representative samples from representative populations.2.1 Targeted statistical problemsWe identify a few typical statistical problems to be addressed as follows.• Does the quality index of a population meet the published standard?• Do two populations have the same quality index value?• Has the quality index of a population decreased?We design our numerical tools to handle data collected in the following schemes.Consider the situation where lumbers are produced in a population of mills.• A set of mills are selected with probability proportional to their volume ofproductions.13• For each sampled mills, a pre-specified number of lumbers are randomlyselected.2.2 Typical data structureThe chosen lots from the same year make a population of representative samplesof the lumber strength for that certain year. Wood pieces from the same mill havesimilar solid strength which has been verified by real data. Such a feature seemsreasonable. Wood pieces from the same mill are very likely from the same tree,or from trees grown in the same region. They are then very likely to have similarstrength. Moreover, the wood strength is also affected by the devices that theyare produced from. Wood pieces produced from same devices might have somesimilarity and this affect their strength. Both of them implies a correlation in woodstrength among the wood pieces sampled from the same mill. Therefore, woodpieces from the same mill should be treated as a cluster. The following examplepresents the information can be contained in a lumber strength data set.region mill lot piece mor1 1 1 1 2.512501 1 1 2 7.929801 1 1 3 5.66030... ... ... ... ...1 1 2 11 3.766501 1 2 12 2.645501 1 2 13 3.685301 2 1 1 6.850731 2 1 2 7.517071 2 1 3 4.71568... ... ... ... ...1 2 2 11 5.888081 2 2 12 4.914891 2 2 13 4.896441 2 2 14 6.41309... ... ... ... ...14In this data set, “region” indicates the place where lumber is collected. “mill”indicates the wood manufactories. “lot” is the index of lots for each manufactory.“mor” is the abbreviation of “modulus of rupture”, which is a material property,defined as the stress in a material just before it yields in a flexure test. We canobserve from this example data set that 13 pieces of wood from the first mill forma cluster and 14 pieces of wood from the second mill form another cluster.As mentioned above, to monitor the quality of lumber across years, we needto monitor the values of lumber strength distribution quantiles across years. Inother words, if we regard the year when lumber strength observations measuredas population index, our target is to compare some lower quantiles of differentpopulation distributions and then to figure out if the lumber quality become worseor not.Focusing on this quantile comparison problem, Chen et al. [2016] use the den-sity ratio model (DRM) to link the lumber strength distributions of different yearsand use the composite empirical likelihood method to estimate parameters. A hy-pothesis of the difference of two population quantiles is considered for the purposeof detecting the quantile change among different lumber strength distributions.Such a hypothesis is called monitoring test in the context because its motivationis to monitor lumber quality across years. A cluster-based bootstrap procedureis proposed to make statistical inference on this hypothesis and to construct con-fidence intervals of desired quantiles. In the paper, their simulation results sup-port that such confidence intervals have satisfactory precise coverage probabilities.This monitoring test controls type I error rates tightly with good power. For con-venience, let DRM-CEL denote the method from Chen et al. [2016]. More detailsof this long term monitoring test are presented in the next chapter.Although this DRM-CEL-based method is motivated by the project of moni-toring lumber quality across years, it is not limited to this specific area. Such amethod can be applied to any data set for the purpose of monitoring the quantilechange as long as the model assumptions are satisfied. Let xk, j,l denote the lth ob-servation in the jth cluster from the kth sample. Usually a qualified data set shouldcontain the information of population indices, cluster indices and observed valuesas follows.15population cluster observed values1 1 x_{1,1,1}1 1 x_{1,1,2}... ... ...1 2 x_{1,2,1}1 2 x_{1,2,2}... ... ...2 1 x_{2,1,1}2 1 x_{2,1,2}... ... ...In Chen et al. [2016], a solid theoretical foundation has been established fortheir DRM-CEL method and its performance has been evaluated by the correspond-ing simulation study. There are two remaining tasks on this method.The first one is to fill in the gap between the theoretical results and the corre-sponding applications. There are already some R program as the realization of thisDRM-CEL-based method but the program is developed for the research purposeand is not easy to access for users without much statistical background. For theconvenience of users from industry, a user-friendly R package that can be appliedeasily is needed. This package should contain the functions that can be used toobtain the desired quantile estimates, to visualize the results and to diagnose theresults.The second task is to extend their DRM-CEL quantile estimator and theircluster-based bootstrap method to a more general case. Chen et al. [2016] onlyconsider the scenario that clusters of all populations share a same cluster size. Inthis thesis, we will use simulation study to investigate the performance of theirmethod in the situation without the constraint of equal cluster size.16Chapter 3Numerical tools for DRM-CELMonitoring TestChen and Chen [2000], Chen and Liu [2013] and Chen et al. [2016] have devel-oped an extensive set of statistical tools for data analysis regarding quantiles andquantile functions given multiple samples. They have also subscribed many nu-merical recipes to carry out the data analysis. At the same time, they have yet tomaterialize all numerical recipes. Mostly, their developments enable them to in-vestigate the properties of proposed inference methods. In applications, users notonly must carry out the data analysis suggested by the authors, but also want to dosome exploratory analysis. As an important part of this thesis, we develop somenumerical tools for exploratory data analysis as well as tools for more elaboratestatistical inference.As mentioned in Chapter 2, three pieces of information are required for a dataset, population indices, cluster indices and observed values. We consider the situ-ation where the samples are organized into a list with the kth entry being a matrixor a data frame consisting of observations from the kth sample. Specifically, forclustered data, each sample needs to be saved in a matrix with two columns or adata frame with two columns (one column for cluster indices and one column forobservations). For the data frame format, users can specify variable names for thecluster indices and the observed values, and the order of two columns. For the ma-trix format, since its columns can not be named, by default, the first column is for17cluster indices and the second column is for the observed values. For independentdata, each sample is just saved in a vector of observed values. Here we use thesame data set in Chapter 2 as an example. The variable “mill” is treated as thepopulation index and “lot” is treated as the cluster index.region mill lot piece mor1 1 1 1 2.512501 1 1 2 7.929801 1 1 3 5.66030... ... ... ... ...1 1 2 11 3.766501 1 2 12 2.645501 1 2 13 3.685301 2 1 1 6.850731 2 1 2 7.517071 2 1 3 4.71568... ... ... ... ...1 2 2 11 5.888081 2 2 12 4.914891 2 2 13 4.896441 2 2 14 6.41309... ... ... ... ...If users want to use a data frame format, two possible ways of the first entry ofthe list are as follows.lot mor1 2.512501 7.929801 5.66030... ...2 3.766502 2.645502 3.6853018ormor lot2.51250 17.92980 15.66030 1... ...3.76650 22.64550 23.68530 2If users want to use a matrix format, there is only one way for the first entry ofthe list, which is as follows.1 2.512501 7.929801 5.66030... ...2 3.766502 2.645502 3.68530In the next section, we introduce the R functions in our package and we presenttheir inputs, outputs with examples and the algorithms behind them.3.1 Model Fitting: drm fit(· · · )The first numerical task of the DRM-CEL approach is to find the fitted values ofθ r: r = 1,2, . . . ,m. One must look forGr(x) = ∑k, j,lpk, j,l exp{θ τrq(xk, j,l)}1(xk, j,l ≤ x)satisfying θ 1 = 0 and for r = 1,2, . . . ,m,∑k, j,lpk, j,l exp{θ τrq(xk, j,l)}= 119that maximizeLc(G1, . . . ,Gm) =∏k, j{d∏l=1dGk(xk, j,l)}.Luckily, the numerical solution to the above problem in θ is the same as findthe maximum point of the following dual likelihood function:`n(θ ) =−∑k, j,llog[ m∑r=1ρr exp{θ τrq(xk, j,l)}]+∑k, j,lθ τkq(xk, j,l) (3.1)Once its maximum point θˆ is obtained, we havepˆk, j,l =1nd∑mr=1ρr exp{θˆτrq(xk, j,l)}.Subsequently, the maximum CEL estimator of Gr(x) is given byGˆr(x) = ∑k, j,lpˆk, j,l exp{θˆ τrq(xk, j,l)}1(xk, j,l ≤ x).See Chen and Liu [2013], Chen et al. [2016] Keziou and Leoni-Aubin [2008], andQin and Zhang [1997].It is interesting to find the dual likelihood function coincides with the likeli-hood function under multinomial logistic regression model Czepiel [2002]. Thereis a well written R-function “multinom(· · · )” from a R package called “nnet” forfitting a multinomial logistic regression model by finding the maximum likelihoodestimates of parameters Ripley et al. [2016]. We will adopt this function for ourpurpose and provide some explanations on the connections between the dual like-lihood and the likelihood under the multinomial logistic regression model.Let there be N experiment units divided into m groups. All responses takevalues in categories {1,2, . . . ,m}. For each unit, the probability that its responsefalls into category r ispir = Pr{Y = r}.Let ∆u,r = 1 if the response of the uth unit is in category r and ∆u,r = 0 otherwise.Let ∆u and δu be corresponding random vector and observed response vector. We20may then writePr{∆u = δu}= piδu,11 piδu,22 · · ·piδu,mm .Now, assume that pir’s are not the same for all units but satisfy a logistic regres-sion model: the log-odds of a unit falls into category r is a linear function of somecovariate zlog(pirpi1)= zτβ rfor r= 2, . . . ,m, where β r is a regression coefficient vector. We may let β 1 = 0 forconvenience.Given data of N units, assume the independence between units and the multi-nomial logistic regression, we get the log likelihood function`N(β ) =∑u[m∑r=1δu,r{zτuβ r}− log( m∑r=1exp{zτuβ r})].Clearly, `N(β ) is algebraically similar to the dual empirical likelihood given in(3.1). Hence, “multinom(· · · )” can be used for DRM-CEL based data analysis. Touse this function for DRM-CEL, we need to match the corresponding componentsto items in the logistic likelihood function.Let the observations in multiple sample problem be {xk, j,l}, where xk, j,l is thelth observation in cluster j of population k. The total number of observations isgiven by N = d∑k nk. Let q(x) be the user specified basis function of length T +1and (β 1,β 2, . . . ,β m) be regression coefficient matrix of size (T +1)×m.By regarding the lth unit in cluster j from population k as a unit whose responsefalls into category k with zk, j,l = q(xk, j,l) being its covariate zk, j,l . Using u forindex (k, j, l), we can see δu,k′ = 1 when k = k′ and δu,k′ = 0 otherwise. Under themultinomial logistic regression model, we get the log likelihood function`N(β ) =∑u[m∑r=1δu,r{β τrq(xu)}− log( m∑r=1exp{β τrq(xu)})].Note that δu,r = 0 unless u = (r, j, l) for some j, l. The above expression, when21written in terms of (k, j, l), becomes`N(β ) = ∑k, j,l[{β τkq(xk, j,l)}− log( m∑r=1exp{β τrq(xk, j,l)})].Meanwhile, the dual composite empirical likelihood would have been`d(θ) = ∑k, j,lθ τk q(yk, j,l)−∑k, j,llog( m∑r=1ρr exp{θ τr q(yk, j,l)})(3.2)where ρr = nr/n. Letβ r = θ r+(logρr,0, . . . ,0)τ ,we find`N(β ) = `d(θ )+m∑k=1nk logρk.Hence, if θˆ and βˆ are the maximum points of `d(θ ) and `N(β ) respectively, wemust haveθˆ k = βˆ k− (logρr,0, . . . ,0)τ .This relationship enables us to use “multinom(· · · )” to obtain βˆ of a multinomiallogistic regression and then θˆ of DRM.With the help of “multinom(· · · )”, I build an R function called “drm fit(· · · )”for model fitting.Input:• z: An optional character indicating the name of data values.• cluster: An optional character indicating the name of clusters.• data: A list of m samples. If data are clustered, a matrix format is requiredeach sample. If variable names are not provided by the user, by default, thefirst column of the matrix is cluster identifier; the second column is for theobserved values. If data are independent, each sample is saved in a numericalvector.• basis: A vector of characters defining the basis function q(x) required inDRM.22– Examples:basis= c(1, ‘x’, ‘xˆ2’)– The basis function can be user-defined functions. Example:ff = function(x) {x + sin(x)}basis=c(1,‘ff(x)’)Output:• distr est: The first output is a matrix with its rth column made of likeli-hood contributions of xk, j,l for the rth distribution function Gˆr(·) after xk, j,lis sorted. The estimated distributions Gˆr(·) can be obtained from this matrix.• theta hat: The second output is a matrix with columns θˆ 1, θˆ 2, . . . , θˆ m.• data: The third output is a matrix with the second column being the sortedobservations, and the first column is its population identifier.This output is designed to be used as input of other R functions in the pack-age.3.2 Quantile Estimates: quan est(· · · )After DRM-CEL estimate of the population distribution Gˆr is obtained, the corre-sponding quantile estimates are readily computed. We have written an R functionquan est(· · · ) for this purpose. Some numerical details are as follows.For population distribution Gr, we obtain information on the support of Gˆr,which is the original data set, and corresponding probabilities (or likelihood con-tributions) on the support.For any α ∈ (0,1), the DRM-CEL quantile of the population r is defined to beξˆr(α) = inf{x : Gˆr(x)≥ α}.The above definition does not ask us to solve equation Gˆr(x) = α because Gˆr(x) isnot continuous.23To simplify the computation, I use linear interpolation to obtain a continuousdistribution function Gˆ′r(x). After which, I solve Gˆ′r(x) = α and report its solutionas ξˆr(α).I use an R function wtd.quantile(· · · ) from R package “Hmisc” Harrell andDupont [2008] with argument: type=“ i/(n+1)”. This function uses linear interpo-lation.My R function for computing quantiles is called quan est(· · · ). Its input andoutput are as follows:Input:• fit: an output of drm fit(· · · ).– Example:fit = drm_fit(...)• quan: A numeric vector indicating quantile levels to be estimated. It can bea single value or several values.– Two examples:quan = 0.05quan = c(0.05, 0.10, 0.15)Output:• If the input of quan is a single value, α , then the output is a vector containingGˆ−1r (α) for r = 1, . . . ,m.If the input quan is a vector, α1,α2, . . . ,αt , then the output is a t×m matrixwith the ith row being a vector containing Gˆ−1r (αi) for i= 1, . . . , t.3.3 Visualization Tool: plot cdf(· · · )After DRM-CEL estimate of the population distribution Gˆr is obtained, a user islikely interested to have a visual impression of fitted distributions. We have writtenan R function plot cdf(· · · ) to make this task easy. It depicts the following function24for any r = 1,2, . . . ,m:Gˆr(x) = ∑k, j,lpˆk, j,l exp{θˆ τrq(xk, j,l)}1(xk, j,l ≤ x).Some numerical details are as follows.For population distribution Gr, we obtain information on the support of Gˆr,which is the original data set, and (or likelihood contributions) on the support.Gr(·) under the assumption is a continuous distribution while Gˆr(·) is a discretedistribution function. I first compute values ofGˆr(x) = ∑k, j,lpˆk, j,l exp{θˆ τrq(xk, j,l)}1(xk, j,l ≤ x).at y= xk, j,l for all k, j and l.For a user-specified set of r values, this R function plots curves connecting thepoints (xk, j,l, Gˆr(xk, j,l)) over all k, j and l by a default R function plot(· · · ) withargumentplot(..., type = ‘l’).This argument makes the function plot(·) plot the points first and then connect thesepoints with line segments.Its inputs and outputs are given as follows:Input:• fit: An output of drm fit(· · · ).– Example:fit = drm_fit(...)• index: A numeric vector indicating populations to be plotted. Single entry isalso permitted.• x name: An optional character for the name of the x-axis of the plot. If notspecified, the x-axis of the plot will not have a name.Output: The plot of Gˆr(x) of selected populations.253.4 Diagnostic Tools: plot kde(· · · ) and plot qq(· · · )When a DRM with user specified basis function q(x) is fitted, a user may wonderwhether the data and the fitted model actually match each other. While formalprocedures can be developed, one simpler solution is to create a density estimatebased on the DRM-CEL fit and compare it with the histogram of the original data;another is to have a quantile-quantile plot of the DRM-CEL estimate Gˆr againstthe empirical distribution from the rth population.We use plot kde(· · · ) to produce a kernel density estimate and plot it with thehistogram of the corresponding sample. If the model fits data well, the kerneldensity estimate should match the histogram well.We use plot qq(· · · ) to generate a plot between DRM-CEL-quantiles and sam-ple quantiles of a chosen population. If the model fits data well, these points shouldapproximate a line with slope 1.Here is the specifications for plot kde(· · · ).Input:• fit: A list that is the output of drm fit(· · · ).– Example:fit = drm_fit(...)• index: identity of the population to be plotted.• n : the number of grids required for the plot. The default value is 200.• x name: An optional character indicating the x-axis name of the plot. If notspecified, the x-axis will not have a name.• main: An optional character indicating the main title of the plot. If notspecified, the plot will not have a title.Output: The output of plot kde(· · · ) is a figure of kernel density estimation ofGˆk(x) together with a histogram of the corresponding sample.26Some details are as follows: we use hist(..., breaks = ’FD’, ...)with ‘Freedman-Diaconis rule’. It selects the size of the bins according toBin size = 2IQR(x)m1/3,where IQR(x) is the interquartile range of the data and m is the number of observa-tions in the sample x.We follow Chen and Liu [2013] to produce a kernel density estimate. Let K(·)be a commonly used kernel function. For some bandwidth b > 0, let Kb(x) =(1/b)K(x/b). Then a kernel estimate of gr(x) is given bygr(x) =∫Kb(x− y)dGˆr(y).In plot kde(· · · ), we set K(x) to the standard normal density function. We choosethe bandwidth b according to the rule of thumb of Silverman [2018].We select n points uniformly in the range of data at which the kernel densityestimates (KDE) are calculated. We obtain the density estimate curve by con-necting each pair of consecutive points. We overlay the density estimate over thehistogram.Now we give some details for the other diagnostic tool, plot qq(· · · ), which canbe used to produce a quantile-quantile plot stated as the above.Input:• fit: an output of drm fit(· · · ).Example: fit = drm_fit(...)• index: identity of the population to be plotted.• n : the number of quantiles used to plot. The default value is 200.• main: An optional character indicating the main title of the plot. If notspecified, the plot will not have a title.Output: The output of plot qq(· · · ) is a quantile-quantile plot between Gˆr(x) andthe sample quantiles of the rth sample. The n quantile levels, α1, ..., αn, are chosenevenly between 1% and 99%.27The linear interpolation is used to obtain quantiles of Gˆr(x) and empirical quan-tiles through R function wtd.quantile(· · · ) from the R package “Hmisc’ Harrell andDupont [2008]. We use “plot(· · · )” with argument “type = p”, “col = blue” and“pch = 3” to plot these two vectors of quantiles of length n. We add a straight redline starting from the origin with slope 1.3.5 Monitoring Test: monitor test(· · · )We have a function to carry out monitoring test for the following two types ofhypotheses:H0 : ξk1,α ≤ ξk2,α vs H1 : ξk1,α > ξk2,αH0 : ξk1,α = ξk2,α vs H1 : ξk1,α 6= ξk2,αfor any α ∈ (0,1).This R function, monitor test(· · · ), is designed to test the above hypothesesby the bootstrap method Efron [1992], Hall [1986, 1988]. The output of this Rfunction includes p-values of hypothesis tests of interest and one-sided/ two-sidedconfidence intervals of ξk2,α − ξk1,α . The following shows the inputs and outputsof this function.Input:• z: An optional character indicating the name of data values.• cluster: An optional character indicating the name of clusters.• data: A list containing the observations of variables. The format for data setof this function is the same as the R function drm fit(· · · ).• basis: A vector of characters indicating the basis function in DRM, that is,q(y).Example: basis= c(1, ‘x’)Example: ff = function(x) { x + sin(x)}; basis=c(1,‘ff(x)’)• quan: the levels of quantiles α in the above hypotheses, which can be asingle value or a vector of desired values.28Example: quan = c(0.05, 0.15)• Comp: identities of two populations.• alternative: either “two.sided”, “greater” or “less”.Suppose “Comp = c(i, j)”. The alternative hypotheses aretwo.sided : ξi,α 6= ξ j,α ;greater : ξi,α ≥ ξ j,α ;less : ξi,α ≤ ξ j,α .• B: number of bootstrap repetitions. The default value is 999.• ci level: nominal level of bootstrapping confidence intervals.Output:• $p value: p-values of desired hypothesis tests.• $CI 1: confidence limits of quantiles ξα1,i, ..., ξαL,i.• $CI 2: confidence limits of quantiles ξα1, j, ..., ξα1, j.• $CI diff: confidence limits of ξαl ,i−ξαl , j for l = 1, ...,L– If “alternative = ‘two.sided’ ”, confidence limits are two-sided.– If “alternative = ‘greater’ ”, then upper confidence limits.– If “alternative = ‘less’ ”, then lower confidence limits.A detailed algorithm of this R function is given as follows.• We sample nk clusters IIDfrom the kth sample for k = 1, ...,m, where nk isthe number of clusters in the kth sample. This forms a bootstrap sample.• Call a function similar to quan est(· · · ) to obtain {ξˆ ∗k2,α − ξˆ ∗k1,α}(b) for thebootstrap sample.29• Repeat the above to obtain, {ξˆ ∗k2,α − ξˆ ∗k1,α}(b), b= 1, ...,B.• For one-sided alternative, we compute the p-value as the percentage of timeswhen{ξˆ ∗k2,α − ξˆ ∗k1,α}(b) > 0in B bootstrap samples.• For two-sided alternative, we also compute the percentage of times when{ξˆ ∗k2,α − ξˆ ∗k1,α}(b) < 0and the twice of the lower percentage is the p-value.• The solution to the confidence interval is similar.3.6 Data ExampleIn this section, I apply these R functions to a real data set. The dataset containssamples from three populations which will be referred to as 2007, 2010 and 2011.There are 98, 282 and 445 modulus of rupture (MOR) measurements in the threesamples, respectively.I save the MOR measurements from three years into “lmbdata” following theformat mentioned before. The 1st, 2nd and 3rd entries of “lmbdata” are MORdata of 2007, 2010 and 2011 respectively. All of the measurements are treated asindependent observations.Several different choices for basis function q(x) are considered. We first fitthe model with “drm fit(· · · )” with different basis functions q(x) and then use“plot qq(· · · )” and “plot kde(· · · )” to generate diagnostic plots. Based on thesediagnostic plots, we finally use q(x) = (1, logx) in the DRM. The choice of basisfunction has a significant influence on model fitting. Fokianos and Kaimi [2006]quantified the effect of choosing an incorrect linear form of q(x). In general, thepoint estimate is adversely affect when model is misspecified. For this reason,Fokianos [2007] consider a model selection approach for DRM. Fokianos [2007]suggest to select q(x) as a linear combination of a rich class of functions. The most30appropriate q(x) is then determined by selecting a sub-vector of the current q(x).The classical model selection approaches can be applied here. With the followingcode, we obtain Figures 3.1, 3.2, 3.3, 3.4, 3.5 and 3.6.model_fit = drm_fit(z = ‘obs’, cluster = ‘id’,data = lmbdata, basis = c(1, ‘log(x)’))plot_qq(z = ’obs’, cluster = ’id’, data = lmbdata,distr = model_fit, index = 1)plot_qq(z = ’obs’, cluster = ’id’, data = lmbdata,distr = model_fit, index = 2)plot_qq(z = ’obs’, cluster = ’id’, data = lmbdata,distr = model_fit, index = 3)plot_kde(z = ’obs’, cluster = ’id’, data = lmbdata,distr = model_fit, n = 1000, index = 1,main = ’KDE’)plot_kde(z = ’obs’, cluster = ’id’, data = lmbdata,distr = model_fit, n = 1000, index = 2,main = ’KDE’)plot_kde(z = ’obs’, cluster = ’id’, data = lmbdata,distr = model_fit, n = 1000, index = 3,main = ’KDE’)We compute 5%, 10% and 15% quantiles of three populations with “quan est(· · · )”.qq = quan_est(distr = model_fit,quan=c(0.05, 0.1, 0.15))print(qq)The 5%, 10% and 15% quantile estimates of 2007 are 4.55, 4.97 and 5.26. The5%, 10% and 15% quantile estimates of 2007 are 4.55, 4.99 and 5.28. The 5%,10% and 15% quantile estimates of 2007 are 3.51, 4.02 and 4.48.We now test for the following hypotheses:H0 : ξ1,0.05 ≤ ξ2,0.05 vs H1 : ξ1,0.05 > ξ2,0.05;H ′0 : ξ2,0.05 ≤ ξ3,0.05 vs H ′1 : ξ2,0.05 > ξ3,0.05.These can be accomplished with the following commands:314 5 6 7 8 9456789Q−Q PlotDRMEMPFigure 3.1: Q-Q Plot of 20074 5 6 7 8 9 10456789Q−Q PlotDRMEMPFigure 3.2: Q-Q Plot of 2010p_value = monitor_test(z = ‘obs’, cluster = ‘id’,data = lmbdata, quan = 0.05,basis = c(1, ‘log(x)’), Comp = c(1,2),B = 999, ci_level = 0.95,alternative = ‘greater’)$p_value.322 4 6 8 10246810Q−Q PlotDRMEMPFigure 3.3: Q-Q Plot of 2011KDEMORDensity4 6 8 3.4: Kernel Density Estimate Plot of 2007p_value = monitor_test(z = ‘obs’, cluster = ‘id’,data = lmbdata, quan = 0.05,basis = c(1, ‘log(x)’), Comp = c(2,3),B = 999, ci_level = 0.95,33alternative = ‘greater’)$p_value.P-values of the above hypothesis tests are 0.47 and 0.01. The bootstrappingconfidence intervals of ξ1,0.05 − ξ2,0.05, and ξ2,0.05 − ξ3,0.05 are (−0.34,∞) and(0.76,∞). Since zero falls in the first interval but not the second one, we do nothave sufficient evident to reject H0 but we reject H ′0 at the significance level 5%,which implies that the lumber quality of 2011 very probably become worse com-pared with that of 2010.To visualize the estimated distributions, we use the following code to generatea plot of three estimated distributions under DRM.plot_cdf(distr = model$distr_est, index = c(1:3),x_name = ’MOR’)In Figure 3.7, the black line is the estimated distribution of 2007, the red line isthe estimated distribution of 2010 and the green line is the estimated distribution of2011. From the plot, we notice that the lumber quality of 2010 and 2011 are veryprobably better than that of 2007.34KDEMORDensity4 6 8 3.5: Kernel Density Estimate Plot of 2010KDEMORDensity4 6 8 3.6: Kernel Density Estimate Plot of 2011352 4 6 8 10 CDFsMORProbability123Figure 3.7: Estimated Distributions under DRM36Chapter 4Monitoring tests for ClusteredDataIn Chen et al. [2016], the problem of interest is to efficiently and accurately monitorthe change of population quantiles, which are often used as quality indices forwood products. In that paper, the performances of several conventional statisticaltests for monitor purpose are discussed and compared. These conventional tests areoften designed for assumed i.i.d. samples. When the observations are independent,the nonparametric Wilcoxon rank sum test has promised properties. However whenthe data are clustered, Wilcoxon rank sum test and most other tests all have inflatedtype I errors.If the monitoring targets are lower quantiles, then the Wilcoxon rank sum test isnot an appropriate choice in the first place. However, it has always been a familiartool for users. If the data distribution is not too far from normal, then the changein quantile often leads to other changes for which Wilcoxon rank sum test candetect effectively. Hence, if the Wilcoxon rank sum test can be modified to workfor clustered data, it may draw attention to many practitioners. For this purpose,we investigate the performance of a modified large-sample Wilcoxon rank sum testproposed by Rosner et al. [2003] and it incorporates the clustering effects. Rosneret al. [2003] have shown the type I error of this cluster-based Wilcoxon test closeto the nominal level when the sample size is not very small. It is of interest tocompare it with the DRM-CEL approach of Chen et al. [2016].37To distinguish the “old” Wilcoxon test and the “new” Wilcoxon test, we use“the classical Wilcoxon test” for the test designed for i.i.d. data and use “the mod-ified Wilcoxon test” to for the test designed for clustered data proposed in Rosneret al. [2003].The classical Wilcoxon test is often presented as an alternative to a t-test whenthe data are not normally distributed. Whereas a t-test is a test of population means,the Wilcoxon test is commonly regarded as a test of population medians. This isnot strictly true, and treating it as such can lead to inadequate analysis of data.The Wilcoxon test is a test of both location and shape. In the case where the onlydistributional difference is a shift in location, this can indeed be described as adifference in medians.In this chapter, we first review some details of the classical Wilcoxon rank sumtest. In section 4.2, we introduce the modified Wilcoxon rank test for the clustereddata proposed by Rosner et al. [2003].4.1 Classical Wilcoxon Rank Sum TestThe Wilcoxon rank sum test is a nonparametric test for the hypothesis that it isequally likely that an observation from one population will be less than or greaterthan an observation from a second population. Suppose that we have two samplesof sizes m and n, respectively. Wilcoxon test is conducted under the assumptionthat the observations from both samples are independent of each other.Wilcoxon test works for the null hypothesisH0 : Pr{X > Y}= Pr{Y > X}.The alternative hypothesis H1 could be a two-sided: Pr{X > Y} 6= Pr{Y > X}; ora one-sided such as Pr{X > Y}< Pr{Y > X}.We denote the combined sample by Z. The sample size of Z will be denotedby N = m+n. DefineW =m∑i=1Rank(Xi).where Rank(Xi) is the rank of i-th observation in the X sample among the combined38Z sample of m+n observations. The expectation and variance of W are given byE(W ) = m(N+1)/2 (4.1)Var(W ) = [mn/{N(N−1)}]N∑i=1(Rank(Zi)− 1+N2 )2. (4.2)The Wilcoxon rank sum statistic is defined to beT = {W −E(W )}/{Var(W )}1/2,The test statistic T is asymptotically standard normal. Wilcoxon rank sum testuses the asymptotic distribution to determine the critical regions Lehmann [2006].The expression 4.2 works well in the situation of independent data. For clustereddata set, the observed values from a cluster tend to have similar ranks among thecombined sample and therefore, their ranks also form a cluster as well. In thissituation, 4.2 may overestimate the variance ofW and this leads to a inflated type Ierror when using Wilcoxon rank sum test on clustered data set.4.2 Modified Wilcoxon Rank Sum TestThe test statistic of the modified Wilcoxon rank sum test is derived in two steps.Consider a simple case that data are balanced so that there are equal number ofobservations in each cluster and the sub-units in clusters are exchangeable. Let mand n denote the numbers of clusters in X and Y respectively, and let the clustersize be g.Let us keep Z for the combined data set and use Zi j for jth sub-unit of the ithcluster. Let δi = 1 if the ith cluster is in sample X and δi = 0 if it is in sample Y .Because the observations in a cluster are assumed exchangeable, we still use X andY for single random variables from two populations. The null hypothesis remainsto beH0 : Pr{X > Y}= Pr{Y > X}.The alternative hypothesis H1 again could be a two-sided: Pr{X > Y} 6= Pr{Y >X}; or a one-sided such as Pr{X > Y}< Pr{Y > X}.Let Ri j be the rank of jth sub-unit in the ith cluster among all Ng sub-units.39DefineWc =N∑i=1δiRi+where Ri+ = ∑gj=1Ri j. In other words, Wc is the same as the classical Wilcoxonrank sum test statistic after regarding sub-units as independent observations frompopulations X and Y .The expectation and variance of Wc are given byE(Wc) = mg(Ng+1)/2 (4.3)Var(Wc) = [mn/{N(N−1)}]N∑i=1[Ri+−g(1+Ng)/2]2. (4.4)The modified Wilcoxon rank sum test statistics Zc is defined to beTc = {Wc−E(Wc)}/{Var(Wc)}1/2When N is large, Tc is asymptotically standard normal.When cluster sizes are not the same, assume the sub-units in the same clusterare exchangeable. This implies that all units from the same population have thesame marginal distribution.Let (mg,ng) = number of clusters of size g in population X and Y respectively.The cluster size of the ith cluster is denoted as gi. Let gmax = maxi{gi}, then1≤ gi ≤ gmax}. Denote Ng =mg+ng and N =gmax∑g=1Ng. Similarly, let Ri j be the rankof jth sub-unit in the ith cluster among allgmax∑g=1gNg sub-units, and Ri+ = ∑gij=1Ri j.DefineWc =N∑i=1δiRi+,where δi is same as the above. Let ∆ig = 1 if the size of the ith cluster is g and40∆ig = 0 if its size is not g. The expectation and variance of Wc are given byE(Wc) =gmax∑g=1mg[N∑i=1∆igRi+]/Ng (4.5)Var(Wc) =gmax∑g=1[ngmg/{Ng(Ng−1)}]Ng∑i=1(∆igRi+− [N∑i=1∆igRi+]/Ng)2. (4.6)In this situation, they find thatTc = {Wc−E(Wc)}/{Var(Wc)}1/2is asymptotically standard normal.Compared with 4.2, 4.4 and 4.6 take the effect of clusters into account whichimprove the accuracy of normal approximation, as supported by simulations inRosner et al. [2003]. Specifically, the simulation in Rosner et al. [2003] shows thatwhen the sample size is above 80, the normal approximation leads to tests withaccurate sizes.414.3 Wilcoxon rank sum test for monitoring purposeBased on the simulation results, Rosner et al. [2003] find the type I error of the clas-sical Wilcoxon rank sum test is seriously inflated and the modified Wilcoxon testhas an accurate type I error and a good power. Recall that the statistical problemof interest is to test the hypothesisH0 : ξ1,α ≤ ξk,α against Ha : ξ1,α > ξk,αwhere index k refers to the kth population in the multi-sample setting.Strictly speaking, both Wilcoxon tests are not developed for the comparison oftwo quantiles. Wilcoxon test works for the null hypothesisH0 : Pr{X > Y}= Pr{Y > X}.The alternative hypothesis H1 could be a two-sided: Pr{X > Y} 6= Pr{Y > X}; ora one-sided such as Pr{X > Y} < Pr{Y > X}. However, this null hypothesis iscommonly regarded as a test of population medians,H0 : Xm = Ym,where Xm and Ym are medians of X population and Y population, respectively. Thispremise is based on a misunderstanding of the null hypothesis as pointed out byKruskal and Wallis [1952]. The practical value of this is hard to see, and thus thenull hypothesis of Wilcoxon test sometimes is presented as “the two populationshave equal medians”. The actual null hypothesis can be expressed as the lattermedian hypothesis, but only under the additional assumption that the shapes of thedistributions are identical in each group. In other words, the interpretation of thetest as comparing the medians of the distributions requires the location-shift-onlyalternative to be the case. For most of cases,Pr{X > Y}= Pr{X < Y}< Xm = Ym.Even if, we assume the location-shift-only alternative, the Wilcoxon-type tests42still not appropriate for the hypothesis of our interestH0 : ξ1,α ≤ ξk,α against Ha : ξ1,α > ξk,αWhat we would like to show here is that the modified Wilcoxon test can rejectH0 for a wrong reason: an observation from one population has a higher probabilityof being larger than the observation from another population but their αth quantilesare equal. We demonstrate this point with a targeted simulation experiment.Specifically, we carry out a simulation study based on a normal random effectmodel. We simulate data with two samples from the following normal randomeffect model:yk jl = µk+ γk j+ εk jlfor k = 1,2, j = 1, . . . ,40 and l = 1, . . . ,d, where γk j is the random effect and εk jlis the error term. The random effects and error terms are normally distributedindependent of each other. Based on this model, we generate two samples eachwith 40 clusters.In Scenario 1, we setµ1 = 0.8224, γ2, j ∼ N(0,0.5), ε2, j,l ∼ N(0,1.75);µ2 = 0, γ1, j ∼ N(0,0.5), ε1, j,l ∼ N(0,0.5).We put the cluster size at g= 10. The marginal distribution of y1, j,l isN(0.8224,2.25)and the marginal distribution of y2, j,l isN(0,1). The 5% quantile ofN(0.8224,2.25)is equal to the 5% quantile of N(0,1), and Pr(y1,· > y2,·)> Pr(y1,· < y2,·).In Scenario 2, we setµ1 = 1.6449, γ2, j ∼ N(0,0.5), ε2, j,l ∼ N(0,3.5);µ2 = 0, γ1, j ∼ N(0,0.5), ε1, j,l ∼ N(0,0.5).We put the cluster size at g = 5. The marginal distributions are N(1.6449,4) andN(0,1). The 5% quantiles of N(1.6449,2) and N(0,1) are equal, and Pr(y1,· >y2,·)> Pr(y1,· < y2,·).43In Scenario 3, we setµ1 = 0, γ1, j ∼ N(0,0.5), ε1, j,l ∼ N(0,0.5);µ2 = 0, γ2, j ∼ N(0,0.5), ε2, j,l ∼ N(0,1.75).We put the cluster size at g= 5. The marginal distributions areN(0,1) andN(0,2.25).Their 5% quantiles are not equal, and Pr(y1,· > y2,·) = Pr(y1,· < y2,·).In Scenario 4, we setµ1 = 0, γ1, j ∼ N(0,0.5), ε1, j,l ∼ N(0,0.5);µ2 = 0, γ2, j ∼ N(0,0.5), ε2, j,l ∼ N(0,3.5).We put the cluster size at g= 5. The marginal distributions are N(0,1) and N(0,4).The 5% quantiles are not equal, and Pr(y1,· > y2,·) = Pr(y1,· < y2,·). .Let two pairs of hypotheses beH0 : ξ1,α ≤ ξ2,α against Ha : ξ1,α > ξ2,α .andH ′0 : Pr(y1,· > y2,·)≤ Pr(y1,· < y2,·) against H ′a : Pr(y1,· > y2,·)> Pr(y1,· < y2,·).In Scenarios 1 and 2, H0 is true but H ′0 is not true, and in Scenarios 3 and 4, H′0 istrue but H0 is not true. The simulated rejection rates are given in Table 4.1From Table 4.1, in Scenarios 1 and 2, where H0 is true, the DRM CEL methodhas accurate type I errors when the nominal level of hypothesis test is 5%. How-ever, in these two scenarios, the modified Wilcoxon test rejects H0 wrongly with re-jection rates 94.4% and 100% among 1000 repetitions. In Scenarios 3 and 4, whereH0 is not true, the DRM CEL monitoring test rejects H0 with rejection rates 22.7%and 60.7% among 1000 simulations. However, in these two scenarios, the modifiedWilcoxon test only rejects the hypothesis with rates 6.3% and 7.2%. Therefore, ifthe problem of interest is to compare some quantile differences, the DRM CELmonitoring test is more appropriate and reasonable than the modified Wilcoxon44Table 4.1: Rejection Rates (%) of DRM CEL Monitoring Test and ModifiedWilcoxon Test at Nominal Level 95%Setting Method Rejection Rate (×100)Scenario 1DRM CEL 5.1Modified Wilcoxon 94.4Scenario 2DRM CEL 5.7Modified Wilcoxon 100Scenario 3DRM CEL 22.7Modified Wilcoxon 6.3Scenario 4DRM CEL 60.7Modified Wilcoxon 7.2test.4.4 The DRM CEL monitoring test with unequal sizedclustersChen et al. [2016] develop a DRM CEL monitoring test under the assumption thatall clusters have the same size. From my observation, this requirement is not nec-essary in applications. The R-package we developed earlier can be easily appliedto clustered data with unequal sized clusters. We are therefore interested to knowif the tests still have accurate sizes, namely, whether their type I errors are closeto nominal values. In addition, it is of interest to compare the performance of theDRM CEL monitoring test and the cluster-based Wilcoxon rank sum test.We first use simulation to investigate the coverage precision of the confidenceintervals constructed by the cluster-based bootstrap for quantiles and quantiles dif-ferences. The simulation study sheds light on whether unequal cluster sizes impactthe performance of bootstrap-based confidence intervals. We also like to know howlarge the sample must be before the similar asymptotic results become applicable.Two data generating models are used in our simulation study. They are givenas follows. For convenience, we call them “normal model” and “gamma model” inthe following context.45• Normal Random Effect Modelyk, j,l = µk+ γk, j+ εk, j,lfor k = 1,2,3,4, j = 1, ...,nk and l = 1, ...,dk, j, where nk is the number ofclusters of kth populations and dk, j is the cluster size of the jth cluster of thekth population. γk, j is the term of random effect and εk, j,l is the error term.The random effects and error terms are normally distributed and independentof each other.• Gamma Random Effect Model Nadarajah and Gupta [2006]Given the population index k and cluster index j, let U1,...,Udk, j be dk, j i.i.d.random variables with beta distribution having shape parameters a and b(positive constant). Further letWk, j be i.i.d. gamma-distributed random vari-ables with shape parameter a+b and rate parameter β .That is,Wk, ji.i.d∼ Gamma(a+b,β );U1, . . . ,Udk, ji.i.d∼ Beta(a,b).DefineYτk, j =Wk, j× (U1, . . . ,Udk, j).The distribution of Yτ is the multivariate gamma with corr(Yi,Yj) = a/(a+b). The marginal distribution of Yi is Gamma with shape parameter a andrate parameter β .We apply the DRM CEL monitoring test via the R function “monitor test(· · · )”to datasets generated from the above models. We summarize the two-sided cover-age probabilities of two chosen quantiles and their difference.In the simulation study, we consider the problem with four populations. Twotypes of data generating procedures are used. Let vector (n1,n2,n3,n4) denote thenumbers of clusters of the four samples. The values of (n1,n2,n3,n4) are pre-fixedbefore the simulations. For the first type of data generating procedure, in eachrepetition, we simulate data as follows.46• We first set a range for cluster sizes (l, l+1, ..., l+ p);• We randomly assign a value from (l, l+ 1, ..., l+ p) to each cluster of eachsample;• Given the values of (n1,n2,n3,n4) and the randomized cluster sizes from(l, l+1, ..., l+ p), we repeat the following simulation 1000 times:– In each repetition, two data sets are simulated from the above two mod-els respectively and we apply the R function “monitor test(· · · )” onthem to obtain the results of our interest.The second type of data generating procedure is as follows.• We first set a range for cluster sizes (l, l+1, . . . , l+ p);• Based on the values of (n1,n2,n3,n4) and (l, l+ 1, . . . , l+ p), we repeat thefollowing simulation 1000 times:– In each repetition, we randomly assign a value from (l, l+1, . . . , l+ p)to each cluster of each sample;– Given randomized cluster sizes, two data sets are simulated from theabove two models respectively;– We apply the R function “monitor test(· · · )” on them to obtain the re-sults of our interest.In the first type of data generating procedure, cluster sizes are randomly de-cided before the repetition is started. The same cluster sizes are used over the 1000repetitions. While in the second type of data generating procedure, the cluster sizesare randomly assigned for every repetition. The cluster sizes are different over the1000 repetitions. We want to investigate the influence of the two different types ofunequal cluster sizes on the DRM CEL cluster-based bootstrap method.For both data generating procedure, several sets of the sample sizes n1, n2,n3 and n4 are chosen in the simulation. Given (n1,n2,n3,n4), we use differentranges of cluster sizes and parameter values. A wider range of cluster sizes meansthere is more difference between clusters and different parameter values in the datagenerating models implies different correlation strengths within clusters.47Specifically, we simulate data with two settings: (n1,n2,n3,n4)= (38,45,60,60)and (n1,n2,n3,n4) = (63,75,100,100). For given (n1,n2,n3,n4), the range of clus-ter sizes is chosen as (5,6,7,8) to (3, . . . ,10).We calculate the ratios of the true values of target quantiles and their differencefalling in the bootstrap intervals, which give the coverage probabilities. Tables 4.2and 4.3 present coverage probabilities based on the first type of simulated data fromthe normal and gamma models, respectively. Tables 4.4 and 4.5 present coverageprobabilities based on the second type of simulated data from two models.Based on the simulation results, we find the coverage probabilities are closerto the nominal 95% when the sample sizes are larger under the normal model. Theprecision is also higher when the error variances are smaller. Although there aremany occasions where the simulated coverage probability is as low as 91%, a largeproportion of them are near 94% or higher, but never too much higher than 95%.Hence, we conclude that the DRM-CEL and bootstrap based confidence intervalshave satisfactorily precise coverage probabilities even when data contain unequalsized clusters.48Table 4.2: Coverage probabilities (%) of two-sided 95% bootstrap confidenceintervals under Normal models (type one)Cluster Sizes Target α = 0.05 α = 0.10 α = 0.05 α = 0.10(n1,n2,n3,n4) = (38,45,60,60)(σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4) (σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4)=(2.25, 2.25, 1.00, 1.00) =(1.44, 1.44, 1.00, 1.00)5 - 8 ξ1,α 92.2 91.4 92.8 91.6ξ2,α 91.0 93.3 93.6 93.1∆ξ1,2,α 92.8 93.8 94.6 93.34 - 9 ξ1,α 91.2 92.0 92.7 93.1ξ2,α 91.7 92.9 93.5 94.0∆ξ1,2,α 93.6 94.9 94.7 94.83 - 10 ξ1,α 91.7 91.5 91.9 91.5ξ2,α 90.7 92.3 92.0 92.5∆ξ1,2,α 93.0 94.0 94.3 93.7(n1,n2,n3,n4) = (50,60,80,80)(σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4) (σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4)=(2.25, 2.25, 1.00, 1.00) =(1.44, 1.44, 1.00, 1.00)5 - 8 ξ1,α 91.9 91.4 93.9 91.6ξ2,α 93.4 93.3 94.6 92.9∆ξ1,2,α 93.4 93.8 95.1 95.14 - 9 ξ1,α 91.6 92.0 92.9 92.8ξ2,α 93.2 92.9 94.1 93.1∆ξ1,2,α 95.0 94.9 95.8 95.93 - 10 ξ1,α 91.6 91.5 92.2 92.5ξ2,α 91.4 92.3 92.7 92.8∆ξ1,2,α 94.0 94.0 94.4 94.1(n1,n2,n3,n4) = (63,75,100,100)(σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4) (σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4)=(2.25, 2.25, 1.00, 1.00) =(1.44, 1.44, 1.00, 1.00)5 - 8 ξ1,α 93.5 92.4 94.0 93.3ξ2,α 94.5 91.5 94.3 93.3∆ξ1,2,α 94.4 94.5 94.6 94.54 - 9 ξ1,α 91.9 92.5 93.7 93.0ξ2,α 93.2 92.6 93.8 92.3∆ξ1,2,α 95.3 94.9 94.7 94.63 - 10 ξ1,α 92.4 91.6 93.0 92.9ξ2,α 93.1 92.3 93.6 93.2∆ξ1,2,α 94.5 93.9 94.3 94.449Table 4.3: Coverage probabilities (%) of two-sided 95% bootstrap confidenceintervals under Gamma models (type one)Cluster Sizes Target α = 0.05 α = 0.10 α = 0.05 α = 0.10(n1,n2,n3,n4) = (38,45,60,60)b= 14 b= 365 - 8 ξ1,α 91.4 91.5 93.5 93.2ξ2,α 90.7 92.1 93.2 94.0∆ξ1,2,α 93.2 93.6 94.1 93.84 - 9 ξ1,α 91.8 93.7 93.2 94.6ξ2,α 92.5 94.5 93.2 95.2∆ξ1,2,α 93.0 92.9 94.2 94.63 - 10 ξ1,α 93.4 92.5 93.7 94.5ξ2,α 93.8 93.0 94.5 94.7∆ξ1,2,α 93.1 93.2 95.0 95.3(n1,n2,n3,n4) = (50,60,80,80)b= 14 b= 365 - 8 ξ1,α 92.4 93.8 93.3 92.4ξ2,α 93.2 93.7 93.2 93.3∆ξ1,2,α 93.9 94.3 94.4 94.34 - 9 ξ1,α 95.1 94.8 94.7 94.2ξ2,α 94.6 94.3 94.7 94.9∆ξ1,2,α 95.2 94.7 95.0 95.13 - 10 ξ1,α 93.6 93.4 93.3 94.1ξ2,α 94.1 92.4 94.4 94.5∆ξ1,2,α 93.8 94.0 94.6 95.2(n1,n2,n3,n4) = (63,75,100,100)b= 14 b= 365 - 8 ξ1,α 92.1 92.4 93.0 92.7ξ2,α 92.5 93.0 92.6 92.3∆ξ1,2,α 94.7 94.5 93.7 93.44 - 9 ξ1,α 93.2 94.6 94.2 94.4ξ2,α 94.0 94.8 95.4 94.8∆ξ1,2,α 94.6 94.3 94.9 94.93 - 10 ξ1,α 93.5 92.9 94.5 93.8ξ2,α 93.9 93.5 94.4 94.8∆ξ1,2,α 94.5 94.5 95.2 95.250Table 4.4: Coverage probabilities (%) of two-sided 95% bootstrap confidenceintervals under Normal models (type two)Cluster Sizes Target α = 0.05 α = 0.10 α = 0.05 α = 0.10(n1,n2,n3,n4) = (38,45,60,60)(σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4) (σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4)=(2.25, 2.25, 1.00, 1.00) =(1.44, 1.44, 1.00, 1.00)5 - 8 ξ1,α 93.3 92.7 93.5 93.0ξ2,α 92.5 93.4 93.7 92.5∆ξ1,2,α 93.2 92.9 94.5 94.24 - 9 ξ1,α 91.3 92.0 94.3 94.1ξ2,α 91.5 92.4 94.9 93.0∆ξ1,2,α 94.0 93.6 94.1 93.83 - 10 ξ1,α 92.3 93.2 93.1 92.7ξ2,α 91.9 92.3 93.8 91.8∆ξ1,2,α 95.2 95.2 94.8 94.4(n1,n2,n3,n4) = (50,60,80,80)(σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4) (σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4)=(2.25, 2.25, 1.00, 1.00) =(1.44, 1.44, 1.00, 1.00)5 - 8 ξ1,α 91.9 93.4 94.4 92.8ξ2,α 92.0 93.3 94.4 92.7∆ξ1,2,α 94.6 94.1 93.6 93.64 - 9 ξ1,α 91.9 93.4 93.9 94.3ξ2,α 93.1 92.3 93.7 94.3∆ξ1,2,α 94.7 94.3 94.9 95.43 - 10 ξ1,α 91.4 92.9 94.2 94.0ξ2,α 93.8 93.7 93.9 93.0∆ξ1,2,α 94.7 95.2 94.9 95.2(n1,n2,n3,n4) = (63,75,100,100)(σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4) (σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4)=(2.25, 2.25, 1.00, 1.00) =(1.44, 1.44, 1.00, 1.00)5 - 8 ξ1,α 93.1 94.5 93.1 93.2ξ2,α 92.8 93.1 93.6 92.7∆ξ1,2,α 95.3 95.4 92.3 92.24 - 9 ξ1,α 93.3 94.5 94.0 93.5ξ2,α 94.3 94.3 94.0 94.2∆ξ1,2,α 94.9 94.7 94.4 93.83 - 10 ξ1,α 93.0 93.7 93.7 92.4ξ2,α 92.4 92.3 92.8 93.2∆ξ1,2,α 93.6 93.5 93.5 93.951Table 4.5: Coverage probabilities (%) of two-sided 95% bootstrap confidenceintervals under Gamma models (type two)Cluster Sizes Target α = 0.05 α = 0.10 α = 0.05 α = 0.10(n1,n2,n3,n4) = (38,45,60,60)b= 14 b= 365 - 8 ξ1,α 92.4 92.0 91.7 92.9ξ2,α 92.1 92.5 91.7 92.7∆ξ1,2,α 94.6 94.8 94.1 94.64 - 9 ξ1,α 93.2 93.7 93.9 93.8ξ2,α 93.1 94.6 93.5 94.1∆ξ1,2,α 94.1 93.9 93.9 94.13 - 10 ξ1,α 93.1 92.8 93.9 95.3ξ2,α 93.0 92.4 93.8 94.2∆ξ1,2,α 93.4 93.5 93.9 94.2(n1,n2,n3,n4) = (50,60,80,80)b= 14 b= 365 - 8 ξ1,α 91.7 93.4 93.1 93.6ξ2,α 92.7 94.2 93.5 94.0∆ξ1,2,α 93.9 94.2 93.7 93.44 - 9 ξ1,α 94.6 93.0 92.6 93.5ξ2,α 94.3 92.9 92.6 94.0∆ξ1,2,α 93.6 94.1 94.2 94.13 - 10 ξ1,α 93.3 93.0 93.3 94.9ξ2,α 93.1 93.6 93.1 94.5∆ξ1,2,α 94.1 93.9 94.2 94.8(n1,n2,n3,n4) = (63,75,100,100)b= 14 b= 365 - 8 ξ1,α 92.0 93.7 94.4 94.7ξ2,α 92.0 94.3 94.4 93.8∆ξ1,2,α 94.8 94.8 93.1 93.54 - 9 ξ1,α 93.5 94.0 93.4 93.2ξ2,α 93.7 94.0 94.4 94.7∆ξ1,2,α 94.0 94.0 95.6 95.73 - 10 ξ1,α 93.4 93.0 93.9 93.4ξ2,α 93.2 94.4 94.5 95.0∆ξ1,2,α 94.5 94.1 95.1 94.9524.5 Performance of DRM CEL Monitor Test whenModel is MisspecifiedIn applications, the true distribution of observations is usually unknown. Any sta-tistical model is just an approximation to the true distribution of data. Usually,instead of choosing a known basis function q(x), one may select q(x) as a richclass of functions and apply some classical model selection approaches to deter-mine a subset of the class of functions as the basis function of DRM. The quantile-quantile plot and density plot are two ways to judge the goodness-of-fit. Supposethat density ratio model is being applied to a data set. Without the full informationof the true distribution of data, the basis function q(x) usually can not be chosen“correctly”. Consequently, the influence of using an “incorrect” basis function forDRM is of big concern. For this reason, Fokianos and Kaimi [2006] study the im-pact of using an “incorrect” basis function on the estimation of θ , the parametersof DRM. Such an impact is serious. In Chen and Liu [2013], their simulation re-sults based on independent data show that the quantile estimation are not as badlyaffected by the misspecified model as the parameter estimation. In this section, westudy the same problem with clustered data. We would like to use simulation studyto investigate the performance of DRM CEL quantile estimation when an “incor-rect” basis function is used, that is, the model is misspecified. Our purpose is toevaluate the impact of misspecified model on the DRM CEL quantile estimations.We have used simulation study to investigate the performance of cluster-basedWilcoxon test for quantile comparison. Even though in Rosner et al. [2003], thecluster-based Wilcoxon test has controlled type I error and good performance in thecase of relative small sample size, there is a significant potential risk of using it inquantile comparison. Therefore, we do not consider this method as a comparisonsubject anymore.We simulate random samples from the same Normal and Gamma models asbefore. As a trade-off between a well fitted model and a parsimonious model, wechoose q(x) = (1,x, log(1+ |x|), |x|3/2). Two Scenarios are considered for bothmodels. In scenario 1, all clusters have a same size. In scenario 2, cluster sizes arenot equal.For each model set up, we generate 1000 samples. For each sample generated,53999 bootstrap repetitions are used to build two-sided 95% bootstrapping confi-dence intervals for multiple quantiles of all samples and their difference as well.We report the coverage probabilities of these confidence intervals based on 1000repeitions. The simulation results are summarized in Tables 4.6 and 4.7.54Table 4.6: Coverage probabilities (%) of two-sided 95% bootstrap confidenceintervals under Normal modelsCluster Sizes Target α = 0.05 α = 0.10 α = 0.05 α = 0.10(n1,n2,n3,n4) = (38,45,60,60)(σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4) (σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4)=(2.25, 2.25, 1.00, 1.00) =(1.44, 1.44, 1.00, 1.00)5 - 8 ξ1,α 91.8 90.8 92.8 91.6ξ2,α 93.5 93.1 93.6 93.1∆ξ1,2,α 93.6 92.2 94.6 93.34 - 9 ξ1,α 91.6 90.1 92.7 93.1ξ2,α 92.9 91.6 93.5 94.0∆ξ1,2,α 93.4 93.8 94.7 94.83 - 10 ξ1,α 90.6 92.0 91.9 91.5ξ2,α 91.8 93.8 92.0 92.5∆ξ1,2,α 93.6 93.4 94.3 93.7(n1,n2,n3,n4) = (50,60,80,80)(σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4) (σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4)=(2.25, 2.25, 1.00, 1.00) =(1.44, 1.44, 1.00, 1.00)5 - 8 ξ1,α 92.1 91.8 93.9 91.6ξ2,α 93.6 92.9 94.6 92.9∆ξ1,2,α 95.3 94.8 95.1 95.14 - 9 ξ1,α 92.4 91.4 92.9 92.8ξ2,α 94.4 92.5 94.1 93.1∆ξ1,2,α 92.7 94.1 95.8 95.93 - 10 ξ1,α 92.8 91.8 92.2 92.5ξ2,α 93.1 92.3 92.7 92.8∆ξ1,2,α 93.8 94.0 94.4 94.1(n1,n2,n3,n4) = (63,75,100,100)(σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4) (σ2γ,1,σ2γ,2,σ2γ,3,σ2γ,4)=(2.25, 2.25, 1.00, 1.00) =(1.44, 1.44, 1.00, 1.00)5 - 8 ξ1,α 92.3 93.3 94.0 93.3ξ2,α 92.7 93.7 94.3 93.3∆ξ1,2,α 93.8 93.0 94.6 94.54 - 9 ξ1,α 91.8 92.8 93.7 93.0ξ2,α 92.8 93.0 93.8 92.3∆ξ1,2,α 92.4 93.3 94.7 94.63 - 10 ξ1,α 92.5 92.4 93.0 92.9ξ2,α 93.6 94.1 93.6 93.2∆ξ1,2,α 94.0 94.6 94.3 94.455Table 4.7: Coverage probabilities (%) of two-sided 95% bootstrap confidenceintervals under Gamma modelsCluster Sizes Target α = 0.05 α = 0.10 α = 0.05 α = 0.10(n1,n2,n3,n4) = (38,45,60,60)b= 36 b= 145 - 8 ξ1,α 91.9 90.5 92.1 90.5ξ2,α 92.9 92.0 93.2 91.4∆ξ1,2,α 92.9 93.2 94.1 94.94 - 9 ξ1,α 93.1 90.6 91.2 91.0ξ2,α 93.2 92.2 91.6 91.4∆ξ1,2,α 94.3 93.6 94.3 94.43 - 10 ξ1,α 91.1 92.3 90.1 91.5ξ2,α 92.7 92.9 91.2 91.9∆ξ1,2,α 94.2 94.2 91.7 91.8(n1,n2,n3,n4) = (50,60,80,80)b= 36 b= 145 - 8 ξ1,α 93.3 91.1 90.2 92.3ξ2,α 93.2 92.5 91.1 93.4∆ξ1,2,α 93.3 93.3 93.4 93.04 - 9 ξ1,α 93.7 91.8 91.4 91.7ξ2,α 93.3 92.1 93.1 92.6∆ξ1,2,α 94.6 94.9 94.8 94.33 - 10 ξ1,α 92.6 92.0 91.9 91.6ξ2,α 93.3 92.4 92.0 92.6∆ξ1,2,α 93.8 93.9 94.6 94.7(n1,n2,n3,n4) = (63,75,100,100)b= 36 b= 145 - 8 ξ1,α 92.9 93.0 91.0 92.6ξ2,α 93.8 93.2 93.0 93.8∆ξ1,2,α 92.7 93.2 93.7 93.44 - 9 ξ1,α 94.2 94.0 91.0 90.6ξ2,α 94.8 94.1 91.9 91.6∆ξ1,2,α 94.7 94.8 92.5 92.93 - 10 ξ1,α 91.7 92.8 92.5 92.6ξ2,α 92.8 92.9 92.6 93.4∆ξ1,2,α 93.7 93.7 94.0 94.3From Table 4.6 and 4.7, we notice that the impact of misspecified model on the56coverage probabilities of 5% and 10% quantiles of populations 1 and 2. Specifi-cally, the coverage probabilities tend to be lower than the nominal level 95%. Butthe simulated coverage probabilities of quantile differences do not suffer too muchfrom the misspecified model. This might be because the quantile estimates of bothpopulations tend to be lower or larger than the true values simultaneously so thattheir differences are still close to the true values. Based on these simulation results,we can conclude that the choice of basis function indeed has some influence on theaccuracy of quantile estimation. But as long as we choose an appropriate basisfunction such that the model fits data well, the interval estimation of populationquantile is still in a reasonable range. Moreover, the interval estimation of quantiledifference is still very reliable.57Chapter 5SummaryThe main interest of this thesis lies in the demonstration of our R package forDRM CEL quantile estimation and how to use the package to do further research.With the help of this package, we compare the performance of DRM CEL quantileestimation with cluster-based Wilcoxon rank sum test in different situations. Thecluster-based Wilcoxon test as a nonparametric method, needs less assumptions onthe underlying population distributions. But for the purpose of monitoring popu-lation quantiles change, this Wilcoxon test is not testing the hypothesis of interestand DRM CEL is more appropriate here. Taking advantage of the package, weuse simulation study to investigate the performance of DRM CEL quantile in thecase that cluster sizes are not equal, which is not considered in Chen et al. [2016].We imitate the same cluster-based bootstrap procedure from that paper to conducthypothesis tests and to build confidence intervals. Simulation results supports thatthe conclusions in Chen et al. [2016] still hold in the general situations.As future research, adding the multi-parameter one-sided monitoring test byZhu and Chen [2017] into our current R package is of interest. This multi-parameterone-sided monitoring test is also motivated by monitoring multiple quality indicesin forestry products. In the literature, they find a novel way to derive a likelihoodratio test (LRT) type statistic for monitoring test of interest. Compared with theclassical LRT, the new test retains good control of the type I error and is markedlymore powerful. We would like to organize this new test into our package. Besides,the simulation results support our guess that the DRM CEL monitoring test should58work in the general case. We would like to investigate the theory behind it.59BibliographyJ.A. Anderson. Multivariate logistic compounds. Biometrika, 66(1):17–26, 1979.→ pages 2, 3, 4J.D. Barrett and R.M. Kellogg. Bending strength and stiffness of second growthdouglas-fir dimension lumber. Forest products journal (USA), 1991. → pages 2B.A. Bendtsen and J. Senft. Mechanical and anatomical properties in individualgrowth rings of plantation-grown eastern cottonwood and loblolly pine. Woodand Fiber Science, 18(1):23–38, 2007. → pages 2D. Bertsimas, G.J. Lauprete, and A. Samarov. Shortfall as a risk measure: proper-ties, optimization and applications. Journal of Economic Dynamics and control,28(7):1353–1381, 2004. → pages 1H. Bier and M.J. Collins. Bending Properties of 100 x 50 mm Structural Timberfrom a 28-year-old Stand of New Zealand Radiata Pine. New Zealand ForestService, 1985. → pages 2R. S. Boone and M. Chudnoff. Compression wood formation and other characteris-tics of plantation-grown pinus caribaea. Forest Service Research Paper, Instituteof Tropical Forestry, Puerto Rico, (ITF-13), 1972. → pages 2H. Chen and J. Chen. Bahadur representations of the empirical likelihood quantileprocesses. Journal of Nonparametric Statistics, 12(5):645–660, 2000. → pages5, 17J. Chen and Y. Liu. Quantile and quantile-function estimations under density ratiomodel. The Annals of Statistics, 41(3):1669–1692, 2013. → pages 6, 7, 10, 17,20, 27, 53J. Chen, P. Li, Y. Liu, and J.V. Zidek. Monitoring test under nonparametric randomeffects model. arXiv preprint arXiv:1610.05809, 2016. → pages iii, 1, 3, 6, 8,9, 10, 11, 12, 15, 16, 17, 20, 37, 45, 5860S.A. Czepiel. Maximum likelihood estimation of logistic regression models: the-ory and implementation. Available at czep. net/stat/mlelr. pdf, 2002. → pages20B. Efron. Bootstrap Methods: Another Look at the Jackknife. Springer, 1992. →pages 28K. Fokianos. Density ratio model selection. Journal of Statistical Computation andSimulation, 77(9):805–819, 2007. → pages 30K. Fokianos and I. Kaimi. On the effect of misspecifying the density ratio model.Annals of the Institute of Statistical Mathematics, 58(3):475–497, 2006. →pages 30, 53P. Hall. On the bootstrap and confidence intervals. The Annals of Statistics, pages1431–1452, 1986. → pages 28P. Hall. Theoretical comparison of bootstrap confidence intervals. The Annals ofStatistics, pages 927–953, 1988. → pages 28F.E. Harrell and C. Dupont. Hmisc: harrell miscellaneous. R Package Version, 3(2), 2008. → pages 24, 28ASTM International. Standard practice for establishing allowable properties forvisually-graded dimension lumber from in-grade tests of full-size specimens.astm d1990-07, 2007. → pages 1A. Keziou and S. Leoni-Aubin. On empirical likelihood for semiparametric two-sample density ratio models. Journal of Statistical Planning and Inference, 138(4):915–928, 2008. → pages 20W.H. Kruskal and W.A. Wallis. Use of ranks in one-criterion variance analysis.Journal of the American Statistical Association, 47(260):583–621, 1952. →pages 42E.L. Lehmann. Nonparametrics: Statistical Methods Based on Ranks. Springer,2006. → pages 11, 39B.G. Lindsay. Composite likelihood methods. Contemporary Mathematics, 80(1):221–239, 1988. → pages 7S. Nadarajah and A.K. Gupta. Some bivariate gamma distributions. AppliedMathematics Letters, 19(8):767–774, 2006. → pages 4661H. Nakayama, H. Sayama, and Y. Sawaragi. A generalized lagrangian function andmultiplier method. Journal of Optimization Theory and Applications, 17(3-4):211–227, 1975. → pages 6A.B. Owen. Empirical Likelihood. Chapman and Hall/CRC, 2001. → pages 5G.C. Pflug. Some emarks on the value-at-risk and the conditional value-at-risk.Probabilistic Constrained Optimization, pages 272–281, 2000. → pages 1J. Qin and B. Zhang. A goodness-of-fit test for logistic regression models based oncase-control data. Biometrika, 84(3):609–618, 1997. → pages 20B. Ripley, W. Venables, and M.B. Ripley. Package ‘nnet’. R package version,pages 7–3, 2016. → pages 20B. Rosner, R.J. Glynn, and M.L. Lee. Incorporation of clustering effects for thewilcoxon rank sum test: a large-sample approach. Biometrics, 59(4):1089–1098,2003. → pages 11, 37, 38, 41, 42, 53R.J. Serfling. Approximation Theorems of Mathematical Statistics, volume 162.John Wiley Sons, 2009. → pages 4, 12B.W. Silverman. Density Estimation for Statistics and Data Analysis. Routledge,2018. → pages 27I. Smith, S. Alkan, and Y.H. Chui. Variation of dynamic properties and staticbending strength of a plantation-grown red pine. Journal of the Institute of WoodScience, 12(4):221–224, 1991. → pages 2C. Varin, N. Reid, and D. Firth. An overview of composite likelihood methods.Statistica Sinica, pages 5–42, 2011. → pages 7S. Verrill, D.E. Kretschmann, and J.W. Evans. Simulations of strength prop-erty monitoring tests. Unpublished manuscript. Forest Products Laboratory,Madison, Wisconsin. Available at http://www1. fpl. fs. fed. us/monit. pdf, 2015.→ pages 2, 3G. Zhu and J. Chen. Multi-parameter one-sided monitoring tests. Technometrics,(just-accepted), 2017. → pages 5862


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items