IDENTIFICATION OF RISK GROUPS: STUDY OF INFANT MORTALITY IN SRI LANKA by LISA KAN B.Sc, Simon Fraser University, 1986 A THESIS SUBMITTED IN PARTIAL FUTJFIIJLMENT THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES The Department of Statistics We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA September 1988 © Lisa Kan, 1988 OF In presenting degree this at the thesis in University of partial fulfilment British Columbia, freely available for reference and study. copying of department this or thesis by publication of this for scholarly his thesis or her Department of The University of British Columbia Vancouver, Canada DE-6 (2/88) the I agree I further agree purposes may representatives. It be is requirements for an advanced that the Library shall make it that permission granted for extensive by the head understood that for financial gain shall not be allowed without permission. Date of of my copying or my written ABSTRACT Multivariate s t a t i s t i c a l methods, including recent computing-intensive techniques, are explained and applied i n a medical sociology context to study infant death i n r e l a t i o n to socioeconomic r i s k factors of households in S r i Lankan v i l l a g e s . The data analyzed were c o l l e c t e d by a team of s o c i a l s c i e n t i s t s who interviewed households i n S r i Lanka during 1980-81. Researchers would l i k e to i d e n t i f y c h a r a c t e r i s t i c s (risk factors) distinguishing those households at relatively high or low risk of experiencing an infant death. Furthermore, they would l i k e to model temporal and s t r u c t u r a l relationships among important r i s k factors. Similar statistical issues s o c i o l o g i c a l and epidemiological and analyses studies. are relevant to many Results from such studies may be useful to health promotion or preventive medicine program planning. With respect discriminating statistical linear t o an outcome such as infant death, factors or variables can be i d e n t i f i e d using a v a r i e t y of discriminant discriminant, partitioning (CART). methods, including logistic The linear usefulness Fisher's parametric discrimination, of a ii (normal) recursive discriminant of the data variables are dichotomous, o r d i n a l , normal, etc.,) context and objectives of the analysis. and particular methodology may depend on d i s t r i b u t i o n a l properties the r i s k groups and and also (whether on the There are at least three conceptual approaches to s t a t i s t i c a l studies of risk r e l a t i v e factors. An A risk. classification epidemiological perspective uses the notion of second generally or discriminant analysis, i s to predict outcome, or class membership. probability approach, A third approach is to referred a as dichotomous estimate of each outcome, or of belonging to each class. to the These three approaches are discussed and compared; and appropriate methods are applied to the Sri Lankan household data. Path analysis i s a standard method used to investigate relationships among variables in the social sciences. causal However, the normal multiple regression assumptions under which this method i s developed are very restrictive. In this thesis, limitations of path analysis are explored, and alternative loglinear techniques are considered. iii TABLE OF CONTENTS Abstract i i Table of Contents iv List of Tables vi L i s t of Figures vii Acknowledgements viii 1. Introduction 1 2. A Study of Infant Mortality in Sri Lanka 4 2.1 4 3. Infant Mortality in Medical Sociology 2.2 The Sri Lankan Household Data 7 Discriminant Applications to Identify Risk Groups 13 3.1 13 Basic Approaches 3.2 Optimality Criteria for Discriminants 3.3 15 3.2.1 Relative Risk 15 3.2.2 Decision Theoretic Bayes Rules 19 Sample Space Partitions Corresponding to Bayes Rules 26 3.3.1 Linear Discriminants for Normal Distributions 27 3.3.2 Logistic Linear Discriminants 29 3.3.3 Classification Trees: Recursive Partitioning 31 3.4 Construction of Discriminants from Sample Data 35 3.4.1 Logistic Discriminant 36 3.4.2 CART Discriminant: Growing a Class i f ica i ton Tree .. 37 3.4.3 CART Discriminant: Pruning a Classification Tree .. 40 3.4.3.1 Test Sample Estimates of Risk 44 3.4.3.2 Cross-Validation Estimates of Risk 46 iv 4. Path Analysis 48 4.1 Structural Modelling with Quantitative Data 49 4.1.1 Path Models 49 4.1.2 Estimation and Interpretation of Path Coefficients .. 53 4.2 5. 59 4.2.1 Loglinear and Logit Models 59 4.2.2 Path Models 63 4.2.3 Estimation of Path Coefficients 65 4.2.4 Goodness-of-Fit 68 for Path Models Statistical Analyses on the Sri Lankan Household Data 71 5.1 Identification of Infant Mortality Risk Groups 71 5.1.1 Logistic Discrimination 72 5.1.2 Discrimination Using CART 76 5.1.3 Discussion 80 5.2 6. Structural Modelling with Qualitative Data Causal Modelling 84 5.2.1 Structural Modelling with Quantitative Data 85 5.2.2 Structural Modelling with Qualitative Data 90 5.2.3 Discussion 95 Remarks and Recommendations on Statistical Methods Used to Identify Risk Groups 96 Bibliography Appendix I 100 Partitioning the Sample Space Using Logistic Discrimnation (Younger Women) Appendix II Modified Path Analysis - Model Selection (Younger Women) Appendix III 104 105 Modified Path Analysis - Model Selection (Older Women) 108 v L I S T O F T A B L E S Table I Variables used in the Sri Lankan household study 10 Table II Households used in the analysis 12 Table III Estimated direct and indirect effects for path model (4.3) 58 Table IV Various loglinear models for three-dimensional tables .. 60 Table V Results of forward stepwise logistic regression ........ 74 Table VI Comparison of sample space partitioning between logistic discrimination and CART Table VII Estimated logistic regression equations for younger women Table VIII 82 Estimated direct 83 and indirect effects on infant death 89 Table IX Variables used in modified path analysis 92 Table X Goodness-of-fit statistics for loglinear models (younger women) 107 Table XI Goodness-of-fit statistics for loglinear models (older women) 110 vi LIST Figure 1 OF FIGURES Conceptual model of medical sociological approach to research on infant mortality Figure 2 4 Examples of Relative Risk functions for known probability densities 17 Figure 3 An example of a binary tree 31 Figure 4 An example of a path diagram 49 Figure 5 An example of a path diagram with path coefficients 51 Figure 6 An example of a colored Figure 7 A path model with estimated path coefficients 56 Figure 8 A path model with dichotomous variables 64 Figure 9 CART results for the younger women 78 Figure 10 CART results for the older women 79 Figure 11 Path model specifying temporal relationships among path diagram ....52 selected variables 84 Figure 12 Path analysis results for the younger women 87 Figure 13 Path analysis results for the older women 88 Figure 14 Path diagram showing causal links implied by selected logit models for younger women Path diagram shoving causal links implied by selected 93 logit models for older women 94 Figure 15 vii ACKNOWLEDGEMENTS I would like to thank Dr. Nancy E. Waxier-Morrison data and the stimulus for my research. his guidance, suggestions, Dr. A. John Petkau's helpful I am grateful to Dr. Ned Glick for and patience comments for providing the in producing are also greatly this thesis. appreciated. Finally, I thank my husband, Scott, for his continuous encouragement and support. Without his belief in me, i t might have taken me longer to get here. viii 1. Introduction A study of infant mortality in Sri Lanka was conducted by a team of social scientists during 1980-81 (before the current c i v i l war) to identify households and socioeconomic conditions in which there was a high risk of experiencing an infant death. Further, relationships among risk factors would also be of interest to future planning programs in developing countries- of any preventive health Similar applications of multivariate analyses are widely used to identify risk groups in epidemiology, urban planning, economics, business, etc. . This thesis explores and applies various statistical methods for assessing risk groups, and relationships among risk factors. Risk, groups and discriminating factors can be identified by a variety of statistical discriminant and modelling methods. The most often used criterion for determining the goodness of a discriminant rule has been the rate of misclassification. However, the importance of misclassification rate varies depending on the purpose of discrimination. In medical diagnosis, the objective is to pinpoint as accurately as possible the cause of symptoms. Since i t is not desirable to subject a healthy individual to possibly detrimental treatments, such as chemotherapy, nor to leave an infection untreated because of misdiagnosis, misclassification rates are preferred. discriminant rules with low In medical screening, say early breast cancer detection, examinations are performed on apparently healthy volunteers from the general population, for the purpose of separating them into groups with high and low probabilities for breast cancer (Sackett and Holland 1975). The idea of a screening discriminant i s to use a few 1 inexpensive measurements to capture a l l those with the disorder in a high risk, group, so that more complicated, and often more expensive examinations need be performed only on this smaller group of individuals. Thus, factors considered to be good screening factors may not be acceptable diagnosis factors. In epidemiology and medical sociology, the main objective is to discover the context in which a disorder may homosexual men were identified as the f i r s t high AIDS, although homosexuality per clearly, by using sexual se is not orientation misclassification rate would be high. the risk of infant death is being p o l i t i c a l perspective. as occur. risk the a For example, group in studies of cause of disease; discriminant and rule, the In our Sri Lankan household study, examined from the socioeconomic and Health planning involves not only the understanding of biomedical causes of infant death, but also the social context in which infant death may occur. Although discriminant rules constructed socioeconomic and p o l i t i c a l variables may using not have low misclassification rates, the socioeconomic and p o l i t i c a l conditions under which a family is most l i k e l y to experience an infant death can still be identified. Thus, the goal is to find discriminating variables and discriminant rules that partition the households into distinguishable groups with respect to the risk of infant mortality. determining the goodness In this thesis, two of a discriminant rule are other criteria for investigated, and discriminant methods that are appropriate for the Sri Lankan household data set are applied. A second objective of the Sri Lankan household study is to test a theoretical model that places infant mortality at expanding series of social contexts. 2 the Infant deaths may center of an be affected by proximate factors such as inadequate nutrition or poor sanitation creating conditions for tetanus or diarrhea. These proximate factors may be influenced by the education level of the mother, and the economic status of the family, which in turn, may be linked to ethnic group membership. Path analysis is the standard method used to analyze such models in the social sciences. However, the assumptions under which this method is developed are highly restrictive. limited. Thus, the use of path analysis is In this thesis, limitations of the methodology are explored, and alternative techniques are considered. 3 2. A Study of Infant, M o r t a l i t y i n S r i Lanka 2.1 I n f a n t M o r t a l i t y i n Medical S o c i o l o g y In medical sociology, infant mortality i s viewed as a consequence of biosocial interactions. The key idea behind the disease i s that etiology i s b i o l o g i c a l l y s p e c i f i c . biomedical model Hence, medical research i s p r i m a r i l y focused on disease agents and host-agent interactions. other hand, social science research on of infant mortality On the has been t r a d i t i o n a l l y concentrated on the association between socioeconomic status and the l e v e l and pattern of mortality i n the population. The specific medical causes of death are generally not addressed by s o c i a l s c i e n t i s t s . Medical sociology attempts to bridge these two approaches infant mortality. premise that necessarily to the study of Mosley and Chen (1984) proposed a framework based on the " a l l social operate and through a economic determinants common set of of child biological proximate determinants, to exert an impact on mortality". mortality mechanisms, This framework can be summarized by the following i l l u s t r a t i o n . socioeconomic factors Figure 1 biomedical factors infant mortality Conceptual model of medical s o c i o l o g i c a l approach t o research on i n f a n t mortality 4 or Primary causes of infant death understood from contributes the medical to high infant in developing perspective. mortality countries are well One of the factors rates i s risk of that infection. Patel (1980) noted the common use of dung as a healing agent prior to 1940 in Sri Lanka. 1906, As documented by the Registrar of Ceylon Medical College in tetanus, infection a common cause to the navel childbirth. This after source of infant death, separation of infection often resulted from of the umbilical can easily cord in be eliminated by abolishing such practice. Another source of infection is the contaminated water supply caused by lack of proper sanitation f a c i l i t i e s . of infection may be eliminated This source by construction of sanitary latrines. In general, most infant deaths are preventable with current understanding of disease transmission and existing health technology. Although most infant deaths are preventable with the available technology, the social context in which infant death occurs may block the use of such technology. The Sri Lankan government has created a subsidy program for the construction of latrines. poor to take advantage of such subsidies. of hospitals for childbirths. However, many families are too Another example involves the use Waxier et al. (1985) suggest that childbirth may not be considered serious enough to require a doctor's care. hospitals for maternity hospitals which Thus, care are sometimes not used, even though these are essentially free, are within short distances. Therefore, in order to design an effective package of health policies to promote infant survival, the biomedical and the social context problem must be examined concurrently (Mosley 1984). 5 of the Two recent developments i n sociological research have also altered the approach al. to infant mortality studies, (1985). as pointed out by Waxier et McKeown (1976) has argued t h a t changes i n health status across time are probably better predicted by changes i n s a n i t a t i o n and a v a i l a b l e food s u p p l i e s , than by health care or narrowly defined medical v a r i a b l e s t h a t are often considered. Secondly, i n f a n t m o r t a l i t y has been used, by development economists and others, as a c e n t r a l i n d i c a t o r of the s t a t e of development, or quality countries (Morris 1979). of life, of populations These developments have in called developing f o r expanded models that place i n f a n t m o r t a l i t y i n a l a r g e r s o c i a l context. The proximate n u t r i t i o n (Puffer supply that create causes and the maternal education Bernstein 1982, and Perry 1982, infant Serrano 1973) conditions Smucker et al. 1980). of or death poor may be sanitation inadequate and water f o r tetanus or diarrhea ( P a t e l 1980, and However, these proximate causes may be r e l a t e d t o level (Caldwell and McDonald 1982, Simmons and and Chowdhury 1982), economic status of the f a m i l y (Grosse and Waxier et al. 1985), and access t o health services (World Bank 1975), which i n t u r n , may be r e l a t e d t o ethnic group membership (Waxier et al. 1985). I n the S r i Lankan household study, r e l a t i o n s h i p s between i n f a n t m o r t a l i t y and various biomedical and socioeconomical are examined. 6 factors 2.2 The S r i Lankan Household Data As described in Waxier et a.1. (1985), the 22 districts of Sri Lanka were divided into three clusters having different patterns of quality of life based on results of a previous study (Morrison and Waxier Four villages representative of a typical clusters were selected. 1984). d i s t r i c t from each of the three For each village, a random sample of 40 households was drawn from the population l i s t . A household was substituted only i f the sampled house was empty, or i f both male and female head of household were absent in several calls over a period of weeks. Approximately 30 substitutions were made in the sample of 480 households. The researchers who devised this sampling scheme regard the sampled households as being representative of the Sri Lankan village population. A long systematic set of open questions was used for interviewing both the male and the female head of household. The questions elicited information on health, housing, nutrition, employment, education, etc. . The female head of the household, in addition, reported on the number of live births in her lifetime, and the number of her children who died before reaching age one. Information on the cause of death (or symptoms at death) was also obtained for each infant that died. The variable of primary interest in our analysis is a dichotomous response indicating whether or not the female head experienced at least one infant death. the study are listed in Table I. 7 of household has A l l explanatory variables used in 391 households (82% of the total sample) have complete information on the variables of interest. Table II shows that 92% of the total sample satisfied the i n i t i a l inclusion criterion: a female head of household with known child-bearing history, and known number of infant deaths must be present in the household. Further, the table shows that 12% of these households had missing information (where 11% have at most one missing variable and 1% have two missing variables). Most missing values appear in the variables concerning family income, and among older female head of households; otherwise, there was no noticable pattern when the distribution of households with missing information was examined for each variable. Several populations may require separate analysis in this study. Women with more childbirths are more likely to have experienced at least one infant death. Thus, the Sri Lankan village population i s separable with respect to the dichotomous response on infant death by the number of childbirths. Furthermore, several explanatory variables may have different relevance to women of different age groups. For example, the use of health services for childbirth is restricted by a v a i l i b i l i t y which may vary across time. The impact of ethnicity may also vary for the different generations. Thus, analysis should be performed separately for the various age groups. However, the available sample size restricts the number of allowed strata. Since older women also tend to have more childbirths, the sample is divided into two groups based on the woman's age (<44 and 44 ). + Most women in the latter age group are postmenopausal; thus, women in this age group have similar numbers of childbirths. In contrast, the number of childbirths varies for women in the younger age group. correspondence between household Since there is a one-to-one and female head of the household, the 8 terms, fiousehold and woman, w i l l be used interchangably t o refer to a unit of observation throughout t h i s t h e s i s . In our analysis of t h i s S r i Lankan household corresponding survey, the two data sets age <44 (250 cases), and those of age 44 random samples. 9 + t o those women of (141 cases) are treated as simple Table I Name X X variables used in the Sri Lankan household study Explanation Codes Infant death indicator 1 2 No. of languages spoken at home 1 one 2 two or more Current usage of health services - where was the last child born? 1 2 3 Nutrition - no. of protein foods consumed in the past week, from four most common types listed. 0 none 1 one type at least one none hospital home with midwife home without midwife 4 four types Sanitation 1 2 3 4 5 none communal latrine own / open-pit type own / water-sealed toilet Economic status - no. of household items owned, from five listed. 0 none 1 one 5 No. of hrs a day female head of household worked outside the home five 0 none 1 one - three 2 four 7 nine 8 ten or more X No. of household members currently employed 10 0 none 1 some 2 all Name X IO Explanation Codes Primary source of income 1 2 3 4 salary land/business/boat piece rate food stamps etc. No. of bustrips taken in the last week 0 1 none one 7 8 seven eight or more Ethnicity 1 2 Sinhalese others Years of schooling for female head of household 0 1 none one 11 12 X 1 AGE eleven twelve or more Education level of female head relative to that of male head 1 lower 2 same 3 higher Age of female head of household as reported 11 Table I I Households used in the analysis Total number of households sampled 480 no female head of household 12 no child birth or no information on child birth 25 no information on infant deaths number of invalid households 1 38 Total number of valid households 442 missing information on one variable 48 missing information on two variables 3 number of excluded households 51 Total number of households included in analysis 391 number of women with age <44 250 number of women with age 44* 141 ( 12 3. Discriminant Applications to Identify 3.1 Basic Risk Groups Approaches In the S r i Lankan household study, we are interested in deriving discriminant rules that partition the households into distinguishable groups with respect to the risk of experiencing infant death. There are at least three basic approaches to this problem. An epidemiological perspective uses the notion of relative If a population t can be divided into two disjoint subpopulations, and < , then relative risk.. say * risk of a particular phenomenon is defined to be the occurrence probability in relative to the occurrence probability in I . 2 For example, we would like observable variables to define some groups t and * such that the probability of infant death i s higher for households 2 in t 2 relative to the probability for households in t . In general, a variable which can partition the population so that one subset has high relative risk is considered an important rish A second approach i s to predict /actor. a dichotomous outcome based on some collected information; for example, classify a family as likely or unlikely to experience an infant death based on the sanitation f a c i l i t y , nutrition, etc. available to the family. This approach i s generally referred to as discriminant analysis or classification, and as pattern recognition in engineering. The idea i s to select discriminating variables and to derive discriminant rules that minimize the expected cost of misclassification. This w i l l be referred to hereafter as the c l a s s i f i c a t i o n approach. 13 A third approach is to estimate or of belonging to each class, the probability given some of each outcome collected information; for example, estimate the probability of infant death given the educational level of the mother. Regression c I ass Trees Using the terminology in C l a s s i f i c a t i o n , and {CART) by Breiman et al. (1984), this approach is called probab i l i t y estimation. The methods used in this approach search for variables and rules that minimize a squared error loss function to be defined later (Section 3.2.2). Obviously, these three approaches are related. For instance, class probability estimation for an observation [e.g. for a family) suggests a discriminant that assigns the observation to whichever class has the maximum probability; and relative risk can be estimated for the resulting discriminant partition. perspectives can probabilities, be and The similarities and differences between these described in the in terms more general of various context of conditional decision theory. Some s t a t i s t i c a l techniques and software may be adapted to more than one of these approaches. We w i l l f i r s t consider the roles of these approaches in characterizing a good discrimination will conditional discriminant. be probabilities discussed are in known. The the underlying context However, principles where in the practice of various these conditional probabilities are often unknown, and need to be estimated from the sample data. The last section describes how obtained. 14 these estimates may be 3.2 O p t i m a l i t y C r i t e r i a f o r Discriminants R e l a t i v e Risk 3.2.1 Relative risk i s generally considered in a context relating the presence or absence of a specific disease to exposure levels possible risk factor(s) (Schlesselman 1982). for some The concept of relative risk is simplest when exposure level i s dichotomous (presence or absence of a factor). that A high relative risk (of disease) among those exposed suggests the factor Schlesselman 1982, may be a cause of disease (Breslow and Day 1980, Hennekens and Buring 1987). Let X be a random variable that indicates the level of exposure to a specific risk factor. D e f i n i t i o n 3.1 Suppose there are only two levels. Relative risk, is defined as P(d*sease\X=2) P(dr.sease\X = 1 ) When RR > I, the probability of disease in the population with X = 2 i s higher than the probability of disease in the population with X = l. The reverse relationship Historically, variables. is implied when RR < 1 . relative risk used primarily for dichotomous But suppose the random variable X i s continuous on the real line, or positive half-line, etc.. we was are interested Then by considering X as a risk factor, in partitioning the real line distinguishable with respect to the risk of disease. 15 into two regions, Is i t reasonable to use relative risk as a partitioning Suppose the disease-present and the disease-absent populations densities of X denoted respectively by p(x|disease) and p ( x | n © smooth unimodal densities, the ideal have disease), If p(x|disease) is which, in practice, may be estimated from sample data. right-shifted with respect to p(x|n© disease), criterion? then, at least for most partition is in the half-lines, { X < c} and { X > c}, for some c on the real line. form of Thus, by Bayes theorem, for any c e R, the corresponding relative risk is RR(c) = W ° " \ X P(disease|X > c) < c) { P ( X > c\disease) P(X The two > c) P(X P{X < c) < c\disease) 3 2 ) examples illustrated in Figure 2 show that for densities with monotone likelihood ratio, RR(c) may increase to infinity as c decreases; but the discriminants corresponding to such extreme c are of no practical value. Thus, choosing c to maximize RR{c) partitioning. is not a useful criterion for Furthermore, because RR{c) may not be a monotone function, relative risk values do not provide information on how well separated are the two populations, disease-absent and disease-present. For example, a relative risk value of about 2 can arise from different partitions of the real line in either of the two situations Since relative risk does not indicate disease-present and disease-absent illustrated in Figure 2. the magnitude of shift between the densities, relative risk is not necessarily informative about the practical discriminating nature of a risk factor that is continuous rather than dichotomous. 16 Figure 2 Examples of relative risk function for known probability densities Density Plots: N(0,1) vs. N(1,1) p(x) i i i i I i i i i I i i i i I i i i i I i i i i I i i i i -1.0 0.0 1.0 2.0 3.0 4.0 5.0 x Relative Risk Function for N(0,1) vs. N(1,1) 6.0 5.0 - 4.0 RR(c) 3.0 - 2.0 - 1.0 M 1.0 I I | 0.0 I I I I | I 1.0 I I I | 2.0 17 I I I I | 3.0 I I I I | 4.0 I I I I 5.0 Density Plots: N(0,1) vs. N(2,1) 0.5 not diseased ll 0.4 - diseased 0.3 _ z p(x) 0.2 _ z 0.1 - 0.0 i -1.0 I i i I i | i i i i | i i i i fl 0.0 1.0 2.0 3.0 i I i l I I | i i i i 4.0 5.0 x Relative Risk Function for N(0,1) vs. N(2,1) 6.0 5.0 H 4.0 RR(c) 3.0 H 2.0 1.0 -1.0 i i i i I i i 0.0 I I I I I I I I I I I I I I I [ I I I I 1.0 2.0 3.0 4.0 5, I 18 These properties indicate that relative risk may not be a meaningful criterion for selecting discriminating variables. Even though relative risk associated with a particular discriminant may be of interest, relative risk per se is not usually an appropriate criterion for construction of a discriminant. 3.2.2 D e c i s i o n T h e o r e t i c Bayes Rules Although the formal objective differs for classification and class probability estimation, both approaches use discriminant methods that can be described in a general framework of decision theory as presented in Classification, and. Regression Trees (CART) by Breiraanei al. (1984). In the following, discussion w i l l be restricted to the two-class problem, which is appropriate for the Sri Lankan household study. Generalization to more than two classes can easily be made. Let X be the sample space of possible measurement vectors, and l e t S = {1,2} denote the set of possible classes. Further, l e t X e X random variable whose distribution is denoted by P(dx), denote the class membership. Definition 3.2 Suppose jf is the set of possible A decision rule d D e f i n i t i o n 3.3 A loss l e t Y e 55 actions. d is a jtf-valued function on X : X : -» sf. function L is a real-valued function on S x sf : L Thus L(y,a) and be a :« x -» R. is the loss when Y = y and a e jf is the action taken. 19 D e f i n i t i o n 3.4 rule d is used. The risk. R(d) is the expected loss when the decision That i s , Rid) = E [ L(Y,d(X)) ]. In the classification approach, we are interested in predicting the class membership of an object with measurement vector X = x . Thus, we want to construct decision rules that assign class membership in t to every measurement vector x <= X, and so, l e t the action space J# C be "6. Furthermore, any decision rule d is equivalent to the partition of sample space X into two regions, l and * , such that an object with measurement 2 vector x e t . is classified as class j, for j = 1,2. ~ These rules w i l l be j called c l a s s i / i c a t ion rv.les. The loss function, L (y, a), in this situation c is the cost of classifying a class y object as a class a object, denoted by C(a|y). Suppose C(a|y) is positive when a * y and is O otherwise. Then the risk or expected cost of using decision rule d is given by R (d) = C(l 12) P(Y Let the probability that for j = i,2. prevalences of X, = 2,X an e * ) + C(2\l ) P(Y observation comes from In epidemiological terms, these a priori of the two classes. (3.3) = 1 ,X e I ). class j be probabilities are Further, l e t the conditional probability given an object from class J be denoted by p(x\j) for j = 1,2. Then the risk in (3.3) can be re-expressed as * ( d ) . C ( i U » n [j o z p(x\2) + C(2\l ) n[ t 20 rfxj J P(x\l) dx j . (3.4) In the class probability estimation approach, we are interested in obtaining an estimate of the probability that an object with measurement vector X = x belongs to class j. That i s , we are interested in estimating p(j | x ) = p(Y = j |X = x ) , j=l,2. Thus, we want to construct rules of the type, d(x) with d(J\x) d(2\x)) for J = 1,2, and £ . <i(j'|x) = f , for every x e X. > O Such rules w i l l be called action space = (d(i | x ) , class probability estimators. Hence, the consists of a l l pairs of nonnegative numbers that sum to /. Let the loss function L (y,cn) for a = (a ,a ) € 4f be defined by p where ^j-(y) for J = i s ~ ~ the Kronecker delta p 2 (l i f y = j and 0 otherwise), Then the risk of a decision rule d i s given by R (d) » E [ L <y,d(X)) ] = But 1 given X = x , probability p ( j | x ) , E [ < d<y|X) - 6 (X) ) ] . 2 6^.(7) i s a Bernoulli (3.6) random variable with success for J = 1,2. Thus, E [ <5 ^. (y) | X = x ] = P O ' | X ) and E [ (SjAY) - p U l x ) ) 2 I X = x ] = Var[ 6 {Y) |X = x ] y = P(j'lx) 21 [1 - p ( j | x ) ] . (3.7) Hence, for any a e sf^ f - Zj < <v " ~ y) = Zj PU\X) t l -POIX)] = 2p(l\x)p(2\x) from (3.7). pol + ~ ~*j x = x ] ) + P(J| E (P(J'|X) + Zj (PU\x)- )21 ) - *j) z <*J) , 2 Therefore, for class probability estimation, the risk of a rule d i s given by R (d) = 2 E [ pit |X)p(2|X) ] + Zj p lPU\X) ~ dU |X)) ] , (3.8) 2 where the f i r s t term does not depend the rule. A Bayes D e f i n i t i o n 3.5 rule i s a decision rule d that minimizes B the risk function R(d). In the classification approach, a Bayes rule d that minimizes the D expected cost as expressed in (3.4), is obtained by choosing i \ X *2 \ ~ € € X • • ^(x\2V p(x|2) C(2\l) < C(2\l) TI n\ 2 J ' A N D ( 3 ' 9 ) ) ' as shown in Anderson (1984), with the Bayes risk as given in (3.4) with the above regions i and t . 22 In the class probability estimation approach, the unique Bayes rule i s given by d ( x ) = ( p(l |x), p(2\x) ) for x e X, with risk B R (d ) = 2 E[ p(l |X)p(2|X) ] = 2 J p(l \x)p(2\x) (3.10) P(dbc) which can be seen easily from (3.8). Bayes rule and Bayes risk can also be defined for a partition of the sample space X. Definition 3.6 The partition /unction T associated with the partition T i s defined as T : X -*• T such that T ( X ) = t i f and only i f x e t, for a l l x € X and * e T. A decision rule d i s said to correspond to the partition T constant on each subset of T. i f i t is That i s , for every l e T, there exists some jtf-valued function u on T such that c o U ) = d(x) for every x e i . Then a decision rule d^. corresponding to the partition T i s explicitly given by d^-(x) = u>(r (x)), and the associated risk is given by R(d ) =£ where P{t) = P(X € * ) . £[ HY,<*(i)) = <oU) value minimizes E [ L ( / , a ) that minimizes (3.11) Thus d_ i s a Bayes rule corresponding to the partition T i f and only i f a |X 6 i ] P « ) , (x) = C O ( T (x)) such that for each t e T, |X e I ]. For convenience, l e t E [ L(y,a) | X e i ] over 23 a e jtf, toU) be a for * e T. Furthermore, for t e T, l e t r U ) = E [ L(y,toU)) | X e * ] . Then the Bayes risk corresponding to the partition T can be written as R(T) = £ rU)PU). (3.12) In the classification approach to discrimination, a Bayes rule d corresponding to the partition T i s obtained by setting d^ (x) = <*MT for minimizes x e X, where E f Z . (Y,i) | X € 4 1 o> it) is a value i e {i .2} that (x)) Then for * e J*. co it) i s a value for * e J*. £ € U ,2} that minimizes E[L (y,i) o |X€<] where p ( j U ) = piY = j'|X e i ) , * C{t\t)p{l\t) j" =1,2. + C(i\2)p(2\t), Thus, the minimum conditional expected cost of misclassif ication on subset t e 7" i s given by r U ) = min [ C(2\l)p(l c \t), C(l \2)p{2\i) ]. (3.13) Then the Bayes risk for partition T can be written as R {T) = E i* (*)*>(*)• (3.14) In the class probability estimation approach, the unique Bayes rule d corresponding to partition T Q i s obtained by setting «* (>0 = " ( (x)) T B 24 for x s X, where <»> U) is the pair of nonnegative values a = (« /« ) that p ± 2 minimizes = Ej. £ [ ( - pom + PU\*) | x = Zj E [ ( 6 ( y ) - p{j\t)f y = E. since 6 j ,(y) given X € I is a oi (t) = ( p ( i | 4 ) , p(2\l) p ] + E | X € * ] 2 (PO'U) - « ) y 2 Y «j) 2 Bernoulli random variable with success p(j'U) = p{Y = j'|X e <) probability 4, ) OLJ + E (PU'I*) - P ( J l * ) [ l - P(J"I*)] J € - for J =1,2. Thus for t e T, ), and the minimum conditional expected loss is given by r U) = 2p(* \t)p(2\t). (3.15) p The Bayes risk for partition T can then be written as R (T) = E teT ^ p (3.16) p Suppose the sample space X is to be divided into two regions using the class probability estimation approach. How do these two regions compare with those selected by the classification approach? partition T = R (D p For any two-region }/ = E *eJ* = 2p(l p U)PU) (3.17) \t )p(2\l )P(t ) t ± + 2p(l t 25 \i )p(2\i )PU ). 2 z 2 Suppose n^, rz , p(x|z) and p(x\2) are known as in the classification 2 approach. Then (3.17) can be re-expressed as R (5-) = 2p(i \\)P(X p e * \2)n ± = 2p(/|* )n [ J 1 2 + 2p(2\i )P(X z 2 e \l)n (3.18) ± P(x|2) <*x J J + 2p<2| p(x|i) dx ] . But this i s same as the expected cost (3.4) of a classification rule i f 2p(l\l ) = C(l\2) t and 2p(2\l ) 2 = C(2\t ). Let T* = U*,/} be the partition with minimum risk R (• ) among a l l two-region partitions; that i s , p * let T be the best estimation approach. two-region partition using the class probability Suppose the cost ratio is given by Then from (3.9), a Bayes rule that minimizes the expected cost in (3.4) is * determined by the partition T . Therefore, by varying the cost ratio, the best two-region partition determined by the class probability estimation approach can be obtained from the classification approach. 3.3 Sample Space P a r t i t i o n s Corresponding t o Bayes Rules In the following sections, some of the commonly used methods for discriminant analysis are presented. The most widely used method assumes multivariate normality for the observations from both classes. 26 In this case, a Bayes rule i s obtained minimizes the risk function. by choosing The logistic a linear- partition that discrimination procedure also provides a linear partition for use with both normal and certain non-normal populations. such as kernal Methods based on nonparametric density estimation algorithms, and nearest neighbor methods, are also available, but w i l l not be covered in this thesis. Instead, the method of classification is explored. and trees A recent report produced by a panel on Discriminant Analysis Clustering (DAC report), which was created under the Committee on Applied and Theoretical Statistics, National Research provides a helpful summary of a l l these methods. Council (1988), In the following, we present three of these methods from the decision theoretic perspective. In addition, we examine the classification trees method in much greater detail. 3. 3. 1 L i n e a r D i s c r i m i n a n t s f o r Normal D i s t r i b u t i o n s In the classification problem, by assuming the two class-conditional distributions are known multivariate normal with equal covariance matrices, namely N(y ,Z) and N(y ,Z), Wald (1944) showed a Bayes rule is obtained by 4 2 choosing the linear partition given by x «= X : x ' z " ^ - « ) > * 1 2 where the point k } , and i s a function of rc n^, C(l \2), C(2\l ), (j , y 27 (3.19) 2 and Z ; see Anderson (1984), Hand (1981), Dillon and Goldstein (1984), and others. The linear projection given by x Z"*(g ~ fcj )/ i s sometimes called the T normal linear discriminant 2 function. However, in most applications, the mean vectors and the covariance matrices are unknown. Suppose there i s a sample of size and a sample of size from class 2. Let from class 1 be estimated by the usual mean x^. of the sample from class j population for J = 1,2, and l e t Z be estimated by the pooled sample covariance S defined by o _ s> — <* - 1)S. + <N_ - l)S_ 1 (N 1 1 where S and 2—, 2 +N - 2) 2 are the corresponding sample covariance matrices. ± Then the Bayes decision regions are estimated by \ = | x <= X : x S ( x T 1 ^ ^ ~ _ 1 ~1 ~ = I x e X : x S (x T _ 1 - x )> * ~Z i , and 2 (3.20) J - x ) < te \ , 2 where the point * 2 i s a function of n^, n^, C(l\2), The linear projection given by x S T discriminant function - 1 C(2\l), (x^ - x ) i s the Fisher 2 suggested by Fisher (1936). 28 x^, x g and S. linear 3.3.2 L o g i s t i c Linear Discriminants In the classification problem, logist ic discrintinat linear partition of the sample space for use ion also provides a with normal and certain non-normal populations; see Lachenbruch (1975), Hand (1981), Dillon and Goldstein (1984), DAC report (1988), and others. Suppose that the two class population densities can be expressed as P(x|j) = expfOj + x g y ) , for J = 1,2. T (3.21) Then by invoking Bayes theorem, P<* where n O Ix) = P(x\t)n T = = log( n / n ) + (a - a ) and 1 inu!. t i v a r i a t e 2 1 logist lo ic n = ft - ft . 2 function, * [TT^ilTTxr] ~± ~ ^-2 This i s called a which can be re-expressed as • + 2*3 • (3.23) Thus the probability of belonging to a class given a measurement vector X = x can be estimated by modeling the logit of p ( i |x) as a linear function of x. Furthermore, by substituting (3.22) into (3.9), the best decision region in the classification setting i s given by the partition, * = | x € X 4 : x g > te | , and T g 29 (3.24) I = X x € X : x y> < T where the point fc i s a function of a , a , n , n , C(l \2) and 3 1 2 C(2\t). 1 2 So far the logarithm of each class-conditional probability function is assumed to be adequately modeled by a linear function. general approach assumes the difference between the class-conditional probability functions is linear. A slightly more logarithms of This is equivalent the to the approach adopted by Anderson (1972) which assumes the logit of p(l \x) is linear as expressed in (3.23). The equivalence relationship can easily be (3.22). seen by examining expression Clearly, the model expressed in (3.22) is exact when the class conditional probability density functions are multivariate normal with identical variance-covariance Thus, for known normal p(x|/J> and p(x\2), the logistic coefficients are functions of normal parameters, and regions given in (3.19). densities in (3.24) correspond However, are i f the multivariate to underlying normal with the class matrices, regression the Bayes decision Wald's linear partition conditional probability unknown parameters, then the logistic discrimination procedure cannot be expected to classify as well as does the linear discriminant function (Efron Wilson 1978). 30 1975, and Press and 3.3.3 C l a s s i f i c a t i o n Trees: R e c u r s i v e P a r t i t i o n i n g The technique of classification trees for discriminant analysis was initially developed by Morgan and Sonquist (1963), Messenger (1973) under the name automat ic interaction This technique has been pursued and refined detection (AID). several people. by Recent development, under the name classification trees and Morgan and and regression (CART), is described in detail in the book by Breiman et al. (1984). The primary differences between AID and CART is in the tree construction. The technique of CART creates a binary tree-structured discriminant by repeatedly splitting subsets of sample space X into two descendant sets, starting with X t 1 t 5 itself. An example i s illustrated in Figure 3, where = X, t and t are disjoint subsets of t with tut ' 2 9 1 are disjoint subsets of * with tut 2 4 2 = t , and t and 3 =t . 5 2 t t t 2 3 t 5 Figure 3 An example binary tree 31 1 / 4 Those subsets with no descendant In the above example, t , t and t sets are called terminal subsets. are the terminal subsets. Thus the technique of CART constructs discriminant rules that partition the sample space as specified by the terminal subsets. That i s , t^,^,^} forms a partition of the sample space that corresponds to some decision rule. The tree i s constructed based on a set of binary questions of the form f Is x e i? } for some subset t of. X. Let the measurement vector X be M dimensional, X = (X^,.. . , X ) , with mixture of ordered and categorical t m types. 1 Then the allowable set of splits i s defined as follows: a. Each s p l i t depends on the value of a single variable. b. For each ordered variable X , the questions are of the form { Is x < c? }, for a l l c in the range of X . vft c. m For each categorical variable X , the questions are of m the form { Is x e S? }, for a l l subsets S of possible TTI X -values. m Let J" be a fixed partition and l e t t e 7 be a fixed subset of X in J*. * Consider a s p l i t o of t into two disjoint subsets l and t . Let T modification of T after applying s p l i t o to t. Then the risk reduction As defined in Brieman et al. (1984), a variable is ordered values are real numbers; and a variable i s categorical values from a finite set with no natural ordering. variable can be a continuous or an ordinal variable. 32 be the i f its measured i f i t takes on Thus an ordered AR(o,4) = R(T) - R(T ) due to the s p l i t o is given by AR(o,4) = RU) - [RUJ + RU ) ] = P(t) [ r U ) - P^ritJ where P = P [X e 4 (3.25) R | X <= 4] and P - P^U^) ], = P [X e 4 | X e *]. The r e l a t i v e risk reduction due to the s p l i t is then given by A R ( o U ) = AR(4,l) / P{i) = r(*) - Pr.U ) - P r U ). L, 1. R R Thus, the risk, reduction (3.26) i s achieved by choosing the s p l i t o partition that maximizes the relative risk reduction. In the class probability estimation approach, PU\*) =P u PU\iJ +P R PU\* ), M J = Thus by substituting the above into r U) in (3.15), AR p can be shown p to be AR (<»|i) = 2P P [ p U | * ) - p(J I * ) ] . Hence the relative risk reduction (3.27) 2 is maximized i f the difference between class probabilities i n the two resulting subsets is maximized. Suppose class 1 corresponds to the class of households with infant death. Then the class probability estimation approach seeks splits that maximize the difference in probability of infant death between the two resulting groups. Furthermore, because of the multiplicative factor P^P^t the criterion also favors those splits which divide the set t more evenly into two subsets. 33 Note that relative involves a ratio risk, in epidemiology, as defined in Definition 3.1, rather than a difference: " P(l i * r * R Thus a desirable s p l i t should have a very high or very low relative risk value. In any case, there i s no way of ensuring even s p l i t s . as discussed in Section 3.2.1, using relative risk as a Therefore, partitioning criterion may not be provide splits of practical value. Risk reduction i s not a good criterion for choosing a s p l i t in the classification approach. Breiman et al. (1984: pp. 95-96) shoved that for any s p l i t of * into ^ J*U) = t and * , RJl) > K.U r ) + (* ) R C w i t n R ), where j*(u) minimizesC(JIi )pU 1^) )= equality i f + C(J\2)p{2\-u,), K JW for subset v, of X. Thus, i t i s conceivable that every allowable s p l i t of t may produce a partition for which AR {o,t) i s zero. In situations where C the population i s predominated by a single class, the risk criterion may result in no s p l i t s . that risk reduction partition reduction The second defect is caused by the fact (in the classification approach) is a one-step optimization process that does not account for the future s p l i t s . In some situations, the best current choice of s p l i t may not provide the best overall improvement in strategic position. For futher discussion of these considerations, see Breiman et al. (1984: pp. 94-98). Two splitting criteria for the classification approach have been implemented in the CART software: Gini criterion and Twoing criterion. In the two-class problem, these criteria can be shown to coincide (Breiman 34 et al. 1984: pp. 104-108). Thus, in this thesis only the Gini criterion is considered. Let T be any partition of sample space X. of r(t) consider an impurity function For t e T, instead i(t) defined by i(t) = 2p{l \t)p(2\t), called the Gini diversity index. (3.28) Then, the partition impurity toxT is defined by KT) = £ i{t)P{t). (3.29) * Thus the impurity reduction due to the s p l i t -o is AI(4,t) = I (T) - 1 (T ), where T and T are as defined in AR(o,l) earlier; and the relative impurity reduction due to the s p l i t o is given by AZ(o|*) = A7(o,«) / P{t) = 2P P [ P{1 \t ) - P{1 \t) ] . 2 Li But this K » Li R is precisely the risk reduction criterion used probability estimation approach as expressed in (3.27). reduction partition approach is the probability using same as estimation Gini the diversity risk approach. index reduction Therefore, in the class Thus, the impurity in the partition the (3.30) J sample classification in the space class X is partitioned in the same manner by both approaches when CART is used. 3. A C o n s t r u c t i o n of D i s c r i m i n a n t s from Sample Data Since the measurement variables available in the Sri Lankan household study are mainly ordinal, not continuous, partitioning of the sample space 35 by assuming normal populations may latter two techniques, logistic not be appropriate. linear Thus, only the discrimination and CART, are discussed in this section. In practice, classification or discrimination problems begin with a sample of correctly classified objects, each with a set of measurements, x. The classification approach uses the sample to derive rules that partition the sample space into disjoint regions with each predominantly inhabited by members of a single class. region purely or The partitioning of a population into classification regions is similar to, but not quite the same as the partitioning of population into groups distinguishable with respect to high and In principle, class risk are risk of belonging to a specific is clearly defined while the terms high relative. technique (for low class Both the logistic probability risk, and low discrimination and estimation) estimate class. the the CART class probabilities for each possible measurement vector x in the sample space X. The high and low risk groups are then defined by choosing a probability threshold. 3. 4. 1 L o g i s t i c Discriminant Let { (X ,Y ) : n = i ~n W } be a random sample of size N from the n joint distribution of {X,Y), where X is a X-valued random variable and Y is a S-valued random variable that denotes the observation. class membership of Logistic discrimination assumes that 36 the r n P(Y = i i x ) for x € X. Thus, for x e X, P<y = i i x ) = 1 + exp(* + x o T 2 (3.31) ) Therefore, the parameters i) and r? can be estimated by maximizing the Q likelihood function, N n = 1 A l l logistic discriminant analyses performed for the S r i Lankan household study are accomplished by using a logistic regression program, PLR, of BMDP Statistical Software. 3.4.2 CART D i s c r i m i n a n t : Crowing a C l a s s i f i c a t i o n Tree Let { (X ,y ) : n = * } be a random sample of size N from the n joint distribution of (X.Y), a S-valued observation. random variable In both where X i s a X-valued random variable and Y i s that denotes the class the classification and the class estimation approach, there are two situations prior probabilities membership of the rt^ and rc probability to consider: one when the are known, and another when the prior probabilities are unknown. 37 Consider f i r s t the situation where the prior probabilities are known. Let N , be the number of observations with y = j, j =i ,2. partition of the sample space X. observations with Suppose J* i s a Given t e T, l e t Nj(l) be the number of x € i and y = j, for j = 1,2. Then estimate P{t) = P(X « t) by n . PU) = E,. — ^ N . J J Suppose P(t) > O (3.32) J for a l l t e J*. Then for j = * ,2, estimate = P(y = J'lX e *) by PU\*) p(j|«) = In practical n , N .U) / N . ;? 1— . PU) (3.33) J applications, however, the prior probabilities are often unknown. Then for any * e T, l e t N(t) be the number of observations with x € t, and estimate P(t) = P(X e t) by the proportion of observations in t, PU) = • (3.34) Suppose PU) > 0 for a l l l e J*. Then estimate p(j'|*) by the proportion of observations belonging to class j in the subset t, N .U) p(j\t) = . A?U) 38 (3.35) For any * e T, j = 1,2. l e t pij \t) be estimated by the appropriate i n the classification approach, l e t to it) pij\t), be the smallest C i « {1,2} that minimizes C(£|i)pU |*) + C{i\2)p(2\t), it) and estimate r C in (3.13) by r U ) = min [ C(2\l )p(l c \t), dl\2)pi2\t) In the class probability estimation approach, l e t co it) p (pit \i), p(2\t)), and estimate r U) in (3.15) by p r p Using the appropriate Pit) it) ]. (3.36) denote the vector = 2pU\t)p(2\l). (3.37) and r ( i ) , the Bayes risk associated with the partition T i s then estimated by RiT) = £ 4eT rU)PU) . (3.38) Recall from Section 3.3.3 the desirable splitting criterion for either the classification or the class probability estimation approach (see (3.27) or (3.30)). t G T. Consider a s p l i t o of * e T into * ) > O. £(* Let T be a partition of sample space X with Pit) > O for every and t , where Pit ) > O and Let R P -h*JP(*) and P - J ^ - . P(*) Then, the empirical splitting rule for either approach is to choose an allowable s p l i t o of t that maximizes ^IAIC p ( f 'V " p U 39 1 V i a ( 3 , 3 9 ) This partitioning procedure w i l l continue to s p l i t until each subset of the current partition contains either observations of the same class, or observations with obtained in identical measurement vector x. this manner are artificial and Discriminant rules highly data dependent. Furthermore, i t is conceivable that this splitting procedure may until each terminal set contains only one observation. the construction of a parsimonious partition continue In the following, suggested by Breiman efc al. (1984) is summarized. 3.4.3 CART D i s c r i m i n a n t : Pruning- a C l a s s i f i c a t i o n Tree The stop-splitting criterion i n i t i a l l y consists of setting a threshold and deciding not to s p l i t further i f the decrease in the estimated impurity for the classification approach, or the decrease in the estimated risk for the class probability estimation approach, is less than the threshold. This may lead to two problems. If the threshold is set too low, then there are too many subsets in the resulting partition. too high, good splits may be lost. If the threshold is set That i s , a subset t may not produce a s p l i t with a large enough decrease, but i t s descendants t and t may be able to do so. Breiman et al. (1984) suggest the following alternative. The basic procedure can be summarized in three steps which are more easily described by tree terminologies. discriminants. Recall the construction of binary tree-structured Since each node on a tree corresponds to some set on the 40 sample space X, henceforth. the terms, node So far, the terminal and set, w i l l be used interchangeably nodes of a given tree, which constitute a partition of the sample space, is the only tree terminology introduced. D e f i n i t i o n 3.7 no ancestor; The root node of a given binary tree is the node with that i s , the set on the tree which is not a subset of any other sets on the tree. Let a binary tree be denoted by T. Any node on the tree T t « T, and the set of terminal nodes is denoted by D e f i n i t i o n 3.8 A branch is denoted by T. of T with root node t e T consists of the node t and a l l descendants of I in ?. D e f i n i t i o n 3. 9 T Pruning just below the node t. D e f i n i t i o n 3. IO T* a branch T from a tree T involves cutting off The resulting tree is denoted by T - T . is a pruned subtree of T i f T' i s obtained by successively pruning off the branches of T. The alternative to the stop-splitting procedure has three basic steps. The sample space X is f i r s t partitioned into an o v e r l y large that i s , the sample space is partitioned into fine sets. pruned upward until appropriate only the sized estimated risk. estimation node is l e f t . estimate of the risk, the right pruned subtrees, i s selected. right root tree is to This tree is then By using a more tree from among the The most obvious criterion for selecting a choose This criterion may errors. sized binary tree; the pruned subtree also be adjusted However, these criteria 41 may not with minimum to compensate for always select a sensible tree. In most practical applications, the primed subtrees and their corresponding risk estimates are inspected; and by using external information about the variables and by noting the context of the problem, the right sized, tree is selected. The f i r s t step is to grow a large tree T by continuing the splitting procedure until a l l the terminal nodes are either pure, or contain only identical measurement vectors. with RiT^) = R{J" ). be the smallest pruned subtree of T Q Note that the pruning criterion may differ for the o classification Let and the The estimated risk R(T) class = E cases: r(4) is the estimated classification approach, while probability estimation approach. r(*)£(*) i s defined differently for the two wi thin-node misclassif ication cost in the r(4) i s the estimated within-node Gini diversity index in the class probability estimation approach. Now for any branch T of T , define R{T R{T.) = E ) by R(t), t where T is the set of terminal nodes of 3^. Breiman et a l . (1984: pp. 287-288) showed that for any nonterminal node t of T , R{i) > R{T ). D e f i n i t i o n 3.11 parameter- Let \ > O be a real number called the and define the cost-complexity R-^{T) = R(T) measure (T) as + where \T\ is the number of terminal nodes in the tree T. 42 complexity The complexity parameter X may be thought terminal node of a tree. of as the portal ty on each Thus the cost-complexity measure takes into account the risk associated with a tree, as v e i l as the complexity of the tree. Consider any nonterminal node I of T . As long as R^TJ the tree with the branch vithout the branch 3^. intact i s preferred over the pruned subtree However, at some c r i t i c a l value of X, the two cost-complexities become equal. pruned off i s preferred over D e f i n i t i o n 3.12 < R^{{•*}), Then the smaller tree with the branch T T. ± Consider a nontrivial tree T. Define a function t(t) for * « T by i*V - 1 t€ +00 Then define the -weakest link t i n T as the node satisfying t(t*;3r) Let \ = Z{i ;T ). Hi;3r). = min Then the node i 2 T. ± i s the weakest link i n the sense that as the complexity parameter X increases, i t i s the f i r s t node with R^(U}) equals ( ) , where i s a branch of T ± with root node *. Thus, when the complexity parameter i s \ , the pruned subtree, T , obtained by pruning 2 away the branch T * from T , i s preferred over T . Nov define recursively * for k = 2,3,as 1 1 long as 3 ^ i s not just a terminal node, 43 Continuing pruning in this manner, a decreasing sequence of subtrees i s obtained: T , T , J r , where T i s the root node on a l l subtrees. Furthermore, a corresponding increasing sequence of complexity parameters is also obtained (Breiman et al. 1984: p.286). The next step is to select one of these pruned subtrees as the sized, tree. If R ( ^ ) right i s used to estimate the risk associated with T^, the largest tree w i l l always have the minimum estimated risk. Furthermore, this estimate is biased. Thus a more accurate estimate of R{T.) i s needed. Two methods of estimation are discussed by Breiman et al. (1984): use of an independent test sample and cross-validation. As noted earlier, the sequence of subtrees, T ,...,T , may differ for the classification and the class probability estimation approach. Since the class probability estimation approach seems more appropriate for the discrimination objectives of the Sri Lankan household study, the description of the estimation methods w i l l be restricted to the class probability estimation approach. Extension to the classification approach can be made similarly. 3. 4. 3. 1 T e s t Sample Estimates of Risk The sample i s divided randomly into two sets, where one set is used to construct the decision rules, and the other i s used to estimate the risk associated with each rule constructed. These two sets are generally called the training sample, and the test sample 44 respectively. Let y denote the random sample { (>< ,y^) : n = 1 ,...,N }. A sample of fixed size A / The 2 > is randomly selected from y to form the test sample J^ . <2> remainder J^ (1> = ? - J* constitutes the training sample, which is <2) used to construct the decreasing sequence of pruned subtrees, T ,...,T. 1 For each pruned subtree T^, r^. l e t p ^ ( j | x ) estimate the probability of belonging to class j given measurement vector x , j = 1,2, by applying 3 ^ to the cases in the test sample. Then for j = 1,2, define l m r = C^<*l2n> " < W * * j < K —£r )m N . < 2) where n . <2> (2) ( 3 ' 4 0 ) and Y = j }, and 6. (y ) is the Kronecker n ~ n 2 . = { n : (X ,Y ) e ? J ]' n I n delta (/ i f y = £ and 0 otherwise). Test sample estimate of the Bayes risk associated with the tree 3 ^ * then given by s R (r ) »E*)"<*V j la n M • (3.41) If the prior probabilities are unknown, estimate « by A/^. / A / 2 > 2 > , j = The standard error estimate for R (J*^) denoted by S£"(R (J"^)), i8 obtained et al. by standard statistical t8 methods as described in may be Breiman (1984). A large sample is needed for this method. In particular, a large number of cases i s required in the training sample so that the rules constructed are somewhat reliable. 45 3.4.3.2 C r o s s - V a l i d a t i o n E s t i m a t e s o f Risk When the data set i s large, test sample estimation i s a reasonable approach. However, when the number of cases i s only a few hundred as i n the Sri Lankan household study, test sample estimation can be inefficient in i t s use of available data. Thus, cross-validation is preferred. In V-fold cross-validation, the original sample f i s randomly divided into V subsets of similar sizes, f^, v = 1, ...,V. Then the v-th sample i s defined as ^ < v ) = ? - J» , for v = 1,...,V. By using the entire sample J " , the decreasing sequence of pruned subtrees, T ,...,T^ is constructed. /V^7 with corresponding complexity parameters, X ,...,X , t Then for each h = t, o f x * x w i t h x *+i let denote the geometric k °°= (V) ~ < v> Now for each sample Cr , v = 1, ...,V, construct , the optimally pruned subtree with respect to the complexity parameter Xj^, h = . Then for each tree 3 '^ , l e t P ^ ( j | x ) estimate the probability of belonging to class j given measurement vector x , j = 1,2, by applying l v > ! v> v> to the cases in y . Then for j = 1.2, define V R T { T K ) = — E E E [ p i ( ^ | x ) - 6 £ (y n ) ] , where 77 V* = { n : (X ,/ ) € y J ~ n 2 V > n v > (3.42) and Y = J }, and <5 . (y ) is the Kronecker n n I n delta U i f y = i and O otherwise). Cross-validated estimate of the Bayes 46 risk associated with the tree 3 ^ i s then given by R c v ( ^ ) = £Ryir ) nj . (3.43) h If the prior probabilities are unknown, estimate nj by N , / N, j - 1,2. Standard error estimate for R («7" ) denoted by S E ( R ( T^)), cv CV fc maybe obtained by heuristic arguments as described in Breiman et al. (1984). The right sized, tree may be defined as the pruned subtree with minimum estimated risk, or as recommended by Breiman et al. (1984), the tree selected by the 1 SE rule: instead of risk, the smallest tree « t * C <*W 8 V < * W the tree with minimum estimated satisfying * S ^ t e J + i9 hm * * ^**> CV SEiR (T )) + SE(R C V or (^)), whichever i s appropriate, i s selected. This rule was created to take into account the instability of minimum estimated risk, and to select the simplest tree whose estimated risk i s comparable to the minimum estimated risk. Note that i s a pruned subtree of 3"^^. 47 4. Path A n a l y s i s Path analysis contrast investigates causal to the focus individuals or cases. patterns in a set of variables, in of discriminant analysis on patterns This s t a t i s t i c a l methodology, vhich was among introduced by a geneticist, Sewall Wright, in the 1920*s, has been popularized sociological others). literature (see Duncan 1966, Land 1969, Blalock Path analysis utilizes a visual representation, diagram, in the 1970 and called path which consists of arrows leading from one variable to another, to illustrate the The s t a t i s t i c a l cause-and-effect part cause-and-effect relationships among the variables. of the method does not specify the direction of relations between the variables, but does provide quantitative assessments of the relationships v i a what are called coefficients. Thus, this i s not a method for discovering path causal relatioships among the variables, but rather a method for assessing whether or not a specified set of relationships among the variables i s compatible with the observations. Hence, directions of causality between variables are specified by using non-statistical information In practice, the natural temporal ordering or substantive theory. of the variables usually indicates the direction of causality between the variables. The method of path analysis was i n i t i a l l y developed for quantitative data, where a path diagram i s based on a sequence of linear regression models. However, most sociological data quantitative. causal instead of Thus, assumptions under which path analysis was developed are generally not satisfied. studying are qualitative Goodman (1972, 1973a,b) proposed a method for relationships among discrete 48 variables, where a path diagram is based on one or more loglinear or logit models. However, causal models thus constructed have limitations, and are not directly analogous to causal models with continuous variables (Fienberg 1980, Rosenthal 1980). Various problems in causal modelling with quantitative or qualitative data have been explored Lauritzen 1983, recently (Wermuth 1980 Kiveri, Speed and Carlin 1984, and and 1987, Wermuth others). In and this thesis, only the basic approach which lead to the more recent developments for qualitative data is examined. 4.1 S t r u c t u r a l M o d e l l i n g w i t h Q u a n t i t a t i v e Data 4.1.1 Path Models A path model can be represented by a path diagram. Suppose we are interested in the relationship between infant mortality (X ), a dichotomous o variable, and two explanatory variables, say age (X^) and education (X^) of the mother. We suspect that both age mortality directly. and education influence infant Further, we rule out the possiblity that education affects age, but w i l l postulate that age affects the level of education attained. Then this model can be represented pictorially as in Figure 4. L e v e l of education X Age Figure 4 X + i X o Infant death An example of a path diagram 49 The directed arrow, leading from one variable to another, indicates that the f i r s t variable has direct influence on the second. A path is formed by moving along the arrows. X —*• X^—* X Q In our example, X^—• X^, X^—* X , X —• X , and Q are the possible paths. 2 Q If a path diagram contains a path that traces back onto i t s e l f , then the diagram is said to have a feedback, loop. Any path model represented by a diagram with no feedback loop i s called a recursive system. A l l path models considered hereafter are recursive. The method of path analysis assumes that a l l relationships are linear. Thus for the above example, = ft X , X (4.1) 2 ' 21 l ' X = ft X + ft X . O OA 1 02 2 But in pratice variation. this i s not exact; there are unmeasured sources of Thus, the above system of equations i s more appropriately expressed as X = ft X + 6 , (4.2) 2 ' 21 1 2' X = ft X + ft X + 6 , O CU 1 CS 2 O' where the error terms, 6 other variables and 6 , have mean O and are uncorrelated with the in the corresponding equations. Without generality, assume hereafter that a l l variables are standardized and unit variance. standardized Conventionally, loss of to mean O coefficients in the equations with variables are referred to as path denoted by fv. . , where the subscripts 50 coefficients, and are represent the direct effect of standardized variable X*. on standardized variable XV . J 1 Thus, our path model can be re-expressed as X' = fl X'+ fl e , 2 '21 1 ' 22 (4.3) 2' X* = fi X'+ fi X'+ fx e , O where coefficients such as, A residual path, CU 2 2 1 ' 02 and /v coefficients. 2 ' OO O' , are generally referred to as the The path diagram i s then modified as follows. L e v e l of e d u c a t i o n 02 Age > X' o Infant death 01 00 Figure 5 Since a An example of a path diagram with path coefficients path model can be represented by a sequence of linear submodels, the corresponding path diagram can be modified to better reflect this key concept by the use of colors. For instance, the earlier example can be represented by a path diagram with colored arcs as in Figure 6 . The modified path diagram i s visually more attractive, in the sense that v i t a l information can be extracted more easily. Suppose we want to know which variables have direct effect on a specific variable in a more complicated path model. Instead of staring at a maze of arcs, we can focus on a 51 particular color and especially useful obtain the desired in specifying the information. system of This feature is linear equations that represents a path model. £ 2 L e v e l of e d u c a t i o n Age X' o X; £ Figure 6 Infant death O An example of colored, path diagram The basic assumptions underlying the application of path analysis for quantitative data are summarized as follows: i. Causal (or temporal) ordering of the variables in the model is assumed as specified. evaluated Validity of the model cannot be from the data; external criteria or substantive theory must provide justification for the model proposed. ii. iii. Relationships among the variables are linear and additive. Error terms are not correlated with variables proceeding them in the submodel, nor with each other. iv. The variables are measured on an interval scale (at least), with the exception of dichotomous variables, which can 52 be included as interval-scaled by assigning numerical scores to the two categories. 4.1.2 E s t i m a t i o n and I n t e r p r e t a t i o n of Path C o e f f i c i e n t s Path coefficients may be estimated in two ways. The f i r s t method of decomposing correlation coefficients was employed by Wright (1934, 1960) in the development of path analysis. The second method consists of applying ordinary least squares regression to each submodel in the system. The latter method of estimation automatically provides estimates of the precision of the coefficients, and a framework in which hypotheses concerning the coefficients may be tested. Although the regression method is generally preferred, the method of decomposing correlation coefficients offers a more fundamental understanding of the relationships among the variables considered, in the following, these two estimation methods are illustrated in the context of the earlier example using a random sample of size N. Since the variables are standardized, the sample correlation coefficient between X, and X . can be expressed as J r. . = -|T- V x*. x'. . Let the sample correlation coefficient be zero, i f the two variables are assumed to be uncorrelated. Then in path model (4.3), 53 Let fv^. denote the estimate of path coefficient ft i . Then path model (4.3) implies that r = 21 f since £ x* 2 * V N ^ - 7 7 - - ... . _ 1 x' x' ± Z V x* —r-7- ** N (ft x ' + 21 1 1 7 7 (4.4) ft e> ) = ft , 22 2 21' 7 i = 1, and £ ^ x 2 • 0 Similarly, = fi + ft r r Ol r = e Q2 CK = ft + ft r 'oat , and (4.5) Z l ' 02 Ol 12 r . In general, Wright (1934) shoved that r . . = £ ft. (4.6) vhere s runs over a l l variables with direct effect on X£ . Therefore, estimates of the path coefficients can be obtained by solving for ft. .'s in the decomposition of correlation coefficients. ft = ' 21 ft ' Ol = r r oi J (4.7) , 21 ' - r r oe 21 2 1 - r 21 - r 02 r O2 l i - r 12 54 , ' , r 02 In our example, 12 , and Now the residual path coefficients can be obtained by noting r r = —JJ— TJ x ' w <J = — - -L YJ (ft N '21 2 ^ 2 1 x ' + ft I 7 ^ 2 ~ 2 = —JT— E x ' = f v + n, + n, 00 N o '01 02 00 7 e 22 2 7 + ) = ft / \ 2ft 7 21 7 / \ ft + ft 7 22 ' , and A ft ox 02 21 7 7 Thus, ( 1 - KI Y'*> ( 4 - 8 ) ft =[ 1 - ft - ft - 2fV A ft I ' OO V. ' CM. OB Od. 02 21 ' 7 For a simple path model as i n our example, this method of estimation seems straight forward. However, for a more complicated model, this method can be very tedious. Since a path model i s essentially a sequence of linear submodels, path coefficients can be estimated by applying the method of ordinary least squares regression to each submodel. ordinary least squares estimate of ft ft 21 Thus for path model ( 4 . 3 ) , the is 1 2 = ^ _ ,2 E *\ = r 21'. since x ; and x' are standardized; and the normal equations for the second 2 linear relationship are as expressed in (4.5). It can be shown easily that 2 2 / 1 - R , where R i s the coefficient of multiple determination between the dependent variable in question and those variables with direct model ( 4 . 3 ) , 55 influence on i t . Thus for ~ ft ' 2 22 i = - R 2 = i —LN 2-i = / ~ ft 21 ft = f -R ' OO = N ' f ** 2 A where (ft x' f " a i , + ft x* ) i j - £) (ft x' 1 0-1.2 = r - A. ' Ol Ol 7 1 ^ 2 7 ^ - rt — 02 02 ^ 2 ^ 2ft ft ft ' O l ' 0 2 ' 21 z . ' is the coefficient of multiple determination between dependent variable X' and independent variable X* . and R i s the coefficient of 2 l ' 2 multiple determination variables X* and X* . 0-1.2 between dependent variable X^ and independent Therefore, estimates of the path coefficients agree for both methods. Proof of the general result can be found in Land (1973). By treating the data from the Sri Lankan household study as a simple random sample, the path coefficients for our example path model (4.3) are estimated (see Figure 7). °-1 33 L e v e l of education X' 2 -o. Age ie -o. 15 X' o X^ O. 13 Infant death O. 98 Figure 7 A path model with estimated path coefficients 56 A l l path coefficients are significantly nonzero at 5 % level. But, as shown by the residulal path coefficients, or equivalently the coefficients of multiple determination, linear models do not f i t the data well. For further analysis, one may try transforming the variables. Wright developed the method of path analysis as a means of studying the direct and indirect effects of variables. Direct effect refers to the effect of an independent variable on a dependent variable directly without any mediating variables. effect pertains to the effect of an Indirect independent variable on a dependent variable through a third variable which affects the dependent variable either directly or indirectly. In our example, x ; has an indirect effect on X^ thru X^ which has a direct effect on X ' . In another model, X* may not have a direct effect on X ' , but has an 2 O O indirect effect thru another variable, say X^, that has a direct effect on X* . o The observed correlation between two variables can be expressed as a sum of three components. The direct and indirect effects of one variable on the other account for two of the components. The third component of correlation coefficient i s attributable to the antecedent variables common to the two variables under consideration. the spxirioxis component. This component is referred to as The decomposition of correlation coefficient as shown in (4.6) may be re-expressed as follows: d i r e c t effect + i n d i r e c t effects + spurious component ft. . tj + E ~*. , ft. is 57 r sj . + E **, . rt.t o 7 r o j. where both X' and X* have direct influence on X". with s running over a l l s o J. variables X^ which are influenced by X*. , and o running over a l l variables X' which influence X'. : that i s . s runs over a l l variables that have a direct path to X£ and can be reached by following the arrows from Xj , and o runs over a l l variables that have a direct path to X£ , and can reach X^. by following the arrows. the total effect. direct r r 21 at r OS The sum of direct and indirect effects i s called For our path model (4.3), effect = ft 21 = ft indirect effect spurious component + ft r 02 21 oi = ft 02 + ft O l r 12 Using data from the S r i Lankan household study, the estimated direct and indirect effects are shown in the following table. Effect Direct Indirect Age on education -0.16 — Age on infant death 0.13 Education on infant death Table I I I Estimated direct and indirect -0.15 0.02 — effects for path model (4.3) Thus, the effect of age on infant death i s mainly direct. Therefore, decomposition of a correlation coefficient provides a way of separating the direct effect on the dependent variable from the indirect effect which manifests itself through the correlations with other explanatory variables. 58 4.2 S t r u c t u r a l M o d e l l i n g w i t h Q u a l i t a t i v e Data 4.2.1 L o g l i n e a r and L o g i t Models Gocdman (1972, 1973a, b) proposed using loglinear and logit models to study the causal patterns i n a set of discrete variables. Commonly used terminologies and notations for the analysis of categorical variables are reviewed in the context of three-dimensional contingency tables. A more complete presentation of this methodology can be found in Fienberg (1980), Haberman (1978), Bishop, Fienberg and Holland (1975), and others. Consider three variables, A, B and C , with 1, J and K categories respectively. m. Suppose a random sample of size N has been collected. Let denote the expected number of observations with (A,B,C) = (i,j',AO for i = J = 1,...,J and k = l ...,K. Then the general f loglinear model is given by log m., .. = u + u ^ l jk + v. . + v. 1< I > + v. 2< J > . . + v. 12 < l_?> (4.9) 3<fc> ,, + v. 13< lfc> .. + u 23<jfe> , .. . 123<ljfe) where J 7 ** i =1 1(1) K 2<J> J=l ** k=i 59 3<fc> 7 I J y XL ** ,. = y i2<i j> i=i ** j=t I i2<i j> *•* = Y XL xi ** 231 j hi h=l J = TJ - k.=i** J=l . ., taitki i3<ife> K = LY I = y -u TJ. i=i J T. xi K ii , . = y O, 23< jhi K xi . T = v. . .. = O. This general loglinear model does not impose any restriction on expected c e l l counts l £jk}r m a n d i s denoted by [ABC]. By setting some of the u-terms to zero, special cases of the model can be obtained: Model u-terms set to zero IAB][AC][BC) XL [AB] [AC] [AB][BC] [AC][BC] [AB][C] [AC][B] [BC][A] [A][B] [C] T a b l e IV ±23i . .. I jhi 123< l j h i ' Xi 123 < I jhi , ' 123< Ijhi ' XL 123< . . . . . . tjh) . 129 < I jfc> 123< I. jhi ., 23< J&> tanhi XI , . 12<l J > u ... 13<lfe> ' XI . . . . XI . . . 123< I J«> ' XI XL 12<lJ> ' 12<t J> ' ' U . . . 12<lJi ' XL ., 23 < jfc> 23 < j'fc> ia< th) XL . , XL ia< ifc> ' ,, 23<jhi Various loglinear models for three-dimensional tables 60 Model [AB][AC][BC] assumes that each two-variable interaction is unaffected by the value of the third variable. Models [AB][AC], [AB]IBC], and [AC][BC] are obtained by assuming conditional independence of two variables given the third. For example, model [AB][AC] assumes that variables B and C are independent given variable A. Models [AB][C], [AC][B], and [BC)[A] are obtained by assuming one variable i s jointly independent of the other two. For example, model [AB)[C] assumes that variable C i s jointly independent of variables A and B. Lastly, model [A][B][C] assumes that the three variables are mutually independent. The method proposed by Goodman i s restricted to a hierarchical set of models in which higher-ordered lower-ordered terms are present. terms may appear only i f the related An example of a nested hierarchy of models is given below: [A][B][C] c [AB][C] <z [AB][AC] c [AB] [AC] [BC] c [ABC], where c means " i s a special case of". Effects of categorical predictors, say A and B, on a dichotomous response, say C, can also be assessed by a logit model: C for i = 1 | AB = log .,7, and 2(J> j= 1 ,...,J, where 61 + W i2<i j> (4.10) KI) *"• i=l ±2(1. J i 2<J> j=l i=l ±2<IJ> j=i Note that this logit model can be obtained from the general loglinear model by making the following identifications: xo = 2 XL , w 3<i> =2 xo . 2<J> XL . = 2 xi . . . =2 u xo , 23<J1>' ... taut ), i< i> 12<tJ> ... 123< Xjk>, Special cases of this logit model can again be obtained by setting some of the io-terms to zero. Logit models for categorical predictors are special cases of logistic response models introduced i n Section 3.3.2. Let p. .. i for i = 1 probability that (A,B,C) = (i,j,te), j = 1,. ,.,J, and Then, (4.10) can be rewritten as te -1,2. log = XO + XO 1<1> with the same restrictions on the w-terms. X , denote the JR. 2< J) (4.11) + W . + XO ±2<l J> Suppose I = J = 2. Let X^ and be dummy variables defined as B i f A = 1, i f A = 2, and l e t X A -••f-i if B = i , i f B = 2, = X X . Further, l e t p(te|X) denote the probability of C = te AB given X and A B and X , i . e . l e t p(te|X) = p. . . Then (4.11) can be rewritten as B LJ 62 K. loa LeilLXL-1 * = w + w X + w X + w t(l) A 2<1> B J [p(2\X) 10 (4.12) X 12<11> A B Thus, logit models are special cases of logistic response models where the predictors need not necessarily be categorical. with more than two Extension to predictors categories can be made similarly by defining the appropriate dummy variables. 4. 2. 2 Path Models As in Section 4.1, suppose we are interested in the relationship between infant death (C) and two explanatory variables, say age education (B) of the mother. two levels. The {A) and But now assume that each variable has only relationship between variables A and B can then be expressed by the logit model n . . B IA logit^. ' B where E W A I B | A , =w B | A . . i<£>' (4.13) , _. A <£> = 0* N o w build a logit model with C (infant death) as the response variable, and A and B as the explanatory variables. The three unsaturated loglinear models corresponding to such a logit model are The best 1. [AB][AC)[BC] 2. [AB][AC] 3. [AB)[BC]. model among those providing acceptable external information, or substantive theory. 63 f i t is chosen using The f i t of a recursive system of logit models can be assessed by two approaches, which are presented in later section. Suppose model 1 is the best model. Then the path model can be represented by the following diagram with path coef ficients given by the 10- terms. L e v e l of education B |AB <i> Age A c I AB w ' C Infant death i< i > Figure 8 A path model with dichotomous variables Several drawbacks of this method proposed by Goodman (1972, 1973a,b) are illuminated by the above example. Although Goodman does assign numerical values to arrows in the diagram, these values do not have the same interpretation as in path analysis for continuous variables. There is no calculus of path coefficients; so there i s no way of evaluating the indirect effect of a variable. Further, variables with multiple categories have multiple coefficients associated with a given arrow in the diagram. Thus, interpretation of the model may be complicated. path Since a sparse contingency table w i l l pose problems in estimation of the u-terms, and thus the u>-terms, the number of categories for each variable, and the number of variables considered must be restricted. In view of these obstacles, we w i l l limit ourselves to variables with two categories, and consider only a small number of variables. 64 4.2.3 E s t i m a t i o n o f Path C o e f f i c i e n t s The path coefficients are estimated by maximum likelihood method, which w i l l be illustrated using a two-dimensional easily be extended to higher dimensional tables. data set i s assumed to be a fixed cxoss-classified consideration. according to sample, i t s values table. The method can Our Sri Lankan household in which each member i s for the variables under Since a multinomial sampling model i s assumed for the Sri Lankan household study, the estimation procedure w i l l be developed based on such models. Estimation procedures are similar for other commonly encountered sampling models, such as product-multinomial and Poisson (see Bishop, Fienberg and Holland 1975, and Fienberg 1980). Consider a random sample of N subjects, where (A^,B^) for subject h i s observed, h. = / N. Let p^j denote the probability that (A,B) = {i,j), and l e t Z^j be the number of subjects with A = i and B - j, for i ,j = 1,2. Then, under the multinomial sampling model, the expected number of subjects with A = i and B = j i s given by m.. . = £(Z. .) = Np. . . ij (4.14) ij The general loglinear model for a two-dimensional table i s log m.. . = xi + v. + xi .+u (4.15) for i,j = i,2, where 2 T.u. i=l = 2 2 T \ x i . = Vv. j=l i =1 65 2 .. = V J=i VL ..=0. Alternatively, the matrix representation of this model is m. m log l ti 12 1 — L l t XL 1-1-1 XL 1-1 171 21 m. 22 i 1-1 1-1-1 J U 1 L 1(1> 2<1> XI 12<11> J or log m. = WQ. The likelihood function i s given by L(Q) oc rj p. 2 . . oc ij m. n where s. .. are the observed c e l l counts. *• J ij , Thus the maximum likelihood equations are given by a L(Q) = w'is - m) = 0 , log where -z = (z ,z , 3 , 2 ) of m. ~ 11' 12 21 22 (4.16) and m. i s the maximum likelihood ' estimate ~ Further, the observed Fisher information matrix i s given by & q = log L (Q) = W T (4.17) M W, where M = Hence, the maximum m. 0 11 0 m 0 0 0 0 likelihood 12 0 O 0 0 m. 0 21 m. 0 22 estimates Newton-Raphson iterative procedure: 66 J of Q can be obtained by [^lV]V(.- »,, a gi+o.gi** where Q 1=0,1,... s i s the estimate of (3 at the l - t h stage, j a <li tf**' is the diagonal matrix corresponding to i n i t i a l estimate g < 0 > »JJ < 1 > . l ) = exp(W(3 ), and <l> Since the choice of w i l l affect the rate of convergence, estimate should be chosen carefully. estimate of Q with weights In general, the weighted least square — - — will S the i n i t i a l provide a satisfactory initial i j estimate. The u-terms can also be estimated by using various other methods (see Bishop, Fienberg and Holland 1975). However, only the Newton-Raphson iterative procedure provides a readily available estimate of the precision of Q. The maximum likelihood estimator Q i s asymptotically normally distributed with mean Q and variance information matrix. matrix & , q where & i s the Fisher In practical applications, the observed information which i s available upon convergence procedure, i s often used in place of 3>. in the Newton-Raphson Therefore, statistical inference for the u-terms (in vector Q) i s possible. Although the above iterative procedure i s described for the saturated loglinear model in the case of two-dimensional tables, extension to other loglinear models simply involves modifying the m.-vector, the W-matrix, and others accordingly. similarly. Thus, estimates of the u-terms can be obtained Since path coefficients (w-terms) are twice the appropriate ii-terms, they can be estimated from the estimates of u-terms. 67 4. 2. 4 Goodness-of-Fit f o r Path Models A path model is specified by a recursive system of models. The f i t of a system of logit models can be assessed by directly checking the f i t of each component model, or by computing a set of estimated expected c e l l counts for the combined system. Once the expected c e l l counts are estimated, the f i t of the model can be assessed by either the Pearson chi-square statistic X or the 2 likelihood-ratio s t a t i s t i c G i 2 v 2 - r» (observed - expectecD expected ' 2 a 9 where the summation i n both cases i s over a l l cells in the table. If the fitted model i s correct and the total sample size i s large enough, both X and G 2 are approximately x 2 distributed with degrees of freedom given by d.f. = # of cells - # of parameters. (4.20) In the context of causal modelling, Goodman uses the likelihood-ratio test statistic G 2 to evaluate the f i t of a model. Improvement in the f i t of a model by adding or deleting iteraction terms can also be assessed by chi-square statistics. some Consider two models, model I and II, where model II i s a special case of model I. That i s , model II i s obtained from model I by setting some of the u-terms 68 to zero. AG Then the likelihood-ratio test statistic, = G (II) - G (I) = 2 2 2 with d.f. 2 = d.f.(I) - d.f expected^ E observed * (4.21) log expected xx can be used to test whether the difference .{ID between the expected c e l l counts for the two models is simply due to random variation given the true expected cell counts satisfy model I. For instance, in our example, the effect of adding the relationship between A (age) and C (infant death) to the model [AB)[BC] can be evaluated by using the test s t a t i s t i c AG 2 = G ( 2 [AB)[BC] ) - G ( [AB] [AC ] [BC ] ) 2 with / degree of freedom. Goodness-of-f i t of a path model can also be assessed by using the expected c e l l counts of the combined system of logit or loglinear models. The computation of these combined estimates i s best illustrated by an example. Suppose we have three variables with the following causal ordering: A precedes B precedes C, as shown i n Figure 8. system, consisting (4.22) Then the estimated expected c e l l of the pair of unrestriced logit counts for a models implied by (4.22), are given by a IA 1 m.ijh " C I A B 69 ~ m I C A B 1 (4.23) A where and i s { f f i . i . ' JR. c A B the number of observations with {A,B) = (i,j), } B I A and to^y 1 are the estimated expected c e l l counts for the logit models with variables B and C as the response variables respectively. Since the latter model involves conditioning on the marginal totals ^^j^r which can be seen from the maximum in (4.23) i s obtained. likelihood Thus, equations, the second the likelihood-ratio test equality statistic is given by G =2 * £ 2 t J K * log * 2 E tjh . . 2 B where G . 2 B | A IA (4.24) h J m a = G . J L ijh i,j,h " i * l ° € \ I J Z i ~ : J B * A — C I A B + G , 2 C IAB is the likelihood-ratio test s t a t i s t i c for logit model specified on the 2x2 table obtained by collapsing over variable C, and . | g 2 a b is the likelihood-ratio test s t a t i s t i c for logit model specified on the complete 2x2x2 table. Thus, the overall likelihood-ratio test s t a t i s t i c has degrees of freedom given by the sum of degrees of freedom corresponding to the two 2 component G 's. A more detailed discussion on this approach can be found in Goodman (1973b), and Fienberg (1980). 70 5. R e s u l t s o f S t a t i s t i c a l A n a l y s e s on t h e S r i Lankan Household The Data Sri Lankan discriminant infant methods households with high to mortality identify data set risk was factors risk of infant mortality. first and analyzed to by characterize Methods for path analysis were then applied to the identified risk factors, in order to assesss the relationships among them, and their relationship to infant death. 5.1 I d e n t i f i c a t i o n of Infant M o r t a l i t y R i s k Groups The main objective of this analysis is to identify risk factors that discriminate between mortality. By households using the with relatively high terminologies Section 3, the problem can be formalized and and notations as follows. low infant introduced in For each household sampled in the Sri Lankan household study, let Y be a dichotomous variable indicating whether or not an infant death has vector of explanatory variables. household belongs. occurred, and let X be a Then, Y specifies the class to which the The explanatory variables are listed as X-variables in Table I, which includes information on nutrition, sanitation, education of the mother, economic status, family, etc.. Then, the combinations of the x-values. childbirth environment, ethnicity of sample space X consists of the a l l possible Using decision theoretic criteria, estimates of infant death probability at each x-value partition the sample space X into two groups. regions corresponding Two discriminant to methods are 71 relatively advocated in high and Section 3: low risk logistic discrimination and class probability estimation by CART. For each of these methods, the analysis was performed separately for those women of age less than 44 (N - 250) and those of age greater than or equal to 44 (N = 141). 5.1.1 Logistic Discrimination A forward stepwise procedure implemented in the logistic regression program PLR of BMDP, was used to select explanatory or predictive variables that may adequately model the logit described in Section 3. of infant death probability, as The results of this analysis are shown in Table V. Consider the results for younger women (Table Vb). About 25% of these women with age less than 44 have experienced at least one infant death. Maximum likelihood estimates of the regression coefficients i n the most parsimonious model indicate that probability of infant death seems to be greater for those who gave birth at home, and for those whose families have lower economic status. By setting some threshold value p , the Sri Lankan Q village households can be partitioned into two risk groups with the higher risk group composed of households with estimated infant death probability greater than the threshold value. Using the maximum likelihood estimation results, the sample space can be partitioned as follows: the region of high risk corresponds to families with 1. last child born in hospital, and economic status < -4.732 ( logit 72 p + 1.134 ), or 2. last child born at home with a midwife, and economic status < -4.732 ( logit 3. p last child born at home without a midwife, and economic status < -4. 762 ( logi t p Details + 0.305 ), or Q + O. 352 ). on formulation of the above partition are shown i n Appendix I. Although this partition of the sample space can be interpreted easily, this may not always be the case where more variables are in the final model. Next, consider likelihood estimates the results of parsimonious model indicate for older the regression that women (Table Vb). coefficients probability of infant Maximum i n the most death for the non-Sinhalese families may be twice as high as that for the Sinhalese families. Thus, for the older women, the relatively high and low risk groups may be defined by ethnic group membership. 73 Table V a. Results of forward stepwise logistic regression Model selection Study group Model Women of age<44 constant -2 log X constant, X 8.509 1 0.004 7.003 2 0.030 11.665 1 0.001 5 constant, X , X s' Women of age44 + p-value d.f. z constant constnat, X 10 maximum likelihood under previous model where X. = , maximum likelihood under current model X i s the environment of child birth, X_ i s the economic status, and z 5 X^ i s the ethnicity. o Note that X is treated as continuous variable, while X and 5 X 2 are treated as categorical variables represented by dummy variables as defined on the following page. 74 Maximum likelihood estimates of the coefficients in the final model Maximum likelihood estimate Study group Variable coefficient s.e. Women of age<44 constant -0.597 0.209 X -0.210 0.098 X 0.292 0.224 X 0.245 0.238 constant -0.683 0.187 0.622 0.187 5 2<2> 2<9> Women of age44 + X 10<2> where X_ i s the economic status, 5 i f the last child was born at home with a midwife, 2<2> -1 O i f the last child was born in hospital, otherwise, i f the last child was born at home without a midwife, X 2(3) X 10<2> -1 o i f the last child was born in hospital, otherwise, and i f the household i s non-Sinhalese, i f the household i s Sinhalese. 75 5.1.2 D i s c r i m i n a t i o n u s i n g CART The probability of infant death at each point i n the sample space was estimated using the CART software described in Section 3, using the 10-fold cross-validation procedure. As i n the previous section, younger and older women were analyzed separately. For the younger women, the pruned subtree with the minimum cross-validated estimate of risk i s shown i n Figure 9. If the same criterion is used for the older women, then a t r i v i a l tree with one terminal node would be selected. Thus, the next largerbe obtained tree which can by growing a tree with an appropriate complexity parameter using the entire sample, is considered (see Figure 10). For younger women, the binary tree (Figure 9) has three terminal groups corresponding to low risk, and one terminal group corresponding to high. risk. Women who gave birth in the hospitals, or whose families have high economic status appear to have a relatively low risk of experiencing at least one infant death. For those women who gave birth at home, and whose families have low economic status, families whose major source of income i s from piece-rate work or hourly labor seem to be at a much lower risk than those families whose income i s from other sources. households in poverty, For those piece-rate work or hourly labor may provide a steadier source of income. Thus, women who give birth at home, live in poverty, and whose families have no steady income, are at the highest risk of experiencing at least one infant death. 76 For older women, the binary tree (Figure 10) suggests that Sinhalese families may have been at a lower risk than the non-Sinhalese families. The estimated probability of infant death indicates the risk death may be twice as high in non-Sinhalese families families. 77 of infant as in Sinhalese Figure 9 CART results for the younger women 63 class 187 class 1 2 C25%> in hospital 24 class 115 class Where was the l a s t c h i l d born? 1 2 at home 39 class 72 class C17X? 1 2 C35VO 0-2 34 class 48 class Economic status 1 2 5 class 24 class C171D C41T& piece rate 10 class 31 class Primary source of income others 24 class 17 class 1 2 1 2 C59X> C24TO class 1 : households with infant death experiences, class 2\ households with no infant death experience. Proportion of class 1 households are reported in the brackets. 78 1 2 F i g u r e 10 CART results for the older women 48 93 class class 1 2 C34%> Sinhalese others I O 16 class 59 class 32 34 1 2 class class I 2 (48%) (21%) class 1 : households with infant death experiences, class 2: households with no infant death experience. Proportion of class 1 households are reported in the brackets. 79 5.1. 3 Discussion Explanatory variables considered important by the logistic discrimination method were also considered important by the CART method. However, the partition of the sample space into regions of relatively high and low risk may be different for the two methods. Logistic discrimination forces a linear partition, whereas CART partition i s piecewise linear. For younger women, economic status of the family i s considered an important risk factor by both methods. But i n the CART result, the partition uses this variable only for those women giving birth at home. Suppose the threshold value, p , in Section 5.1.1 Q CART result. equals O. 17 as in the Then logistic discrimination method partitions the sample space into high and low risk regions as follows: the region of High, risk corresponds to families with 1. last child born in hospital, and economic status < 3 , or. 2. last child born at home with a midwife, or. 3. last child born at home without a midwife. Thus, women who gave birth at home are in the high risk group, and so are women who gave birth in the hospital but whose family is poor. But this contradicts the CART result (Figure 9), where a l l women giving birth in hospital are in the low risk group. Consider the 3x2 contingency table formed by cross-tabulating the environment of childbirth, and the economic status dichotomy created by grouping the categories 0-2 and 3-5 , as shown in Table VI. The table shows that the partition provided by the CART 80 method seems more coherent than the partition provided by logistic discrimination. The logistic between the logit discrimination method assumes that of infant death probability (logit the relationship p) and economic status (X ) for environment of childbirth (X^), can be modelled by parallel 5 straight lines (Table VII). This criterion seems reasonable for latter two childbirth conditions, but not for a l l three conditions. By imposing this parallelism on the results, the more appropriate partitioning of the sample space is overlooked. variables However, i f interactions between the two explanatory were allowed, logistic discrimination appropriate partitioning. fitting might have obtain the In general, logistic discrimination may require many different models with various iteraction terms before a partitioning comparable to that found by the CART method, is discovered. Discrepancies between results for the two age groups may be explained by several factors. Health services may be more readily available at time of child bearing for the younger women. Younger generation may also be less inhibited by health technologies; and thus utilizes the services more frequently. Ethnicity may be more relevant to everything mortality) when the older women were child bearing. (including infant Ethnicity may s t i l l be pertinent to economic status and usage of health services in the younger generation, but the effect of ethnicity on infant mortality may have lessen. Lastly, economic status at time of study may be strongly related to economic status at time of child bearing for the younger women, perhaps not for the older women. 81 but Table VI Comparison of sample space partitioning by logistic discrimination and by CART The following table i s constructed based on women of age less than 44. Economic status - ownership of household items. (X ) 5 Where was the l a s t child born ? (X ) 0-2 2 In hospital At home with midwife 3-5 °-13 °- 4 1 At home without midwife (-^-) (-£-:> (-£-) 0-13 (JL-) The high, risk group identified by logistic discrimination i s the group of households in the highlighted region given by the union of the f i r s t column and the last two rows. The high risk group identified by CART i s the group of households in the highlighted region given by the intersection column and the last two rows. 82 of the f i r s t T a b l e VTI Estimated logistic regression equations for younger women Estimated Where was the last child born? (X ) 2 L o g i s t i c Regression Equation In hospital logit p = -1. 134 - 0. 210 X At home with midwife logit p = -0.305 - 0.210 X At home without midwife logit p = -0.352 ~ 0.210 X 5 5 83 5.2 Causal M o d e l l i n g Discriminant analysis performed status, environment of childbirth, associated with infant mortality. and earlier ethnic indicates that economic group membership may To understand how be these variables work together to affect infant mortality, a path model is constructed based on the natural temporal ordering of the variables (Figure 11). F i g u r e 11 A path model specifying temporal relationships among selected variables 84 5.2.1 Structual Modelling with Quantitative Data The following analysis is performed using the REG procedure in the SAS statistical software, by treating a l l four variables as continuous. Results of path analysis for the two age groups are shown in Figures 12 and 13 respectively. The estimated direct and indirect effects of explanatory variables on infant mortality are summarized in Table VIII for the two age groups. Comparing path models shown in Figures 12 and 13 suggests that the relationships among the variables The effect of ethnic may differ for the two age groups. group membership on childbirth environment seems stronger for the younger women. Economic status and childbirth environment appear to affect infant mortality for the younger women, whereas only ethnicity appears to have a substantial effect on infant mortality for the older women. Consider the estimated direct and indirect effects of explanatory variables on infant mortality for the younger women (Table VIII). Although ethnicity has virtually no direct effect on infant mortality, i t does seem to influence the other two variables, economic status and environment of childbirth, to affect infant mortality. Thus minority group status may adversely affect the economic status, and may obstruct access to better childbirth environment, which in turn, increases the risk of infant death. 85 Estimated direct and indirect effects of explanatory variables on infant mortality for the older women in (Table VIII) indicate that neither economic status nor childbirth environment have strong direct or indirect effects on infant mortality. Therefore, minority group status seems to be the only factor, among the three considered, to increase the risk of infant death. For both path models (Figure 12 and 13), the path coefficients corresponding to the unobserved sources of variations are high. Thus, the linear models considered by path analysis do not seem to f i t the data well. Since the occurrence of infant death is a relatively rare event, and the variables investigated are not immediate biological causes a linear model is not likely to f i t the data well. of infant death, However, this type of model s t i l l provides some useful information on the relationships among the variables. 86 F i g u r e 12 Path analysis results for the younger women | 0.93 Economic Environment of c h i l d b i r t h X 2 | where 0.90 • signifies s t a t i s t i c a l l y nonzero path coefficient at the 10% level (excluding residual path coefficients). 87 F i g u r e 13 Path analysis results for the older women 90 Economic status \ X -O. sS 44 Ethnicity X -O. -O. io O. o. IO ^ 25 —• S 19 y 16 -o.02 Infant death y 95 Environment of c h i l d b i r t h Xz | where O. 96 • signifies s t a t i s t i c a l l y nonzero path coefficient at the *0% level (excluding residual path coefficients). 88 Table VIII Estimated direct and indirect effects on infant death Effect on Infant Mortality Study Group Age <44 Age 44 + Variable (source) Direct Indirect Ethnicity 0.00 -0.12 Economic status 0.13 0.05 Use of health services for childbirth -0.16 Ethnicity -0.25 -0.04 Economic status 0.10 -0.G1 Use of health services for childbirth 0.02 89 1 5.2.2 Structural Modelling with Qualitative Data The preceding section applied statistical analysis that was originally derived for continous variables; but most of the variables in this study are ordered categorical. In this section, the relationships between the variables are analyzed using the method for categorical variables proposed by Goodman, which was described in Section 4.2. method as discussed Due to limitations of the in Section 4.2.2, the variables considered are receded into two categories (Table IX). Let A - D be the receded variables for ethnicity, economic status, environment of childbirth, and infant death respectively. Then the following causal ordering of the variables i s assumed: A preceeds B preceeds C preceeds D. Programs written i n a language implemented in the s t a t i s t i c a l package called S were used for the analysis. causal software Path diagrams depicting the connections implied by the best logit or loglinear models for women of the two age groups are shown in Figures 14 and 15. Details on the model selection are given in Appendix II and III respectively for the two groups of women. The path diagram for the younger women (Figure 14) (1) minority group status may adversely obstruct access to better indicates that: affect economic status, and may childbirth environment; (2) poverty may have blocked access to better childbirth environment; (3) lastly, poverty and 90 childbirth minority environment may be linked group status to infant mortality. Although does not seem to have direct effect on infant mortality, i t does seem to have an indirect effect through economic status and environment of childbirth. The path diagram for the older women (Figure 15) indicates that: (1) minority group status may have negative effects on both economic status and infant mortality; (2) poverty may have blocked access childbirth environment; but (3) neither economic status environment for older nor childbirth has any significant effect on infant mortality. women, no variables i n addition to better Therefore, to ethnicity (among those considered) can significantly improve discrimination between high and low risk groups. 91 T a b l e IX variables used in modified path analysis Variable A Variable in original data set Codes X Ethnicity 1 2 Sinhalese non-Sinhalese Economic status 1 2 0-1 2+ to B X s C X Use of health services for childbirth 1 2 in hospital at home D y Infant death 1 2 at least one none 92 F i g u r e 14 Path diagram showing causal links implied by selected logit models for the younger women 93 Figure 15 Path diagram shoving causal links implied by selected logit models for the older women Economic status \ -O. 73 \ Ethnicity A -0.59 s NJ S -O. 41 Infant death D Environment of c h i l d b i r t h C where signifies non-significant relationship. 94 5.2.3 Discussion Causal interpretations of path diagrams constructed quantitative and qualitative approaches are similar. by both For the younger women, both path diagrams (Figures 12 and 14) show that minority group status seems to result in poverty, and seems to obstruct access to better childbirth environment, which in turn, leads to infant deaths. For the older women, both path diagrams (Figures 13 and 15) indicate that minority group status per se appears to be the only factor that has any effect on infant mortality. Discrepancies between results for the two age groups may be explained as in Section 5.1.3. None of the linear regression models in Figures 12 and 13 f i t the data particularly well, as shown by the path coefficients corresponding to the unobserved sources of variations. On the other hand, the loglinear or logit models considered in Figures 14 and 15 provide reasonable f i t to the data sets. However, the method for qualitative data does not provide quantitative assessments of indirect effects as provided by the method for quantitative data. 95 6. Remarks and Recommendations on S t a t i s t i c a l Methods Used to Identify Risk Groups An objective of the Sri Lankan household survey was small number of risk factors that distinguish groups of women having relatively high or low probability of experiencing death. to identify a at least one infant This study examined socioeconomic factors (not medical causes) that are relevant to resource allocation priorities, and to cultural obstacles in the planning of health services and health promotion programs. Structural or temporal relationships among the risk factors are also of interest to the researchers. Statistical discrimination methods were used to select significant risk factors, and to identify the high risk group (or groups) in the Sri Lankan households. Although both logistic discrimination and computing-intensive, the computing resources, and Otherwise, the CART logistic has discrimination method CART requires are less more readily available software packages. technique is preferable, since informative and more easily interpretable results. i t provides more Furthermore, the CART technique does not require any distributional assumptions. After a small set of risk factors had been identified by discriminant analysis, the structural or temporal relationships among selected risk factors and infant mortality were investigated using path analysis. The classical method of path analysis using linear regression models has often been applied to social science 96 data that are ordinal or categorical in nature, where a modified method using response models would be more appropriate. logistic quanta1 When the classical method i s applied inappropriately, the resulting path model usually does not f i t the data well, as indicated by high residual path coefficients. modified method does provide a better Although the f i t , i t i s highly computing- intensive, and is restrictive in the number of variables allowed i n the proposed path model. In practice, variables social scientists would than the models considered here. use path models with more variables that were not selected by the discrimination methods might s t i l l be of interest to the researchers, when considering infant mortality in a larger socioeconomic and p o l i t i c a l context. The approach used in this thesis, and recommended for similar studies to identify risk groups, applies discriminant analysis (preferably CART) as an exploratory tool, and then uses path analysis (preferably quanta 1 response modelling) to confirm logistic significance of relationships among variables. In our Sri Lankan household study, discriminant analysis identified economic status and environment of childbirth as significant risk factors for the younger women. In contrast, ethnic group membership is the only risk factor identified for the older women. Younger women who gave birth at home, and whose families have low economic status appear to be at a high risk of experiencing at least one infant death, whereas, younger women who 97 gave birth in the hospital, or whose families have high economic status seem to be at a substantially lower risk. For the older women, non-Sinhalese families appear to have a higher risk of experiencing at least one infant death than the Sinhalese families. Results of path analysis on infant mortality using the three identified risk factors suggest that the changing role of ethnicity may have partially explained the discrepancies between previous results for the two age groups. While ethnic group membership may be relevant to many things, including infant mortality, for the older generation, i t s influence on infant mortality seems to have lessened for the younger generation. The discrepancies between results for the two age groups may also be explained by other factors. Health services may not have been as readily available at time of child bearing for the older women as for the younger women. The use of better childbirth environment by the younger women may also be explained by the changing attitude toward the seriousness of childbirth by the families. Finally, the economic status at the time of study may be strongly related to the economic status at time of child bearing for the younger women, but may not be so for the older women. In order to plan an effective health program to promote infant survival, one must understand the socioeconomic conditions in which infant death is likely to occur, as well as the biomedical causes of infant death. Our analysis suggests most of the high risk households w i l l be too poor to take advantage of the government's subsidy program for the construction of 98 sanitary latrines. Although Sri Lanka has a well-organized network of essentially free health services that extend into rural areas, access to and usage of better childbirth environment Health planning entails more than designing can a still be program that improved. treats or prevents a health disorder; i t must also ensure health care delivery to those in need. 99 BIBLIOGRAPHY Anderson, J.A. (1972). Biometriha, 59:19-35. Separate sample logistic discrimination. Anderson, T.W. (1984). An Introduction to Multivariate Analysis, 2nd ed.. New York: John Wiley & Sons. Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. Multivariate Analysis: Theory and Practice. Press. Blalock, H.M. ed. (1970). Chicago: Aidine. Breiman, L., Classification Causal Models in the Statistical (1975). Discrete Cambridge: The MIT Social Sciences. Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). and Regression Trees. Belmont: Wadsworth & Brooks. Breslow, N.E. and Day, N.E. (1980). The Analysis of Studies. Statistical Methods in Cancer Research, International Agency for Research on Cancer. Case-Control Vol. 1. Lyon: Caldwell, J. and McDonald, P. (1982). Influence of matrernal education on infant and child mortality: levels and causes. Health Policies and Education, 2:251-267. Chovdhury, A. (1982). Education and infant survival in rural Bangladesh. Health Policies and Education, 2:369-374. Cox, D.R. (1970). The Analysis of Binary Data. Dillon, W.R., and Goldstein, M. (1984). Multivariate John Wiley & Sons. London: Methuen. Analysis. New York: Duncan, O.D. (1966). Path analysis: sociological examples. The Journal of Sociology, 72:1-16. American Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70:891-898. Fienberg, S.E. (1980). The Analysis Data. Cambridge: The MIT Press. Fisher, R.A. (1936). problems. Annals of Goodman, L.A. The American (1972). Journal of Cross-Classified The use of multiple Eugenics, 7:179-188. measurements in A general model for the analysis of Sociology, 77:1035-1086. 100 Categorical taxonomic of surveys. (1973a). of surveys. Causal analysis of data from panel studies and other kinds The American Journal of 78:1135-1191. Sociology, . (1973b). The analysis of multidimensional contingency tables when some variables are posterior to others: a modified path analysis approach. Biometrika, 60:179-192. Grosse, R. and Perry, B. (1982). Correlates of l i f e expectancy in less developed countries. Health Policies and Education, 2:275-304. Haberman, S.J. (1978). Introductory Analysis Topics. of Qualitative Data. Uolume 1: New York: Academic Press. Hand, D.J. (1981). Discrimination Wiley & Sons. and Classification. Heise, D.R., ed. (1975). Sociological Jossey-Bass. Methodology New York: John 1976. San Francisco: Kendall, M.G. and O'Muircheartaigh, CA. (1977). Path analysis and model building, World Fertility Survey, Technical Bulletin No. 414. Kiveri, H., Speed, T.P. and Carlin, J.B. (1984). Recursive causal models. Journal Society of the Australian Lanchenbruch, P.A. (1975). Press. Land, K.C. Methodology Mathematical Discriminant Analysis. A, 36:30-52. New York: Hafner (1969). Principles of path analysis. Sociological 1969, ed. E.Borgatta. San Francisco:Hossey-Bass. . (1973). Identification, parameter estimation and hypothesis testing in recursive sociological models. Structural Equation Models in the Social Sciences, eds. A.S. Goldberger and O.D. Duncan. New York: Seminar Press. Leik, R. (1975). Causal retrospective. Sociological San Francisco: Jossey-Bass. McKeown, T. (1976). Academic Press. The models with nominal and ordinal data: Methodology 1976, ed. D.R. Heise. Modern Rise of Population. New York: Mishler, E.G., Amarasingham, L.R., Hauser, S.T., Liem, R., Osherson, S.D., and Waxier, N.E. (1981). Patient Care. Social Contexts Morgan, J.N. and Sonquist, J.A. (1963). survey data, and a proposal. Journal of Association, of Health, Cambridge: Cambridge University Press. 58:415-435. 101 Illness, and Problems i n the analysis of the American Statistical Morgan, J.N. and Messenger, R.C. (1973). THAID: a sequential search program for the analysis of nominal scale dependent variables. Ann Arbor: Institute for Social Research, University of Michigan. Morris, M.D. (1979). Measuring New York: Pergamon Press. the Condition of the World's Poor. Morrison, B. and Waxier, N.E. (1984). Three patterns of basic needs within Sri Lanka: 1971-1973. Unpublished paper. Mosley, W.H. (1984). Child survival: research and policy. Child Population and development review, a supplement to volume 10. The Population Council, Inc.. Survival. New York: , and Chen, L. (1984). An analytical framework for the study of child survival in developing countries. Child Survival. Population and development review, a supplement to volume 10. New York: The Population Council, Inc.. Mueller, J.H., Schuessler, K.F., and Costner, H.L. (1977). Reasoning in Sociology. Boston: Houghton Mifflin. Statistical Panel on Discriminant Analysis, Classification, and Clustering Discriminant Academy Press. Analysis and Clustering. Washington, D.C: (1988). National Patel, M. (1980). Effects of the health service and environmental factors on infant mortality: the case of Sri Lanka, Jour rial of Epidemiology and Community Health, 34:76-82. Press, S.J. and Wilson, S. (1978). and discriminant analysis. Association, 73:699-705. Choosing between logistic regression Journal of the American Statistical Puffer, R.R. and Serrano, C.V. (1973). Patterns of Mortality in Childhood. Scientific Publication No. 262. Washington, D.C: Pan American Health Organization. Rosenthal, H. (1980). The limitation of log-linear analysis. Sociology, 9:207-212. Sackett, D.L. and Holland, W.W. (1975). disease. The Lancet, 2:357-359. Contemporary Controversy in the detection of ® SAS Institute, Inc. (1985). SAS Edition. User's Cary, NC: SAS Institute Inc.. Schlesselman, J.J. (1982). Case-Control Analysis. Guide: Statistics, Studies: New York: Oxford University Press. Design, Version 5 Conduct, Simmons, G. and Bernstein, S. (1982). The educational status of parents, and infant and child mortality in rural North India. Health P o l i c i e s and Education, 2:349-367. 102 Smucker, C , Simmons, G., Bernstein, S., and Misra, B. (1980). Neo-natal mortality in Sourth Asia: the special role of tetanus. Population Studies, 34:321-335. Wald, A. (1944). On a s t a t i s t i c a l problem arising in the classification of an individual into one of two groups. Annals of Mathematical S t a t i s t i c s , 15:145-163. Waxier, N.E., Morrison, B.M., Sirisena, W.M., and Pinnaduwage, S. (1985). Infant mortality in S r i Lankan households: a causal model. Social Science and Medicine, 20:381-392. Wermuth, N. (1980). Linear recursive equations, covariance selection, and path analysis. Journal of the American Statistical Association, 75:963-997. . (1987). Parametric collapsibility and the lack of moderating effects in contingency tables with a dichotomous response variable. Journal of the Royal S t a t i s t i c a l Society, 49:353-364. , and Lauritzen, S.L. (1983). Graphical and recursive contingency tables. Biometriha, 70:537-552. Winship, C. and Mare, R.D. (1983). for discrete data. The American World Bank (1975). Health Bank. Sector models for Structural equations and path analysis Journal of Sociology, 89:54-110. Policy Paper. Washington, D.C.: World Wright, S. (1934). The method of path coefficients. Mathematical S t a t i s t i c s , 5:161-215 . (1960). Path coefficients and path regression: complementary concepts? Biometrics, 14:189-202. Annals alternative or Vande Geer, J.P. (1971). Introduction to M u l t i v a r i a t e Analysis the Social Sciences. San Francisco: W.H. Freeman and Company. 103 of for Appendix I Partitioning the Sample Space Using Logistic Discrimination (Younger Women) Let p o be some threshold value chosen, so that the high, risk group i s composed of households with estimated probability of experiencing at least one infant death greater than p . Q Then using maximum likelihood estimates of the regression coefficients (Table Vb), the high risk households have explanatory variables satisfying the following inequality: - 0.597 - 0.210 X 5 where X denotes + 0.292 X + 0.245 X 2<2> the economic status, > logit p , 2<9> and X 5 (A.l) O and X 2<2> are dummy 2(3) variables representing the categorical variable X^ as defined below, X f 1 1 = =1-1 \ -1 * X ~ i f the last child was born at home with a midwife, i i f the last child was born in hospital, otherwise, f the last child was born at home without a midwife, ii f the last child was born in hospital, l o otherwise. o = \ -1 2 < 9 > Alternatively, the partition region can be described by examining each childbirth environment i n (A.l) : the region of high risk corresponds to families with 1. last child born in hospital, and economic status < -4.762 ( logit 2. + 1.134 ), or 0 last child born at home with a midwife, and economic status < -4.762 ( logit 3. P p + 0.305 ), or Q last child born at home without a midwife, and economic status < -4. 762 ( logi t p Q 104 + O. 352 ). Appendix I I Modified Path Analysis - Model Selection (Younger Women) Using the method proposed by Goodman as described in Section 4.2, the relationship between variables A and B i s investigated through the logit model , , . B |1 A logit i = u> ~ with estimated effect parameter \ + . IA B t i ,. (A.2) r w x ) „, IA B ( IA B 1 ± = -0.63 . ) By examining results of f i t t i n g the three unsaturated loglinear models corresponding to the logit model with C as the reponse variable, and A and B as the explanatory variables (models Ml - M3 [AB)[AC][BC] That in Table X), we see that models (Ml) and [AB][AC] (M2) provide reasonable f i t s for the data. i s , their goodness-of-fit statistics (either X or G ) 2 2 are not s t a t i s t i c a l l y significant. However, G (M2) - G (M1) = 3.087 with 1 degree 2 Z of freedom i s significant at the 10% level, suggesting the relation between variables B and C may be important. preferred. Thus, the larger model, Ml, i s The corresponding logit model is , logit C U B .. 1 I AB C = w 1 , C 1 AB + it> VN with estimated effect parameters; . C + w I AB '. , £• I *> ., ^ =0.82 1< 1 > _. (A.3) C IAB and w = -0.24 . 2< i> Now examine the effects of A on D, B on D, and C on D as suggested by the assumed causal ordering. The results of f i t t i n g the seven unsaturated loglinear models corresponding to the logit model with D as the reponse variable, and A, B and C as the explanatory variables (M4 -MIO in Table X), show that a l l except model [ABC)[AD] 105 (M8) f i t the data well. Since model M7 i s a special case of model M4, and G (M7) - G (M4) = 0.216 2 2 with * degree of freedom i s not s t a t i s t i c a l l y significant at the 5% level, the smaller model, M7, i s preferred. For models M9 and MIO, two special cases of model M7, G (M9) - G (H7) = 6.729 2 and 2 G (M10) - G (M7) = 4.886, 2 2 each with 1 degree of freedom; both are s t a t i s t i c a l l y significant at the 5% level. Thus, father reduction from model M7 i s not desirable. The logit model corresponding to H7 i s , . . logtt D IA ..' B C IJ = w D 1 1 A B C + w , with estimated effect parameters: u> 15 \ D 2< J > 2< IA B C + w , ' D 3<fc> IA , =0.33 and w 1 1> . , B C (A.4) 1 a< i > The results are summarized by the path diagram in Figure 14. 106 = -0.38 . , , e x Goodness-of-fit statistics for loglinear models (younger women) Model d.f. X 2 G Z Mi [AB][AC)[BC] 1 0.215 0.215 M2 [AB][AC] 2 3.325 3.302 M3 [AB][BC] 2 32.029 32.887 M4 [ABC] [AD] [BD)[CD] 4 2.830 2.896 MS [ABC] [AD] [BD] 5 8.766 8.063 M6 [ABC] [AD] [CD] 5 6.629 7.051 M7 [ABC] [BD] [CD] 5 3.012 3.112 M8 [ABC][AD] 6 13.740 13.252 M9 [ABC][BD] 6 9.781 9.841 MIO [ABC][CD] 6 7.716 7.998 where X G 2 z is the Pearson chi-square s t a t i s t i c , and is the likelihoodf ratio s t a t i s t i c . 107 Appendix I I I Modified Path Analysis - Model Selection (Older Women) Using the method proposed by Goodman as described in Section 4.2, the relationship between variables A and B is investigated through the logit model , , . B | A logit ^ with estimated effect parameter IA B xo = t B I + i ± B w t ( i IA C . r > A - -0.73 w i , ' ) . By examining results of f i t t i n g the three unsaturated loglinear models corresponding to the logit model with C as the reponse variable, and A and B as the explanatory variables (models Ml - M3 [AB][AC][BC] That in Table XI), we see that models (Ml) and [AB][BC] (M3) provide reasonable f i t s for the data. i s , their statistically goodness-of-fit statistics (either significant. X Since M3 i s a special 2 or G ) are not 2 case of Ml, and G (M3) - G (M1) = 0.609 with 1 degree of freedom i s not s t a t i s t i c a l l y 2 2 significant at the 5% level, the smaller model, M3, i s preferred. The corresponding logit model is , . . logit C AB A with estimated effect parameter, AB C = io . ci I 1 . C + xo AB '. , .. C . (A.6) AB w 2 l ± > ~ -0.41 . By examining results of f i t t i n g the seven unsaturated loglinear models corresponding to the logit model with D as the reponse variable, and A, B and C as the explanatory variables (M4 -MIO in Table XI), we see [ABC] [AD] (M8) is the smallest model that f i t s the data well. Since adding more interaction terms into model M8 does not significantly improve the f i t , the most 108 parsimonious model is M8. Thus, the corresponding logit i s given by , ., logit.^ with estimated effect D|ABC DIABC - » ' parameter, , DIABC + wj iy w^l* BC summarized by the path diagram in Figure 15. 109 , ~ -0.59 . ,„ „ (A.7) % The results are T a b l e XI Goodness-of-fit statistics for loglinear models (older women) Model d.f. x2 G 2 Ml [AB] [AC ] [BC ] 1 0.106 0.105 M2 [AB][AC] 2 4.023 4.058 M3 [AB)[BC] 2 0.724 0.715 M4 [ABC] [AD] [BD][CD] 4 1.776 1.682 MB [ABC][AD][BD] 5 2.505 2.514 Me [ABC][AD][CD] 5 1.771 1.714 M7 [ABC][BD][CD] 5 12.187 12.130 M8 [ABC] [AD] 6 2.503 2.515 MQ [ABC][BD] 6 13.335 13.304 M1Q [ABC][CD] 6 12.805 12.917 X 2 is the Pearson chi-square s t a t i s t i c , and 2 i s the likelihood ratio s t a t i s t i c . where G 110
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Identification of risk groups : study of infant mortality...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Identification of risk groups : study of infant mortality in Sri Lanka Kan, Lisa 1988
pdf
Page Metadata
Item Metadata
Title | Identification of risk groups : study of infant mortality in Sri Lanka |
Creator |
Kan, Lisa |
Publisher | University of British Columbia |
Date Issued | 1988 |
Description | Multivariate statistical methods, including recent computing-intensive techniques, are explained and applied in a medical sociology context to study infant death in relation to socioeconomic risk factors of households in Sri Lankan villages. The data analyzed were collected by a team of social scientists who interviewed households in Sri Lanka during 1980-81. Researchers would like to identify characteristics (risk factors) distinguishing those households at relatively high or low risk of experiencing an infant death. Furthermore, they would like to model temporal and structural relationships among important risk factors. Similar statistical issues and analyses are relevant to many sociological and epidemiological studies. Results from such studies may be useful to health promotion or preventive medicine program planning. With respect to an outcome such as infant death, risk groups and discriminating factors or variables can be identified using a variety of statistical discriminant methods, including Fisher's parametric (normal) linear discriminant, logistic linear discrimination, and recursive partitioning (CART). The usefulness of a particular discriminant methodology may depend on distributional properties of the data (whether the variables are dichotomous, ordinal, normal, etc.,) and also on the context and objectives of the analysis. There are at least three conceptual approaches to statistical studies of risk factors. An epidemiological perspective uses the notion of relative risk. A second approach, generally referred to as classification or discriminant analysis, is to predict a dichotomous outcome, or class membership. A third approach is to estimate the probability of each outcome, or of belonging to each class. These three approaches are discussed and compared; and appropriate methods are applied to the Sri Lankan household data. Path analysis is a standard method used to investigate causal relationships among variables in the social sciences. However, the normal multiple regression assumptions under which this method is developed are very restrictive. In this thesis, limitations of path analysis are explored, and alternative loglinear techniques are considered. |
Subject |
Newborn infants -- Sri Lanka -- Mortality Infants -- Mortality Sri Lanka -- Statistics, Vital |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-08-30 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0097696 |
URI | http://hdl.handle.net/2429/27971 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-UBC_1988_A6_7 K36.pdf [ 4.64MB ]
- Metadata
- JSON: 831-1.0097696.json
- JSON-LD: 831-1.0097696-ld.json
- RDF/XML (Pretty): 831-1.0097696-rdf.xml
- RDF/JSON: 831-1.0097696-rdf.json
- Turtle: 831-1.0097696-turtle.txt
- N-Triples: 831-1.0097696-rdf-ntriples.txt
- Original Record: 831-1.0097696-source.json
- Full Text
- 831-1.0097696-fulltext.txt
- Citation
- 831-1.0097696.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0097696/manifest