IDENTIFICATION OF RISK GROUPS: STUDY OF I N F A N T MORTALITY IN SRI LANKA by LISA KAN B.Sc, Simon Fraser University, 1986 A THESIS SUBMITTED IN PARTIAL FUTJFIIJLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES The Department of Statistics We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA September 1988 © Lisa Kan, 1988 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of The University of British Columbia Vancouver, Canada Date DE-6 (2/88) ABSTRACT Multivariate s t a t i s t i c a l methods, including recent computing-intensive techniques, are explained and applied in a medical sociology context to study infant death in relation to socioeconomic risk factors of households in Sri Lankan villages. The data analyzed were collected by a team of social scientists who interviewed households in Sri Lanka during 1980-81. Researchers would like to identify characteristics (risk factors) distinguishing those households at relatively high or low risk of experiencing an infant death. Furthermore, they would like to model temporal and structural relationships among important risk factors. Similar s t a t i s t i c a l issues and analyses are relevant to many sociological and epidemiological studies. Results from such studies may be useful to health promotion or preventive medicine program planning. With respect to an outcome such as infant death, risk groups and discriminating factors or variables can be identified using a variety of s t a t i s t i c a l discriminant methods, including Fisher's parametric (normal) linear discriminant, l o g i s t i c linear discrimination, and recursive partitioning (CART). The usefulness of a particular discriminant methodology may depend on distributional properties of the data (whether the variables are dichotomous, ordinal, normal, etc.,) and also on the context and objectives of the analysis. i i There are at least three conceptual approaches to statistical studies of risk factors. An epidemiological perspective uses the notion of r e l a t i v e risk. A second approach, generally referred to as c l a s s i f i c a t i o n or discriminant analysis, is to predict a dichotomous outcome, or class membership. A third approach is to estimate the probability of each outcome, or of belonging to each class. These three approaches are discussed and compared; and appropriate methods are applied to the Sri Lankan household data. Path analysis is a standard method used to investigate causal relationships among variables in the social sciences. However, the normal multiple regression assumptions under which this method is developed are very restrictive. In this thesis, limitations of path analysis are explored, and alternative loglinear techniques are considered. i i i TABLE OF CONTENTS Abstract i i Table of Contents iv List of Tables vi List of Figures v i i Acknowledgements v i i i 1. Introduction 1 2. A Study of Infant Mortality in Sri Lanka 4 2.1 Infant Mortality in Medical Sociology 4 2.2 The Sri Lankan Household Data 7 3. Discriminant Applications to Identify Risk Groups 13 3.1 Basic Approaches 13 3.2 Optimality Criteria for Discriminants 15 3.2.1 Relative Risk 15 3.2.2 Decision Theoretic Bayes Rules 19 3.3 Sample Space Partitions Corresponding to Bayes Rules 26 3.3.1 Linear Discriminants for Normal Distributions 27 3.3.2 Logistic Linear Discriminants 29 3.3.3 Classification Trees: Recursive Partitioning 31 3.4 Construction of Discriminants from Sample Data 35 3.4.1 Logistic Discriminant 36 3.4.2 CART Discriminant: Growing a Class i f ica i ton Tree .. 37 3.4.3 CART Discriminant: Pruning a Classification Tree .. 40 3.4.3.1 Test Sample Estimates of Risk 44 3.4.3.2 Cross-Validation Estimates of Risk 46 iv 4. Path Analysis 48 4.1 Structural Modelling with Quantitative Data 49 4.1.1 Path Models 49 4.1.2 Estimation and Interpretation of Path Coefficients .. 53 4.2 Structural Modelling with Qualitative Data 59 4.2.1 Loglinear and Logit Models 59 4.2.2 Path Models 63 4.2.3 Estimation of Path Coefficients 65 4.2.4 Goodness-of-Fit for Path Models 68 5. Statistical Analyses on the Sri Lankan Household Data 71 5.1 Identification of Infant Mortality Risk Groups 71 5.1.1 Logistic Discrimination 72 5.1.2 Discrimination Using CART 76 5.1.3 Discussion 80 5.2 Causal Modelling 84 5.2.1 Structural Modelling with Quantitative Data 85 5.2.2 Structural Modelling with Qualitative Data 90 5.2.3 Discussion 95 6. Remarks and Recommendations on Statistical Methods Used to Identify Risk Groups 96 Bibliography 100 Appendix I Partitioning the Sample Space Using Logistic Discrimnation (Younger Women) 104 Appendix II Modified Path Analysis - Model Selection (Younger Women) 105 Appendix III Modified Path Analysis - Model Selection (Older Women) 108 v L I S T O F T A B L E S Table I Variables used in the Sri Lankan household study 10 Table II Households used in the analysis 12 Table III Estimated direct and indirect effects for path model (4.3) 58 Table IV Various loglinear models for three-dimensional tables .. 60 Table V Results of forward stepwise logistic regression ........ 74 Table VI Comparison of sample space partitioning between logistic discrimination and CART 82 Table VII Estimated logistic regression equations for younger women 83 Table VIII Estimated direct and indirect effects on infant death 89 Table IX Variables used in modified path analysis 92 Table X Goodness-of-fit statistics for loglinear models (younger women) 107 Table XI Goodness-of-fit statistics for loglinear models (older women) 110 vi L I S T O F F I G U R E S Figure 1 Conceptual model of medical sociological approach to research on infant mortality 4 Figure 2 Examples of Relative Risk functions for known probability densities 17 Figure 3 An example of a binary tree 31 Figure 4 An example of a path diagram 49 Figure 5 An example of a path diagram with path coefficients 51 Figure 6 An example of a colored path diagram ....52 Figure 7 A path model with estimated path coefficients 56 Figure 8 A path model with dichotomous variables 64 Figure 9 CART results for the younger women 78 Figure 10 CART results for the older women 79 Figure 11 Path model specifying temporal relationships among selected variables 84 Figure 12 Path analysis results for the younger women 87 Figure 13 Path analysis results for the older women 88 Figure 14 Path diagram showing causal links implied by selected logit models for younger women 93 Figure 15 Path diagram shoving causal links implied by selected logit models for older women 94 v i i ACKNOWLEDGEMENTS I would like to thank Dr. Nancy E. Waxier-Morrison for providing the data and the stimulus for my research. I am grateful to Dr. Ned Glick for his guidance, suggestions, and patience in producing this thesis. Dr. A. John Petkau's helpful comments are also greatly appreciated. Finally, I thank my husband, Scott, for his continuous encouragement and support. Without his belief in me, i t might have taken me longer to get here. v i i i 1. Introduction A study of infant mortality in Sri Lanka was conducted by a team of social scientists during 1980-81 (before the current c i v i l war) to identify households and socioeconomic conditions in which there was a high risk of experiencing an infant death. Further, relationships among risk factors would also be of interest to future planning of any preventive health programs in developing countries- Similar applications of multivariate analyses are widely used to identify risk groups in epidemiology, urban planning, economics, business, etc. . This thesis explores and applies various statistical methods for assessing risk groups, and relationships among risk factors. Risk, groups and discriminating factors can be identified by a variety of statistical discriminant and modelling methods. The most often used criterion for determining the goodness of a discriminant rule has been the rate of misclassification. However, the importance of misclassification rate varies depending on the purpose of discrimination. In medical diagnosis, the objective is to pinpoint as accurately as possible the cause of symptoms. Since i t is not desirable to subject a healthy individual to possibly detrimental treatments, such as chemotherapy, nor to leave an infection untreated because of misdiagnosis, discriminant rules with low misclassification rates are preferred. In medical screening, say early breast cancer detection, examinations are performed on apparently healthy volunteers from the general population, for the purpose of separating them into groups with high and low probabilities for breast cancer (Sackett and Holland 1975). The idea of a screening discriminant is to use a few 1 inexpensive measurements to capture a l l those with the disorder in a high risk, group, so that more complicated, and often more expensive examinations need be performed only on this smaller group of individuals. Thus, factors considered to be good screening factors may not be acceptable diagnosis factors. In epidemiology and medical sociology, the main objective is to discover the context in which a disorder may occur. For example, homosexual men were identified as the first high risk group in studies of AIDS, although homosexuality per se is not the cause of disease; and clearly, by using sexual orientation as a discriminant rule, the misclassification rate would be high. In our Sri Lankan household study, the risk of infant death is being examined from the socioeconomic and political perspective. Health planning involves not only the understanding of biomedical causes of infant death, but also the social context in which infant death may occur. Although discriminant rules constructed using socioeconomic and political variables may not have low misclassification rates, the socioeconomic and political conditions under which a family is most l i k e l y to experience an infant death can s t i l l be identified. Thus, the goal is to find discriminating variables and discriminant rules that partition the households into distinguishable groups with respect to the risk of infant mortality. In this thesis, two other criteria for determining the goodness of a discriminant rule are investigated, and discriminant methods that are appropriate for the Sri Lankan household data set are applied. A second objective of the Sri Lankan household study is to test a theoretical model that places infant mortality at the center of an expanding series of social contexts. Infant deaths may be affected by 2 proximate factors such as inadequate nutrition or poor sanitation creating conditions for tetanus or diarrhea. These proximate factors may be influenced by the education level of the mother, and the economic status of the family, which in turn, may be linked to ethnic group membership. Path analysis is the standard method used to analyze such models in the social sciences. However, the assumptions under which this method is developed are highly restrictive. Thus, the use of path analysis is limited. In this thesis, limitations of the methodology are explored, and alternative techniques are considered. 3 2. A Study of Infant, Mortality i n S r i Lanka 2.1 Infant Mortality i n Medical Sociology In medical sociology, infant mortality is viewed as a consequence of biosocial interactions. The key idea behind the biomedical model of disease is that etiology is biologically specific. Hence, medical research is primarily focused on disease agents and host-agent interactions. On the other hand, social science research on infant mortality has been traditionally concentrated on the association between socioeconomic status and the level and pattern of mortality in the population. The specific medical causes of death are generally not addressed by social scientists. Medical sociology attempts to bridge these two approaches to the study of infant mortality. Mosley and Chen (1984) proposed a framework based on the premise that " a l l social and economic determinants of child mortality necessarily operate through a common set of biological mechanisms, or proximate determinants, to exert an impact on mortality". This framework can be summarized by the following i l l u s t r a t i o n . socioeconomic biomedical infant factors factors mortality Figure 1 Conceptual model of medical sociological approach to research on infant mortality 4 Primary causes of infant death in developing countries are well understood from the medical perspective. One of the factors that contributes to high infant mortality rates is risk of infection. Patel (1980) noted the common use of dung as a healing agent prior to 1940 in Sri Lanka. As documented by the Registrar of Ceylon Medical College in 1906, tetanus, a common cause of infant death, often resulted from infection to the navel after separation of the umbilical cord in childbirth. This source of infection can easily be eliminated by abolishing such practice. Another source of infection is the contaminated water supply caused by lack of proper sanitation facilities. This source of infection may be eliminated by construction of sanitary latrines. In general, most infant deaths are preventable with current understanding of disease transmission and existing health technology. Although most infant deaths are preventable with the available technology, the social context in which infant death occurs may block the use of such technology. The Sri Lankan government has created a subsidy program for the construction of latrines. However, many families are too poor to take advantage of such subsidies. Another example involves the use of hospitals for childbirths. Waxier et al. (1985) suggest that childbirth may not be considered serious enough to require a doctor's care. Thus, hospitals for maternity care are sometimes not used, even though these hospitals which are essentially free, are within short distances. Therefore, in order to design an effective package of health policies to promote infant survival, the biomedical and the social context of the problem must be examined concurrently (Mosley 1984). 5 Two recent developments i n sociological research have also altered the approach to infant mortality studies, as pointed out by Waxier et al. (1985). McKeown (1976) has argued that changes i n health status across time are probably better predicted by changes i n sanitation and available food supplies, than by health care or narrowly defined medical variables that are often considered. Secondly, infant mortality has been used, by development economists and others, as a central indicator of the state of development, or q u a l i t y of l i f e , of populations i n developing countries (Morris 1979). These developments have c a l l e d for expanded models that place infant mortality i n a larger s o c i a l context. The proximate causes of infant death may be inadequate n u t r i t i o n (Puffer and Serrano 1973) or poor sanitation and water supply that create conditions for tetanus or diarrhea (Patel 1980, and Smucker et al. 1980). However, these proximate causes may be related to the maternal education l e v e l (Caldwell and McDonald 1982, Simmons and Bernstein 1982, and Chowdhury 1982), economic status of the family (Grosse and Perry 1982, and Waxier et al. 1985), and access to health services (World Bank 1975), which i n turn, may be related to ethnic group membership (Waxier et al. 1985). In the S r i Lankan household study, relationships between infant mortality and various biomedical and socioeconomical factors are examined. 6 2.2 The Sri Lankan Household Data As described in Waxier et a.1. (1985), the 22 districts of Sri Lanka were divided into three clusters having different patterns of quality of li f e based on results of a previous study (Morrison and Waxier 1984). Four villages representative of a typical district from each of the three clusters were selected. For each village, a random sample of 40 households was drawn from the population l i s t . A household was substituted only i f the sampled house was empty, or i f both male and female head of household were absent in several calls over a period of weeks. Approximately 30 substitutions were made in the sample of 480 households. The researchers who devised this sampling scheme regard the sampled households as being representative of the Sri Lankan village population. A long systematic set of open questions was used for interviewing both the male and the female head of household. The questions elicited information on health, housing, nutrition, employment, education, etc. . The female head of the household, in addition, reported on the number of live births in her lifetime, and the number of her children who died before reaching age one. Information on the cause of death (or symptoms at death) was also obtained for each infant that died. The variable of primary interest in our analysis is a dichotomous response indicating whether or not the female head of household has experienced at least one infant death. All explanatory variables used in the study are listed in Table I. 7 391 households (82% of the total sample) have complete information on the variables of interest. Table II shows that 92% of the total sample satisfied the ini t i a l inclusion criterion: a female head of household with known child-bearing history, and known number of infant deaths must be present in the household. Further, the table shows that 12% of these households had missing information (where 11% have at most one missing variable and 1% have two missing variables). Most missing values appear in the variables concerning family income, and among older female head of households; otherwise, there was no noticable pattern when the distribution of households with missing information was examined for each variable. Several populations may require separate analysis in this study. Women with more childbirths are more likely to have experienced at least one infant death. Thus, the Sri Lankan village population is separable with respect to the dichotomous response on infant death by the number of childbirths. Furthermore, several explanatory variables may have different relevance to women of different age groups. For example, the use of health services for childbirth is restricted by availibility which may vary across time. The impact of ethnicity may also vary for the different generations. Thus, analysis should be performed separately for the various age groups. However, the available sample size restricts the number of allowed strata. Since older women also tend to have more childbirths, the sample is divided into two groups based on the woman's age (<44 and 44+). Most women in the latter age group are postmenopausal; thus, women in this age group have similar numbers of childbirths. In contrast, the number of childbirths varies for women in the younger age group. Since there is a one-to-one correspondence between household and female head of the household, the 8 terms, fiousehold and woman, w i l l be used interchangably to refer to a unit of observation throughout this thesis. In our analysis of this Sri Lankan household survey, the two data sets corresponding to those women of age <44 (250 cases), and those of age 44+ (141 cases) are treated as simple random samples. 9 Table I variables used in the Sri Lankan household study Name Explanation Codes Infant death indicator 1 at least one 2 none X No. of languages spoken at home 1 one 2 two or more Current usage of health services - where was the last child born? 1 hospital 2 home with midwife 3 home without midwife Nutrition - no. of protein foods consumed in the past week, from four most common types listed. 0 none 1 one type 4 four types Sanitation 1 none 2 communal latrine 3 own / open-pit type 4 own / water-sealed 5 toilet X Economic status - no. of household items owned, from five listed. 0 1 none one 5 five No. of hrs a day female head of household worked outside the home 0 none 1 one - three 2 four 7 nine 8 ten or more X No. of household members currently employed 0 1 2 none some a l l 10 Name Explanation Codes Primary source of income 1 salary 2 land/business/boat 3 piece rate 4 food stamps etc. No. of bustrips taken in the last week 0 none 1 one 7 seven 8 eight or more X I O Ethnicity 1 Sinhalese 2 others Years of schooling for female head of household 0 none 1 one 11 eleven 12 twelve or more X Education level of female head 1 relative to that of male head 1 lower 2 same 3 higher AGE Age of female head of household as reported 11 Table II Households used in the analysis Total number of households sampled 480 no female head of household 12 no child birth or no information on child birth 25 no information on infant deaths 1 number of invalid households 38 Total number of valid households 442 missing information on one variable 48 missing information on two variables 3 number of excluded households 51 Total number of households included in analysis 391 number of women with age <44 250 number of women with age 44* 141 ( 12 3. D i s c r i m i n a n t A p p l i c a t i o n s t o I d e n t i f y R i sk Groups 3.1 B a s i c Approaches In the Sri Lankan household study, we are interested in deriving discriminant rules that partition the households into distinguishable groups with respect to the risk of experiencing infant death. There are at least three basic approaches to this problem. An epidemiological perspective uses the notion of relative risk.. If a population t can be divided into two disjoint subpopulations, say * and <2, then relative risk of a particular phenomenon is defined to be the occurrence probability in relative to the occurrence probability in I . For example, we would like observable variables to define some groups t and *2 such that the probability of infant death is higher for households in t2 relative to the probability for households in t . In general, a variable which can partition the population so that one subset has high relative risk is considered an important rish /actor. A second approach is to predict a dichotomous outcome based on some collected information; for example, classify a family as likely or unlikely to experience an infant death based on the sanitation facility, nutrition, etc. available to the family. This approach is generally referred to as discriminant analysis or classification, and as pattern recognition in engineering. The idea is to select discriminating variables and to derive discriminant rules that minimize the expected cost of misclassification. This will be referred to hereafter as the c l a s s i f i c a t i o n approach. 13 A third approach is to estimate the probability of each outcome or of belonging to each class, given some collected information; for example, estimate the probability of infant death given the educational level of the mother. Using the terminology in C l a s s i f i c a t i o n , and Regression Trees {CART) by Breiman et al. (1984), this approach is called c I ass probab i l i t y estimation. The methods used in this approach search for variables and rules that minimize a squared error loss function to be defined later (Section 3.2.2). Obviously, these three approaches are related. For instance, class probability estimation for an observation [e.g. for a family) suggests a discriminant that assigns the observation to whichever class has the maximum probability; and relative risk can be estimated for the resulting discriminant partition. The similarities and differences between these perspectives can be described in terms of various conditional probabilities, and in the more general context of decision theory. Some statistical techniques and software may be adapted to more than one of these approaches. We will first consider the roles of these approaches in characterizing a good discriminant. The underlying principles of discrimination will be discussed in the context where the various conditional probabilities are known. However, in practice these conditional probabilities are often unknown, and need to be estimated from the sample data. The last section describes how these estimates may be obtained. 14 3.2 Opt imal i ty C r i t e r i a for Discriminants 3.2.1 Re la t ive Risk Relative risk is generally considered in a context relating the presence or absence of a specific disease to exposure levels for some possible risk factor(s) (Schlesselman 1982). The concept of relative risk is simplest when exposure level is dichotomous (presence or absence of a factor). A high relative risk (of disease) among those exposed suggests that the factor may be a cause of disease (Breslow and Day 1980, Schlesselman 1982, Hennekens and Buring 1987). Let X be a random variable that indicates the level of exposure to a specific risk factor. Suppose there are only two levels. D e f i n i t i o n 3.1 Relative risk, is defined as P(d*sease\X=2) P(dr.sease\X = 1 ) When RR > I, the probability of disease in the population with X = 2 is higher than the probability of disease in the population with X = l. The reverse relationship is implied when RR < 1 . Historically, relative risk was used primarily for dichotomous variables. But suppose the random variable X is continuous on the real line, or positive half-line, etc.. Then by considering X as a risk factor, we are interested in partitioning the real line into two regions, distinguishable with respect to the risk of disease. 15 Is i t reasonable to use relative risk as a partitioning criterion? Suppose the disease-present and the disease-absent populations have densities of X denoted respectively by p(x|disease) and p(x|n© disease), which, in practice, may be estimated from sample data. If p(x|disease) is right-shifted with respect to p(x|n© disease), then, at least for most smooth unimodal densities, the ideal partition is in the form of half-lines, { X < c} and { X > c}, for some c on the real line. Thus, by Bayes theorem, for any c e R, the corresponding relative risk is RR(c) = W ° " \ X > c) { 3 2 ) P(disease|X < c ) P ( X > c\disease) P{X < c) P(X > c ) P ( X < c\disease) The two examples illustrated in Figure 2 show that for densities with monotone likelihood ratio, RR(c) may increase to infinity as c decreases; but the discriminants corresponding to such extreme c are of no practical value. Thus, choosing c to maximize RR{c) is not a useful criterion for partitioning. Furthermore, because RR{c) may not be a monotone function, relative risk values do not provide information on how well separated are the two populations, disease-absent and disease-present. For example, a relative risk value of about 2 can arise from different partitions of the real line in either of the two situations illustrated in Figure 2. Since relative risk does not indicate the magnitude of shift between the disease-present and disease-absent densities, relative risk is not necessarily informative about the practical discriminating nature of a risk factor that is continuous rather than dichotomous. 16 Figure 2 Examples of relative risk function for known probability densities Density Plots: N(0,1) vs. N(1,1) p(x) i i i i I i i i i I i i i i I i i i i I i i i i I i i i i -1.0 0.0 1.0 2.0 3.0 4.0 5.0 x RR(c) 6.0 5.0 -4.0 3.0 -2.0 -Relative Risk Function for N(0,1) vs. N(1,1) 1.0 M I I | I I I I | I I I I | I I I I | I I I I | I I I I 1.0 0.0 1.0 2.0 3.0 4.0 5.0 17 0.5 Density Plots: N(0,1) vs. N(2,1) not ll diseased p(x) 0.4 -0.3 z_ 0.2 z_ 0.1 -0.0 diseased RR(c) 6.0 5.0 H 4.0 3.0 H 2.0 i i I i I i i I i | i i i i | i i i i f l l I I | i i i i -1 .0 0.0 1.0 2.0 3.0 4.0 5.0 x Relative Risk Function for N(0,1) vs. N(2,1) 1.0 i i i i I i i -1 .0 0.0 I I I I I I I I I I I I I I I I [ I I I I 1.0 2.0 3.0 4.0 5, 18 These properties indicate that relative risk may not be a meaningful criterion for selecting discriminating variables. Even though relative risk associated with a particular discriminant may be of interest, relative risk per se is not usually an appropriate criterion for construction of a discriminant. 3.2.2 Decision Theoretic Bayes Rules Although the formal objective differs for classification and class probability estimation, both approaches use discriminant methods that can be described in a general framework of decision theory as presented in Classification, and. Regression Trees (CART) by Breiraanei al. (1984). In the following, discussion will be restricted to the two-class problem, which is appropriate for the Sri Lankan household study. Generalization to more than two classes can easily be made. Let X be the sample space of possible measurement vectors, and let S = {1,2} denote the set of possible classes. Further, let X e X be a random variable whose distribution is denoted by P(dx), and let Y e 55 denote the class membership. Suppose jf is the set of possible actions. D e f i n i t i o n 3.2 A decision rule d is a jtf-valued function on X : d : X -» sf. D e f i n i t i o n 3.3 A loss function L is a real-valued function on S x sf : L : « x -» R. Thus L(y,a) is the loss when Y = y and a e jf is the action taken. 19 D e f i n i t i o n 3 . 4 The risk. R(d) is the expected loss when the decision rule d is used. That is, Rid) = E [ L(Y,d(X)) ]. In the classification approach, we are interested in predicting the class membership of an object with measurement vector X = x . Thus, we want to construct decision rules that assign class membership in t to every measurement vector x <= X, and so, let the action space J#C be "6. Furthermore, any decision rule d is equivalent to the partition of sample space X into two regions, l and *2, such that an object with measurement vector x e t . is classified as class j, for j = 1,2. These rules will be ~ j called c l a s s i / i c a t ion rv.les. The loss function, L (y, a), in this situation c is the cost of classifying a class y object as a class a object, denoted by C(a|y). Suppose C(a|y) is positive when a * y and is O otherwise. Then the risk or expected cost of using decision rule d is given by R (d) = C(l 12) P(Y = 2,X e * ) + C(2\l ) P(Y = 1 ,X e I ). (3.3) Let the probability that an observation comes from class j be for j = i,2. In epidemiological terms, these a priori probabilities are prevalences of the two classes. Further, let the conditional probability of X, given an object from class J be denoted by p(x\j) for j = 1,2. Then the risk in (3.3) can be re-expressed as * o ( d ) . C ( i U » nz[j p(x\2) rfxj (3.4) + C(2\l ) nt[ J P(x\l) dx j . 20 In the class probability estimation approach, we are interested in obtaining an estimate of the probability that an object with measurement vector X = x belongs to class j. That is, we are interested in estimating p(j |x ) = p(Y = j |X = x ) , j=l,2. Thus, we want to construct rules of the type, d(x) = (d(i | x ) , d(2\x)) with d(J\x) > O for J = 1,2, and £. <i(j'|x) = f, for every x e X. Such rules will be called class probability estimators. Hence, the action space consists of a l l pairs of nonnegative numbers that sum to /. Let the loss function L (y,cn) for a = (a ,a ) € 4f be defined by p ~ ~ 1 2 p where ^j-(y) i s the Kronecker delta (l if y = j and 0 otherwise), for J = Then the risk of a decision rule d is given by R (d) » E [ L <y,d(X)) ] = E [ < d<y|X) - 6 (X) ) 2 ] . (3.6) But given X = x , 6^ .(7) is a Bernoulli random variable with success probability p ( j | x ) , for J = 1,2. Thus, E [ <5 ^. (y) | X = x ] = PO ' | X) and E [ (SjAY) - p U l x ) ) 2 I X = x ] = Var[ 6y{Y) |X = x ] (3.7) = P( j ' lx) [1 - p ( j | x ) ] . 21 Hence, for any a e sf^f - Zj < <vy) " pol~) + P(J|~) ~*j )2 1 x = x ] = Zj PU\X) t l - P O I X ) ] + E ( P ( J ' | X ) - *j)z = 2p(l\x)p(2\x) + Zj (PU\x)- <*J)2, from (3.7). Therefore, for class probability estimation, the risk of a rule d is given by R p(d) = 2 E [ pit |X)p(2|X) ] + Zj lPU\X) ~ dU |X)) 2 ] , (3.8) where the first term does not depend the rule. D e f i n i t i o n 3.5 A Bayes rule is a decision rule d that minimizes B the risk function R(d). In the classification approach, a Bayes rule d that minimizes the D expected cost as expressed in (3.4), is obtained by choosing i \ X € X • ^(x\2V C(2\l) TI 2 J ' A N D ( 3 ' 9 ) *2 \ ~ € • p(x|2) < C(2\l) n\ ) ' as shown in Anderson (1984), with the Bayes risk as given in (3.4) with the above regions i and t . 22 In the class probability estimation approach, the unique Bayes rule is given by d B(x) = ( p(l |x), p(2\x) ) for x e X, with risk R (d ) = 2 E[ p(l |X)p(2|X) ] (3.10) = 2 J p(l \x)p(2\x) P(dbc) which can be seen easily from (3.8). Bayes rule and Bayes risk can also be defined for a partition of the sample space X. D e f i n i t i o n 3 . 6 The partition /unction T associated with the partition T is defined as T : X -*• T such that T ( X ) = t i f and only i f x e t, for a l l x € X and * e T. A decision rule d is said to correspond to the partition T i f i t is constant on each subset of T. That is, for every l e T, there exists some jtf-valued function u on T such that c o U ) = d(x) for every x e i . Then a decision rule d^. corresponding to the partition T is explicitly given by d^-(x) = u>(r (x)), and the associated risk is given by R(d ) = £ £[ HY,<*(i)) | X 6 i ] P « ) , (3.11) where P{t) = P(X € *). Thus d_ is a Bayes rule corresponding to the partition T i f and only i f (x) = C O ( T (x)) such that for each t e T, a = <oU) minimizes E [ L ( / , a ) | X e I ] . For convenience, let t o U ) be a value that minimizes E [ L(y,a) | X e i ] over a e jtf, for * e T. 23 Furthermore, for t e T, let rU) = E [ L(y,toU)) | X e * ] . Then the Bayes risk corresponding to the partition T can be written as R(T) = £ rU)PU). (3.12) In the classification approach to discrimination, a Bayes rule d corresponding to the partition T is obtained by setting d^ (x) = < * M T (x)) for x e X, where o> it) is a value i e {i .2} that minimizes EfZ. (Y,i) | X € 4 1 for * e J*. Then for * e J*. co it) is a value £ € U ,2} that minimizes E [ L o ( y , i ) | X € < ] * C{t\t)p{l\t) + C(i\2)p(2\t), where p(jU) = piY = j'|X e i ) , j" =1,2. Thus, the minimum conditional expected cost of misclassif ication on subset t e 7" is given by r cU) = min [ C(2\l)p(l \t), C(l \2)p{2\i) ] . (3.13) Then the Bayes risk for partition T can be written as R {T) = E i* (*)*>(*)• (3.14) In the class probability estimation approach, the unique Bayes rule d Q corresponding to partition T is obtained by setting «*B(>0 = " ( T(x)) 24 for x s X, where <»>pU) is the pair of nonnegative values a = («±/«2) that minimizes = Ej. £ [ ( - p o m + PU\*) - OLJ ) 2 | X € * ] = Zj E [ (6 y(y) - p{j\t)f | x € 4, ] + E y (PO'U) - « Y) 2 = EJ. P ( J l * ) [ l - P(J"I*)] + E (PU'I*) - « j ) 2 since 6 ,(y) given X € I is a Bernoulli random variable with success j probability p(j'U) = p{Y = j'|X e <) for J =1,2. Thus for t e T, oi (t) = ( p(i |4 ) , p(2\l) ), and the minimum conditional expected loss is p given by r U) = 2p(* \t)p(2\t). (3.15) p The Bayes risk for partition T can then be written as R (T) = E ^ (3.16) p teT p Suppose the sample space X is to be divided into two regions using the class probability estimation approach. How do these two regions compare with those selected by the classification approach? For any two-region partition T = }/ R (D = E U)PU) (3.17) p *eJ* p = 2p(l \tt)p(2\l±)P(tt) + 2p(l \i2)p(2\iz)PU2). 25 Suppose n^, rz2, p(x|z) and p(x\2) are known as in the classification approach. Then (3.17) can be re-expressed as Rp(5-) = 2p(i \\)P(X e *±\2)nz + 2p(2\i2)P(X e \l)n± (3.18) = 2p(/|* 1)n 2[ J P(x|2) <*x J + 2p<2| J p(x|i) dx ] . But this is same as the expected cost (3.4) of a classification rule i f 2p(l\lt) = C(l\2) and 2p(2\l2) = C(2\t ). Let T* = U*,/} be the partition with minimum risk R (• ) among a l l two-region partitions; that is, p * let T be the best two-region partition using the class probability estimation approach. Suppose the cost ratio is given by Then from (3.9), a Bayes rule that minimizes the expected cost in (3.4) is * determined by the partition T . Therefore, by varying the cost ratio, the best two-region partition determined by the class probability estimation approach can be obtained from the classification approach. 3.3 Sample Space P a r t i t i o n s Corresponding to Bayes Rules In the following sections, some of the commonly used methods for discriminant analysis are presented. The most widely used method assumes multivariate normality for the observations from both classes. In this 26 case, a Bayes rule is obtained by choosing a linear- partition that minimizes the risk function. The logistic discrimination procedure also provides a linear partition for use with both normal and certain non-normal populations. Methods based on nonparametric density estimation algorithms, such as kernal and nearest neighbor methods, are also available, but will not be covered in this thesis. Instead, the method of classification trees is explored. A recent report produced by a panel on Discriminant Analysis and Clustering (DAC report), which was created under the Committee on Applied and Theoretical Statistics, National Research Council (1988), provides a helpful summary of a l l these methods. In the following, we present three of these methods from the decision theoretic perspective. In addition, we examine the classification trees method in much greater detail. 3. 3. 1 L i n e a r D i s c r i m i n a n t s f o r Normal D i s t r i b u t i o n s In the classification problem, by assuming the two class-conditional distributions are known multivariate normal with equal covariance matrices, namely N(y4,Z) and N(y2,Z), Wald (1944) showed a Bayes rule is obtained by choosing the linear partition given by x «= X : x ' z " 1 ^ - « 2) > * } , and (3.19) where the point k is a function of rc n^, C(l \2), C(2\l ), (j , y 2 and Z ; 27 see Anderson (1984), Hand (1981), Dillon and Goldstein (1984), and others. The linear projection given by xTZ"*(g ~ fcj2)/ is sometimes called the normal linear discriminant function. However, in most applications, the mean vectors and the covariance matrices are unknown. Suppose there is a sample of size from class 1 and a sample of size from class 2. Let be estimated by the usual mean x^ . of the sample from class j population for J = 1,2, and let Z be estimated by the pooled sample covariance S defined by o _ <* - 1)S. + <N_ - l)S_ s> — 1 1 2 2 — , (N + N - 2) 1 2 where S ± and are the corresponding sample covariance matrices. Then the Bayes decision regions are estimated by \ = | x <= X : x TS _ 1(x - x ) > * i , 1 ^ ~ ~ ~ 1 ~Z 2 J = I x e X : x TS _ 1(x - x ) < te \ , and (3.20) ^ 2 where the point * 2 is a function of n^, n^, C(l\2), C(2\l), x^, x g and S. The linear projection given by x TS - 1 (x^ - x 2) is the Fisher linear discriminant function suggested by Fisher (1936). 28 3.3.2 L o g i s t i c Linear Discriminants In the classification problem, logist ic discrintinat ion also provides a linear partition of the sample space for use with normal and certain non-normal populations; see Lachenbruch (1975), Hand (1981), Dillon and Goldstein (1984), DAC report (1988), and others. Suppose that the two class population densities can be expressed as P(x|j) = expfOj + x Tgy), for J = 1,2. (3.21) Then by invoking Bayes theorem, P<* Ix) = P(x\t)n = T where n = log( n / n ) + (a - a ) and n = ft - ft . This is called a O 1 2 1 2 ~ ~ ± ^ - 2 i n u ! . t i v ariate logist ic function, which can be re-expressed as l o * [TT^ilTTxr] • + 2*3 • (3.23) Thus the probability of belonging to a class given a measurement vector X = x can be estimated by modeling the logit of p(i |x) as a linear function of x. Furthermore, by substituting (3.22) into (3.9), the best decision region in the classification setting is given by the partition, *4 = | x € X : x Tg > teg | , and (3.24) 29 I = X x € X : xTy> < where the point fc is a function of a , a , n , n , C(l \2) and C(2\t). 3 1 2 1 2 So far the logarithm of each class-conditional probability function is assumed to be adequately modeled by a linear function. A slightly more general approach assumes the difference between the logarithms of the class-conditional probability functions is linear. This is equivalent to the approach adopted by Anderson (1972) which assumes the logit of p(l \x) is linear as expressed in (3.23). The equivalence relationship can easily be seen by examining expression (3.22). Clearly, the model expressed in (3.22) is exact when the class conditional probability density functions are multivariate normal with identical variance-covariance matrices, Thus, for known normal p(x|/J> and p(x\2), the logistic regression coefficients are functions of normal parameters, and the Bayes decision regions given in (3.24) correspond to the Wald's linear partition in (3.19). However, i f the underlying class conditional probability densities are multivariate normal with unknown parameters, then the logistic discrimination procedure cannot be expected to classify as well as does the linear discriminant function (Efron 1975, and Press and Wilson 1978). 30 3.3.3 C l a s s i f i c a t i o n Trees: Recursive P a r t i t i o n i n g The technique of classification trees for discriminant analysis was initi a l l y developed by Morgan and Sonquist (1963), and Morgan and Messenger (1973) under the name automat ic interaction detection (AID). This technique has been pursued and refined by several people. Recent development, under the name classification and regression trees (CART), is described in detail in the book by Breiman et al. (1984). The primary differences between AID and CART is in the tree construction. The technique of CART creates a binary tree-structured discriminant by repeatedly splitting subsets of sample space X into two descendant sets, starting with X itself. An example is illustrated in Figure 3, where t = X, t and t are disjoint subsets of t with tut = t , and t and 1 ' 2 9 1 2 3 1 / 4 t are disjoint subsets of * with tut = t . 5 2 4 5 2 t t t 2 3 t 5 Figure 3 An example binary tree 31 Those subsets with no descendant sets are called terminal subsets. In the above example, t , t and t are the terminal subsets. Thus the technique of CART constructs discriminant rules that partition the sample space as specified by the terminal subsets. That is, t^,^,^} forms a partition of the sample space that corresponds to some decision rule. The tree is constructed based on a set of binary questions of the form f Is x e i? } for some subset t of. X. Let the measurement vector X be M dimensional, X = (X^,.. . , X m ) t , with mixture of ordered and categorical types.1 Then the allowable set of splits is defined as follows: a. Each split depends on the value of a single variable. b. For each ordered variable X , the questions are of the form { Is x < c? }, for a l l c in the range of X . vft m c. For each categorical variable X , the questions are of m the form { Is x e S? }, for a l l subsets S of possible TTI X -values. m Let J" be a fixed partition and let t e 7 be a fixed subset of X in J*. * Consider a split o of t into two disjoint subsets l and t . Let T be the modification of T after applying split o to t. Then the risk reduction As defined in Brieman et al. (1984), a variable is ordered i f its measured values are real numbers; and a variable is categorical i f i t takes on values from a finite set with no natural ordering. Thus an ordered variable can be a continuous or an ordinal variable. 32 AR(o,4) = R(T) - R(T ) due to the split o is given by AR(o,4) = RU) - [RUJ + RU R) ] (3.25) = P(t) [ r U ) - P^ritJ - P^U^) ], where P = P [X e 4 | X <= 4] and P = P [X e 4 | X e *]. The r e l a t i v e risk reduction due to the split is then given by AR (oU) = AR(4,l) / P{i) = r(*) - Pr.U ) - P r U ). (3.26) L, 1. R R Thus, the risk, reduction partition is achieved by choosing the split o that maximizes the relative risk reduction. In the class probability estimation approach, PU\*) = Pu PU\iJ + P R PU\*M), J = Thus by substituting the above into r U) in (3.15), AR can be shown p p to be AR (<»|i) = 2P P [ p U |* ) - p(J I*) ] 2 . (3.27) Hence the relative risk reduction is maximized i f the difference between class probabilities in the two resulting subsets is maximized. Suppose class 1 corresponds to the class of households with infant death. Then the class probability estimation approach seeks splits that maximize the difference in probability of infant death between the two resulting groups. Furthermore, because of the multiplicative factor P^P^t the criterion also favors those splits which divide the set t more evenly into two subsets. 33 Note that relative risk, in epidemiology, as defined in Definition 3.1, involves a ratio rather than a difference: " P(l i * R r * Thus a desirable split should have a very high or very low relative risk value. In any case, there is no way of ensuring even splits. Therefore, as discussed in Section 3.2.1, using relative risk as a partitioning criterion may not be provide splits of practical value. Risk reduction is not a good criterion for choosing a split in the classification approach. Breiman et al. (1984: pp. 95-96) shoved that for any split of * into ^ t and * r, RJl) > K.U ) + R C ( * R ) w i t n equality i f J*U) = ) = ), where j*(u) minimizesC(JIi )pU 1 )^ + C(J\2)p{2\-u,), J W K for subset v, of X. Thus, i t is conceivable that every allowable split of t may produce a partition for which AR {o,t) is zero. In situations where C the population is predominated by a single class, the risk reduction criterion may result in no splits. The second defect is caused by the fact that risk reduction partition (in the classification approach) is a one-step optimization process that does not account for the future splits. In some situations, the best current choice of split may not provide the best overall improvement in strategic position. For futher discussion of these considerations, see Breiman et al. (1984: pp. 94-98). Two splitting criteria for the classification approach have been implemented in the CART software: Gini criterion and Twoing criterion. In the two-class problem, these criteria can be shown to coincide (Breiman 34 et al. 1984: pp. 104-108). Thus, in this thesis only the Gini criterion is considered. Let T be any partition of sample space X. For t e T, instead of r(t) consider an impurity function i(t) defined by i(t) = 2p{l \t)p(2\t), (3.28) called the Gini diversity index. Then, the partition impurity toxT is defined by KT) = £ i{t)P{t). (3.29) * Thus the impurity reduction due to the split -o is AI(4,t) = I (T) - 1 (T ), where T and T are as defined in AR(o,l) earlier; and the relative impurity reduction due to the split o is given by AZ(o|*) = A7(o,«) / P{t) = 2P P [ P{1 \t ) - P{1 \t) ] 2 . (3.30) L i K » L i R J But this is precisely the risk reduction criterion used in the class probability estimation approach as expressed in (3.27). Thus, the impurity reduction partition using Gini diversity index in the classification approach is the same as the risk reduction partition in the class probability estimation approach. Therefore, the sample space X is partitioned in the same manner by both approaches when CART is used. 3. A Construction of Discriminants from Sample Data Since the measurement variables available in the Sri Lankan household study are mainly ordinal, not continuous, partitioning of the sample space 35 by assuming normal populations may not be appropriate. Thus, only the latter two techniques, logistic linear discrimination and CART, are discussed in this section. In practice, classification or discrimination problems begin with a sample of correctly classified objects, each with a set of measurements, x. The classification approach uses the sample to derive rules that partition the sample space into disjoint regions with each region purely or predominantly inhabited by members of a single class. The partitioning of a population into classification regions is similar to, but not quite the same as the partitioning of population into groups distinguishable with respect to high and low risk of belonging to a specific class. In principle, class is clearly defined while the terms high risk, and low risk are relative. Both the logistic discrimination and the CART technique (for class probability estimation) estimate the class probabilities for each possible measurement vector x in the sample space X. The high and low risk groups are then defined by choosing a probability threshold. 3. 4. 1 L o g i s t i c Discriminant Let { (X ,Y ) : n = i W } be a random sample of size N from the ~ n n joint distribution of {X,Y), where X is a X-valued random variable and Y is a S-valued random variable that denotes the class membership of the observation. Logistic discrimination assumes that 36 r P(Y = i i x ) n for x € X. Thus, for x e X, P<y = i i x ) = (3.31) 1 + exp(*o+ x T 2 ) Therefore, the parameters i)Q and r? can be estimated by maximizing the likelihood function, N n = 1 A l l logistic discriminant analyses performed for the Sri Lankan household study are accomplished by using a logistic regression program, PLR, of BMDP Statistical Software. 3 . 4 . 2 CART Discriminant: Crowing a C l a s s i f i c a t i o n Tree Let { (X ,y ) : n = * } be a random sample of size N from the joint distribution of (X.Y), where X is a X-valued random variable and Y is a S-valued random variable that denotes the class membership of the observation. In both the classification and the class probability estimation approach, there are two situations to consider: one when the prior probabilities rt^ and rc are known, and another when the prior probabilities are unknown. n 37 Consider f i r s t the situation where the prior probabilities are known. Let N , be the number of observations with y = j, j =i ,2. Suppose J* is a partition of the sample space X. Given t e T, let Nj(l) be the number of observations with x € i and y = j, for j = 1,2. Then estimate P{t) = P(X « t) by PU) = E,. — ^ n . (3.32) J N . J J Suppose P(t) > O for a l l t e J*. Then for j = * ,2, estimate PU\*) = P(y = J'lX e *) by n , N .U) / N . p(j|«) = J ;? 1— . (3.33) PU) In practical applications, however, the prior probabilities are often unknown. Then for any * e T, let N(t) be the number of observations with x € t, and estimate P(t) = P(X e t) by the proportion of observations in t, PU) = • (3.34) Suppose PU) > 0 for a l l l e J*. Then estimate p(j'|*) by the proportion of observations belonging to class j in the subset t, N .U) p(j\t) = . (3.35) A?U) 38 For any * e T, let pij \t) be estimated by the appropriate pij\t), j = 1,2. in the classification approach, let to it) be the smallest C i « {1,2} that minimizes C(£|i)pU |*) + C{i\2)p(2\t), and estimate r it) C in (3.13) by r cU) = min [ C(2\l )p(l \t), dl\2)pi2\t) ] . (3.36) In the class probability estimation approach, let co it) denote the vector p (pit \i), p(2\t)), and estimate r U) in (3.15) by p r it) = 2pU \ t )p(2 \ l ) . (3.37) p Using the appropriate Pit) and r ( i ) , the Bayes risk associated with the partition T is then estimated by RiT) = £ rU)PU) . (3.38) 4eT Recall from Section 3.3.3 the desirable splitting criterion for either the classification or the class probability estimation approach (see (3.27) or (3.30)). Let T be a partition of sample space X with Pit) > O for every t G T. Consider a split o of * e T into * and t , where Pit ) > O and £(* ) > O. Let R P -h*J- and P - J ^ - . P(*) P(*) Then, the empirical splitting rule for either approach is to choose an allowable split o of t that maximizes ^ I A I C p ( f 'V " p U 1 V i a - ( 3 , 3 9 ) 39 This partitioning procedure will continue to split until each subset of the current partition contains either observations of the same class, or observations with identical measurement vector x. Discriminant rules obtained in this manner are a r t i f i c i a l and highly data dependent. Furthermore, i t is conceivable that this splitting procedure may continue until each terminal set contains only one observation. In the following, the construction of a parsimonious partition suggested by Breiman efc al. (1984) is summarized. 3 . 4.3 CART Discriminant: Pruning- a C l a s s i f i c a t i o n Tree The stop-splitting criterion i n i t i a l l y consists of setting a threshold and deciding not to split further i f the decrease in the estimated impurity for the classification approach, or the decrease in the estimated risk for the class probability estimation approach, is less than the threshold. This may lead to two problems. If the threshold is set too low, then there are too many subsets in the resulting partition. If the threshold is set too high, good splits may be lost. That is, a subset t may not produce a split with a large enough decrease, but its descendants t and t may be able to do so. Breiman et al. (1984) suggest the following alternative. The basic procedure can be summarized in three steps which are more easily described by tree terminologies. Recall the construction of binary tree-structured discriminants. Since each node on a tree corresponds to some set on the 40 sample space X, the terms, node and set, will be used interchangeably henceforth. So far, the terminal nodes of a given tree, which constitute a partition of the sample space, is the only tree terminology introduced. D e f i n i t i o n 3.7 The root node of a given binary tree is the node with no ancestor; that is, the set on the tree which is not a subset of any other sets on the tree. Let a binary tree be denoted by T. Any node on the tree T is denoted by t « T, and the set of terminal nodes is denoted by T. D e f i n i t i o n 3.8 A branch of T with root node t e T consists of the node t and a l l descendants of I in ?. D e f i n i t i o n 3. 9 Pruning a branch T from a tree T involves cutting off T just below the node t. The resulting tree is denoted by T - T . D e f i n i t i o n 3. IO T* is a pruned subtree of T i f T' is obtained by successively pruning off the branches of T. The alternative to the stop-splitting procedure has three basic steps. The sample space X is first partitioned into an overly large binary tree; that is, the sample space is partitioned into fine sets. This tree is then pruned upward until only the root node is left. By using a more appropriate estimate of the risk, the right sized tree from among the pruned subtrees, is selected. The most obvious criterion for selecting a right sized tree is to choose the pruned subtree with minimum estimated risk. This criterion may also be adjusted to compensate for estimation errors. However, these criteria may not always select a 41 sensible tree. In most practical applications, the primed subtrees and their corresponding risk estimates are inspected; and by using external information about the variables and by noting the context of the problem, the right sized, tree is selected. The first step is to grow a large tree T by continuing the splitting procedure until a l l the terminal nodes are either pure, or contain only identical measurement vectors. Let be the smallest pruned subtree of TQ with RiT^) = R{J"o). Note that the pruning criterion may differ for the classification and the class probability estimation approach. The estimated risk R(T) = E r(*)£(*) is defined differently for the two cases: r(4) is the estimated wi thin-node misclassif ication cost in the classification approach, while r(4) is the estimated within-node Gini diversity index in the class probability estimation approach. Now for any branch T of T , define R{T ) by R{T.) = E R(t), t where T is the set of terminal nodes of 3^. Breiman et a l . (1984: pp. 287-288) showed that for any nonterminal node t of T , R{i) > R{T ). D e f i n i t i o n 3.11 Let \ > O be a real number called the complexity parameter- and define the cost-complexity measure (T) as R-^{T) = R(T) + where \T\ is the number of terminal nodes in the tree T. 42 The complexity parameter X may be thought of as the portal ty on each terminal node of a tree. Thus the cost-complexity measure takes into account the risk associated with a tree, as veil as the complexity of the tree. Consider any nonterminal node I of T . As long as R^TJ < R^{{•*}), the tree with the branch intact is preferred over the pruned subtree vithout the branch 3^. However, at some critical value of X, the two cost-complexities become equal. Then the smaller tree with the branch T pruned off is preferred over T±. D e f i n i t i o n 3.12 Consider a nontrivial tree T. Define a function t(t) for * « T by i * V - 1 +00 t € T. Then define the -weakest link t in T as the node satisfying t(t*;3r) = min Hi;3r). Let \2= Z{i ;T ). Then the node i± is the weakest link in the sense that as the complexity parameter X increases, i t is the first node with R^(U}) equals ( ) , where is a branch of T± with root node *. Thus, when the complexity parameter is \ 2, the pruned subtree, T , obtained by pruning away the branch T * from T , is preferred over T . Nov define recursively * 1 1 for k = 2,3,as long as 3^ is not just a terminal node, 43 Continuing pruning in this manner, a decreasing sequence of subtrees is obtained: T , T , J r , where T is the root node on a l l subtrees. Furthermore, a corresponding increasing sequence of complexity parameters is also obtained (Breiman et al. 1984: p.286). The next step is to select one of these pruned subtrees as the right sized, tree. If R ( ^ ) is used to estimate the risk associated with T^, the largest tree will always have the minimum estimated risk. Furthermore, this estimate is biased. Thus a more accurate estimate of R{T.) is needed. Two methods of estimation are discussed by Breiman et al. (1984): use of an independent test sample and cross-validation. As noted earlier, the sequence of subtrees, T ,...,T , may differ for the classification and the class probability estimation approach. Since the class probability estimation approach seems more appropriate for the discrimination objectives of the Sri Lankan household study, the description of the estimation methods will be restricted to the class probability estimation approach. Extension to the classification approach can be made similarly. 3. 4. 3. 1 Test Sample Estimates of Risk The sample is divided randomly into two sets, where one set is used to construct the decision rules, and the other is used to estimate the risk associated with each rule constructed. These two sets are generally called the training sample, and the test sample respectively. 44 Let y denote the random sample { (>< ,y^) : n = 1 ,...,N }. A sample of fixed size A/ 2 > is randomly selected from y to form the test sample J^ < 2 >. The remainder J^ ( 1 > = ? - J* < 2 ) constitutes the training sample, which is used to construct the decreasing sequence of pruned subtrees, T ,...,T. 1 r^. For each pruned subtree T^, let p^(j|x) estimate the probability of belonging to class j given measurement vector x , j = 1,2, by applying 3 ^ to the cases in the test sample. Then for j = 1,2, define *ljm<rK)m—£r * = C^<*l2n> " < W ] 2 ' ( 3 ' 4 0 ) N . <2> . < 2 ) ( 2 ) where n . = { n : (X ,Y ) e ? and Y = j }, and 6. (y ) is the Kronecker J ~ n n n I n delta (/ i f y = £ and 0 otherwise). Test sample estimate of the Bayes risk associated with the tree 3^ * s then given by Rla(rM) »E*)"<*V nj • (3.41) If the prior probabilities are unknown, estimate « by A / ^ . 2 > / A / 2 > , j = The standard error estimate for Ri8(J*^) denoted by S£"(Rt8(J"^)), may be obtained by standard statistical methods as described in Breiman et al. (1984). A large sample is needed for this method. In particular, a large number of cases is required in the training sample so that the rules constructed are somewhat reliable. 45 3.4.3.2 Cross-Validation Estimates of Risk When the data set is large, test sample estimation is a reasonable approach. However, when the number of cases is only a few hundred as in the Sri Lankan household study, test sample estimation can be inefficient in its use of available data. Thus, cross-validation is preferred. In V-fold cross-validation, the original sample f is randomly divided into V subsets of similar sizes, f^, v = 1, ...,V. Then the v-th sample is defined as ^ < v ) = ? - J» , for v = 1,...,V. By using the entire sample J " , the decreasing sequence of pruned subtrees, T ,...,T^t with corresponding complexity parameters, X ,...,X , is constructed. Then for each h = t, let denote the geometric / V ^ 7 o f x * x*+i w i t h xk = °°-( V ) ~ < v> Now for each sample Cr , v = 1, ...,V, construct , the optimally pruned subtree with respect to the complexity parameter Xj^, h = . Then for each tree 3!'^v>, let P^ v >(j|x) estimate the probability of l v> belonging to class j given measurement vector x , j = 1,2, by applying to the cases in y . Then for j = 1.2, define V R T { T K ) = — E E E [ p i V > ( ^ | x n ) - 6 £(y n) ] 2 , (3.42) where 77 V* = { n : (X ,/ ) € y v > and Y = J }, and <5 . (y ) is the Kronecker J ~ n n n I n delta U i f y = i and O otherwise). Cross-validated estimate of the Bayes 46 risk associated with the tree 3 ^ is then given by R c v ( ^ ) = £Ryirh) nj . (3.43) If the prior probabilities are unknown, estimate nj by N , / N, j - 1,2. Standard error estimate for Rcv(«7"fc) denoted by SE(R C V( T^)), maybe obtained by heuristic arguments as described in Breiman et al. (1984). The right sized, tree may be defined as the pruned subtree with minimum estimated risk, or as recommended by Breiman et al. (1984), the tree selected by the 1 SE rule: instead of the tree with minimum estimated risk, the smallest tree satisfying « t 8 < * W * S ^ t e J + SEiRi9(Thm)) or * C V < * W * * C V^**> + S E ( R C V ( ^ ) ) , whichever is appropriate, is selected. This rule was created to take into account the instability of minimum estimated risk, and to select the simplest tree whose estimated risk is comparable to the minimum estimated risk. Note that is a pruned subtree of 3"^^. 47 4. Path Analysis Path analysis investigates causal patterns in a set of variables, in contrast to the focus of discriminant analysis on patterns among individuals or cases. This statistical methodology, vhich was introduced by a geneticist, Sewall Wright, in the 1920*s, has been popularized in the sociological literature (see Duncan 1966, Land 1969, Blalock 1970 and others). Path analysis utilizes a visual representation, called path diagram, which consists of arrows leading from one variable to another, to illustrate the cause-and-effect relationships among the variables. The statistical part of the method does not specify the direction of cause-and-effect relations between the variables, but does provide quantitative assessments of the relationships via what are called path coefficients. Thus, this is not a method for discovering causal relatioships among the variables, but rather a method for assessing whether or not a specified set of relationships among the variables is compatible with the observations. Hence, directions of causality between variables are specified by using non-statistical information or substantive theory. In practice, the natural temporal ordering of the variables usually indicates the direction of causality between the variables. The method of path analysis was ini t i a l l y developed for quantitative data, where a path diagram is based on a sequence of linear regression models. However, most sociological data are qualitative instead of quantitative. Thus, assumptions under which path analysis was developed are generally not satisfied. Goodman (1972, 1973a,b) proposed a method for studying causal relationships among discrete variables, where a path 48 diagram is based on one or more loglinear or logit models. However, causal models thus constructed have limitations, and are not directly analogous to causal models with continuous variables (Fienberg 1980, Rosenthal 1980). Various problems in causal modelling with quantitative or qualitative data have been explored recently (Wermuth 1980 and 1987, Wermuth and Lauritzen 1983, Kiveri, Speed and Carlin 1984, and others). In this thesis, only the basic approach which lead to the more recent developments for qualitative data is examined. 4.1 Structural Modelling with Quantitative Data 4.1.1 Path Models A path model can be represented by a path diagram. Suppose we are interested in the relationship between infant mortality (X o), a dichotomous variable, and two explanatory variables, say age (X^) and education (X^) of the mother. We suspect that both age and education influence infant mortality directly. Further, we rule out the possiblity that education affects age, but will postulate that age affects the level of education attained. Then this model can be represented pictorially as in Figure 4. Level of education X Age X + X Infant death i o Figure 4 An example of a path diagram 49 The directed arrow, leading from one variable to another, indicates that the first variable has direct influence on the second. A path is formed by moving along the arrows. In our example, X^—• X^, X^ —* XQ, X2—• XQ, and X —*• X^—* XQ are the possible paths. If a path diagram contains a path that traces back onto itself, then the diagram is said to have a feedback, loop. Any path model represented by a diagram with no feedback loop is called a recursive system. A l l path models considered hereafter are recursive. The method of path analysis assumes that a l l relationships are linear. Thus for the above example, X = ft X , ( 4.1) 2 ' 21 l ' X = ft X + ft X . O OA 1 02 2 But in pratice this is not exact; there are unmeasured sources of variation. Thus, the above system of equations is more appropriately expressed as X = ft X + 6 , (4.2) 2 ' 21 1 2' X = ft X + ft X + 6 , O CU 1 CS 2 O' where the error terms, 6 and 6 , have mean O and are uncorrelated with the other variables in the corresponding equations. Without loss of generality, assume hereafter that a l l variables are standardized to mean O and unit variance. Conventionally, coefficients in the equations with standardized variables are referred to as path coefficients, and are denoted by fv. . , where the subscripts represent the direct effect of 50 standardized variable X*. on standardized variable XV . Thus, our path J 1 model can be re-expressed as X' = fl X'+ fl e , 2 '21 1 ' 22 2' X* = fi X'+ fi X'+ fx e , O CU 1 ' 02 2 ' OO O' (4.3) where coefficients such as, A 2 2 and /v , are generally referred to as the residual path, coefficients. The path diagram is then modified as follows. Level of education Age 01 02 > X' Infant death o 00 Figure 5 An example of a path diagram with path coefficients Since a path model can be represented by a sequence of linear submodels, the corresponding path diagram can be modified to better reflect this key concept by the use of colors. For instance, the earlier example can be represented by a path diagram with colored arcs as in Figure 6 . The modified path diagram is visually more attractive, in the sense that vital information can be extracted more easily. Suppose we want to know which variables have direct effect on a specific variable in a more complicated path model. Instead of staring at a maze of arcs, we can focus on a 51 particular color and obtain the desired information. This feature is especially useful in specifying the system of linear equations that represents a path model. £ 2 Level of education X' Infant death o £ O Figure 6 An example of colored, path diagram The basic assumptions underlying the application of path analysis for quantitative data are summarized as follows: i . Causal (or temporal) ordering of the variables in the model is assumed as specified. Validity of the model cannot be evaluated from the data; external criteria or substantive theory must provide justification for the model proposed. i i . Relationships among the variables are linear and additive. i i i . Error terms are not correlated with variables proceeding them in the submodel, nor with each other. iv. The variables are measured on an interval scale (at least), with the exception of dichotomous variables, which can be Age X; 52 included as interval-scaled by assigning numerical scores to the two categories. 4.1.2 Estimation and Interpretation of Path C o e f f i c i e n t s Path coefficients may be estimated in two ways. The first method of decomposing correlation coefficients was employed by Wright (1934, 1960) in the development of path analysis. The second method consists of applying ordinary least squares regression to each submodel in the system. The latter method of estimation automatically provides estimates of the precision of the coefficients, and a framework in which hypotheses concerning the coefficients may be tested. Although the regression method is generally preferred, the method of decomposing correlation coefficients offers a more fundamental understanding of the relationships among the variables considered, in the following, these two estimation methods are illustrated in the context of the earlier example using a random sample of size N. Since the variables are standardized, the sample correlation coefficient between X, and X . can be expressed as J r . . = - | T - V x*. x'. . Let the sample correlation coefficient be zero, i f the two variables are assumed to be uncorrelated. Then in path model (4.3), 53 Let fv^. denote the estimate of path coefficient fti . Then path model (4.3) implies that * - ... . _ 1 r = - 7 7 - V x' x' - —r-7- V x* (ft x '+ ft e> ) = ft , (4.4) 21 N ^ ± Z N ** 1 7 21 1 7 22 2 7 21' f 2 i since £ x* = 1, and £ x ^ e 2 = 0 • Similarly, r = fi + ft r , and (4.5) O l CK 02 Z l ' r = ft + ft r Q2 'oat Ol 12 In general, Wright (1934) shoved that r . . = £ ft. r . (4.6) vhere s runs over a l l variables with direct effect on X£ . Therefore, estimates of the path coefficients can be obtained by solving for ft. .'s in J the decomposition of correlation coefficients. In our example, ft = r , (4.7) ' 21 21 ' r - r r oi oe 21 , ft = , and ' O l , 2 ' 1 - r 21 r - r r 02 Ol 12 02 2 i - r 12 54 Now the residual path coefficients can be obtained by noting r = —JJ— TJ x ' = — - - YJ (ft x ' + ft e ) = ft + ft , and w <J N L '21 I 7 22 2 7 21 7 22 ' / \ / \ A 1 2 ^ 2 ^ 2 ~ 2 r = — J T — E x ' = f v + n, + n, + 2ft ft ft 00 N o '01 7 02 7 00 7 ox7 027 21 Thus, (1 - KI Y'*> ( 4 - 8 ) ft =[ 1 - ft - ft - 2fV A ft I ' OO V. ' CM. 7 OB Od. 02 21 ' For a simple path model as in our example, this method of estimation seems straight forward. However, for a more complicated model, this method can be very tedious. Since a path model is essentially a sequence of linear submodels, path coefficients can be estimated by applying the method of ordinary least squares regression to each submodel. Thus for path model ( 4 . 3 ) , the ordinary least squares estimate of ft is ^ 1 2 ft = = r . 21 _ ,2 21' E * \ since x ; and x'2 are standardized; and the normal equations for the second linear relationship are as expressed in (4.5). It can be shown easily that /2 2 1 - R , where R is the coefficient of multiple determination between the dependent variable in question and those variables with direct influence on i t . Thus for model ( 4 . 3 ) , 55 i - R2 = i — L - r (ft x' f 2 - i N " a i = / ~ ft , 2 1 ' f t = f - R = 1 i j - £) (ft x' + ft x* ) z ' OO 0 - 1 . 2 N ** 7 Ol 1 7 0 2 2 A 2 ^ 2 ^ ^ ^ = f - A. - rt — 2 f t ft ft . ' O l 0 2 ' O l ' 0 2 ' 21 ' where is the coefficient of multiple determination between dependent variable X' and independent variable X* . and R2 is the coefficient of 2 l ' 0 - 1 . 2 multiple determination between dependent variable X^ and independent variables X* and X* . Therefore, estimates of the path coefficients agree for both methods. Proof of the general result can be found in Land (1973). By treating the data from the Sri Lankan household study as a simple random sample, the path coefficients for our example path model (4.3) are estimated (see Figure 7). ~ 2 f t = ' 22 °-331 Level of education X' 2 -o. ie -o. 15 Age X^ X' o Infant death O. 13 O. 98 Figure 7 A path model with estimated path coefficients 56 All path coefficients are significantly nonzero at 5 % level. But, as shown by the residulal path coefficients, or equivalently the coefficients of multiple determination, linear models do not f i t the data well. For further analysis, one may try transforming the variables. Wright developed the method of path analysis as a means of studying the direct and indirect effects of variables. Direct effect refers to the effect of an independent variable on a dependent variable directly without any mediating variables. Indirect effect pertains to the effect of an independent variable on a dependent variable through a third variable which affects the dependent variable either directly or indirectly. In our example, x ; has an indirect effect on X^ thru X^ which has a direct effect on X ' . In another model, X* may not have a direct effect on X ', but has an O 2 O indirect effect thru another variable, say X^, that has a direct effect on X* . o The observed correlation between two variables can be expressed as a sum of three components. The direct and indirect effects of one variable on the other account for two of the components. The third component of correlation coefficient is attributable to the antecedent variables common to the two variables under consideration. This component is referred to as the spxirioxis component. The decomposition of correlation coefficient as shown in (4.6) may be re-expressed as follows: direct effect + indirect effects + spurious component ft. . + E ft. r . + E rt. r . tj ~*. , i s s j **, . 7 to o j 57 where both X' and X* have direct influence on X". with s running over a l l s o J. variables X^ which are influenced by X*. , and o running over a l l variables X' which influence X'. : that is. s runs over a l l variables that have a direct path to X£ and can be reached by following the arrows from Xj , and o runs over a l l variables that have a direct path to X£ , and can reach X^ . by following the arrows. The sum of direct and indirect effects is called the total effect. For our path model (4.3), direct effect indirect effect spurious component r = ft 21 2 1 r = ft + ft r at o i 02 21 r = ft + ft r O S 02 O l 12 Using data from the Sri Lankan household study, the estimated direct and indirect effects are shown in the following table. Effect Direct Indirect Age on education -0.16 — Age on infant death 0.13 0.02 Education on infant death -0.15 — Table III Estimated direct and indirect effects for path model (4.3) Thus, the effect of age on infant death is mainly direct. Therefore, decomposition of a correlation coefficient provides a way of separating the direct effect on the dependent variable from the indirect effect which manifests itself through the correlations with other explanatory variables. 58 4 . 2 Structural Modelling with Qualitative Data 4.2.1 Loglinear and Logit Models Gocdman (1972, 1973a, b) proposed using loglinear and logit models to study the causal patterns in a set of discrete variables. Commonly used terminologies and notations for the analysis of categorical variables are reviewed in the context of three-dimensional contingency tables. A more complete presentation of this methodology can be found in Fienberg (1980), Haberman (1978), Bishop, Fienberg and Holland (1975), and others. Consider three variables, A, B and C , with 1, J and K categories respectively. Suppose a random sample of size N has been collected. Let m. denote the expected number of observations with (A,B,C) = (i,j',AO for i = J = 1,...,J and k = lf...,K. Then the general loglinear model is given by log m., .. = u + u + v. + v. . ^ l jk 1< I > 2< J > 3<fc> (4.9) where + v. . . + v. ,, + v. .. + u , .. . 12 < l_?> 13< lfc> 23<jfe> 1 2 3 < l j f e ) 7 J K ** 1(1) 2<J> ** 3<fc> 7 i =1 J=l k=i 59 I J I K y XL , . = y ii , . = y TJ. = y -u ** i2<i j> ** i2<i j> *•* i3<ife> taitki i=i j=t i=i h=l J L J=l K k.=i = Y xi = Y XL - O, ** 231 j hi ** 23< jhi I J K T. xi . ., = TJ xi . = T v. . .. = O. This general loglinear model does not impose any restriction on expected cell counts lm£jk}r a n d is denoted by [ABC]. By setting some of the u-terms to zero, special cases of the model can be obtained: Model u-terms set to zero IAB][AC][BC) XL . .. ±23i I jhi [AB] [AC] 123< l j h i ' XL ., 23< J&> [AB][BC] Xi . , 123 < I jhi ' tanhi [AC][BC] 123< Ijhi ' XI , . 12<l J > [AB][C] XL . . . . . 123< tjh) u ... 13<lfe> ' XL ., 23 < jfc> [AC][B] 123< I J«> ' XI . . . 12<lJ> ' 23 < j'fc> [BC][A] XI . . 129 < I jfc> XI . . . 12<t J> ' ia< th) [A][B] [C] 123< I. jhi ' U . . . 12<lJi ' XL . , XL ,, ia< ifc> ' 23<jhi Table IV Various loglinear models for three-dimensional tables 60 Model [AB][AC][BC] assumes that each two-variable interaction is unaffected by the value of the third variable. Models [AB][AC], [AB]IBC], and [AC][BC] are obtained by assuming conditional independence of two variables given the third. For example, model [AB][AC] assumes that variables B and C are independent given variable A. Models [AB][C], [AC][B], and [BC)[A] are obtained by assuming one variable is jointly independent of the other two. For example, model [AB)[C] assumes that variable C is jointly independent of variables A and B. Lastly, model [A][B][C] assumes that the three variables are mutually independent. The method proposed by Goodman is restricted to a hierarchical set of models in which higher-ordered terms may appear only i f the related lower-ordered terms are present. An example of a nested hierarchy of models is given below: [A][B][C] c [AB][C] <z [AB][AC] c [AB] [AC] [BC] c [ABC], where c means "is a special case of". Effects of categorical predictors, say A and B, on a dichotomous response, say C, can also be assessed by a logit model: C | A B = log 2(J> + W i2<i j> (4.10) for i = 1 .,7, and j = 1 ,..., J, where 61 KI) *"• 2<J> ±2(1. J i ±2<IJ> i=l j=l i=l j=i Note that this logit model can be obtained from the general loglinear model by making the following identifications: xo = 2 XL , w . = 2 xi . . . 3<i> i< i> taut ), xo . = 2 XL . , xo . . = 2 u . . . 2<J> 23<J1>' 12<tJ> 123< Xjk>, Special cases of this logit model can again be obtained by setting some of the io-terms to zero. Logit models for categorical predictors are special cases of logistic response models introduced in Section 3.3.2. Let p. .. denote the i JR. probability that (A,B,C) = (i,j,te), for i = 1 , j = 1,. ,.,J, and te -1,2. Then, (4.10) can be rewritten as log = XO + XO + W . + XO 1<1> 2< J) ±2<l J> (4.11) with the same restrictions on the w-terms. Suppose I = J = 2. Let X^ and X be dummy variables defined as B i f A = 1, i f A = 2, and - • • f - i i f B = i , i f B = 2, and let X = X X . Further, let p(te|X) denote the probability of C = te A B A B given X and X , i.e. let p(te|X) = p. . . Then (4.11) can be rewritten as A B L J K. 62 l o a LeilLXL-1 10* [p(2\X) J = w + w X + w X + w X t(l) A 2<1> B 12<11> A B (4.12) Thus, logit models are special cases of logistic response models where the predictors need not necessarily be categorical. Extension to predictors with more than two categories can be made similarly by defining the appropriate dummy variables. 4. 2. 2 Path Models As in Section 4.1, suppose we are interested in the relationship between infant death (C) and two explanatory variables, say age {A) and education (B) of the mother. But now assume that each variable has only two levels. The relationship between variables A and B can then be expressed by the logit model n . . B I A B | A , B | A . . , _ . logit^. ' =w i<£>' (4.13) B I A where E W A < £ > = 0* N o w build a logit model with C (infant death) as the response variable, and A and B as the explanatory variables. The three unsaturated loglinear models corresponding to such a logit model are 1. [AB][AC)[BC] 2. [AB][AC] 3. [AB)[BC]. The best model among those providing acceptable f i t is chosen using external information, or substantive theory. The f i t of a recursive system 63 of logit models can be assessed by two approaches, which are presented in later section. Suppose model 1 is the best model. Then the path model can be represented by the following diagram with path coef ficients given by the 10- terms. Level of education B |AB < i > Age A C Infant death c I A B w ' i < i > Figure 8 A path model with dichotomous variables Several drawbacks of this method proposed by Goodman (1972, 1973a,b) are illuminated by the above example. Although Goodman does assign numerical values to arrows in the diagram, these values do not have the same interpretation as in path analysis for continuous variables. There is no calculus of path coefficients; so there is no way of evaluating the indirect effect of a variable. Further, variables with multiple categories have multiple coefficients associated with a given arrow in the path diagram. Thus, interpretation of the model may be complicated. Since a sparse contingency table will pose problems in estimation of the u-terms, and thus the u>-terms, the number of categories for each variable, and the number of variables considered must be restricted. In view of these obstacles, we will limit ourselves to variables with two categories, and consider only a small number of variables. 64 4.2.3 Estimation of Path C o e f f i c i e n t s The path coefficients are estimated by maximum likelihood method, which will be illustrated using a two-dimensional table. The method can easily be extended to higher dimensional tables. Our Sri Lankan household data set is assumed to be a fixed sample, in which each member is cxoss-classified according to its values for the variables under consideration. Since a multinomial sampling model is assumed for the Sri Lankan household study, the estimation procedure will be developed based on such models. Estimation procedures are similar for other commonly encountered sampling models, such as product-multinomial and Poisson (see Bishop, Fienberg and Holland 1975, and Fienberg 1980). Consider a random sample of N subjects, where (A^,B^) for subject h is observed, h. = / N. Let p^j denote the probability that (A,B) = {i,j), and let Z^j be the number of subjects with A = i and B - j, for i ,j = 1,2. Then, under the multinomial sampling model, the expected number of subjects with A = i and B = j is given by m.. . = £(Z. .) = Np. . . (4.14) ij ij The general loglinear model for a two-dimensional table is log m.. . = xi + v. + xi . + u (4.15) for i,j = i,2, where 2 2 2 2 T.u. = T \ x i . = Vv. . . = V VL ..=0. i=l j=l i =1 J=i 65 Alternatively, the matrix representation of this model is log m. ti m 12 — 171 21 m. L 22 J l i l t 1 1-1-1 1-1 1-1 1-1-1 1 XL XL 1(1> U 2<1> XI L 12<11> J or log m. = WQ. The likelihood function is given by 2 . . L(Q) oc rj p. ij oc n m. ij , where s. .. are the observed cell counts. Thus the maximum likelihood *• J equations are given by a log L(Q) = w'is - m) = 0 , (4.16) where -z = (z ,z , 3 , 2 ) and m. is the maximum likelihood estimate ~ 11' 12 21 22 ' ~ of m. Further, the observed Fisher information matrix is given by where &q = log L (Q) = WT M W, (4.17) M = m. 0 0 O 11 0 m 0 0 12 0 0 m. 0 21 0 0 0 m. 22 J Hence, the maximum likelihood estimates of Q can be obtained by Newton-Raphson iterative procedure: 66 gi+o.gi** [ ^ l V ] V ( . - s a » , , 1=0,1,... where Q<li is the estimate of (3 at the l-th stage, j a l ) = exp(W(3<l>), and tf**' is the diagonal matrix corresponding to » J J < 1 > . Since the choice of ini t i a l estimate g < 0 > will affect the rate of convergence, the in i t i a l estimate should be chosen carefully. In general, the weighted least square estimate of Q with weights — - — will provide a satisfactory i n i t i a l S i j estimate. The u-terms can also be estimated by using various other methods (see Bishop, Fienberg and Holland 1975). However, only the Newton-Raphson iterative procedure provides a readily available estimate of the precision of Q. The maximum likelihood estimator Q is asymptotically normally distributed with mean Q and variance where & is the Fisher information matrix. In practical applications, the observed information matrix &q, which is available upon convergence in the Newton-Raphson procedure, is often used in place of 3>. Therefore, statistical inference for the u-terms (in vector Q) is possible. Although the above iterative procedure is described for the saturated loglinear model in the case of two-dimensional tables, extension to other loglinear models simply involves modifying the m.-vector, the W-matrix, and others accordingly. Thus, estimates of the u-terms can be obtained similarly. Since path coefficients (w-terms) are twice the appropriate ii-terms, they can be estimated from the estimates of u-terms. 67 4. 2. 4 Goodness-of-Fit for Path Models A path model is specified by a recursive system of models. The f i t of a system of logit models can be assessed by directly checking the f i t of each component model, or by computing a set of estimated expected cell counts for the combined system. Once the expected cell counts are estimated, the f i t of the model can be assessed by either the Pearson chi-square statistic X 2 or the likelihood-ratio statistic G2i v 2 - r» (observed - expectecD2 ' a expected 9 where the summation in both cases is over a l l cells in the table. If the fitted model is correct and the total sample size is large enough, both X 2 and G2 are approximately x distributed with degrees of freedom given by d.f. = # of cells - # of parameters. (4.20) In the context of causal modelling, Goodman uses the likelihood-ratio test statistic G 2 to evaluate the f i t of a model. Improvement in the f i t of a model by adding or deleting some iteraction terms can also be assessed by chi-square statistics. Consider two models, model I and II, where model II is a special case of model I. That is, model II is obtained from model I by setting some of the u-terms 68 to zero. Then the likelihood-ratio test statistic, AG2 = G 2(II) - G 2(I) = 2 E observed * log expected^ (4.21) expected xx with d.f. = d.f.(I) - d.f .{ID can be used to test whether the difference between the expected cell counts for the two models is simply due to random variation given the true expected cell counts satisfy model I. For instance, in our example, the effect of adding the relationship between A (age) and C (infant death) to the model [AB)[BC] can be evaluated by using the test statistic with / degree of freedom. Goodness-of-f i t of a path model can also be assessed by using the expected cell counts of the combined system of logit or loglinear models. The computation of these combined estimates is best illustrated by an example. Suppose we have three variables with the following causal ordering: as shown in Figure 8. Then the estimated expected cell counts for a system, consisting of the pair of unrestriced logit models implied by (4.22), are given by AG2 = G2( [AB)[BC] ) - G2( [AB] [AC ] [BC ] ) A precedes B precedes C, (4.22) m. a I A ~ C I A B 1 m 1 (4.23) ijh " C I A B 69 A B I A where i s the number of observations with {A,B) = (i,j), and to^y 1 and { f f i . c . ' A B } are the estimated expected cell counts for the logit models i JR. with variables B and C as the response variables respectively. Since the latter model involves conditioning on the marginal totals ^^j^r which can be seen from the maximum likelihood equations, the second equality in (4.23) is obtained. Thus, the likelihood-ratio test statistic is given by G 2 = 2 £ * t J K * log * i J h (4.24) i , j , h L m i j h J " 2 E a t j h * l ° € \ Z i J : * — . . J I ~ B A C I A B = G 2. + G 2, B I A C I A B where G 2 . is the likelihood-ratio test statistic for logit model specified B | A on the 2x2 table obtained by collapsing over variable C, and g 2.| a b is the likelihood-ratio test statistic for logit model specified on the complete 2x2x2 table. Thus, the overall likelihood-ratio test statistic has degrees of freedom given by the sum of degrees of freedom corresponding to the two 2 component G 's. A more detailed discussion on this approach can be found in Goodman (1973b), and Fienberg (1980). 70 5 . R e s u l t s o f S t a t i s t i c a l A n a l y s e s on t h e S r i Lankan Household Data The Sri Lankan infant mortality data set was first analyzed by discriminant methods to identify risk factors and to characterize households with high risk of infant mortality. Methods for path analysis were then applied to the identified risk factors, in order to assesss the relationships among them, and their relationship to infant death. 5.1 I d e n t i f i c a t i o n o f I n f a n t M o r t a l i t y R i s k Groups The main objective of this analysis is to identify risk factors that discriminate between households with relatively high and low infant mortality. By using the terminologies and notations introduced in Section 3, the problem can be formalized as follows. For each household sampled in the Sri Lankan household study, let Y be a dichotomous variable indicating whether or not an infant death has occurred, and let X be a vector of explanatory variables. Then, Y specifies the class to which the household belongs. The explanatory variables are listed as X-variables in Table I, which includes information on nutrition, sanitation, education of the mother, economic status, childbirth environment, ethnicity of the family, etc.. Then, the sample space X consists of a l l possible combinations of the x-values. Using decision theoretic criteria, estimates of infant death probability at each x-value partition the sample space X into two regions corresponding to relatively high and low risk groups. Two discriminant methods are advocated in Section 3: logistic 71 discrimination and class probability estimation by CART. For each of these methods, the analysis was performed separately for those women of age less than 44 (N - 250) and those of age greater than or equal to 44 (N = 141). 5.1.1 L o g i s t i c Discrimination A forward stepwise procedure implemented in the logistic regression program PLR of BMDP, was used to select explanatory or predictive variables that may adequately model the logit of infant death probability, as described in Section 3. The results of this analysis are shown in Table V. Consider the results for younger women (Table Vb). About 25% of these women with age less than 44 have experienced at least one infant death. Maximum likelihood estimates of the regression coefficients in the most parsimonious model indicate that probability of infant death seems to be greater for those who gave birth at home, and for those whose families have lower economic status. By setting some threshold value pQ, the Sri Lankan village households can be partitioned into two risk groups with the higher risk group composed of households with estimated infant death probability greater than the threshold value. Using the maximum likelihood estimation results, the sample space can be partitioned as follows: the region of high risk corresponds to families with 1. last child born in hospital, and economic status < -4.732 ( logit p + 1.134 ), or 72 2. last child born at home with a midwife, and economic status < -4.732 ( logit p + 0.305 ), or 3. last child born at home without a midwife, and economic status < -4. 762 ( logi t pQ + O. 352 ). Details on formulation of the above partition are shown in Appendix I. Although this partition of the sample space can be interpreted easily, this may not always be the case where more variables are in the final model. Next, consider the results for older women (Table Vb). Maximum likelihood estimates of the regression coefficients in the most parsimonious model indicate that probability of infant death for the non-Sinhalese families may be twice as high as that for the Sinhalese families. Thus, for the older women, the relatively high and low risk groups may be defined by ethnic group membership. 73 Table V Results of forward stepwise logistic regression a. Model selection Study group Model -2 log X d.f. p-value Women of age <44 constant constant, X 5 8.509 1 0.004 constant, X , X s' z 7.003 2 0.030 Women of age 44+ constant constnat, X 10 11.665 1 0.001 maximum likelihood under previous model where X. = , maximum likelihood under current model X is the environment of child birth, z X_ is the economic status, and 5 X^o is the ethnicity. Note that X is treated as continuous variable, while X and 5 2 X are treated as categorical variables represented by dummy variables as defined on the following page. 74 Maximum likelihood estimates of the coefficients in the final model Maximum likelihood estimate Study group Variable coefficient s.e. Women of age <44 constant -0.597 0.209 X 5 -0.210 0.098 X 2<2> 0.292 0.224 X 2<9> 0.245 0.238 Women of age 44+ constant -0.683 0.187 X 10<2> 0.622 0.187 where X_ is the economic status, 5 2<2> -1 O i f the last child was born at home with a midwife, if the last child was born in hospital, otherwise, X 2 ( 3 ) X 10<2> -1 o i f the last child was born at home without a midwife, if the last child was born in hospital, otherwise, and if the household is non-Sinhalese, i f the household is Sinhalese. 75 5.1.2 Discrimination using CART The probability of infant death at each point in the sample space was estimated using the CART software described in Section 3, using the 10-fold cross-validation procedure. As in the previous section, younger and older women were analyzed separately. For the younger women, the pruned subtree with the minimum cross-validated estimate of risk is shown in Figure 9. If the same criterion is used for the older women, then a tri v i a l tree with one terminal node would be selected. Thus, the next larger- tree which can be obtained by growing a tree with an appropriate complexity parameter using the entire sample, is considered (see Figure 10). For younger women, the binary tree (Figure 9) has three terminal groups corresponding to low risk, and one terminal group corresponding to high. risk. Women who gave birth in the hospitals, or whose families have high economic status appear to have a relatively low risk of experiencing at least one infant death. For those women who gave birth at home, and whose families have low economic status, families whose major source of income is from piece-rate work or hourly labor seem to be at a much lower risk than those families whose income is from other sources. For those households in poverty, piece-rate work or hourly labor may provide a steadier source of income. Thus, women who give birth at home, live in poverty, and whose families have no steady income, are at the highest risk of experiencing at least one infant death. 76 For older women, the binary tree (Figure 10) suggests that Sinhalese families may have been at a lower risk than the non-Sinhalese families. The estimated probability of infant death indicates the risk of infant death may be twice as high in non-Sinhalese families as in Sinhalese families. 77 Figure 9 CART results for the younger women 63 class 1 187 class 2 C25%> in hospital Where was the la s t c h i l d born? at home 24 class 1 115 class 2 C17X? 39 class 1 72 class 2 C35VO 0-2 Economic status 34 class 1 48 class 2 C41T& 5 class 1 24 class 2 C171D piece rate Primary source of income others 10 class 1 31 class 2 C24TO 24 class 1 17 class 2 C59X> class 1 : households with infant death experiences, class 2\ households with no infant death experience. Proportion of class 1 households are reported in the brackets. 78 Figure 10 CART results for the older women Sinhalese 16 class 1 59 class 2 48 class 1 93 class 2 C34%> I O others 32 class I 34 class 2 (21%) (48%) class 1 : households with infant death experiences, class 2: households with no infant death experience. Proportion of class 1 households are reported in the brackets. 79 5.1. 3 D i s c u s s i o n Explanatory variables considered important by the logistic discrimination method were also considered important by the CART method. However, the partition of the sample space into regions of relatively high and low risk may be different for the two methods. Logistic discrimination forces a linear partition, whereas CART partition is piecewise linear. For younger women, economic status of the family is considered an important risk factor by both methods. But in the CART result, the partition uses this variable only for those women giving birth at home. Suppose the threshold value, pQ, in Section 5.1.1 equals O. 17 as in the CART result. Then logistic discrimination method partitions the sample space into high and low risk regions as follows: the region of High, risk corresponds to families with 1. last child born in hospital, and economic status < 3 , or. 2. last child born at home with a midwife, or. 3. last child born at home without a midwife. Thus, women who gave birth at home are in the high risk group, and so are women who gave birth in the hospital but whose family is poor. But this contradicts the CART result (Figure 9), where a l l women giving birth in hospital are in the low risk group. Consider the 3x2 contingency table formed by cross-tabulating the environment of childbirth, and the economic status dichotomy created by grouping the categories 0-2 and 3-5 , as shown in Table VI. The table shows that the partition provided by the CART 80 method seems more coherent than the partition provided by logistic discrimination. The logistic discrimination method assumes that the relationship between the logit of infant death probability (logit p) and economic status (X5) for environment of childbirth (X^), can be modelled by parallel straight lines (Table VII). This criterion seems reasonable for latter two childbirth conditions, but not for a l l three conditions. By imposing this parallelism on the results, the more appropriate partitioning of the sample space is overlooked. However, i f interactions between the two explanatory variables were allowed, logistic discrimination might have obtain the appropriate partitioning. In general, logistic discrimination may require fitting many different models with various iteraction terms before a partitioning comparable to that found by the CART method, is discovered. Discrepancies between results for the two age groups may be explained by several factors. Health services may be more readily available at time of child bearing for the younger women. Younger generation may also be less inhibited by health technologies; and thus utilizes the services more frequently. Ethnicity may be more relevant to everything (including infant mortality) when the older women were child bearing. Ethnicity may s t i l l be pertinent to economic status and usage of health services in the younger generation, but the effect of ethnicity on infant mortality may have lessen. Lastly, economic status at time of study may be strongly related to economic status at time of child bearing for the younger women, but perhaps not for the older women. 81 Table VI Comparison of sample space partitioning by logistic discrimination and by CART The following table is constructed based on women of age less than 44. Economic status - ownership of household items. (X ) 5 Where was the last child born ? (X ) 2 0 - 2 3 - 5 In hospital °-13 (-^-) At home with midwife ° - 4 1 ( - £ - : > At home without midwife ( - £ - ) 0-13 (JL-) The high, risk group identified by logistic discrimination is the group of households in the highlighted region given by the union of the first column and the last two rows. The high risk group identified by CART is the group of households in the highlighted region given by the intersection of the first column and the last two rows. 82 Table VTI Estimated logistic regression equations for younger women Where was the last child born? (X ) 2 Estimated Logistic Regression Equation In hospital At home with midwife At home without midwife logit p = -1. 134 - 0. 210 X 5 logit p = -0.305 - 0.210 X logit p = -0.352 ~ 0.210 X 5 83 5.2 Causal Modelling Discriminant analysis performed earlier indicates that economic status, environment of childbirth, and ethnic group membership may be associated with infant mortality. To understand how these variables work together to affect infant mortality, a path model is constructed based on the natural temporal ordering of the variables (Figure 11). Figure 11 A path model specifying temporal relationships among selected variables 84 5.2.1 Structual Modelling with Quantitative Data The following analysis is performed using the REG procedure in the SAS statistical software, by treating a l l four variables as continuous. Results of path analysis for the two age groups are shown in Figures 12 and 13 respectively. The estimated direct and indirect effects of explanatory variables on infant mortality are summarized in Table VIII for the two age groups. Comparing path models shown in Figures 12 and 13 suggests that the relationships among the variables may differ for the two age groups. The effect of ethnic group membership on childbirth environment seems stronger for the younger women. Economic status and childbirth environment appear to affect infant mortality for the younger women, whereas only ethnicity appears to have a substantial effect on infant mortality for the older women. Consider the estimated direct and indirect effects of explanatory variables on infant mortality for the younger women (Table VIII). Although ethnicity has virtually no direct effect on infant mortality, i t does seem to influence the other two variables, economic status and environment of childbirth, to affect infant mortality. Thus minority group status may adversely affect the economic status, and may obstruct access to better childbirth environment, which in turn, increases the risk of infant death. 85 Estimated direct and indirect effects of explanatory variables on infant mortality for the older women in (Table VIII) indicate that neither economic status nor childbirth environment have strong direct or indirect effects on infant mortality. Therefore, minority group status seems to be the only factor, among the three considered, to increase the risk of infant death. For both path models (Figure 12 and 13), the path coefficients corresponding to the unobserved sources of variations are high. Thus, the linear models considered by path analysis do not seem to f i t the data well. Since the occurrence of infant death is a relatively rare event, and the variables investigated are not immediate biological causes of infant death, a linear model is not likely to f i t the data well. However, this type of model s t i l l provides some useful information on the relationships among the variables. 86 Figure 12 Path analysis results for the younger women | 0.93 Economic Environment of c h i l d b i r t h X 2 | 0.90 where • signifies statistically nonzero path coefficient at the 10% level (excluding residual path coefficients). 87 Figure 13 Path analysis results for the older women 90 Economic status \ X -O. 44 Ethnicity X s i o -O. 19 O. 16 S o. IO -O. 25 ^ Infant death — • S y y -o.02 95 Environment of c h i l d b i r t h Xz | O. 96 where • signifies statistically nonzero path coefficient at the *0% level (excluding residual path coefficients). 88 Table VIII Estimated direct and indirect effects on infant death Effect on Infant Mortality Study Group Variable (source) Direct Indirect Age <44 Ethnicity 0.00 -0.12 Economic status 0.13 0.05 Use of health services for childbirth -0.16 1 Age 44+ Ethnicity -0.25 -0.04 Economic status 0.10 -0.G1 Use of health services for childbirth 0.02 89 5.2.2 Structural Modelling with Qualitative Data The preceding section applied statistical analysis that was originally derived for continous variables; but most of the variables in this study are ordered categorical. In this section, the relationships between the variables are analyzed using the method for categorical variables proposed by Goodman, which was described in Section 4.2. Due to limitations of the method as discussed in Section 4.2.2, the variables considered are receded into two categories (Table IX). Let A - D be the receded variables for ethnicity, economic status, environment of childbirth, and infant death respectively. Then the following causal ordering of the variables is assumed: A preceeds B preceeds C preceeds D. Programs written in a language implemented in the statistical software package called S were used for the analysis. Path diagrams depicting the causal connections implied by the best logit or loglinear models for women of the two age groups are shown in Figures 14 and 15. Details on the model selection are given in Appendix II and III respectively for the two groups of women. The path diagram for the younger women (Figure 14) indicates that: (1) minority group status may adversely affect economic status, and may obstruct access to better childbirth environment; (2) poverty may have blocked access to better childbirth environment; (3) lastly, poverty and 90 childbirth environment may be linked to infant mortality. Although minority group status does not seem to have direct effect on infant mortality, i t does seem to have an indirect effect through economic status and environment of childbirth. The path diagram for the older women (Figure 15) indicates that: (1) minority group status may have negative effects on both economic status and infant mortality; (2) poverty may have blocked access to better childbirth environment; but (3) neither economic status nor childbirth environment has any significant effect on infant mortality. Therefore, for older women, no variables in addition to ethnicity (among those considered) can significantly improve discrimination between high and low risk groups. 91 Table IX variables used in modified path analysis Variable Variable in original data set Codes A X to Ethnicity 1 2 Sinhalese non-Sinhalese B X s Economic status 1 2 0 - 1 2+ C X Use of health services for childbirth 1 2 in hospital at home D y Infant death 1 2 at least one none 92 Figure 14 Path diagram showing causal links implied by selected logit models for the younger women 93 Figure 15 Path diagram shoving causal links implied by selected logit models for the older women Economic status \ -O. 73 Ethnicity A s -O. 41 \ -0.59 NJ Infant death S D Environment of childbirth C where signifies non-significant relationship. 94 5.2.3 Discussion Causal interpretations of path diagrams constructed by both quantitative and qualitative approaches are similar. For the younger women, both path diagrams (Figures 12 and 14) show that minority group status seems to result in poverty, and seems to obstruct access to better childbirth environment, which in turn, leads to infant deaths. For the older women, both path diagrams (Figures 13 and 15) indicate that minority group status per se appears to be the only factor that has any effect on infant mortality. Discrepancies between results for the two age groups may be explained as in Section 5.1.3. None of the linear regression models in Figures 12 and 13 f i t the data particularly well, as shown by the path coefficients corresponding to the unobserved sources of variations. On the other hand, the loglinear or logit models considered in Figures 14 and 15 provide reasonable f i t to the data sets. However, the method for qualitative data does not provide quantitative assessments of indirect effects as provided by the method for quantitative data. 95 6. Remarks and Recommendations on Statistical Methods Used to Identify Risk Groups An objective of the Sri Lankan household survey was to identify a small number of risk factors that distinguish groups of women having relatively high or low probability of experiencing at least one infant death. This study examined socioeconomic factors (not medical causes) that are relevant to resource allocation priorities, and to cultural obstacles in the planning of health services and health promotion programs. Structural or temporal relationships among the risk factors are also of interest to the researchers. Statistical discrimination methods were used to select significant risk factors, and to identify the high risk group (or groups) in the Sri Lankan households. Although both logistic discrimination and CART are computing-intensive, the logistic discrimination method requires less computing resources, and has more readily available software packages. Otherwise, the CART technique is preferable, since i t provides more informative and more easily interpretable results. Furthermore, the CART technique does not require any distributional assumptions. After a small set of risk factors had been identified by discriminant analysis, the structural or temporal relationships among selected risk factors and infant mortality were investigated using path analysis. The classical method of path analysis using linear regression models has often been applied to social science data that are ordinal or 96 categorical in nature, where a modified method using logistic quanta1 response models would be more appropriate. When the classical method is applied inappropriately, the resulting path model usually does not f i t the data well, as indicated by high residual path coefficients. Although the modified method does provide a better f i t , i t is highly computing-intensive, and is restrictive in the number of variables allowed in the proposed path model. In practice, social scientists would use path models with more variables than the models considered here. variables that were not selected by the discrimination methods might s t i l l be of interest to the researchers, when considering infant mortality in a larger socioeconomic and political context. The approach used in this thesis, and recommended for similar studies to identify risk groups, applies discriminant analysis (preferably CART) as an exploratory tool, and then uses path analysis (preferably logistic quanta 1 response modelling) to confirm significance of relationships among variables. In our Sri Lankan household study, discriminant analysis identified economic status and environment of childbirth as significant risk factors for the younger women. In contrast, ethnic group membership is the only risk factor identified for the older women. Younger women who gave birth at home, and whose families have low economic status appear to be at a high risk of experiencing at least one infant death, whereas, younger women who 9 7 gave birth in the hospital, or whose families have high economic status seem to be at a substantially lower risk. For the older women, non-Sinhalese families appear to have a higher risk of experiencing at least one infant death than the Sinhalese families. Results of path analysis on infant mortality using the three identified risk factors suggest that the changing role of ethnicity may have partially explained the discrepancies between previous results for the two age groups. While ethnic group membership may be relevant to many things, including infant mortality, for the older generation, its influence on infant mortality seems to have lessened for the younger generation. The discrepancies between results for the two age groups may also be explained by other factors. Health services may not have been as readily available at time of child bearing for the older women as for the younger women. The use of better childbirth environment by the younger women may also be explained by the changing attitude toward the seriousness of childbirth by the families. Finally, the economic status at the time of study may be strongly related to the economic status at time of child bearing for the younger women, but may not be so for the older women. In order to plan an effective health program to promote infant survival, one must understand the socioeconomic conditions in which infant death is likely to occur, as well as the biomedical causes of infant death. Our analysis suggests most of the high risk households will be too poor to take advantage of the government's subsidy program for the construction of 98 sanitary latrines. Although Sri Lanka has a well-organized network of essentially free health services that extend into rural areas, access to and usage of better childbirth environment can s t i l l be improved. Health planning entails more than designing a program that treats or prevents a health disorder; i t must also ensure health care delivery to those in need. 99 BIBLIOGRAPHY Anderson, J.A. (1972). Separate sample logistic discrimination. Biometriha, 59:19-35. Anderson, T.W. (1984). An Introduction to Multivariate Statistical Analysis, 2nd ed.. New York: John Wiley & Sons. Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. (1975). Discrete Multivariate Analysis: Theory and Practice. Cambridge: The MIT Press. Blalock, H.M. ed. (1970). Causal Models in the Social Sciences. Chicago: Aidine. Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and Regression Trees. Belmont: Wadsworth & Brooks. Breslow, N.E. and Day, N.E. (1980). The Analysis of Case-Control Studies. Statistical Methods in Cancer Research, Vol. 1. Lyon: International Agency for Research on Cancer. Caldwell, J. and McDonald, P. (1982). Influence of matrernal education on infant and child mortality: levels and causes. Health Policies and Education, 2:251-267. Chovdhury, A. (1982). Education and infant survival in rural Bangladesh. Health Policies and Education, 2:369-374. Cox, D.R. (1970). The Analysis of Binary Data. London: Methuen. Dillon, W.R., and Goldstein, M. (1984). Multivariate Analysis. New York: John Wiley & Sons. Duncan, O.D. (1966). Path analysis: sociological examples. The American Journal of Sociology, 72:1-16. Efron, B. (1975). The efficiency of logistic regression compared to normal discriminant analysis. Journal of the American Statistical Association, 70:891-898. Fienberg, S.E. (1980). The Analysis of Cross-Classified Categorical Data. Cambridge: The MIT Press. Fisher, R.A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:179-188. Goodman, L.A. (1972). A general model for the analysis of surveys. The American Journal of Sociology, 77:1035-1086. 100 (1973a). Causal analysis of data from panel studies and other kinds of surveys. The American Journal of Sociology, 78:1135-1191. . (1973b). The analysis of multidimensional contingency tables when some variables are posterior to others: a modified path analysis approach. Biometrika, 60:179-192. Grosse, R. and Perry, B. (1982). Correlates of li f e expectancy in less developed countries. Health Policies and Education, 2:275-304. Haberman, S.J. (1978). Analysis of Qualitative Data. Uolume 1: Introductory Topics. New York: Academic Press. Hand, D.J. (1981). Discrimination and Classification. New York: John Wiley & Sons. Heise, D.R., ed. (1975). Sociological Methodology 1976. San Francisco: Jossey-Bass. Kendall, M.G. and O'Muircheartaigh, CA. (1977). Path analysis and model building, World F e r t i l i t y Survey, Technical Bulletin No. 414. Kiveri, H., Speed, T.P. and Carlin, J.B. (1984). Recursive causal models. Journal of the Australian Mathematical Society A, 36:30-52. Lanchenbruch, P.A. (1975). Discriminant Analysis. New York: Hafner Press. Land, K.C. (1969). Principles of path analysis. Sociological Methodology 1969, ed. E.Borgatta. San Francisco:Hossey-Bass. . (1973). Identification, parameter estimation and hypothesis testing in recursive sociological models. Structural Equation Models in the Social Sciences, eds. A.S. Goldberger and O.D. Duncan. New York: Seminar Press. Leik, R. (1975). Causal models with nominal and ordinal data: retrospective. Sociological Methodology 1976, ed. D.R. Heise. San Francisco: Jossey-Bass. McKeown, T. (1976). The Modern Rise of Population. New York: Academic Press. Mishler, E.G., Amarasingham, L.R., Hauser, S.T., Liem, R., Osherson, S.D., and Waxier, N.E. (1981). Social Contexts of Health, Illness, and Patient Care. Cambridge: Cambridge University Press. Morgan, J.N. and Sonquist, J.A. (1963). Problems in the analysis of survey data, and a proposal. Journal of the American Statistical Association, 58:415-435. 101 Morgan, J.N. and Messenger, R.C. (1973). THAID: a sequential search program for the analysis of nominal scale dependent variables. Ann Arbor: Institute for Social Research, University of Michigan. Morris, M.D. (1979). Measuring the Condition of the World's Poor. New York: Pergamon Press. Morrison, B. and Waxier, N.E. (1984). Three patterns of basic needs within Sri Lanka: 1971-1973. Unpublished paper. Mosley, W.H. (1984). Child survival: research and policy. Child Survival. Population and development review, a supplement to volume 10. New York: The Population Council, Inc.. , and Chen, L. (1984). An analytical framework for the study of child survival in developing countries. Child Survival. Population and development review, a supplement to volume 10. New York: The Population Council, Inc.. Mueller, J.H., Schuessler, K.F., and Costner, H.L. (1977). S t a t i s t i c a l Reasoning in Sociology. Boston: Houghton Mifflin. Panel on Discriminant Analysis, Classification, and Clustering (1988). Discriminant Analysis and Clustering. Washington, D.C: National Academy Press. Patel, M. (1980). Effects of the health service and environmental factors on infant mortality: the case of Sri Lanka, Jour rial of Epidemiology and Community Health, 34:76-82. Press, S.J. and Wilson, S. (1978). Choosing between logistic regression and discriminant analysis. Journal of the American Statistical Association, 73:699-705. Puffer, R.R. and Serrano, C.V. (1973). Patterns of Mortality in Childhood. Scientific Publication No. 262. Washington, D.C: Pan American Health Organization. Rosenthal, H. (1980). The limitation of log-linear analysis. Contemporary Sociology, 9:207-212. Sackett, D.L. and Holland, W.W. (1975). Controversy in the detection of disease. The Lancet, 2:357-359. ® SAS Institute, Inc. (1985). SAS User's Guide: Statistics, Version 5 Edition. Cary, NC: SAS Institute Inc.. Schlesselman, J.J. (1982). Case-Control Studies: Design, Conduct, Analysis. New York: Oxford University Press. Simmons, G. and Bernstein, S. (1982). The educational status of parents, and infant and child mortality in rural North India. Health P o l i c i e s and Education, 2:349-367. 102 Smucker, C, Simmons, G., Bernstein, S., and Misra, B. (1980). Neo-natal mortality in Sourth Asia: the special role of tetanus. Population Studies, 34:321-335. Wald, A. (1944). On a statistical problem arising in the classification of an individual into one of two groups. Annals of Mathematical S t a t i s t i c s , 15:145-163. Waxier, N.E., Morrison, B.M., Sirisena, W.M., and Pinnaduwage, S. (1985). Infant mortality in Sri Lankan households: a causal model. Social Science and Medicine, 20:381-392. Wermuth, N. (1980). Linear recursive equations, covariance selection, and path analysis. Journal of the American S t a t i s t i c a l Association, 75:963-997. . (1987). Parametric collapsibility and the lack of moderating effects in contingency tables with a dichotomous response variable. Journal of the Royal S t a t i s t i c a l Society, 49:353-364. , and Lauritzen, S.L. (1983). Graphical and recursive models for contingency tables. Biometriha, 70:537-552. Winship, C. and Mare, R.D. (1983). Structural equations and path analysis for discrete data. The American Journal of Sociology, 89:54-110. World Bank (1975). Health Sector P o l i c y Paper. Washington, D.C.: World Bank. Wright, S. (1934). The method of path coefficients. Annals of Mathematical S t a t i s t i c s , 5:161-215 . (1960). Path coefficients and path regression: alternative or complementary concepts? Biometrics, 14:189-202. Vande Geer, J.P. (1971). Introduction to M u l t i v a r i a t e Analysis for the Social Sciences. San Francisco: W.H. Freeman and Company. 103 Appendix I Partitioning the Sample Space Using Logistic Discrimination (Younger Women) Let p o be some threshold value chosen, so that the high, risk group is composed of households with estimated probability of experiencing at least one infant death greater than pQ. Then using maximum likelihood estimates of the regression coefficients (Table Vb), the high risk households have explanatory variables satisfying the following inequality: - 0.597 - 0.210 X + 0.292 X + 0.245 X > logit p , (A.l) 5 2<2> 2<9> O where X denotes the economic status, and X and X are dummy 5 2<2> 2(3) variables representing the categorical variable X^ as defined below, if the last child was born at home with a midwife, X =1-1 i f the last child was born in hospital, * ~ otherwise, f 1 1 \ -1 X = \ -1 i 2 < 9 > l o o f the last child was born at home without a midwife, i f the last child was born in hospital, therwise. Alternatively, the partition region can be described by examining each childbirth environment in (A.l) : the region of high risk corresponds to families with 1. last child born in hospital, and economic status < -4.762 ( logit P 0 + 1.134 ), or 2. last child born at home with a midwife, and economic status < -4.762 ( logit pQ + 0.305 ), or 3. last child born at home without a midwife, and economic status < -4. 762 ( logi t pQ + O. 352 ). 104 Appendix I I Modified Path Analysis - Model Selection (Younger Women) Using the method proposed by Goodman as described in Section 4.2, the relationship between variables A and B is investigated through the logit model , , . B | A B I A . B I A , . „ , logiti 1 = u> 1 + w x t i ) r (A.2) ~ B I A with estimated effect parameter \ ( ± ) = -0.63 . By examining results of fitting the three unsaturated loglinear models corresponding to the logit model with C as the reponse variable, and A and B as the explanatory variables (models Ml - M3 in Table X), we see that models [AB)[AC][BC] (Ml) and [AB][AC] (M2) provide reasonable fits for the data. That is, their goodness-of-fit statistics (either X 2 or G2) are not statistically significant. However, G2(M2) - GZ(M1) = 3.087 with 1 degree of freedom is significant at the 10% level, suggesting the relation between variables B and C may be important. Thus, the larger model, Ml, is preferred. The corresponding logit model is , C U B C I A B , C 1 A B . C I A B . , _ . logit . .1 = w 1 + it> + w '. , (A.3) VN £• I ^ C I A B with estimated effect parameters; *> =0.82 and w = -0.24 . 1< 1 > 2< i> Now examine the effects of A on D, B on D, and C on D as suggested by the assumed causal ordering. The results of fitting the seven unsaturated loglinear models corresponding to the logit model with D as the reponse variable, and A, B and C as the explanatory variables (M4 -MIO in Table X), show that a l l except model [ABC)[AD] (M8) f i t the data well. 105 Since model M7 is a special case of model M4, and G2(M7) - G2(M4) = 0.216 with * degree of freedom is not statistically significant at the 5% level, the smaller model, M7, is preferred. For models M9 and MIO, two special cases of model M7, G2(M9) - G2(H7) = 6.729 and G2(M10) - G2(M7) = 4.886, each with 1 degree of freedom; both are statistically significant at the 5% level. Thus, father reduction from model M7 is not desirable. The logit model corresponding to H7 is , . . D I A B C D 1 A B C , D I A B C , D I A B C . , , , logtt ..' = w 1 + w \ + w ' , (A.4) IJ 2< J > 3<fc> with estimated effect parameters: u> 1 =0.33 and w 1 = -0.38 . 15 2< 1 > a< i > The results are summarized by the path diagram in Figure 14. 106 e x Goodness-of-fit statistics for loglinear models (younger women) Model d.f. X 2 GZ Mi [AB][AC)[BC] 1 0.215 0.215 M2 [AB][AC] 2 3.325 3.302 M3 [AB][BC] 2 32.029 32.887 M4 [ABC] [AD] [BD)[CD] 4 2.830 2.896 MS [ABC] [AD] [BD] 5 8.766 8.063 M6 [ABC] [AD] [CD] 5 6.629 7.051 M7 [ABC] [BD] [CD] 5 3.012 3.112 M8 [ABC][AD] 6 13.740 13.252 M9 [ABC][BD] 6 9.781 9.841 MIO [ABC][CD] 6 7.716 7.998 where X 2 is the Pearson chi-square statistic, and Gz is the likelihoodf ratio statistic. 107 Appendix III Modified Path Analysis - Model Selection (Older Women) Using the method proposed by Goodman as described in Section 4.2, the relationship between variables A and B is investigated through the logit model , , . B | A B I A , B I A C . logitt = xo ' + w t ( i > r ^ B I A with estimated effect parameter w i i ± ) - -0.73 . By examining results of fitting the three unsaturated loglinear models corresponding to the logit model with C as the reponse variable, and A and B as the explanatory variables (models Ml - M3 in Table XI), we see that models [AB][AC][BC] (Ml) and [AB][BC] (M3) provide reasonable fits for the data. That is, their goodness-of-fit statistics (either X 2 or G2) are not statistically significant. Since M3 is a special case of Ml, and G2(M3) - G2(M1) = 0.609 with 1 degree of freedom is not statistically significant at the 5% level, the smaller model, M3, is preferred. The corresponding logit model is , . . C A B C A B . C A B . . C . logit . = io 1 + xo '. , (A.6) A ci I A B with estimated effect parameter, w 2 l ± > ~ -0.41 . By examining results of fitting the seven unsaturated loglinear models corresponding to the logit model with D as the reponse variable, and A, B and C as the explanatory variables (M4 -MIO in Table XI), we see [ABC] [AD] (M8) is the smallest model that fits the data well. Since adding more interaction terms into model M8 does not significantly improve the f i t , the most parsimonious 108 model is M8. Thus, the corresponding logit is given by , . , D | A B C D I A B C , D I A B C , „ „ % logit.^ - » ' + wjiy , (A.7) with estimated effect parameter, w^l*BC ~ -0.59 . The results are summarized by the path diagram in Figure 15. 109 Table XI Goodness-of-fit statistics for loglinear models (older women) Model d.f. x2 G2 Ml [AB] [AC ] [BC ] 1 0.106 0.105 M2 [AB][AC] 2 4.023 4.058 M3 [AB)[BC] 2 0.724 0.715 M4 [ABC] [AD] [BD][CD] 4 1.776 1.682 MB [ABC][AD][BD] 5 2.505 2.514 Me [ABC][AD][CD] 5 1.771 1.714 M7 [ABC][BD][CD] 5 12.187 12.130 M8 [ABC] [AD] 6 2.503 2.515 MQ [ABC][BD] 6 13.335 13.304 M1Q [ABC][CD] 6 12.805 12.917 where X 2 is the Pearson chi-square statistic, and G2 is the likelihood ratio statistic. 110
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Identification of risk groups : study of infant mortality...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Identification of risk groups : study of infant mortality in Sri Lanka Kan, Lisa 1988
pdf
Page Metadata
Item Metadata
Title | Identification of risk groups : study of infant mortality in Sri Lanka |
Creator |
Kan, Lisa |
Publisher | University of British Columbia |
Date Issued | 1988 |
Description | Multivariate statistical methods, including recent computing-intensive techniques, are explained and applied in a medical sociology context to study infant death in relation to socioeconomic risk factors of households in Sri Lankan villages. The data analyzed were collected by a team of social scientists who interviewed households in Sri Lanka during 1980-81. Researchers would like to identify characteristics (risk factors) distinguishing those households at relatively high or low risk of experiencing an infant death. Furthermore, they would like to model temporal and structural relationships among important risk factors. Similar statistical issues and analyses are relevant to many sociological and epidemiological studies. Results from such studies may be useful to health promotion or preventive medicine program planning. With respect to an outcome such as infant death, risk groups and discriminating factors or variables can be identified using a variety of statistical discriminant methods, including Fisher's parametric (normal) linear discriminant, logistic linear discrimination, and recursive partitioning (CART). The usefulness of a particular discriminant methodology may depend on distributional properties of the data (whether the variables are dichotomous, ordinal, normal, etc.,) and also on the context and objectives of the analysis. There are at least three conceptual approaches to statistical studies of risk factors. An epidemiological perspective uses the notion of relative risk. A second approach, generally referred to as classification or discriminant analysis, is to predict a dichotomous outcome, or class membership. A third approach is to estimate the probability of each outcome, or of belonging to each class. These three approaches are discussed and compared; and appropriate methods are applied to the Sri Lankan household data. Path analysis is a standard method used to investigate causal relationships among variables in the social sciences. However, the normal multiple regression assumptions under which this method is developed are very restrictive. In this thesis, limitations of path analysis are explored, and alternative loglinear techniques are considered. |
Subject |
Newborn infants -- Sri Lanka -- Mortality Infants -- Mortality Sri Lanka -- Statistics, Vital |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-08-30 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0097696 |
URI | http://hdl.handle.net/2429/27971 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1988_A6_7 K36.pdf [ 4.64MB ]
- Metadata
- JSON: 831-1.0097696.json
- JSON-LD: 831-1.0097696-ld.json
- RDF/XML (Pretty): 831-1.0097696-rdf.xml
- RDF/JSON: 831-1.0097696-rdf.json
- Turtle: 831-1.0097696-turtle.txt
- N-Triples: 831-1.0097696-rdf-ntriples.txt
- Original Record: 831-1.0097696-source.json
- Full Text
- 831-1.0097696-fulltext.txt
- Citation
- 831-1.0097696.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0097696/manifest