DOES C U L T U R E AFFECT THE LOCATION OF D E A T H A M O N G C A N C E R PATIENTS IN BRITISH COLUMBIA?: A PILOT STUDY UTILIZING STANDARD A N D N O V E L STATISTICAL METHODS IN END-OF-LIFE R E S E A R C H . by M I C H A E L DAVID REGIER B.AR. Hons., Canadian Bible College, 1997 B.Sc., The University College of the Fraser Valley, 2003 A THESIS SUBMITTED IN PARTIAL F U L F I L M E N T OF THE REQUIREMENTS FOR THE DEGREE OF M A S T E R OF SCIENCE in THE F A C U L T Y OF G R A D U A T E STUDIES (Statistics) THE UNIVERSITY OF BRITISH C O L U M B I A August 2005 © Michael David Regier, 2005 Abstract Objective Understanding the clinical, cultural, social, demographic and economic landscape that affects end-of-life health service utilization for British Columbian cancer patients is novel work. The location of death was taken as the initial use of health services to investigate. The goals for this thesis are to understand the trends in the place of death from 1997 to 2003 for adult British Columbians that died of cancer, understand which predictive factors of dying out of hospital, develop new indicators of culture, and evaluate the utility of data mining for end-of-life research. Methods The subjects were all adults, age 18 years and older who died of cancer, in British Columbia as identified from the death certificates, from 1997 to 2003. Data from the British Columbia Cancer Registry and the British Columbian Vital Statistics were linked. To this the Postal Code Conversion File program was used to associate each subject with a dissemination area. This was used to impute ecological measures of income, ethnicity, language and religion to each patient. The culture indicators were based upon an indicator developed using principal component analysis and k-means. When considering the utility of linear discriminant analysis, logistic regression, neural networks, classification trees, and nearest neighbours for end-of-life research, a sub set comprised of all adults aged 20 years and older from 1999-2003 was obtained from the previously described data set. Receiver Operating Characteristic Curve, Area Under the Curve (AUC), Misclassification Curve and Hit curve and cross-validation were used. Conclusions After all other significant factors have been taken into consideration; people who were' more likely to die out of hospital were females, lived in neighbourhoods in a higher income quintile, lived in either the Interior or Northern Health Authorities, lived in communities with less than 500,000 people, died from breast, colorectal, pancreatic or prostate cancer, had longer survival times, and were older. It was found that people associated with the derived the Aboriginal Presence, British Isles Presence and the Americas and Europe ethnic origins cluster and those people associated with the Chinese .Presence and the Punjabi Presence mother tongue cluster were less likely to die out of hospital. n Table of Contents Abstract i i Table of Contents i i i List of Tables ...vii List of Figures ix List of Abbreviations xi Acknowledgements xii Chapter 1: Introduction 1 Introduction and Background 1 Health Outcomes in End-of-Life Research 1 Use of Administrative Databases in End-of-Life Research 3 Data Mining in End-of-Life Research 7 Cancer, End-Of-Life Research, and Data Mining 8 Appropriateness of Data Mining for End-of-Life Research 9 Ensuring Model Generalization 10 Purpose 11 Context.! 11 Thesis Structure 12 References 14 Chapter 2: Manuscript 1 - Assessing the Necessity and Appropriateness of Mixed-Level Ecological Studies in End-of-Life Research 23 Introduction 24 Census Data as an Ecological Measure 24 Balancing the Two Fallacies 24 Understanding the Variables, Study, and Design 26 Types of variables 26 Prescriptive versus Descriptive Construction of Culture indicators 28 Types of Studies 30 Reasonability of an Ecological Study '. 32 Current Situation 33 New Horizons 34 Prescriptive Approaches 34 Specification 34 Specification and Quantification 34 Descriptive Approaches .34 Classification Trees to Define Cut-off Values 34 Cluster Algorithms for Indicator Development 35 Conclusion .35 References • 36 Chapter 3: Manuscript 2 - Trends in the Place of Death of British Columbia Cancer Patients, 1997-2003 ..39 Background 40 ii i Methods : 40 Data Sources 40 Study Population 41 Study Variables 41 Analysis 42 Results 43 Discussion 48 Limitations.. 50 Conclusion 51 Appendix 52 References 53 Chapter 4: Manuscript 3 - Developing a Multivariate Defined Indicator of Culture: A Novel Synthesis of Unsupervised Learning Techniques 56 Background , 57 Methodology 59 Data Source and Population 59 Study Variables 59 Data Transformation 62 Statistical Tools - An Overview.. 62 Principal Component Analysis (PCA) 64 K-Means 65 Results 66 Ethnic Origins 66 Mother Tongue 73 Religion 76 Indicator Validation 78 Discussion 81 Limitations 82 Conclusions 82 Appendix 83 References 86 Chapter 5: Manuscript 4 - Identifying Socioeconomic, Cultural, Demographic, and Clinical Predictors for the Location of Death of Cancer Patients in British Columbia (1999-2003): A Data Mining Investigation 90 Introduction 91 Methodology 92 Administrative Databases 92 Subjects 93 Variables : : 93 Data Sources 93 Response 93 Predictors 94 Data Quality 96 Data Mining Methodologies 97 Linear Discriminant Analysis 97 iv Logistic Regression 99 Neural Networks 100 Classification Trees 103 K Nearest Neighbours 105 Advantages and Disadvantages 106 Evaluating the Performance of Data Mining Methodologies .....109 Performance Criteria 109 Misclassification, Loss Function and the Misclassification Curve 109 Receiver Operating Characteristic Curve and Area Under the Curve 112 Hit Curve (Lift Curve) 114 Estimating the Performance Criterion 115 Out-of-Sample Prediction 115 Estimation by reusing the same data 116 Using data to prevent overly optimistic estimates and over-fitting 116 Bias-Variance Trade-Off 117 Cross-Validation 119 Cross-Validation and model refinement 120 Two-layer Cross-Validation 121 Fitting the classifiers with cross-validation 123 Analysis 125 Results 125 Classifier Comparison Results 128 Receiver Operating Characteristic Curve Comparison 128 Area Under the Receiver Operating Characteristic Curve 129 Hit Curve 130 Misclassification Rate and Empirically Derived Cost of Misclassification 137 Health Results 138 Logistic Regression: Full and Parsimonious Models 138 Classification Tree 141 Discussion 142 Classifier Performance 142 Health Related Insights 144 Limitations 148 Conclusion 149 Classification Techniques 149 Population Health 149 General Conclusion 149 Appendices • 151 Appendix A : The D C L D C O process in British Columbia 152 Appendix B: ICD Groupings 155 Appendix C: British Columbia Health Authorities 156 Appendix D: Data Quality •• 157 Appendix E: Misclassification Curves , 158 Appendix F: Classification Tree 164 References 168 Chapter 6: Conclusion 175 v Conclusion...... 175 Strengths '...178 Limitations 179 Significance of this Work 181 Future Work and Recommendations 182 References 183 vi List of Tables Table 1-1: The variables used in this thesis and the source for each variable 5 Table 2-1: Comparison between Susser's variables and Macintyre-Duncan spatially based variables ..26 Table 2-2: Table for developing a new nomenclature by which to specify predictor and response variables 28 Table 3-1: Characteristics of adults in British Columbia who died of cancer from 1997 to 2003, by place of death 43 Table 3-2: Crude and adjusted odds of dying out of hospital, by predictors, for British Columbia, 1997-2003 45 Table 3-3: Nova Scotia Groupings for ICD codes. The British Columbia groups were identical to these groups 52 Table 4-1: Conceptual grouping of ethnic origins 60 Table 4-2: Conceptual groupings of mother tongue 61 Table 4-3: Conceptual groupings of religion 62 Table 4-4: P C A proportion of variance and cumulative proportion of variance for the conceptually grouped ethnic origins census variables 67 Table 4-5: Statistical descriptions of the population proportions for the six ethnic origin groups expressed by only the minimum value, the median and the maximum value69 Table 4-6: Percentage of the six ethnic origins clusters in 22 British Columbian municipalities : 72 Table 4-7: Percentage of the five mother tongue clusters in 22 British Columbian municipalities 75 Table 4-8: Percentage of the six religion clusters in 22 British Columbian municipalities 78 Table 4-9: Percentage of clusters for mother tongue groups given the ethnic origin group ...79 Table 4-10: Percentage of clusters for religion groups given the ethnic origin group 80 Table 4-11: Percentage of clusters for religion groups given the mother tongue group ... 80 Table 4-12: Statistical descriptions of the population proportions for the five mother tongue groups expressed by only the minimum value, the median and the maximum value 83 Table 4-13: Statistical descriptions of the population proportions for the six five mother tongue groups expressed by only the minimum value, the median and the maximum value 85 Table 5-1: Advantages and disadvantages specific to each classifier 107 Table 5-2: Misclassification table 110 Table 5-3: 0-1-c loss function; the truth is the real state (what was observed) and the decision is the decision made by the classifier. " H " indicates death in a hospital. "O" indicates death out of a hospital 110 Table 5-4: Tabular representation of the hitcurve where "O" is out of hospital death and " H " is hospital death 114 Table 5-5: 2 3 design for determining the best set of parameters for a neural network.... 124 vn Table 5-6: Characteristics of adults in BC who died of cancer from 1999 to 2003, by place of death 125 Table 5-7: Characteristics for age and survival as continuous variables 127 Table 5-8: Cross-validated Area Under the Curve and cross-validated estimate of the A U C standard error 130 Table 5-9: Minimum misclassification rates and empirically derived costs for each classifier. The misclassification rate for classifying all patients as dying in a hospital is given at the end of the table 137 Table 5-10: Adjusted odds of dying out of the hospital, by predictors, for British Columbia for both the full model and the parsimonious model 139 Table 5-11: Tumour groups and associated ICD-9, ICD-10, and ICD-0 codes 155 yin List of Figures Figure 1-1: Relationship between the databases and census information used in this thesis. The British Columbia Vital Statistics and the British Columbia Cancer Registry are linked and the study subjects are obtained. To this research database, dissemination area information is attributed to each patient 4 Figure 3-1: Proportion of death out of hospital and in hospital among British Columbian adults who died of cancer, 1997-2003 45 Figure 3-2: Percentage of patients who died out of hospital over the study period 1997-2003 by British Columbia Health Region 47 Figure 4-1: Structure of the 2001 Canada Census Ethnic Origins characteristic 60 Figure 4-2: Scree plot for the ethnic origins census P C A 65 Figure 4-3: Total within-cluster sum of squares (SS) versus the number of clusters ..68 Figure 4-4: Scatter plots of the first five principal components with the centres for the six ethnic origins clusters: Heterogeneous (1), Aboriginal Presence (2), Chinese Presence (3) British Isles Presence (4), East Indian Presence (5) and the Americas and Europe (6) 71 Figure 4-5: Scatter plots of the first three principal components with the centres for the five mother tongue clusters: Punjabi Presence (1), Dominant English (2), Heterogeneous with Dominant English (3), Chinese Presence (4), and Strong English Presence (5) 73 Figure 4-6: Scatter plots of the first four principal components with the centres for the six religion clusters 76 Figure 5-1: Graph for study set variable completeness 96 Figure 5-2: Example of a one layer architecture for a neural network with three input variables. The input variables are x\, xj, and x$ with a constant input XQ. The transformation functions (perceptrons) for the hidden layer are hi and h2. The output layer has a transformation function/ A skip allows x 0 to skip over the hidden layer and be one of the weighted inputs to the output layer 101 Figure 5-3: Example of how the classification tree algorithm partitions the explanatory variable space. There are two explanatory variables, xiand xi. The binary splits are represented by a, b, c, and d 103 Figure 5-4: The classification tree for Figure 5-3. If the condition is true, go down the left branch. If the condition is false, go down the left branch. " H " indicates death in a hospital. "O" indicates death out of a hospital. The terminal nodes are considered pure. There are only members of one class in each terminal node 104 Figure 5-5: The ROC curve graphically represents the relationship between the sensitivity and the specificity. Line (A) represents a classifier that performs no better than flipping a coin. The associated A U C would be 0.5. Line (B) represents a classifier that performs well and has the ability to separate the two classes. The A U C would be near 1.0 113 Figure 5-6: The ideal hit curve (A). The curve rises at a 45 degree angle to the number of selected patients axis until it reaches the total number of patients who died out of hospital (nl). After this the curve is horizontal. The curve (B) is what is commonly seen with hit curves. The curve rises more slowly and needs more than n l patients selected to find the n l patients who died out of the hospital 115 ix Figure 5-7: The partitioning of data into three sets: Training set for model construction, validation set for model selection, and testing set for error estimation (generalization) 117 Figure 5-8: An example of cross-validation illustrated by 10-fold cross-validation 120 Figure 5-9: An example of two-layer cross-validation illustrated by 10-fold two layer cross-validation 122 Figure 5-10: Proportion of death out of hospital and in hospital among adults in British Columbia who died of cancer, 1999-2003 128 Figure 5-11: Receiver Operating Characteristic Curves for linear discriminant analysis, logistic regression, neural networks, classification trees, and nearest neighbours. Neural network has the best ROC curve and nearest neighbours has the worst. Linear discriminant analysis, logistic regression, and classification trees are similar. 129 Figure 5-12: Linear discriminant analysis hit curve 131 Figure 5-13: Logistic regression hit curve for the full model 132 Figure 5-14: Logistic regression hit curve for the parsimonious model 133 Figure 5-15: Neural network hit curve 134 Figure 5-16: Classification tree hit curve 135 Figure 5-17: Nearest neighbour hit curve 136 Figure 5-18: British Columbia Health Authorities66 156 Figure 5-19: Agreement between the diagnosis of cancer and the cancer cause of death. : 157 Figure 5-20: Misclassification curve for linear discriminant analysis model 158 Figure 5-21: Misclassification curve for the full logistic regression model 159 Figure 5-22: Misclassification curve for the parsimonious logistic regression model.... 160 Figure 5-23: Misclassification curve for neural networks model 161 Figure 5-24: Misclassification curve for classification trees model 162 Figure 5-25: Misclassification curve for the nearest neighbours model 163 x List of Abbreviations Abbreviation Description Arg Argument A U C Area Under the Curve B C British Columbia B C C A British Columbia Cancer Agency B C C R British Columbia Cancer Registry CI Confidence Interval CTHR Canadian Institutes of Health Research CT Classification Tree C V Cross-Validation D A Dissemination Area DCI Death Certificate Identified DCO Death Certificate Only Err Error ICD-10 International Classification of Diseases: Tenth Revision ICD-9 International Classification of Diseases: Ninth Revision ICD-0 International Classification of Disease for Oncology IPPE Income Per Person Equivalent K N N K Nearest Neighbours L D A Linear Discriminant Analysis M S E Mean Squared Error N A A C C R North American Association of Central Cancer Registries NET New Emerging Team N N Neural Networks NS Nova Scotia OR Odds Ratio PC A Principal Component Analysis PCCF Postal Code Conversion File PHSA Provincial Health Services Authority ROC Receiver Operating Characteristic SDEC Socio-demographic and economic characteristics SE Standard Error SES Socioeconomic Status VS British Columbia Vital Statistics xi Acknowledgements I would like to thank the Sociobehavioural Research Center at the British Columbia Cancer Agency for supporting me in the completion of this thesis. The professional, moral and financial support has been an asset to do the work that was set before me. I would like to acknowledge both my supervisors, Dr. Ruben Zamar and Dr. Maria Cristina Barroetavena for their support. I would like to thank Dr. R. Zamar whose help, stimulating suggestions and encouragement have been invaluable as I have been working on this thesis. I would like to thank Dr. M.C. Barroetavena whose mentorship in the area of epidemiology, stimulating suggestions and encouragement have been invaluable to this investigation. I am indebted to both Dr. R. Zamar and Dr. M.C . Barroetavena. Finally, I would like to thank my wife Christine, who read through many drafts, spent many weekends at home as I worked, and who encouraged me throughout the past two years. xii Chapter 1: Introduction Introduction and Background I was fourteen when I attended my first funeral. It was for my grandfather. M y family did not talk about his illness. They did not talk about death. One day the phone rang. I answered. He had died in the hospital. I told my parents when they arrived at home. They left immediately. I was left alone. Death is inevitable. Some people see it as a point of transition from one form of existence to another. For others it is the terminus. Some are surrounded by family and friends. Others die alone. Some people desire to involve their family and friends as they move towards it. Others seek a solitary path. Regardless of the choices that are made, death is the one thing all people must face. Research may never be able to pull back the veils of mystery that surround death, but it will help researchers understand the choices people make at the end of their life. Questions that are of interest include: • whether culture play a role in the way people understand death; • whether income affect the quality of life as one prepares for death; • whether the place of residence determine the resources that will be utilized during the end-of-life period. Answers to these and other questions will assist health care providers in enhancing the quality of end-of-life care for people. This thesis focuses upon the theme of end-of-life health service utilization. As this is novel work for British Columbia, the point of origin was to investigate the location of death of cancer patients in British Columbia. By establishing firm research foundations, further investigations can occur that will help epidemiologists and social scientists to understand the clinical, cultural and economic landscape that affects end-of-life health service utilization. By identifying the patterns of health service utilization among vulnerable populations, those marginalized from the health care system, interventions can be developed and implemented, through programs and legislation that will ensure equal access to and use of health services at the end-of-life for all British Columbians. Health Outcomes in End-of-Life Research Death can occur in a variety of places, yet for many it occurs in a hospital, care facility or the home. Current research indicates that i f a patient is given a choice, up to 88% (Connecticut, Unites States)5 of patients dying of cancer would choose to die at home, yet only a smaller proportion realizes this preference 1 _ 6 . If congruence between the desired place of death and the actual place of death is to occur, end-of-life services must be established that would enable people to die where they choose. For the purposed of this research, the location of death, defined as "In Hospital" and "Out of Hospital", is investigated in order to provide a foundation of research upon which further investigations in to the location of death of cancer patients. 1 In Canada, it is estimated that 69, 500 deaths due to cancer will occur in 2005 7 . For women and men, lung cancer will be the leading cause of death1. Approximately 8,700 deaths due to cancer will occur in BC in 2005 1 . Of the top three cancer causes of death for men and women, the majority of deaths correspond to people aged 50 years and older7. In Nova Scotia (NS) Canada, out of hospital deaths were 25.4% the period from 1992 to 1997 . This difference may have as much to do with how health services are delivered as with individual choice. Out of hospital deaths ranged from 19.8% in 1992 to 30.2% in 1997 1 0 . In Ontario, out-of-hospital deaths have ranged from 31% in 1980 to 21.5% in 1994 to 34% in 1997 n . In Manitoba, 57.4% of cancer patients die in the hospital during the fiscal year of 2000/01 . In Vancouver, approximately 15% died in non-institutional settings from 1990 tO 1992 (home or other locations), yet in 1993 this number increased to just under 30% 1 2 . It is not only important to understand where people die, but the predictors for the location of death are of great interest in understanding the underlying factors that may influence where a person dies. In the United States, much emphasis is placed upon conventional measures of socioeconomic status (SES) while the sociodemographic aspect is generally limited to race, age, gender and marital status 6 ' 1 4 ~ 2 2 . The consideration of cultural context extends mostly to region of residence which is summarised in a rural versus urban indicator16~17. Similar themes emerge when considering international research into the cultural determinants of the utilization of health services and the location of death for persons dying of cancer 2 3~ 2 8 . Research upon populations from Toronto and Nova Scotia presents further challenges as it suggests there is no SES connection to these health outcomes 1 0 ' 2 9 ~ 3 2 . Roos 3 0 and Veugelers imply that the lack of association between health outcomes and SES is due to Canada's universal health care system. If this is true, then the conventional measures of SES (e.g. income, education, and occupation) may not be discriminants of health outcomes. Health Canada claims that "universal coverage for medically necessary health care services [is] provided on the basis of need, rather than the ability to pay" 3 3 ; thus economic related measures of SES may not capture any systematic marginalization of peoples from health services. By investigating the indicators used to understand the non-clinical factors that are associated to the use of health services and the place of death within cancer research, it became apparent that international comparisons must be made within the context of the health system in place in each of the countries being considered. This suggests that indicators that are relevant in the United States and other countries with different health care systems may not have the same influence in Canada. Since there is not a consensus on the role that SES measures play in understanding health outcomes, perhaps there is another mechanism which creates vulnerable populations within Canada - culture. British Columbia is a microcosm of the Canadian mosaic. Many different ethnic groups, such as the Chinese, South-Asian, Dutch, English, and German, are found within the provincial borders. The Department of Canadian Heritage states that "Multiculturalism 2 ensures that all citizens can keep their identities, can take pride in their ancestry and have a sense of belonging" 3 6 . Furthermore, people are "free to choose for themselves, without penalty, whether they want to identify with their specific group or not" 3 6 . By incorporating the spirit of multiculturalism, this research has posited a novel approach for defining dissemination area derived indicators of culture. By utilizing the multivariate aspect of the census categories, a descriptive approach to the construction of cultural indicators has been developed and constitutes one of the four manuscripts of this thesis. The motivation for this development is given in the first manuscript where the difference between the current prescriptive approach 1 0 and the proposed descriptive approach is discussed. The current methodology suggests that culture fits into a predetermined structure or identity whereas this new approach suggests that culture is defined by a complex web of relationships. Grouping dissemination areas based upon these webs of interconnectivity allows for a shift from modern to post-modern understandings of culture. In doing so, this research will be poised to capture the complex interplay of cultures that constitutes the mosaic of Canadian society. Use of Administrative Databases in End-of-Life Research Administrative databases represent an efficient means by which a sketch of a population's health can be obtained. They can be accessed with relative ease and the data is typically available within reasonable time period and the associated costs are manageable. A disadvantage to using an administrative database is that the information desired from the databases does not naturally extend from the underlying motivation for . collecting the information. As well, the data sets obtained are often very large in terms of observations, variables or both. For the purposes of this research, an end-of-life database was constructed. The British Columbia Cancer Registry (BCCR) and British Columbia Vital Statistics (VS) databases were linked by to add information on place of death for each study subject (Figure 1-1). 3 Figure 1-1: Relationship between the databases and census information used in this thesis. The British Columbia Vital Statistics and the British Columbia Cancer Registry are linked and the study subjects are obtained. To this research database, dissemination area information is attributed to each patient. The BC Cancer Registry, having been in existence from 1969, and maintained by the BC Cancer Agency (BCCA) since 1980, collects and generates cancer statistics on the BC population 3 4 . This database is used for cancer surveillance and research, which provides information on the magnitude of the cancer problem within B C , assists in the development and monitoring of programs for reduced mortality and morbidity, and provides information for future planning 3 4. By providing a population based registry, the B C C A has access to a source of information which avoids a source of bias due to non-representative participation. This translates into better quality data than that used in non-population-based sources 3 4 . The BC Cancer Registry has attained an overall silver certification status from the North American Association of Central Cancer Registries T S (NAACCR) . The registry has reached gold standard achievement for specific items such as timeliness and death certificate only cases 3 5 . The original database contained just over 60,000 patients who resided in and died in BC and over 100 fields. From this database, the study sets were obtained. These were smaller in terms of the number of patients and number of fields as the study cohort was further defined by age and cause of death, yet they were still sizable. After ascertaining the completeness and utility of the variables for this investigation, the variables to be used in this research were selected (Table 1-1). A small number of variables were used, but the number of patients was still very large, over 30,000 for the smallest data set. 4 Table 1-1: The variables used in this thesis and the source for each variable. Study Variables Units of Measure Description Source1 Sex M=Male, F=Female, X=Missing Indicates the patient's gender B C C R Age at death Years Death date - Birth date B C C R Year of death Calendar Year Obtained from Death date B C C R Region of Residence l=Interior 2=Fraser 3=Vancouver Coastal 4=Vancouver Island 5=Northern 6=Provincial Health Authority Provincial Health Authority excluded as it does not represent a geographic region of B C VS Primary Cause of Death ICD-9 and ICD 10 groupings The patient's primary or underlying cause of death as determined by the B C Vital Statistics Agency or another province/country's vital statistics department. B C C R Survival time from diagnosis Days Death date - Diagnosis date B C C R Place of death H=Hospital, 0=Out of Hospital The outcome variable for the place of death was obtained from VS. The location of death was derived from a field which indicated hospital death. If a person died in hospital, this was identified with the hospital code. If a person died out of the hospital, there was no code. Place of death was dichotomized as in hospital or out of hospital. VS Rural vs., Urban R/U Postal code of the usual place of residence at death will be classified according to the Canada Post designation where the second digit indicates rural or urban. If the number is 1-9, then the mail is destined to an urban vs 5 delivery area. The number 0 indicates a rural delivery area. Income Quintile 1-5 The neighbourhood income per person equivalent (EPPE) is a household size-adjusted measure of household income, base upon the census information at the enumeration/dissemination level. This measure is used to construct income quintiles. CC Mother Tongue Measure Derived from the census mother tongue single response category CC Ethnic Origin Measure Derived from the census ethnic origin total response category CC Religious Measure Derived from the religion census category. 2001 census only CC f BC Cancer Registry: BCCR; Vital Statistics: VS; Canada Census: CC Dissemination level, census based information was associated with each of the study sets using the Statistics Canada Postal Code Conversion File (PCCF) program. The census information was linked with SAS based upon a unique identifier for each dissemination area. Although there is much information contained in these data sources, the data has been collected for purposes other than the understanding of the location of death for cancer patients. Research based upon a secondary use of administrative or corporate databases bears the characteristics of observational studies. From a classical statistical perspective, observational studies are the most problematic type of study as there is, in general, no reference to any experimental design. It is this lack of reference which puts observational studies in such low regard as "design [is] at the base of statistics" which has strong mathematical support3 6. This lack of design is most commonly seen in non-randomized data, thus observational, non-randomized studies tend to carry lower evidential weight than randomized, experimental ones where the researcher has control over the interventions under study 3 7. It is because of this diminished evidential weight that a well-planned and credible method for analyzing data sets of this nature is needed. Data mining has emerged in response to the desire to make sense of the vast amounts of data housed in administrative databases and will be discussed in the following section 3 8 . 6 Data Mining in End-of-Life Research The two branches of data mining are unsupervised and supervised learning. With both, the main objective is to learn about the data and discover the patterns within it. Unsupervised learning is used when only the features of the data are known and there is no expressed outcome variable. Having only the predictor variables, the central 38 39 objective is to understand how the data is organized based upon the known features Principal Component Analysis is an example. Supervised learning has an outcome variable which guides the process 3 8 " 3 9 . Here, the goal is to understand how the data is organized with respect to both the predictor and the outcome variables. A familiar example of this type of learning is logistic regression. Over the past decade, data mining techniques have become more common within medical research. As awareness of these techniques has grown the applications have been used to study health related conditions such as: medication error and adverse drug events 4 1 , Binswanger's disease 4 3 , anophthalmia and microphthalmia44, cystic fibrosis 4 5 , 47-4R 8fi myocardial infarction ' , 48-49 59 pregnancy outcomes ' , stroke 5 0 ' 6 0, cardiac ischemia 5 1, meningococcal disease 5 2 " 5 ^ 6 1 ; carpal tunnel syndrome5 4, survival and mor ta l i ty 6 ' 5 5 ' 8 6 - 8 7 ' 8 9 ' 9 0 , cancer 9 ' 4 0 ' 5 6 ' 6 2 " 8 4 ' 9 4 - 9 5 , schistosomiasis 5 1 , pneumococcal disease , critical c a r e 8 5 ' 8 8 ' 9 0 , and asthma 8 6. A survey of these applications reveals two primary motivations for bringing data mining techniques into medical research: diagnosis and the comparison of how different techniques perform to find a better tool for diagnosis. The first concerns the diagnosis of a disease given a set of patient descriptors. Here, the primary focus is upon the ability to diagnose (or predict) an outcome. The second motivation is to determine the best classifier for a data set when the objective is to predict an outcome. Here, the motivation is to find the best classifier in order to have more reliable diagnosis. Although diagnosis is a goal, it is subordinate finding the best predictor. Taking the lead from the wider medical literature, a reasonable group of classification techniques have been selected. Linear discriminant analysis, logistic regression, neural networks, classification trees, and nearest neighbours have been chosen as they are common choices within the literature. The final manuscript seeks to compare the 7 performance of these five classifiers and determine i f they are appropriate for end-of-life cancer research. Cancer, End-Of-Life Research, and Data Mining The set of articles reviewed represent a search for data mining techniques within cancer research and epidemiology focusing upon research of the last 15 years. Research which used data mining techniques other than logistic regression was sought. Even with these constraints, logistic regression was prevalent 9 ' 4 2 ' 5 3 ' 6 0 ' 6 5 ' 6 9 ' 7 2 ' 7 8 ' 8 5 ' 8 9 ' 9 1 - 9 4 " 9 5 . Outside of the use of logistic regression, a majority of the articles related to cancer diagnosis have utilized neural networks 6 3 " 6 4 ' 6 6 ' 6 8 ' 7 1 ' 7 3 - 7 4 ' 8 2 " 8 3 with cluster 5 5 ' 5 8 ' 8 5 . When comparing techniques, neural networks is present in almost all the papers reviewed 4 0 ' 6 5 ' 69,75,78-80 t n e s e p a p e r S ; n e ura l networks are compared against linear discriminant analysis 4 0 ' 7 5 ' 7 7 ' 7 9 , logistic regression 6 5 ' 6 9 ' 7 8 , classification trees 6 9 , and nearest neighbours 7 9 . A diversity of views in relation to the use of data mining techniques was present. When the research focus was diagnosis, the majority of the reviewed articles were supportive of the use of data mining techniques 6 3 ' 6 6 ' 6 8 ' 7 1 ' 7 3 " 7 4 ' 8 2 " 8 3 . Some simply used the techniques with no secondary comparison 6 8 , 7 4 while others sought to extend the techniques 7 1 ' 7 3 , 8 2 " 8 3 . Baxt 6 4 was the only article to exhibit any caution about the techniques and viewed the body of current research using neural networks within a medical context as being inconclusive to its usefulness. When a comparison of techniques was the focus, a majority of the reviewed papers f\") f\H fiiQ 7 f l 7 ^ 7 R supported the data mining technique • > > - > - citing that the chosen technique outperformed a more standard one, such as logistic regression. The remaining articles viewed these as potentially beneficial techniques, but were cautious about how they should be applied and about the methodology surrounding their assessment40'79~80. Although the majority of reviewed articles were enthusiastic about the techniques, the sober judgement of the few was compelling. Without a clear methodology which incorporates appropriate validation and assessment of comparable techniques, it is questionable i f a preferred modelling technique can be objectively chosen. Schwarzer et al. conclude their survey of the uses of neural networks within cancer research by stating that, "In our opinion, there is no evidence that artificial neural networks have provided real progress in the field of diagnosis and prognosis in oncology" 8 0 . This sentiment is echoed by Sbnoer et a l . 7 9 and Dybowski et al. 4 0 . When narrowing the focus to end-of-life research, the range of techniques used also narrows. Logistic regression features in a majority of the reviewed articles 6 ' 9 " 1 0 ' 1 6 ' 2 4 ' 8 5 ' 87,89-90,95 m e s e m e r e is one instance of neural networks 8 9 , and one instance of classification trees85. Ottenbacher compares logistic regression and neural networks and views neural networks as an aid to clinical prognosis, but still prefers logistic regression o n QC . Abu-Hanna does not compare classification trees and logistic regression, rather a two stage modelling procedure is used where classification trees are used to identify 8 disjoint groups of patient and then logistic regression is used as a local model for these groups. The implicit view is that the more novel data mining techniques should not supplant logistic regression, but rather function as an added tool to better understand the underlying structure of the data. By reviewing the medical literature for uses of data mining in cancer and end-of-life research, some trends are apparent. • Logistic regression is being used as a base model against which the newer or less familiar techniques are being compared. • With logistic regression as the standard of performance, it is often compared with one other classifier. Some studies employ a larger battery of classifiers for the comparison. • Neural networks and.classification trees are commonly used. • Many practitioners present a favourable view towards data mining techniques while few offer insightful comments about the misuse and poor comparisons of these techniques. Appropriateness of Data Mining for End-of-Life Research In order to predict the location of death for a cancer patient, we need a tool that will help classify the patient into one of the predetermined categories. Beyond this classification, we want to determine the set of predictor variables that are most useful in correctly classifying the patient. It is readily apparent why logistic regression has been widely used within this research. The outcome of the model is a probability that the person will die in a certain location and a set of useful explanatory variables. The results from logistic regression are easily put into an epidemiological framework through the use of the odds ratio. The logistic regression models employed in location-of-death end-of-life research assumes a linear model. No interaction terms are included in the models, nor are any higher order terms. This linearity may be an overly simplistic representation of the underlying phenomenon. Although the linear assumption provides simplicity of explanation, it may not be an accurate representation of the underlying sociological phenomena 8 0 . If a researcher chooses to relinquish the preference for a linear model, then models which provide a high degree of flexibility may be required. This does not eclipse the use of logistic regression. The need for more complex models suggests that either more complex logistic regression models should be used or other classification techniques should be used. As the current research is observational, large administrative databases to predict a health outcome, data mining techniques would be appropriate. If a researcher prefers the logistic model, other data mining techniques should not be excluded from the model development process. They can be used to gain insights which could then be applied to the logistic model in terms of interactions or higher powers of predictors. A classifier could be used to find an interesting subset of the data and then logistic regression used to understand that subset85. Furthermore, setting logistic regression in a classification context will allow researchers to gain an understanding of 9 how it performs when compared with other techniques. This may be valuable in assessing the epidemiological utility of the model itself. At this stage of the research, data mining is appropriate for understanding the location of death of cancer patients. Understanding the cultural and clinical predictors for where cancer patients die is predicated upon correctly classifying the place-of-death for British Columbian cancer patients. Once a classifier has been fit and assessed, the explanatory variables used for the classification can be identified and explored. By expanding the range of perspectives by which classification and exposition occurs, a more complete understanding may emerge. By participating in the comparison of classification techniques, location-of-death for cancer patients will emulate the current research occurring in many areas of cancer research. Within this exploration, caution must be exerted, as novelty must not eclipse sound methodology . Ensuring Model Generalization One aspect of fitting classifiers that bears discussion is the assessment of how a classifier performs. When fitting a classifier the method of assessment must be determined. In this thesis four methods will be used: misclassification error, receiver operating characteristic curve, the area under the receiver operating characteristic curve and the hit curve. These will give an indication of how the fit classifier will perform when an independent data set is used. The measure of performance associated with an independent set of data is an indication of how well the model will generalize. A naive approach is to use all the data to fit and assess the model. When the classifier is fit to the data, it will adapt to some of the random noise in the data. When the same data is used to assess the performance, the performance measure will give an overly optimistic impression of the fit classifiers ability to generalize. A classifier can be tailored to fit the data i f overly complex models, with interactions and higher order terms, are used without a good means of assessing their performance. This will result in over-fitting. Over-fit models often give an overly optimistic of performance, but this generally happens when the data is used for both the fitting of the classifier and its assessment. To over come this problem the data could be split into a training set used to fit the classifier and a test set used to assess the performance of the fit classifier. In reality, this represents an inefficient use of data. Data is limited and typically costly, in terms of time and money, to obtain. One way to overcome this is to have a sophisticate reuse of the data. Cross-validation is one such method. It is a non-analytic method used to approximate the test step and give an estimate of the performance of the classifier and will be explained in chapter 5. 10 Purpose The investigation into end-of-life health service utilization for cancer patients is novel work in British Columbia. Beyond serving as a platform from which future investigations can proceed, this thesis serves several epidemiological and statistical purposes. • To understand the implications upon research when utilizing ecological variables. • To fulfd the one of the objectives of a Canadian Institutes of Health Research (CIHR) grant entitled "Palliative Care in cross-cultural context: A NET for equitable and quality cancer care for culturally diverse populations" through: o collaborative work with a Nova Scotia research team, and o seminal work to understand the place-of-death for British Columbian cancer patients. • To extend the Nova Scotia work through the development of a novel approach for the construction of cultural indicators. • To bring the topic of data mining into a new area of investigation by exploring the utility of data mining for end-of-life research. • To identify areas where future epidemiological and statistical work can be done. • To identify areas where departure from the Nova Scotia methodology must occur for British Columbian based research (e.g. characterizing predictors of place of death that apply to BC in particular). Through these varied purposes an initial understanding of how health services are utilized at the end of life will be formed. As much of this research is part of a collaborative investigation, many of the variables will be based upon methodology established by the Nova Scotia research team. This thesis serves to build upon the work which has been started in Nova Scotia by exploring several novel applications of . statistical methodology. The identification of predictors for the location of death will use both the Nova Scotia methods and data mining techniques which represent novelty in this field of investigation. Although the place-of-death is a gross measure of how end-of-life health services are used, it serves as first step towards characterizing where British Columbia die and what . are the predictors for those different locations. This information will allow to uncover patterns of service use among culturally diverse groups. As these populations are identified, remedial actions can be taken. Context In July 2004, the Canadian Institutes of Health Research (CIHR) approved the grant application entitled "Palliative Care in cross-cultural context: A NET for equitable and quality cancer care for culturally diverse populations". This five year grant allows researchers in British Columbia, Nova Scotia, and Saskatchewan to focus upon three themes: Access to heath services, caregivers and complementary and alternative therapies. The access theme seeks to understand the cultural constructs that influence the use of health services by cancer patients at the end-of-life. Not only does this 11 involve the construction and interpretation of predictive models, but it is also concerned with the construction of cultural indicators to be used in the models and in the quality of ' the data used for the modelling process 9 6 . This thesis is part of the access theme as it relates to the development of cultural indicators and the determination of the cultural predictors for the location-of-death for British Columbians. The use ecological variables for end-of-life research are considered. The cultural predictors of ethnicity, language and religion are developed utilizing novel methodology and supplement the SES dissemination area level measurement of income. The utility of these predictors for end-of-life research is determined through an investigation of the usefulness of data mining techniques for end-of-life research. As this is novel work for British Columbia, it establishes a base upon which future access work can be built. Thesis Structure This thesis has six chapters, four of which represent manuscripts to be submitted to journals. The first chapter is an overall introduction to manuscript. Brief descriptions of the remaining manuscript section are subsequently presented. Chapter 2 - Manuscript 1: Assessing the Necessity and Appropriateness of Mixed-Level Ecological Studies in End-of-Life Research. The aim of this paper is to assess the necessity and appropriateness of ecological research in end-of-life studies. The two fallacies, ecological and individualistic are presented. From this integral, contextual and conceptual ecological variables are considered and a new nomenclature for variable specification is proposed. The proposed nomenclature indicates both the type of ecological variable and the type of spatial relationship that occurs between groups. As the ecological variables to be used in the research are derived from the Canadian Census, the creation of ecological variables is considered. A prescriptive approach, which searches for a predetermined cultural aspect, such as French as the mother tongue, are the primary method used for developing dissemination-level census-based ecological variables. A descriptive approach is proposed as an alternative. With this approach, unsupervised learning techniques are used to group the dissemination areas in a meaningful, data driven way. The groups are described based upon the features of the group. Group identity is discovered rather than imposed. The reasonability of an ecological study is considered. It is determined that i f a researcher is interested in exploring the underlying cultural constructs, then an ecological study is both appropriate and necessary. It is concluded that although Canadian investigation into the cultural indicators for the place-of-death for cancer patients is appropriate and necessary, overcoming this ecological critique may not be simple. 12 Chapter 3 - Manuscript 2: Trends in the Place of Death of British Columbia Cancer Patients, 1997-2003. As no prior work has been done in British Columbia for understanding where cancer patients die, this study serves as point of origin for such an investigation. As part of a collaborative effort with Nova Scotia, this study is reproduction of seminal study by Johnston et al. 1 0 . The goal was to understand the trends in the place of death from 1997 to 2003 for adult British Columbians that died of cancer. A corollary to this was to understand which factors were predictive of dying out of hospital. The variables investigated were the year of death, age, sex, region of residence, residence, income, tumour group and survival time. Chapter 4 - Manuscript 3: Developing a Multivariate Defined Indicator of Culture: A Novel Synthesis of Unsupervised Learning Techniques. Despite the variety of socioeconomic measures, few measures of culture exist. In Canada, researchers have employed a method by which binary cultural variables are constructed from univariate explorations of census based data. Although this represents a first step in understanding the role of culture, an indicator of culture that is based upon the multivariate nature of the data will allow the subtle nuances of a cultural construct to be retained. Chapter 5 - Manuscript 4: Identifying Socioeconomic, Demographic and Clinical Predictors for the Location of Death of Cancer Patients in British Columbia: A Data Mining Investigation. In Canada, the role of socio-economic measures in predicting health outcomes has mixed results. Although a health gradient exists with respect to income, the general absence of income as a predictor for health service utilization is confusing. As Canada has a universal health system, it is thought that culture may play an important role in how health services are used. Within this context, it has been noticed that logistic regression has been used as the preferred modelling technique. Given that the research uses administrative health service databases which are large in terms of the number of patients and variables, a data mining perspective is proposed as an alternative analytic tool. This paper utilizes a data mining approach to compare five data mining techniques. It compares the performance of the techniques and draws health related insights from them. Chapter 6 - Concluding Chapter This chapter is a summary across all the manuscripts. A brief discussion of the limitations, strengths, contributions and future recommendations follows. 13 References 1. Thomas, C , Morris, S.M., Clark, D. 2004. Place of Death: Preferences among Cancer Patients and their Carers. Social Science & Medicine 58: 2431-2444. 2. Carroll, D.S. 1998. An audit of place of death of cancer patients in a semi-rural Scottish practice. Palliative Medicine 12: 51-53. 3. Tang, S.T., McCorkle, R. 2003. Determinants of Congruence between the Preferred and Actual Place of Death for Terminally 111 Cancer Patients. Journal of Palliative Care 19(4): 230-237. 4. Gyllenhammar, E., Thoren-Todoulos, E., Strang, P., Strom, G., Eriksson, E., Kinch, M . 2003. Predictive Factors for Home Deaths among Cancer Patients in Swedish Palliative Home Care. Support Care Cancer 11:5 60-5 67 5. Tang, S.T. 2003. When Death Is Imminent Where Terminally III Patients With Cancer Prefer to Die and Why. Cancer Nursing 26(3): 245-251. 6. Tang, S.T., McCorkle, R. 2001. Determinants of Place of Death for Terminally 111 Cancer Patients. Cancer Investigations 19(2): 165-180. 7. National Cancer Institute of Canada. Canadian Cancer Statistics 2005. Toronto (ON): The Institute; 2005. 8. Burge, F., Lawson, B. , Johnston, G. 2003. Family Physician Continuity of Care and Emergency Department Use in End-of-Life Cancer Care. Medical Care 41(8): 992-1001. 9. Burge, F., Lawson, B., Johnston, G, Cummings, I. 2003. Primary care continuity and location of death for those with cancer. Journal of Palliative Medicine 6(6): 911-918. 10. Burge, F., Lawson, B., Johnston, G. 2003. Trends in the place of death of cancer patients, 1992-1997. C M A J 168(3): 265-270. 11. Heyland, D., Lvery, J.V., Trammer, J.E., Shortt, S.E.D. 2000. The Final Days: An Analysis of the Dying Experience in Ontario. Ann R Coll Physicians Surgeons Can. 33(6): 356-361. 12. Cardiff, K., Hsu, D., Kuhl, D. 1998. Utilization of Palliative Care Services in Vancouver: 1990-1993. Center for Health Services and Policy Research 98:13. 14 13. Menec, V. , Lix, L., Steinbach, C , Ekuma, O., Sirshi, M . , Dahl, M . , Soodeen, R. 2004. Patterns of health care use and cost at the end of life. Manitoba Centre for Health Policy. 14. Bruera, E., Sweeney, C , Russell, N . , Willey, J.S., Palmer, J.L. 2003. Place of death of Huston Area Residents with Cancer over a Two-Year Period. Journal of Pain and Symptom Management 26(1): 637-643. 15. Feudtner, C , Silveira, M.J. , Christakis, D.A. 2002. Where do Children with Complex Chronic Conditions Die? Patterns in Washington Stats, 1980-1998. Pediatrics 109(4): 656-660. 16. Grande, G.E., Addington-Hall, J .M., Todd, C.J. 1998. Place of Death and Access to Home Care Services: Are Certain Patient Groups at a Disadvantage? Social Science Medicine 47(5): 565-579. 17. Klassen, A .C . Curriero, F.C., Hong, J.H., Williams, C , Kulldoff, M . , Meissner, H.I., Alberg, A . A . , Ensminger, M . 2004. The Role of Area-Level Influences on Prostate Cancer Grade and Stage at Diagnosis. Preventive Medicine 39: 441-448. 18. Kreiger, N . , Chen, J.T., Waterman, P.D., Rehnkopf, D.H., Subramanian, S.V. 2003. Race/Ethnicity, gender and Monitoring Socioeconomic Gradients in Health: A Comparison of Area-Based Socioeconomic Measures - The Public Health Disparities Geocoding Project. Public Health Matters 93(10): 1655-1671. 19. Kreiger, N . , Williams, D.R., Moss, N.E. 1997. Measuring Social Class in US Public Health Research: Concepts, Methodologies, and Guidelines. Annual Review of Public Health. 18: 341-378. 20. Merkin, S.S., Stevenson, L. , Powe, N . 2002. Geographic Socioeconomic Status, Race and Advanced-Stage Breast Cancer in New York City. American Journal of Public Health 92(1): 64-70. 21. Shi, L. , Macinko, J., Starfield, B. , Politzer, R., Wulu, J., Xu , J. 2005. Primary Care, Social Inequalities, and All-Cause, Heart Disease, and Cancer Mortality in US Counties, 1990. American Journal of Public Health 95(4): 674-680. 22. Tang, S.T. 2003. Determinants of Hospice Home Care Use Among Terminally 111 Cancer Patients. Nursing Research. 52(4): 217-225. 23. Ahlner-Elmqvist, M . , Jordhoy, M.S., Jannert, M . , Fayers, P., Kassa, S. 2004. Place of Death: Hospital-Based Advanced Home Care versus Conventional Care. Palliative Medicine 18: 585-593. 24. Costantini, M . , Balzi, D., Garronec, E., Parodi, S., Vercelli, M . , Bruzzi, P. 2000. Geographic variations of place of death among Italian communities suggests an 15 inappropriate hospital use in terminal phase of cancer disease. Public Health 114:15-20. 25. Fukui, S., Fukui, N . , Kawagoe, H . 2004. Predictors of Place of Death for Japanese Patients with Advanced-Stage Malignant Disease in Home Care Settings: A Nationwide Survey. Cancer 101(2): 421-429. 26. Grundy, E., Mayer, D., Young, H. , Slogged, A . 2004. Living Arrangements and Place of Death of Older People with Cancer in England and Wales: A Record Linkage Study. British Journal of Cancer 91: 907-912. 27. Higginson, I.J., Thompson, M . 2003. Children and Young People Who Die from Cancer: Epidemiology and Place of Death in England (1995-9). British Medical Journal 327: 478-479. 28. Van den Eynden, B. , Hermann, I., Schrijvers, D., Van royen, P., Maes, R., Vermeulen, L., Herweyers, K. , Smits, W., Verhoeven, A. , Clara, R., Denekens, J. 2000. Factors Determining the Place of Palliative Care and Death of Cancer Patients. Support Cancer Care 8: 59-64. 29. Gorey, K . M . , Holowaty, E.J., Fehringer, G., Laukkanen, E., Richter, N.L . , Meyer, C M . 2000. An International Comparison of Cancer Survival: Metropolitan Toronto, Ontario, and Honolulu, Hawaii. American Journal of Public Health 90(12): 1866-1872. 30. Roos, L .L . , Magoon, J., Gupta, S., Chateau, D., Veugelers, P.J. 2004. Socioeconomic Determinants of Mortality in Two Canadian Provinces: Multilevel Modelling and Neighbourhood Context. Social Science and Medicine 59: 1435-1447. 31. Veugelers, P.J., Yip, A . M . , Kephart, G. 2001. Proximate and Contextual Socioeconomic Determinants of Mortality: Multilevel Approaches in a setting with Universal Health Care Coverage. American Journal of Epidemiology 154(8): 725-732. 32. Robert, S.A., Strombom, I., Trentham-Diez, A. , Hampton, J .M., McElroy, J.A., Newcomb, P.A., Remington, P.L. 2004. Socioeconomic Risk Factors for Breast Cancer: Distinguishing Individual- and Community-Level Effects. Epidemiology 15(4): 442-450. 33. Health Canada http://www.hc-sc.gc.ca/english/care/index.html, June 6, 2005 34. British Columbia Cancer Registry: Health Professionals Info - B C Cancer Statistics 2005, August 3. http://www.bccancer.bc.ca/HP I/CancerStatistics/default.htm#bccanreg 16 35. British Columbia Cancer Registry: Definitions and Technical Notes - Data Quality Indicators 2005, August 3. http://www.bccancer.bc.ca/HPI/CaricerStatistics/Defns/DataQuality/default.htm 36. Mackinnon, M.J. , Glick, N . , 1999. Data mining and knowledge discovery in databases - an overview. Australian & New Zealand Journal of Statistics 41(3): 255-275. 37. Altman, D.G., 2000. Statistics in medical journals: some recent trends. Statistics in Medicine 19: 3275-3289. 38. Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning. New York: Springer-Verlag. 39. Johnson, R.A., Wichern, D.W., 2002. Applied Multivariate Statistical Analysis. Fifth Edition. Upper Saddle, New Jersey: Prentice-Hall. 40. Dybowski, R., Gant, V . , 1995. Artificial neural networks in pathology and medical laboratories. The Lancet 346: 1203-1207. 41. Rudman, W.J., Brown, C.A., Hewitt, C.R., Carpenter, W.O., Campbell, B. , Tubb, T . , Nobel, S.L., 2002. The use of data mining tools in identifying medication error near misses and adverse drug events. Top Health Inform Management 23(2): 94-103. 42. Tu, J.V., 1996. Advantages and disadvantages of using artificial neural networks versus logistic regression for predicting medical outcomes. Journal of Clinical Epidemiology 49(11): 1225-1231. 43. Auer, D.P., Benno, P., Christoff, G., Elbel, G., Gasser, T., Dichgans, M . , 2001. Differential lesion patterns in CADASIL and sporadic subcortical ateriosclerotic encephalopathy: M R imaging study with statistical parametric group comparison. Radiology 218: 443-451. 44. Dolk, H. , Busby, A. , Armstrong, B.G., Walls, P.H., 1998. Geographical variation in anophthalmia and microphthalmia in England, 1988-94. British Medical Journal 317: 905-910. 45. Brown-Ewing, L.J . , Finklestein, S.M., Budd, J.R., Rich, S., Kujawa, S.J., Wielinski, C.L., Warwick, W.J., 1988. A rule for early detection of chronic changes in cystic fibrosis patient status. Journal of Clinical Epidemiology 41(9): 915-922. 46. Ennis, M . , Hinton, G., Naylor, D., Revow, M . , Tibshirani, R., 1998. A comparison of statistical learning methods on the GUSTO database. Statistics in Medicine 17: 2501-2508. 17 47. Gilpin, E., Olshen, R., Henning, H. , Ross, J., 1983. Risk Prediction after myocardial infarction. Cardiology 70: 73-84. 48. Goodwin, L .K. , Iannacchinone, M.A. , Hammond, W.E., Crockett, P., Maher, S., Schlitz, K. , 2001. Data mining methods find demographic predictors of preterm birth. Nursing Research 50(6): 340-345. 49. Goodwin, L .K . , Iannacchinone, M . A . , 2002. Data Mining Methods for improving birth outcomes prediction. Outcomes Management 6(2): 80-85. 50. Gottrup, C , Thompson, K. , Locht, P., Wu, O., Sorensen, A .G . , Koroshetz, W.J., Ostergaard, L. , 2005. Applying instance-based techniques to prediction of final outcome in acute stroke. Artificial Intelligence in Medicine 33(3): 223-236. 51. Long, W.J., Griffith, J.L., Selker, H.P., D'Agostino, R.B., 1993. A comparison of logistic regression to decision-tree induction in a medical domain. Computers and Biomedical Research 26: 74-97 52. Thwaites, G.E., Chau, T.T.H., Stepniewska, K. , Phu, N.H. , Chuong, L .V . , Sinn, D.X. , White, N J . , Parry, C M . , Farrar, J.J., 2002. Diagnosis of adult tuberculosis meningitis by use of clinical and laboratory features. The Lancet 360: 1287-1292. 53. Werneck, G.L., de Carvalho, D., Barroso, D.E., Cook, E.F., Walker, A . M . , 1999. Classification trees and logistic regression applied to prognostic studies: a comparison using meningococcal disease as an example. Journal of Tropical Paediatrics 45: 248-251. 54. Wong, S.M., Griffith, J.F., Hui A.C.F. , Lo, S.K., Fu, M . , Wong, K.S. , 2004. Carpal Tunnel Syndrome: diagnostic usefulness of sonography. Radiology 232:93-99. 55. Carmelli, D., Halpern, J., Swan, G.E., Dame, A. , McElroy, M . , Gelb, A . B . , Rosenman, R.H., 1991. 27-year mortality in the western collaborative group study: construction of risk groups by recursive partitioning. Journal of Clinical Epidemiology 44(12): 1341-1351. 56. Friedman, J.H., Meulman, J.J., 2003. Multiple additive regression trees with application in epidemiology. Statistics in Medicine 22:1365-1381. 57. Hammad, T.A., Abel-Wahab, M.F., Declaris, N . , El-Sahly, A. , El-Kady, N . , Strickland, G.T., 1996. Comparative evaluation of the use of artificial neural networks for modelling the epidemiology of schistosomiasis mansoni. Transactions of the Royal Society of Tropical Medicine and Hygiene 90: 372-376. 18 58. Huang, S.S., Finkelstein, J.A., Rifas-Shiman, S.L., Kleinman, K. , Piatt, R., 2004. Community-level predictors of pneumococcal carriage and resistance in young children. American Journal of Epidemiology 159(7): 645-654. 59. Luo, Z., Kierans, W.J., Wilkins, R., Liston, R . M . , Mohamed, J., Kramer, M.S., 2004. Disparities in birth outcomes by neighbourhood income: temporal trends in rural and urban areas, British Columbia. Epidemiology 15(6): 679-686. 60. Nelson, L . M . , Bloch, D.A., Longstreth, W.T., Shi, H. , 1998. Recursive partitioning for the identification of disease risk subgroups: a case-control study of subarachnoid hemorrhage. Journal of Clinical Epidemiology 51(3): 199-209. 61. Nguyen, T., Malley, R., Inkelis, S.H., Kuppermann, N . , 2002. Comparison of prediction models for adverse outcome in pediatric meningoccal disease using artificial neural network and logistic regression analyses. Journal of Clinical Epidemiology 55: 687-695. 62. Baker , J.A., Kornguth, P.J., Lo, J.Y., Williford, M.E. , Floyd, C.E., 1995. Breast Cancer: prediction with artificial neural network based on BI-RADS Standardized Lexicon. Radiology 196: 817-822. 63. Baker, J.A., Kornguth, P.J., Lo, J.Y., Floyd, C.E., 1996. Artificial neural network: improving the quality of breast biopsy recommendations. Radiology 198:131-135. 64. Baxt, W.G., 1995. Application of artificial neural networks to clinical medicine. The Lancet 346: 1135-1138. 65. Biagiotti, R., Desii, C , Vanzi, E., Gacci, G., 1999. Predicting ovarian malignancy: application of artificial neural networks to transvaginal and color Doppler flow US. Radiology 210: 399-403. 66. Buller, D., Buller, A. , Innocent, P.R., Pawlak, W., 1996. Determining and classifying the region of interest in ultrasonic images of the breast using neural networks. Artificial Intelligence in Medicine 8: 53-66. 67. Burke, H.B., Goodman, P.H., Rosen, D.B., Henson, D.E., Weinstein, J.N., Harrell, F.E., Marks, J.R., Winchester, D.P., Bostwick, D.G., 1997. Artificial neural networks improve accuracy of cancer survival prediction. Cancer 79(4): 857-862. 68. Chen, D., Chang, R., Huang, Y . , 1999. Computer-aided diagnosis applied to US of solid breast nodules by using neural networks. Radiology 213: 407-412. 69. Delen, D., Walker, G., Kadam, A. , 2005. Predicting breast cancer survivability: a comparison of three data mining methods. Artificial Intelligence in Medicine 34(2): 113-127 19 70. Huo, Z., Giger, M . L . , Olopade, O.I., Wolverton, D.E., Weber, B.L. , Metz, C.E., Zhong, W., Cummings, S.A., 2002. Computerized analysis of digitized mammograms of BRCA1 and BRCA2 gene mutation carriers. Radiology 225: 519-526. 71. Fogel, D.B., Wasson, E.C., Boughton, E .M. , 1995. Evolving neural networks for detecting breast cancer. Cancer Letters 96: 49-53. 72. Janes, H. , Pepe, M . , Kooperberg, C , Newcomb, P., 2005. Identifying target populations for screening or not screening using logistic regression. Statistics in Medicine 24(9): 1321-1338. 73. Jerez-Aragones, J .M., Gomez-Ruiz, J.A., Ramos-Jimenez, G., Munoz-Perez, J., Alba-Conejo, E., 2003. A combined neural network and decision trees model for prognosis of breast cancer relapse. Artificial Intelligence in Medicine 27:45-63. 74. Lo, J.Y., Baker, J.A., Kornguth, P.J., Iglehart, J.D., Floyd, C.E., 1997. Predicting breast cancer invasion with artificial neural networks on the basis of mammographic features. Radiology 203: 159-163. 75. Markey, M . K . , Lo, J.Y., Flyod, C.E., 2002. Differences between computer-aided diagnosis of breast masses and that of calcifications. Radiology 223:489-493. 76. Nakamura, K. , Yoshida, FL, Englemann, R., MacMahon, H. , Katsuragawa, S., Ishida, T., Ashizawa, K. , Doi, K. , 2000. Computerized analysis of the likelihood of malignancy in solitary pulmonary nodules with use of artificial neural networks. Radiology 214: 823-830. 77. Nattkemper, T.W., Arnich, B., Lichte, O., Timm, W., Degenhard, A. , Pointon, L. , Hayes, C , Leach, M.O., 2005. Evaluation of radiological features for breast tumour classification in clinical screening with machine learning methods. Artificial Intelligence in Medicine 34(2): 129-139. 78. Ronco, A . L . , 1999. Use of artificial neural networks in modeling associations of discriminant factors: towards an intelligent selective breast cancer screening. Artificial Intelligence in Medicine 16:299-309. 79. Sboner, A. , Eccher, C , Blanzieri, E., Bauer, P., Cristofolini, M . , Zumiani, G., Forti, S., 2003. A multiple classifier system for early melanoma diagnosis. Artificial Intelligence in Medicine 27: 29-44. 80. Scwarzer, G., Vach, W., Schumacher, M . , 2000. On the misuses of artificial neural networks for prognostic and diagnostic classification in oncology. Statistics in Medicine 19: 541-561. 20 81. Selaru, F .M. , Yin , J., Olaru, A. , Mori, Y . , Xu , Y . , Epstein, S.H., Sato, F., Deacu, E., Wang, S., Sterian, A. , Fulton, A. , Abraham, J.M., Shibata, D., Baquet, C , Stass, S.A., Meltzer, S.J., 2004. An unsupervised approach to identify molecular phenotypic components influencing breast cancer features. Cancer Research 64: 1584-1588. 82. Setiono, R., 1996. Extracting rules from pruned neural networks for breast cancer diagnosis. Artificial Intelligence in Medicine 8: 37-51. 83. Setiono, R., 2000. Generating concise and accurate classification rules for breast cancer diagnosis. Artificial Intelligence in Medicine 18: 205-219. 84. West, D., West, V . , 2000. Model selection for a medical diagnosis decision support system: a breast cancer detection case. Artificial Intelligence in Medicine 20: 183-204. 85. Abu-Hanna, A. , de Keizer, H. , 2003. Integrating classification trees with local logistic regression in intensive care prognosis. Artificial Intelligence in Medicine 29: 5-23. 86. Altman, D.G., 2000. What do we mean by validating a prognostic model? Statistics in Medicine 19:453-473. 87. Johnston, G.M. , Boyd, C.J., Joseph, P., Maclntyre, M . , 2001. Variation in delivery of palliative radiotherapy to persons dying of cancer in Nova Scotia, 1994 to 1998. Journal of Clinical Oncology 19(14): 3323-3331. 88. Kong, L. , Milbandt, E.B., Weissfeld, L .A. , 2004. Advances in statistical methodology and their application in critical care. Current Opinion in Critical Care 10: 391-394. 89. Ottenbacher, K.J . , Smith, P .M. , Illig, S.B., Linn, R.T., Fiedler, R.C., Granger, C.V., 2001. Comparison of logistic regression and neural networks to predict rehospitalization in patients with stroke. Journal of Clinical Epidemiology 54: 1159-1165. 90. Wilkins, R., Berthelot, J., Ng, E., 2002. Trends in mortality by neighbourhood income in urban Canada from 1971 to 1996. Supplement to Health Reports, 13: 1-27. 91. Terrin, N . , Schmid, C.H., Griffith, J.L., D'Agostino, R.B., Sr, Selker, H.P., 2003. External Validity of predictive models: a comparison of logistic regression, classification trees, and neural networks. Journal of Clinical Epidemiology. 56:721-729. 21 92. Menec, V. , Lix, L., Steinbach, C , Ekuma, O., Sirshi, M . , Dahl, M . , Soodeen, R., 2004. Patterns or health care use and cost at the end of life. Manitoba Centre for Health Policy. 93. Perez, C.E., 2002. Health status and health behaviour among immigrants. Supplement to Health Reports 13: 89-98 94. Ng, E., Wilkins, R., Fung, M.F.K. , Berthelot, M . , 2004. Cervical cancer mortality by neighbourhood income in urban Canada from 1971 to 1996. Canadian Medical Association Journal 170(10): 1545-1549. 95. Johnston, G.M. , Boyd, C.J., Joseph, P., Maclntyre, M . , 2001. Variation in delivery of palliative radiotherapy to persons dying of cancer in Nova Scotia, 1994 to 1998. Journal of Clinical Oncology 19(14): 3323-3332. 96. CIHR: NET "Palliative Care in a cross-cultural context: A NET for equitable and quality cancer care for culturally diverse populations" Doll R, Kazanjian A , Barroetavena M C , Johnston G., Leiss Anne, Fyles G. 2004. t '22 Introduction In a society that creates and stores vast amounts of data every day, it seems peculiar that many researchers are faced with the problem of insufficient data for their research. For many, the omission of socioeconomic and demographic information from a data set can present a serious challenge, especially when it is believed that this information may have an impact upon the outcome being studied. Within the context of cancer research, much information is collected about the disease, yet little is collected about the patient and the patient's social context. In many clinical situations, these factors are of little interest, but they are of interest when studying how cancer patients use health resources at the end of their life. Collecting the desired information about a patient's socioeconomic position can be a lengthy and expensive process, i f it can even be done. Given these challenges many investigators have turned their attention to census based information. Although it is a pragmatic solution, it must be asked whether the choice of census information is necessary and appropriate for the desired research. The aim of this paper is to assess the necessity and appropriateness of ecological research in end-of-life studies. C e n s u s Data as an Ecologica l Measure Current cancer patient place of death population based research, within Canada, strives to describe the socioeconomic factors that determine the location of death for a cancer patient 1 2 ' 1 6 ' 2 0 . By understanding these factors, researchers will be able to address the desires of those who are in the final stages of life and identify potential areas where end-of-life services are not connected with the surrounding people and culture 1 7 ' 1 8 , 1 9 ' 3 2 . In order to achieve these goals, investigators are associating aggregate demographic and socioeconomic information obtained from census data with cancer registry data. In Canada, end-of-life research pertaining to the place of death uses the dissemination area level data (DA) from Statistics Canada as a reasonable substitute for individual 12 20 • • socioeconomic information ' . The dissemination area represents a small aggregation of the population for which information is made public and represents a compact set of 1 2 * blocks with a target population of 500 people ' . As this is the smallest area for which information is available, it has served as the geographic region for which a characteristic is imputed for each person in that local. The use of such data implies that the study will be an ecological one, thus the researcher must clearly understand not only the reasons for using such a design and the limitations of the design, but also the role the data takes within that design. Balancing the Two Fallacies A researcher involved in an ecological study can encounter two types of fallacy: the ecological fallacy and the individualistic fallacy. In many situations the risk of participating in either of the fallacies is minimal, but in place of death research, the investigator has the risk of participating in both fallacies. A tension results from balancing the use of aggregate data, such as census level data, and the desire for individual level data. 24 The ecological fallacy is the inference of phenomena from the group to the individual level " . Simply, this is individual-level conclusions based upon aggregate data. Three subsequent fallacies are related to the ecological one: individual level models are better specified, ecological correlations are always substitutes for individual ones and aggregate level variables do not cause disease 3 . In contrast to this concern is the individualistic fallacy which is the notion that individual level data is sufficient for epidemiological studies 3 ' 6 ' 7 ' I 2 . This suggests that the individual-level analysis is able to capture the group dynamics of a health related phenomena8. Simply, this is aggregate-level conclusions based upon a generalization from individual-level data. A tension emerges for any researcher that seeks to understand population health. In end-of-life research pertaining to the place-of-death of cancer patients, this tension is exacerbated by having predictors which are both individual-level and group-level. Inference in either direction may lead to the participation in one of the two fallacies. For cancer research into the place-of-death, this problem is further compounded by the inability to obtain individual-level demographic and socioeconomic data. The data that is used is derived from the census and is only available at an aggregate level. If the research conceives of the investigation in terms of individual-level data and uses census data because it is easily obtained, then participation in the these fallacies may be inevitable as the individual-level specified model has conceptual precedence over any group-level analysis. In this context, the census data serves as a proxy for the unattainable individual level socioeconomic level data. As suggested, the use of the census data as a proxy causes the researcher to participate in this tension. The desire is for individual level data, but group level data is used as a sufficient substitute. The group level data is the second choice, resulting in the subtle implication that it is an inferior type of data. If this conceptual bias is not made explicit, then a lucid interpretation may be eclipsed. Not only will the investigator risk fallacy, but the reader will also encounter a similar risk. For example, i f the reader believes that the particular study is only credible with individual level data, the reader will suspect any results of ecological studies, even i f it has been shown by the investigator that the constructs of interest are best served by the group-level variables. By making explicit the conceptual structure of the variables, study and the tenability of the investigation, the researcher gives the reader a tool by which unrealized notions can be made explicit and aptly addressed. This denotative process will help sieve out unwarranted criticism, thus enabling a constructive dialogue to emerge. By clearly specifying the type of variable, the construct for which an ecological measure is being used, the type of study and the reason for the design, the investigator and reader wil l have the tools to avoid fallacy and clearly present the strengths and weakness of the investigation. 25 Understanding the Variables, Study, and Design Types of variables Many researchers will be familiar with the four levels of data which can be used to describe individual level data: nominal, ordinal, interval, and ratio. For group level data, several descriptions have been suggested. Susser proposes two added dimensions: integral and contextual (Table 2-1) 8 ' 4 ' 1 0 . An integral variable "affects all or virtually all members of a group" . This is a non-derived variable and is a measure of some pre-existing condition pertaining to the group. An example of this may be the number of hospitals in a well defined geographic region or the presence of a breast screening campaign. This measure applies uniformly to the entire group, thus implying no within group variation. If within group variation does occur it is expected to be small . This implies that this type of variable will have larger variation between groups than within groups. Table 2-1: Comparison between Susser's variables and Macintyre-Duncan spatially based variables Susser 8 Macintyre-Duncan 1 1 Integral A variable which affects all or virtually all members of a group Expect to have high variability between groups and low variability within groups e.g. Number of hospitals in a region, number of doctors in a region Contextual A variable derived from a measured attribute of individuals within each group e.g. Mean, median, proportion Aggregate effect of social, cultural, and environmental characteristics. Similar types of people will have different experiences in different neighbourhoods Compositional Aggregate of all individual characteristics in a neighbourhood. 26 Similar types of people wil l have similar experiences no matter where they live A contextual variable "characterizes the group and not the individual" (Table 2-1)8. This measure is derived from individual level measurement. This characterization of the group cannot be applied uniformly to all members of the group as within group variation wil l exist. Larger internal variation than the integral variable is implied. This measure can be associated with a distribution which will allow for a dynamic characterization of the group. For a contextual variable, there may be no a priori assumption about how distributions wil l vary across groups. Two additional types of variables have been suggested: compositional and contextual1 1. These variables convey a spatial aspect of the data which is relevant to much of the research that utilizes ecological variables. The compositional variable is defined as the "aggregate of all individual characteristics in a neighbourhood, that is, similar types of persons will have similar illness experiences no matter where they live" " . The compositional variable can be integral in that the variable does not vary with geography or it can be contextual in that it is a derived measure which does not vary with geography. In contrast to the compositional variable is the unfortunately named contextual variable. This is "the aggregate effect of social, cultural and environmental characteristics of the neighbourhood" or otherwise stated as similar types of people having illness experiences dependent upon where they l i v e 1 0 ' n . As with the compositional variable, the contextual can be either integral or contextual in Susser's terms. For research into the cultural constructs that predicate health outcomes, a nomenclature can be established that describes not only the type of variables but the geographic aspects of them. Four possible labels emerge: integral-compositional, integral-contextual, contextual-compositional and contextual-contextual (Table 2-2). The structure is illustrated as follows. The integral-compositional variable is one where the groups have measure which applies uniformly to the group and this measure varies with respect to location. An example of this would be the number of doctors in a dissemination area. A dissemination area would have a single count and it could be conjectured that all people in that dissemination area would have access to these doctors. As medical professionals tend to be concentrated in certain locations, it is reasonable to posit that this measure would not be uniform over all dissemination areas. These measures would have little to no variation within the geographically defined group (dissemination area), but would have higher variation between dissemination areas. This suggests that the group is homogeneous with respect to the variable of interest and similar people in different groups have similar health outcomes (e.g. access to health services). 27 Table 2-2: Table for developing a new nomenclature by which to specify predictor and response variables Macintyre-Duncan Contextual Compositional Susser Integral Groups have uniform measure that does not vary with location Groups have uniform measure which varies with location Susser Contextual Groups have derived measure (or associated distribution), which does not vary with location Groups have derived measure (or associated distribution), which does vary with location The contextual-compositional variable has a derived measure or associated distribution, which does vary with location. This suggests that there would be variation within groups and between groups. It suggests that the group is heterogeneous (has an associated probability distribution) with respect to the variable of interest and similar people in different groups will have different health outcomes. The income for a dissemination area would be an example of this. A dissemination area in the Point Grey community of Vancouver will have an associated income distribution, as would a dissemination area in the Downtown East Side area. The income distributions of a wealthy area and an economically frustrated one will be different. As well, variation between the two areas is expected. By using the nomenclature derived, specification of group-level variables can occur. This specification will assist the researcher in understanding the construction of the variable and any underlying constructs that may be associated with it. Also, it will assist the reader in coming to understand how the researcher has understood and has used the variables. Through this both the researcher and the reader will move towards a common interpretation. Prescriptive versus Descriptive Construction of Culture indicators Understanding the group level data will assist the researcher towards a clear understanding of the underlying social construct that is being measured. There will be cases where the independent variable is clearly contextual, but due to methodological limitations, the distributional property of the variable cannot be utilized. Consider the situation for the culture indicator of religion. Within any region, it is reasonable to posit that not all people, in a well defined geographic region, may be characterised by the same religious belief. The religious practice of a small region can be best represented by a probabilistic distribution, a multinomial distribution. If this was incorporated into the model, then the model would allow all religions in the area to influence the outcome. This would be a contextual-compositional variable as the proportions for each area would have locative variations. 28 Suppose that there are three major religions, A , B, and C, being practiced, each of which has a strong presence. There are other religions, but their presence is not predominant. Due to methodological constraints five categories are defined. If any religion has at least 50% of the population as adherents, then the region is described as either A , B, or C. If a region exists where at least two religions each have at least 30% of the people describing themselves as adherents, then the region is described as being "Religiously Mixed" (M). If a region does not satisfy any of these, then it is called "Indeterminate" (I). The new religion variable has five categories. Each region is no longer associated with the appropriate probability distribution for religions in that region, but instead with a descriptor of some predetermined structure imposed by the researcher. For categories A, B, and C, the region is described by the dominant religious influence. A l l subtlety of measurement is lost. For example, a region with 55% of its adherents to religion A and a region with 95% of its adherents to religion A are now equivalent with respect to their designation. Arguments can be made for the equivalence of the two regions, but this would be based upon the notion that all regions with majority of adherents to a specific religion are equivalent which negates other influences. In terms of the specification, this would be an integral-contextual variable. A l l people in a dissemination area would be characterised as coming from a type of community to which there is no individual antecedent. Two concerns emerge with prescribing the structure of a culture indicator. The first, as described above, is that all subtlety is lost by the imposition of such cut-points. By investigating a small set of viable cut-points premised upon a univariate exploration of the data, the researcher wil l negate the subtle influences that are produced by the interplay of cultural forces within a small well defined geographic region. The second is the role of the researcher in defining the culture indicators. The data is explored based upon a predetermined understanding of how culture is to be defined. By prescribing an operational definition of culture upon the data, the researcher assumes a position of privileged cognitive access with respect to defining a cultural construct. With the current approach, the type of measurement is prescribed. By imposing structure upon the data, it is more difficult to ascertain the intent of the researcher. Both proxies and cultural constructs could be obtained through such a prescriptive means. This means that the variables could be integral or contextual in Susser's terms.. Without specification, the reader has no indications of the intent of the researcher and is left in an interpretive quandary. By shifting the development of culture indicators from a prescriptive stance to that of a descriptive one, clarity can be obtained. Developing the culture indicators as a result of a multivariate exploration of the data will result in groups that are naturally defined through the data itself. By allowing the data to drive the construction of the groups, the researcher will be creating an indicator of an underlying cultural construct rather than a proxy. 29 Univariate constructions of culture indicators suggest that a single aspect is important independent from all other aspects associated with that cultural construct, for example the use of French as mother tongue suggests that this aspect of the mother tongue census characteristic is the defining aspect apart from all other languages that could occur 1 6 . It suggests the search for French speaking people rather than the search for a full understanding of the mother tongue languages that are influential in a region. The multivariate descriptive approach, allows for all the languages to influence the eventual description of the underlying cultural construct. Researchers do not seek out speakers of a particular language with this approach. They allow all the languages to work together to obtain groups. In the case of end-of-life research, the result will be groups of dissemination areas. These groups will be based upon similarities for all languages, not just one particular language or language group (e.g. official languages versus non-official languages). The best descriptor of the group is the associated distribution of languages, but evocative names can be assigned to assist communication. By constructing indicators of culture by this means, contextual-compositional and contextual-contextual variables would be constructed using census based data. On a conceptual level, the researcher relinquishes any position of privilege over to the data and allows the data to determine the underlying structure of the cultural construct. The point at which the culture indicator is identified may appear trivial, but it has an effect upon the nature of the construct being considered. As suggested earlier, prescriptive approaches will leave the nature of the indicator in an ambiguous state if not explicitly stated by the researcher. When utilising descriptive approaches, a clear indicator for the underlying cultural construct will be developed. The indicator is premised upon a multivariate distribution, thus is will be more natural to develop a culture indicator for the underlying cultural construct than a simple proxy. The indicator would be a succinct representation of a complex underlying multivariate description of a cultural construct. By using prescriptive rather than descriptive approaches to developing the culture indicators a trade-off occurs. If the categories are pre-specified, majority-minority relationships could evolve resulting in unexpected social perceptions. The descriptive approach can be used in negative manner names must be attached to each of the derived distributions to ease discussion, but this is best left to a discussion concerning ethical research. Although, from the investigative perspective, this is a semantic issue, it must not be neglected. Through the specification of the type of variables that are being used, the use of prescriptive or descriptive approaches, and the identification of the true construct being measured, researchers will be assisted in the development, presentation, interpretation and application of the research while recognizing its limitations. Types of Studies As with the variables, when ecological variables are used, an extra dimension must be considered which goes beyond understanding the modelling procedures used for more routine epidemiological studies. This aspect incorporates the ways in which the levels of variables can be combined. Following Susser's notation, let x and y represent 30 individual level independent and dependent variables respectively and let X and Y represent the group level independent and dependent variable respectively8. An unmixed study is defined to be one where the relationship is defined on the same level that is from x to y or from X to Y 8 . A mixed study is defined as one where the inference crosses from one level to another, from x to Y or from X to y 8 . In the Canadian context of studying the place of death of cancer patients, the relationship is from {x, X} to y, where {x, X} is the set of individual and ecological independent variables. For completeness, a study of the form {x, X} to Y is possible. With such mappings specified, it is apparent that any guideline for interpretation based solely on Susser's supposition is not immediately transferable to this situation unless the individual level predictor variables or the ecological variables do not contribute to the parsimonious model. These special cases reduce to either an unmixed relationship if the group level variables are not significant or a mixed relationship if the individual variables are not present in the final model. When the model does not reduce to one of Susser's relationships, the interpretation must contain aspect of both the individual and the group. If the set {x, X} is found to be predictive of y, then the investigator would have information that individuals with characteristic x from areas defined by X have result y. When this structural relationship occurs, lucid interpretation may be precluded. Individual level interpretations will force the researcher to participate in the ecological fallacy, yet group level interpretations will result in the individualistic fallacy. When this is encountered, the common interpretation is of the form, individuals with characteristic x from an area defined as X have result y 9 ' 1 3 ' 1 0 ' 1 4 . Morgenstem opposes the common interpretation with his concept of cross-level bias 5 . Cross-level bias is the sum of the aggregation bias due to groupings and the specification bias due to the confounding effect of the group 5 . It is suggested that any study where the parsimonious model maps {x, X} to y contains cross-level bias. Mathematically, Geronimus and Susser verify this notion and identify the structure of the bias 8' 2 . Morgenstem and Susser then claim that any study which has cross-level bias participates in the ecological fallacy 8' 2 2. From the perspective of the individual, the cross-level bias occurs from the aggregation effects of the variables, the misspecification of the underlying group construct, or from both 8. From the perspective of the group, the cross-level bias results from the disregard for group effects or from confounding 8 . Even though the components of cross-level bias can cancel each other, it is rare. Morgenstem illustrates that there will be no cross-level bias when the coefficient for the group-level variables are equal to zero when the response is at the individual level 5. The parsimonious model is a mapping from {x} to {y}. Susser describes cross-level bias as a result from the "atomistic fallacy" where individual observations ignore group effects or from specification bias where confounding occurs due differential distribution of individual characteristics among groups 8 . The atomistic fallacy resulting from aggregation bias implies that the between group relationship differs from the between individual relationship within group and ignoring groups 8 . The specification bias results from within group relationship differing from the between group relationship and the individual relationship ignoring groups 8 . 31 It may seem that the researcher is trapped, but Greenland gives balanced insight into the problem, "it is important to remember that the possibility of bias does not demonstrate the presence of bias, and that a conflict between ecological and individual-level estimates does not by itself demonstrate that the ecological estimates are more biased" 26 . Unfortunately this does not suggest a more acceptable interpretation for the researcher who is constrained to a {x, X} to y mixed-level ecological study, yet given the complexity that multiple levels introduces, the current interpretation attempts to be sensitive to individual level variables and group level variables. It is at this stage that much discussion over the methodology has occurred " ' ' . It is within the context of this discussion that alternate measures are suggested in order to compensate for the bias which occurs with ecological studies 5 ' 2 2 ' 2 3 . As well, the concept of confounding between the individual level variable and the ecological variable is addressed ' ' ' . In order to handle these two issues, several alternate models have been suggested ranging from modifications of standard regression techniques in order to identify how the bias affects the estimates to the proposal of variations of random effects models ^,7.22.23,25-27 Reasonabil i ty of an Ecologica l Study By identifying the types of variables, the construction of indicators and the type of study, it will be apparent i f an investigator is utilizing an ecological design. If an ecological study results, then the relevance of the study needs to be assessed. A researcher may be obligated to use an ecological variable because of the research question, concerns about group level actions, or because of the inability to obtain individual level data 1 5 . If the study is apt, then the use of ecological variables is logically appropriate 1 5 . Within the context of understanding the place of death for cancer patients, i f the ecological variables are included in the study because they represent a relevant construct then the study is obligate and apt. If they are included in the study because the individual level data is unattainable, then the study is obligate, not apt and convenient15. It is not apt because the proxy may not be measuring the same construct. If through the specification of the variable, the proxy is found to be measuring the same thing as the original variable, then the study is apt and convenient. The situation where the study is obligate, not apt and convenient could fundamentally alter the investigation. If through the specification of the variable, the proxy is found to be measuring a different construct, then an exchange of convenience for logical appropriateness has occurred. Although this study may have the same model as in the other two situations and employ the same analysis,! they are not equivalent studies because the underlying constructs are not equivalent. Two studies may be addressing similar issues, say the cultural predictors for the location of death of Canadian cancer patients, but the apt study is predicated upon potentially different constructs than the not apt one. It is through the specification of the variables, that the equivalence of the studies will be best understood. 32 Current Situation Current cancer patient place of death population based research, within Canada suffers as there has been no explicit specification of the study variables, type and reasonableness. Stating the variables, the method by which they were constructed and modelling procedure does not address what constructs are actually being measured by ecological variables. A n excellent example of this is the current use of income quintiles as a proxy for socioeconomic status. Income will fluctuate over time and over short and long periods and as a result may not be a sufficient indicator of household purchasing power, personal wealth, the ability to respond to a health crisis, or the resources available in the surrounding area 6 . If two individuals are compared with the same income, nothing is revealed about how much of that income is designated to savings, mortgage, food, or lifestyle activities which translates into the ability for the individual to weather a health crisis. Inference is made based upon this measure, but the knowledge gained is limited. Current findings serve to further compound the issue. There exists evidence that small differences in income are related to larger changes in health status 6 . This evokes a second problem with the current use of ecological proxies. Given that small changes in income have been found to be associated with larger changes in health status, the income quintile may be too gross a measure to elicit any significance regardless of how the model is specified. As this has not been addressed, the lack of significance of income in current research may be due to the construction of the variable as opposed to its true significance. If this is the case, then the model is highly sensitive to how the underlying constructs are defined which would cast a shadow over legitimacy of the research itself. Without specifying the underlying social construct that is being measured by such a proxy, it is apparent that the use of such a variable will entertain discussions about bias, confounding and misspecification. Such discussions must occur as there has been no clear presentation of what is being measured and the utility of that measurement. Even when specification has occurred, bias and confounding may occur, but in this context the discussion of such technical aspects will be premised upon a clear understanding of the measures employed in the analysis 1 2 - 2 1 " 2 6 . A secondary issue that emerges in the current Canadian research is the construction of the ecological variables. As the studies have not specified the intent of the culture indicators, there is not reason to assume they are operating as anything other than a proxy to some individual cultural attribute. As the studies are mixed-level ecological ones, interpretive fallacies are avoided through careful semantics and mathematical limitations are noted. Unfortunately this does not negate the inherent problems with using indicators predicated upon a univariate exploration of the data. Specification would illuminate research intent and illuminate areas of concern within an ecological study, but not address the subtle problems surrounding the development of the indicator. Even though the current research attempts to advance knowledge on the cultural forces influencing patient's decisions at the end-of-life, it is limited by the current method for specifying indicators of culture. Without specification, research intent can only be 33 inferred. This inference is obtained by the body of related publications, model specification, the culture indicators used and the methodology associated with the development of the indicators. With a shift to descriptive indicators, the intent of the researcher would be made more explicit. In attempting to indicate the contextual aspects of culture, the researcher will be using apt ecological variables, thus validating the ecological approach to the research. New Horizons Faced with the spectre of frustration, there are avenues researchers can take in order to begin addressing the issues at hand. The following collection of potential strategies serves to illustrate a variety of methodologies and directions that could be taken. Prescriptive Approaches Specification The simplest and immediate solution is to have the variables completely specified. This will enable other researchers to produce comparable research in different geographic areas and in different time periods. Unfortunately this leaves the research open to discussion about the construct being measured and its validity as well as the usual issues of bias and confounding. Unfortunately, it requires that the researcher accept all the limitations, both conceptually and mathematically. Specification and Quantification This is the same as the previous situation except that the bias is quantified. Geronimus, Bound and Neidert use the example of linear regression to illustrate a method for quantifying the bias due to using an ecological variable 2 2 . Although the bias can be assessed, it is contingent upon knowledge of how the individual level measure is related to its ecological counterpart. This approach parallels Susser's quantification of the bias 8 . Morgenstem provides a methodology for the quantification of the bias inherent in ecological studies by comparing the risk ratio and correlation coefficient for the grouped data and the individual level data using exposure status, disease status and group identity5. The immediate challenge with this approach is the need to develop an estimate for the bias when individual level data is not available or when the data can not be manipulated into the structure used by Morgenstem. Descriptive Approaches Classification Trees to Define Cut-off Values This approach assumes a clearly specified set of variables and addresses the issue of constructing the cut points when transforming independent contextual variables into independent integral variables. The central issue was the effect of the choice of cut-point. This data driven method uses classification trees to define potential cut points based upon the organization of the information with respect to some specified outcome 2 8 _ 3 ° . Given a set of possible cut-points for the data, the researcher then applies area specific knowledge to select the collection of data selected and clinically reasonable cut-34 off values. The advantage of this method is that any unspecified perception based researcher bias may be avoided due to the presence of a data driven set of cut-off values. Cluster Algorithms for Indicator Development The multivariate distribution of the cultural characteristic can be used to organize regions based upon a measure of similarity 3 1 . Unsupervised learning hierarchical or non-hierarchical clustering algorithms could be used. This is a data driven approach would geographic regions without a pre-specified outcome. The resultant groups could then be statistically described and given a label in order to facilitate communication. Each patient could be assigned to a group based upon their postal code, thus an indicator of culture, premised upon a multivariate, would result. C o n c l u s i o n How does a researcher even begin to address the issues of end-of-life studies in Canada using a mixture of ecological and individual level predictor variables and which approaches should be used? A reasonable approach can be taken from literary theory. The hermeneutical spiral is the method of refining a textual interpretation through a continual reflexive critique of the current interpretation. Transferring this to the end-of-life research context, the hermeneutical process would be the continual critique of previous research and methodologies. Within this paradigm, a point of origin for this self-referential critic has been established with the current, yet limited, body of work. This article would represent a reflexive critique of the research, as it sits within the frame work of end-of-life research, yet is attempting to extract learning from the current body of work to improve this next round of research. By understanding the limitations of the current methodology for the development of culture indicators new approaches can be derived which may allow for the researcher to avoid the two fallacies. By moving end-of-life cancer research in Canada beyond the traditional forms of indicating culture, new methodologies and understandings may emerge. 35 References 1. Puderer, H. , 2001. Introducing the dissemination are for the 2001 census: an update. Statistics Canada, Geography Working Paper Series No. 2000-4. 2. Statistics Canada. Standard Geographical Classification (SGC) 2001. Available: www.statcan.ca/english/Subjects/Standard/2001/2001-sgc-intro.html. 3. Schwartz, S. 1994. The fallacy of the ecological fallacy: the potential misuse of a concept and its consequences. American Journal of Public Health 84(5): 819-824. 4. Roux, A . V . 2004. The study of group-level factors in epidemiology: rethinking variables, study designs and analytic approaches. Epidemiology Revue 26: 104-111. 5. Morgenstern, H. 1982. Uses of Ecological Analysis in Epidemiological Research. American Journal of Public Health 72(12): 1336-1344. 6. Kreiger, N . , Williams, D.R., Moss, N.E. 1997. Measuring social class in US public health research: concepts, methodologies, and guidelines. Annu. Rev. Public Health 18:341-378. 7. Kreiger, N . 1992. Overcoming the absence of socioeconomic data in medical records: validation and application of a census-based methodology. American Journal of Public Health 82(5): 703-710. 8. Susser, M . 1994. The logic of ecological: I. The logic of analysis. American Journal of Public Health 84(5):825-829. 9. Johnston, G .M. , Boyd, C. J., Maclssac, M . A , Rhodes, J. W., Grimshaw, R. N . 2003. Effectiveness of letters to Cape Breton women who have not had a recent Pap smear. Chronic Diseases in Canada 24(2/3). 10. Hou, F. and Chen, J. 2003. Neighbourhood low income, income inequality and health in Toronto. Statistic Canada Health Reports 14(2). 11. Malstrom, M . , Sundquist, J., Johansson, S.E. 1999. Neighbourhood environment and self-rated health status: a multilevel analysis. American Journal of Public Health 89(8): 1181-1186. 12. Wilkins, R., Berthelot, J., Ng, E. 2002. Trends in mortality by neighbourhood income in urban Canada from 1971 to 1996. Supplement to Health Reports 13: 1-27. 13. Luo, Z., Kierans, W.J., Wilkins, R., Liston, R . M . , Mohamed, J., Kramer, M.S. 2004. Disparities in birth outcomes by neighbourhood income: temporal trends in rural and urban areas, British Columbia. Epidemiology 15(6): 679-686. 36 14. Johnston, G.M. , Boyd, C.J., Joseph, P., Maclntyre, M . 2001. Variation in delivery of palliative radiotherapy to persons dying of cancer in Nova Scotia, 1994 to 1998. Journal of Clinical Oncology 19(14): 3323-3331. 15. Susser, M . , 1994. The logic in ecological: II. The logic of design. American Journal of Public Health 84(5): 830-835 16. Burge, F., Lawson, B. , Johnston, G. 2003. Trends in the place of death of cancer patients, 1992-1997. Canadian Medical Association Journal 168(3): 265-270. 17. Tang, S.T., McCorkle, R. 2001. Determinants of place of death for terminal cancer patients. Cancer Investigation 19(2): 165-180. 18. Costantini, M . , Balzi, D., Garronec, E., Parodi, S., Vercelli, M . , Bruzzi, P. 2000. Geographic variations of place of death among Italian communities suggests an inappropriate hospital use in terminal phase of cancer disease. Public Health 114: 15-20. 19. Grande, G. E., Addington-Hall, J. M . , Todd, C. J. 1998. Place of death and access to home care services: Are certain patient groups at a disadvantage? Social Science and Medicine 47(5): 565-579. 20. Menec, V. , Lix, L. , Steinbach, C , Ekuma, O., Sirshi, M . , Dahl, M . , Soodeen, R., 2004. Patterns or health care use and cost at the end of life. Manitoba Centre for Health Policy. 21. Mustard, Cameron A. , Derksen, S., Berthelot, J., Wolfson, M . 1999. Assessing ecological proxies for household income: a comparison of household and neighbourhood level income measures in the study of population health status. Health & Place 5: 157-171. 22. Geronimus, A . T., Bound, J., Neidert, L. J., 1996. On the validity of using census geocode characteristics to proxy individual socioeconomic characteristics. Journal of the American Statistical Association 91(434): 529-537. A. 23. Freedman, D.A., 1999. Ecological inference and the ecological fallacy. International Encyclopedia of the Social & Behavioural Sciences, Technical Report No. 549. 24. Deonandan, R., Campbell, K. , Ostbye, T., Tummon, I., Robertson, J. 2000. A comparison of methods for measuring socio-economic status by occupation or postal code area. Chronic Diseases in Canada 21(3): 114-118. 25. Jones, K. , Duncan, C. 1995. Individuals and their ecologies: analysing the geography of chronic illness within a multilevel modelling framework. Health & Place 1(1): 27-40. 37 26. Greenland, S. 2001. Ecological versus individual-level sources of bias in ecologic estimates of contextual health effects. International Journal of Epidemiology 30:1343-1350. 27. Koopman, J. S., Longini, I. M . 1994. The ecological effects of individual exposures and nonlinear disease dynamics in populations. American Journal of Public Health 84(5): 836-842. 28. Carmelli, D., Halpern, J., Swan, G.E., Dame, A. , McElroy, M . , Gelb, A .B . , Rosenman, R.H. 1991. 27-year mortality in the western collaborative group study: construction of risk groups by recursive partitioning. Journal of Clinical Epidemiology 44(12): 1341-1351. 29. Huang, S.S., Finkelstein, J.A., Rifas-Shian, S.L., Kleinman, K. , and Piatt, R. 2004. Community-level predictors of pneumococcal carriage and resistance in young children. American Journal of Epidemiology 159(7): 645-654. 30. Allore, H. , Tinetti, M.E. , Araujo, K , L . B . , Hardy, S., Peduzzi, P. 2005. A case study found that a regression tree outperformed multiple linear regression in predicting the relationship between impairments and Social and Productive Activities scores. Journal of Clinical Epidemiology 58:154-161. 31. Johnson, R.A., Wichern, D.W., 2002. Applied Multivariate Statistical Analysis, 5 t h ed., Prentice Hall, New Jersey. 32. Robert, S.A., Strombom, I., Trentham-Dietz, A. , Hampton, J .M., McElroy, J.A., Newcomb, P.A., Remington, P.L. 2004. Socioeconomic risk factors for Breast Cancer: Distinguishing Individual- and Community-Level Effects. Epidemiology 15(4):442-450. 38 Background Current research indicates that if a patient is given a choice, up to 88% of patients dying of cancer would choose to die at home, yet only a smaller proportion realizes this preference 1 _ 5 . International estimates of the proportion of cancer patients who die at home range from 23% in Antwerp, Belgium to 35% in Houston, United States 6 ' 7 . In Nova Scotia (NS) Canada, out of hospital deaths were 25.4% the period from 1992 to 1997 when measured in terms of calendar years and 31.6% for the fiscal period from 1992 to 1997 8" 9. Out of hospital deaths ranged from 19.8% in 1992 to 30.2% in 1997 1 0 Out-of-hospital deaths in Ontario have ranged from 31% in 1980 to 21.5% in 1994 to 34% in 1997 n . In Manitoba, 57.4% of cancer patients die in the hospital during the fiscal year of 2000/01 . Vancouver has experienced a range of out-of-institutional cancer deaths. From 1990 to 1992, approximately 15% died in non-institutional settings 12 (home or other locations), but in 1993 this number increased to just under 30% . When the patient characteristics of those who are more likely to die out of hospital are considered, congruencies across the studies can be seen. These commonalities include old age, long term relationship, ethnicity, health care provider, education and the geographic region of residence10"15. As well, sex was found to relate to the location of death. Burge (2003)1 0 and Constantini (2000)13 found that women were more likely to die out of hospital. In July 2004, the Canadian Institutes of Health Research (CIHR) approved the grant application entitled "Palliative Care in cross-cultural context: A NET for equitable and quality cancer care for ethnically diverse populations". This five year grant allows researchers in British Columbia, Nova Scotia, and Saskatchewan to focus upon three themes: Access to heath services, caregivers and complementary and alternative therapies. The Access theme seeks to understand the cultural constructs that influence the use of health services by cancer patients at the end-of-life. Not only does this involve the construction and interpretation of predictive models, but it is also concerned with the construction of culture indicators to be used in the models and in the quality of the data used for the modelling process 3 3 . This study is the first to analyze the trend in the place of death for B C cancer patients . (1997 to 2003) and identify predictors of end-of-life care. The objective was to understand which factors were predictive of dying out of hospital. As this study is the first step towards a comparative study between B C and NS, the factors taken were those used in the original NS study 1 0 . Methods Data Sources For this retrospective population-based study, a secondary data set was created from the linkage of two administrative data bases: The British Columbia Vital Statistics and the British Columbia Cancer Registry (BCCR). The Statistics Canada Postal Code Conversion File (PCCF) programs 3J and 4D were used along with the 1996 and 2001 40 Canadian censuses to obtain the quintiles for the neighbourhood income per person equivalent (IPPE). The response, location of death, was obtained from the VS database. The predictors were obtained from the British Columbia Cancer Agency (BCCA) database. The cause of death was taken from the B C C A . It should be noted that the B C C A database obtained this information from BC Vital Statistics. Study Population The subjects were all adults, age 18 years and older who died of cancer, in British Columbia as identified from the death certificates, from 1997 to 2003. The death certificate information was obtained from the BC Vital Statistics. Death Certificate Only (DCO) patients were excluded from the formation of the data set. These patients were those who had a cancer diagnosis at the time of death, although no prior information relating to a diagnosis of cancer was found after following up the diagnosis at the time of death. The age was derived from the B C C R database as the difference between the date of death and the date of birth. The postal code for the usual place of residence was obtained from VS. If this information was unavailable, then the postal code for the place of residence was obtained from the BCCR. Vital Statistics provided the location of death. Deaths due to cancer, primary or secondary, were identified using the B C C A database. Study Variables From this data set, the year of death, sex, age, neighbourhood IPPE, diagnosis (coded according to the International Classification of Diseases, ninth and tenth revisions 1 6 ' 1 7 ) , cause of death (coded according to the International Classification of Diseases O, third revision 1 8 ) , postal code of usual residence, health region associated with the postal code of usual residence, survival time, and location of death (dichotomized as in hospital or out of hospital) as recorded on the death certificate were obtained. The categories for age and survival were obtained from the Nova Scotia research. The age groups were constructed to reflect the skewed age distribution of people who died from cancer in Nova Scotia 1 0 . The patient's age was taken as the difference between the date of death as indicated on the death certificate and the date of birth as reported in the BCCR. For survival, the categories were constructed to capture the varying lengths of survival 1 0 . The survival time was taken as the difference between the date of death on the death certificate and the date of diagnosis as reported in the BCCR. The neighbourhood income per person equivalent is a household size-adjusted measure of income based upon the dissemination area level information 1 9 . In the PCCF program, then quintiles were defined in such a manner as to reflect the relative nature of the measure of income across regions, to minimize the effect of large differences in housing costs upon the overall household welfare, and have equal proportions of the population in each income quintile 1 9 . This measure was selected for two reasons. The first is the relative ease with which it could be obtained, thus it was an opportunistic choice. The second was due to the adjustments used in the measure. As the final intent is to compare across regions it is desirable to have a measure of income that does adjust for regional variation. 41 The International Classification of Diseases (ICD) codes were grouped as in the Nova Scotia research (Appendix). As statistical power was a concern of the Nova Scotia team, cancer types were aggregated to produce more information in each group. As inter-provincial comparisons are intended, this structure was adopted. The region of death taken to be the health authority associated with the patient's usual residence. In B C , the five health authorities are: • Fraser, • Vancouver Coastal, • Vancouver Island, • Interior, and • Northern. A provincial health authority exists, but was not used as it is not geographically defined. The Provincial Health Services Authority's (PHSA) primary role is to ensure access to network of specialized health care services 2 7 . It operates agencies such as BC Children's Hospital, the B C Transplant Society, and Riverview Hospital. It is responsible for specialized services such as chest surgery and cardiac services 2 7 . A n indicator of rural or urban residence was constructed using the postal code of usual residence, obtained from the death certificate, employed in the PCCF process. This follows the methodology of the Nova Scotia research 1 0 . If the second number was a 0, then a rural delivery point was indicated 1 9 . If no address was given on the death certificate, the address from the B C C R was used. BC has two postal codes which were reused. The distance between the old and new area is considerable enough to pose geocoding problems. As the data set spanned the vintage date, 1 April 1999, for these postal codes a supplemental program was used which handles this peculiarity. The identified postal codes were verified for accuracy and the appropriate re-coding was performed when the reported postal code and the current postal code did not agree. Analysis As an inter-provincial comparison is an eventual goal, the groupings for age, survival, and cause of death were the same as those used by the Nova Scotia research 1 0 . Cross tabulations were used to understand the relationship between the predictors and the location of death. Only those with a significant association were selected for entry into the multivariable logistic regression. Logistic regression was used to understand how the selected set of predictors was related to the location of death. The predictors were year of death, sex, age, neighbourhood IPPE, aggregate tumour group (lung as reference group), region of residence (BC Health Authority), and survival group. Residence (rural versus urban) was excluded from the regression as no association was found. Multivariable logistic regression was conducted utilizing backwards stepwise elimination. Stepwise elimination that allowed backwards and forward steps was used as a comparative technique. An exclusion p-value of 0.05 was used to find a 42 parsimonious model for the prediction of out-of-hospital death. A l l the variables that were used in the full model were retained. Results A total of 52,295 adults died of cancer in British Columbia in the period from 1997 to 2003. 2,366 patients were excluded due to missing information on: Neighbourhood IPPE and region of death. Of the remaining 49,929 patients 33,409 (66.91%) died in hospital (Table 3-1). Not all the characteristics were found to be significantly associated with the place of death. The residence indicator (urban versus rural) had insufficient evidence for an association (p=.24), thus it never entered the multivariable logistic regression model. Table 3-1: Characteristics of adults in British Columbia who died of cancer from 1997 to 2003, by place of death. British Columbia^ In Percent in Out of Characteristic Hospital Hospital Hospital Percent out of Hospital Year 1997 4964 74.09% 1736 25.91% 1998 5052 74.54% 1726 25.46% 1999 4232 61.83% 2613 38.17% 2000 4518 66.40% 2286 33.60% 2001 4568 64.72% 2490 35.28% 2002 4898 64.49% 2697 35.51% 2003 5177 63.53% 2972 36.47% Sex Male 18079 68.19% 8434 31.81% Female 15330 65.61% 8034 34.39% Age, yr 18-44 1275 70.36% 537 29.64% 45-64 8283 68.47% 3815 31.53% 65-74 9641 69.27% 4278 30.73% 75-84 10297 66.54% 5179 33.46% S85 3913 59.07% 2711 40.93% Neighbourhood IPPE Low 8353 68.34% 3870 31.66% Lower Middle 6990 67.89% 3306 32.11% Middle 6520 66.95% 3218 33.05% Upper Middle 5992 66.24% 3054 33.76% Upper 5554 64.37% 3074 35.63% 43 Aggregate Tumour Group Breast 2492 62.99% 1464 37.01% Colorectal 3019 63.85% 1709 36.15% Lung 8931 67.97% 4209 32.03% • Prostate 2050 61.71% 1272 38.29% Other 16917 68.26% 7866 31.74% Region of Death Interior 6379 64.18% 3561 35.82% Fraser 11307 73.09% 4163 26.91% Vancouver Coastal 7455 67.88% 3528 32.12% Vancouver Island 6688 60.74% 4323 39.26% Northern 1580 62.57% 945 37.43% Residence Urban 29077 67.01% 14316 32.99% Rural 4332 66.28% 2204 33.72% Length of Survival <60 8367 76.90% 2514 23.10% 61-120 3391 66.19% 1732 33.81% S121 21651 63.82% 12274 36.18% | All characteristics, except Residence, were associated with place of death (p<01) Over the study period the general trend was towards an increase in the proportion of deaths occurring out of hospital from 25.9% in 1997 to 36.5% in 2003 (Figure 3-1). Logistic regression confirmed this trend. Compared with 1997, the adjusted odds ratio (OR) for 2003 was 1.6 (95% confidence interval [CI] 1.5-1.7) (Table 3-2). Figure 3-2 shows this same trend occurring over the regional health authorities. From 1997 to 1999 there is a clear increase in the proportion of out of hospital deaths for all the health authorities. The same anomalous spike is found in 1999 for all health authorities, and then from 1999 to 2003 the trends are less obvious. Despite the unclear trends, there is an increasing trend for the Northern Health Authority. 44 80.00% - r -70.00% - -60.00% -m 50.00% - -.c i •S 40.00% -I O * 30.00% - -20.00% - -10.00% -I 0.00% J§ 1997 1998 1999 2000 2001 2002 2003 Year of Death Figure 3-1: Proportion of death out of hospital and in hospital among British Columbian adults who died of cancer, 1997-2003. Table 3-2: Crude and adjusted odds of dying out of hospital, by predictors, for British Columbia, 1997-2003 British Columbia - Backwards Elimination 95% CI Predictor Crude OR Adjusted ORT Lower Upper Year 1997* 1.0 1.0 1998 1.0 1.0 0.9 1.1 1999 1.8 1.8 1.7 2.0 2000 1.4 1.5 1.3 1.6 2001 1.6 1.6 1.5 1.7 2002 1.6 1.6 1.5 1.7 2003 1.6 1.6 1.5 1.7 Sex Male* 1.0 1.0 Female 1.1 1.1 1.1 1.2 Age, yr * 18-44 1.0 1.0 45-64 1.1 1.1 1.0 1.2 65-74 1.1 1.1 1.0 1.2 75-84 1.2 1.2 1.1 1.4 >85 1.6 1.7 1.6 2.0 45 Neighbourhood IPPE Low 1.0 1.0 Lower Middle 1.0 1.0 1.0 1.1 Middle 1.1 1.1 1.0 1.1 Upper Middle 1.1 1.1 1.1 1.2 Upper 1.2 1.2 1.1 1.3 Aggregate Tumour Group Breast 1.2 1.0 0.9 1.1 Colorectal 1.2 1.1 ' 1.0 1.1 Lung' 1.0 1.0 Prostate 1.3 1.0 0.9 1.1 Other 1.0 0.9 0.9 1.0 Region of Death Interior 1.2 1.2 1.2 1.3 Fraser 0.8 0.8 0.7 0.8 _ Vancouver Coastal 1.0 1.0 Vancouver Island 1.4 1.4 1.3 1.5 Northern 1.3 1.4 1.2 1.5 Length of Survival ¥ <60 1.0 1.0 61-120 1.7 1.8 1.7 1.9 >121 1.9 2.0 1.9 2.1 Note: OR=Odds Ratio, CI=Confidence Interval | Adjusted for year of death, sex, age, neighbourhood IPPE, aggregate tumour group, region of death and length of survival * Reference group 46 Deaths out of Hospital: B C Health Authority —$— Interior —a— Fraser —&— Vancouver Coastal Vancouver Island • — Northern 1 9 9 7 1 9 9 8 1 9 9 9 2 0 0 0 2 0 0 1 2 0 0 2 2 0 0 3 Figure 3-2: Percentage of patients who died out of hospital over the study period 1997-2003 by.British Columbia Health Region The adjusted results for the multivariable logistic regression are given in Table 3-2. After accounting for the all the other significant factors in the model, patients who died out of hospital were more likely to be female, over 45 years of age, from a neighbourhood in the upper three neighbourhood IPPE quintiles, live outside of the Fraser and Vancouver Coastal Health Authorities, and survive longer than 60 days after diagnosis. The adjusted OR for females was 1.1 (95% CI 1.1-1.2), yet 53.2% of all the deaths were men. In relation to age, the adjusted OR was 1.1 (95% CI 1.0-1.2) for both the 45-64 age group and the 65-74 age group compared with those who were under 45 years of age. For the 75-84 age group the OR was 1.2 (95% CI 1.1-1.4) and the 85+ age group had an OR of 1.7 (95% CI 1.6-2.0) compared with those who were under 45 years of age. Compared to the lowest income quintile, the highest had an OR of 1.2 (95% CI 1.1-1.3). The middle income quintile had an OR of 1.1 (95% CI 1.0-1.1) and the upper-middle income quintile had an OR of 1.1 (95% CI 1.1-1.2). 47 When the region of death, as defined by the BC Health Authorities, was considered, it was found that the non-Fraser Valley health authorities were more likely to have out of hospital deaths. When compared to the Vancouver Coastal Health Authority, the Interior Health Authority had an OR of 1.2 (95% CI 1.2-1.3), the Vancouver Island Health Authority had an OR of 1.4 (95% CI 1.3-1.5) and the Northern Health Authority had an OR of 1.4 (95% CI 1.2-1.5). When compared with those who survived less than 60 days, those who survived longer than 120 days had an OR of 2 (95% CI 1.9-2.1). The method of obtaining a parsimonious model did produce any differences in the final model. As such, the results for the stepwise method were not reported. Discuss ion We found that the proportion of out of hospital deaths among adults with cancer increased over the study period. The proportion rose from 25.9% in 1997 to 36.5% in 2003. It was found that women, patients over 45 years of age, patients in neighbourhoods in the Middle to Upper IPPE quintiles, patients with colorectal cancer, patients who lived outside the Fraser Valley, and those who have survival times greater than 60 days were more likely to die outside of an hospital. The difference in the location of death as seen in Figure 3-1 was unexpected and warrants further investigation. This may reflect an implementation of a commitment for home related health care delivery systems 2 9 . This puzzling event warrants further investigation into the health policies being implemented in the 1998/1999 time period. It was found that women are more likely than men to die out of hospital. This is congruent with Nova Scotia and Italian results 1 0 ' 1 3. Even though this is true, Tang (2001)5 reveals that there is conflicting evidence for the relevance of sex as a predictor for the location of death5. As well, it was found studies where women are less likely to die out of hospital 5. The British Columbia and Nova Scotia have complimentary results with respect to the role age plays in determining the location of death. In B C , it was found that patients 45 years of age and older were more likely to die out of the hospital. In Nova Scotia, it was found that patients 75 years and older were more likely to die out of the hospital. These findings concur with those of Constantini et al. (2000)13 and Bruera et al. (2002)14. On the other hand, Tang (2001)5 finds that the relationship between age and the location of death is yet unsettled 5 . BC has a median age of 38.4 years 2 1 . Nova Scotia, which also reported that age was a significant predictor for the location of death has a median age of 38.8 years 2 2 . Both of these are higher than the Canadian level of 37.6 years of age 2 0 . On the other hand, Manitoba had a median age of 36.8 years, yet it has the second highest proportion of seniors aged 65 and over (14%) exceeded only be Saskatchewan (15%) 3 1. Manitoba saw the largest increase in median age over the last decade. This was due to the increase in the numbers of elderly people . 48 B C has seen an increase in the numbers of older people in the province between the 1996 and 2001 censuses. There was a 49% increase in the 45 to 64 age group, a 27% increase in the 70 to 79 age group, and a 54% increase in the 80 and over group 2 1 . The Fraser Valley and Vancouver Island regions saw their median age increase from 36.0 years in 1996 to 38.1 years in 2001 2 1 . B C has 13 of the top 25 municipalities with the oldest populations. The oldest BC municipality was Qualicum Beach (Median age: 58.1 years) followed by White Rock 21 (Median age: 50.9 years) . Victoria is the second oldest Canadian metropolis areas. It has a median age of 41.0 years21. Vancouver, on the other hand, with a median age of 37.4 years, is close to the Canadian figure. With such demographic shifts, age may begin to play an important role in the observed health outcome and in the decisions made concerning end-of-life issues. Income was found to be predictive of the location of death. In Nova Scotia and Toronto the opposite was found 1 ' 2 3 . The difference between the studies is the measure of income. The measure used in this investigation incorporated regional variations; it is unclear i f this was done in the Nova Scotia and Toronto studies. This suggests that the means by which income is assessed may contribute to the role it plays in predicting health outcomes. This may be especially true for countries that have universal health care systems. Following this avenue of thought, Roos (2004)24 and Veugelers (2001)25 imply that the lack of association may be due to Canada's universal health care system. Unlike the Nova Scotia results, the aggregate tumour group was found to be predictive of the location of death 1 0 . Those patients with colorectal cancer were more likely to die out of the hospital compared with patients who had lung cancer. There was no significant difference in the likelihood of dying out of hospital between breast cancer, prostate cancer, and all other cancers with reference to lung cancer. Patients who reside in the Fraser Health Authority are less likely to die out of hospital when compared with the region comprising the Vancouver Coastal Health Authority. The Fraser Health Authority has the largest proportion of the BC population comprising 36.4% of it and is the most urbanised26. It has the second lowest number of acute care days (503 days/1000 pop) with the Vancouver Coastal having the lowest (496 days/1000 pop) 2 6 . Patients who reside in the Northern, Interior, and Vancouver Island Health Authorities are more likely to die out of hospital when compared to Vancouver Coastal. The Northern Health Authority has the highest number of acute care days (734 days/1000 pop) 2 6 . The Interior has 602 days/1000 acute care days and the Vancouver Island Health Authority has 596 days/1000 2 6 . In BC, the regions with higher odds of dying out of hospital are those regions where the length of stay in acute care facilities is longest. The regions were the odds of dying out of hospital are lower than the Vancouver Coastal having short lengths of stay in acute care facilities. For British Columbians, these results may be related to the access of health services i f people are required to travel substantial distance for health services. 49 For patients outside the Fraser and Vancouver Health Authorities, hospitalization may occur when other local options have been exhausted. This unexpected relationship posits a challenge for BC researchers as this relationship may be intimately connected to utilization, geographic distribution and the administration of health services as well as the influence of other non-institutional support systems. The use of administrative databases allows for a rough sketch of population health to be obtained. Although they represent an efficient means of obtaining health related data, they are limited as their primary intent is for administration not research. This aspect does limit the number of usable fields and constrains the researcher into using predetermined constructs and coding. Fortunately the knowledge gained outweighs the limitations. Our data was taken from the administrative databases used at the British Columbia Cancer Agency and the British Columbia Vital Statistics. In the future, it is hoped that more administrative databases pertaining to palliative care, hospital separations and hospice care will be added to the existing database. It is believed that with this information a better understanding of how palliative care and primary care services are used by cancer patients at the end of their life. Limitations Beyond the limitation of using administrative databases as previously mentioned, other limitations do exist with using administrative databases. The VS and B C C A databases do not collect socioeconomic and demographic information. Due to this, the income measure used in this study was obtained from dissemination area census data. The use of such data does have ramifications upon the utility of the model in the form of the ecological fallacy. Also, researchers are limited to the categories that have been specified by the data providers. Even with these and the aforementioned limitations, administrative databases do represent an efficient means, in terms of cost and time by which population based health data can be obtained. This allows for a rough sketch of population health to be made with relative ease. The response, death out of the hospital may be too vague. These deaths are not a hospital death, but they may occur in an institution such as a nursing home or a palliative care facility. Although technically not a hospital, locations such as these represent a mixed environment as medical services and place of usual residence are intermingled. What is needed is clearer information on the location of death so that pseudo-residential locations can be identified. The use of the postal code for an indicator of rural versus urban residence is questionable. The zero in the second position of the postal code indicates delivery from a rural post office. This indicates the point of service, but not the rural or urban feature of the area. Some rural areas are urbanised and suburbanised, yet they have a zero in the second position, thus it may not be the best way to define the indicator 1 9 ' 3 0 . As well, rural route services and sub-urban route services are provided from urban post offices19. For example, the entire province of New Brunswick is considered urban as no postal code has a zero in the second position . A definition must be established in order to use this type of indicator with precision and clarity. A measure based upon population 50 size or on population density and distance to regional center may serve as starting points for such a definition. The problem of re-used postal codes presents a limitation in that only those patients who had a recorded address could undergo postal code verification. When a postal code was identified as correctly identifying an institution, then all instances of that postal code were taken as correct. Unfortunately patients with an old postal code that was not associated with an institution could not be followed up if they did not have a recorded address. Also, the address was taken from the BCCR and may not represent the olace of usual residence. Although a limitation, this was restricted to 207 patients in the data set. This research only describes where people have died. As administrative databases were used, no comment can be made on the congruence between the location of death and the patient's preferred place of death. Understanding the reasons for dying in a certain location are also eclipsed. As this paper is a first step in an inter-provincial comparison, the categories for age and survival from the Nova Scotia research were used. As these categorizations may not be appropriate for the BC population, this research is limited by the conceptual framework implemented for the population in Nova Scotia. A common framework must be established in order to facilitate lucid comparisons. C o n c l u s i o n A shift has occurred in the place-of-death for BC cancer patients between 1997 and 2003. As the population of BC continues to age, understanding the nature of these choices will be an essential component in meeting the needs of those who are completing their life. It is recommended that future research includes data from hospital separations, continuing care and the BC Medical Services Plan to improve our understanding of how health services are utilized at the end-of-life. Based upon the insights gained through the integration of these databases into the current research, qualitative approaches should be developed to capture how and why people make certain decisions about service utilization at the end of their life. These insights will be necessary to develop effective health policy and service delivery programs for all. 51 Appendix ICD codes used to generate disease groups for CR and VS data for up to 2002 for End-of-Life with cancer research4 Table 3-3: Nova Scotia Groupings for ICD codes. The British Columbia groups were identical to these groups. Disease Group Vital Statistics (VS) Deaths Cancer Registry (CR) cases Version of ICD ICD-9(1979-2000) ICD-10(2001-) ICD-8(-1978) ICD-0(1979-1991) ICD-0-2(l 992-2000)/ ICD-O-3(2001-) Head and Neck 140-149,160,161 C00-C14,C30-C32 140-149,160,161 140-149,160,161 C00-C14,C30-C32 Gastrointestinal 150-159 C15-C26,C48 150-159 150-159 C15-C26,C48 Thoracic 162-165 C33,C34,C37-C39,C45 162-163 162-165 C33,C34,C37-C39 Breast 174,175 C50 174 174,175 C50 Gynecologic 179-184 C51-C58 180-184 179-184 C51-C58 Genitourinary 185-189 C60-C68 185-189 185-189 C60-C68 CNS 190-192 C69-C72 190-192 190-192 C69-C72 Hematologic 200-208 C81-C85,C88,C90-C95 200-207 169,196 C42,C77 Other 170-172,193-195 C40,C41,C43,C44,C46,C4 7,C49,C73-C76,C96 170,172,173,193-195 170,171,173,193-195 C40,C41,C44,C47,C49,C73-C76 Unknown Primary 196-199 C77-C80 196-199 199 C80 X This table was constructed by the Nova Scotia Research team10. 52 References 1. Carroll, D.S. 1998. An audit of place of death of cancer patients in a semi-rural Scottish practice. Palliative Medicine 12: 51-53. 2. Tang, S.T., McCorkle, R. 2003. Determinants of Congruence between the Preferred and Actual Place of Death for Terminally 111 Cancer Patients. Journal of Palliative Care 19(4): 230-237. 3. Gyllenhammar, E., Thoren-Todoulos, E., Strang, P., Strom, G., Eriksson, E., Kinch, M . 2003. Predictive Factors for Home Deaths among Cancer Patients in Swedish Palliative Home Care. Support Care Cancer 11:5 60-5 67 4. Tang, S.T. 2003. When Death Is Imminent Where Terminally 111 Patients With Cancer Prefer to Die and Why. Cancer Nursing 26(3): 245-251. 5. Tang, S.T., McCorkle, R. 2001. Determinants of Place of Death for Terminally 111 Cancer Patients. Cancer Investigations 19(2): 165-180. 6. Schrijvers, D., Joosens, E., Vandebroek J., Verhoeven A . 1998. The Place of Death of Cancer Patients in Antwerp. Palliative Medicine 12: 133-34. 7. Bruera, E., Sweeney, C , Russell, N . , Willey, J.S., Palmer, J.L. 2003. Place of Death of Houston Area Residents with Cancer Over a Two-Year Period. Journal of Pain and Symptom Management 26(1): 637-642. 8. Burge, F., Lawson, B. , Johnston, G. 2003. Family Physician Continuity of Care and Emergency Department Use in End-of-Life Cancer Care. Medical Care 41(8): 992-1001. 9. Burge, F., Lawson, B., Johnston, G, Cummings, I. 2003. Primary care continuity and location of death for those with cancer. Journal of Palliative Medicine 6(6): 911-918. 10. Burge, F., Lawson, B. , Johnston, G. 2003. Trends in the place of death of cancer patients, 1992-1997. Canadian Medical Association Journal 168(3): 265-270. 11. Heyland, D., Lvery, J.V., Trammer, J.E., S'hortt, S.E.D. 2000. The Final Days: A n Analysis of the Dying Experience in Ontario. Ann R Coll Physicians Surgeons Can 33(6): 356-361. 12. Cardiff, K. , Hsu, D., Kuhl, D. 1998. Utilization of Palliative Care Services in Vancouver: 1990-1993. Center for Health Services and Policy Research 98:13. 53 13. Costantini, M . , Balzi, D., Garronec, E., Parodi, S., Vercelli, M . , Bruzzi, P. 2000. Geographic variations of place of death among Italian communities suggests an inappropriate hospital use in terminal phase of cancer disease. Public Health 114: 15-20. 14. Bruera, E., Russell, N . , Sweeney, C.,Fisch, M . , Palmer, J.L. 2002. Place of Death and Its Predictors for Local Patients Registered at a Comprehensive Cancer Center Journal of Clinical Oncology 20(8): 2127-2133. 15. Tang, S.T. 2002. Influencing Factors of Place of Death among Home Care Patients With Cancer in Taiwan. Cancer Nursing 25(2): 158-166. 16. International Classification of Diseases: Ninth Revision Geneva: World Health Organization. 1990 17. International Classification of Diseases: Tenth Revision. Geneva: World Health Organization. 1992. 18. International Classification of Disease for Oncology: Third Revision. Geneva: World Health Organization. 2000 19. Wilkins, R. PCCF+Version 4D User's Guide: Automated Geographic Coding Based on the Statistics Canada Postal Code Conversion Files Including Postal Codes to December 2003. Statistics Canada, Health Analysis and Measurement Group, July 2004. 20. Profile of the Canadian Population by Age and Sex: Canada Ages. Ottawa: Statistics Canada; 2005 June 16. Cat no. 96F0030XIE2001002. Available: http ://www 12. statcan.ca/ english/censusO 1 /Products/Analytic/companion/ age/prov s.cfm. 21. British Columbia: One of the Oldest Provinces. Ottawa: Statistics Canada; 2005 June 16. Cat no. 96F0030XIE2001002. Available: http://wwwl2.statcan.ca/english/census01/Products/Analytic/companion/age/bcc mi 22. Nova Scotia: Lowest Ratio of Men to Women in Canada. Ottawa: Statistics Canada; 2005 June 16. Cat no. 96F0030XIE2001002. Available: http ://wwwl 2.statcan.ca/english/censusO 1/Products/Analytic/companion/age/ns.c fin 23. Gorey, K . M . , Holowaty, E.J., Fehringer, G., Laukkanen, E., Richter, N.L . , Meyer, C M . 2000. An International Comparison of Cancer Survival: Metropolitan Toronto, Ontario, and Honolulu, Hawaii. American Journal of Public Health 90(12): 1866-1872. 54 24. Roos, L .L . , Magoon, J., Gupta, S., Chateau, D., Veugelers, P J . 2004. Socioeconomic Determinants of Mortality in Two Canadian Provinces: Multilevel Modelling and Neighbourhood Context. Social Science and Medicine 59: 1435-1447. 25. Veugelers, P.J., Yip, A . M . , Kephart, G. 2001. Proximate and Contextual Socioeconomic Determinants of Mortality: Multilevel Approaches in a setting with Universal Health Care Coverage. American Journal of Epidemiology 154(8): 725-732. 26. BC Health Atlas: Second Edition. BC Ministry of Health Services: Centre for Health Services and Policy Research. 2005 June 16. Available: http://www.chspr.ubc.ca/Research/healthatlas.htm 27. Provincial Health Services Authority. 2005 July 25. Available: http://www.phsa.ca/default.htm 28. Menec, V. , Lix, L., Steinbach, C , Ekuma, O., Sirshi, M . , Dahl, M . , Soodeen, R., 2004. Patterns of health care use and cost at the end of life. Manitoba Centre for Health Policy. 29. Allen, D.E., Stajduhar, K.I., Reid, R.C. 2005. The uses of provincial administrative health databases for research on palliative care: Insights from British Columbia, Canada. B M C Palliative Care 4:2. 30. Martin, S. Canada Post [Personal communication 2004 April 20] 31. East- West Split in Aging Patterns. Ottawa: Statistics Canada; 2002 July 16. Cat no. 96F0030XIE2001002. Available http ://www 12. statcan.ca/english/censusO 1 /Products/ Analytic/companion/ age/prov s.cfm 32. Manitoba: Record Increase in Median Age. Ottawa: Statistics Canada; 2005 June 16. Cat no. 96F0030XIE2001002. Available: http ://www 12. statcan.ca/english/censusO 1 /Products/ Analytic/companion/age/mb. cftn 33. CIHR: NET "Palliative Care in a cross-cultural context: A NET for equitable and quality cancer care for culturally diverse populations" Doll R, Kazanjian A, Barroetavena M C , Johnston G., Leiss Anne, Fyles G. 2004 55 Background Understanding the underlying cultural constructs which are associated with a cancer patient's utilization of health services and the location of death can be critical to identifying vulnerable populations. Research in Manitoba, Nova Scotia and Vancouver suggests that the place of residence is associated with health-related outcomes such as service utilization or location of death 1 _ 6. This definition of the place of residence is too gross a measure (e.g. rural versus urban) to identify a coherent underlying social construct that gives insight other than proximal discrimination. Research upon populations from Toronto and Nova Scotia presents further frustrations as it suggests there is no socioeconomic status (SES) connection to these health outcomes 7" 1 0. Roos8 and Veugelers9 imply that the lack of association between health outcomes and SES is due to Canada's universal health care system. If this is true, then the conventional measures of SES (e.g. income, education, and occupation) may not be discriminants of health outcomes. Health Canada claims that "universal coverage for medically necessary health care services [is] provided on the basis of need, rather than the ability to pay" 1 1 _ thus economic related measures of SES may not capture any systematic marginalization of peoples from health services. This understanding will have repercussions upon the body of work from which Canadian researchers can draw inference. Consider the current work in the United States which investigates cancer-related health outcomes. Within the known literature, much emphasis is placed upon conventional measures of SES while the sociodemographic 12 21 aspect is limited to race, age, gender and marital status " . The consideration of cultural context extends only to region of residence which is summarised in a rural versus urban indicator 14~15. Similar themes emerge when considering international research into the cultural determinants of the utilization of health services and the location of death for persons dying of cancer 2 2 ' 2 8 . By investigating the indicators used to understand the non-clinical factors that are associated to the use of health services and the place of death within cancer research, it became apparent that international comparisons must be made within the context of the health system in place in each of the countries being considered. This suggests that indicators that are relevant in the United States may not have the same influence in Canada " , thus the use of the SES measures may be the source of the mixed results for Canadian research. In order to overcome this limitation and seek understanding from the international community, Canadian research should draw from sources that share similar health service ideologies, such as Sweden , Italy and Belgium . The challenge is that research upon these populations has focused upon the conventional measures of SES 22-23, 26, 28 Although leadership in developing a more complete understanding of how people dying of cancer use end-of-life services based upon sociodemographic and economic characteristics (SDEC) is absent within the known international body of research for universal health care systems, Canadian research has not remained inactive. Research in Nova Scotia has utilised Canada Census data to create culture indicators other than SES 57 measures for small geographic regions called enumeration areas (EA). Within the context of understanding the utility of letters concerning Pap smear information and the use of cervical cancer screening facilities, census based ecological indicators were created for ethnicity and language . Within the investigation, it was found that areas where French was the dominant mother tongue had a decreased risk of an abnormal smear 3 2. Unfortunately, the use of census derived SDEC for understanding the social context and effects upon health service utilization in Canada does not extend much beyond this research. Although representing a departure from the exclusive use of SES measures in cancer research to indicate the SDEC context, the methodology employed in Nova Scotia serves only to highlight a predetermined aspect of a cultural characteristic. The researcher chooses French as mother tongue over others due to the large French community. An E A was considered to be Francophone i f 50% or more of the people in the E A had French as their mother tongue. Although this appears to be a reasonable means by which to describe a small geographic region in order to infer a cultural context, they are inadequate in two respects: reliance upon a univariate description of the regions and the choice of cut-points. The first is in the choice of a univariate decision versus a multivariate one. The current approach deconstructs the concept of culture into a series of discrete, well-defined partitions such as French, Black and Aboriginal identities. Each is a binary indicator that is constructed without reference to any other cultural feature. In the construction of these indicators, only a small amount of the information is used. Although useful for an initial investigation in order to build understanding of the underlying cultural constructs ofa region, they do not represent the underlying multivariate nature of a cultural identity. The complex interplay of the cultural characteristics is eclipsed by this approach. The second is the choice of cut points. The cut point of 50% for French as the mother tongue was chosen to define a Francophone community. This may appear sensible as it would identify EAs with a majority of people with French as their mother tongue. Although this presents a clear cut-point, it is not easily defendable against ambiguous cases. Consider two EAs. One has 52% French as mother tongue and the other has 48%. Is it reasonable to suggest that they should be classified differently without any other evidence of their dissimilarity? The researcher-determined cut-point of 50% suggests that these are very different EAs. Yet they may share similar underlying social constructs due to the similarity in mother tongues and in the proximal relationship to one another. An analogous situation would arise when the second E A has a proportion that is distinctly different than the first. If the second had 90% French as mother tongue then it may not be appropriate to group the two because of the differences, yet a 50% cut suggests that they are identical in terms of the underlying construct being measured. Choosing cut-points on a univariate exploration of the data casts a shadow over the underlying cultural constructs that obscures all cultural nuances for a geographically constrained construct resulting in a myopic view of culture. Although providing "universal, comprehensive coverage for medically necessary hospital and physician services' 1 1 is the federal mandate, SES measures may not successfully 58 assess whether or not this egalitarian representation of health services in Canada is equitably applied amongst all peoples. In order to understand the non-economic mechanisms of health service marginalization, SES measures must be supplemented with non-economic measures of sociodemographic and economic characteristics. Although cancer research upon Canadian populations has begun this endeavour, much development is needed in order to capture the multivariate nature of the underlying cultural constructs which may be serving as non-economic based forces of health service utilization marginalization. This research presents a preliminary step in the development of such an indicator. This multi-stage approach uses the empirical multivariate distribution for a chosen census characteristic in order to construct an indicator for the underlying cultural context. The objective of this study is to explore a novel approach for the development of census-based culture indicators. Methodology Data Source and Population The data for this investigation came from the 2001 Canadian Census as reported by Statistics Canada for dissemination areas (DA). The dissemination area is the lowest level of data that is released by Statistics Canada. It is defined as a standard geographic area designed for data output which is predicated upon temporal stability, reduced area suppression, uniformity, intuitive boundaries and a compact shape 3 3 . The dissemination areas were constructed to respect the Census Sub-Division and the Census-Tract boundaries and to be comparable to the 1996 census enumeration areas 3 3 . Within the construction of the D As, uniformity of numerical size and geographic shape was sought. 75% of the DAs had populations within the 400 to 700 respondents range. 10% had populations less than 400 with approximately 2% experiencing data suppression . Data suppression occurs when a D A has less than 40 respondents 3 3 . 15% had populations over 70 0 3 3 . Approximately 90% of the DAs retained a compact shape (square) with 10%) exhibiting horseshoe, strip, cookie bit and donut shapes 3 3 . These pertained to cul-de-sacs, DAs that boarder upon geographic boundaries, such as rivers, or municipal infrastructure boundaries, such as roads, and high density regions such as apartment blocks 3 3 . The observational unit was the DA. In BC there are 6669 DAs. 97 of the DAs experienced data suppression and were removed from the exploration. Study Variables The census categories chosen for the investigation were ethnic origins (total responses), mother tongue (single response only) and religion. Ethnic origins has 186 sub-categories. Mother tongue has 80 sub-categories. Religion has 35 categories. The structure of the census data is such that the sub-categories may be further broken down, for example, the total population by mother tongue (category) can be broken down into single responses, multiple responses, and total responses (sub-categories) (Figure 4-1). Total responses can be further broken down into English, French and non-official languages. The non-official languages can be further broken down into specific languages, for example Chinese, Arabic, and Polish. For the purposes of this investigation, the variables were taken to be the most specific level reported for the category (e.g. a specific language, ethnic group, or religion). 59 Category Sub-categories A Single Response Ethnic Origin , I Multiple Responses English 1— French Total Responses Non-Official Languages Chinese Punjabi 1 Greek Figure 4-1: Structure of the 2001 Canada Census Ethnic Origins characteristic As there were a large number of variables, they were first grouped according to some internal logic such as geography and ethnic groups for ethnic origins (Table 4-1), geography and language families for mother tongue (Table 4-2), and theological positions, historical evolution, and religious practices for religion (Table 4-3). Table 4-1: Conceptual grouping of ethnic origins Conceptually Grouped Variables Original Census Variables - Ethnic Origins (Total Response) French Quebecois, Acadian, French Canadian Canadian British Isles English, Scottish, Irish, Welsh, British not indicated elsewhere North-West European German, Dutch, Norwegian, Swedish, Danish, Austrian, Belgian, Finnish, Swiss, Icelandic Southern Europe and Mediterranean Italian, Portuguese, Greek, Spanish Eastern European Ukrainian, Polish, Russian, Hungary-Magyar, Romanian, Croatian, Czech, Yugoslav not indicated elsewhere, Slovak, Serbian, Armenian East Asia Filipino, Vietnamese, Korean, Japanese, South Asian not indicated elsewhere USA American - USA Caribbean Jamaican, Haitian, West Indian, Trinidadian-Tobagonian Latin America Latin America not indicated elsewhere, Guyanese 60 Middle East Lebanese, Iranian, Arab not indicated elsewhere, Egyptian African African-Black not indicated elsewhere Indian Sub-continent Pakistani, Sri Lankan, Punjabi Chinese Chinese Aboriginal North American Indian, Metis, Inuit East Indian East Indian Jewish Jewish Black Black Table 4-2: Conceptual groupings of mother tongue Conceptually Grouped Variables Original Census Variables -Mother Tongue (Single Response) Chinese Cantonese, Mandarin, Hakka, Chinese not otherwise specified Aboriginal Cree, Inuktitut-Eskimo, Ojibway, Montagnais-Naskapi, Micmac, South Slave, Dogrib, Chipewyan, Tlingit, Kutchin-Gwich in Loucheux, Dakota Sioux, Blackfoot Southern European Italian, Greek, Maltese Eastern European Polish, Ukrainian, Hungarian, Croatian, Russian, Czech, Romanian, Slovak, Macedonian, Serbian, Slovenian, Bulgarian, Serbo-Croatian North-West European Dutch, Finnish, Danish, Norwegian, Swedish, Flemish, Gaelic Languages Baltic Estonia, Lithuanian, Latvian-Lettish Middle East Arabic, Persian-Farsi, Armenian, Hebrew, Turkish, Kurdish, Pashto Indian Sub-Continent Gujarati, Hindi, Tamil, Urdu, Bengali East Asia Tagalog-Filipino, Vietnamese, Korean, Japanese, Khmer-Cambodian, Lao, Malay-Bahasa, Malayalam, Thai Spanish-Portuguese Spanish, Portuguese English English French French German German Punjabi Punjabi Other Creoles, Yiddish, Other languages 61 Table 4-3: Conceptual groupings of religion Conceptually Grouped Variables Original Census Variables for Religion Catholic Roman Catholic, Ukrainian Catholic Orthodox . Greek Orthodox, Orthodox not included elsewhere, Ukrainian Orthodox, Serbian Orthodox Protestant United Church, Anglican, Christian not included elsewhere, Baptist, Lutheran, Protestant not included elsewhere, Presbyterian, Pentecostal, Salvation Army, Christian Reformed Church, Evangelical Missionary Church, Christian and Missionary Alliance, Adventist, Non-denominational, Hutterite, Methodist, Brethren in Christ, Mennonite None No religion Muslim Muslim Sikh Sikh Jewish Jewish Other Buddhist, Hindu, Jehovah's Witnesses, Church of Jesus Christ of Latter-day Saints, Aboriginal spirituality, Pagan Each conceptual group measure was the sum of the individual ones. In several situations, the underlying constructs prevented a clear allocation to a group, for example the Church of Jesus Christ of Latter-day Saints shares themes with Protestant churches, yet there are theological distinctions between the two which would negate the identification of Mormon Church to be with those churches that ascribe to the basic tenets of faith as upheld by the other Protestant churches identified in the Protestant group. In these situations, either a separate conceptual variable was created or they were included in the "Other" category. In this case, the Mormon variable was allocated to the "Other" group (Table 4-3). Data Transformation The prevalence of a cultural attribute within a D A was expressed as the percentage of individuals in the D A that had that particular attribute. This transformation, applied to the conceptually grouped variables, provides a measure of cultural density. The new culture indicators will be premised upon the prevalence of the attribute rather than the numbers of people who share the attribute. Statistical Tools - An Overview For each census category, a set of variables is obtained. The data did not contain a variable which classified the dissemination areas or any information to organize the raw data. Given this situation, statistical methods are needed to discover a useful structure. In order to discover the underlying structure of the data, a series of descriptive and data 62 mining techniques were employed. Univariate descriptions and bivariate scatter plots were used to establish an initial conceptual framework for the data. This schema served as a guide for the subsequent data mining exploration. Data mining can be divided into two general areas: supervised and unsupervised learning. Supervised learning trains a data set upon an expressed outcome (e.g. logistic regression). Unsupervised learning is an exploration to understand the structure of the data for the intent of generating data-related hypotheses (statistical or otherwise). The development of the multivariate-defined culture indicator was predicated upon unsupervised learning techniques. These techniques were chosen as there was no'explicit response or outcome variable. As previously indicated, three cultural constructs were explored: ethnicity, mother tongue and religion. Each cultural characteristic had many sub-categories, thus each census category represented a high dimensional space, for example the ethnic origins census characteristic had 61 ethnicities; each represented one dimension (variable or axis) in the ethnic origins space. A succinct expression of these cultural characteristics was desired, thus the number of variables (dimensions) was reduced. The conceptual grouping served as the first step of reduction. For example, the conceptual grouping of ethnic origins reduced the space from 61 dimensions to 18; one dimension for each conceptual group. Principal component analysis was applied to the conceptually reduced data for several reasons. The conceptual groups represented one way of organizing data, based upon known similarities of between various groups. With principal component analysis, the data could be further organized based upon the multivariate structure of these groups. By using principal components for this second stage of organization, further data reduction could be obtained. This allows for the data to be compressed into a smaller set of variables while still retaining much of the original variability 3 5. As well, this organization of the data helps to reduce the noise in the data 3 7. As the data is in a lower dimensional variable space it is easier to visualize the data. As there is less noise in the data, better results can be obtained by using the principal components as input variables for subsequent analysis. With the data adequately compressed, the DAs were grouped based upon similarities as expressed by the principal component variables. K-means was used to group the DAs and the number of groups was chosen through an exploration based upon the knowledge gained from the univariate and bivariate explorations. Once the number of groups was determined, using the criteria of within-group sum of squares and the ease of interpreting the groups, the groups were statistically described and given an evocative label. Finally, the reasonability of the groups was explored. British Columbian municipalities were selected to assess how well the indicator described the cultural composition of each municipality. Agreement between the indication of cultural groups within a city and the current understanding of the cultural composition of the city was sought. As well, the groups were tested for internal consistency through a series of chi-squared tests. SAS was used to generate the data sets and to explore group validity. R was used for the univariate, bivariate, and unsupervised learning procedures. 63 Principal Component Analysis (PCA) A brief overview of P C A will be given. For a more complete discussion see Venables and Rip ley 3 4 or Johnson and Wichern 3 5 . P C A is a data-driven method by which data in a high dimensional variable space is transformed into a smaller dimensional variable space. This is done by creating linear combinations of the original variables that have maximal variance. The specific linear combinations are determined by the structure of the covariance-variance matrix. For a given data set, let £ represent the covariance-variance matrix. As the covariance-variance matrix is positive definite, spectral decomposition splits yjinto two portions: a diagonal matrix of eigenvalues, A , and an orthogonal one of eigenvectors, E, thus £ = ETAE . The columns of E are the loadings. These are the coefficients for the linear combinations of original variables. The size of the loading indicates the weight given to each of the original variables. The entries of A are a measure of variation associated with each of the columns of E. As the primary objective for the use of P C A in this study was for variable reduction, the set of principal components which explained much of the variation in the data was determined. Three measures, based upon the eigenvalue for the i t h component, \ , were used to identify the desired number of components: the proportion of variance explained by the component, the overall cumulative proportion of variance explained and the scree plot. The i t h component explains, K I of the variation. The variation explained by the first i components is, i The scree plot is a useful aid to determine the number of principal components that are desired. With the components ordered from the one explaining the most variation in the data to the one explaining the least, the scree plot is a plot of the variation in the data, \ , versus I, the index. An elbow in the curve is desirable (Figure 4-2). The principal components chosen are those that precede the elbow, which are all the components that capture much of the variation in the data. 64 L O ;co o -o o C O : o — O L O : C D — C D • C D C N C D — O :

o — o o o — o U O C D .CD — O O O o — C D o o o o o o o o o o o 5 10 Principal Component 15 Figure 4-2: Scree plot for the ethnic origins census PCA K-Means K-means is a clustering algorithm which chooses a predetermined number of group or cluster centres that minimises the within-class sum of squares for each cluster. This iterative process begins with the random selection of k initial centres. The distance from each observational unit, D A , to each centre was found. Each point was associated with the centre to which it is closest resulting in k clusters. New centres (means) were calculated for each cluster. This would comprise the first iteration. The second iteration began with the removal of the associations for each cluster. The distance between each D A and each of the k new centres was determined. The D A were now associated with one of the k centres resulting in k new clusters. The clusters were different from the previous k clusters. New centres were calculated for each of the clusters. This would comprise the second iteration. This process continued until the k centres became stable. By selecting the minimum distance as the measure by which each observational unit is associated with a centre, the algorithm sought to minimize the within-cluster sum of squares, 65 iexCj where Z j is the i t h D A , is the centre of the j t h cluster, and Cj is the j t h cluster. By minimizing the within-cluster sum of squares, the clusters exhibit a desired level of homogeneity within each cluster. Cluster homogeneity can be further increased or decreased by increasing or decreasing the number of clusters that are initially chosen. For the purposes of this exploration, the Euclidean distance, as implied in (0.1), was used. In order to assess the quality of the clustering two criteria were considered. The first was the total within-cluster sum of squares. This is the sum of (0.1) over all clusters and is defined as, I Zlk-^lf- (°-2) The second was the ability to interpret reasonably the results. A balance was desired between a low total within-cluster sum of squares and the ability to give an evocative label to the clusters. Results In order to build further understanding of the development process, a detailed explanation will be given for the ethnic origins census variable. Only the final results will be given for mother tongue and religion census characteristics. This detailed look into the construction of the variables will serve to highlight the innovations in the development of culture indicators and their complexity. Ethnic Origins As seen in Table 4-1, the original census sub-categories for the category of ethnic origins were grouped into meaningful geographical regions. P C A was performed upon this conceptually grouped set of ethnic origins census variables. The results of this exploration would serve as the basis for grouping the dissemination areas. As mentioned earlier, the scree plot, the proportion of variance and the cumulate proportion of variance were used to determine the number of principal components used to group the dissemination areas. The scree plot (Figure 4-2) suggests that between five and eight principal components are reasonable. There is no clear elbow in the curve; it is clear that the curve levels off after then eighth principal component. Since the goal is to reduce the dimensionality of the problem, only a few principal components are desired. From Table 4-4, it can be seen that there are large differences between the first three components. The differences in the proportion of variance explained begin to decline rapidly after the fifth or sixth principal component. As well, over 91% of the variation in the data is explained by the first five components. Either 5 or 6 components appear to be reasonable and as the objective was to minimize the dimensionality of the problem, five 66 components were taken. The loadings for these five components represent the coefficients of the linear combination of the conceptually grouped census variables. Based upon these linear combinations, new values were calculated for the dissemination areas. Table 4-4: PCA proportion of variance and cumulative proportion of variance for the conceptually grouped ethnic origins census variables Principal Component Proportion of Variance Cumulative Proportion Component 1 0.412 0.412 Component 2 0.293 0.705 Component 3 0.103 0.808 Component 4 0.067 0.875 Component 5 0.039 0.914 Component 6 0.030 0.944 Component 7 0.019 0.963 Component 8 0.014 0.977 Component 9 0.011 0.988 Component 10 0.004 0.992 K-means was used to group the dissemination areas in the five-dimensional principal component space. The univariate analysis suggested that Canadian, British Isles, Chinese, Aboriginal, and East Indian communities may exist. The plot of the total within-cluster sum of squares versus the index for 2 through 25 clusters begins to level off around the eighth cluster (Figure 4-3). Significant reductions in the total within-cluster sum of squares are seen between two and three clusters and between three and four clusters. 67 CD CO o CD CO CD 00 < N CD CD CN CD CD CD CD UD "T" 5 10 o o I 15 O o \— 20 o o o o o 25 Number of Clusters Figure 4-3: Total within-cluster sum of squares (SS) versus the number of clusters The in-depth exploration with respect to the within-cluster sum of squares and the ability to interpret the clusters was restricted to investigations utilizing four through seven clusters. Six was selected as the best number of clusters because the gains in the reduction of the total within-cluster sum of squares was minimal and the ability to interpret the clusters was considerably more difficult. Six clusters yielded a total within-cluster sum of squares of 99.5 whereas five clusters had 118.7 and seven clusters had 90.7. Although the clusters were defined in the principal component space, it was possible to identify the groups in the original conceptual variable space. The multivariate nature of the groups was summarized by the univariate measures for each of the eighteen conceptual groups (Table 4-5). Based upon these descriptions, an evocative label was given to each of the six groups. The "Heterogeneous" group has pockets of the various ethnic origins but no one ethnic origin appears to have a strong overall representation within this group. The "Aboriginal Presence" cluster has a strong aboriginal presence as the proportion of aboriginals in these dissemination areas ranges from 44% to 100%. One exceptional feature of this group is the noticeable absence of East Indian, Indian Sub-Continent and Caribbean origins. Pockets of European, Asian and Canadian origins can be seen. The dissemination areas in the "Chinese Presence" cluster have moderate to strong Chinese 68 ethnic origin presence with proportions Ranging from just over 28% to just fewer than 87%. The presence of other ethnic origins is more diverse than with the Aboriginal Presence cluster. The "British Isles Presence" cluster has a strong presence of ethnic origins from British Isles ranging from 37% to just fewer than 88%. As with the previous clusters, there are visible pockets of ethnic origins other than British Isles with Canadian origins exhibiting the strongest minority presence. The "East Indian Presence" group has dissemination areas with East Indian ethnic origins ranging from about 17% to 100%. The "Americas and Europe" cluster has dissemination areas with strong Canadian, British Isles, and North-Western Europe presence. This cluster also contains the dissemination areas with higher proportions of Latin American origins. Table 4-5: Statistical descriptions of the population proportions for the six ethnic origin groups expressed by only the minimum value, the median and the maximum value Conceptual Group Brief Statistical Description Heterogeneous (1) Aboriginal Presence (2) Chinese Presence (3) British Isles Presence (4) East Indian Presence (5) Americas and Europe (6) Min 0.000 0.000 0.000 0.000 0.000 0.000 Canadian Median 0.100 0.000 0.060 0.138 0.079 0.155 Max 0.267 0.400 0.247 0.550 0.265 0.765 Mean 0.105 0.025 0.065 0.140 0.083 0.161 Min 0.000 0.000 0.000 0.000 0.000 0.000 French Median 0.031 0.000 0.013 0.049 0.022 0.054 Max 0.122 0.222 0.103 0.294 0.116 0.286 Mean 0.034 0.021 0.016 0.051 0.026 0.056 Min 0.057 0.000 0.000 0.373 0.000 0.000 British Isles Median 0.262 0.068 0.131 0.475 0.160 0.362 Max 0.484 0.375 0.375 0.875 0.333 0.486 Mean 0.263 0'.080 0.134 0.486 0.155 0.352 North-western Europe Min 0.000 0.000 0.000 0.000 0.000 0.000 Median 0.092 0.000 0.043 0.147 0.074 0.189 Max 0.278 0.333 0.186 0.375 0.299 0.790 Mean 0.096 0.033 0.048 0.148 0.081 0.197 Southern Europe and Mediterranean Min 0.000 0.000 0.000 0.000 0.000 0.000 Median 0.039 0.000 0.033 0.021 0.017 0.023 Max 0.325 0.200 0.303 0.196 0.179 0.232 Mean 0.051 0.007 0.049 0.026 0.025 0.030 Min 0.000 0.000 0.000 0.000 0.000 0.000 Eastern Median 0.065 0.000 0.033 0.063 0.037 0.081 European Max 0.382 0.182 0.159 0.385 0.207 0.400 Mean 0.073 0.008 0.038 0.067 0.043 0.087 Min 0.000 0.000 0.000 0.000 0.000 0.000 East Asian Median 0.060 0.000 0.077 0.003 0.049 0.007 Max 0.452 0.118 0.349 0.176 0.395 0.277 Mean 0.079 0.003 0.088 0.012 0.066 0.019 Min 0.000 0.000 0.000 0.000 0.000 0.000 United States Median 0.000 . 0.000 0.000 0.007 0.000 0.006 Max 0.073 0.071 0.083 0.104 0.047 0.212 Mean 0.006 0.002 0.004 0.009 0.003 0.010 69 Min 0.000 0.000 0.000 0.000 0.000 0.000 Caribbean Median 0.000 0.000 0.000 0.000 0.000 0.000 Max 0.065 0.000 0.129 0.094 0.040 0.105 Mean 0.003 0.000 0.002 0.001 0.003 0.002 Min 0.000 0.000 0.000 0.000 0.000 0.000 Latin America Median 0.000 0.000 0.000 0.000 0.000 0.000 Max 0.100 0.016 0.082 0.050 0.064 0.113 Mean 0.002 0.000 0.001 0.001 0.002 0.001 Min 0.000 0.000 0.000 0.000 0.000 0.000 Middle, Median 0.000 0.000 0.000 0.000 0.000 0.000 Eastern Max 0.425 0.067 0.114 0.239 0.087 0.356 Mean 0.011 0.000 0.004 0.004 0.004 0.005 Min 0.000 0.000 0.000 0.000 0.000 0.000 African Median 0.000 0.000 0.000 0.000 0.000 0.000 Max 0.091 0.133 0.085 0.081 0.031 0.088 . Mean 0.003 0.002 0.001 0.001 0.001 0.002 Min 0.000 0.000 0.000 0.000 0.000 0.000 Indian Sub- Median 0.000 0.000 0.000 0.000 0.034 0.000 continent Max 0.163 0.000 0.145 0.075 0.328 0.129 Mean 0.006 0.000 0.006 0.001 0.051 0.003 Min 0.019 0.000 0.292 0.000 0.000 0.000 Chinese Median 0.191 0.000 0.469 0.000 0.015 0.000 Max 0.405 0.154 0.868 0.199 0.329 0.148 Mean 0.198 0.005 0.484 0.017 0.044 0.016 Min 0.000 0.444 0.000 0.000 0.000 0.000 Aboriginal Median 0.011 0.830 0.000 0.015 0.002 0.025 Max 0.215 1.000 0.233 0.290 0.210 0.458 Mean 0.018 0.814 0.010 0.022 0.015 0.038 Min 0.000 0.000 0.000 0.000 0.167 0.000 East Indian Median 0.023 0.000 0.022 0.000 0.358 0.000 Max 0.243 0.000 0.295 0.206 1.000 0.235 Mean 0.042 0.000 0.043 0.008 0.397 0.019 Min 0.000 0.000 0.000 0.000 0.000 0.000 Jewish Median 0.000 0.000 0.000 0.000 0.000 0.000 Max 0.118 0.047 0.121 0.153 0.107 0.099 Mean 0.010 0.000 0.008 0.005 - 0.002 0.003 Min 0.000 0.000 0.000 0.000 0.000 0.000 Black Median 0.000 0.000 0.000 0.000 0.000 0.000 Max 0.044 0.048 0.031 0.060 0.061 .0.065 Mean 0.001 0.000 0.001 0.000 0.000 0.000 Cluster Size 808 237 410 2468 206 2443 From the statistical descriptions of the clusters, it would be expected that Aboriginal Presence, Chinese Presence and the East Indian Presence clusters would exhibit a degree of separation from the other groups when plotted. In Figure 4-4, the Heterogeneous, British Isles Presence and the Americas and Europe groups are close together in each plot. The Aboriginal Presence exhibits clear separation in plots 4-4a, e, f, and g. It shows minor separation in plots 4-4b, c, and d. This level of separation can be seen in the low proportions of non-Aboriginal conceptual groups. The Chinese Presence cluster 70 is far from the other clusters in plot 4-4a and it has minor separation in plots 4-4b and 4-4e. This cluster is close to the Heterogeneous, British Isles Presence and the Americas and Europe clusters in the remaining plots. This would validate the higher proportions of the Western European and Canadian conceptual ethnic origin categories in this cluster. The East Indian Presence cluster exhibits high and moderate levels of separation in plots 4-4b, c, e, f, h, and i . 43 f r T3> I I I I I 4 0:4 ; 6 i 1 1 M I - 0 . 4 0 .4 o 4 1 3 2 j T T T T T -0.4 0:4-53-2) I .1 I I I - 0 . 4 0 :4 .5 lite. i i i I I I i - 0 . 6 0 .4 PC1 PCI PC1 PCI PC2 ft H f N * -'•f I I I I I I I - 0 . 6 014 PC2 O 1 , 1 1 . 1 . 1 . 0.2 0.6 PC3 . C N ' o 3 I" 5 1111 0.4 0.4 PC4 Figure 4-4: Scatter plots of the first five principal components with the centres for the six ethnic origins clusters: Heterogeneous (1), Aboriginal Presence (2), Chinese Presence (3) British Isles Presence (4), East Indian Presence (5) and the Americas and Europe (6) The culture indicator was initially assessed by investigating the proportion of the six ethnic origin groups in a census sub-division (Table 4-6). Surrey and Abbotsford have the highest proportion of East Indian Presence dissemination areas, 23.4% and 12.8% 71 respectively. Richmond, Vancouver, and Burnaby have the highest proportion of Chinese Presence dissemination areas, 29.4%, 30.4%, and 11% respectively. Victoria, West Vancouver, and Nelson have the highest proportion of British Isles dissemination areas, 84.7%, 79.2%, and 72.7% respectively. These regional attributes agree with the general profile of the various cities. Table 4-6: Percentage of the six ethnic origins clusters in 22 British Columbian municipalities Proportion within Census Subdivision Census Subdivision Heterogeneous Aboriginal Presence Chinese Presence British Isles Presence East Indian Presence Americas and Europe Abbotsford 0.0% 0.0% 0.0% 12.8% 12.8% 74.4% Burnaby 67.6% 0.0% 11.0% 6.3% 2.2% 12.9% Chilliwack 0.0% 0.0% 0.0% 21.4% 0.0% 78.6% Coquitlam 35.2% 0.0% 3.4% 25.7% 0.0% 35.8% Cranbrook 0.0% 0.0% 0.0% 25.0% 0.0% 75.0% Delta 4.8% 0.0% 0.6% 47.9% 8.4% 38.3% Fort St. John 0.0% 0.0% 0.0% 0.0% 0.0%. 100.0% Kamloops 0.8% 0.0% 0.0% 48.5% 0.0% 50.8% Kamloops 1 0.0% 40.0% 0.0% 40.0% 0.0% 20.0% Kelowna 0.0% 0.0% 0.0% 21.3% 0.0% 78.8% Langley 0.0% 0.0% 0.0% 44.0% 0.0% 56.0% Maple Ridge 0.0% 0.0% 0.0% 55.3% 0.0% 44.7% Nelson 0.0% 0.0% 0.0% 72.7% 0.0% 27.3% New Westminster 4.3% 0.0% 0.0% 34.8% 3.3% 57.6% North Vancouver 4.2% 0.5% 0.0% 66.5% , 0.0% 28.8% Prince George 0.0% 0.0% 0.0% 23.1% 0.0% 76.9% Prince Rupert 0.0% 0.0% 0.0% 6.3% 0.0% 93.8% Richmond 54.5% 0.0% 29.4% 7.2% 1.3% 7.7% Surrey 9.3% 0.0% 0.0% 22.8% 23.4% 44.5% Vancouver 30.5% 0.0% 30.4% 22.0% 3.4% 13.7% Victoria 0.7% 0.0% 0.0% 84.7% 0.0% 14.6% West Vancouver 13.0% 0.0% 1.3% 79.2% 0.0% 6.5% 72 Mother Tongue The same steps were taken for the conceptually grouped mother tongue variables (Table 4-2). The first three principal components, which explained 91% of the variation in the data, were taken. In this reduced variable space, various numbers of clusters were explored for low total within-cluster sum of squares and ease of interpretation. Five clusters were chosen with a total within-cluster sum of squares of 71.54. The five centres were labelled as: • Punjabi Presence, • Dominant English, • Heterogeneous with Dominant English, • Chinese Presence, and • Strong English Presence. The distributions for each group are given in the Appendix (Table 4-12). When the five centres are visualised in the principal component space, it can be seen that there is good separation between the clusters in most of the pair-wise plots (Figure 4-5). In figure 4-5, a high degree of separation exists between cluster 1 and the other four. Cluster 4 has good separation in all three plots as well. Clusters 2, 3, and 5 have moderate separation in plots 4-5a and b. In plot 4-5c, clusters 5 and 3 are very close together with cluster 2 exhibiting only moderate separation from them. a b c -1 ; .0 • -O.ti -0 2 0.2 -1:0' -0.6 -O.i ! 0 2 -0 8 -0.4 0.0 PC1 PC1 P C ? Figure 4-5: Scatter plots of the first three principal components with the centres for the five mother tongue clusters: Punjabi Presence (l), Dominant English (2), Heterogeneous with Dominant English (3), Chinese Presence (4), and Strong English Presence (5) 73 With such close proximity of clusters 3 and 5 in plot 4-5c, it is expected that these clusters will share similar features. The labels suggest the commonality of English as mother tongue between cluster 3, "Heterogeneous with Dominant English", and cluster 5, "Strong English Presence". In cluster 3, the mother tongue profile is well mixed with many different language pockets emerging, but there is a noticeable presence of English which ranges from 12% to 68%. As well, there is a minor Chinese presence throughout these dissemination areas. Ranging from 0% to 44%, this minor presence rises quickly from 0% to 16% by the first quartile and reaches 31% by the third quartile. Cluster 5 has a Strong English Presence as it ranges from 48% to 81% within the dissemination areas. By the first quartile, this presence becomes a majority (65%). As with the third cluster, there are some strong non-English representations. These are fewer and not as large, for example the Chinese presence ranges from 0% to 28% with the third quartile at 8%. Cluster 2, "Dominant English", has the strongest English as the mother tongue presence, ranging from 78% to 100%, thus the slight separation of this cluster from that of 3 and 5. A number of languages have maximums between 15% and 22% such as Aboriginal, German, French, North-Western European, Indian sub-continental, East Asian, Spanish-Portuguese, Punjabi, and Eastern European, yet none of the third quartiles exceed 2%. From this, the proximity of clusters 3 and 5 to cluster 2 is reasonable. Clusters 1 and 4 should represent distinct mother tongue groups. Cluster 1 has been identified as the "Punjabi Presence" cluster. Here the presence of this language ranges from 16% to 90%. Although it does not have such a distinct presence across all dissemination areas in this cluster, the first quartile is 32%, thus 75% of the dissemination areas in this cluster had more than a third of the people with Punjabi as their mother tongue. There is an Dominant English in this cluster, ranging from 0% to 61%. The first quartile is 24%, thus 75% of the dissemination areas have at least a quarter of the people with English as their mother tongue. Cluster 4 has been identified as the "Chinese Presence" cluster. Here the presence of Chinese languages ranges from 26% to 87%. In all dissemination areas in this cluster at least a quarter of all the people have a,Chinese language as their mother tongue. The first quartile is 42%, thus this minority quickly rises to be a strong presence. Again English does have a presence, but it is not as strong as with the Punjabi cluster, ranging from 4% to 49% with the third quartile being 34%. When considering the distribution of these clusters with census sub-divisions, trends similar to that of ethnic origins emerge (Table 4-7). Surrey and Abbotsford have the largest Punjabi percentage of dissemination areas in the Punjabi Presence cluster. Richmond, Vancouver and Burnaby have the highest proportion of Chinese Presence dissemination areas. Nelson, Cranbrook and Kamloops 1 have all dissemination areas identified with the Dominant English cluster. 74 Table 4-7: Percentage of the five mother tongue clusters in 22 British Columbian municipalities Proportion within Census Subdivision Census Subdivision Punjabi Presence Dominant English Heterogeneous with Dominant English Chinese Presence Strong English Presence Abbotsford 15.1% 48.3% 1.2% 0.0% 35.5% Burnaby 0.9% 1.6% 66.0% 12.3% 19.2% Chilliwack 0.0% 95.7% 0.0% 0.0% 4.3% Coquitlam 0.0% ' 17.3% 29.6% 3.4% 49.7% Cranbrook 0.0% 100.0% 0.0% 0.0% 0.0% Delta 11.4% 47.9% 3.6% 1.8% 35.3% Fort St. John 0.0% 100.0% 0.0% 0.0% 0.0% Kamloops 0.0% 89.4% 0.8% 0.0% 9.8% Kamloops 1 0.0% 100.0% 0.0% 0.0% 0.0% Kelowna 0.0% 78.1% 0.0% 0.0% 21.9% Langley 0.0% 86.9% 0.0% 0.0% 13.1% Maple Ridge 0.0% 89.3% 0.0% 0.0% 10.7% Nelson 0.0% 100.0% 0.0% 0.0% 0.0% New Westminster 4.3% 33.7% 3.3% 0.0% 58.7% North Vancouver 0.0% 33.0% 5.1% 0.0% 61.9% Prince George 0.7% 85.1% 0.0% 0.0% 14.2% Prince Rupert 0.0% 75.0% 0.0% 0.0% 25.0% Richmond 0.9% 3.8% 45.1% 36.2% 14.0% Surrey 23.7% 24.7% 7.6% 0.0% 44.0% Vancouver 3.1% 8.9% 28.5% 32.1% 27.5% Victoria 0.0% 77.4% 0.7% 0.0% 21.9% West Vancouver 0.0% 50.6% 15.6% 1.3% 32.5% A n interesting feature revealed by this exploration is the contrast between Abbotsford, Chilliwack and Langley. Abbotsford is situated between Chilliwack and Langley on the Trans-Canada highway and are both approximately 30 kilometres from Abbotsford. Interestingly, Langley and Chilliwack have more in common with each other than with Abbotsford with respect to the distribution of mother tongue languages. Both Chilliwack and Langley have a high proportion of dissemination areas identified with the Dominant English cluster, 95.7% and 86.9% respectively. Abbotsford has only 48.3%. The balance of dissemination areas, for Chilliwack and Langley fall in the Strong English Presence cluster. Abbotsford, unlike these two has 35.5% in that cluster and has 15.1% of the dissemination areas identified with the Punjabi Presence cluster. From a mother tongue perspective, Chilliwack and Langley have more in common with cities such as Nelson, Cranbrook and Prince George than with Abbotsford. These regional attributes agree with the general profile of the various cities. 75 Religion From the PCA, four principal components, which explained just less than 97% of the variation in the data, were taken. Six clusters were chosen as they had a low total within-cluster sum of squares (145.3) and they could be easily described. Figure 4-6 reveals that two clusters, 2 and 4, have good separation from the other four clusters. Clusters 1, 3, 5, and 6 have reasonable separation in most scatter plots, but common themes are expected between them. a b c P C I P C 1 T — i — n — i — n — T — r - 0 8 - 0 4 0 0 0.4 T i n n i r rr - 0 :8 -0 .4 0 .0 0.4 o C N - 1 — i — n — i — n — T — T -0 8 - 0 4 0 .0 0.4 P C 2 P C 2 P C S Figure 4-6: Scatter plots of the first four principal components with the centres for the six religion clusters Clusters 1, 3, 5, and 6 are generally close together. This proximity is due to the subtle differences with respect to the presence of Protestants and those who have no religious affiliation. Clusters 1 and 6 are both heterogeneous with respect to religious affiliation. Cluster 1 has a moderate presence of Protestants (25% to 58%) and people with no religious affiliation (25% to 62%). Protestants in cluster 6 range from 0% to 45% and no 76 religious affiliation ranges from 0% to 47%. Also, the Catholic presence in this cluster is similar to that of the Protestants and those with no religious affiliation, ranging from 0% to 45% with the first quintile just over 23%. These similarities can be seen in the close proximity of clusters 1 and 6 in figure 4-6a, 4-6b, and 4-6c. The differences between the two clusters are mainly experienced by the remaining religious identities. Cluster 2 is labelled "Catholic Presence" (Appendix, Table 4-13). The proportion of Catholics, both Roman and Ukrainian, ranges from 42% to 100%. The first quartile is just less than 52%, thus 75% of all the dissemination areas in this cluster have a majority of Catholics. As usual, there are other religions represented in this cluster, but none exhibit the consistent presence that is presented by those who identify with the Catholic religion. Cluster 3 has been labelled "Protestant Presence" (Appendix, Table 4-13). It ranges from just less than 37% to 100% with a median of approximately 53%. Catholic and no religious affiliation groups are present but do not exceed 45% as their maximum. They have medians less than or equal to 25%. The other religious groups have low representations in this cluster. Cluster 4 is labelled "Sikh Presence" (Appendix, Table 4-13). The proportion of Sikhs ranges from just fewer than 18% to just over 92%. The first quartile is 28.6% with a median of 36.5%. This indicates that there is a moderate to strong Sikh presence in these dissemination areas. There is a Protestant presence within this cluster with ranges from 0% to 44% with a third quintile of 24%. Again, other religions are represented in this cluster to varying degrees. Cluster 5 has been called the "No Religion Presence" cluster (Appendix, Table 4-13). Here, disseminations areas range from just over 33% to 100% in the proportion of people identifying no religious affiliation. The first quartile reaches 47%. The Protestant presence does not exceed 37% and the Catholic presence does not exceed 44%. Again, there are pockets of other religious affiliations, but none exhibit the densities of the no religion, Protestant, or Catholic groups. When considering the distribution of the religion clusters within census sub-divisions, familiar results occur (Table 4-8). Surrey and Abbotsford have the highest proportion of dissemination areas defined as having a Sikh Presence. Richmond, Vancouver, and Burnaby, which had high proportions of "Chinese" ethnic origins and "Chinese" mother tongue, have varying results. Richmond has a high proportion of dissemination areas associated with the Heterogeneous cluster (45.1%) and has a moderately high proportion of dissemination areas associated with the No Religion Presence cluster (30.2%). Vancouver has the highest proportion of dissemination areas associated with the No Religion Presence cluster (47.0%) and it has a moderately high proportion in the Heterogeneous cluster (31.6%). Burnaby has a high proportion of dissemination areas associated with the Heterogeneous cluster (57%) and a moderate proportion with the No Religion cluster (21.1%). 77 Table 4-8: Percentage of the six religion clusters in 22 British Columbian municipalities Proportion within Census Subdivision Census Subdivision Heterogeneous with Protestant and No Religion Presence Catholic Presence Protestant Presence Sikh Presence No Religion Presence Heterogeneous Abbotsford 12.8% 0.0% 65.1% 17.4% 0.0% 4.7% Burnaby 14.2% 0.9% 3.8% 2.2% 21.1% 57.9% Chilliwack 31.4% 0.0% 64.3% 0.0% 1.4% 2.9% Coquitlam 19.6% 1.7% 10.6% 0.0% 12.3% 55.9% Cranbrook 35.0% 0.0% 25.0% 0.0% 0.0% 40.0% Delta 28.1% 0.0% 28.7% 15.0% 6.0% 22.2% Fort St. John 40.0% 0.0% 20.0% 0.0% 13.3% 26.7% Kamloops 43.9% 0.0% 17.4% 1.5% 10.6% 26.5% Kamloops 1 0.0% 0.0% 20.0% 0.0% 40.0% 40.0% Kelowna 30.0% 0.0% 46.3% 0.0% 6.9% 16.9% Langley 42.3% 0.0% 37.5% 0.6% 7.1% 12.5% Maple Ridge 38.8% 0.0% 23.3% 0.0% 19.4% 18.4% Nelson 88.9% 0.0% 0.0% 0.0% 11.1% 0.0% New Westminster 23.9% 0.0% 8.7% 6.5% 8.7% 52.2% North Vancouver 34.9% 0.5% 17.7% 0.0% 8.4% 38.6% Prince George 30.6% 0.7% 11.2% 0.7% 18.7% 38.1% Prince Rupert 25.0% 0.0% 31.3% 0.0% 0.0% 43.8% Richmond 19.6% 0.0% 3.4% 1.7% 30.2% 45.1% Surrey 16.7% 0.6% 20.1% 28.3% 4.6% 29.8% Vancouver 13.3% 1.2% 3.3% 3.7% 47.0% 31.6% Victoria 46.0% 0.0% 14.6% 0.0% 28.5% 10.9% West Vancouver 35.1% 0.0% 48.1% 0.0% 1.3% 15.6% Abbotsford and Chilliwack have the highest proportion of Protestant Presence clusters from this list of municipalities, 65.1% and 64.3% respectively. These communities differ in that Abbotsford has a Sikh presence of 17.4% whereas Chilliwack has none. Chilliwack, on the other hand, has 31.4%) of the dissemination areas associated with the Heterogeneous with Protestant and No Religion Presence cluster and Abbotsford has only 12.8%). These regional attributes agree with the general profile of the various cities. Indicator Validation The prevalence of the various clusters within census sub-divisions was used as a first step in assessing the reasonableness of the culture indicators described above. The next step was to check the consistency of the indicators by cross-tabulation. The two-way tables, not presented here: • Ethnic origins and mother tongue, • Ethnic origins and religion, and • Mother tongue and religion. 78 were constructed and assessed for independence. A Chi-squared test for independence found that the ethnic origins clusters and mother tongue clusters were dependent (p-value « 0 . 0 0 1 ) . Similarly, ethnic origin clusters and religion clusters were found to be dependent (p-value « 0 . 0 0 1 ) as were the mother tongue clusters and the religion clusters (p-value « 0 . 0 0 1 ) . These associations are highlighted with their conditional multinomial distributions. Table 4-9 is the distribution of mother tongue clusters given the cluster of ethnic origin. If the ethnic origin was from the Chinese Presence cluster, 92% of these dissemination areas were also members of the mother tongue Chinese Presence cluster. For the East Indian Presence ethnic origin cluster, 89.3% of the dissemination areas were from the Punjabi Presence mother tongue cluster. The Heterogeneous ethnic origins cluster had 77.4% of its dissemination areas associated with the Heterogeneous with Dominant English mother tongue cluster. The British Isles ethnic origin cluster which has 79.2% of its dissemination areas associated with the Dominant English cluster and 19.8% associated with the Strong Dominant English cluster. Table 4-9: Percentage of clusters for mother tongue groups given the ethnic origin group M T | Origins Mother Tongue Punjabi Presence Dominant English Heterogeneous with Dominant English Chinese Presence Strong Dominant English Origins Heterogeneous 0.9% 0.0% 77.4% 8.7% 13.1% Origins Aboriginal Presence 0.0% 80.2% 2.5% 0.0% 17.3% Origins Chinese Presence 0.0% 0.0% 8.0% 92.0% 0.0% Origins British Isles Presence 0.0% 79.2% 1.1% 0.0% 19.8% Origins East Indian Presence 89.3% 0.5% 3.4% 1.5% 5.3% Origins Americas and Europe 0.9% 62.0% 2.3% 0.0% 34.8% The table of religious clusters given the ethnic origins is less well defined than the previous, but still exhibits associations desired of a well defined indicator function (Table 4-10). The cluster for East Indian ethnic origins has 94.2% of its dissemination areas associated with the Sikh Presence cluster. The Chinese Presence ethnic origin cluster has 64.9% of its dissemination areas associated with the No Religion Presence cluster. The Aboriginal Presence cluster has 48.9% of its dissemination areas associated with Catholic Presence clusters and almost 20% associated with Protestant Presence clusters. 79 Table 4-10: Percentage of clusters for religion groups given the ethnic origin group Religion | Origins Relii »ion Heterogeneous with Protestant and No Religion Presence Catholic Presence Protestant Presence Sikh Presence No Religion Presence Heterogeneous Origins Heterogeneous 13.2% 1.7% 3.0% 2.2% 27.7% 52.1% Origins Aboriginal Presence 7.6% 48.9% 19.8% 0.0% 15.2% 8.4% Origins Chinese Presence 6.6% 0.7% 0.5% 1.2% 64.9% 26.1% Origins British Isles Presence 42.9% 0.2% 27.7% 0.1% 14.4% 14.8% Origins East Indian Presence 0.0% 0.0% 0.0% 94.2% 1.0% 4.9% Origins Americas and Europe 34.0% 0.6% 23.0% 2.1% 14.3% 26.0% When considering the religious cluster given the mother tongue cluster, patterns emerge that emulate those in the previous tables (Table 4-11). 95.8% of the dissemination areas were associated with the Sikh Presence cluster given they were from the Punjabi Presence cluster. 63.6% of the dissemination areas were associated with the No Religion Presence cluster given they were from the Chinese Presence cluster. Table 4-11: Percentage of clusters for religion groups given the mother tongue group Religion | M T Religion Heterogeneous with Protestant and No Religion Presence Catholic Presence, Protestant Presence Sikh Presence No Religion Presence Heterogeneous Mother Tongue Punjabi Presence 0.5% 0.0% 0.0% 95.8% 1.4% 2.3% Mother Tongue Dominant English 42.7% 2.7% 27.3% 0.0% 14.7% 12.6% Mother Tongue Heterogeneous with Dominant English 13.3% 2.0% 4.2% 1.2% 26.6% 52.7% Mother Tongue Chinese Presence 7.1% 1.1% 0.4% 2.0% 63.6% 25.8% Mother Tongue Strong English Presence 23.1% 2.1% 19.1% 3.1% 13.7% 38.9% 80 Discussion This approach to developing an indicator for culture represents a departure from the current approach employed in Canadian research. By limiting the pre-processing of the data to the conceptual grouping of particular cultural features much of the original data structure is retained. Principal Component Analysis serves to summarise this data in a succinct way for the expressed purpose of grouping the dissemination areas based upon the multivariate structure resultant from PCA. As a result, the classification of dissemination areas is based upon all the available data. By utilizing the multivariate distribution of the census data, this approach was able to identify cultural groups that are known to exist in British Columbia, such as the Indo-Canadian and Chinese cultural groups. As well, this approach was able to determine clusters that reflected the known geographic distributions of these cultural groups. Through the discovery of these groups, information was generated that can be mined for further insights and research. The labelling of the groups with a meaningful title is another departure from the current methodology. Currently, the label precedes the formation of the groups. This suggests that the concept is imposed upon the groups a priori the construction. This new approach resists any labelling until the end of the process, thus labels are assigned based upon the composition of the groups. This distinction in when the label is assigned reflects more the investigator's understanding of culture. The current methodology suggests that culture fits into a predetermined structure or identity whereas this new approach suggests that culture is defined by a complex web of relationships. Grouping dissemination areas based upon these webs of interconnectivity, a shift from modern to post-modern understandings of culture can emerge. The Department of Canadian Heritage states that "Multiculturalism ensures that all citizens can keep their identities, can take pride in their ancestry and have a sense of belonging" 3 6. Furthermore, the Department of Canadian Heritage states that people are "free to choose for themselves, without penalty, whether they want to identify with their specific group or not" . This cultural pluralism suggests that the development of culture indicators must be premised upon the underlying multivariate distributions of the desired cultural constructs. In doing so, Canadian research will be poised to capture the complex interplay of cultures that constitutes the mosaic of Canadian society. This new tool will be useful in understanding various regional structures. This exploration suggests that certain regions should not be considered homogeneous with respect to their cultural constructs. It was shown that there exist distinct differences within the Upper Fraser Valley with Abbotsford exhibiting different cultural characteristics than Chilliwack or Langley. These understandings can be used to specify better the services that may be needed in a region. 81 Limitations Although this indicator does provide a weight of evidence for the validity of the constructs being identified which is essential for evidentiary support for the development of policy, it does suffer from several limitations. As thedndicators are developed from the underlying multivariate distributions of a census-based cultural construct, they are limited to the measures determined by Statistics Canada. The interpretation of these indicators should be done in dialogue with sociologists and ethno-geographers in order to prevent disparaging ethnic profiling within the research community, thus interdisciplinary collaborative teams would be beneficial for the continued development of multivariate culture indicators. As well, the census cultural constructs should be combined in evocative ways to group dissemination areas upon more complex interplays of culture. This would entail the combining of several cultural features, such as ethnic origins, mother tongue and religion, into one data set and then group the dissemination areas based upon a more complete and complex representation of culture. Finally, a comprehensive assessment would need to be developed. C o n c l u s i o n s In order to identify Canadian populations that are systematically marginalized from health services due to non-economic sociodemographic constructs, further work in developing multivariate-based culture indicators must be done. This novel work represents a first step towards a new understanding of how to quantify culture in population-based research which uses census-based measures to describe culture constructs. 82 Appendix Table 4-12: Statistical descriptions of the population proportions for the five mother tongue groups expressed by only the minimum value, the median and the maximum value Conceptual Group Brief Statistical Description Punjabi Presence (1) Dominant English (2) Heterogeneous with Dominant English (3) Chinese Presence (4) Strong English Presence(5) Min 0.000 0.000 0.000 0.262 0.000 Chinese Median 0.000 0.000 0.241 0.500 0.0321 Max 0.343 0.147 0.443 0.8692 0.282 Mean 0.036 0.007 0.233 0.513 0.051 Min 0.000 0.000 0.000 0.000 0.000 Aboriginal Median 0.000 0.000 0.000 0.000 0.000 Max 0.017 0.200 0.071 0.017 0.500 Mean 0.000 0.001 0.001 0.000 0.001 Min 0.000 0.000 0.000 0.000 0.000 Southern Median 0.000 0.000 0.000 0.000 0.000 European Max 0.054 0.159 0.222 0.238 0.194 Mean 0.004 0.005 0.018 0.020 0.011 Min 0.000 0.000 0.000 0.000 0.000 Eastern Median 0.000 0.000 0.024 0.000 0.022 European Max 0.150 0.171 0.355 0.184 0.397 Mean 0.013 0.012 0.420 0.017 0.035 Min 0.000 0.000 > 0.000 0.000 0.000 North-West Median 0.000 0.000 0.000 0.000 0.000 European Max 0.103 0.200 0.319 0.056 0.261 Mean 0.005 0.013 0.008 0.003 0.015 Min 0.000 0.000 0.000 0.000 0.000 Baltic Median 0.000 0.000 0.000 0.000 0.000 Max 0.035 0.061 0.089 0.080 0.073 Mean 0.000 0.000 0.001 0.001 0.001 Min 0.000 0.000 0.000 0.000 0.000 Middle East Median 0.000 0.000 0.000 0.000 0.000 Max 0.101 0.089 0.454 0.157 0.301 Mean 0.006 0.002 0.022 0.008 0.015 Min 0.000 0.000 0.000 0.000 0.000 Indian Sub- Median 0.046 0.000 0.000 0.000 0.000 continent Max 0.328 0.188 0.257 0.257 0.254 Mean 0.063 0.001 0.019 0.018 0.011 Min 0.000 0.000 0.000 0.000 0.000 East Asia Median 0.026 0.000 0.062 0.000 0.022 Max 0.311 0.182 0.500 0.379 0.310 Mean 0.045 0.007 0.079 0.072 0.038 Min 0.000 0.000 0.000 0.000 0.000 Spanish- Median 0.000 0.000 0.014 0.000 0.000 Portuguese Max 0.173 0.182 0.270 0.172 0.283 Mean 0.016 0.004 0.023 0.018 0.017 Min 0.000 0.784 0.128 0.040 0.478 English Median 0.366 0.896 0.491 0.284 0.712 Max 0.615 1.000 0.687 0.494 0.811 Mean 0.356 0.896 0.484 0.280 0.703 French Min 0.000 0.000 0.000 0.000 0.000 83 Median 0.000 0.009 0.000 0.000 0.014 Max 0.177 0.212 0.154 0.109 0.214 Mean 0.008 0.014 0.012 0.005 0.019 Min 0.000 0.000 0.000 0.000 0.000 German Median 0.000 0.018 0.000 0.000 0.020 Max 0.200 0.200 0.826 0.067 0.452 Mean 0.013 0.023 0.018 0.007 0.029 Min 0.161 0.000 0.000 0.000 0.000 Punjabi Median 0.3836 0.000 0.000 0.000 0.000 Max 0.9097 0.174 0.235 0.316 0.297 Mean 0.423 0.004 0.022 0.026 0.034 Min 0.000 0.000 0.000 0.000 0.000 Other Median 0.000 0.000 0.000 0.000 0.000 Max 0.159 0.200 0.778 0.104 0.500 Mean 0.011 0.007 0.018 0.010 0.017 Cluster Size 214 3660 753 450 1495 84 Table 4-13: Statistical descriptions of the population proportions for the six five mother tongue groups expressed by only the minimum value, the median and the maximum value Conceptual Group Brief Statistical Description Heterogeneous with Protestant and No Religion Presence (1) Catholic Presence (2) Protestant Presence (3) Sikh Presence (4) No Religion Presence(5) Heterogeneous (6) Catholic Min Median Max 0.000 0.139 0.258 0.4167 0.6340 1.000 0.000 0.146 0.441 0.000 0.108 0.381 0.000 0.158 0.433 0.000 0.259 0.4868 Mean 0.138 0.657 0.146 0.119 0.159 0.263 Orthodox Min Median Max 0.000 0.000 0.179 0.000 0.000 0.118 0.000 0.000 0.190 0.000 0.000 0.125 0.000 0.000 0.244 0.000 0.000 0.235 Mean 0.006 0.004 0.006 0.006 0.009 0.014 Protestant Min Median Max 0.253 0.396 0.575 0.000 0.076 0.429 0.367 0.533 1.000 0.000 0.167 0.438 0.000 0.207 0.366 0.000 0.287 0.450 Mean 0.396 0.088 0.552 0.176 0.196 0.280 None Min Median Max 0.252 0.406 0.615 0.000 0.197 0.500 0.000 0.247 0.421 0.000 0.1706 0.387 0.337 0.5234 1.000 0.000 0.321 0.471 Mean 0.408 0.200 0.247 0.168 0.537 0.314 Muslim Min Median Max 0.000 0.000 0.264 0.000 0.000 0.209 0.000 0.000 0.308 0.000 0.018 0.409 0.000 0.000 0.217 0.000 0.000 0.600 Mean 0.008 0.004 0.007 0.340 0.011 0.026 Sikh Min Median Max 0.000 0.000 0.253 0.000 0.000 0.175 0.000 0.000 0.301 0.179 0.365 0.929 0.000 0.000 0.342 0.000 0.000 0.254 Mean 0.009 0.004 0.012 0.405 0.012 0.029 Jewish Min Median Max 0.000 0.000 0.185 0.000 0.000 0.057 0.000 0.000 0.170 0.000 0.000 0.136 0.000 0.000 0.200 0.000 0.000 0.321 Mean 0.005 0.001 0.004 0.002 0.008 0.007 Other Min Median Max 0.000 0.021 0.303 0.000 0.000 0.375 . 0.000 0.015 0.320 0.000 0.074 0.398 0.000 0.0482 0.435 0.000 0.047 0.667 Mean 0.030 0.042 0.025 0.085 0.068 0.066 Cluster Size 2040 152 1317 270 1233 1560 85 References 1. Burge, F., Lawson, B., Johnston, G. 2003. Trends in the place of death of cancer patients, 1992-1997. Canadian Medical Association Journal 168(3): 265-270. 2. Burge, F., Lawson, B., Johnston, G, Cummings, I. 2003. Primary care continuity and location of death for those with cancer. Journal of Palliative Medicine 6(6): 911-918. 3. Cardiff, K. , Hsu, D., Kuhl, D. 1998. Utilization of Palliative Care Services in Vancouver: 1990-1993. Centre for Health Services and Policy Research 98:13. 4. Johnston, G.M. , Boyd, C.J., Joseph, P., Maclntyre, M . 2001. Variation in delivery of palliative radiotherapy to persons dying of cancer in Nova Scotia, 1994 to 1998. Journal of Clinical Oncology 19(14): 3323-3332. 5. Johnston, G.M. , Gibbons, L. , Burge, F.I., Dewar, R.A., Cummings, I., Levy, I.G. 1998. Identifying potential need for cancer palliation in Nova Scotia. Canadian Medical Association Journal 158(13): 1691-1698. 6. Menec, V. , Lix, L., Steinbach, C , Ekuma, O., Sirshi, M . , Dahl, M . , Soodeen, R. Patterns of Health Care use and Cost at the End of Life. Manitoba Centre for Health Policy, February 2004 7. Gorey, K . M . , Holowaty, E.J., Fehringer, G., Laukkanen, E., Richter, N.L . , Meyer, C M . 2000. An International Comparison of Cancer Survival: Metropolitan Toronto, Ontario, and Honolulu, Hawaii. American Journal of Public Health 90(12): 1866-1872. 8. Roos, L .L . , Magoon, J., Gupta, S., Chateau, D., Veugelers, P.J. 2004 Socioeconomic Determinants of Mortality in Two Canadian Provinces: Multilevel Modelling and Neighbourhood Context. Social Science and Medicine 59: 1435-1447. 9. Veugelers, P.J., Yip, A . M . , Kephart, G. 2001. Proximate and Contextual Socioeconomic Determinants of Mortality: Multilevel Approaches in a setting with Universal Health Care Coverage. American Journal of Epidemiology 154(8): 725-732. 10. Robert, S.A., Strombom, I., Trentham-Diez, A. , Hampton, J.M., McElroy, J.A., Newcomb, P.A., Remington, P.L. 2004. Socioeconomic Risk Factors for Breast Cancer: Distinguishing Individual- and Community-Level Effects. Epidemiology 15(4): 442-450. 86 11. Health Canada. June 6, 2005. Available: http://www.hc-sc.gc.ca/english/care/index.html, 12. Bruera, E., Sweeney, C , Russell, N . , Willey, J.S., Palmer, J.L. 2003. Place of death of Huston Area Residents with Cancer over a Two-Year Period. Journal of Pain and Symptom Management 26(1): 637-643. 13. Feudtner, C , Silveira, M.J. , Christakis, D A . 2002. Where do Children with Complex Chronic Conditions Die? Patterns in Washington Stats, 1980-1998. Pediatrics 109(4): 656-660. 14. Grande, G.E., Addington-Hall, J.M., Todd, C.J. 1998. Place of Death and Access to Home Care Services: Are Certain Patient Groups at a Disadvantage? Social Science Medicine 47(5): 565-579. 15. Klassen, A .C . Curriero, F.C., Hong, J.H., Williams, C , Kulldoff, M . , Meissner, H.I., Alberg, A . A . , Ensminger, M . 2004. The Role of Area-Level Influences on Prostate Cancer Grade and Stage at Diagnosis. Preventive Medicine 39: 441-448. 16. Kreiger, N . , Chen, J.T., Waterman, P.D., Rehnkopf, D.H., Subramanian, S.V. 2003. Race/Ethnicity, gender and Monitoring Socioeconomic Gradients in Health: A Comparison of Area-Based Socioeconomic Measures - The Public Health Disparities Geocoding Project. Public Health Matters 93(10): 1655-1671. 17. Kreiger, N . , Williams, D.R., Moss, N.E. 1997. Measuring Social Class in US Public Health Research: Concepts, Methodologies, and Guidelines. Annual Review of Public Health 18: 341-378. 18. Merkin, S.S., Stevenson, L. , Powe, N . 2002. Geographic Socioeconomic Status, Race and Advanced-Stage Breast Cancer in New York City. American Journal of Public Health 92(1): 64-70. 19. Shi, L., Macinko, J., Starfield, B., Politzer, R., Wulu, J., Xu , J. 2005. Primary Care, Social Inequalities, and All-Cause, Heart Disease, and Cancer Mortality in US Counties, 1990. American Journal of Public Health 95(4): 674-680. 20. Tang, S.T., McCorkle, R. 2001. Determinants of Place of Death for Terminal Cancer Patients. Cancer Investigations 19(2): 165-180. 21. Tang, S.T. 2003. Determinants of Hospice Home Care Use among Terminally 111 Cancer Patients. Nursing Research 52(4): 217-225. 22. Ahlner-Elmqvist, M . , Jordhoy, M.S., Jannert, M . , Fayers, P., Kassa, S. 2004. Place of Death: Hospital-Based Advanced Home Care versus Conventional Care. Palliative Medicine 18: 585-593. 87 23. Costantini, M . , Balzi, D., Garronec, E., Parodi, S., Vercelli, M . , Bruzzi, P. 2000. Geographic variations of place of death among Italian communities suggests an inappropriate hospital use in terminal phase of cancer disease. Public Health 114: 15-20. 24. Fukui, S., Fukui, N . , Kawagoe, H. 2004. Predictors of Place of Death for Japanese Patients with Advanced-Stage Malignant Disease in Home Care Settings: A Nationwide Survey. Cancer 101(2): 421-429. 25. Grundy, E., Mayer, D., Young, H. , Sloggett, A . 2004. Living Arrangements and Place of Death of Older People with Cancer in England and Wales: A Record Linkage Study. British Journal of Cancer 91: 907-912. 26. Gyllenhammer, E., Thoren-Todoulos, E., Strang, P., Strom, G., Erikson, E., Kinch, M . 2003. Predictive factors for Home Deaths among Cancer Patients in Swedish Palliative Home Care. Support Cancer Care 11:560-567. 27. Higginson, I.J., Thompson, M . 2003. Children and Young People Who Die from Cancer: Epidemiology and Place of Death in England (1995-9). British Medical Journal 327: 478-479. 28. Van den Eynden, B. , Hermann, I., Schrijvers, D., Van royen, P., Maes, R., Vermeulen, L. , Herweyers, K. , Smits, W., Verhoeven, A. , Clara, R., Denekens, J. 2000. Factors Determining the Place of Palliative Care and Death of Cancer Patients. Support Cancer Care 8: 59-64. 29. Government of Sweden: Health care, Health, Social Issues/Insurance. June 6, 2005. Available: http://www.sweden.gov.se/sb/dV2950, 30. Donatini, A. , Rico, A. , D'Ambrosio M.G. , Scalzo, L. , Orzella, L., Cicchetti, A. , Profili, S. 2001. Health Care Systems in Transition: Italy. European Observatory on Health Care Systems 3(4). 31. World Health Organization Regional Office for Europe. June 6, 2005. Available: http://www.euro.who.int/eprise/main/who/progs/chhbel/system/20050307_l, 32. Johnston, G.M. , Boyd, C.J., Maclsaac, M.A. , hodes, J.W., Grimshaw, R.N. 2003. Effectiveness of Letters to Cape Breton Women who have not had a Recent Pap Smear. Chronic Disease in Canada 24(2/3). 33. Puderer, H . Introducing the Dissemination Area for the 2001 Census - An Update. Statistics Canada: Geography Working Paper Series. No. 2000-4. 34. Venables, W.N., Ripley, B.D. 2002. Modern Applied Statistics. Fourth Edition. New York: Springer-Verlag. 88 35. Johnson, R.A., Wichern, D.W. 2002. Applied Multivariate Statistical Analysis. Fifth Edition. New Jersey: Prentice Hall. 36. Department of Canadian Heritage (Multiculturalism). Canadian Multiculturalism: A n Inclusive Citizenship. Available: http://www.canadianheritage.gc.ca/progs/multi/inclusive_e.cfm. 37. Gauch, H.G. 1982. Noise Reduction by Eigenvector Ordinations. Ecology 63(6) 89 Introduction Current research into the location of death of cancer patients in Canada suggests that demographic and clinical attributes may be useful in predicting where a cancer patient will die " . This notion of the utility of socioeconomic, demographic and clinical assessment for predicting place of death for cancer patients is supported by non-Canadian research 3 - 1 1. In order to come to this conclusion, univariate, multivariate, non-parametric, and regression techniques have been employed. In determining the place of death of cancer patients it is desirable to obtain a set of socioeconomic, demographic and clinical predictors that are useful in determining the location of end of life care for cancer patients. A corollary to this is to determine which modelling technique allows for the best predictive accuracy. Interestingly, of all the classification techniques, the research reveals that logistic regression is the method of choice for this area of investigation '~12. As well, there is a noticeable absence of location-of-death studies which compare the utility of different classifiers. Unfortunately preference does not determine the usefulness and appropriateness of a technique. As logistic regression is a familiar modelling technique and has been implemented in a variety of statistical packages, it is not surprising that this classifier is frequently chosen. Even though it a popular technique, it is one of several technique that can be used for classification. Classifiers are often encountered within the field of Data Mining. The field of data mining can be broken into two sections: unsupervised and supervised learning. If there is a known outcome, such as the location of death, then supervised learning techniques (classifiers) are used. If there is no expressed outcome in the data set, then unsupervised learning techniques are used. Logistic regression is one type of supervised learning technique along with linear discriminant analysis, neural networks, recursive partitioning, and nearest neighbours among others. Recognizing that logistic regression is one possible way to determine the predictors for the location of death, the exploration of data mining techniques is an appropriate endeavour for this area of investigation. Cancer research is rapidly exploring the usefulness of these models 1 3~ 3 6. With applications in radiology, survival, diagnosis and prediction, these models are being used in two general ways. A portion of the research simply uses the classifiers as a statistical t o o l 1 3 - 2 4 while the other portion is concerned about the usefulness and appropriateness of these classifiers within cancer research2 5"3 6. Within this activity of application and assessment, place of death research is silent. In July 2004, the Canadian Institutes of Health Research (CIHR) approved the grant application entitled "Palliative Care in cross-cultural context: A NET for equitable and quality cancer care for ethnically diverse populations". This five year grant allows researchers in British Columbia, Nova Scotia, and Saskatchewan to focus upon three themes: Access to Heath services, caregivers and complementary and alternative therapies. The Access theme seeks to understand the cultural constructs that influence 91 the use of health services by cancer patients at the end-of-life. Not only does this involve the construction and interpretation of predictive models, but it is also concerned with the construction of culture indicators to be used in the models and in the quality of the data used for the modelling process . In order to gain a voice within the emerging research, two approaches can be taken. The first is to merely infer the utility of data mining based upon the current body of research. The second is to apply the insights gained from the current body of research and engage in area specific research comparing the usefulness of the techniques. Spurious inference of the usefulness of data mining for predicting the location of death for cancer patients is thus avoided. Following the second approach, the objectives of this research are to compare the performance of five classifiers, comment upon the clinical utility of the classifiers and to understand which factors were predictive of dying out of hospital. Methodology Administrative Databases For this retrospective population-based study, a secondary data set was created from the linkage of two administrative databases: The British Columbia Vital Statistics Agency (VS) and the British Columbia Cancer Registry (BCCR). The Statistics Canada Postal Code Conversion File program version 4D (PCCF+4D) was used to associate subjects, at the dissemination area level, with economic and demographic information derived from the 2001 census. Linkage was performed by the VS for the B C C R and VS databases. The census information was merged to the VS and B C C R database using SAS. The merging of the census information and the study database was based upon the dissemination area allocated to each person by the PCCF program. The B C Cancer Registry, having been in existence from 1969, and maintained by the BC Cancer Agency (BCCA) since 1980, collects and generates cancer statistics on the BC population 6 7 . This database is used for cancer surveillance and research which provides information on the magnitude of the cancer problem within B C , assists in the development and monitoring of programs for reduced mortality and morbidity, and provides information for future planning 6 1 . By providing a registry, the B C C A has access to a source of information which avoids bias due to non-representative participation. This translates into better quality data than that used in non-population-based sources 6 7 . The BC Cancer Registry has attained an overall silver certification status from the North American Association of Central Cancer Registries (NAACCR) 68. The registry has reached gold standard achievement for specific items such as timeliness and death certificate only cases 6 8 . Although there is much information contained in these data sources, the data has been collected for purposes other than the understanding of the location of death for cancer patients, thus making this investigation an observational study. From a classical statistical perspective, observational studies are the most problematic type of study as there is, in general, no reference to any experimental design. As a result, is has been suggested that observational, non-randomized studies carry lower evidential weight than 92 randomized ones 3 8 . It is because of this that a well-planned and credible method for analyzing data sets of this nature is needed. Data mining has emerged in response to the desire to make sense of the vast amounts of data housed in administrative and corporate databases 4 0 . These data bases represent an efficient means, in terms of cost and time, to sketch a rough picture for population-based studies. These sketches are useful for hypothesis generation. Often, the information desired from the databases does not naturally extend from the underlying motivation for collecting the information. For the current research, the B C C R database was not developed for research into the location of death 6 7. It is large in terms of the number of patients represented and in the number of potential variables. The use of this database and the VS database allows for the unique opportunity to use data mining techniques to generate understanding of population-based end-of-life health outcomes. These attributes of the B C C R database indicate that the current research is an observational one and would be subject to the aforementioned critiques. Fortunately, data mining addresses such situations and allows researchers credibly to investigate such databases. Subjects The subjects were all adults, aged 20 years and older, who died of cancer, in British Columbia, as identified from the death certificates, from 1999 to 2003. Death Certificate Only (DCO) patients were excluded from the formation of the data set. These patients were those who had a cancer diagnosis at the time of death, although no prior information relating to a diagnosis of cancer was found after following up the diagnosis at the time of death (Appendix A). The postal code for the usual place of residence was obtained from British Columbia Vital Statistics. If this information was unavailable, then the postal code for the place of residence was obtained from the BCCR. Vital Statistics provided the location of death. Variables Data Sources The response, location of death, was obtained from the VS database. The predictors were obtained from the B C C A database. The cause of death was taken from the B C C A . It should be noted that the B C C A database obtained this information from BC Vital Statistics. Response The response variable for the place of death was obtained from VS. The location of death was derived from a field which indicated hospital death. If a person died in hospital, this was identified with the hospital code. If a person died out of the hospital, there was no code. Place of death was dichotomized as in hospital or out of hospital. 93 Predictors As this research is situated within a collaborative investigation between Nova Scotia (NS) and British Columbia (BC), the predictors chosen for this stage represented a modest expansion upon the published set of predictors1. To the set of predictors, • Year of Death, • Sex, • Age (years), • Income, • Aggregated Tumour Group, • Region, and • Survival Time (days) the following were added: • City Size, • Ethnic Origin Context, • Mother Tongue Context, and • Religious Context. The year of death and sex were given by the BCCR. The age was derived from the B C C R database as the difference between the date of death and the date of birth and was not categorized as in the NS research 1 . The income per person equivalent was obtained from the Postal Code Conversion File (PCCF) program output and expressed as quintiles51. The aggregate tumour group was derived from the ICD-O. Cancer causes of death were initially grouped according to the Canadian Cancer Society groupings (Appendix B) 5 6 . These groups were further aggregated as many of the groups had small numbers of patients. Canadian Cancer Society groups which had fewer than 1500 patients from the study group were grouped into the "Other" category. The final aggregate tumour groups were: • breast, • colorectal, • haematological, • lung, • pancreas, • prostate, • unknown primary, and • other. The region was the B C Health Authority of usual residence as derived from the VS report. The survival time was derived from the difference between the date of death and the diagnosis date and was not categorized as in the NS research 1 . The city size was obtained from the PCCF program. 94 The cultural indicators were derived using unsupervised learning techniques 3 Z . The Ethnic Origin context has the categories: • Heterogeneous, • Aboriginal Presence, • Chinese Presence, • British Isles Presence, • East Indian Presence, and • Americas and Europe. The Mother Tongue context indicator is broken down into: • Punjabi Presence, • Dominant English, • Heterogeneous with Dominant English, • Chinese Presence, and • Strong English Presence. The Religious context indicator is broken into: • Heterogeneous with Protestant and No Religion Presence • Catholic Presence, • Protestant Presence, • Sikh Presence, • No Religion Presence, and • Heterogeneous. The postal code definition of rural/urban as defined by the Nova Scotia research was not used 1 . The zero in the second position of the postal code indicates delivery from a rural post office. This indicates the point of service, but not the rural or urban feature of the area. Some rural areas are urbanised and suburbanised, yet they have a zero in the second position, thus it may not be the best way to define the indicator 5 1 ' 1 0 . As well, rural route services and sub-urban route services are provided from urban post offices51. For example, the entire province of New Brunswick is considered urban as no postal code has a zero in the second position7 0. A measure based upon population size or on population density and distance to regional center may serve as starting points for such a definition. The re-definition of rural and urban needs further work. The process of redefining rural and urban for this research was not in the scope of this investigation. Instead the city size was used. As the PCCF program uses the postal codes of usual residence, a city size could be associated with each patient. The five groups were: • greater than 1,250,000 people, • 500,000 - 1,249,999 people, • 100,000-499,999 people, • 10,000-99,999 people, and • less than 10,000 people. 95 The PCCF support material suggested a definition of rural as those locations where the population was less than 10,000 people 5 1 . The city size was chosen over the coarser population definition of rural/urban. As this research is in collaboration with a research team in Nova Scotia the categories used for age and survival were based upon work previously done by the Nova Scotia members of research team 1. • Data Quality Assessing the quality of the data is an important step when using administrative databases. The data quality assessment has been limited to variable completeness and the concordance of the diagnosis and the cancer cause of death due to database constraints 3 9'. Under the above definition, the study data set has a high degree of variable completeness. Only three variables are not complete and each of them is well over 99% complete (Figure 5-1). The agreement between cancer diagnosis and cancer cause of death suggests a high degree of data quality (Appendix D). Variable Completeness For Data Mining Data Set i Age Aggregate Tumour Group Ethnic Origins Context M other Tongue Context Religious Context Residence Sex Survival Year of Death Community Size Income Region 95.0% 95.5% 96.0% 96.5% 97.0% 97.5% 98.0% 98.5% 99.0% 99.5% 100.0% -I f— Figure 5-1: Graph for study set variable completeness. 96 Data Mining Methodologies By investigating the health services that are used at the end of a patient's life, we are using information from a variety of predictor variables to allocate patients into one health service category. For this research, the location of death is being used as the health outcome. A patient either used the hospital services at death or did not. The outcome categories have been determined by the research question and we are able to determine them in the data set, thus it is natural to use supervised learning techniques to investigate the relationship, by building models, between the explanatory variables and the binary outcome for the location of death. As previous investigations into the location of death in Canada have used logistic regression researchers have already used a supervised learning, data mining technique. We will introduce linear discriminant analysis (LDA), neural networks (NN), classification trees (CT), and K-nearest neighbours (KNN) to end-of-life research and compare the performance of these classifiers to that of logistic regression. As the outcome variable is binary (death in a hospital versus death out of a hospital), the presentation of the classification methodology will be within the two class framework. For purposes of notation, " H " will be used to designate deaths in a hospital and "O" will be used for deaths out of a hospital. The outcome to be modelled will be out of hospital deaths unless otherwise expressed. Linear Discriminant Analysis Linear discriminant analysis is one of the oldest classification techniques. Introduced by Fisher in 1936 ' 7 1, it is a classification technique that is used to distinguish between two or more groups based upon a set of characteristics (predictor variables). With two classes, linear discriminant analysis (LDA) seeks to find a linear boundary between the two groups. The linear boundary is found by obtaining the function which best separates the two groups. It does this by using a discriminant function, VV = frXx+- + BpXp=pTX (0.3) where there are p predictor variables, p\ is the i t h coefficient, and X\ is the i t h predictor variable. The coefficients or weights are estimated to give the best separation of the two groups. In order to determine the coefficients that yield the best separation, two assumptions will be made. The first is that each group has a different mean. This suggests that the means of X\, ..., Xp is different for patients who died in a hospital and for those who died out of a hospital. The second assumption is that the variance-covariance structure is the same for both groups 4 1 ' 7 1 . With this setup, the assumptions, conditioned upon each class, about the data are: 97 X i , X d I class H ~ Z), and X i , . . . , X d | class 0~(u< 0 ) , I ) , where u, ( H ) and | j , ( 0 ) are the /^-dimensional mean vectors for deaths out of a hospital and hospital deaths respectively. X is the pxp common covariance matrix. Interestingly, no distributional assumption has been made. L D A can be developed with a multivariate normal assumption, but this approach will not be presented in this paper. Hastie, et al. 4 0 , do discuss the distributional approach. The statistical properties of the DF distributions, conditioned upon class k, where k e {H,0}, can be determined. The expectation is, E(DF\Class k) = E[BTX\Class k) = BTE(x\Classk) The variance of the discrimination function is, Var(DF) = Var(BTX) = BTVar(X)B = P'^P-As the goal is to find the best separation of the two groups, we want a DF that makes a good distinction between the two groups. This difference can be measured by the separation between the two means, B1//"^ - BT//°^. A n estimate of this distance is, BTXW - BTX{0) = BT (*(H) - X{0)) where X^ is a random vector for , the realisation 7 1 . If the means of the two groups are far apart, then the classification technique can adequately distinguish the two groups. This estimator has a variance of, Var[fiT [X{H) - X{0))) = BTVar(x{H) - X ( 0 ) ) B 98 by the pooling the sample variance-covariance matrices 4 1 , 7 1 , S H and S°: As £ is the common variance-covariance between the two classes, this can be estimated n,7 nH+n0-2 With an estimate of the difference between the means of the two groups and an estimate of the variation in the estimate of the difference between the two means, it is possible to construct a statistic that resembles the two-sample t test statistic with a pooled sample variance 71 _ BT(x^-x^) (0.4) 1 1 H BTSB the notation t' has been used to denote that this is not a t statistic. No distributional assumption has been made, so it is only similar to the t statistic. The utility oft' is that a very large value indicates good separation. This is similar to having a large t statistic for the two-sample t test in the context of hypothesis testing. The coefficients of DF are chosen so that the absolute value of t' is maximized 7 1 . The 7 2 square oft', t is used as it is easier. The maximum of t' is, B oc S x - x (0.5) Fisher's Discriminant Function is, then, B^= x ( H ) -x ( 0 ) Vs- 'x 4 1 (0.6) From this, a new observation x is classified based upon its proximity to the class means, the new patient is classified as dying out of a hospital, O, if, T T - ( 0 ) T (H) w x - w x < w x - w x (0.7) otherwise the patient is classified as dying in a hospital. Logistic Regression As logistic regression is a familiar epidemiological modelling technique, only a brief overview will be given. Logistic regression is a special case of a generalized linear model. The initial formulation of generalized linear models was by Nelder and Wedderburnin 1972 4 3 . 99 In logistic regression, we model the probability of belonging to a certain class. As we are considering only the two class situation, the formulation of logistic regression is greatly simplified. We need only to model the probability of belonging to one of the classes. As deaths out of a hospital are of interest, we will model the conditional probability of dying out of a hospital given the explanatory variables x, Pr (Class=0 | x). This probability has immediate appeal as it is easy to make a decision as to how to classify the new observation. If we wanted to know the odds of dying out of hospital when compared to dying in hospital, given the explanatory variables, then the odds in favour of dying out of a hospital would be, Pr (Class=0 | x) odds = 1— Pr (Class=H | x) Pr (Class=Q | x) ~l-Pr(Class=0| x)' where odds e [0,co). Since the odds takes any value between zero and infinity, the logarithm of the odds will take any value in the real numbers, \og(odds) e ( - 0 0 , 0 0 ) . As the \og(odds) takes any real value, it is reasonable to model the \og(odds) with a linear* function of the predictor variables 4 3 . For more information about the technical reasons for using the \og(odds) with the linear model see Dobson 4 3 or Venables et al. 4 2 . The \og(odds) is commonly referred to as the logit function. If TCo=Pr(Class=0|x) and ft is the row vector of coefficients 4 3 , then a linear model for the logit function is l o g ( - ^ ) = /?Tx. (1.1) The model is linear in terms of the unknown coefficients. Here we have a linear model for the logarithm of the odds of dying out of a hospital when compared to dying in a hospital. What is wanted is Tto=Pr(Class=0|x). This can be rewritten as, Kn= (1.2) ° US' In this form, TC0 will take on values between zero and one, n0 e [0,l]. If the probability of dying in a hospital is desired, the complement is taken. Neural Networks The human brain and neural networks are conceptual analogs. Information goes into a neuron. Information leaves the neuron and subsequently enters another neuron. Each neuron has inputs from other neurons and outputs to other neurons 7 1 . The human brain is extremely complicated, so neural networks attempts to emulate this process through a 100 very simple structure. This idea, having been around since the 1940s, was popularized when Rosenblatt introduced his idea of a perceptron, the mathematical version of the neuron 4 4~ 4 6 . Widespread interest in neural networks was gained in the 1980s with improvements in the training algorithm 4 7 . For neural networks (NN), the fundamental idea is to create linear combinations of the explanatory variables as inputs to each of the nodes. The outputs of these nodes are, in turn, inputs to other nodes. This continues until the final node(s) are reached. At the final node(s), the model is a complex non-linear function of the original explanatory variables40. As with logistic regression, we will model the probability of dying out of a hospital given the explanatory variables x, Pr (Class=0 | x). To use neural networks, two particulars must be addressed. The first is the relationship between the nodes (perceptrons). By establishing this relationship, the architecture for the network is determined. The most frequently used model is a single-layer neural network which has only one hidden layer of nodes between the inputs and the output node (Figure 5-2). With the binary response being used in this research only a single output node is needed (Figure 5-2). The input variables are inputs to the hidden layer nodes. The hidden layer nodes output a linear combination of the input variables. The outputs from the hidden layer become inputs for the output node. If a skip is allowed, the constant input term is allowed to be an input to the output node. Input Layer Hidden Layer Output Layer Figure 5-2: Example of a one layer architecture for a neural network with three input variables. The input variables are xu x2, and x3 with a constant input x0. The transformation functions (perceptrons) for the hidden layer are h\ and h2. The output layer has a transformation function/ A skip allows x0 to skip over the hidden layer and be one of the weighted inputs to the output layer. 101 The second item to determine is the function that is used in the nodes. For classification, the softmax function, Hz) = - ^ , (2.1) 1 + e is used 4 0 ' 4 2 . This function is similar to (1.2), where z = /? Tx . Here z is a linear combination of the inputs. Recall that the inputs can be either the original explanatory variables or the outputs from another node. For Figure 5-2, the hidden node, h\, would be where fi\ is the vector of coefficients for hidden node h\. The hidden node, h2 would have an analogous formulation. For the output node, with no skip, it would be, go(rTh) = 7T0 = \ + ern where h is a column vector of the outputs from the hidden nodes and y T is the vector of coefficients for the output node. This process is similar to logistic regression at each of the hidden layer nodes and then taking the outputs from these nodes and using them as inputs for a logistic regression at the output node. For reasons analogous to those expressed with logistic regression, the probability of dying out of a hospital given the explanatory variables x, Pr (Class=0 | x), is modelled. It is interesting to observe that i f there is no hidden layer and the softmax function is used, then neural networks reduces to logistic regression. Logistic regression is a special case of neural networks. For completeness, several technical issues will be briefly presented. Further information can be found in Hastie et a l . 4 0 and Venables et al . 4 2 . The single-layer architecture in Figure 5-2 with the softmax function represents a feed-forward neural network as the inputs feed forward through the network 4 2 ' 7 1 . The values for the weights are those which minimize a measure of error, such as least squares or deviance 4 0 ' 4 2 , 7 1 . In order to find these values, back propagation is used which is a generic approach to minimizing the chosen measure of error by gradient descent 4 0 ' 7 1. When the model has been fit, then classification occurs according to some predetermined cut-off value. If ^b=Pr(Class=0|x) > c, for c e [0,l], then the neural network classifies the new patient as dying out of a hospital. Otherwise the patient is classified as dying in a hospital. By adding the product of a weight-decay parameter, X, and the squared sum of weights, w, to the error function, E, smoothness can be ensured: Fitting criterion = E + A^w 40 102 Classification Trees The idea of tree-based methods can be traced back to at least Morgan and Sonquist in 1963 7 1 . Breiman et al. popularized tree-based methods in the mid-1980s 7 1 . Classification trees divides the explanatory variable space into a set of disjoint hyper-rectangles. The hyper-rectangles are created so that they are as homogeneous as possible with respect to the outcome variable. When considering the location of death, we desire hyper-rectangles that are homogeneous with respect to deaths out of a hospital or deaths in a hospital. The most common implementation is to make the hyper-rectangles as homogeneous as possible in the response variable through binary splits of the explanatory variables. For simplicity, consider the case where there are only two explanatory variables, x\ and x2. Figure 5-3 represents the explanatory variable space and each of the patients is identified by letter in that space. " H " refers to those patients that died in a hospital and "O" are those that died out of hospital. b H H H H H H H 0 ° o o H 0 o H o o ° ° 0 H c d X l Figure 5-3: Example of how the classification tree algorithm partitions the explanatory variable space. There are two explanatory variables, x,and x2. The binary splits are represented by a, b, c, and d. In the two-dimensional explanatory variable space, the goal is to create rectangles that are pure in one of the two outcomes. The classification algorithm will search x\ and x2 for the best place to split the explanatory variable into two parts (binary split). In Figure 5-3, the first binary split occurred along x2 at point a. The decision rule is given for the root node (Figure 5-4). The upper portion of the variable space only has Hs, thus all patients in this portion of the explanatory variable space died in a hospital. In Figure 5-4 this is represented by the terminal node labelled H . The lower portion is a mix of Hs and Os so further splits will be needed to reach our goal of pure nodes. 103 H x2 < c H < Terminal Node f * O H Figure 5-4: The classification tree for Figure 5-3. If the condition is true, go down the left branch. If the condition is false, go down the left branch. "H" indicates death in a hospital. "O" indicates death out of a hospital. The terminal nodes are considered pure. There are only members of one class in each terminal node. The second binary split was made along x\ at point c. This split the lower rectangle in Figure 5-3 into two parts. The left hand side has only patients that died out of a hospital. The terminal node is labelled as O in Figure 5-4. The right hand side is a mix of Hs and Os. Further splits will be needed to reach our goal of pure nodes. This process continues until the nodes are pure or some predetermined stopping criteria is reached. In reality, of course, pure nodes are rare. For each region, a probability of belonging to a particular class is the proportion of that class in the region. Translating this to the tree graph (Figure 5-4), the probability of belonging to a particular class in a terminal node is the proportion of patients of that class associated with the terminal node. For us, we can determine the probability of dying out of a hospital given the explanatory variables x, Pr (Class=0 | x) by the proportion of patients who died out of a hospital for the terminal node associated with x. 104 To make this precise, the probability of dying out of a hospital given that the explanatory variables, x, puts the patient in the j t h node then the estimated value of Pr(Class=0 | x) is, Pp=-jr,ZHClasS = 0), (3.1) where x. e R. denotes all patients in the Rjih node (or region) and Nj is the number of observations in that node (or region). This is the proportion of class O in the j t h node (or region) and this proportion is uniform over the entire region4 0. For completeness, the four methods used to determine the optimal split will be briefly presented. For further information see Hastie et a l . 4 0 or Venables et al . 4 2 . There are four methods for determining the optimal split. The simplest of them is the misclassification rate. The split is made for the smallest misclassification rate 4 2 . This is generally not used . Secondly, using the idea of likelihood, a deviance can be defined as D = - 2 X I ^ l o g A , ( 3 . 2 ) where is the number of observations in the i t h node and the k t h class, p^ is the proportion of class k in the i t h node. The split is made where the reduction in deviance is maximized 4 2 . The final two methods use the idea of impurity in the node distributions. Entropy, and the Gini index, k select the split that reduces the average impurity 4 2 . In addition, a cost function can be utilized to penalize for overly complex trees 4 0 . K Nearest Neighbours K nearest neighbours (KNN) utilizes the idea that similar things are often found together. The approach emerge out of the electrical engineering and information systems fields of study. It can be traced back to Cover and Hart (1967)71. K N N is premised upon the idea that the outcome of a patient will be similar to patients who have similar explanatory variables. If the explanatory variables of a patient put it in a cluster where the majority of patients have died out of a hospital, then we would expect that the new patient would have an outcome like that of the majority. 105 With k nearest neighbours (KNN), decisions are based upon the majority vote among the k nearest neighbours 4 0 ' 4 8 . This voxpopuli approach to classification depends upon the definition of a neighbourhood. The neighbourhood is defined by a measure of distance. This can be a measure such as Euclidean distance (straight line between two points). The probability of dying out of a hospital given the explanatory variables x, Pr (Class=0 | x) is taken as the proportion of neighbours that are from class O. For completeness, a general metric for the distance between two observations, x and x is defined as, This allows for a variety of measures to be used. For instance, i f A is the identity matrix, then the distance is the p-dimensional Euclidean distance; i f it is the covariance matrix, then the Mahalanobis distance is used. Also the number of neighbours is generally experimentally determined. Advantages and Disadvantages Every technique has both advantages and disadvantages. For supervised learning techniques, there are some common advantages. • They can be highly adaptable allowing for complex relationships such as interactions and higher order terms. • They are useful when analysing large data sets. • They are useful for classification. The most common disadvantages shared by all classifiers are: • overly optimistic assessments of performance, • over-fitting, and • under-fitting. These disadvantages will be addressed in a subsequent section. Each particular classifier has it own set of advantages and disadvantages. For the remainder of this section, each of the five classifiers previously discussed will have their advantages and disadvantages presented. Table 5-1 is a summary of the advantages and disadvantages. (4.1) 106 Table 5-1: Advantages and disadvantages specific to each classifier Classifier Advantages Disadvantages Linear Discriminant • Adaptable for non- • Requires assumption Analysis linear patterns of equal covariance (Quadratic Linear structures Discriminant • parametric approach Analysis) requires the • Distributional and assumption ofa non-distributional multivariate normal approaches distribution Logistic Regression • Can have • Limited by model interactions and specification higher power terms in the model • Widely available in statistical packages • Significance testing is possible • Odds-ratio and relative risk are feasible Neural Networks • Can find complex • Significance testing patterns not available • Flexible architecture • Black-box • Many parameters • May not find global available to enhance minimum the performance of • Large computation the classifier resources required • Not limited to the softmax function Classification Trees • Can find complex • Significance testing patterns not available • Results transfer • Does not optimize easily to clinical over the entire tree setting • Interactions can be implied Nearest Neighbours • Can find complex • Significance testing patterns not available • Memory-based • Black box • Can identify pockets of activity • Alternate measures of distance 107 Linear discriminant analysis has both parametric and non-parametric methods. Only the non-parametric approach was previously presented. Having a parametric method for the development of L D A does allow for statistical inference, but a multivariate normal distribution needs to be assumed. L D A does assume a linear relationship but a non-linear relationship can be explored by relaxing the assumption of equal variance-40 covariance matrices Logistic regression has many appealing properties which include: • wide-spread implementation in statistical software packages, • the use relative risk and the odds ratio, and • hypothesis testing and inferential procedures. Although logistic regression is a linear modelling technique, interactions and higher order terms can be included. Logistic regression does not automatically seek out these complex relationships, so it is limited to the model specified by the researcher. As with logistic regression, neural networks has many appealing properties. It can find complex relationships, has a flexible architecture, and has many parameters that help fine tune its performance (e.g. decay parameter). As well, it is not restricted to a single type of function for the nodes 4 7 ' 7 1 . There are some drawbacks to this flexible classifier. It is a black box procedure in that it is challenging to extract information about the utility of the explanatory variables, but research into the extraction of important variables is emerging 2 3 . A minimum will be found for the measure of error used, but there is no guarantee that the method has found the global minimum or the best solution 4 0 '. Classification trees, in partitioning the variable space into hyper-rectangles, yield results in the form of classification rules. The obvious advantage of this approach is that it is easily transferred into a clinical setting as classification trees can be easily explained. These rules can be presented as written rules or pictorially as a decision tree. As well, interaction effects are suggested through a particular branch of the tree. The partitioning of the explanatory variable space into hyper-rectangles means that the classifier will perform poorly i f the class probabilities have gradual transitions rather than abrupt ones71. In this situation, linear discriminant analysis, logistic regression, or neural networks may out perform classification trees. Also, there is no attempt to optimize over the entire tree 4 2 . Nearest Neighbours is a memory-based method. It can produce demarcations for very complicated boundaries as well as identify pockets of peculiarity which may be 71 overlooked by other methods . Another benefit of the procedure is that the notion of distance can be changed, thus the researcher is not limited to the Euclidean measure alone. Nearest neighbours is a black box approach and there is not an efficient means to determine which explanatory variables contribute meaningful information. Also, as the variable space increases in its dimensions, the space itself becomes more sparse which may degrade the performance 5 2 . 108 As the data under analysis is not simulated or simple in its construction, it is difficult to make a priori decisions about the underlying structure of the data. Given that the structure of the data will be, in general, unknown, the perspectives given by each classifier will help to uncover the underlying story in the data. Each classifier provides insights into the data that may not be revealed by other classifiers, the strengths of one compensating for the weaknesses of another. For example, neural networks can be used to see i f complex relationships between the explanatory variables is better than simple linear relationships, and classification trees can give insights into the types of interactions that should be explored. By using the strengths of each classifier, a more complete picture of'the data can be obtained. Evaluating the Performance of Data Mining Methodologies Performance Criteria Recall that each classifier will produce a conditional probability of dying out of a hospital, 7To=Pr(Class=0|x). This probability will be used to determine the class to which a patient belongs based upon a threshold value or it will be used to prioritize patients. The subsequent methods for assessing the performance of a classifier use the conditional probability of dying out of a hospital in one of these two ways. When a model is fit to a data set, a natural thing to wonder about is its performance. The following techniques can be applied to any classification method. In assessing the performance of a classifier, two main issues emerge: the criterion for measuring the performance and the means by which the criterion is estimated 7 1 . Four criteria will be considered: the misclassification rate, the receiver operating characteristic curve, the area under the receiver operating characteristic curve and the hit curve. Cross-validation will be considered for the estimation of the performance criterion. Misclassification, Loss Function and the Misclassification Curve The misclassification rate is the proportion of patients that have been misclassified. Frequently this is the first measure considered when assessing the performance of a classifier. It is easy to communicate and easy to calculate. For the two-class problem for predicting the location of death for cancer patients a general misclassification table is given by Table 5-2. The overall misclassification is Miss= B + C . (5.1) A+B+C+D This formulation of the misclassification rate gives equal weight to all errors. This implies that a patient is classified as dying out of a hospital i f Pr(Class=0 | x)>0.5. 109 Table 5-2: Misclassification table Prediction ( F ) Out of Hospital In Hospital Observed (Y) Out of Hospital A B Observed (Y) In Hospital C D In reality, one type of misclassification may be worse than another. For example, it may be worse to classify a patient who died in a hospital as having died out of a hospital than to classify a patient who died out of a hospital as having died in a hospital. One method of accounting for differences in the costs of misclassifying a patient is the 0-1 -c loss function. In the two-class situation the loss function allows the researcher to specify the loss t associated with a certain type of misclassification. Recall that the outcome (response) Y is H i f a patient died in a hospital and O i f a patient died out of a hospital. The loss function is given as L(Y,Y) = where Y is the predicted class. Table 5-3 presents this in tabular form. QifY = nandY = B 0ifY = OandY = O UfY = HandY = 6 [ ' cifY = OandY = H Table 5-3: 0-1-c loss function; the truth is the real state (what was observed) and the decision is the decision made by the classifier. "H" indicates death in a hospital. "O" indicates death out of a hospital. Decision (f^ H 0 Truth H 0 1 Truth 0 c 0 110 Since deaths out of a hospital are of primary interest in this research, we want to minimize the risk of misclassifying patients who died out of a hospital. The risk of misclassification given that the patient did die out of the hospital is R(Yo,Y) = cP(Y = H) + 0P(Y = O), (5.3) where YQ is actual out of hospital death. The risk of misclassification given that the patient did die in the hospital is R(Yn,Y) = 0P{Y = H) + \P(Y = O), (5.4) where 7H is actual in hospital death. To minimize the risk of misclassifying the patients who died out of a hospital, we want (5.3) to be less than (5.4), R ( Y 0 , Y ) < R ( Y H , Y ) cP(Y = li)\ then the cost of misclassifying an observation from class 1 as being a member of class 0 is greater than the cost of misclassifying a member of class 0 as belonging to class 1. The reverse is true if c*l H otherwise 111 This function is the predicted value for the patient (Table 5-2). The misclassification error is a function of Y, the true outcome and Y (k), the predicted outcome. The misclassification error , Miss, becomes, Missk (Y,Y(k)) = . A 4 +B 4 - fC 4 +D t A misclassification table can be constructed for each value of k, thus a misclassification rate can be obtained for each value of k. A set of misclassification errors results, MISS = [Missk (Y, Y(k)) | & e [0,l]} . (5.7) The value of k chosen is the smallest misclassification rate, k = Mgmm(MISS) k = arg m i n ( M ^ (Y, Y(k))). k y With the optimal cut point determined, the value of c can be found, k c = . l-k The value of c is the empirically derived cost that yields the minimum misclassification rate for the classifier. The value of c will offer an insight into the relative costs of misclassification for the particular data set as mentioned earlier. A plot of Missk by k yields a misclassification curve which graphically shows the minimum misclassification rate and the optimal cut point. Receiver Operating Characteristic Curve and Area Under the Curve Building upon the information obtained by constructing a misclassification table, the sensitivity and the specificity can be obtained. In this study, the sensitivity (true positive fraction) is the probability of predicting a death out of a hospital given the true state was death out of hospital, A Sensitivity= . (6.1) A+B The specificity (true negative fraction) is the probability of predicting death in a hospital given the true state was death in a hospital, Specificity= ^ . (6.2) 112 As with the misclassification rate, the sensitivity and specificity are subject to the cut point k. Different values of k will result in different sensitivity and specificity values. A plot, similar to the misclassification plot can be constructed for these two measures. This plot is known as the receiver operating characteristic (ROC) curve. The ROC curve is a plot of sensitivity versus 1-specificity (false positive fraction) for all possible cut points, k. Emulating the process by which (5.7) was obtained, (6.1) and (6.2) are derived for each k. This results in the pairs (1-specificityt, sensitivity*). A summary measure of the ROC curve is the area under the ROC curve commonly known as just the area under the curve (AUC). A straight line ROC curve from (0, 0) to (1, 1) will have an A U C of 0.5 and represents a classifier that is no better than random allocation to the classes (Figure 5-5) 5 3. This suggests that the classes are not well separated by the classifier. A ROC curve that rises very quickly to a sensitivity of 1.0 wil l have an A U C close to 1.0. This suggests that the classes are well separated by the classifier (Figure 5-5) 5 3. 1- Specificity (FPF) Figure 5-5: The ROC curve graphically represents the relationship between the sensitivity and the specificity. Line (A) represents a classifier that performs no better than flipping a coin. The associated AUC would be 0.5. Line (B) represents a classifier that performs well and has the ability to separate the two classes. The AUC would be near 1.0. 113 The ROC curve is the graphical representation of the A U C . If the ROC is a straight line from the bottom left corner to the top right corner, then the classifier does no better than a random allocation of patients to each group (death in a hospital, death out ofa hospital). The associated A U C will be close to 0.5. If the ROC rises very quickly towards the top left corner and then levels off as it approaches the top right corner, then the classifier is able to do a very good job distinguishing between the two groups. The A U C will be close to 1.0. Hit Curve (Lift Curve) The above measures of performance assess how well the classifier classifies patients. Another means by which performance can be assessed is to determine how quickly it finds the patients of interest. In our situation, we are interested in finding those people who died out of a hospital. The classifier assigns to each patient a probability of dying out of a hospital, TCO= Pr (Class=0 | x). These probabilities are sorted from largest to smallest. The patients with the highest probabilities are chosen first and a hit occurs when the chosen patient actually died out of a hospital (Table 5-4) 55. A good classifier will quickly find the patients of interest. A n effective way of presenting this is with a hit curve. Table 5-4: Tabular representation of the hit curve where "O" is out of hospital death and "H" is hospital death. Rank Study ID Tc0=Pr(Class=0|x) y Expected Number of hits Hit/Miss 1 19991985 0.98 0 .98 Hit 2 20004563 0.96 0 1.94 Hit 3 20016466 0.91 0 2.85 Hit 4 20007651 0.82 H 3.67 Miss 5 19970293 0.81 0 4.48 Hit More formally, i f S patients are chosen and Sh of them are hits the coordinate pair (S, Sh) results. For each value a coordinate pair is obtained. A plot can be constructed with the number of patients selected, S, along the horizontal axis and the number of hits, Sh, along the vertical axis. A classifier which quickly finds the cases of interest will have a hit curve that rises quickly (Figure 5-6). 114 C/3 h i Number of selected patients Figure 5-6: The ideal hit curve (A). The curve rises at a 45 degree angle to the number of selected patients axis until it reaches the total number of patients who died out of hospital (nl). After this the curve is horizontal. The curve (B) is what is commonly seen with hit curves. The curve rises more slowly and needs more than nl patients selected to find the nl patients who died out of the hospital. In reality, the ideal curve (A) in Figure 5-6 is rarely seen, thus the performance is compared to the expected hit curve. The expected hit curve is derived from plotting the expected number of hits against the number of patients selected. The expected number of hits can be taken as the sum of probabilities for the first S ordered patients (Table 5-4) 5 5 . It is desirable to see the hit curve perform at least as well as expected. Estimating the Performance Criterion Out-of-Sample Prediction Four criteria for classifier performance have been presented. These four criteria will be used to assess the performance of the classifiers, but we still need to obtain an estimate for each of the performance criteria. What we need is an honest estimate of how the fitted classification model performs 7 1 . This needs to reflect how the fitted model will • 71 perform i f new patients are classified in the future . This is the problem of assessing how a fitted classifier performs with out-of-sample prediction. When estimating the performance of a fitted classification model, three approaches can be taken. The first entails using the same data for both the construction of the model and the assessment of the model. The second entails the use of an independent set of data to 115 assess the fitted model. The third entails a sophisticated reuse of data so that the same data is not used for fitting the classifier and estimating the performance of the fitted classifier. Estimation by reusing the same data A naive approach to estimating the performance is to use the same data to fit the model and to assess the performance of the fitted classifier. Two related problems emerge when data is used in this way: overly optimistic estimation and over-fitting . Any classifier will attempt to fit the data as well as it possibly can. So logistic regression with only linear explanatory terms will fit the data as well as the model will allow. The model will to some extent adjust to the random noise in the data 7 1. If the same data used to fit the classifier is used, then the estimation of performance will be too good because the model has adapted to some degree to the data. The estimation will be an overly optimistic evaluation of how the fitted classifier will perform on an independent data set. The overly optimistic estimation of how a classifier performs can be made worse i f the model is too complex. A model can be made to adapt to the data. For example, i f interactions and higher powers of explanatory variables were included in the specification of a logistic regression function, then the logistic regression function would fit the data even better. In this situation, the classifier could be tailored to the data set, thus it has been tailored to the noise in the data set. This complex model could perform worse than a simpler model on an independent data set. This is known as over-fitting. In order to overcome this naive use of data, a more sophisticated view of how to use the data is needed. Using data to prevent overly optimistic estimates and over-fitting If the data is used in such a way that overly optimistic estimates of performance are avoided, then a significant step has been taken to also address the problem of over-fitting. The failing of the previous approach to performance assessment was that the data was reused and the estimate would not give an good picture of how the fit classifier would perform on an independent data set. A simple solution is to split the data into two parts40. The first part is used to fit the classifier and is called the "training set". The second part is used to assess the performance of the fit classifier and is called the "test set". In most cases, more work needs to be done than just simply fit the model and assess it. Often, the fitted model goes through a process of refinement. For logistic regression, this would be the process by which a parsimonious model is found. For nearest neighbours, it would be the process by which the best number of neighbours is selected. In these cases, either the training data set or the test data set would have to be reused to refine the model. This reuse of data would result in the same two problems that we are trying to avoid. 116 In order to avoid these problems when fine-tuning the fit of the model, we could split the data into three parts (Figure 5-7) 4 0 . The "training set" would be used to fit the classifier. The "validation set" would be used to estimate the performance of the fitted model for model selection. When fine-tuning the fitted model, the validation set would be used as the independent data set for assessing the ability of the model to generalize to an independent data set. The "test set" would be held until all the fine-tuning of the fitted model was complete. This data has not been used for fitting the model or for selecting the best version of that model. The test set will yield an honest estimate of the performance of the fitted classifier's ability to generalize to an independent data set. In doing this, we will avoid the problems of overly optimistic estimates of a classifiers performance and over-fitting. T ra in ing Va l i da t i on T e s t i n g Figure 5-7: The partitioning of data into three sets: Training set for model construction, validation set for model selection, and testing set for error estimation (generalization) Reality rarely allows for such a partitioning of the data. It must be used in an efficient manner as the data may be difficult to obtain and expensive, in terms of time, money or both. To overcome this limitation on the use of data, sophisticated methods of data reuse have emerged. The solution utilized in this research was to efficiently reuse the data through cross-validation. Bias-Variance Trade-Off Before cross-validation is discussed, two other aspect of fitting a model will be addressed: bias and variance. These two measures are related to the complexity of the fitted model and thus to the problem of over-fitting. When fitting a classifier, there are four items of interest40. There is the observed outcome, y. There is the true model for the data,/Tx). There is the fitted classifier used to approximate the true model, f(x). Finally, there is the expected value of the fitted classifier, Ef(x), or simply the mean of the fitted classifier. It is believed that the data follows some true function, fx). The training data, T, is chosen and can be defined as the set of all pairs of predictor vectors and outcomes for the training set, (x;, yj) i=l, . . . ,N , where the outcome follows the true model with some added noise, yi= /(x,) + £r Mathematically, the training set is defined as T={(XJ, 117 yj)|i=l,.. . ,N and y\=f(x.) + e{} 4 U . Furthermore, we assume that noise has mean zero, E(e)=0, and constant variance, Var(e)=aE2. A classifier is fit to the data. The resulting fitted model, f(x), is an approximation of the true model, /fx). As before, we want to assess the performance of the classifier. For simplicity squared error loss is used 4 0 . This is the squared difference between the observed value, y,, and the fitted value, /(JC,.). The expected prediction error (PErr) at X i , based upon the squared error is, PErr(xi) = E(yi-f(xi))2\X = xi) = E = a2E+MSE(f(xi)) =