A Cross Sectional Analysis of Air Pollution’s Impacton Chronic Respiratory Disease in OntariobyChristopher Alan DuddekB.Sc. University of Manitoba, 1989A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE STUDIESDepartment of StatisticsWe accept this thesis as conformingto the uired s ndardL..THE UNIVERSITY OF BRITISH COLUMBIAAugust 1994©Christopher Alan Duddek, 1994In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)___________________________Department of cThe University of British ColumbiaVancouver, CanadaDate__________DE-6 (2)88)AbstractDoes ambient air pollution in Canada pose a threat to respiratory health? For astudy initiated by Health Canada, we combined analyses of micro-level data from boththe 1990 Ontario Health Survey with an environmental air monitoring system to obtain aquantitative answer to the question.In contrast to studies designed to collect special purpose data, the Ontario HealthSurvey was not designed to address respiratory health issues. In spite of this, this crosssectional database was rich enough for modelling. We used asthma and emphysema as theresponse variables in assessing the impact of four pollutants estimated for summer andwinter.Two analyses were conducted for each response variable, one incorporating survey design information the other ignoring it. Age, income, smoker type and sex were significantlyrelated to asthma at a a = 5% level of confidence in both analyses. None of the pollutantcovariates figured in the model.Using the classical x2 test for nested models as the criterion, the emphysema modelachieved a better fit than the asthma model. Smoker type and age, in particular, werestrongly related to emphysema; income and number of cigarettes smoked were significantlybut less strongly related; summer NO2 was marginally significant, depending on which ofthe two analyses was considered.IIContentsAbstract ijTable of ContentsList of Tables vi.List of Figures ViAcknowledgementsDedicationPoem Shard xi1 Introduction 12 Literature Review 52.1 Experimental Studies 52.2 Estimating Cumulative Ambient Air Pollution Exposure 62.3 Cross Sectional Studies 93 Issues in Cross Sectional Analyses 113.1 Definition of Epidemiology 113.2 The Disease Process 123.3 Trends in Disease Patterns Since 1900 133.4 Types of Epidemiological Studies 153.5 Pros and Cons of the Cross Sectional Analysis 174 Description of the Ontario Health Survey 214.1 Objectives 224.2 Data Collection Method 224.3 Target Population 254.4 Pretesting 254.5 Questionnaire Content 274.6 Survey Methodology 294.7 Nonresponse Rates 314.8 Comment 355 Initial Data Analysis 365.1 OHS Covariates 375.2 Pollution Covariates 475.3 Asthma versus Covariates 535.4 Emphysema versus Covariates 566 Methodology 606.1 Finite vs. Infinite Population Inference 616.2 Generalized Linear Models Theory 646.3 Modeffing Binary Data 687 Asthma Analysis 727.1 Arcsine Analysis 727.2 Logistic Modeling 748 Emphysema Analysis 778.1 Covariate Selection Excluding Pollution 788.2 Evaluating the Effect of Pollution 818.3 Comparing Weighted and Unweighted Analyses 839 Discussion 86Iv9.1 Model Assumptions 869.2 Exposure Measurement Problems 889.3 Future Directions 90References 90Appendix: Figures 97VList of Tables1 OHS questionnaire content 282 Flowchart of the derivation of the study data 373 List of study covariables 394 How the OHS work exposure questions form a covariate for this study 405 Types of pollution assessment studies 476 Spatially interpolated ambient air pollution six-year averages (pg/rn3) 517 Definition of symbols used for the modeffing of binary data 698 The best arcsine models fitted for asthma 739 Most significant terms in the unweighted logit asthma model 7410 Most significant terms in the weighted logit asthma model 7511 One term emphysema models using a fifth of the data 7912 Terms in a stepwise fitting strategy for emphysema using a fifth of the data. 8013 Goodness of fit for each of the terms in the full emphysema model 8114 Loadings for the first three principal components 8215 Goodness of fit test for the pollution terms 8216 Stepwise terms for emphysema using a fifth of the data and ignoring weights 8417 Terms in the full unweighted emphysema model 8518 Goodness of fit test for the pollution terms in the unweighted analysis 85viList of Figures1 Canadian life expectancy at birth 132 Geographic coverage of the Ontario Health Survey 233 Flowchart summary of the OHS data collection procedure 244 OHS pretest questionnaire response rates 265 Nonresponse rates for the work exposure questions 336 Nonresponse rates for the smoking questions 337 Nonresponse rates for the well being score by age 348 Prevalence of the two response variables 389 The demographic covariates 4110 The socioeconomic covariates 4311 The lifestyle covariates 4412 The health covariates 4513 Typical graph of a station’s measurement of one pollutant 4914 Asthma prevalence by covariates (I) 5415 Asthma prevalence by covariates (II) 5516 Emphysema prevalence by covariates (I) 5817 Emphysema prevalence by covariates (II) 5918 Empirical density of survey weights 6219 Marginal distribution of age using and ignoring survey weights 6420 Immigration background of the 1990 residents of Ontario 8821 North American air quality trends 8922 Ordered summary of study covariate nonresponse 9823 NO2 readings for twelve stations ordered by increasing station mean(1ug/m3). 9924 03 readings for twenty out of twenty one stations (tg/m3) 10025 SO2 readings for all twenty stations (g/m3) 101vi’26 SO4 readings for all ten stations(1ug/m3)27 Strongest scatterplot relationships between pollution estimates. .28 Distribution of the 37 estimated PHU means for the four pollutants.29 Comparison of pollution measurements to estimates30 Estimated NO2 six year average31 Estimated 03 six year average32 Estimated SO2 six year average33 Estimated SO4 six year average34 Asthma arcsine model diagnositcs for the demographic grouping.35 Asthma arcsine model diagnositcs for the socioeconomic grouping.36 Asthma arcsine model diagnositcs for the lifestyle grouping102103104105106107108109110111112VIjiAcknowledgementThis thesis was made possible by Jim Zidek and Rick Burnett: both were instrumental inbringing the Health Canada project to the University of British Columbia. I appreciate theirunending labour in sorting through the red tape associated with such a project.To get me thinking about my thesis, the biostatistics subgroup met regularly in the summerof 1993. I would like to acknowledge the regulars for their help: Victor Espinosa-Balderas,Nhu Le, John Petkau, Rick White, Hubert Wong, Weimin Sun and Jim Zidek.At Health Canada, I was lucky to have a good ally in Robert Tkalec. When I askedquestions or complained about incomplete documents he was quick to the whip.Weimin Sun deserves special recognition for his modification and implementation of thespatial interpolation methodology developed by Jim and Nhu. When I needed data he wouldwork until the next morning’s sunrise. His dedication to the task was remarkable.On the social side, I would like to recognize those who made my UBC days exciting.Xiaochun Li was always up for movies and dinner when I was stuck or tired; Nita Deerpalsingwas stellar as my perpetual audience, both for the thesis and in general; and Tim Fijal wascontinually breaking me up with his Hippocrates’ saying that ‘wholemeal bread clears out thegut’.Finally, I would like to thank Jim Zidek for his neverending encouragement. Without it Imay never have succeeded in finishing the thesis.IxTo my mom and dadfor producing me.The yellow fog that rubs its back upon the window-panes,The yellow smoke that rubs its muzzle on the window-panesLicked its tongue into the corners of the evening,Lingered upon the pools that stand in drains,Let fall upon its back the soot that falls from chimneys,Slipped by the terrace, made a sudden leap,And seeing that it was a soft October night,Curled once about the house, and fell asleep.And should I then presume?And how should I begin?From The Love Song of J Alfred Proofrock by T.S. Effiot)1 IntroductionResource exploitation was a dominant ideal of European imperialism. Dreams about the ‘NewWorld’ were predicated on animal fur, vast stands of East Coast forest and unlimited stocksof Grand Banks cod (Seiler, 1993, 303).Industrialization, urbanization, technology and the rise of bureaucracy, however, radicallyaltered the landscape. Our ability to deleteriously effect our environment gradually led tothe emergence of agencies now associated with the welfare state. Health Canada, AgricultureCanada, the Environmental Protection Agency and the Food and Drug Administration are afew of the North American institutions which attest to our faith in resource management.The air we breathe has come to be seen as one of those resources. Two approaches havebeen taken to ensure and improve air quality: emission controls and development of ambientair quality standards. The automobile gives a good example of the first. Detroit manufacturerscontinuously update their line of cars, thereby achieving opportunities to incorporate emissionreducing engineering improvements. Since preventative measures are viewed as the least costlyof pollution controls, regulatory agencies in the United States have goaded the car industryinto meeting higher auto emission standards. By the mid 1990s Canada will have realignedstandards so that they fall in line with America’s. Because of this vigilance, North Americanstandards are higher than those of many European countries (Hoberg, 1993, 108-109).Ambient air quality standards provide another benchmark. The need for them becameapparent after the London fog episode of 1952. The maximal 24-hour concentration of sulfurdioxide, ten times the current allowable concentration, resulted in an estimated 4000 excessdeaths (Griffith, 1989, 112). The ambient air quality standard serves as a benchmark and issupposed to be determined in spirit in accordance with the current state of scientific evidence.With general acceptance of these principles, ambient air quality standards have become acornerstone of public health policy.The concept of the Threshold Limit Value or TLV is a key to understanding the history of1air quality guidelines in the States. Since the 1940s TLVs have been set by the TLV Committeeof the American Conference of Governmental Industrial Hygienists. The TLV Committeeofficially defines the TLV to be levels at which “nearly all workers may repeatedly be exposedday after day without adverse health effects.” The evidence, however, suggests that the TLVCommittee has always been sensitive to the impact their decisions would have on industry.They thus chose levels thought to be achievable by major players in industry (Rappaport,1993).The role of the TLV should become clear after sketching air pollution regulation in the U.S.from 1970 to present. The inadequate regulation of hazardous air pollutants was implicitlyrecognized by the Clean Air Act of 1970. It gave the Environmental Protection Agency (EPA)the authority to impose emission standards that guaranteed “an ample margin of safety.”The EPA immediately became sensitized to the potential for economic dislocation if stringentemission limits were set and bypassed enforcing the law by avoiding to list and regulate airbornetoxicants. When political pressure over specific substances arose that was too great to ignore,the EPA established emission limits based on economic rather than health considerations(Robinson and Paxman, 1992).In the early 80s, commencing with the inauguration of Ronald Reagan, the EPA sought todelegate responsibility of airborne pollutant regulation to the states. The states then developedAmbient Air Level guidelines which were based on TLVs multiplied by an appropriate safetyfactor. But is the TLV a good starting point for air quality guidelines? There are reasons tobe wary of the TLV.First and foremost, the TLV is tainted by the TLV Committee’s decision process:TLVs for particular substances were heavily influenced by corporations with directfinancial interests in the substances being evaluated. . . . TLVs often represent theexposure levels actually prevalent in major firms rather than levels at which noadverse health effects are reported (Robinson and Paxman, 1992, p. 392).2Whose voices are not heard by the Committee? Where economic interests have had a stake,collective health has partially been compromised to minimize the negative fiscal impact ofhigher emission standards on the offending industries. As long as the hazard is not deadly inthe short term and effects only sensitive populations, profits have in past held the upper edgein public health policy.A second problem with the TLV derives from its use. States use the TLV by multiplyingits value by a constant. But how is the constant determined? Since the TLV is based on the40 hour workweek, multiplying by 4.2 would account for the 168 hour week experienced byresidents living in the affected area. Stifi, the guideline is supposed to be designed for workingpopulations and not necessarily for a population encompassing the more sensitive segmentslike the very young and very old. The resulting state defined Acceptable Ambient Air Levelguideline for acrylonitrile, for example, varies by a ratio of over a thousand for regulatingstates. The variability alone casts suspicion on the process.The Clean Air Act Amendment of 1990 repositions the EPA as the national standard bearerin air pollution policy. The Act requires the EPA to develop national emission standards for189 toxic substances in the next decade. In contrast to the “ample margin of safety” quotedin the Clean Air Act of 1970, the Amendment requires that standards be set on “maximumachievable control technology.” This reorientation is proposed to accelerate the pace of standardsetting by basing it on technological advancement. The TLV, as a result, will play a lesser rolein health policy (Robinson and Paxman, 1992).Gibson (1989) documents a similar tale of corporate influence over public air pollutionpolicy in Ontario. The Inco smelter in Sudbury has long been an infamous source of SO2emissions. Little headway in reductions have been made, however, because Ontario Environment Ministry officials knew “that the environmental benefits would be less immediate andperceptible than the costs of abatement and that the beneficiaries would be more dispersedand less well connected politically than the recipients of abatement orders” (Gibson, 250).3Ignoring the decision making realities of the political realm, the question remains: howseriously is human health affected by air pollution? A call for quantification, in the form ofenvironmental health impact studies, is a common response (Britton, 1992), with many formshaving been attempted. I give a flavour of recent work in the literature review. A taxonomyof epidemiological studies is provided in the chapter covering cross sectional designs.In a bizarre twist, regulatory procedures often provide data for the studies. Once an airquality standard is determined, pollutant monitoring stations are erected and data collected.Thus, ironically, the available data on pollutant exposure is driven by regulation rather thanfor its utility for inquiries into public health (Matanoski et al., 1992).In this study we evaluate the risks of air pollution to chronic lung function. The 1990Ontario Health Survey provides us with a cross sectional view of the health status and socioeconomic level of Ontario’s population. The air pollution data arises from an air monitoringnetwork of 37 stations, tracking four pollutants, from 1983 to 1989. The study’s methodologyand analysis are given in greater detail in the appropriate chapters.The objective of this study is ambitious. As opposed to chronic effects the study of acuteeffects, as measured by longitudinal hospital admissions, is more common. The difficulty witha study of chronic effects is the measurement of exposure. Long lag times, differences inpollutant mixtures over time, movement of individuals in the study population and error inself reported medical conditions are but a few of the obstacles to valid inference.On the bright side, the Ontario Health Survey data is comprehensive and contains information on a large sample. The six years pollution data employs recently developed methodology:where necessary the multivariate air monitoring station readings are spatially interpolated(Brown, Le and Zidek, 1993).To make informed policy decisions on the effect of pollution on respiratory health weneed to employ pre-existing data sources wherever possible. This study provides that kind ofopportunity.42 Literature ReviewThe literature on environmental health impact assessments is voluminous. In this chapter Icover some of the more recent papers in this area. I have ordered the material to follow anatural flow. Since experimental studies can more or less stand on their own I discuss themfirst. Observational studies tend to be more complex. I have apportioned a section to thedifficult area of exposure measurement and one to cross sectional studies of a similar intent.2.1 Experimental StudiesAlthough questions about a pollutant’s effect on human populations can be ethically difficultto address in an experimental study, some have been carried out. Whenever an effusion ofpossible predictors with many levels appears, experimental design, R.A. Fisher’s territory,tempts good researchers. In this section I describe recent experimental studies.Hackney et al. (1992) evaluate the acute effects of nitrogen dioxide (NO2) on older adultswith chronic obstructive pulmonary disease. They note that although animal toxicologicalstudies prove high doses of NO2 pose a respiratory health risk, the epidemiological literature isinconclusive. They increase the power of their study by focusing on a sensitive segment of thepopulation. This strategy has two benefits. First, the sensitive subpopulation is of interest inits own right since, by definition almost, they are most affected by pollutant levels. Second,information garnered from sensitive subpopulations may have implications for the larger population. We could use the metaphor of individuals as instruments: sensitive individuals mightbe like us except that they react more strongly to the same stimulus.Hackney’s Los Angeles study combines laboratory and field work. In the laboratory researchers examined the lung dysfunction of subjects exposed to 0.3 ppm NO2 for four hoursinterlaced with four bouts of exercise. In the field they evaluated the exposure measurementsfrom personal exposure monitoring devices worn by the 26 volunteers for two week periods.NO2 readings are traditionally highest in LA during fall and winter. The study interval was5no exception with an average of 125 tg/m3 as compared to the annual average of 90 ,ug/m3.Two conclusions are drawn. First, from a comparison of station to personal exposuremeasurements, NO2 exposure is strongly influenced by outdoor pollution, even though up to90% of the subject’s time is spent indoors. Second, the sensitive population’s short term NO2exposure results in little short term clinical exacerbation of respiratory disease.McDonnell et al. (1991) study the effect of a 6.6 hour ozone (03) exposure on 38 healthyhumans. The response measure, forced expiratory volume in one second (FEy1), shows significant decline upon comparing clean air exposure to 0.08 ppm 03.McDonnell, Muller, Bromburg and Shy (1993) improve on the previous study by examiningmore design points, increasing the range to 0.0—0.4 ppm 03 and increasing the sample size.This time they divide the sample into an exploratory sample of 96 subjects and a confirmatorysample of 194 subjects. The first sample specifies a model after fitting many models andpredictors showing statistical significance. The second sample validates the model and protectsagainst declaring predictors significant when they spuriously appear in the model by chancealone. The study concludes that 03 explains 31% of the response variance while age explains4%.Experimental studies necessarily examine acute effects. From the three I reviewed, ozoneseems to be more potent than nitrogen dioxide. Though experimental studies are useful forcorroborating evidence produced by observational studies, the artificial context of a laboratorychamber may not correspond to pollution effects occurring in daily life. Moreover we cannotassess chronic health effects. Observational studies address these issues head on. The firststep in an observational study is to estimate the extent of exposure.2.2 Estimating Cumulative Ambient Air Pollution ExposureIn observational studies we must deal with the problem of quantifying internal dose of theagent over time. Difficulties like the impossibility of obtaining internal dose data, workingwith imperfect proxies and refining pollution exposure modeling strategies characterize the6quest for an adequate solution.How can we use ambient air pollution measurements as a surrogate for internal dose?Typically the data come from fixed site air monitoring stations, stations for which data hasalready been collected over long periods of time. The objective is to link station data tohuman populations. Commonly the data are first interpolated spatially to other sites whereno measuring equipment exists. Then the exposure based on living patterns reported byindividuals in a survey are modelled.Abbey, Moore, Petersen and Beeson (1991) address the interpolation question. They validated a simple method of interpolation by fixed site monitoring station deletion: after a stationwas deleted, interpolated values were calculated and compared against the measured values.Ozone (03) and total suspended particulate (TSP) were measured for at least three yearsby 126 and 142 stations, respectively. Ozone was measured for one minute every hour andTSP for 24 hours every sixth day.The statistics of interest were exceedance frequency and mean concentration. The exceedance frequency was compared to US regulatory policy levels which are often stated interms of maximum allowable values.Their interpolation method obtained estimates for all ZIP codes in California. A maze ofrules were put together to get interpolated values. First, a measuring station was valid if atleast three years of pollution data existed. Second, for a given ZIP centroid, a station wasvalid if it resided within a radius of 50 km. Third, a series of three zones, based on concentricrings, were defined, zone A being the closest, zone B in between and zone C the furthest. Theonly stations used in the interpolation were those faffing within the closest ring. Up to threestations were used in a given zone.The zone definitions differed for each pollutant. TSP was assumed less homogeneous overspace and the radius boundaries for the zones were tighter than for ozone. In the study areaabout 60% of the population lived within 10 km of a TSP monitoring site (zone B or better).7About 90% lived within 16 km of an ozone monitoring site (zone A or better).The study concludes that the interpolation methods worked well in their particular case.The correlation of 0.78 for TSP and 0.87 for ozone suggest the importance of treating differentpollutants differently since, even with tighter controls of TSP zones and a greater number ofstations, the TSP estimates were more inaccurate than those for ozone. Second, if persons canbe situated to ZIP code centroids for significant time periods, the interpolation method mayproduce estimates which are good surrogates for internal dose.Seixas, Robins and Becker (1993) propose to model human exposure using the occupational history of individuals and hundreds of thousands of occupational ambient air pollutionmeasurements. They motivate their research by contending that simple exposure estimatesdo not adequately capture the complexity of the disease process. The implicit assumptionsthat dose is a linear function of concentration and independent of time go against evidence ofnonlinear toxicological behaviour of many substances causing adverse chronic health effects.In addition, the use of a simple statistic for exposure contributes to error in variable bias.Their modeling solution estimates the exponents of pollutant concentration and time between measurement and reported outcome. In their application 300,000 dust exposure observations from the shafts of underground coal miners were used to estimate model parameters.Each of the 1200 respondents provided work histories sufficient to obtain estimates of individual exposure. Despite the efforts made, the predictive power of alternative, simpler modelsachieved competitive performance levels.In conclusion, for fixed site air station monitoring data, the best epidemiological studies cando to approximate individual internal dose over time is to use some combination of interpolationand human exposure modeling. When stations are close enough, e.g. within 10 km for TSP or16 km for ozone, a crude interpolation method will do well. For greater distances the qualityof the interpolation will decrease, though the decrease is pollution specific and the degree ofquality deterioration remains to be further quantified.8The subtlety of human exposure modeling promises to challenge researchers striving for theideal. Models, however, will continue to be data dependent: the availability of work histories,migration patterns and activity diaries will determine the quality of inference.More research is needed to answer the difficult question of how internal dose relates toexposure measures. If progress is made, we will also be able to evaluate the soundness of themodeling approach. Until then, modeling will offer hope of improving the power of epidemiological studies.2.3 Cross Sectional StudiesI will characterize five recent cross sectional studies. Their variety of approaches is striking.While some of the studies simply analyse differences between study and control groups, othersuse standard regression techniques. The sample sizes range from 600 to 3900 respondents. Thefindings also show a range. Causes of respiratory symptoms go from grain-farming to blowingalkali salts.Abbey, Moore, Petersen and Beeson (1993) used Seventh-day Adventist nonsmokers tocheck on TSP, ozone and sulfur dioxide. Their logistic regression analysis found significantrelationships between ambient concentrations of TSP and ozone with several respiratory diseaseoutcomes.Senthilselvan, Chen and Dosman (1993) examine the relationship between grain farmingand respiratory illness in Humboldt, Saskatchewan. They split their study population into fourcohorts, use a survey to obtain binary responses and compare prevalences between cohorts.Asthma was found to be significantly related to grain farming and sex; wheezing was relatedto grain farming and smoking.Gomez, Parker, Dosman and McDuffie (1992) considered the effect of alkali dust on a Southern Saskatchewan population living near Old Wives Lake. When prevalences were comparedtheir control and study groups’ chronic wheeze, eye irritation and nasal irritation increased.Unlike the previous two studies, the researchers used forced vital capacity (FVC) and forced9expiratory volume in one second (FEy1.0)in their analysis.Xiping, Dockery and Wang (1991) measured the FVC and FEV1.0from a sample of Beijingresidents. Besides underlining the importance of coal heating as a source of respiratory problems they discovered a relationship between outdoor SO2 and FVC and FEy1.0 in subjectswho had not used stove coal heating.Ozkaynak and Thurston (1987) use U.S. mortality statistics to evaluate the effect of ambientair pollutants. Their regression model suggests that SO4 concentration is a significant factorin mortality prediction.The published cross sectional studies provide some evidence that respiratory health isadversely effected by irritants in the air. This study will add to the continually growingcollection.103 Issues in Cross Sectional AnalysesBefore analysing the data I will describe, in broad terms, epidemiology, and then focus in onthe virtues and problems of cross sectional analyses.As a science, epidemiology has had a number of high impact successes. I document thecase of AIDS as a way of illustrating the role and importance of epidemiologists. Historically,the etiology of infectious disease has been easier to discover than that of noninfectious disease.Noninfectious disease is more of a problem in the industrialized world, however. With an agingpopulation it would not be far-fetched to suggest that the quality and length of life will inpart be determined by our ability to understand and control noninfectious disease.Noninfectious disease etiology is a slippery concept. With multiple causes and long developmental intervals, epidemiologists are forced to consider sophisticated analytic techniques.As a result, the literature is imbued with a variety of study methodologies. I provide a shorttaxonomy of epidemiological studies as an introduction to this area. A more detailed look atthe cross sectional study ends the chapter.3.1 Definition of EpidemiologyEpidëmos is Greek for prevalence. Hippocrates wrote several books concerning disease prevalence in the fourth century BC. In one, he distinguishes between endemic diseases, whichprevail continuously at relatively low levels, and epidemic diseases, which occur at higherthan expected frequencies. When his writings focus on the health of populations rather thanindividuals Hippocrates is donning the epidemiologist’s hat.Epidemiology, as practiced today, “is the study of the distribution of states of humanhealth and of determinants of deviations from health in human populations” (Valanis, 1992).This report is an epidemiological study by virtue of its concern with distribution of chronicrespiratory illness over a geographic area and its relationship to airborne pollutants.113.2 The Disease ProcessDisease is a complex process that can be separated into stages of prepathogenesis and pathogenesis. Prepathogenesis refers to the initial bodily changes that may or may not lead topathogenesis. The individual’s exposure to one or more agents comprises the first stage ofprepathogenesis. If the individual is susceptible he or she is unable to adapt to introductionof the agent; with successful adaptation the disease does not develop any further.Pathogenesis occurs when the disease has successfully established itself within the host.During early prepathogenesis, events take place that are clinically difficult to detect. Thisperiod is often referred to as the presymptomatic or predinical stage. At latter stages, however,clinical symptoms show up. This is the point where the person is commonly said to ‘have’the disease. Identification and classification of disease is complicated by partial symptoms,misdiagnosis and long lag times between early prepathogenesis and development of clinicalsymptoms.Three components are distinguishable in the disease process (Valanis, 1992, Chapter 2).The first, denoted as the host, is the site of the disease. Factors relating to the host’s susceptibility, like lack of sleep, malnutrition, aging and immunity, may be useful in understandingthe transition from healthiness to pathogenesis and the rate of that transmission. Explicitdescriptions of the host, e.g. the human subject, is a good first step in describing disease.The agent, initiator of the disease process, is the second component of the trinity. Thetubercle bacillus, without which tuberculosis would be unknown, is an example of a parasiticagent. Risk factor assessments try to describe and quantify the hazards agents pose. Mostinfectious diseases arise from a single agent; noninfectious disease are usually the result ofmultiple agents.The environment is the third component and relates host to agent. The environment is thesum of external conditions and influences affecting the life of living things. By this definition,physical, biological and socioeconomic descriptors are encompassed. The environment can1280706050Year3.3 Trends in Disease Patterns Since 19001970FemalesMales1930 1950 1990Figure 1: Canadian life expectancy at birth.effect host susceptibility and viability of the agent. Cholera, for example, spreads best whenpeople live in crowded conditions with poor sanitation. In this study wind patterns andtemperature may have a significant effect on an individual’s exposure to pollutants.Thus, disease is a rather complicated process. The focus of epidemiology, as opposed topractitioners of clinical medicine, say, is on the triangular relationship between host, agentand environment. To uncover the etiology or causes of disease is the goal of risk assessmentstudies.Life expectancy in Canada, as shown in Figure 1 (Sources: Statistics Canada, 1983, andInstitute for Health Care, 1990), has increased steadily since the early part of the twentiethcentury. The primary reasons for the increase lie in improved sanitation, better diet andour increased ability to control infectious disease. They are interrelated, of course, sincea better diet reduces susceptibility and improved sanitation diminishes the opportunity forthe spread of virulent bacteria. The strides made in medicine’s attack on infectious disease,however, has been nothing less than staggering. In terms of mortality about 45% of deaths in131900 were attributable to infectious disease. The corresponding figure in 1987 was about 5%(Valanis, 1992, p. 28).Epidemiology has had some of its greatest successes with infectious diseases. Take, forexample, the recent role played by the Centers for Disease Control (CDC) in Atlanta inidentifying aquired immunodeficiency syndrome. CDC happened to be the sole supplier ofpentamidine, an experimental drug used in chemotherapy and radiotherapy to treat cancerpatients. When physicians in Los Angeles and New York City greatly increased demand forpentamidine, epidemiologists at CDC took note. In 1981 CDC’s Morbidity and MortalityWeekly Report contained two articles about a new disorder termed aquired immunodeficiencysyndrome (AIDS).The new disease had the marking of an infectious disease: AIDS spread among people whohad been in contact with one another. Assuming that AIDS was caused by an unidentifiedvirus, a series of risk assessment studies uncovered the connection that the vast majority ofAIDS patients were either drug users or homosexuals. The finding had major policy implications. Public education programs commenced, health care resources were reallocated and AIDSresearch emerged. In short the new disease along with associated risk factors was identifiedearly enough to allow public institutions to reformulate policy and adjust to new realities.The assumption of AIDS’ infectious nature turned out to be right. The human immunodeficiency virus (HIV) was discovered by a French team in 1983. They completed the cycle frominitial recognition to a reasonably complete etiology of the disease (Purtilo and Purtilo, 1989).Significantly, these investigations focused on a disease that had a single cause, developed relatively quickly and resulted in a sharp increase in the number of observable cases.Unfortunately for epidemiology’s rising star, noninfectious diseases remain the leadingcauses of death. They tend to be harder to identify, take longer to develop, have multiplecauses and come with long lag times between introduction of the agents and development ofdefinitive clinical symptoms. In the wake of these developments, epidemiology must turn to14more sophisticated methods and measurements to get at the complex etiology of noninfectiousdisease.3.4 Types of Epidemiological StudiesIn this section I contextualize the cross sectional study by describing its placement withina multitude of evaluative methods. Much of my insight derives from Valanis’ book (1992,chapter 3).The study of disease etiology generally proceeds in an orderly manner from generatinghypotheses to determining the causal mechanisms underlying the observed phenomenon. Threetypes of studies, sequentially related and characterized by increased effort, higher costs andgreater investigative controls, are used in epidemiology: descriptive, analytic and experimental.The descriptive study relies on easily accessible data, often mortality rates, which is analysed in a simple manner. Analyses may consist of breaking the population down by certaincharacteristics and then comparing mortality rate differentials. Descriptive studies are useful for uncovering unusual events or problems with data quality and can be thought of asequivalent to initial data analysis.The analytic study relates disease to agents while attempting to control for potentiallyconfounding covariates. At least three types of investigation fall within this category: crosssectional, case control and cohort studies. The analytic study generates hypotheses, leads toexperimental studies, and tests hypotheses in an attempt to explain phenomena arising indescriptive epidemiology.The experimental or designed study distinguishes itself from observational studies, i.e.descriptive or analytic, in terms of control. In an experimental study the researcher selectsfactors of potential importance, their levels of application and randomly assigns experimentalunits to the prespecified treatments. After controffing to whatever extent possible for externalconditions, the researcher observes the outcomes to determine the importance of the selectedfactors. With an observational study, the researcher has no control over factor levels or the15assignment of experimental units.The experimental study seeks to confirm or disavow certain cause and effect relationships.The investigator incorporates experimental randomization to avoid the pitfalls of systematicbias introduced by human intentionality. Under realistic conditions, only people who havethe disease of interest are included in experimental studies. For one group, the experimentorremoves the suspected agent from the environment; the other group acts as a control. The twogroups are then followed over time to see if significantly different changes occur as a result ofthe treatment.Since this study is analytic rather than experimental, each of the three analytic subtypes,namely cross sectional, case control and cohort, will be described in greater detail. The crosssectional study is like a snapshot: population characteristics are conceptually sited at a singletime point. The resulting measurement of disease prevalence explains why some authors usethe term ‘prevalence study.’The case control and cohort studies differ from the cross sectional in that they followindividuals over time. Case control studies begin with two distinct populations: people withthe disease and people without. Disease relationships are determined by examining bothgroups’ exposure to a variety of agents and uncovering the greatest differences between thetwo. In other words, case control studies work from disease to uncover exposure status.As opposed to case control studies, cohort studies start from exposure status. The bestexample is the Framingham study begun in 1949 and continued to the present. A random sample of individuals was drawn from the population of Framingham, Massachusetts. Physiciansdetermined that approximately 5000 of the sampled individuals were free of coronary heartdisease thus making them suitable for the study. Every year since, the subjects have gonethrough a physical examination to (a) assess the health status of their hearts and (b) measureexposure risks. The hope is that the risk factors of those who develop heart disease can beidentified. The cohort study can either be historical or prospective in nature.16The case of smoking and lung cancer, as documented from a historical perspective byClemmesen (1993), is a good example of the potential and pitfalls of descriptive and analyticstudies. At the turn of the century, incorrect diagnoses prevented the scientific communityfrom identifying smoking as a problem. Much of the evidence against tobacco was anecdotal.With improvements in identification, some studies uncovered a relationship between smokingand lung cancer. Criticism aimed at the studies, however, mostly concerned with potentialconfounding factors and poor data quality, undermined their impact.The development of cancer registries in Mecklenburg and New York in the early 1940sacknowledged concerns over data deficiencies. Five studies, published in 1950, produced common findings on the smoking habits of patients with lung cancer. This confirmation of earliersuspicions eventually led to the first International Symposium on Lung Cancer Endemiologyat Louvain in 1952. In 1959, about 100 years after cigarettes had first been manufacturedin the U.S., the U.S. Public Health Service officially pronounced their “deep concern” overthe increase in age adjusted incidence of lung cancer deaths from 4 per 100,000 in 1930 to31 in 1956. The long time interval needed to identify the association of lung cancer withcigarette smoking points to the importance of illness classification methods, quality data andconfirmatory studies.3.5 Pros and Cons of the Cross Sectional AnalysisCross sectional analyses provide information about some aspects of the underlying diseaseprocess; other aspects of the process remain hidden. The information primarily comes from astatistical model fitted to the cross sectional data. In this section I will describe what kindsof insight can be gained from the model. I first look at the advantages of the cross sectionalstudy.The model is intimately related to the data. If the data is a sample from some humanpopulation then the inferences about the model are applicable to the sampled population. Inother words the study and target population are similar if not the same. Observational studies17are better than experimental studies in this respect. In experimental studies the question ofhow the study population relates to the target population usually looms unanswered.In a related way, the subject’s exposure level in an observational study is realistic, thoughhard to measure. With controlled experiments the levels at which pollution concentrations areset often bear no relation to levels experienced by the population of real interest. The designfor LD5O experiments, where the objective is to get an estimate of the dose required to kill50% of the population, is an example.On a practical level, observational studies give information which cannot be duplicated byexperimental studies. Whenever irreparable damage to the observational subject is possible,mice, not men, will be sacrificed. Even the use of data from the NAZI hypothermia trials,which were ethically indefensible, is controversial. Observational studies, on the other hand,are nonintrusive.Another practical consideration is cost. Cross sectional studies often attempt to get information about previous events through a questionnaire. This is one way of evaluating longterm exposures of human populations to a variety of potentially toxic chemical compounds.By contrast, designed experiments attempt to manipulate factors so that the effects can beobserved. Designed experiments are very expensive when extended over many years and comewith the danger that the question under study will lose relevance over time. In this sense,cross sectional studies are efficient.For cross sectional studies in particular, a model allows the investigator to examine therelationship between response variables and predictors. The model under consideration canvary considerably; explanatory variables that are nominal, ordinal and continuous with nonconforming scales can be handled at the same time. If the standard errors of model coefficientsare also estimated, we can assess the importance of the predictors relative to one another. Infact the estimation of coefficients and their standard errors is the foundation upon which thehouse of cross sectional analyses are built. Cross sectional studies, then, can provide useful18information using standard statistical methodology.Now I proceed to indicate a few areas in which cross sectional analyses are weak. Havinginformation at one point in time rather than at several is one of them. First and foremost,cause cannot be ascertained since event sequence is unknown. For instance, if fitness level isfound to be negatively related to a persistent cough, is the individual’s inactivity partially thecause of the cough or is the cough a precursor to lessened activity? Without knowledge ofspecific events ordered in time, it is difficult to make conclusive statements.Related to this aspect of cross sectional studies is the type of disease statistic adopted.Epidemiologists make a distinction between incidence, the number of cases of a disease ina prescribed time interval, and prevalence, the proportion of people with the disease. Crosssectional studies measure prevalence, a concern since bias is associated with prevalence: peopledie or move in response to the occurrence of a disease.Exposure is one of the key measures in air pollution risk assessment studies. Exposureoccurs when the host comes in contact with an agent in the environment. Depending onthe exposure data, cross sectional studies can either be ecological or relational. Ecologicalstudies use station pollution data which is assumed to apply to the individuals who live nearby.Relational studies use exposure information collected for every individual selected in the study.The ecological fallacy arises if one assumes an estimate based on an average applied toindividuals is equivalent to an estimate based on individual measurements. If, for example,the distribution of a pollutant is heterogeneous then the absorbed dose among individualswill probably not be the same. Another possibility is that individuals attain different exposure levels due to daily commutes from one area to the next. Thus, a false relationshipbetween pollutants and disease can be observed because the averaged exposure data does notadequately reflect the relation between subjects’ absorbed doses in different geographic areas.The ecological fallacy can be seen as arising from measurement error. The sensitivity of regression coefficients to various levels of spatial aggregation has not been studied comprehensively19(Evans et aL, 1984).Another drawback of the cross sectional design arises from the lack of control. Alas, peopleare not randomly distributed experimental units. They are intentional agents often makingdecisions related to the area of scientific interest. An example is provided by asthmatics who,aware of their own hypersensitivities, exercise self selection in terms of where they live andwhere they work (Lebowitz, 1991). Individuals who make informed decisions of that sortshould no longer be considered, strictly speaking, observations from a stochastic process thatassumes independence.The context of these studies is important; they ought to be thought about as one partof an extensive, ongoing research effort. By itself, certainly, a cross sectional study cannotprove a causal biological relationship. Bates (1992) suggests we look for coherence in complexphenomena in building a scientific case. Epidemiological evidence should be corroborated, forexample, by toxicological studies and results from molecular genetics.204 Description of the Ontario Health SurveyIn Canada, 10% of the 1991 gross domestic product was spent on health care (Evans, 1993,32). Concurrently, little is known about the health status of the general population. Thiscreates difficulties for provincial governments which must provide health care to individuals atlocal levels. What data sources are available for these agencies? Are they adequate to meetthe need of the increasingly difficult task of optimizing the distribution of federally allocatedfunds?The information we do have comes from administrative sources like hospital admissiondata. Although this data sheds light on who has received medical attention it remains silentabout those who do not feed into the system. Further, profile information such as smokinghistory and socioeconomic status is limited or nonexistent from these sources. The need forbetter health data is clear.In 1978-79, Statistics Canada conducted the Canada Health Survey, the first comprehensivehealth survey taken in Ontario. Successive health related surveys include the Canada FitnessSurvey (1981), the Canada Health and Disability Survey (1983/84) and the Health and ActivityLimitation Survey (1986/87). For provincial planning purposes, however, the available datahad been inadequate. The sample sizes were not large enough to permit inferences below theprovincial level. The Ontario Ministry of Health recognized the value of funding a survey whichwould allow for estimates at the level of District Health Council or Public Health Unit and in1987 proposed a health survey for the population of Ontario. Four years later the proposalbecame reality: the survey was carried out and a rich source of new information about thehealth status of the population of Ontario was created. The database obtained from the 1991Ontario Health Survey (OHS) provides the information for our study. The large sample size,wide geographic coverage and detailed respondent information will enhance the quality of thestudy. This chapter gives an overview of the survey.214.1 ObjectivesThe survey set out to provide baseline statistical data on the health of the Ontario populationat the Public Health Unit level. The objectives are to:F> measure the health status of the population;L collect risk factor data for the major causes of morbidity and mortality;L collect data related to socioeconomic and demographic variations in health;F> measure awareness of high risk behavior;F> measure utilization of health services;F> collect descriptive data for health units; andF> collect data comparable to that in the Canada and Québec Health Surveys.The long length of the resulting questionnaire reflects a vigorous attempt to achieve all theobjectives. The high response burden was recognized from the start and evaluated duringpretesting.The high level of geographical coverage is worth highlighting. As Figure 2 indicates,the Province of Ontario can be divided into thirty seven Public Health Units (PHUs) ordistricts. The PHU is similar to Census Division, the difference being marginal disagreementsin boundaries. Some PHUs, for example, are aggregates of two Census Divisions. Duringanalysis, the PHU will play a vital role in linking geographical pollution data to individuals inthe sample. The large size of the PHU ensures that people living within one can reasonablybe expected to be bounded to the area in terms of daily movement. We hope there are enoughPHUs to enable a good description of pollution differentiation between areas of the province.4.2 Data Collection MethodWe wifi often refer to the data collection method and so describe it now in some detail. Figure 3depicts a flowchart summary. I will comment on some of the pros and cons associated with22CD CD C C CD C CD C C CD CID CDIdentify Household in Sample.1J.Conduct Personal Interview with all Household MembersLeave Self Completed QuestionnaireHandle Nonrespondents with Follow up Telephone CallFigure 3: Flowchart summary of the OHS data collection procedure.each part of the process.Once a household is identified in sample, a household record form is created. Newlyconstructed buildings or subdivisions, outdated maps and dangerous neighborhoods can makeidentification difficult.As soon as the household record form is available, the personal interview of one householdmember is possible. That member must be knowledgeable since he reports on all of the othermembers of the household. Obstacles to successful completion of this phase include inabilityto contact anyone and errors associated with inaccurate determination of household members.There are at least three benefits from using a personal interview. First, the survey taker getsa chance to introduce the survey without being ignored. Second, empirical observation provesthat a higher response rate is generally achieved by a personal interview over self enumerationalone. Third, telephone follow up will be easier since the caller will be making a ‘warm’ call.A self completed questionnaire is left for each household member aged 12 and up. If thatmember does not return the questionnaire before a specified time, two telephone calls aremade.The combination of personal interview, self enumeration and telephone follow up representsa compromise between total survey cost and response rate.244.3 Target PopulationThe target population for the interviewer completed portion of the survey is all residents ofprivate dweffings in Ontario during the survey period (January to December of 1990). The 1991estimate for the population of Ontario is 8.1 miffion people. As in many surveys conducted bygovernmental agencies, residents of Indian reservations, inmates of institutions, foreign servicepersonnel and residents of remote areas were excluded.The target population for the self completed portion of the survey is similar except thatthe population includes only people aged twelve and up.4.4 PretestingThe Ontario Ministry of Health hired Statistics Canada to conduct a pilot survey for the OHS.The four objectives were to:L identify weaknesses of the content, wording and structure of the questionnaire;t evaluate the efficacy of the training procedures and field operations;L quantify the effect of questionnaire length to response rates; andL assess the use of an incentive to boost response rates.In May 1989, a total of 800 dweffings in Peterborough County and the Municipality ofHamilton-Wentworth were surveyed. All households went through the same interviewer portion; four versions of the self completed questionnaire were equally divided among dweffings.Two follow up telephone calls were made to jog the memories of individuals who had not yetreturned their questionnaires.Let {Basic} be the core of the questionnaire (the questions take about 15 minutes to fillout); {Food Frequency Schedule}, the set of questions pertaining to the amount and frequencyof different kinds of food the respondent has consumed; {Linkage Information} the set of questions asking for additional identifying information (middle names, maiden names, perviously251007572% 69%61%VersionFigure 4: OHS pretest questionnaire response rates.used surnames and birthplace). The four versions of the self completed questionnaire can thenbe summarized as follows:Version A: {Basic}Version B: {Basic} + {Food Frequency Schedule)Version C: {Basic} + {Food Frequency Schedule) + {Linkage Information}Version D: {Basic} + {Food Frequency Schedule) + {Linkage Information)with an incentive.The incentive in Version D was three $500 prizes drawn at random from those respondentswho replied promptly. Two of the objectives were met by including these four versions of thequestionnaire.The response rate for the personal interviewer completed portion was 87%. The overallresponse rate for the self completed portion was 69%, somewhat under the 75% response ratethe Ontario Ministry of Health was shooting for. The main conclusions of the pretest followfrom the response rates for each version of the questionnaire, shown in Figure 4. First, addingthe food frequency schedule reduced the response rate marginally. Second, requesting theextra identifying information seemed to have a significant adverse effect on the rate. Third,the incentive worked dramatically. The male response rate was 10% lower than that forfemales. The incentive remarkably increased response rates for males between the ages of 16to 60. Finally, the use of telephone follow up proved worthwhile since, before cailing started,the response rate was below 50%. Although Statistics Canada recommended version A orA B D26version D on the basis of the rate set out by the Ontario Ministry of Health, version B waseventually chosen as the best of all candidates.4.5 Questionnaire ContentThe questions and format of the OHS came from various sources. Previous survey questionnaires like the Canada Health Survey, Québec Health Survey, General Social Survey andHealth Promotion Survey were used as models. Potential users of the data, ie. units within theOntario Ministry of Health such as the Public Health Branch, Public Health Units and DistrictHealth Councils, were given an opportunity to develop survey content. Finally, organizationslike Statistics Canada and Sante Québec were consulted along the way.The final form of the questionnaire breaks down into three separate parts: the householdrecord form, the interviewer questionnaire and the self administered questionnaire. The household record form keeps track of the dweffing and the identities of the household members. Thecontent of the interviewer and self administered questionnaires is summarized in Table 1.The important sections for our study are highlighted with an asterisk in Table 1. For theinterviewer completed portion the chronic health problems section contains the two responsevariables considered in our study, chronic cough and asthma. The personal interview achievesthe highest response rate among survey delivery techniques and favorably affects the qualityof the response variable data. The other variables, on the other hand, come from the selfcompleted part of the survey.Sociodemographic data often varies with health outcome. We will therefore want to includea selection of sociodemographic variables from the list. The information derived from questionson the OHS ranges from country of birth to education, income and housing. Comparability ofsociodemographic data to other sources, e.g. Canada Census, provides a possible data integritycheck.Information on the multifaceted phenomenon of smoking is extremely important for anystudy of respiratory health. The section devoted to smoking is comprehensive, identifying27• Household Record Form• Interviewer Completed QuestionnaireContacts with Health ProfessionalsDisability within the last Two WeeksUse of MedicationMedical InsuranceAccidents and InjuriesHealth StatusRestriction of ActivitiesChronic Health Problems*Health Problem ProbesSocio-economic Information*• Self Completed questionnaireYour HealthMedicine and DrugsSmoking*Alcohol*Your Family*Dental HealthYour Life in General*Driving and SafetyWomen’s HealthSexual HealthOccupational Health*Physical Activities*NutritionTable 1: OHS questionnaire content.smoker type, the number of cigarettes smoked daily and the ages at which smoking began andended, where appropriate. As a bonus, questions aimed at the extent of second hand smokewere also asked.The Short Michigan Alcohol Screening Test (SMAST) score is included in this study eventhough, on the surface, it may be of peripheral interest. The score identifies drinkers and alcoholics with a view towards reliability. With the stigma attached to alcoholism, the respondentmay be sensitive about questions related to drinking. Research has shown that the SMAST isminimally affected by denial tendencies (Seizer, 1975).A series of twelve questions under the heading ‘Your Family’ were weighted and summed to28obtain a general family functioning score. It is supposed to reliably measure family functioning.The success or failure of interpersonal relationships within the immediate family might bethought of as another demographic characteristic of the individual. If there is some truthto the relationship between mental and physical health, covariates such as family functioningought to be included in the study. In a similar vein, questions from the section ‘Your Lifein General’ seek to measure social support outside of immediate family. An analogue to thefamily functioning score, the general well being score, was constructed.Health outcomes may in part be determined by conditions in the workplace. I take anumber of questions delving into workplace exposure to hazardous materials from this OHSsection.Lastly, physical activity has a direct impact on human health. Effects generally includethe reduction of premature morbidity and an enhancement of emotional well being. The OHSasked about the type and frequency of physical activity that took place in the last month.In conclusion, the OHS questionnaire obviously strives to be comprehensive. The sheernumber of questions, totaling over one thousand, attests to the fact. With the availability ofsuch data, this project has a good chance in uncovering a relationship as could reasonably beexpected from an ecological study.4.6 Survey MethodologySurvey methodologists implement sampling designs that meet given specifications under knownconstraints. In terms of a 95% confidence interval, the OHS design objective is to enablePHU proportions as low as 3% to be estimated within 50% of the estimate. For an estimated PHU proportion of 3%, the 95% confidence interval would be, in terms of percentages,(3 — [1/2]3, 3 + [1/2]3) = (1.5, 4.5). Design constraints include budget and available samplingframes.The chosen sampling frame, a frame of enumeration areas, comes from the 1986 Census.The enumeration area (EA) is the smallest area for which population counts can automatically29be retrieved. Each EA is situated in a PHI] and classified as either urban or rural. Specifically, urban EAs represent the urban core and fringe of census metropolitan areas or censusagglomerations.A multistage stratified cluster sample is a good description of the OHS survey type. PHUand the urban/rural bifurcation stratify the population. The purpose of stratification is togroup dwellings into homogeneous units with respect to measurable characteristics of interest.The estimates of population characteristics are for the most part more precise when usinga stratified sample over a simple random sample. The primary sampling unit is the EA. Inthe first stage the survey takers sampled an average of 46 EAs within a PHU. They thenconstructed a list of dweffings for each of the selected EAs. The list became the samplingframe for the second stage of the sample. Clusters of dweffings, about fifteen from urbanstrata and twenty from rural strata were sampled at the second stage, resulting in the desiredsample size of approximately 760 dwellings per PHU.Reliable estimates of proportions greater than 3% had to be achieved at the PHU level. Howwas this criterion used to arrive at the sample size? Let j3 be an estimate of a proportion usingweights determined by the survey design. Most designs for surveys conducted at a provincialor national level produce estimates with less precision than those that could be obtained bytaking a simple random sample. The design effect (deff) of a proportion,, in this case estimatedto be two (Ministry of Health, 1992a, p. 29), represents the factor by which the variance ofan estimated proportion is inflated. Thus, letting Var (j3) be the variance of j3 under simplerandom sampling and ignoring the finite population correction factor,Var(s) deff()Var() 2Var() 21 .The coefficient of variation is a scale free ratio of an estimate’s precision to the expectedvalue of the estimate. The OHS criterion for proportions greater than 3% was a coefficient of30variation less than 25%:C.V.()/2P(1_P)<0.25. (1)Since, by (1),________2(1_p)or n > 32(l_)and/l—p’ 1—0.3max 1 1= 32,pE[O.03,1] \. p 1 0.3the approximate sample size needed to fulfill the reliability criterion is n = 32 32 = 1024.For 46 PHUs the sample size translates to about 48,000 individuals.Note that the sample size was further increased to account for expected nonresponse. Theactual sample size resulted in 49,200 individuals responding out of 61,300 surveyed. The 49,200individuals represent 35,500 households. To recapitulate, the large sample size exists to meetdesign specifications.This section was included to give the reader a flavour of the methodological intricacieslurking behind the Ontario Health Survey data. The complex survey design induces nonequalprobabilities of selection for the survey population; the weights associated with each surveyrespondent reflect selection probabilities adjusted for nonresponse and age-sex population totals at the PHU level. This should caution any analyst to consider carefully what types ofinference are supportable by an analysis of the data.4.7 Nonresponse RatesCharacterizing response rates for the OHS is not trivial. From the outset, the personal interview and self enumeration introduce at least two response rates: the OHS had a response rateof 88% for the former and 77% for the latter. Item response rates further fog the issue. I willdocument some of the difficulties in coming to terms with item nonresponse.The topic of response rates is conventionally rephrased in terms of nonresponse rates. Iwifi adhere to that scheme. Nonresponse can be divided into unit and item nonresponse.31Unit nonresponse occurs when the survey taker does not receive any information from therespondent. For the OHS, unit nonresponse happens if nobody in the household goes throughwith the personal interview. Unit nonresponse is handled operationally by modifying theprobabilities of selection for the units selected in sample that do respond and, essentially,ignoring the nonrespondents.Item nonresponse occurs when the survey taker procures only partial information aboutthe respondent. Reasons for this type of nonresponse include respondent reluctance to answersensitive questions and mistakes made during the transcription of data from the actual surveyform to an electromagnetic file. For the OHS, item nonresponse is complicated by the factthat the survey is conducted using both personal interview and self enumeration. Thus, itemnonresponse arises in the following scenarios:Household Personal Interview Family Member Self Enumerationitem nonresponse unit nonresponseitem nonresponse item nonresponseitem nonresponse completecomplete unit nonresponsecomplete item nonresponseThere are two common means of handling item nonresponse. The first is imputation. Thetechnique makes up missing values. The two imputed OHS variables are age and sex. Missingvalues were imputed for the OHS by generating random values proportionally consistent withknown PHU age-sex proportions. With imputed data the user cannot determine the rates ofitem Jionresponse.The second way of dealing with item nonresponse is to tell the user directly by allowingfor a “not stated” category. This strategy allows for calculation of item nonresponse rates andthe uncovering of nonresponse patterns. For example, the nonresponse for the eight questionson work exposure is given in Figure 5. Though the nonresponse rate hovers around 8% foreach of the questions, for the most part the respondents either answer all of the questions ornone.Smoking nonresponse, shown in Figure 6, is an example of a more subtle pattern. The first3230-a)C.)a)030-1 2 3 4 5 6 7 8Uruon17%Question NumberFigure 6: Nonresponse rates for the smoking questions.The same people are responsiblefor most of the nonresponse.2010-0•9.5%Question NumberFigure 5: Nonresponse rates for the work exposure questions.Current smokerswere asked questions two to five.The others wereasked questionssix to eleven.20-a)2a)010-0-33100’66C’)0U)33.C0z0’93Figure 7: Nonresponse rates for the well being score by age.question determines if the respondent currently smokes. Current smokers are asked the nextfour questions while everybody else answers the six after that. The graph shows a higher itemresponse rate for smokers but this merely reflects that a smaller proportion of the populationcan be classified as current smokers. Since the two groups are mutually exclusive the unionfor the response rate is approximately the sum of the current smoker and not current smokernonresponse rate.As a last example, consider the nonresponse pattern, shown in Figure 7, for a collectionof questions concerning personal well being. Older respondents seem to be more sensitive toquestions concerning their well being. The implication is that nonresponse is generally notrandom even though it is convenient to assume so for purposes of an analysis.Ill conclusion, the Ontario Health Survey nonresponse is significant enough to warrantattention during analysis. What rates of nonresponse are there for the study variables? Howwill item nonresponse be dealt with for discrete and continuous variables? These types ofquestions will be dealt with as they arise.39 66Age344.8 CommentThe OHS comes with all the strengths and weaknesses of large scale survey data. The drawbacks include missing data and the survey weighting structure induced by complex surveymethodology. Also, despite the wide subject coverage of the survey, the OHS provides noestimates of an individual’s exposure to potentially harmful pollutants. We are left with thedifficulty of estimating and linking pollution data from another source because of this absence.Admittedly, this criticism is somewhat unfair given the survey’s objectives.These drawbacks notwithstanding, I am delighted to have access to data backed by impressive resources, human and otherwise. One may think of the panels of experts who determinedquestionnaire content; those involved with the pretest; the survey methodologists; and themany who played a part in the field operations, from the training staff, interviewers and fieldsupervisors to those completing the cycle with data capture and imputation. Untold hourswent into the production of what is for me a starting point: the microdata file!355 Initial Data AnalysisThis section will give the reader an overview of the data used in the succeeding analysis. I drawdata from two sources: the 1991 Ontario Health Survey (OHS) and six years of atmosphericenvironmental monitoring.The OHS target population (minus immigrants who have lived in Ontario for less than tenyears) comprise the study population. We exclude recent immigrants because our outdoor airpollution estimates would not adequately represent their true exposure history.I am forced to consider individuals as the unit of analysis because the OHS public datafile,restricted for reasons of confidentiality, does not identify their household. A conflict immediately arises since the analysis should be in synchronicity with the survey design, meaningthat households rather than individuals should be the unit of analysis. One implication ofemploying standard estimation techniques is that standard errors will be underestimated if noadjustments are made.Iii its complete form the OHS data is unmanageable. There are over 1000 variables ofwhich many are of no use to this study. The first task is to cull the data. I give a qualitativeand graphical description of the reduced set of variables along with associated nonrespoiiserates.The pollution data I start with derive from an involved inferential process. The originaldata comes from thirty seven atmosphere monitoring stations. Each station potentially measures up to four pollutants, namely nitrogen dioxide (NO2), ozone (03), sulfur dioxide (SO2)and sulfates (SO4). The station data is interpolated for each public health unit (PHIl) byWeimin Sun to whom I am indebted.I convert the monthly averages, given over the six year period 1983-89, into single summerand winter averages. Thus, each PHU has a six year average which I assume adequatelyrepresents personal lifetime pollution exposure. For a flow diagram of the way the study datais derived see Table 5.361991 OHS Data 1983-89 Daily Atmospheric(Public Health Unit is the Monitoring Station Data fromfinest geographic partition) Scattered Ontario Stations__________IOHS Study Monthly Pollution EstimatesVariables by Public Health Unit/By PublicHealth UnitStudy DataTable 2: Flowchart of the derivation of the study data.To close, I graph each of the covariates against the two response variables. This will provideintuition about at least the one term models we fit later.5.1 OHS CovariatesThe OHS data contains information on innumerable aspects of the population of Ontario. Sincethese data derive from measuring over 1000 variables, a subset that will best relate pollutionto respiratory illness must be selected. In this section I will describe what covariates wereselected, report their associated nonresponse rates and illustrate their marginal distributionsgraphically.This study focuses on asthma and emphysema as the response variables. The two questionsfrom the interviewer portion of the questionnaire were:“Do you have asthma?”; and“Do you have emphysema or chronic bronchitis or persistent cough?”.The questions assume an ongoing chronic condition by the way in which they are asked. A smallfraction of questionnaire respondents (0.9%) did not respond to the two questions specifically.This fraction will be ignored from here on. Figure 8 illustrates the prevalence of asthma and37Asthma Emphesema100 10075 75.0 0Yes YesFigure 8: Prevalence of the two response variables.emphysema. The low rates observed point to the need for a large sample: such studies couldbe straight-jacketed by the lack of statistical power resulting from small samples.To select the predictive covariates, I tried to obtain a satisfactory coverage of the following set of individual descriptors: demographic, lifestyle, health, socioeconomic, and pollutionexposure measures. The categories are rather arbitrary, though I chose them to make thepresentation of results more comprehensible. Table 3 exhibits the set of variables I selected ineach group. The grouping adhered to in the table will generally not make any difference tothe outcome of reported results; the arcsine analysis is the only exception.Most of the study variables have been massaged in a variety of ways. The producers ofthe OHS datafile made the preliminary alterations. To make the file easier to use, they combined subsets of the original questions to get derived variables. Household type, for example,classifying the respondent according to a description of the relationship between the familymembers of the household, was derived from the Household Record Form for each family ineach household.Household income is another example. The variable classifies the respondent into oneof three household income groups: low income; not low income but less than $50,000; andincome $50,000 or more. A question that arises is how exactly low income is defined. Thecost of living, household size and household income are determinants that should be takenNo No38Grouping Covariate TypeDemographic Rural or Urban Stratum NominalSex NominalAge ContinuousFirst Generation Immigrant NominalSocioeconomic Family Type NominalFamily Functioning Score ContinuousBlue Collar Work NominalWork Exposure OrdinalPost-Secondary Education OrdinalIncome OrdinalLifestyle Smoker Type OrdinalCurrent Smoker Type NominalDuration Smoked ContinuousNumber of Cigarettes Smoked IntegerNumber of Current Household Smokers IntegerAlcohol Problem OrdinalHealth Body Mass Index ContinuousEnergy Expenditure ContinuousWell Being Score OrdinalAllergy OrdinalTable 3: List of study covariables.into account. The rule adopted by the OHS is based on poverty lines and low income cutoffs developed by the National Council on Welfare and Statistics Canada. Place of residence(urban or rural), income and household size determine low income classification (Ministry ofHealth, 1990, p. 11).I introduced the second set of modifications which are typified by the example in Table 4.In effect I used existing OHS variables to derive a new variable. I chose this route to reducethe number of explanatory covariates under consideration. Clearly the choice was arbitrary tosome degree.The last modification of the original OHS data is related to item nonresponse. I will delaymy exposition of nonresponse until after I have described the explanatory covariates in moredetail.39Questions: In your job or business have you, inthe past twelve months, worked with Responses:1. dust from wood, grain, haw or straw?2. dust from silica, granite or rock dust? Never3. glass fiber dust or asbestos? Occasionally4. dust or fumes from lead, cadmium, Oftennickel, chromium or mercury? Always5. fumes from solvents, paints or gasoline? Don’t Know6. resins or isocynates? Not Applicable7. pesticides? Not Stated8. coal tar or pitch?J.LYes if ‘Often’ or ‘Always’ at least once,Work Exposure =I No otherwise.Table 4: How the OHS work exposure questions form a covariate for this study.The demographic covariates describe certain unalterable features of the respondent. I chosestratum, sex, age and an immigrant indicator as the demographic covariates. Their marginaldistributions are shown in Figure 9. As with the other estimates in this chapter I used surveyweights, as prescribed by OHS documentation (Ministry of Health, 1990(c), p. 3.), to producethe estimates.The four demographic covariates look reasonable. The Canada Year Book (StatisticsCanada, 1991, p. 73) tells us that 83% of the Ontario population resides in an urban setting. The OHS weighted estimate is 86%, as expected. The division between the sexes isabout fifty-fifty and the age distribution shows a bulge for the baby boomers. The immigrantindicator shows what portion of the study population are immigrants who arrived before 1980.Age may be an important explanatory variable. The probability of death increases withage, ranging from 3% for Canadians between the ages of one to 24 to 71% from 65 years andup (Future Health, p. 102). Any disease related to chronic exposure over a long period of timemight be expected to be related to age. For ailments of the respiratory tract, older people40Figure 9: The demographic covariates.appear to be more vulnerable to inhaled particles and gases (Brain, 1989).Some studies consider ethnicity as a demographic characteristic. When ethnicity proves tobe a good predictor, however, it likely reflects class membership, as in the case of aboriginals.As a group, chronic cough or emphysema affects them at almost double the Ontario average(4.5 versus 2.4%). This phenomenon is explained by the well documented condition of povertyin which many natives live. The OHS data reinforce this view with an observed moderatelyhigh negative correlation between chronic cough and standard socioeconomic variables suchas education and income. The higher prevalence of illness therefore reflects structural inequalities within society rather than race (Steinberg, 1984). Since ethnicity in Canadian society,exempting aboriginals, does not generally imply class membership, this study relies on theUrban Rural1007502a)C.)a)00Male FemaleStratum Sex10075I::010075010 40 70 100 YesAge ImmigrantpYearsNo41socioeconomic indicators to capture those inequalities.Socioeconomic variables, shown in figure 10, are important indicators of longevity (seereferences in Evans, Tosteson and Kinney, 1984). Presumably they are also good predictorsof respiratory health. The socioeconomic covariates represent information on the family unit,work, education and income. All of these variables are categorical, including the seeminglycontinuous family functioning score. The score, however, is a weighted score of responsesfrom a series of twelve questions from the self completed questionnaire. The actual cutoffis supposed to represent the best division distinguishing families seeking clinical help fromthose in the general population. The income categories definition depends on poverty linesand low income cutoffs developed by the National Council of Welfare and Statistics Canada.The formula adjusts for household size, area of residence (urban/rural) and household income(Ontario Ministry of Health, 1992a, p. 10, 21-22).Lifestyle covariates, shown in Figure 11, may turn out to be the most important set ofexplanatory variables due to the impact of smoking on lung function. The best of themis probably duration smoked as it is continuous and more reliable that the other continuouscovariate, the number of cigarettes smoked daily. The measurement of the number of cigarettessmoked daily illustrates the tendency of people to ‘think in fives’ when asked for a simple answerto a complex habit. Second hand smoke exposure was captured in the personal interview whenthe respondent was asked if anybody in the household smoked. Finally, a drinking problemindex, developed from a shortened form of the Standard Michigan Alcohol Screening Test(SMAST) is included as one of the lifestyle indicators.The general health of a person may have an effect on specific pathologies such as asthmaor emphysema. Figure 12 illustrates the distributions for the health covariates. Body massindex (kg/rn2)and exercise expenditure (kcal/kg/day) are continuous and reflect individuals’participation in physical exercise. Well being, for reasons similar to the family functioningscore, is ordinal. The family functioning variable divides families into functional and dysfunc42100 10075. 75a, a)0 50 C) 50a, a)25 250 0Figure 10: The socioeconomic covariates.Family Type Family Functioning ScoreHealthy78%Dysfunctional22%Couple Couple && Kids Couple Others Other10075I25010075.11250IibComposite Score1510a,0a,05010075Ca,50a,250Blue Collar Work40Work ExposureNA Blue WhiteCompleted EducationYes NoIncomePostPrimary Secondary SecondaryLow Mid High43Smoker Type Current Smoker TypeFigure 11: The lifestyle covariates.-I-.-Never FormerOccasional DailyDuration SmokedA.10075Ca)2 50a)02504010075Ca)0 50a)0250Daily Occasional NACigarettes Smoked10075CC’)2 50a)025022Ca)2 11a)0010075CC’)2 50a)02500 20 40 60 0 4020Number per DayAlcohol ProblemYearsNo. of Household Smokers0 1 2 3+ No Maybe Yes44Composite ScoreFigure 12: The health covariates.tional.Most of the OHS covariates display some degree of nonresponse though for the graphsgiven so far, the nonresponding portion of the sample was ignored. I will currently take somespace to characterize the nonresponse for the study data.Nonresponse can be divided into unit and item nonrespoilse. In the context of the OHS,unit nonresponse was either at the household level, where no personal interview took place,or at the household member level, where an individual’s self completed questionnaire was notobtained. Item nonresponse occurred when a respondent provides partial information. Here,item nonresponse manifests itself as a respondent who partially omitted information on theself completed part of the survey.Body Mass Index Energy ExpenditureInactive68%IModerate16%10 20 30kg I rr?Active14%40128a)C.)a)0086a)$4a)0200 2 4kcal/ kg! day15• 10a)C.)a,0010075Ca)2 50a,02501Well Being Score2 3 4.1Allergy0 20 Yes No45The OHS had a unit response rate of 88% for the personal interview and 77% for the selfcompleted questionnaire. For the rest of the study I will consider only the 77% of respondentsresponding to the self completed questionnaire.Figure 22 in the Appendix summarizes item nonresponse. The range is rather striking.Hovering around 15% are well being, energy expenditure and household income. For thenext set of variables, from the number of cigarettes smoked to smoker type, item nonresponsestands at about 10%. Blue collar work, allergy, education and immigration show negligiblenonresponse.There are at least three plausible explanations for high nonresponse. First, the respondents’sensitivity to the question has an effect on rates. The question on household income is a sterlingexample. Many Canadians may well feel uneasy about providing the information. For somethe anxiety is culturally related; for others revealing income may be embarrassing; yet othersmay feel the government could use the survey as a device to nab tax evaders. Well being isanother example: observed nonresponse increases as age increases.Second, the length of the questionnaire will have an effect on both unit and item nonresponse. In the OHS pilot study, a short version of the survey resulted in higher overall responselevels. There is reason to believe that item nonresponse is also affected by questionnaire length.The fact that some household members skipped whole parts of the survey is suggestive.Third, some variables are composite indicators and nonresponse can then result from amissing answer for only one question. The well being score is a weighted total of one positiveand one negative statement covering seven categories: energy, control of emotions, state ofmorale, interest in life, perceived stress, perceived health status and satisfaction about relationships. If any one of fourteen questions went unanswered the well being score was coded as“not stated”. This phenomenon may explain the two other composite variables, family scoreand energy expenditure.Some of the variables seem to have achieved 100% response rates. In the case of the46Exposure Assessment Effects AssessmentLevel Hazard IdentificationDistribution Type of EffectNumber of People Dose ResponseTarget Dose Risk CharacterizationTable 5: Types of pollution assessment studies.geographic indicators Public Health Unit (PHU) and stratum, data exists for all respondentsbecause the variables were used in survey stratification. In other cases the response rate isartificial. An imputation method can, for example, fill in data where data is missing. Age andsex were imputed using a random assignment mechanism based on census information on astratum’s population breakdown into age and sex categories.From this look at item nonresponse we can agree that some of the variables are more reliablethan others. Looking at the set of study variables together, only 55% of the respondents havegiven complete information. At the modeffing stage provisions must be made to deal with theitem nonresponse.In summary the 1991 Ontario Health Survey is the source of the two response variablesand twenty explanatory covariates. The majority of covariates are categorical rather thancontinuous, reflecting the difficulties of obtaining good continuous measurements from a largescale survey. The set of explanatory variables cover a range large enough to build decentmodels. The one ominous omission is the pollution data to which I now turn.5.2 Pollution CovariatesThe measurement of internal dose of a pollutant is very important in pollution effects studies.In most cases, however, the internal dose is unknown. The theme of this section is how theavailable exposure data is related to an individual’s internal dose.Consider the two broad classes of pollutant assessments shown in Table 5. The mostcommonly available data measures environmental releases or concentrations in specific media;47exposure measurements such as the number of people exposed and their absorbed dose arerelatively rare. Exposure data is therefore generally estimated by making assumptions that willallow pollution concentration data to be linked with individuals faffing in a certain geographicalarea. For most pollution assessment studies the data is clearly imperfect. This study falls underthe effects assessment heading as a risk characterization study. How good the assumptionsneeded are is uncertain but at least one study, where NO2 exposure measured by personalmonitoring devices was compared to station measurement, suggests station measurements areadequate for airborne pollutants (Hackney et al., 1992).The contact between a person and a pollutant in an environmental medium is termedexposure. Exposure is completely described by the route by which the pollutant enters thebody; the concentration of the incoming pollutant; the duration of the exposure; and thefrequency of exposure. Most pollutant data measure pollutant concentration for one mediumcovering some geographic area (Sexton et al, 1992).I make the following assumptions about the pollution data:L’ The weather station sites have well calibrated measuring instruments. That is,at a given concentration of SO4, say, all stations would give a reading closelycorresponding with other stations.F A six year average of site pollution is a good indicator of longer term averages.t Average pollution levels do not fluctuate radically from decade to decade.i Spatially, ambient aerosol pollution is homogeneously distributed within PublicHealth Units (PHU).t Ambient aerosol pollution levels are proportional to an individual’s absorbed internal dose.t Alternative sources and environments for the pollutant under study are negligible.4850 OutlierObserved Monthly Values1 2: , A A A A Mean1983 1985 1987 1989YearFigure 13: Typical graph of a station’s measurement of one pollutant.L Individuals within a PHU, the approximate equivalent of a Census Division, arespatially stable; migration from one PHU to the next is minimal.These assumptions make the available data appropriate for our analysis. The data come from37 atmosphere monitoring stations unevenly scattered across Ontario. Not all stations measureall pollutants of interest.The station data is used to predict pollution levels for the PHU centroids for which thereare no stations. Brown, Nhu and Zidek (1993) have developed a methodology for spatiallyinterpolated predictive distributions. The modification and implementation of their method,described in Duddek et al. (1994), produced the predicted monthly averages used in my analysis. In this section I will heuristically explain the steps taken to get the pollution estimates.The first step begins with the original measurements from which the predicted means areeventually estimated.Nitrogen dioxide (NO2), ozone (03), sulfur dioxide (SO2) and sulfate (SO4) are the fourpollutants considered in this study. The daily measurements, taken over the six year period,1983-89, have been converted into monthly averages. Figure 13 is an example of a measurementof one pollutant at one station and indicates how the graphs are to be understood. The morecomplete sets of graphs are given in the appendix by Figures 23 to 26.49Station measurements of NO2 are given in Figure 23. The plots of twelve of the thirteenstations are ordered by ascending station averages. By looking at the first and last stationsone gets a sense of the mean range, in this case from 25 to 45 ,ug/m3NO2. Low outliers appearin graphs eight and ten. Their existence is partly explained by many missing daily measurements. If at least one daily measurement is available for the month then the monthly mean iscalculated. Nothing is done about measurement or transcription error since no information ondata quality is available. If no measurements exist for the month a monthly mean is imputed.On the whole, the effect of outliers on the six year mean is minimal; the observed differencebetween the simple average and winsorized mean eliminating one observation on each extremeis less than five percent.Ozone measurements age given in Figure 24. The ozone data is better than the NO2 datain two respects. First, there are almost twice as many stations measuring ozone as NO2. Otherconditions being equal we have more information available for ozone. Second, in contrast withNO2, a temporal pattern with peaks in summer and lows in winter is evident. This allows meto more easily identify poor stations by observing which deviate from the trend. The datafrom the station in the fourth row and the third column, for example, looks somewhat suspect.Sulfur dioxide and sulfate measurements are next. Twice as many stations measure 502 asSO4. No seasonal pattern appears in either. SO2 has the highest between station variabilityamong pollutants.Within each PHU we choose a centroid for which a monthly estimate will be produced.The estimate is supposed to be a good proxy for the internal dose experienced by people livingin that PHU. Our interest lies in estimates of six year averages, broken down by summerand winter. By dividing the estimates into two types we had hoped to capture subtle spatialdifferences that may exist between seasons. The final pollutant estimates are shown in Table 6.We make two checks on the validity of the estimated data. First, for each pollutant we50Summer WinterPublic Health Unit 03 NO2 SO2 SO4 03 NO2 SO2 SO4Eastern Ontario 45.9 26.6 22.5 4.04 30.3 31.3 22.8 4.30Ottawa-Carleton 46.0 25.5 22.8 4.33 30.0 31.5 22.1 4.42Leeds, Grenville and Lanark 46.1 25.5 19.7 4.75 29.5 30.8 22.1 4.55Kingston, Frontenac, Lennox 46.3 25.6 18.0 5.24 28.5 30.5 21.2 4.51Hastings and Prince Edward 45.9 25.6 17.6 5.26 28.4 30.9 20.7 4.54Haliburton, Kawartha, Pine Ridge 46.3 28.2 16.9 4.88 29.1 33.1 18.8 4.78Peterborough 45.8 26.0 17.0 5.23 28.3 31.5 19.7 4.68Durham 48.1 30.7 16.6 5.89 27.9 32.8 19.2 4.92York 48.0 33.3 15.4 4.97 29.0 34.7 18.3 4.85Toronto 49.8 37.4 18.6 5.99 27.6 34.6 20.2 4.91Peel 48.7 36.1 16.0 4.89 28.7 35.7 18.7 4.73Weffington, Dulferin and Guelph 49.0 37.8 16.2 4.37 29.6 37.4 19.0 4.66Halton 50.4 38.4 17.8 5.48 28.0 35.5 20.0 4.76Hamilton-Wentworth 52.5 36.1 18.4 6.17 28.8 33.9 20.3 5.10Niagara 56.4 24.1 18.0 8.73 30.9 27.7 20.3 6.10Haldimand-Norfolk 57.5 25.5 18.3 8.75 31.1 28.7 20.1 6.25Brant 52.6 35.7 18.0 6.01 29.0 34.2 20.2 5.05Waterloo 49.4 39.7 16.5 4.67 28.4 37.7 19.8 4.47Perth 49.5 40.1 18.7 4.33 29.0 39.0 21.5 4.40Oxford 52.7 36.1 18.3 5.76 29.1 35.3 20.5 4.91Elgin-St.Thomas 56.5 30.1 20.4 7.60 30.3 32.7 20.9 5.57Kent-Chatham 57.8 29.5 22.1 8.46 30.2 35.2 22.5 5.48Windsor-Essex 58.9 28.4 22.9 9.23 30.1 36.8 24.1 5.51Sarnia-Lambton 55.3 37.0 29.1 5.89 30.0 39.0 25.3 5.11Middlesex-London 53.8 36.6 21.9 5.73 29.4 37.1 22.3 4.86Huron 49.4 40.5 20.3 4.10 29.8 39.6 22.8 4.52Bruce-Grey-Owen Sound 47.5 37.1 16.1 3.63 31.7 37.7 19.0 4.66Simcoe 47.2 33.1 15.0 3.93 31.0 36.1 17.8 4.54Muskoka-Parry Sound 44.7 31.8 17.7 3.27 28.5 36.0 23.2 2.90Renfrew 44.9 28.3 19.5 3.79 29.8 34.4 20.3 3.57North Bay 42.8 31.4 20.0 2.80 27.9 34.2 21.9 2.56Sudbury 38.6 33.3 33.3 1.96 26.4 35.7 30.1 2.15Timiskaming 43.5 32.3 25.1 2.63 28.0 34.0 23.6 2.47Porcupine 47.5 32.2 21.9 3.19 28.9 33.5 22.5 2.61Algoma 39.0 34.2 30.3 2.00 26.6 36.3 28.6 2.16Thunder Bay 43.7 33.9 24.7 2.49 27.0 35.3 25.3 2.11Northwestern 49.0 32.2 20.8 3.39 29.7 33.0 21.5 2.78Table 6: Spatially interpolated ambient air pollution six-year averages (g/m3).51produce boxplots for estimated and actual values (Figure 29 in the Appendix). The range ofobserved values may be larger than the estimated values, in particular for nitrogen dioxideand sulfur dioxide, because some stations had high measurement variability. The spatialinterpolation method corrects for that variability by emphasizing stable stations over highlyvariable ones.The second check comes in the form of a display of estimates by PHU in Ontario. Figures 30to 33 exhibit the distribution of pollution estimates for both summer and winter. Dark areasindicate higher pollution levels than lighter ones. The gradual nature of regional differencesand higher estimates for the more heavily industrialized Great Lake PHUs give us confidencein the method.Now that we have estimates for both summer and winter we must address whether it isworthwhile to keep two variables for one pollutant. If summer and winter pollution estimatesare highly correlated then a combined pollution estimate should suffice. The strongest relationships between all summer and winter combinations for all four pollution variables is shownin Figure 27.The linearity of the top three graphs stands out at once. Since the three pollutants,nitrogen dioxide, sulfur dioxide and sulfates, have similar ranges, data averaged over the yearwill probably do just as well as either of the two measures. Ozone is rather quirky with summervalues impressively larger in the four month summer than in the eight month winter. Sincepeople are generally outside more in the summer, the summer ozone value will be the ozonerepresentative value in the analysis.The last five graphs show the strongest relationships between pollutants of differing types.Significantly, ozone shows up in each of the plots. Summer ozone is positively correlated withsummer sulfur dioxide and, strangely enough, winter sulfates. On the bottom row, winterozone shows a hint of a relationship with summer sulfur dioxide, winter sulfur dioxide andwinter sulfates. From the limited data, nothing interesting shows up for nitrogen dioxide.52That concludes our description of the process from original station measurements to sixyear PHU estimates. We must now link the pollution data to the Ontario Health Survey dataand proceed to the analysis.5.3 Asthma versus CovariatesOnce the pollution data is merged with the OHS data we are poised to start with the preliminary stages of the analysis. We can make a first assessment of the relationship betweenasthma and each of the explanatory covariates by looking at their plots. (Figures 14 and 15).The bars in the plots exhibit two characteristics. The first is the whitespace in the middle ofthe bar, the weighted estimate of the proportion. The second is the length of the bar, primarilyindicating the number of observations used to calculate the proportion. The bar length,however, is not an explicit 95% confidence interval. A nominal 95% confidence interval wouldgenerally be the weighted proportion plus or minus two times the binomial variance. Withsurvey data we have the design effect (deff) mentioned in the description of OHS methodology.Since the OHS documentation suggests that the design effect for an estimate of the mean isaround two I have constructed the following intervals:j3 ± deff() j3 ±Of course, by displaying so many plots we are implicitly making multiple comparisons. I havenot attempted to construct intervals that maintain an overall 95% confidence level but ratherhave plotted them in the spirit of an exploratory analysis.Of the demographic variables, age is the most interesting. Asthma has a higher prevalencefor the youngest members of the population. This suggests that some proportion of asthmaticsare relabeled later on in life.In the socioeconomic sphere household type, education and income show mild correlationwith asthma. The household type ‘D’ represents single parent households, a category knownto have a disproportionate number of poor, single women. The fact that household type,53Demographic Covanates10(UEC’,0Urban RuralStratumMale FemaleSex Age (Years)Yes NoImmigrantSocioeconomic Covariates10(UEci)0A B CDHousehold TypeGood BadFamily FunctioningNA Blue WhiteBlue CollarYes NoWork ExposureLow Mid HighIncomeFigure ]4: Asthma prevalence by covariates (I).NA 0cc DailyCurrent Smoker:13 7e:::I :Socioeconomic and Lifestyle Covariates(UEU)Prim Sec UnivEducationNA Old 0cc DailySmoker Type54Socioeconomic and Lifestyle Covariates50 0 35Cigarettes Smoked0 1 2 3+Household SmokersHealth CovariatesYes Maybe NoAlcohol ProblemFigure 15: Asthma prevalence by covariates (II).:11Duration Smoked (Years)10E010E010EC’,0:=17 33 0 30 Poor Good Yes NoBody Mass Index Energy Expenditure Well Being Score AllergyPollution Covariates• L ii i I I[i.I IillII27 40 39Nitrogen DioxideI 1 II 111111111160 17Ozone Sulfur Dioxide31 2.1 7.1Sulfates55education and income concur strengthens the argument that socioeconomic factors are relatedto health outcomes.Work exposure merits some attention: observe the lower prevalence of asthma for thepopulation exposed to dust and fumes in an occupational setting. The most likely cause isthat asthmatics probably try harder to avoid work that would exacerbate their condition. Thisis a perfect example of the caution that must be heeded in making inferences with observationaldata.The smoking variables display no interesting relationships but two of the health variablesdo, ie. the well being score somewhat and the allergy indicator strongly. The presence of wellbeing and allergy poses a problem to the analyst wishing to use them to explain asthma oremphysema. Are they reasonable covariates if they in fact are another measure of respiratoryailments? On the other hand they could be thought about as a way of significantly reducingthe variation in the model and thereby increasing the power of hypothesis tests. The positionI wifi take is to avoid bringing them into the model due to the problem of interpretation. Itwould hardly be right to say that a feeling of being ill brings on asthma.Finally, the lack of pollution trends is revealing. Each estimated proportion is based onan average of over one thousand sampled individuals living within a PHU. The graphs raisequestions about how any meaningful relationship between asthma and pollution could bediscovered.In summary allergy shows the strongest relationship though its use as an explanatorycovariate is questionable. Age, education and income show slight downward trends. In allthere are no compelling signs to indicate that we can fit a good model for asthma as theresponse variable.5.4 Emphysema versus CovariatesIn the same spirit Figures 16 and 17 show the prevalence of one or more of bronchitis, chroniccough or emphysema against each of the covariates. It is immediately obvious that the esti56mated prevalence rates exhibit greater differences here than for asthma.As age increases so does the prevalence of emphysema. Perhaps this finding is not toosurprising since the lungs lose some of their elasticity as time passes. None of the otherdemographic variables are helpful.Of the socioeconomic variables education and income reiterate the oft observed relationbetween socioeconomic status and health. The poorer one is the more likely one will be afflictedwith medical conditions less prevalent in the upper socioeconomic order.The smoking variables are strong explanatory covariates. Smoker type and current smokerstatus show us what we would expect given current knowledge: smoking is detrimental to thefunctioning of the respiratory system. Duration smoked and, to a lesser extent, the numberof cigarettes smoked show distinct upward trends. The effect seems most pronounced for thelong time smokers, ie. the upper third of smokers. Functionally, a quadratic equation looks likeit would fit best. Second hand smoke, partially measured through the number of householdsmokers, shows a slight upward trend as does high alcohol consumption. Undoubtably anymodeling of emphysema must incorporate at least one smoking covariate.The measures of health status display some of the correlation we expect to see. Energyexpenditure, well being score and allergy have gentle downward slopes. Correlation betweenthe health covariates and smoking status could wipe out any of the effect we might attributeto the variables (also see the reservations stated in the previous section).The pollution covariates are again negative. The bars look as if they are randomly strewnover the range of estimated pollution values. The only hope in relating any of the pollutioncovariates to emphysema wifi be to model out as much of the variation inherent in the estimatesand condition on other covariates. That is the focus of the next two chapters.Unlike asthma, we anticipate that the modeling of emphysema will at least result in modelswith descriptive power; age, duration smoked and income are related marginally to emphysema.Whether pollution will show significance, however, remains to be seen.57Demographic CovanatesSocioeconomic CovariatesYes NoImmigrant10COEci)Coci)EUI0A B CDHousehold TypeGood BadFamily FunctioningNA Blue WhiteBlue CollarYes NoWork ExposureSocioeconomic and Lifestyle Covariates10COEci)COa)-cEUI0Prim Sec UnivEducationLow Mid HighIncomeNA Old 0cc DailySmoker TypeNA 0cc DailyCurrent SmokerFigure 16: Emphysema prevalence by covariates (I).10CciE0)Coci)EUI0=Urban RuralStratum==Male FemaleSex:13 78Age (Years):= ::: =: :: ::58Socioeconomic and Lifestyle CovanatesCuEa)U)a)a.EwCUEa)C’)a)a.EuJ10CU2a)U)a)-ca.2w00 35 0 1 2 3÷Cigarettes Smoked Household SmokersHealth CovariatesEnergy Expenditure Well Being ScorePollution Covariates40 39 60 17 31 2.1Nitrogen Dioxide Ozone Sulfur DioxideFigure 17: Emphysema prevalence by covariates (II).Yes Maybe NoAlcohol ProblemYesAllergyNo:1Duration Smoked (Years):Body Mass Index27 7.1Sulfates596 MethodologyThe broad objective of epidemiology is to get at the etiology of disease. Cross sectional datawill at best allow for a determination of association between the disease and related factors,casual or otherwise. The specific goal of this study is to identify and assess a number ofpossibly important explanatory factors contributing to chronic respiratory illness.Unfortunately the relationship between chronic disease and etiologic agents is muddied byunsure diagnoses, changing environments, multiple causality, changing behavioral habits, theexistence of other maladies, measurement error and lack of quality data, to give but a partiallist. Each component adds haze to the picture; in statistical jargon, the random componentmay overshadow the systematic component of the model; in other words, uncertainty threatens to obviate the underlying relationship and hence our understanding of that relationship.Statistical methods address issues of uncertainty and are therefore appropriate in this context.Besides incorporation of error, the method used ought to be able to examine a numberof factors at once. In the epidemiological literature, synergy refers to the phenomenon oftwo factors which produce a much greater effect in conjunction than if considered separately.Regression methods take care of these concerns. As a bonus, regression provides the possibilityof incorporating spatial dependence and other such subtleties.The Ontario Health Survey sampling design induces a probability structure that is important to address. Should I incorporate inclusion probabilities into the regression model? Howcan that be done? I begin the chapter by going over the salient differences between finite andinfinite population inference.Classic regression methods assume a continuous response vector. I use responses to theOntario Health Survey that indicate the presence or absence of chronic respiratory illness.Therefore, before analysing the data, I ought to describe generalized linear models, the extension of classical methods that accommodates binary and binomial response vectors. In thischapter I also describe the model selection strategy and model checking procedures.606.1 Finite vs. Infinite Population InferenceThe Ontario Health Survey data arise from a sample survey of a human population. Thesample survey has a complex design which induces inclusion probabilities associated with eachelement in the sample. Should I incorporate the inclusion probabilities in the analysis? Beforethe question can be answered I must go over the basics of survey sampling theory.Finite survey sampling theory gives a way to make inferences from a finite sample, representable as s = {1, ..., n8}, to a finite universe U = {1, ..., N}. Generally inferentialstatements require knowledge of the probability structure resulting from the survey design.The setup differs from classical statistical theory in that classical theory assumes an infinitepopulation.With a finite population the set of all possible samples associated with the design, S = {s1,sM}, is also finite. The probability that a particular sample is chosen, p(s), defines thesample design. Two important probabilities at the element level are7rj = P(kth element sampled) andlrkj = P(kth andjt elements sampled).If both irk and ir can be calculated then the design is said to be measurable. A measurabledesign allows for estimation of finite population totals and the variance of those estimates.Some papers in the literature deal with estimating finite population regression parameters andtheir variances. Binder (1983) describes finite parameter estimation in the generalized linearmodels context. What happens, however, when one wants to consider parameters, such asregression parameters, where the finite population parameters are of little interest?Classic regression theory posits a linear model relating the continuous response variableto explanatory covariates:Ilk = :;/3 + k k=1,2,...,m.Further, El, E2,..., E are independently and identically distributed as N(O, v.2). To mesh610.8>%Cl)Ca)U0.030Figure 18: Empirical density of survey weights.classic regression theory with survey sampling theory, finite survey sampling theorists makeuse of a construct calied the ‘superpopulation.’The finite population is essentially a partial realization of the superpopulation. I canestimate superpopulation parameters in one of two ways: with or without taking the surveydesign p(s) into account. Here, the parameter /3p(s) wili specify an estimator of /3 whichincorporates the inclusion probabilities and /3 will refer to an estimator which does not.If survey weights are more or less the same, a regression analysis incorporating surveyweights may differ from an analysis which assume a self weighted design. The survey weightattached to every survey respondent reflects selection probabilities adjusted for nonresponseand age-sex population totals at the PHU level (Ontario Ministry of Health, 1992a). Byexamining the OHS data, the PHUs with the most extremely weighted respondents are urban.Figure 18 shows the skewness of the weight distribution where the weights have been modifiedso their expected value is one. Because of the skewness, the question of what to do about surveyweights remains.An ongoing debate is over which of the two estimators is better to use. Särndal (1992)handles the question in the following way. First, he notes /3 is the best linear unbiased estimator0 10 20Survey Weight62(BLUE). That is, for any conformable constant vector c,E[c’(/3 3)2 s,X] < E.JC’(I3other _j3)2 I s,X].Under any sampling design p(.)EE[c’(/3 — j3)2 I s,X] EEp{C’(/3other — 3)2 s,X].By the BLUE criterion, /3 is better than13p(s)• He argues, however, that /3p(s) may be preferable on the basis that /3 is design consistent whereas 13p(s) is not. Pfeffermann (1993) addsthat weights protect against nonignorable sampling designs and misspecification of the model.Either way, a methodology which allows for possible prior weights is desirable.The OHS documentation suggests the following:The sample weights placed on the individual microdata tape records must be usedwhen producing estimates from the survey data, including ordinary statistical tables.Otherwise, the estimates derived cannot be considered to be representative of thesurvey population, and will not correspond to those produced by the Ministry ofHealth or other users of the data. Users are particularly cautioned about releasingunweighted tables or performing any analysis on unweighted data (Ontario Ministryof Health, 1992b, p. 3).One way around the controversy is to try the analysis both ways (Fay, 1984). For the marginaldistribution of age shown in Figure 19, for example, there are estimates for which there islittle practical difference in including or excluding weights.A nagging difficulty with the superpopulation approach is the assumption of independencebetween elements. In a realistic application such as the Ontario Health Survey, the units arespatially correlated. Part of the correlation is induced by the clustering and stratificationused in the sampling design. Adjustments of the classical x2 and likelihood ratio tests arenecessary to protect against invalid test statistics (Kumar and Rao, 1984). The assumptionof independence normally results in the underestimation of parameter variance.6320o0.000•No Survey WeightsIncludesSurvey Weights100Figure 19: Marginal distribution of age using and ignoring survey weights.In conclusion, the modelling of survey data introduces a twist into the analysis. A super-population model must be invoked which results in two estimation possibilities: incorporateor ignore inclusion probabilities induced by the survey sampling design.6.2 Generalized Linear Models TheoryGeneralized linear models owe their popularity to the revolution in computer technology. Without a mechanism for solving nonlinear equations using an iterative numerical algorithm, theresearcher faces the task of working everything out by hand. Today with the relatively lowcomputational costs, the availability of software like GUM, SAS and S, the high speed ofcomputers, the use of the generalized model has become quite feasible. Appropriate to suchauspicious beginnings, theoreticians have intensely been focusing on generalized linear modelproblems since the late 1970s; today much of the theory is standardized in “classics” likeGeneralized Linear Models (McCullagh and Nelder, Second Edition, 1989).Generalized linear models are an extension of classical linear models. The most importantchange is that the response variable Y no longer needs to be continuous and normally distributed. As long as Y can adequately be described by a member of the exponential family,10 40 70Age64generalized linear models theory will apply. The Poisson, binomial, gamma and exponentialdistributions are examples suggesting the range of possibifity for applications.There are three components to a generalized linear model (GLM):1. The random component, denoted by the response vector Y, is defined by an assumeddistributional form. Each element of the response vector belongs to the same exponentialfamily with E(Y) =2. The systematic component is a linear function of the explanatory variables x wherej = 1, 2, . •, p. The coefficient vector 3 is of great scientific interest since it is used asthe criterion for interpreting the importance of covariates as predictors. The systematiccomponent is summarized by ij = > = = X/3.3. The link function g(.) relates the random and systematic components. Simply put,Independence between elements of the response vector is assumed.The assumed distribution of Y is supposed to adequately model the observed vector y.As mentioned, the exponential family includes most well known probability distributions.Properly defined, the exponential family covers all probability density functions and probabilitymass functions of the formfy(y; 0, ) exp { + c(y, (2)for given a(.), b(.) and c(.). By convention, 0 denotes the canonical parameter. When a(4) =is called the dispersion parameter where w is a prior known weight associated with theobservation y. Until the dispersion parameter comes up explicitly again, I will assume, for thepurposes of my argument, that is known. Questions of estimation then pertain only to theparameter 0.How can 0 be estimated once a distributional assumption about Y is made? If fy(y; 0, 4’)is regarded as a function of 0 alone then one can maximize the likelihood function L(6) =65y, ) to find an estimator by the maximum likelihood approach. As long as the maximado not occur at the endpoints of the parameter space, the log likelihood1 = logL(O) = 1ogf(8;y,) = +c(y,q) (3)will have the same local maxima as L(9). The solution for 0 is then01The log likelihood for Y, 1 = logL(O), takes the form of a summation over the sample.There are two important properties of the log likelihood:E6() o (4)and/0lE6 + vary = 0. (5)The statistic 01/86 is known as the score. Fishers’ information(0l\ (021(6) = varo \06)= —E9is derived from (5). Note that the largest value achieved by taking the reciprocal of Fishers’information in the exponential family occurs when the variance is smallest. An analogousinversion takes place when computing standard errors for the estimated coefficient vector 3.Thus, the mathematical expression for ‘information’ meshes nicely with intuition: decreasinguncertainty, i.e. increasing certainty, implies the notion of increasing confidence about theinformation latent in the data. From (3),Ey () = Ey (Yj9)) zzz, E(Y) = b’(6),E(0l2E— var(Y)Y — aQ?S) } — a2()E(02— E (—b”(e)’1— —b”(6)Y ‘ a(S) )— a(6)66andEy () + Ey (Y ) = 0 = var(Y) = b”(8)a().A couple of remarks are in order. Assume a(4) is of the form a(4) = /w where w is a knownprior weight. First, E(Y) does not depend on a(ç). Second, the variance of Y depends bothon the dispersion parameter and the prior weight. One possibility for weights are the surveyinclusion probabilities associated with each component of y.So far the discussion has shied away from the coefficient vector 3. Observe, however, that= E(Y) = b’(6) g’() = g_l(XT/3). (6)if the maximum likelihood estimate (MLE) of 0 can be computed then in principle a similarapproach should work for finding a 3estimate. Indeed, an application of the principle leadsto the Newton Raphson method (Collett, 1992, Appendix B). Let u(/3)qxi be the vector ofscores where the jth element is Ol(f3)/O/3. The objective is to satisfy the maximum likelihoodcriterionu() = = 0. (7)Unfortunately no closed form solution is possible when g(p.) is nonlinear since g’(p.) will notreduce to a constant. In other words, the MLE for ji, a function of ,8, has no analyticsolution. A numerical solution can be obtained, however, by linearization. The first orderTaylor linearization of 1 about the vector /3 isu(f) u(/30)+H(/3(/— (8)where H(13o)qxq is the Hessian matrix with the (i, j)t1 element given byO2l(3)/O3ôj3.From (7) and (8), u(f30)+ H(f30)(/3— /3) 0 implies/ =67In more standard notation,Iri = 13r+H’(13r)tt(/r) (9)where /,,. is now used to estimate /r+1. The last equation is more obviously in the form of theiterative Newton Raphson procedure for obtaining maximum likelihood estimates of /3.An alternative to Newton Raphson is Fisher’s method of scoring. Here the Hessian H(/3)in (9) is replaced with the information matrix I(f3). For the (i, j)t11 element021(j3) I Ol(/3)is replaced with — EAn advantage of calculating I(/3) is that the inverse is the asymptotic covariance matrix ofthe MLE for /3. Although both methods converge to the same /3, 1’(/3) will not necessarilybe identical to H1(/3) for all distributions of the exponential family nor for all link functions.In conclusion, standardized methodology exists for non-normal data. The likelihood function plays a central role, making coefficient estimation pretty straightforward. The methodcan handle subtleties like unequal weighting. As important, the methodology has been implemented into major software packages.6.3 Modelling Binary DataNow that I have sketched the important components of generalized linear model theory Iwill focus on the models specifically related to the OHS data. The micro level data givesthe presence or absence of chronic respiratory disease for every sampled individual. Twoapproaches to analysis suggest themselves. One is to go through with a binary variable analysiswhile the other is to analyse by first aggregating the data and then work with the resultingbinomial observations. In this section I will delve into the details of the likelihood equation,link function and parameter estimation for the binary case. The notation I will use is given inTable 7.68Symbol Descriptionn Sample sizeProbability of disease occurrenceProbability vector* Maximum likelihood estimate of irY Random variable distributed as B(1, 7r)Y Random vector of binary observationsy A realization of the random variable Yw A prior weight attached to the random variable Yy A realization of the random vector Y/3 Coefficient vector (/3i, /2, •••,Maximum likelihood estimate of /3Table 7: Definition of symbols used for the modelling of binary data.A binary variable can take on one of two values:— f 1 when chronic respiratory illness exists,1 0 otherwise.The probability ir = P(Y = 1) is known as the Bernoulli probability of success and results inthe probability mass functionfy(y;) = Y(i —= ( (1— 7r).When a weight w is attached to each y2 the joint probability mass function appears asfy(y; ir) = JJ fy(w, y; 7r) (10)= exp{wi[yilog(1K ) +log(1_lri)]} (11)where > = 1. The distribution in (11) is a member of the exponential family given in (2)upon setting 9 = log[p(1—p)’]. More precisely,f(y;9) = exp{w [yjoj_log(1+e9t)]} (12)because log(1+e°) = log[1+ir(1—ir)]= —log(1—ir).69The algorithm for estimating j3 reduces to an iterative weighted least squares algorithm.I am going to give a detailed exposition for this case. Similar arguments can be found inMcCullagh and Nelder (1989) and Collett (1991). The outline is simple: only the score functionu(fr) and Fishers’ information matrix I(/r) are needed to derive /r+1•With binary data, the log likelihood is1 = wj[yjej_log(1+e9:)].Using the chain rule to derive the score function for j3,01(13) — 9l 08 Oir 0— 06 thr 0i O/3= (— g!()3—wi(y—ir)— Ld1 $ ,xt3.\lr:(1 — lrj)g (ir)jFor Fishers’ information first note that E(y — lrj)(yji — iri) = 0 Vi i’ by the assumedindependence between observations. When i = i’, E(y — ir)2 = ir(1 — ire). Thus,1 021(13) f oi 0103jO/3k j = EyI w(y — ir) wi(yi —= Eyr(l — 7r)g’(?r) .‘ ir,(1—= $ 2 XijXik.— ir)[g (r)]Grouping terms helps reduce the estimation procedure to iterative weighted least squares. Letthe multiplier of XjjXjk,vi?W(ir) — ir(1 —determine the elements of the diagonal ii x n matrix W. Clearly I(3) = X’WX. The scorefunction is simplified by setting yj(lrj) = [g’(7rj)(yj — r)] / w. Then u(13) = X’Wy(ir).Finally, everything can be combined. Since the MLE * depends on the estimated value of3 and vice versa, all terms dependent on it and j3 get a subscript r:= 1r+’(fr)tt(/r)70= Ir + (X’WrX)’X’Wryr(*r)= (XIWrX)_1(XWrX)/r + (X’WrX)X’Wryr(*r)= (X’WrX)’X’Wr[XJr + gr(*r)].This is exactly the form of iterative least squares regression on [XJr+yr(*r)J.To this point I have made no mention of the specific form of the link function. The logit,probit and complementary log link functions are the three most commonly used for binary andbinomial data. Recall thatE(Y) = b’(O)1=The advantage of employing one of the three links lies in their range, (0, 1) e R. An unrestricted range could mean, for a given o, that g(?jo) < 0 or g(o) > 1 even though0 r 1. Since i = g(ir) = logit(ir) = 6, the logit link is also known as the canonicallink. The probit link is defined as g(ir) = ‘(j) where_1(.) is the inverse cumulativedensity function for the standard normal distribution and the complementary log function isgiven as g(ii-) = log{— log(1—ir)}. I will generally use the logit link for analysis due to itsinterpretability as the log odds ratio of success.717 Asthma AnalysisIn this chapter we approach model building by dividing the covariates into groups and performseparate arcsine analyses. Besides allowing for standard diagnostic checking, each analysisprovides us ultimately with a subset of covariates useful in constructing a ‘final’ model. Thisis an alternative to the more commonly used stepwise regression technique.The arcsine analysis does not allow for the incorporation of survey weights. We thereforeopt to use generalized linear models when we add the pollution terms.7.1 Arcsine AnalysisFrom Figures 14 and 15 we note the lack of differences for all of the covariates except allergy.Allergy, however, is a questionable covariate since we would hesitate to say allergy causesasthma. Without at least one significant term in a model, asthma will remain unexplained byall covariates including the pollution terms.With the study variables split into five coherent groupings, an arcsine analysis is possible. This approach transforms the response variable so that it is approximately normallydistributed. Then a “classical” regression allalysis is possible. The advantage of this approachcomes from being able to apply standard diagnostic procedures for model checking. Surveyweights will be ignored in this part of the analysis.If Y 13(n, ir) is a binomial random variable then the normal distribution .Af(nir, nir[1 — 7rj)approximates Y well if the sample size is large enough. Normal regression theory requires,however, that the response variable has a constant variance, i.e. Y .JV(E(Y), o.2), ratherthan one dependent on ir. The arcsine transformation is a transformation which obviates theimportance of the proportion in the variance term.The arcsine transformation derives from the so-called “delta- method” of classical statistics.Let ‘t&(.) be a transformation and p = Y/n. The first order Taylor series expansion about K is‘b(p) = (ir) + b’(7r)(p—ir).72Group Model df r2Demographic sex + age + age2 116 0.15Socioeconomic dust exposure + income 46 0.18Lifestyle smoker type + drinker 142 0.18Health well being + family functioning 508 0.04Table 8: The best arcsine models fitted for asthma.Then approximately,E[(p)J = b(7r) andVar[b(p)] = E[’(ir) (p —= [,/,l(.)]2 Var(ir).Let ir) = sin1f . Since ‘(ir) = (2J/FR),1 7r(1—Jr) 1Var[&(p)]= 4 ( ) =Thus, Y’ .iV(sin\/, 1/4n) where y* = sin’(p). The linear regression analysis becomesa weighted regression analysis.Our criterion for the arcsine analysis is the standard t-test for each of the covariates. Anunusually low p-value suggests the covariate could be important in the final model. The bestmodels for each of the groupings is given in Table 8. Note that although many of the covariatespassed this test for admission to the final model, none of the models does a very good job atfitting the data. At best, the models explain about 20% of the observed variation. This leavesanother 80% unexplained!The tables do not tell the whole story, however. For each of the models, we producedstandard diagnostic plots (see Figures 34 to 36 in the Appendix). The first noticeable featureis how the sample spreads unevenly in the n-way table. The leverage for most models looksreasonable when compared to the ‘2p/n’ line. In general, if a leverage point lies greatly abovethe line the model becomes suspect.73Term X* df P (X* > x)Age + Age2 52.4 2 0.000Income 42.7 2 0.000Smoker Type 21.2 4 0.000Sex 12.9 1 0.000NO2 7.3 1 0.00703 1.9 1 0.169Table 9: Most significant terms in the unweighted logit asthma model.The real test of a model, however, is provided by the residuals. The quantile-quantileplot of normality shows how well the assumption of normality is met. All of the models faresomewhat poorly in this regard. Caution must be exercised in assessing the model.The socioeconomic covariates of well being and family functioning score do not, uponsecond reflection, seem to offer much in terms of interpretability of the model. Thus the finalmodel was built from all covariates but those two. The full model consists therefore of ‘age’,‘sex’, ‘dust exposure’, ‘income’, ‘drinker’ and ‘smoker type’.7.2 Logistic ModelingFor fitting the final models, generalized linear model theory is used. The criterion for covariateinclusion in the model is a goodness of fit test. From theory, models can be compared on thebasis of differences in model deviance. Mathematically, if l is the estimated likelihood of thecurrent model and Lf the estimated likelihood for the full model, the deviance is conventionallygiven asD = —2 log(L/Lj).Nested models can be compared by focusing on the difference in deviances. The difference isasymptotically distributed as x2 with the degrees of freedom being the difference of degrees offreedom between the nested models.The overall fit for the full model is X = 120 on 12 degrees of freedom. Once the full model74Term X* df P (X* > x3)Age + Age2 71.8 2 0.000Income 25.1 2 0.000Smoker Type 21.6 4 0.000Sex 10.4 1 0.001Drinker 12.7 2 0.002Work Exposure 4.6 1 0.032NO2 1.7 1 0.196Table 10: Most significant terms in the weighted logit asthma model.is fitted, forward and backward updating of the model allows us to test for each term. Theunweighted analysis produces Table 9.Age is the most significant term in the model. The fitted values over the domain of agescovered in the sample show a decreasing trend for asthma with a range of about 3%. The nextterms are income, smoker type and sex. The model suggests that having lower income, havingbeen a former smoker and being female is associated with higher asthma prevalence.The model also suggests that NO2 is associated with with asthma prevalence. The problemis that its estimated coefficient suggests that NO2 is negatively related to asthma. That is tosay, the more NO2 in the air, the less asthmatics you would expect to observe; the range ofthe downward trend in fitted values is about 1% over the interval of NO2 readings observed inOntario. Two comments are necessary. First, the models have 50,000 observations from whichto fit a maximum likelihood estimate and a spurious result is possible. Second, we can checkthe model estimates be comparing to models which have incorporated the survey weights.Logistic regression modeling allows for the inclusion of a weighting structure, i.e. surveyweights representing inclusion probabilities. The same analysis was run with the weights, producing Table 10. It is comforting to see that the first four significant terms in the unweightedmodel show up in the weighted model in the same order. In the weighted case, however,drinking and work exposure eclipse NO2 as important model variables. In fact NO2 no longer75appears to be significant. The changes reiterate our suspicions of NO2 as a truly significantcovariate.In summary, none of the pollution covariates appear to be significantly related to asthmaprevaJence. NO2 gave the strongest signal of the pollution measures but had a questionablenegative estimated coefficient and was not robust enough to show consistency between weightedand unweighted logistic analyses.768 Emphysema AnalysisIn this chapter we model the cross sectional association of respiratory morbidity and air pollution, using the 1990 Ontario Health Survey data. For convenience, we call the binary responsevariable “emphysema.” However, that term will refer collectively to any of emphysema, chroniccough and/or chronic bronchitis. The etiology of chronic respiratory disease, especially as regards the effect of ambient air pollutants, is currently unknown and hence hotly debated. Wehope our analysis will contribute usefully to that debate.In the first step of our analysis, we select appropriate covariates, to increase the model’ssensitivity to the effect of pollution while avoiding coffinearity and extreme data spread. Weuse monthly pollutant data derived from the multivariate spatial interpolation methodologydescribed in the last section. In our first approach, we calculate the average summer and wintersix year pollution levels for all four pollutants by Public Health Unit (PHU). The resultingvalues, unlike those from our second approach described next, resemble the measurementstaken at the air monitoring stations; the analysis will lend itself to easy interpretation. Inour second conceptually more complex approach, we use principal components to construct apollution index which favours summer pollution levels and represents a variety of pollutants.We obtain the weighting scheme implicitly by using averages of the eight month winter and thefour month summer. This leads to eight estimates per Public Health Unit (PHIJ). A principalcomponents analysis creates the index which extracts the maximal amount of information inthe eight estimates.In our last step we evaluate the significance of the pollution variables. We base our evaluation on the stepwise addition of pollution variables to the covariate model built in the firststep. Using the same steps, we go through with an analysis ignoring weights and then offer acomparison to the analysis incorporating weights.778.1 Covariate Selection Excluding PollutionWe first construct the ‘best’ model excluding the pollution terms. From previous studies weknow that several factors relate to chronic bronchitis. The disease is more prevalent amongolder people than younger people, among males than females and among urbanites more thanrural dwellers. Social class seems important. In Britain unskilled labourers are five times aslikely to have chronic bronchitis as professionals. Smoking and family history also rate aspossible determinants (Fry, 1985, Chapter 6).Beginning with the seemingly most relevant twenty, we must select covariates from themany offered by the Ontario Health Study (OHS) to represent broad population traits. Butwhich covariates are best and how many should be included?Achieving an aesthetic result and avoiding the problems associated with sparse data andcoffinearity demand parsimony. In particular, a model with too many terms will spread thedata over a table of unduly high dimension, since a relatively small proportion of the populationis afflicted with emphysema.Coffiriearity arises from the association of prospective covariates. Perfect association between them means the second adds no information not in the first. In ordinary least squaresregression, collinearity leaves the coefficient estimators unbiased but of reduced precision.Robinson and Jewell (1991) prove that in logistic regression as well, when two predictive covariates exhibit coffinearity, one being correlated with the response, a loss of precision occurs.Unfortunately we cannot avoid the difficulties presented by collinearity since covariates in anobservational study cannot be controlled. Instead we have tried to side-step these problemsby excluding highly correlated pairs of covariates.To reduce our computational burden, we began building our model one factor at a timeusing one fifth of the survey data respondents or about 9,500 records. We incorporated themicrodata survey weights into the binary logistic models. This insured unbiasedness for thefinite population of the estimating equations used to construct estimates of model coefficients.78Term X df P (X* > x)Age 77 1 0.000Household Type 47 3 0.000Education Level 41 2 0.000Income 50 2 0.000Smoker Type 103 4 0.000Current Smoker Type 74 3 0.000Duration Smoked 150 2 0.000Number of Cigarettes Smoked 103 2 0.000Energy Expenditure 63 3 0.000Well Being Status 56 4 0.000Number of Current Household Smokers 23 1 0.000Allergy 20 1 0.000Alcohol Screening Test Category 24 4 0.000Blue Collar Work 12 2 0.002Family Functioning Status 8 2 0.017Body Mass Index 8 2 0.020Sex 3 1 0.100Immigrant Status 2 1 0.214Work Exposure 1 1 0.333Stratum 0 1 0.906Table 11: One term emphysema models using a fifth of the data.Table 11 summarizes the results. Except for age, the demographic covariates are inconspicuous in that they fall at the bottom of the list. The last four one term models wouldbe rejected at the c = 0.05 level criterion; sixteen of the prospective covariates reduce thestandard deviance enough to improve the fit of the model significantly. Clearly a judiciousstrategy for finding a good model is needed.The need to distinguish between missing and zero values complicated our stepwise selectionprocedure. When a variable like ‘cigarettes smoked’ had missing values we had to include anextra dummy variable in the model. In addition, three covariates, ‘well being’, ‘allergy’ and‘family functioning status’, were excluded a priori since we judged them to blur the distinctionbetween dependent and independent variables.Table 12 summarizes the results of applying our stepwise procedure. The table shows thatadding the smoking covariate in the second step significantly decreases the importance of theother smoking covariates in the subsequent steps.We stopped after step six . The covariates selected for our model, before considering79Step Stepwise added Terms X* df P (X* > x)1 Age 77 1 0.000Smoker Type 105 4 0.000Current Smoker Type 83 3 0.0002 Cigarettes Smoked 56 1 0.000Household Smokers 44 1 0.000Education 34 2 0.000Education 26 2 0.000Income 23 2 0.0003 Energy Expenditure 25 3 0.000Cigarettes Smoked 21 2 0.000Immigrant 5 1 0.026Cigarettes Smoked 20 2 0.000Energy Expenditure 21 3 0.0004 Family Type 20 3 0.000Income 15 2 0.001Drinker Category 14 4 0.007Sex 6 1 0.018Energy Expenditure 21 3 0.000Family Type 21 3 0.0005 Income 17 2 0.000Sex 10 1 0.001Drinker Category 14 4 0.009Family Type 20 3 0.000Income 14 2 0.0016 Sex 8 1 0.004Drinker Category 12 4 0.021Duration Smoked 7 2 0.025Duration Smoked 8 2 0.017Sex 5 1 0.0277 Income 7 2 0.031Drinker Category 11 4 0.032Immigrant 3 1 0.104Table 12: Terms in a stepwise fitting strategy for emphysema using a fifth of the data.80Term X* df P (X* > x)Smoker Type 159.2 4 0.000Age 97.0 1 0.000Education 85.2 2 0.000Family Type 98.0 4 0.000Cigarettes Smoked 24.0 2 0.000Energy Expenditure 10.1 3 0.018Table 13: Goodness of fit for each of the terms in the full emphysema model.pollution, are: ‘age’; ‘smoker type’; ‘education’; ‘cigarettes smoked’; ‘energy expenditure’; and‘family type’.8.2 Evaluating the Effect of PollutionIn this subsection we add the pollution variables to the six term model. To check the significance of each of the ‘factors’ using all of the data, we drop each term and note the increase inthe deviance. Table 13 shows the improved sensitivity of the model with four times as muchdata. Except for ‘energy expenditure’, all factors are more significant than they were with thereduced dataset. Note that, the categorical covariates show a greater improvement than theircontinuous counterparts.We considered the pollution covariates in two ways. For the first we computed summerand winter averages; these estimates were easily added to the model. In the second approachwe used principal components to reduce the number of pollution covariates. The first threeprincipal components based on the eight averages explained 80% of the variation.The rotations for the three principal components are given in Table 14 and should beconsidered indices of long term pollution exposure. A close look at the loadings allows for acertain degree of interpretation. The first contrasts 03 and SO4 against SO2 and emphasizessummer averages; the second is a recasting of NO2 with most of the weight given to thesummer; and the third is a weighted combination of SO2 and summer 03.We evaluated the pollution covariates by comparing nested models containing the pollution81ComponentPollutant #1 #2 #3Summer NO2 -0.03 -0.90 0.01Winter NO2 -0.09 -0.43 0.09Summer 03 0.70 -0.06 0.60Winter 03 0.13 0.05 0.03Summer SO2 -0.51 0.07 0.67Winter So2 -0.35 0.05 0.38Summer SO4 0.26 0.06 0.18Winter SO4 0.17 0.00 0.04Table 14: Loadings for the first three principal components.Term X* df P (X* > x)Summer NO2 4.3 1 0.039Pollution Component #2 3.2 1 0.073Winter 03 2.8 1 0.093Summer 03 1.1 1 0.299Summer SO4 0.8 1 0.380Pollution Component #1 0.8 1 0.386Pollution Component #3 0.5 1 0.493Winter So4 0.4 1 0.554Summer SO2 0.0 1 0.920Winter NO2 0.0 1 1.000Winter SO2 0.0 1 1.000Table 15: Goodness of fit test for the pollution terms.term to the full model. The resulting decrease in scaled deviance is shown in Table 15. Only‘summer NO2’ significantly improves model fit at the nominal ü = 0.05 level. The coefficientestimate is 0.013 with a standard error of 0.006. Considering the number of tests we performedand the strong assumptions used in the modeling (such as independence between respondents),the result is more suggestive than irrefutable fact.828.3 Comparing Weighted and Unweighted AnalysesSince we used weights with the logistic model, we could naturally ask what the results of ananalysis would look like if we ignored weights. TI results are similar then we can feel confidentthat model misspecification is minimal.The same stepwise procedure used previously but without survey weights produces Table 16. As before, ‘age’ and ‘smoker type’ give the strongest signals to the model. ‘Cigarettessmoked’ is again the third term to enter the model. ‘Income’, however, rather than ‘energyexpenditure’ is the fourth term and, using a similar criterion to the weighted model, the fifthterm doesn’t even make it into the model. Table 17 shows the goodness of fit test for the termsin the model before adding pollution covariates. Each of the four terms is highly significant.Table 18 illustrates the addition of pollution covariates. Upon comparing Table 18 withTable 15, we notice that summer NO2 appears in both as the first pollutant. Moreover theestimated coefficient values are similar with an estimate of 0.012 and a standard error of 0.006.In both weighted and unweighted models ‘age’ and ‘smoking type’ are strongly associatedwith emphysema. Although summer NO2 shows up as borderline significant in both cases,significantly stronger covariates such as ‘education’ and ‘income’ appear in one model but notthe other. We again conclude with a weak belief in the association between summer NO2 andemphysema.83Step Stepwise added Terms X* df P (X* > x)Age 91.3 1 0.000Family Type 46.1 3 0.000Income 40.7 2 0.000Smoker Type 76.1 4 0.000Current Smoker Type 50.5 3 0.000Duration Smoked 145 2 0.000Smoker Type 77.8 4 0.000Current Smoker Type 60.8 3 0.0002 Cigarettes Smoked 73.7 2 0.000Household Smokers 52.8 1 0.000Duration Smoked 19.8 1 0.000Cigarettes Smoked 24.3 2 0.000Allergy 20.4 1 0.0003 Income 15.8 2 0.000Duration Smoked 9.4 2 0.009Drinker Category 13.2 4 0.010Income 15.7 2 0.000Family Type 10.9 3 0.0124 Household Smokers 5.3 1 0.022Drinker Category 10.8 4 0.029Education 6.8 2 0.034Household Smokers 6.3 1 0.012Drinker Category 9.1 4 0.0585 Duration Smoked 5.7 2 0.059Drinker Indicator 4.9 2 0.086Family Type 5.8 3 0.122Table 16: Stepwise terms for emphysema using a fifth of the data and ignoring weights.84Term X* df P (X* > x)Age 297 1 0.000Smoker Type 123 4 0.000Income 118 2 0.000Cigarettes Smoked 46.3 2 0.000Table 17: Terms in the full unweighted emphysema model.Term X* df P (X* > x)Summer NO2 3.8 1 0.051Pollution Component #2 3.7 1 0.053Winter NO2 3.4 1 0.067Summer SO4 1.9 1 0.173Pollution Component #1 1.1 1 0.286Winter 03 1.1 1 0.296Winter SO4 1.1 1 0.303Summer 03 1.0 1 0.325Summer SO2 0.2 1 0.699Pollution Component #3 0.2 1 0.699Winter SO2 0.1 1 0.806Table 18: Goodness of fit test for the pollution terms in the unweighted analysis.859 DiscussionIn this study we explored the association between between airborne pollutants and chronicrespiratory health. Our findings at most suggest that a weak association exists. In thischapter we critically consider the model assumptions and exposure measurement problems.9.1 Model AssumptionsIf the pollution coefficients showed strong significance we would inevitably get caught up ina debate over the reported standard errors. Measurement error and clustering used in surveysampling are two reasons why the standard errors are, perhaps even grossly, underestimated.We normaily underestimate standard errors because we implicitly assume the absence ofmeasurement error. We have reason to believe, however, that measurement error does exist inour study variables, the extent to which is unknown. Breaking down the layers of error helpsfor gaining an understanding of the phenomenon.The first problem arises from self reporting. When asked whether they have a chroniccough, respondents introduce subjective assessments of their own condition. We all know ofpsychosomatic individuals and their converse, obdurate, self denying stoics. More preciselyworded questions concerning the nature of the symptoms may minimize interpretive leeway. InAbbey, Petersen, Mills and Beeson (1993) symptom assessment questions incorporated specifictime durations. “Did you have symptoms of cough and/or sputum production on most days,for at least three months per year, for two years or more?” is one example of the wording.The Ontario Health Survey, on the other hand, asked if the respondent had “emphysema orchronic bronchitis or persistent cough?”.Besides interpretive difficulties associated with a questionnaire, disease diagnosis can bedifficult. Emphysema provides a sterling example: a definitive answer avails itself only after autopsy. The two main diagnostic instruments, providing physiological and radiologicalmeasurements, are not failsafe.86Physiological measurements include carbon monoxide diffusion capacity, slope of the volume pressure curve and lung recall at specific lung volumes. At best, they have been shownto be good at detecting severe cases; mild disease is poorly detected. Overall, no physiologicalmeasure or combination of measurements can reliably serve for disease determination.Radiography provides an opportunity to visually look inside the body without using thescalpel. Standard films do not do away with the subjective judgement of the medical practitioner, however. As with physiological measurements, chest radiographs are good for diagnosing severe cases but can only diagnose about have the moderately severe cases. Computedtomography offers an improved image over standard radiography and, as a result, better diagnosis potential. High resolution computed tomography, in particular, appears to correctlyresolve 90% of all cases (Snider, 1992, p. 1342).A question arises, of course: if a trained expert is unable to diagnose emphysema reliablythen how much caution should we take in considering a self reported condition? We canenumerate some of the germane factors: the patient’s wiffingness to go to the doctor, the dateof the last checkup, the severity of the condition, the equipment available to the physician, thephysician’s wiffingness to use the equipment and professional judgement determine to somedegree the binary response recorded by the 1990 Ontario Health Survey.Besides measurement error, clustering usually inflates the real error. In the models we considered we assume independence between observations. The Ontario Health Survey, however,uses cluster sampling in which blocks of households, ostensibly more similar to one anotherthe closer they are, are surveyed at the same time. We ignored this phenomenon because themicro-data we had did not allow us to identify the clusters.In short we used liberal assumptions and underestimated standard errors associated withcoefficient estimates. Converting to p-values, the results which we have shown seem moresignificant than they are. The reader is therefore encouraged to be cautious about acceptingstatements of significance at face value.87401980 Cutoffa>c,)a> 1.J/O020____________14%E04%01900Figure 20: Immigration background of the 1990 residents of Ontario.9.2 Exposure Measurement ProblemsThe most difficult task for an epidemiological study of this type is to derive good estimates oflong term exposure. Our pollution data may not have been adequate to detect the hypothesizedrelationship between ambient air pollution and pulmonary disease. We know, for instance, thatthe study population is anything but fixed. In our study, however, we assumed that peoplewere more or less situated within a Public Health Unit for a time span long enough to makethe pollution estimates relevant. Figure 20, estimated from the Ontario Health Survey dataitself, shows that at about 25% of the sample immigrated from another country! Other sourcesreport that in the fifteen year span from 1971 to 1986 almost half of all Canadians changedtheir place of residence every five years (Statistics Canada, 1991, p. 72). Information on themigratory history of the individual’s in the sample would have been helpful.Another difficulty arises from incomplete exposure estimate coverage. In our case we havepollution measurements from 1983 to 1989. We then assume that this time period is similar toearlier intervals. The North American experience, however, leaves that assumption in doubt.Hoberg (1989) argues that pollution controls have made a difference to observable pollutionTotalEuropeAsia1930 1960 1990Year88CanadaU.S.AFigure 21: North American air quality trends.values. Figure 21 shows changes in Canadian and American pollution levels from the mid1970s to the late 1980s. Though the picture is complex we can generally see a decline in ambientair pollution readings. Changes in the spatial distribution of pollution will also muddy theinterpretation of the resulting analysis.Finally, ambient air pollution could be less important than accumulated levels achievedindoors. Dales et al. (1991a) and Dales, Burnett and Zwanenburg (1991b) identified theimportance of indoor molds on the respiratory function of children and adults, respectively.The models I considered for this thesis ignored the potential of household pollutants becauseof the absence of this kind of data.300a-a.ci)0N01500Ea.a.x0UCci)0z-aa.a.ci,x0UDDC,)1975 1980 1985 1990 1975 1980 1985 19901510501975 1980 1985 1990Source: Hoberg (1989), p. 118.899.3 Future DirectionsIf exposure estimates are often the weak link in epiderniological studies then future efforts mustconcentrate on improving alternatives for the future. The use of personal monitoring devicesoffer one such possibility. Silverman et al. (1992) offer a successful application of portablemultipollutant samplers. Due to expense, however, only thirty six subjects participated in thestudy.If personal monitoring is too costly and fixed station data too crude, a greater emphasison sophisticated modeffing offers an intermediary solution. The Office of Air Quality Planningand Standards, a branch of the U.S. Environmental Protection Agency, simulated people’smovements through zones of varying air quality to approximate exposure patterns. The National Ambient Air Quality Standards Exposure Model (NEM) uses ambient air pollutionstation measurements, activity diaries and population data as its major data sources. Sincethe early stages of development in 1979, the model has shown promise (Johnson, Capel, Pauland Wijnberg, 1992).We presume the relationship between airborne pollutants and chronic respiratory disease,if any, is subtle. This study has demonstrated the difficulty of exploring such a relationshipthrough the use of a general purpose survey.90References[1] ABBEY, David, John MOORE, Floyd PETERSEN and Larry BEESON (1991). “Estimatingcumulative ambient concentrations of air pollutants: description and precision of methodsused for an epidemiological study.” Archives of Environmental Health. 46, 5, 281-287.[2] ABBEY, David, Floyd PETERSEN, Paul MILLS and Lawrence BEESON (1993). “Longterm ambient concentrations of total suspended particulates, ozone, and sulfur dioxide andrespiratory symptoms in a nonsmoking population.” Archives of Environmental Health.48, 1, 33-46.[3] BATES, David (1992). “Health indices of the adverse effects of air pollution: the questionof coherence.” Environmental Research. 59, 336-349.[4] BINDER, David (1983). “On the variances of asymptotically normal estimators from complex surveys.” International Statistical Review. 51, 279-292.[5] BRAIN, Joseph D. (1989). “The susceptible individual: an overview.” Susceptibility toInhaled Pollutants, ASTM STP 1024. Mark Utell and Robert Frank, Editors, 3-5.[6] BRITTON, J. (1992). “Pollution and respiratory morbidity: how much do we accept?”Thorax. 47, 5, 391-392.[7] BROWN, P.J., LE, Nhu D. and ZIDEK, J.V. (1993). “Multivariate Spatial Interpolationand Exposure to Air Pollutants.” Canadian Journal of Statistics. To appear.[8] BURKE, T., H. ANDERSON, N. BEACH, D. COLOME, M. FIRESTONE, F. HUACHMAN,T. MILLER, D. WAGENER, L. ZEISE, and L. TRAN (1992). “Role of exposure databasesin risk management.” Archives of Environmental Health. 47, 6, 421-429.[9] CLEMMESEN, Johannes (1993). “Lung cancer from smoking: delays and attitudes, 1912-1965.” American Journal of Industrial Medicine. 23, 94 1-953.91[10] COLLETT, Dave (1991). Modelling Binary Data. London: Chapman and Hall.[11] DALES, Harry ZWANENBURG, Richard BURNETT and Claire FRANKLIN (1991a). “Respiratory health effects of home dampness and molds among Canadian children.” AmericanJournal of Epidemiology. 134, 2, 196-203.[12] DALES, Robert, Richard BURNETT and Harry ZWANENBURG (1991b). “Adverse healtheffects among adults exposed to home dampness and molds.” American Review of Respiratory Disease. 143, 505-509.[13] DOBSON, Annette J. (1990). An Introduction to Generalized Linear Models. London:Chapman and Hail.[14] DUDDEK, Chris, Nhu LE, Weimin SUN, Richard WHITE, Hubert WONG and James V.ZIDEK (1994). “Assessing the impact of ambient air pollution on hospital admissionsusing interpolated exposure estimates in both space and time.” Final report to HealthCanada under DSS contract H4078-3-C059/01-SS.[15] EVANS, J.S., T. TOSTESON and P.L. KINNEY (1984). “Cross sectional mortality studiesand air pollution risk assessment.” Environment International. 10, 55-83.[16] EVANS, Robert (1993). “Less is more: Contrasting styles in health care.” In Canada andthe United States: Differences that Count. Edited by David Thomas. Peterborough, Ont:Broadview Press. 21-41.[17] FAY, Robert E. (1984). “Application of linear and log-linear models to data from complexsurveys.” Suvery Methodology. 10, 1, 82-96.[18] FRY, John (1985). Common Diseases: Their Nature, Incidence and Care. Fourth Edition.Lancaster: MTP Press Limited.[19] GIBSON, Robert (1990). “Out of control and beyond understanding: Acid rain as a political dilemma.” In Managing Leviathan: Environmental Politics and the Administrative92State. Edited by Robert Paehlke and Douglas Torgerson. Peterborough, Ont: BroadviewPress. 243-282.[20] GOMEZ, Stephen, Robert PARKER, James DOSMAN and Helen McDUFFIE (1992). “Respiratory health effects of alkali dust in residents near desiccated Old Wives Lake.” Archivesof Environmental Health. 47, 5, 364-369.[21] HACKNEY, Jack, Wiffiam LINN, Edward AVOL, Deborah SHAMOO, Karen ANDERSON,Joseph SOLOMON, David LITTLE and Ru-Chuan PENG (1992). “Exposures of olderadults with chronic respiratory illness to nitrogen dioxide.” American Review of Respiratory Disease. 146, 1480-1486.[22] HOBERG, George (1993). “Comparing Canadian performance in environmental policy.”In Canada and the United States: Differences that Count. Edited by David Thomas.Peterborough, Ont: Broadview Press. 101-124.[23] INSTITUTE FOR HEALTH CARE FACILITIES OF THE FUTURE (1990). Future Health:A View of the Regional Trends. Ottawa.[24] JOHNSON, Ted, Jim CAPEL, Roy PAUL and Luke WIJNBERG (1992). “Estimation ofcarbon monoxide exposures and associated carboxyhemoglobin levels in Denver residentsusing a probabilistic version of NEM.” Final report to U.S. Environmental ProtectionAgency, Office of Air Quality Planning and Standards, contract number 68-D0-0062.[25] KUMAR, S. and J.N.K. Rao (1984). “Logistic regression analysis of labour force surveydata.” Survey Methodology. 10, 1, 62-76.[26] LEBOWITZ, Michael (1991). “Populations at risk: addressing health effects due to complex mixtures with a focus on respiratory effects.” Environmental Health Perspectives. 95,35-38.93[27] MATANOSKI, G., S. SELEVAN, G. AKLAND, R. BORNSCHEIN, D. DOCKERY, L.EDMONDS, A. GREIFE, M. MEHLMAN, G. SHAW and E. ELLIOTT (1992). “Role ofExposure Databases in Epidemiology.” Archives of Environmental Health. 47, 6, 439-446.[28] McCULLAGH, P. and J.A. NELDER(1989). Generalized Linear Models. Cambridge: Chapman and Hall.[29] McDONNELL, Howard KEHRL, Said ABDUL-SALAAM, Philip IVES, Lawrence FOLINSBEE, Robert DEVLIN, John O’NEIL and Donald HORSTMAN (1991). “Respiratory response of humans exposed to low levels of ozone for 6.6 hours.” Archives of EnvironmentalHealth. 46, 3, 145-150.[30] McDONNELL, Wiffiam, Keith MULLER, Philip BROMBURG and Carl SHY (1993). “Predictors of individual differences in acute response to ozone exposure.” American Reviewof Respiratory Disease. 147, 818-825.[31] ONTARIO MINISTRY OF HEALTH (1992a). “Ontario Health Survey 1990.” User’s guidevolume 1.[32] ONTARIO MINISTRY OF HEALTH (1992b). “Ontario Health Survey 1990.” User’s guidevolume 2.”[33] OZKAYNAK, Halilk and George T11URSTON (1987). “Associations between 1980 U.S.mortality rates and alternative measures of airborne particle concentration.” Risk Analysis. 7, 4, 449-461.[34] PFEFFERMANN, Danny (1993). “The role of sampling weights when modeling surveydata.” International Statistical Review. 61, 2, 317-338.[35] PURTILO, D. and R. PURTILO (1989). Survey of Human Diseases. Second Edition.Boston: Little Brown and Company.94[36] RAPPAPORT, S.M., (1993). “Threshold limit values, permissible exposure limits andfeasibility: the bases for exposure limits in the United States.” American Journal ofIndustrial Medicine, 23, 683-694.[37] ROBINSON, L. and N. JEWELL (1991). “Some surprising results about covariate adjustment in logistic regression models.” International Statistical Review, 59, 227-240.[38] ROBINSON, J. and D. PAXMAN, (1992). “The role of threshold limit values in U.S. airpollution policy.” American Journal of Industrial Medicine, 21, 383-396.[39] SARNDAL, Carl-Erik, Bengt SWENSSON and Jan WRETMAN (1992). Model AssistedSurvey Sampling. New York: Springer-Verlag.[40] SELZER, M., L VAN ROOIJEN (1975). “A self administered Short Michigan AlcoholScreening Test (SMAST).” Journal of Studies on Alcohol 36, 1, 117-126.[41] SEIXAS, Noah, Thomas ROBINS and Mark BECKER (1993). “A novel approach to thecharacterization of cumulative exposure for the study of chronic occupational disease.”American Journal of Epidemiology. 137, 4, 463-471.[42] SElLER, Tamara (1993). “Melting Pot and Mosaic: Images and Realities.” In Canadaand the United States: Differences that Count. Edited by David Thomas. Peterborough,Ont: Broadview Press. 303-325.[43] SENTHILSELVAN, A., Yue CHEN and James DOSMAN (1993). “Predictors of asthma andwheezing in adults: grain farming, sex and smoking.” American Review of RespiratoryDisease. 148, 667-670.[44] SEXTON, K., D. WAGENER and J. LYBARGER (1992). “Estimating human exposuresto environmental pollutants: availability and utility of existing databases.” Archives ofEnvironmental Health, 47, 6, 398-407.95[45] SILVERMAN, F., H.R. HOSEIN, P. COREY, S. HOLTON and S.M. TARLO (1992). “Effectsof particulate matter exposure and medication use on asthmatics.” Archives of Environmental Health. 47, 1, 51-56.[46] SNIDER, Gordon (1992). “Emphysema: the first two centuries and beyond.” AmericanReview of Respiratory Disease. 146, 1334-1344.[47] STATISTICS CANADA (1983). Historical Statistics of Canada. Second Edition. Edited byF. H. Leacy and M. C. Urquhart. Ottawa: Supply and Services Canada.[48] STATISTICS CANADA (1991). Canada Year Book. Ottawa: Supply and Services Canada.[49] STEINBERG, Stephen (1981). The Ethnic Myth. Boston: Beacon Press.[50] VALANIS, Barbara (1992). Epidemiology in Nursing and Health Care. Second Edition.Toronto: Prentice Hall.[51] XIPING, Xu, Douglas DOCKERY and Lihua WANG (1991). “Effects of air pollution onadult pulmonary function.” Archives of Environmental Health. 46, 4, 198-206.96Appendix: FiguresI decided to place some of the figures in this appendix in the hope of avoiding unnecessarydistraction in the text. The figures are ordered into the following logical groupings:OHS NonresponseFigure 22: Ordered summary of covariate nonresponse.Pollution Measurements and EstimatesFigure 23: NO2 station measurements;Figure 24: 03 station measurements;Figure 25: SO2 station measurements;Figure 26: SO4 station measurements;Figure 27: Strongest relationships between pollutants;Figure 28: Estimated six year estimates for all pollutants;Figure 29: Comparison of pollution measurments to estimates;Figure 30: NO2 PHU estimates;Figure 31: 03 PHU estimates;Figure 32: SO2 PHU estimates; andFigure 33: SO4 PHU estimates.Asthma Arcsine Analysis DiagnositcsFigure 34: Demographic model;Figure 35: Soccioeconomic model;Figure 36: Lifestyle model;9730 -20-a)2a)010-0-’3020ci)2ci)a-100Family Body Alcohol CurrentScore Mass Problem Smoker3020a)C)a)a-1:_______Smoker BlueType Collar Allergy Education ImmigrationFigure 22: Ordered summary of study covariate nonresponse.Well No. of DurationBeing Exercise Income Cigs SmokedIWorkExposureI9880 80 8080400858798040083 85 87 8983 85 87 89083 85 87 898040 ALA L A.0.‘,v y83 85 87 8980080083 85 87 89V V ‘ ‘‘400804004004008040083 85 87 89804083 85 87 89 83 85 87 8940I408040083Figure85 87 89 83 85 87 89 83 85 87 8923: NO2 readings for twelve stations ordered by increasing station mean (g/m3).9980400r%%yA804008040080L...&A..L.k&jr [98040083 85 87 89 83 85 87 89 83 85 87 89 83 85 87 89YearFigure 24: 03 readings for twenty out of twenty one stations (ug/m3).I400rvvAk.AkA.1A.fl V røi 910080 80 80 8040 40 40 400,, 0 0___80 — 80 80 80035889 387890:________________:8 5 878: :__________ ____80 80 80 800____080 80 80:859: ‘:8 5 7 89Figure 25: SO2 readings for all twenty stations (jig/rn3).101083 85 87 8938910LAA. hA5Figure 26: SO4 readings for all ten stations (tg/m3).83 85 87 8983 85 87 891050105010510501050105010501050MArjA83 85 87 8983 85 87 890105083 85 87 89102C’J0Cl)C)3117icSummer N024140C.’J0zC)272432C.)0a)C2€’34C’J0U)a)EEDCl)157272715 3Summer S0238 59Summer 03Cl)C)CCl)C)C0Cl)a)CI.•1 1Summer S04.. I38 59Summer 03C’J0Cl)C)EEDCl)c’J0U)a)C38 5Summer 03311726 32 26 32Winter 03 Winter 03 Winter 02Figure 27: Strongest scatterplot relationships between pollution estimates.226 321030.6Nitrogen Dioxide0.00.6Ozone0.00.6Sulfur Dioxide>SulfatesCl)a)0.015 45Estimated MeansFigure 28: Distribution of the 37 estimated PHU means for the four pollutants.104g/m3Co-.4c99000LiCii 9LICD C E Co C C C C CD Co ‘-C CD CD Co C CD Co CD Coz 0 CO CD 0 x 0.CDcoo (I)3CD mcoo,C 33(DCD-‘0. 0 m‘CD 0.Cl)C C 0 x 0. CD[III[].] [jjjFluBHll+[ill[II]IHEI]H[IllFill [IiiFigure 30: Estimated NO2 six year averageSummer IzWinter/106SummerFigure 31: Estimated 03 six year average-F/Winter.‘/107IFigure 32: Estimated SO2 six year averageSummer I/Winter IV/108Figure 33: Estimated SO4 six year average/-F10941)Na)a)0.Ecci(I)Figure 34: Asthma arcsine model diagnositcs for the demographic grouping.0.141)0)ccia)-J4 244—.- /ILii’150002.50(ci0Coa,C-0.80 60 120Index.o 6à 120Immigrant (1mm/Non)Ca 60Age (Young to Old)1210.1a,0)ccici,>a)-J0cci00a)(0cciD0cciStandard Normal Quantiles.•. . .••• b,as;, .•. .. ‘•••:•6 . •e.60 120Stratum (Rural/Urban)C3(0(ciD(0a)0(ci00a).....• .•.C i•. C•. •t .. .•.• S • •. • •. • .1.17 0.25PredictedS•S.S •••• •.• • ••15W5? • ‘••.- if .•.%.••9;t_. •be.•03C-330.3CO(ciDcci..... • S.•.•. •• •. •.e •• b% •I•.I-.‘s • •_• .:...,C-3Sqo 60 120 Ô 60 120 Ô o ioSex (Female/Male) Immigrant (1mm/Non) Age (Young to Old)1100.2 2p/ 02a) a)CaCl)a)00a)oBlue Collar (NA/Blue/White)o 24 48Blue Collar (NA/Blue/White)0 24 48Income (NA/Low/Mid/High)4300a)NU)a)ECl)2.5Ci)CaCi)a)0-0.93Cl)CuU)a)-33....— •.. IStandard Normal Quantiles• —I I •••1•....,.:. .I..rl •;:p.%: • :•o 24 48Work Exposure (No/Yes)0.17 0.29Predicted3C.•...::““‘0)CuU)a)-330Cu0a)-3• J d%.•..-.•t..•0 24 48 Ô 24 48 Ô 24 48Post-Secondary (No/Yes) Blue Collar (NA/Blue/White) Income (NA/Low/Mid/High)Figure 35: Asthma arcsine model diagnositcs for the socioeconomic grouping.11160C2p/i a)a)-JFigure 36: Asthma arcsine model diagnositcs for the lifestyle grouping.8Th kAJliL.L1J2p/0.0908 0 74Duration Smoked (Years)Alcohol (Non/Drinker)....SS/-3 3Standard Normal Quantiles30.09a)a)a)-JC3(0a)D(0a)U)a)D(0a).. .. .$ .1 $I . •.I •a)NCl)a)0.Ea)Cd)2.5CO(5D0Cl)0)C—130a)0a)-3! I I I!.. ...•. • •C____-3r0.19Predicted-30.27 a(0a)D•0(0a)U)a)VU)a)74Smoker Type148• .:.. .••. . a. •• •. •,:.3a,3.... ...• p• p• I. ..•—‘a• ;..••P•• .1 • •. ••, •4Ja• •1. .74Alcohol (Non/Drinker)-3148 Ô 74 148# Cigarettes (Few to Many)-30 74 148Duration Smoked (Years)112