Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A cross sectional analysis of air pollution’s impact of chronic respiratory disease in Ontario Duddek, Christopher Alan 1994-02-26

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


831-ubc_1994-0387.pdf [ 2.44MB ]
JSON: 831-1.0087445.json
JSON-LD: 831-1.0087445-ld.json
RDF/XML (Pretty): 831-1.0087445-rdf.xml
RDF/JSON: 831-1.0087445-rdf.json
Turtle: 831-1.0087445-turtle.txt
N-Triples: 831-1.0087445-rdf-ntriples.txt
Original Record: 831-1.0087445-source.json
Full Text

Full Text

A Cross Sectional Analysis of Air Pollution’s Impacton Chronic Respiratory Disease in OntariobyChristopher Alan DuddekB.Sc. University of Manitoba, 1989A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEinTHE FACULTY OF GRADUATE STUDIESDepartment of StatisticsWe accept this thesis as conformingto the uired s ndardL..THE UNIVERSITY OF BRITISH COLUMBIAAugust 1994©Christopher Alan Duddek, 1994In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)___________________________Department ofcThe University of British ColumbiaVancouver, CanadaDate__________DE-6 (2)88)AbstractDoes ambient air pollution in Canada pose a threat to respiratory health? For astudy initiated by Health Canada, we combined analyses of micro-level data from boththe 1990 Ontario Health Survey with an environmental air monitoringsystem to obtain aquantitative answer to the question.In contrast to studies designed to collect special purpose data,the Ontario HealthSurvey was not designed to address respiratory health issues. Inspite of this, this crosssectional database was rich enough for modelling. We used asthma and emphysemaas theresponse variables in assessing the impact of four pollutants estimated for summer andwinter.Two analyses were conducted for each response variable, oneincorporating survey design informationthe other ignoring it. Age, income, smoker type andsex were significantlyrelated to asthma at a a = 5% level of confidence in both analyses. None of the pollutantcovariates figured in the model.Using the classicalx2test for nested models as the criterion, the emphysema modelachieved a better fit than the asthma model. Smoker typeand age, in particular, werestrongly related to emphysema; income and number ofcigarettessmoked were significantlybut less strongly related; summer NO2was marginally significant,depending on which ofthe two analyses was considered.IIContentsAbstract ijTable of ContentsList of Tables vi.List of Figures ViAcknowledgementsDedicationPoem Shard xi1 Introduction 12 Literature Review 52.1 Experimental Studies 52.2 Estimating Cumulative Ambient Air Pollution Exposure 62.3 Cross Sectional Studies 93 Issues in Cross Sectional Analyses 113.1 Definition of Epidemiology 113.2 The Disease Process 123.3 Trends in Disease Patterns Since 1900 133.4 Types of Epidemiological Studies 153.5 Pros and Cons of the Cross Sectional Analysis 174 Description ofthe Ontario Health Survey 214.1 Objectives 224.2 Data Collection Method 224.3 Target Population 254.4 Pretesting 254.5 Questionnaire Content 274.6 Survey Methodology 294.7 Nonresponse Rates 314.8 Comment 355 Initial Data Analysis 365.1 OHS Covariates375.2 Pollution Covariates 475.3 Asthma versus Covariates 535.4 Emphysema versus Covariates566 Methodology606.1 Finite vs. Infinite Population Inference 616.2 Generalized Linear Models Theory 646.3 Modeffing Binary Data 687 Asthma Analysis 727.1 Arcsine Analysis727.2 Logistic Modeling748 Emphysema Analysis778.1 Covariate Selection Excluding Pollution788.2 Evaluating the Effect of Pollution818.3 Comparing Weighted and Unweighted Analyses839 Discussion 86Iv9.1 Model Assumptions 869.2 Exposure Measurement Problems889.3 Future Directions 90References 90Appendix: Figures 97VList ofTables1 OHS questionnaire content 282 Flowchart ofthe derivation of the study data 373 List of study covariables 394 How the OHS work exposure questions form a covariate for this study 405 Types of pollution assessment studies 476 Spatially interpolated ambient air pollution six-year averages (pg/rn3) 517 Definition of symbols used for the modeffing of binary data 698 The best arcsine models fitted for asthma 739 Most significant terms in the unweighted logit asthma model 7410 Most significant terms in the weighted logit asthma model 7511 One term emphysema models using a fifth ofthe data 7912 Terms in a stepwise fitting strategy for emphysema using a fifth of the data.8013 Goodness of fit for each ofthe terms in the full emphysema model 8114 Loadings for the first three principal components 8215 Goodness offit test for the pollution terms 8216 Stepwise terms for emphysema using a fifth of the data and ignoring weights 8417 Terms in the full unweighted emphysema model8518 Goodness of fit test for the pollution terms in the unweighted analysis85viList of Figures1 Canadian life expectancy at birth 132 Geographic coverage of the Ontario Health Survey 233 Flowchart summary of the OHS data collection procedure 244 OHS pretest questionnaire response rates 265 Nonresponse rates for the work exposure questions 336 Nonresponse rates for the smoking questions 337 Nonresponse rates for the well being score by age 348 Prevalence of the two response variables 389 The demographic covariates 4110 The socioeconomic covariates 4311 The lifestyle covariates 4412 The health covariates 4513 Typical graph of a station’s measurement ofone pollutant 4914 Asthma prevalence by covariates (I) 5415 Asthma prevalence by covariates (II) 5516 Emphysema prevalence by covariates (I) 5817 Emphysema prevalence by covariates (II) 5918 Empirical density of survey weights 6219 Marginal distribution of age using and ignoring survey weights 6420 Immigration background of the 1990 residents of Ontario 8821 North American air quality trends 8922 Ordered summary of study covariate nonresponse 9823 NO2readings for twelve stations ordered by increasing station mean(1ug/m3). 9924 03 readings for twenty out oftwenty one stations (tg/m3) 10025 SO2readings for all twenty stations (g/m3) 101vi’26 SO4readings for all ten stations(1ug/m3)27 Strongest scatterplot relationships between pollution estimates. .28 Distribution of the 37 estimated PHU means for the four pollutants.29 Comparison ofpollution measurements to estimates30 Estimated NO2six year average31 Estimated 03 six year average32 Estimated SO2six year average33 Estimated SO4six year average34 Asthma arcsine model diagnositcs for the demographic grouping.35 Asthma arcsine model diagnositcs for the socioeconomic grouping.36 Asthma arcsine model diagnositcs for the lifestyle grouping102103104105106107108109110111112VIjiAcknowledgementThis thesis was made possible by Jim Zidek and Rick Burnett: both were instrumental inbringing the Health Canada project to the University of British Columbia. I appreciate theirunending labour in sorting through the red tape associated with such a project.To get me thinking about my thesis, the biostatistics subgroup met regularly in the summerof 1993. I would like to acknowledge the regulars for their help: Victor Espinosa-Balderas,Nhu Le, John Petkau, Rick White, Hubert Wong, Weimin Sun and Jim Zidek.At Health Canada, I was lucky to have a good ally in Robert Tkalec. When I askedquestions or complained about incomplete documents he was quick to the whip.Weimin Sun deserves special recognition for his modification and implementation of thespatial interpolation methodology developed by Jim and Nhu. When I needed data he wouldwork until the next morning’s sunrise. His dedication to the task was remarkable.On the social side, I would like to recognize those who made my UBC days exciting.Xiaochun Li was always up for movies and dinner when I was stuck or tired; Nita Deerpalsingwas stellar as my perpetual audience, both for the thesis and in general;and Tim Fijal wascontinually breaking me up with his Hippocrates’ saying that ‘wholemeal bread clears outthegut’.Finally, I would like to thank Jim Zidek for his neverending encouragement. Without it Imay never have succeeded in finishing the thesis.IxTo my mom and dadfor producing me.The yellow fog that rubs its back upon the window-panes,The yellow smoke that rubs its muzzle on the window-panesLicked its tongue into the corners of the evening,Lingered upon the pools that stand in drains,Let fall upon its back the soot that falls from chimneys,Slipped by the terrace, made a sudden leap,And seeing that it was a soft October night,Curled once about the house, and fell asleep.And should I then presume?And how should I begin?From The Love Song ofJ Alfred Proofrock by T.S. Effiot)1 IntroductionResource exploitation was a dominant ideal ofEuropean imperialism. Dreams about the ‘NewWorld’ were predicated on animal fur, vast stands of East Coast forest and unlimitedstocksof Grand Banks cod (Seiler, 1993, 303).Industrialization, urbanization, technology and the rise of bureaucracy, however,radicallyaltered the landscape. Our ability to deleteriously effect our environmentgradually led tothe emergence of agencies now associated with the welfare state. Health Canada,AgricultureCanada, the Environmental Protection Agency and the Food and Drug Administrationare afew of the North American institutions which attest to our faith in resource management.The air we breathe has come to be seen as one of those resources. Twoapproaches havebeen taken to ensure and improve air quality: emission controls and developmentof ambientair quality standards. The automobile gives agood example ofthe first. Detroitmanufacturerscontinuously update their line ofcars, thereby achieving opportunities to incorporateemissionreducing engineering improvements. Since preventative measures are viewedas theleast costlyof pollution controls, regulatory agencies in the United States have goadedthe car industryinto meeting higher auto emission standards. By the mid 1990s Canada will haverealignedstandards so that they fall in line with America’s. Because of this vigilance, NorthAmericanstandards are higher than those ofmany European countries(Hoberg, 1993, 108-109).Ambient air quality standards provide another benchmark. The need for thembecameapparent after the London fog episode of 1952. The maximal 24-hourconcentration of sulfurdioxide, ten times the current allowable concentration, resultedin an estimated 4000 excessdeaths (Griffith, 1989, 112). The ambient air quality standard servesas a benchmark and issupposed to be determined in spirit in accordance with the currentstate of scientific evidence.With general acceptance of these principles, ambient air quality standardshave become acornerstone of public health policy.The concept ofthe Threshold Limit Value or TLV is a key to understandingthe history of1air quality guidelines in the States. Since the 1940s TLVs havebeenset by the TLV Committeeof the American Conference of Governmental Industrial Hygienists.The TLV Committeeofficially defines the TLV to be levels at which “nearlyall workers may repeatedly be exposedday after day without adverse health effects.” The evidence, however, suggeststhat the TLVCommittee has always been sensitive to the impact their decisions would haveon industry.They thus chose levels thought to be achievable by major playersin industry (Rappaport,1993).The role ofthe TLV should become clear after sketchingair pollution regulation in the U.S.from 1970 to present. The inadequate regulation of hazardousair pollutants was implicitlyrecognized by the Clean Air Act of 1970. It gave theEnvironmental Protection Agency (EPA)the authority to impose emission standards that guaranteed“an ample margin of safety.”The EPA immediately became sensitized to the potentialfor economic dislocation if stringentemissionlimits wereset andbypassed enforcingthelaw by avoiding tolist and regulateairbornetoxicants. When political pressure over specific substancesarose that was too great to ignore,the EPA established emission limits based on economic ratherthan health considerations(Robinson and Paxman, 1992).In the early 80s, commencing with the inaugurationof Ronald Reagan, the EPA sought todelegate responsibility ofairborne pollutant regulationtothestates. The statesthen developedAmbient Air Level guidelines which were based on TLVsmultiplied by an appropriate safetyfactor. But is the TLV a good starting pointfor air quality guidelines? There are reasons tobe wary of the TLV.First and foremost, the TLV is tainted by the TLV Committee’sdecision process:TLVs for particular substances were heavily influencedby corporations with directfinancial interests in the substances being evaluated.. . . TLVs often represent theexposure levels actually prevalent in major firms rather than levelsat which noadverse health effects are reported (Robinson and Paxman,1992,p.392).2Whose voices are not heard by the Committee? Where economic interests have had astake,collective health has partially been compromised to minimize the negative fiscal impactofhigher emission standards on the offending industries. As long as the hazardis not deadly inthe short term and effects only sensitive populations, profits have in past heldthe upper edgein public health policy.A second problem with the TLV derives from its use. States use theTLV by multiplyingits value by a constant. But how is the constant determined? Since the TLVis based on the40 hour workweek, multiplying by 4.2 would account for the 168 hour weekexperienced byresidents living in the affected area. Stifi, the guideline is supposed tobe designed for workingpopulations and not necessarily for a population encompassing themore sensitive segmentslike the very young and very old. The resulting statedefined Acceptable Ambient Air Levelguideline for acrylonitrile, for example, varies by a ratio of over athousand for regulatingstates. The variability alone casts suspicion on theprocess.The Clean Air Act Amendment of 1990 repositions the EPAas the national standardbearerin air pollution policy. The Act requires the EPA to develop nationalemission standards for189 toxic substances in the next decade. In contrast tothe “ample margin of safety” quotedin the Clean Air Act of 1970, the Amendment requiresthat standards be set on “maximumachievable controltechnology.” This reorientationis proposedtoacceleratethepace ofstandardsetting by basing it on technological advancement.The TLV, as a result, will play a lesser rolein health policy (Robinson and Paxman, 1992).Gibson (1989) documents a similar tale ofcorporate influence over public air pollutionpolicy in Ontario. The Inco smelter in Sudburyhas long been an infamous source ofSO2emissions. Little headway in reductions have beenmade, however, because Ontario Environment Ministry officials knew “that the environmentalbenefits would be less immediateandperceptible than the costs of abatement and that thebeneficiaries would be more dispersedand less well connected politically than the recipientsof abatement orders” (Gibson, 250).3Ignoring the decision making realities of the political realm, the questionremains: howseriously is human health affected by air pollution? A call for quantification,in the form ofenvironmental health impact studies, is a common response (Britton,1992), with many formshaving been attempted. I give a flavour of recent workin the literature review. A taxonomyof epidemiological studies is provided in the chapter covering cross sectionaldesigns.In a bizarre twist, regulatory procedures often provide data for thestudies. Once an airquality standard is determined, pollutant monitoring stations are erectedand data collected.Thus, ironically, the available data on pollutant exposureis driven by regulation rather thanfor its utility for inquiries into public health (Matanoski et al., 1992).In this study we evaluate the risks of air pollution to chronic lungfunction. The 1990Ontario Health Survey provides us with a cross sectional view ofthe healthstatus and socioeconomic level of Ontario’s population. The air pollution data arises froman air monitoringnetwork of 37 stations, tracking four pollutants, from1983 to 1989. The study’s methodologyand analysis are given in greater detail in the appropriate chapters.The objective of this study is ambitious. As opposedto chronic effects the study of acuteeffects, as measured by longitudinal hospital admissions,is more common. The difficulty witha study of chronic effects is the measurement of exposure. Longlag times, differences inpollutant mixtures over time, movement of individualsin the study population and error inself reported medical conditions are but a few of the obstaclesto valid inference.On the bright side, the Ontario Health Survey datais comprehensive andcontains information on a large sample. The six years pollution dataemploys recentlydeveloped methodology:where necessary the multivariate air monitoring stationreadings are spatially interpolated(Brown, Le and Zidek, 1993).To make informed policy decisions on the effect of pollution on respiratoryhealth weneed to employ pre-existing data sources wherever possible. This studyprovides that kind ofopportunity.42 Literature ReviewThe literature on environmental health impact assessments is voluminous. In this chapter Icover some of the more recent papers in this area. I have ordered the material to follow anatural flow. Since experimental studies can more or less stand on their own I discuss themfirst. Observational studies tend to be more complex. I have apportioned a section to thedifficult area of exposure measurement and one to cross sectional studies of a similar intent.2.1 Experimental StudiesAlthough questions about a pollutant’s effect on human populations can be ethically difficultto address in an experimental study, some have been carried out. Whenever an effusionofpossible predictors with many levels appears, experimental design, R.A. Fisher’s territory,tempts good researchers. In this section I describe recent experimental studies.Hackney et al. (1992) evaluate the acute effects of nitrogen dioxide(NO2)on older adultswith chronic obstructive pulmonary disease. They note that althoughanimal toxicologicalstudies prove high doses ofNO2pose a respiratory health risk, the epidemiological literatureisinconclusive. They increase the power oftheir study by focusing on a sensitive segmentofthepopulation. This strategy has two benefits. First, the sensitive subpopulationis ofinterest inits own right since, by definition almost, they are most affected by pollutant levels. Second,information garnered from sensitive subpopulations may have implications for the larger population. We could use the metaphor ofindividuals as instruments: sensitive individuals mightbe like us except that they react more strongly to the same stimulus.Hackney’s Los Angeles study combines laboratory and field work. In the laboratory researchers examined the lung dysfunction of subjects exposed to 0.3 ppm NO2for four hoursinterlaced with four bouts of exercise. In the field they evaluated the exposure measurementsfrom personal exposure monitoring devices worn by the 26 volunteers for two week periods.NO2readings are traditionally highest in LA during fall and winter. The study interval was5no exception with an average of 125 tg/m3as compared to the annual average of 90 ,ug/m3.Two conclusions are drawn. First, from a comparison of station to personalexposuremeasurements, NO2exposure is strongly influenced by outdoor pollution, even though up to90% ofthe subject’s time is spent indoors. Second, the sensitive population’s short term NO2exposure results in little short term clinical exacerbation ofrespiratory disease.McDonnell et al. (1991) study the effect of a 6.6 hour ozone (03) exposure on38 healthyhumans. The response measure, forced expiratory volume in one second (FEy1),shows significant decline upon comparing clean air exposure to 0.08 ppm03.McDonnell, Muller, Bromburg and Shy (1993) improve on the previous study byexaminingmore design points, increasing the range to 0.0—0.4 ppm 03 and increasing the samplesize.This time they divide the sample into an exploratory sample of96 subjects and a confirmatorysample of 194 subjects. The first sample specifies a model after fitting manymodels andpredictors showing statistical significance. The second sample validates the modeland protectsagainst declaring predictors significant when they spuriously appear in the modelby chancealone. The study concludes that 03 explains 31% of the response variance whileage explains4%.Experimental studies necessarily examine acute effects.From the three I reviewed, ozoneseems to be more potent than nitrogen dioxide. Though experimental studies areuseful forcorroborating evidence produced by observational studies, the artificial context ofalaboratorychamber may not correspond to pollution effects occurring in daily life. Moreover wecannotassess chronic health effects. Observational studies address these issueshead on. The firststep in an observational study is to estimate the extent of exposure.2.2 Estimating Cumulative Ambient Air Pollution ExposureIn observational studies we must deal with the problem of quantifying internal dose of theagent over time. Difficulties like the impossibility of obtaining internal dose data, workingwith imperfect proxies and refining pollution exposure modeling strategies characterize the6quest for an adequate solution.How can we use ambient air pollution measurements as a surrogate for internal dose?Typically the data come from fixed site air monitoring stations, stations for which data hasalready been collected over long periods of time. The objective is to link station data tohuman populations. Commonly the data are first interpolated spatially to other sites whereno measuring equipment exists. Then the exposure based on living patterns reported byindividuals in a survey are modelled.Abbey, Moore, Petersen and Beeson (1991) address the interpolation question. They validated asimple method ofinterpolation by fixed site monitoring station deletion: after astationwas deleted, interpolated values were calculated and compared against the measured values.Ozone (03) and total suspended particulate (TSP) were measured for at least three yearsby 126 and 142 stations, respectively. Ozone was measured for one minute every hour andTSP for 24 hours every sixth day.The statistics of interest were exceedance frequency and mean concentration. The exceedance frequency was compared to US regulatory policy levels which are often stated interms ofmaximum allowable values.Their interpolation method obtained estimates for all ZIP codes in California. A maze ofrules were put together to get interpolated values. First, a measuring station was valid if atleast three years of pollution data existed. Second, for a given ZIP centroid, a station wasvalid if it resided within a radius of50 km. Third, a series ofthree zones, based on concentricrings, were defined, zone A being the closest, zone B in between and zone C the furthest. Theonly stations used in the interpolation were those faffing within the closest ring. Up to threestations were used in a given zone.The zone definitions differed for each pollutant. TSP was assumed less homogeneous overspace and the radius boundaries for the zones were tighter than for ozone. In the study areaabout 60% of the population lived within 10 km of a TSP monitoring site (zone B or better).7About 90% lived within 16 km of an ozone monitoring site (zone A or better).The study concludes that the interpolation methods worked well in their particular case.The correlation of0.78 for TSP and 0.87 for ozone suggest the importance oftreating differentpollutants differently since, even with tighter controls of TSP zones and a greaternumber ofstations, the TSP estimates were more inaccurate than those for ozone. Second,if persons canbe situated to ZIP code centroids for significant time periods, the interpolation methodmayproduce estimates which are good surrogates for internal dose.Seixas, Robins and Becker (1993) propose to model human exposure using the occupational history ofindividuals and hundreds of thousands ofoccupational ambient air pollutionmeasurements. They motivate their research by contending that simpleexposure estimatesdo not adequately capture the complexity of the disease process. The implicit assumptionsthat dose is a linear function of concentration and independent of timego against evidence ofnonlinear toxicological behaviour of many substances causing adverse chronic healtheffects.In addition, the use of a simple statistic for exposure contributes to errorin variable bias.Their modeling solution estimates the exponents of pollutant concentrationand time between measurement and reported outcome. In their application 300,000dust exposure observations from the shafts of underground coal miners were used to estimatemodel parameters.Each ofthe 1200 respondents provided work histories sufficient to obtainestimates ofindividual exposure. Despite the efforts made, the predictive power of alternative,simpler modelsachieved competitive performance levels.In conclusion, forfixed site air stationmonitoring data,the best epidemiologicalstudies candoto approximateindividual internal doseovertime is tousesomecombination ofinterpolationand human exposure modeling. When stations are close enough, e.g. within10 km for TSP or16 km for ozone, a crude interpolation method will do well. For greaterdistances the qualityof the interpolation will decrease, though the decrease is pollution specificand the degree ofquality deterioration remains to be further quantified.8The subtlety ofhuman exposure modeling promises to challenge researchersstrivingfor theideal. Models, however, will continue to be data dependent:the availability ofwork histories,migration patterns and activity diaries will determine the quality ofinference.More research is needed to answer the difficult questionof how internal dose relates toexposure measures. If progress is made, we will also beable to evaluate the soundness of themodeling approach. Until then, modeling will offer hopeof improving the power ofepidemiological studies.2.3 Cross Sectional StudiesI will characterize five recent cross sectional studies.Their variety of approaches is striking.While some ofthe studies simply analyse differences betweenstudy and control groups, othersuse standard regression techniques. The sample sizesrange from 600 to 3900 respondents. Thefindings also show a range. Causes ofrespiratorysymptoms go from grain-farming to blowingalkali salts.Abbey, Moore, Petersen and Beeson (1993) usedSeventh-day Adventist nonsmokers tocheck on TSP, ozone and sulfur dioxide. Their logisticregression analysis found significantrelationships betweenambient concentrations ofTSPand ozonewith severalrespiratory diseaseoutcomes.Senthilselvan, Chen and Dosman (1993) examine the relationshipbetween grain farmingand respiratory illness in Humboldt, Saskatchewan. They split theirstudy population into fourcohorts, use a survey to obtain binary responses andcompare prevalences between cohorts.Asthma was found to be significantly related to grain farming andsex; wheezing was relatedto grain farming and smoking.Gomez, Parker, Dosmanand McDuffie (1992) considered theeffect ofalkalidust on aSouthern Saskatchewan population living near Old Wives Lake. When prevalences werecomparedtheir control and study groups’ chronic wheeze, eye irritation and nasalirritation increased.Unlike the previous two studies, the researchers used forced vital capacity(FVC) and forced9expiratory volume in one second (FEy1.0)in their analysis.Xiping, Dockery and Wang (1991) measured the FVC and FEV1.0from a sample ofBeijingresidents. Besides underlining the importance of coal heating as a source of respiratory problems they discovered a relationship between outdoor SO2and FVC and FEy1.0in subjectswho had not used stove coal heating.Ozkaynak and Thurston (1987) use U.S. mortalitystatistics toevaluatetheeffect ofambientair pollutants. Their regression model suggests that SO4concentration is a significant factorin mortality prediction.The published cross sectional studies provide some evidence that respiratory health isadversely effected by irritants in the air. This study will add to the continually growingcollection.103 Issues in Cross Sectional AnalysesBefore analysing the data I will describe, in broad terms, epidemiology, and then focus in onthe virtues and problems of cross sectional analyses.As a science, epidemiology has had a number of high impact successes. I document thecase ofAIDS as a way ofillustrating the role and importance ofepidemiologists. Historically,the etiology ofinfectious disease has been easier to discover than that ofnoninfectious disease.Noninfectious disease is more ofaproblem in the industrialized world, however. With an agingpopulation it would not be far-fetched to suggest that the quality and length of life will inpart be determined by our ability to understand and control noninfectious disease.Noninfectious disease etiology is a slippery concept. With multiple causes and long developmental intervals, epidemiologists are forced to consider sophisticated analytic techniques.As a result, the literature is imbued with a variety of study methodologies. I provide a shorttaxonomy of epidemiological studies as an introduction to this area. A more detailed look atthe cross sectional study ends the chapter.3.1 Definition of EpidemiologyEpidëmos is Greek for prevalence. Hippocrates wrote several books concerning disease prevalence in the fourth century BC. In one, he distinguishes between endemic diseases, whichprevail continuously at relatively low levels, and epidemic diseases, which occur at higherthan expected frequencies. When his writings focus on the health of populations rather thanindividuals Hippocrates is donning the epidemiologist’s hat.Epidemiology, as practiced today, “is the study of the distribution of states of humanhealth and of determinants of deviations from health in human populations” (Valanis, 1992).This report is an epidemiological study by virtue of its concern with distribution of chronicrespiratory illness over a geographic area and its relationship to airborne pollutants.113.2 The Disease ProcessDisease is a complex process that can be separated into stages of prepathogenesis and pathogenesis. Prepathogenesis refers to the initial bodily changes that may or may not lead topathogenesis. The individual’s exposure to one or more agents comprises the first stage ofprepathogenesis. If the individual is susceptible he or she is unable to adapt to introductionof the agent; with successful adaptation the disease does not develop any further.Pathogenesis occurs when the disease has successfully established itself within the host.During early prepathogenesis, events take place that are clinically difficult to detect. Thisperiod is oftenreferred to as the presymptomaticorpredinical stage. Atlatter stages, however,clinical symptoms show up. This is the point where the person is commonly said to ‘have’the disease. Identification and classification of disease is complicated by partial symptoms,misdiagnosis and long lag times between early prepathogenesis and development of clinicalsymptoms.Three components are distinguishable in the disease process (Valanis, 1992, Chapter 2).The first, denoted as the host, is the site ofthe disease. Factors relating to the host’s susceptibility, like lack of sleep, malnutrition, aging and immunity, may be useful in understandingthe transition from healthiness to pathogenesis and the rate of that transmission. Explicitdescriptions ofthe host, e.g. the human subject, is a good first step in describing disease.The agent, initiator of the disease process, is the second component of the trinity. Thetubercle bacillus, without which tuberculosis would be unknown, is an example of a parasiticagent. Risk factor assessments try to describe and quantify the hazards agents pose. Mostinfectious diseases arise from a single agent; noninfectious disease are usually the result ofmultiple agents.The environmentis the third component and relates host to agent. The environment is thesum of external conditions and influences affecting the life ofliving things. By this definition,physical, biological and socioeconomic descriptors are encompassed. The environment can1280706050Year3.3 Trends in Disease Patterns Since 19001970FemalesMales1930 1950 1990Figure 1: Canadian life expectancy at birth.effect host susceptibility and viability of the agent. Cholera, for example, spreadsbest whenpeople live in crowded conditions with poor sanitation. In this study windpatterns andtemperature may have a significant effect on an individual’s exposure to pollutants.Thus, disease is a rather complicated process. The focus of epidemiology, asopposed topractitioners of clinical medicine, say, is on the triangular relationship betweenhost, agentand environment. To uncover the etiology or causes of disease is the goal of riskassessmentstudies.Life expectancy in Canada, as shown in Figure 1 (Sources: Statistics Canada,1983, andInstitute for Health Care, 1990), has increased steadily since the early part ofthe twentiethcentury. The primary reasons for the increase lie in improved sanitation, betterdiet andour increased ability to control infectious disease. They are interrelated, ofcourse, sincea better diet reduces susceptibility and improved sanitation diminishes the opportunityforthe spread of virulent bacteria. The strides made in medicine’s attack on infectious disease,however, has been nothing less than staggering. In terms of mortality about 45% ofdeaths in131900 were attributable to infectious disease. The corresponding figure in 1987 was about5%(Valanis, 1992, p. 28).Epidemiology has had some of its greatest successes with infectiousdiseases. Take, forexample, the recent role played by the Centers for Disease Control(CDC) in Atlanta inidentifying aquired immunodeficiency syndrome. CDC happened to bethe sole supplier ofpentamidine, an experimental drug used in chemotherapy and radiotherapyto treat cancerpatients. When physicians in Los Angeles and New York City greatlyincreased demand forpentamidine, epidemiologists at CDC took note. In 1981 CDC’s Morbidityand MortalityWeekly Report contained two articles about a new disorder termed aquired immunodeficiencysyndrome (AIDS).The new disease had the marking ofan infectious disease: AIDS spreadamong people whohad been in contact with one another. Assuming thatAIDS was caused by an unidentifiedvirus, a series of risk assessment studies uncovered the connection thatthe vast majority ofAIDS patients were either drug users or homosexuals. The finding had major policyimplications. Public education programs commenced, health care resources werereallocated and AIDSresearch emerged. In short the new disease along with associatedrisk factors was identifiedearly enough to allow public institutions to reformulate policy and adjustto new realities.The assumption ofAIDS’ infectious nature turned out to be right.The human immunodeficiency virus (HIV) was discovered by a French team in 1983. They completed thecycle frominitial recognition to a reasonably complete etiology ofthe disease (Purtilo andPurtilo, 1989).Significantly, these investigations focused on a disease that had a singlecause, developed relatively quickly and resulted in a sharp increase in the number ofobservablecases.Unfortunately for epidemiology’s rising star, noninfectious diseases remain the leadingcauses of death. They tend to be harder to identify, take longer to develop, have multiplecauses and come with long lag times between introduction of the agents and developmentofdefinitive clinical symptoms. In the wake of these developments, epidemiology must turn to14more sophisticated methods and measurements to get at the complex etiology ofnoninfectiousdisease.3.4 Types of Epidemiological StudiesIn this section I contextualize the cross sectional study by describing its placementwithina multitude of evaluative methods. Much of my insight derives from Valanis’book (1992,chapter 3).The study of disease etiology generally proceeds in an orderly manner from generatinghypotheses to determining the causal mechanisms underlying the observed phenomenon.Threetypes of studies, sequentially related and characterized by increased effort, highercosts andgreaterinvestigative controls, areused in epidemiology: descriptive, analytic and experimental.The descriptive study relies on easily accessible data, often mortality rates,which is analysed in a simple manner. Analyses may consist of breaking the populationdown by certaincharacteristics and then comparing mortality rate differentials. Descriptive studiesare useful for uncovering unusual events or problems with data quality and canbe thought of asequivalent to initial data analysis.The analytic study relates disease to agents while attempting to controlfor potentiallyconfounding covariates. At least three types of investigation fall within thiscategory: crosssectional, case control and cohort studies. The analytic study generates hypotheses,leads toexperimental studies, and tests hypotheses in an attempt to explainphenomena arising indescriptive epidemiology.The experimental or designed study distinguishes itselffrom observational studies, i.e.descriptive or analytic, in terms of control. Inan experimental study the researcher selectsfactors of potential importance, their levels of applicationand randomly assigns experimentalunits to the prespecified treatments. After controffing to whateverextent possible for externalconditions, the researcher observes the outcomes to determine the importance of theselectedfactors. With an observational study, the researcher has nocontrol over factor levels or the15assignment ofexperimental units.The experimental study seeks to confirm or disavow certain cause andeffect relationships.The investigator incorporates experimental randomization to avoid thepitfalls of systematicbias introduced by human intentionality. Under realistic conditions, onlypeople who havethe disease ofinterest are included in experimental studies.For one group, the experimentorremoves the suspected agent from theenvironment; the other group acts asa control. The twogroups are then followed over time to see if significantly different changesoccur as a result ofthe treatment.Since this study is analytic rather than experimental, each of thethree analytic subtypes,namely cross sectional, case control and cohort, will bedescribed in greater detail. The crosssectional study is like a snapshot: population characteristicsare conceptually sited at a singletime point. The resulting measurement of disease prevalence explainswhy some authors usethe term ‘prevalence study.’The case control and cohort studies differ from thecross sectional in that they followindividuals over time. Case control studies begin with twodistinct populations: people withthe disease and people without. Disease relationshipsare determined by examining bothgroups’ exposure to a variety of agents and uncoveringthe greatest differences between thetwo. In other words, case control studies work from diseaseto uncover exposure status.As opposed to case control studies, cohort studiesstart from exposure status. The bestexample is the Framingham study begun in 1949 and continuedto the present. A random sample ofindividuals was drawn from the population of Framingham,Massachusetts. Physiciansdetermined that approximately 5000 of the sampled individuals werefree of coronary heartdisease thus making them suitable for the study. Everyyear since, the subjects have gonethrough a physical examination to (a) assess the healthstatus oftheir hearts and (b) measureexposure risks. The hope is that the risk factors of thosewho develop heart disease can beidentified. The cohort study can either be historicalor prospective in nature.16The case of smoking and lung cancer, as documented from a historical perspective byClemmesen (1993), is a good example of the potential and pitfalls of descriptive and analyticstudies. At the turn of the century, incorrect diagnoses prevented the scientific communityfrom identifying smoking as a problem. Much of the evidence against tobacco was anecdotal.With improvements in identification, some studies uncovered a relationship between smokingand lung cancer. Criticism aimed at the studies, however, mostly concerned with potentialconfounding factors and poor data quality, undermined their impact.The development of cancer registries in Mecklenburg and New York in the early 1940sacknowledged concerns over data deficiencies. Five studies, published in 1950, produced common findings on the smoking habits ofpatients with lung cancer. This confirmation ofearliersuspicions eventually led to the first International Symposium on Lung Cancer Endemiologyat Louvain in 1952. In 1959, about 100 years after cigarettes had first been manufacturedin the U.S., the U.S. Public Health Service officially pronounced their “deep concern” overthe increase in age adjusted incidence of lung cancer deaths from 4 per 100,000 in 1930 to31 in 1956. The long time interval needed to identify the association of lung cancer withcigarette smoking points to the importance of illness classification methods, quality data andconfirmatory studies.3.5 Pros and Cons of the Cross Sectional AnalysisCross sectional analyses provide information about some aspects of the underlying diseaseprocess; other aspects of the process remain hidden. The information primarily comes from astatistical model fitted to the cross sectional data. In this section I will describe what kindsof insight can be gained from the model. I first look at the advantages of the cross sectionalstudy.The model is intimately related to the data. If the data is a sample from some humanpopulation then the inferences about the model are applicable to the sampled population. Inother words the study and targetpopulation are similar ifnot the same. Observational studies17are better than experimental studies in this respect. In experimental studies the question ofhow the study population relates to the target population usually looms unanswered.In a related way, the subject’s exposure level in an observational study is realistic, thoughhard to measure. With controlled experiments the levels at which pollution concentrationsareset often bear no relation to levels experienced by the population of real interest.The designfor LD5O experiments, where the objective is to get an estimate of the dose requiredto kill50% of the population, is an example.On a practical level, observational studies give information which cannot be duplicatedbyexperimental studies. Whenever irreparable damage to the observationalsubject is possible,mice, not men, will be sacrificed. Even the use of data from the NAZI hypothermiatrials,which were ethically indefensible, is controversial. Observational studies,on the other hand,are nonintrusive.Another practical consideration is cost. Cross sectional studies often attemptto get information about previous events through a questionnaire. This is one way of evaluatinglongterm exposures of human populations to a variety of potentially toxic chemicalcompounds.By contrast, designed experiments attempt to manipulate factors so thatthe effects can beobserved. Designed experiments are very expensive when extended over manyyears and comewith the danger that the question under study will lose relevance over time.In this sense,cross sectional studies are efficient.For cross sectional studies in particular, a model allows the investigatorto examine therelationship between response variables and predictors. The model under considerationcanvary considerably; explanatory variables that are nominal, ordinal and continuouswith nonconforming scales can be handled at the same time. Ifthe standard errors ofmodelcoefficientsare also estimated, we can assess the importance of the predictors relative toone another. Infact the estimation of coefficients and their standard errors is the foundation upon whichthehouse of cross sectional analyses are built. Cross sectional studies, then, can provideuseful18information using standard statistical methodology.Now I proceed to indicate a few areas in which cross sectional analyses are weak. Havinginformation at one point in time rather than at several is one of them. First and foremost,cause cannot be ascertained since event sequence is unknown. For instance, if fitness level isfound to be negatively related to a persistent cough, is the individual’s inactivity partially thecause of the cough or is the cough a precursor to lessened activity? Without knowledge ofspecific events ordered in time, it is difficult to make conclusive statements.Related to this aspect of cross sectional studies is the type of disease statistic adopted.Epidemiologists make a distinction between incidence, the number of cases of a disease ina prescribed time interval, and prevalence, the proportion of people with the disease. Crosssectional studies measure prevalence, a concern since bias is associated with prevalence: peopledie or move in response to the occurrence of a disease.Exposure is one of the key measures in air pollution risk assessment studies. Exposureoccurs when the host comes in contact with an agent in the environment. Depending onthe exposure data, cross sectional studies can either be ecological or relational. Ecologicalstudies use stationpollution datawhich is assumed to apply to theindividuals who live nearby.Relational studies use exposure information collected for every individual selected in the study.The ecological fallacy arises if one assumes an estimate based on an average applied toindividuals is equivalent to an estimate based on individual measurements.If, for example,the distribution of a pollutant is heterogeneous then the absorbed dose among individualswill probably not be the same. Another possibility is that individuals attaindifferent exposure levels due to daily commutes from one area to the next. Thus, a false relationshipbetween pollutants and disease can be observed because the averaged exposure data does notadequately reflect the relation between subjects’ absorbed doses in different geographic areas.The ecological fallacy can be seen as arising from measurement error. The sensitivity ofregression coefficients to various levels of spatial aggregation has not been studied comprehensively19(Evans et aL, 1984).Another drawback ofthe cross sectional design arises from the lack ofcontrol. Alas, peopleare not randomly distributed experimental units. They are intentional agents often makingdecisions related to the area of scientific interest. An example is provided by asthmatics who,aware of their own hypersensitivities, exercise self selection in terms of where they live andwhere they work (Lebowitz, 1991). Individuals who make informed decisions of that sortshould no longer be considered, strictly speaking, observations from a stochastic process thatassumes independence.The context of these studies is important; they ought to be thought about as one partof an extensive, ongoing research effort. By itself, certainly, a cross sectional study cannotprove a causal biological relationship. Bates (1992) suggests welook for coherence in complexphenomena in building a scientific case. Epidemiological evidence shouldbe corroborated, forexample, by toxicological studies and results from molecular genetics.204 Description ofthe Ontario Health SurveyIn Canada, 10% of the 1991 gross domestic product was spent on health care (Evans,1993,32). Concurrently, little is known about the health status of the general population.Thiscreates difficulties for provincial governments which must provide health care to individualsatlocal levels. What data sources are available for these agencies? Are they adequateto meetthe need of the increasingly difficult task of optimizing the distribution of federallyallocatedfunds?The information we do have comes from administrative sources like hospitaladmissiondata. Although this data sheds light on who has received medical attentionit remains silentabout those who do not feed into the system. Further,profile information such as smokinghistory and socioeconomic status is limited or nonexistent from these sources.The need forbetter health data is clear.In 1978-79, Statistics Canadaconducted the CanadaHealthSurvey, thefirst comprehensivehealth survey taken in Ontario. Successive health related surveys include theCanada FitnessSurvey (1981),the CanadaHealth and Disability Survey (1983/84)and theHealthand ActivityLimitation Survey (1986/87). For provincial planning purposes, however, theavailable datahad been inadequate. The sample sizes were not large enoughto permit inferences below theprovincial level. The Ontario Ministry ofHealth recognizedthe value offunding a survey whichwould allow for estimates at the level of District Health Council or Public HealthUnit and in1987 proposed a health survey for the population of Ontario. Four yearslater the proposalbecame reality: the survey was carried out and a rich source of new informationabout thehealth status of the population of Ontario was created. The databaseobtained from the 1991Ontario Health Survey (OHS) provides the information for our study. The largesample size,wide geographic coverage and detailed respondent information will enhancethe quality ofthestudy. This chapter gives an overview of the survey.214.1 ObjectivesThe survey set out to provide baseline statistical data on the health ofthe Ontario populationat the Public Health Unit level. The objectives are to:F> measure the health status of the population;L collect risk factor datafor the major causes of morbidity and mortality;L collect data related to socioeconomic and demographic variations in health;F> measure awareness of high risk behavior;F> measure utilization of health services;F> collect descriptive data for health units; andF> collect data comparable to that in the Canada and Québec Health Surveys.The long length of the resulting questionnaire reflects a vigorous attempt to achieve all theobjectives. The high response burden was recognized from the start and evaluated duringpretesting.The high level of geographical coverage is worth highlighting. As Figure 2 indicates,the Province of Ontario can be divided into thirty seven Public Health Units (PHUs) ordistricts. The PHU is similar to Census Division, the difference being marginal disagreementsin boundaries. Some PHUs, for example, are aggregates of two Census Divisions. Duringanalysis, the PHU will play a vital role in linking geographical pollution datato individuals inthe sample. The large size of the PHU ensures that people living within one can reasonablybe expected to be bounded to the area in terms ofdaily movement. We hope there are enoughPHUs to enable a good description of pollution differentiation between areas of the province.4.2 Data Collection MethodWe wifi often referto the datacollection method and so describe it now in some detail. Figure 3depicts a flowchart summary. I will comment on some of the pros and cons associated with22CD CD C C CD C CD C C CDCID CDIdentify Household in Sample.1J.Conduct Personal Interview with all Household MembersLeave Self Completed QuestionnaireHandle Nonrespondents with Follow up Telephone CallFigure 3: Flowchart summary of the OHS data collection procedure.each part of the process.Once a household is identified in sample, a household record formis created. Newlyconstructed buildings or subdivisions, outdated maps and dangerous neighborhoodscan makeidentification difficult.As soon as the household record form is available, the personal interviewof one householdmember is possible. That member must be knowledgeable since he reportson all of the othermembers of the household. Obstacles to successful completion of thisphase include inabilityto contact anyone and errors associated with inaccurate determination ofhouseholdmembers.There are at least three benefits from using a personal interview. First,the survey taker getsa chance to introduce the survey without being ignored.Second, empirical observation provesthat a higher response rate is generally achieved by a personal interview overselfenumerationalone. Third, telephone follow up will be easier since thecaller will be making a ‘warm’ call.A self completed questionnaire is left for each household member aged 12 andup. If thatmember does not return the questionnaire before a specified time, two telephonecalls aremade.The combination ofpersonalinterview, selfenumeration and telephone follow uprepresentsa compromise between total survey cost and response rate.244.3 Target PopulationThe target population for the interviewer completed portion of the survey is all residentsofprivate dweffings in Ontario during the survey period (Januaryto December of1990).The 1991estimate for the population of Ontario is 8.1 miffion people. As in many surveys conductedbygovernmental agencies, residents ofIndian reservations, inmates ofinstitutions,foreign servicepersonnel and residents of remote areas were excluded.The target population for the self completed portion of the survey is similarexcept thatthe population includes only people aged twelve and up.4.4 PretestingThe Ontario Ministry ofHealth hired Statistics Canada to conduct a pilot surveyfor the OHS.The four objectives were to:L identify weaknesses of the content, wording and structure of the questionnaire;t evaluate the efficacy of the training procedures and field operations;L quantify the effect of questionnaire length to response rates; andL assess the use of an incentive to boost response rates.In May 1989, a total of 800 dweffings in Peterborough County and the MunicipalityofHamilton-Wentworth were surveyed. All households went through the same interviewerportion; four versions of the self completed questionnaire were equally divided amongdweffings.Two follow up telephone calls were made to jog the memories ofindividuals whohad not yetreturned their questionnaires.Let {Basic} be the core of the questionnaire (the questions take about15 minutes to fillout); {Food Frequency Schedule}, the set ofquestions pertaining to the amountand frequencyofdifferent kinds offood the respondent has consumed; {Linkage Information} the set ofquestions asking for additional identifying information (middle names, maiden names, perviously251007572%69%61%VersionFigure 4: OHS pretest questionnaire response rates.used surnames and birthplace). The four versions ofthe selfcompleted questionnairecan thenbe summarized as follows:Version A: {Basic}Version B: {Basic}+{Food Frequency Schedule)Version C: {Basic}+ {FoodFrequency Schedule)+{Linkage Information}Version D: {Basic}+ {Food Frequency Schedule) + {Linkage Information)with an incentive.The incentive in Version D was three $500 prizes drawn at random from those respondentswho replied promptly. Two of the objectives were met by including these four versionsof thequestionnaire.The response rate for the personal interviewer completed portion was 87%. The overallresponse rate for the self completed portion was 69%, somewhat under the 75% response ratethe Ontario Ministry of Health was shooting for. The main conclusions of the pretest followfrom the response rates for each version ofthe questionnaire, shown in Figure 4. First, addingthe food frequency schedule reduced the response rate marginally. Second, requesting theextra identifying information seemed to have a significant adverse effect on the rate. Third,the incentive worked dramatically. The male response rate was10% lower than that forfemales. The incentive remarkably increased response rates for males between the ages of 16to 60. Finally, the use of telephone follow up proved worthwhile since, before cailing started,the response rate was below 50%. Although Statistics Canada recommended version A orA B D26version D on the basis of the rate set out by the Ontario Ministry of Health, version B waseventually chosen as the best of all candidates.4.5 Questionnaire ContentThe questions and format of the OHS came from various sources. Previous survey questionnaires like the Canada Health Survey, Québec Health Survey, General Social Survey andHealth Promotion Survey were used as models. Potential users ofthe data, ie. units within theOntario Ministry ofHealth such as the Public Health Branch, Public Health Units and DistrictHealth Councils, were given an opportunity to develop survey content. Finally, organizationslike Statistics Canada and Sante Québec were consulted along the way.The final form of the questionnaire breaks down into three separate parts: the householdrecord form, the interviewer questionnaire and the selfadministered questionnaire. The household record form keeps track ofthe dweffing and the identities ofthe household members. Thecontent ofthe interviewer and self administered questionnaires is summarized in Table 1.The important sections for our study are highlighted with an asterisk in Table 1. For theinterviewer completed portion the chronic health problems section contains the two responsevariables considered in our study, chronic cough and asthma. The personal interview achievesthe highest response rate among survey delivery techniques and favorably affects the qualityof the response variable data. The other variables, on the other hand, come from the selfcompleted part ofthe survey.Sociodemographic dataoften varies with health outcome. We will therefore want toincludea selection ofsociodemographic variables fromthelist. Theinformation derived from questionson the OHS ranges from country ofbirth to education, income and housing. Comparability ofsociodemographic datato other sources, e.g. Canada Census, provides apossible dataintegritycheck.Information on the multifaceted phenomenon of smoking is extremely important for anystudy of respiratory health. The section devoted to smoking is comprehensive, identifying27• Household RecordForm• Interviewer Completed QuestionnaireContacts with Health ProfessionalsDisability within the last Two WeeksUse of MedicationMedical InsuranceAccidents and InjuriesHealth StatusRestriction of ActivitiesChronic HealthProblems*Health Problem ProbesSocio-economicInformation*• Self Completed questionnaireYour HealthMedicine and DrugsSmoking*Alcohol*YourFamily*Dental HealthYour Life inGeneral*Driving and SafetyWomen’s HealthSexual HealthOccupationalHealth*PhysicalActivities*NutritionTable 1: OHS questionnaire content.smoker type, the number of cigarettes smoked daily and the agesat which smoking began andended, where appropriate. As a bonus, questions aimed at the extentof second hand smokewere also asked.The Short Michigan Alcohol Screening Test (SMAST) score is included in thisstudy eventhough, on the surface, it may be ofperipheral interest. The score identifiesdrinkers and alcoholics with a view towards reliability. With the stigma attached to alcoholism, therespondentmay be sensitive about questions related to drinking. Research has shownthat the SMAST isminimally affected by denial tendencies (Seizer, 1975).A series oftwelve questions under the heading ‘Your Family’ were weighted andsummed to28obtain ageneralfamily functioning score. It is supposed toreliably measurefamilyfunctioning.The success or failure of interpersonal relationships within the immediate familymight bethought of as another demographic characteristic of the individual. If there is sometruthto the relationship between mental and physical health, covariates such as family functioningought to be included in the study. In a similar vein, questions from the section‘Your Lifein General’ seek to measure social support outside of immediate family. Ananalogue to thefamily functioning score, the general well being score, was constructed.Health outcomes may in part be determined by conditions in the workplace.I take anumber of questions delving into workplace exposure to hazardous materialsfrom this OHSsection.Lastly, physical activity has a direct impact on human health. Effects generallyincludethe reduction of premature morbidity and an enhancement ofemotional wellbeing. The OHSasked about the type and frequency of physical activity thattook place in the last month.In conclusion, the OHS questionnaire obviously strivesto be comprehensive. The sheernumber of questions, totaling over one thousand, attests to the fact. With the availabilityofsuch data, this project has a good chance in uncovering a relationship as couldreasonably beexpected from an ecological study.4.6 Survey MethodologySurvey methodologists implement sampling designsthatmeet given specificationsunder knownconstraints. In terms of a 95% confidence interval, theOHS design objective is to enablePHU proportions as low as3% to be estimated within 50% of the estimate. For an estimated PHU proportion of 3%, the95% confidence interval would be, in terms ofpercentages,(3 — [1/2]3, 3 + [1/2]3) = (1.5, 4.5). Design constraintsinclude budget and available samplingframes.The chosen sampling frame, a frame of enumeration areas, comes from the1986 Census.The enumeration area (EA) is the smallest areafor which population counts can automatically29be retrieved. Each EA is situated in a PHI] and classified as either urban or rural. Specifically, urban EAs represent the urban core and fringe of census metropolitan areas or censusagglomerations.A multistage stratified cluster sample is a good description of the OHS survey type. PHUand the urban/rural bifurcation stratify the population. The purpose of stratificationis togroup dwellings into homogeneous units with respect to measurable characteristicsofinterest.The estimates of population characteristics are for the most part more precisewhen usinga stratified sample over a simple random sample. The primary samplingunit is the EA. Inthe first stage the survey takers sampled an average of 46 EAs within aPHU. They thenconstructed a list of dweffings for each of the selected EAs. The list becamethe samplingframe for the second stage of the sample. Clusters of dweffings, about fifteenfrom urbanstrata and twenty from rural strata were sampled at the second stage, resultingin the desiredsample size of approximately 760 dwellings per PHU.Reliable estimates ofproportionsgreaterthan 3% had to be achieved atthe PHU level.Howwas this criterion used to arrive at the sample size? Let j3 be an estimate ofaproportionusingweights determined by the survey design. Most designs for surveys conductedat a provincialor national level produce estimates with less precision than those that couldbe obtained bytaking a simple random sample. The design effect (deff) ofaproportion,,in this case estimatedto be two (Ministry of Health, 1992a,p.29), represents the factor by which the variance ofan estimated proportion is inflated. Thus, letting Var (j3) bethe variance of j3 under simplerandom sampling and ignoring the finite population correction factor,Var(s) deff()Var() 2Var()21.The coefficient of variation is a scale free ratio of an estimate’s precision tothe expectedvalue of the estimate. The OHS criterion for proportions greater than3% was a coefficient of30variation less than 25%:C.V.()/2P(1_P)<0.25. (1)Since, by (1),________2(1_p)or n > 32(l_)and/l—p’ 1—0.3max1 1= 32,pE[O.03,1] \. p1 0.3the approximate sample size needed to fulfill the reliability criterion is n = 32 32 = 1024.For 46 PHUs the sample size translates to about 48,000 individuals.Note that the sample size was further increased to account for expected nonresponse. Theactual sample size resulted in 49,200 individuals responding out of61,300 surveyed. The49,200individuals represent 35,500 households. To recapitulate, the large sample size exists to meetdesign specifications.This section was included to give the reader a flavour of the methodological intricacieslurking behind the Ontario Health Survey data. The complex survey design induces nonequalprobabilities of selection for the survey population; the weights associated witheach surveyrespondent reflect selection probabilities adjusted for nonresponse and age-sex populationtotals at the PHU level. This should caution any analyst to consider carefully whattypes ofinference are supportable by an analysis of the data.4.7 Nonresponse RatesCharacterizing response rates for the OHS is not trivial. From the outset, the personalinterview and self enumeration introduce at least two response rates: theOHS had a response rateof 88% for the former and 77% for the latter. Item response rates further fog the issue.I willdocument some of the difficulties in coming to terms with item nonresponse.The topic of response rates is conventionally rephrased in terms of nonresponse rates. Iwifi adhere to that scheme. Nonresponse can be divided into unit and item nonresponse.31Unit nonresponse occurs when the survey taker does not receive any information from therespondent. For the OHS, unit nonresponse happens if nobody in the household goes throughwith the personal interview. Unit nonresponse is handled operationally by modifyingtheprobabilities of selection for the units selected in sample that do respond and, essentially,ignoring the nonrespondents.Item nonresponse occurs when the survey taker procures only partial information aboutthe respondent. Reasons for this type ofnonresponse include respondent reluctance to answersensitive questions and mistakes made during the transcription of datafrom the actual surveyform to an electromagnetic file. For the OHS, item nonresponse is complicated bythe factthat the survey is conducted using both personal interview and self enumeration.Thus, itemnonresponse arises in the following scenarios:Household Personal Interview Family Member Self Enumerationitem nonresponse unit nonresponseitem nonresponse item nonresponseitem nonresponse completecomplete unit nonresponsecomplete item nonresponseThere are two common means of handling item nonresponse. The first is imputation.Thetechnique makes up missing values. The two imputed OHS variables are age and sex.Missingvalues were imputed for the OHS by generating random values proportionallyconsistent withknown PHU age-sex proportions. With imputed data the user cannot determinethe rates ofitem Jionresponse.The second way of dealing with item nonresponse is to tell the user directly by allowingfor a “not stated” category. This strategy allows for calculation ofitem nonresponserates andthe uncovering of nonresponse patterns. For example, the nonresponse for the eight questionson work exposure is given in Figure 5. Though the nonresponse rate hovers around8% foreach of the questions, for the most part the respondents either answer all of the questionsornone.Smoking nonresponse, shown in Figure 6, is an example ofamore subtle pattern. The first3230-a)C.)a)030-1 2 3 4 5 6 7 8Uruon17%Question NumberFigure 6: Nonresponse rates for the smoking questions.The same people are responsiblefor most of the nonresponse.2010-0•9.5%Question NumberFigure 5: Nonresponse rates for the work exposure questions.Current smokerswere asked questions two to five.The others wereasked questionssix to eleven.20-a)2a)010-0-33100’66C’)0U)33.C0z0’93Figure 7: Nonresponse rates for the well being scoreby age.question determines if the respondent currently smokes.Current smokers are asked the nextfour questions while everybody else answers the six afterthat. The graph shows a higher itemresponse rate for smokers but this merely reflects that a smaller proportionof the populationcan be classified as current smokers. Since the two groupsare mutually exclusive the unionfor the response rate is approximately the sum of the current smoker andnot current smokernonresponse rate.As a last example, consider the nonresponse pattern,shown in Figure 7, for a collectionof questions concerning personal well being. Older respondentsseem to be more sensitive toquestions concerning their well being. The implicationis that nonresponse is generally notrandom even though it is convenient to assume so forpurposes of an analysis.Ill conclusion, the Ontario Health Survey nonresponseis significant enough to warrantattention during analysis. What rates of nonresponse are there for thestudy variables? Howwill item nonresponse be dealt with for discrete and continuous variables?These types ofquestions will be dealt with as they arise.39 66Age344.8 CommentThe OHS comes with all the strengths and weaknessesoflarge scale survey data. The drawbacks include missing data and the survey weightingstructure induced by complex surveymethodology. Also, despite the wide subject coverage of the survey,the OHS provides noestimates of an individual’s exposure to potentially harmful pollutants.We are left with thedifficulty ofestimating and linking pollution datafromanother source because ofthis absence.Admittedly, this criticism is somewhat unfair given the survey’sobjectives.These drawbacks notwithstanding, I am delighted to have accessto databacked by impressive resources, human and otherwise. Onemay think ofthe panels ofexperts who determinedquestionnaire content; those involved with the pretest;the survey methodologists; and themany who played a part in the field operations, fromthe training staff, interviewers and fieldsupervisors to those completing the cycle with datacapture and imputation. Untold hourswent into the production of what is for me a startingpoint: the microdata file!355 Initial Data AnalysisThis section will give the reader an overview ofthe dataused in the succeeding analysis. I drawdata from two sources: the 1991 Ontario Health Survey (OHS) and six years of atmosphericenvironmental monitoring.The OHS target population (minus immigrants who have lived in Ontario for less than tenyears) comprise the study population. We exclude recent immigrants because our outdoor airpollution estimates would not adequately represent their true exposure history.I am forced to consider individuals as the unit ofanalysis because the OHS public datafile,restricted for reasons of confidentiality, does not identify their household. A conflict immediately arises since the analysis should be in synchronicity with the survey design, meaningthat households rather than individuals should be the unit of analysis. One implicationofemploying standard estimation techniques is that standard errors will be underestimatedif noadjustments are made.Iii its complete form the OHS data is unmanageable. There are over 1000 variablesofwhich many are of no use to this study. The first task is to cull the data. I givea qualitativeand graphical description of the reduced set of variables along with associated nonrespoiiserates.The pollution data I start with derive from an involved inferential process. The originaldata comes from thirty seven atmosphere monitoring stations. Each station potentiallymeasures up to four pollutants, namely nitrogen dioxide (NO2),ozone (03), sulfur dioxide(SO2)and sulfates (SO4).The station data is interpolated for each public healthunit (PHIl) byWeimin Sun to whom I am indebted.I convert the monthly averages, given over the six year period 1983-89, intosingle summerand winter averages. Thus, each PHU has a six year average which I assumeadequatelyrepresents personal lifetime pollution exposure. For a flow diagram ofthe way the study datais derived see Table 5.361991 OHS Data 1983-89 Daily Atmospheric(Public Health Unit is the Monitoring Station Data fromfinest geographic partition) Scattered Ontario Stations__________IOHS Study Monthly Pollution EstimatesVariables by Public Health Unit/By PublicHealth UnitStudy DataTable 2: Flowchart of the derivation of the study data.To close, I graph each ofthe covariates against the tworesponse variables. This will provideintuition about at least the one term models we fit later.5.1 OHS CovariatesThe OHS datacontainsinformation on innumerable aspects ofthepopulation ofOntario. Sincethese data derive from measuring over 1000 variables, a subset that will best relate pollutionto respiratory illness must be selected. In this section I will describe what covariates wereselected, report their associated nonresponse rates and illustrate their marginal distributionsgraphically.This study focuses on asthmaand emphysema as the response variables. The two questionsfrom the interviewer portion of the questionnaire were:“Do you have asthma?”; and“Do you have emphysema or chronic bronchitis or persistent cough?”.The questions assume an ongoing chronic condition by the wayin which they are asked. A smallfraction ofquestionnaire respondents (0.9%) did not respond to the two questions specifically.This fraction will be ignored from here on. Figure 8 illustrates the prevalence of asthma and37Asthma Emphesema100 10075 75.0 0Yes YesFigure 8: Prevalence ofthe two response variables.emphysema. The low rates observed point to the need for a large sample: such studies couldbe straight-jacketed by the lack of statistical power resulting from small samples.To select the predictive covariates, I tried to obtain a satisfactory coverage of the following set of individual descriptors: demographic, lifestyle, health, socioeconomic, and pollutionexposure measures. The categories are rather arbitrary, though I chose them to makethepresentation of results more comprehensible. Table 3 exhibits the set of variables I selected ineach group. The grouping adhered to in the table will generally not make any difference tothe outcome of reported results; the arcsine analysis is the only exception.Most of the study variables have been massaged in a variety of ways. The producers ofthe OHS datafile made the preliminary alterations. To make the file easier to use, they combined subsets of the original questions to get derived variables. Householdtype, for example,classifying the respondent according to a description of the relationship between the familymembers of the household, was derived from the Household RecordForm for each family ineach household.Household income is another example. The variable classifies the respondent intooneof three household income groups: low income; not low income but less than$50,000; andincome $50,000 or more. A question that arises is how exactly low income is defined. Thecost of living, household size and household income are determinantsthat should be takenNo No38Grouping Covariate TypeDemographic Rural or Urban Stratum NominalSex NominalAge ContinuousFirst Generation Immigrant NominalSocioeconomic Family Type NominalFamily Functioning Score ContinuousBlue Collar Work NominalWork ExposureOrdinalPost-Secondary Education OrdinalIncome OrdinalLifestyle Smoker Type OrdinalCurrent Smoker Type NominalDuration Smoked ContinuousNumber of Cigarettes Smoked IntegerNumber of Current Household Smokers IntegerAlcohol Problem OrdinalHealth Body Mass IndexContinuousEnergy Expenditure ContinuousWell Being Score OrdinalAllergy OrdinalTable 3: List of study covariables.into account. The rule adopted by the OHS is based on povertylines and low income cutoffs developed by the National Council on Welfare andStatistics Canada. Place of residence(urban or rural), income and household size determine lowincome classification (Ministry ofHealth, 1990,p.11).I introduced the second set ofmodifications which are typifiedby the example in Table 4.In effect I used existing OHS variables to derive a new variable.I chose this route to reducethe number ofexplanatory covariates under consideration.Clearly the choice was arbitrary tosome degree.The last modification ofthe original OHS data is related to item nonresponse.I will delaymy exposition of nonresponse until after I have described the explanatory covariatesin moredetail.39Questions: In your job or business have you, inthe past twelve months, worked with Responses:1. dust from wood, grain, haw or straw?2. dust from silica, granite or rock dust? Never3. glass fiber dust or asbestos? Occasionally4. dust or fumes from lead, cadmium, Oftennickel, chromium or mercury? Always5. fumes from solvents, paints or gasoline? Don’t Know6. resins or isocynates? Not Applicable7. pesticides? Not Stated8. coal tar or pitch?J.LYes if ‘Often’ or ‘Always’ at least once,Work Exposure =INo otherwise.Table 4: How the OHS work exposure questions form a covariate for this study.The demographic covariates describe certain unalterablefeatures oftherespondent. I chosestratum, sex, age and an immigrant indicator as the demographic covariates. Their marginaldistributions are shown in Figure 9. As with the other estimates in this chapter I used surveyweights, as prescribed by OHS documentation (Ministry of Health, 1990(c),p.3.), to producethe estimates.The four demographic covariates look reasonable. The Canada Year Book (StatisticsCanada, 1991,p.73) tells us that 83% of the Ontario population resides in an urban setting. The OHS weighted estimate is86%, as expected. The division between the sexes isabout fifty-fifty and the age distribution shows a bulge for the baby boomers.The immigrantindicator shows what portion ofthe study population are immigrants who arrived before1980.Age may be an important explanatory variable. The probability of death increases withage, ranging from 3% for Canadians between the ages of one to 24 to 71% from65 years andup (Future Health,p.102). Any disease related to chronic exposure over a long period oftimemight be expected to be related to age. For ailments of the respiratory tract, older people40Figure 9: The demographic covariates.appear to be more vulnerable to inhaled particles and gases (Brain, 1989).Some studies consider ethnicity as a demographic characteristic. When ethnicity proves tobe a good predictor, however, it likely reflects class membership, as in the case of aboriginals.As a group, chronic cough or emphysema affects them at almost double the Ontario average(4.5 versus 2.4%). This phenomenon is explained by the well documented condition ofpovertyin which many natives live. The OHS data reinforce this view with an observed moderatelyhigh negative correlation between chronic cough and standard socioeconomic variables suchas education and income. The higher prevalence ofillness therefore reflects structuralinequalities within society rather than race (Steinberg, 1984). Since ethnicity in Canadian society,exempting aboriginals, does not generally imply class membership, this study relies on theUrban Rural1007502a)C.)a)00Male FemaleStratum Sex10075I::010075010 40 70 100 YesAge ImmigrantpYearsNo41socioeconomic indicators to capture those inequalities.Socioeconomic variables, shown in figure 10, are important indicators of longevity (seereferences in Evans, Tosteson and Kinney, 1984). Presumably they are also good predictorsof respiratory health. The socioeconomic covariates represent information on the family unit,work, education and income. All of these variables are categorical, including the seeminglycontinuous family functioning score. The score, however, is a weighted score of responsesfrom a series of twelve questions from the self completed questionnaire. The actual cutoffis supposed to represent the best division distinguishing families seeking clinical help fromthose in the general population. The income categories definition depends on poverty linesand low income cutoffs developed by the National Council of Welfare and Statistics Canada.The formula adjusts for household size, area ofresidence (urban/rural) and household income(Ontario Ministry of Health, 1992a,p.10, 21-22).Lifestyle covariates, shown in Figure 11, may turn out to be the most important set ofexplanatory variables due to the impact of smoking on lung function. The best of themis probably duration smoked as it is continuous and more reliable that the other continuouscovariate, the number ofcigarettes smoked daily. The measurement ofthe number ofcigarettessmoked dailyillustrates thetendency ofpeople to ‘thinkin fives’ when askedfor asimple answerto a complex habit. Second hand smoke exposure was captured in the personalinterview whenthe respondent was asked if anybody in the household smoked. Finally, a drinking problemindex, developed from a shortened form of the Standard Michigan Alcohol Screening Test(SMAST) is included as one of the lifestyle indicators.The general health of a person may have an effect on specific pathologies such as asthmaor emphysema. Figure 12 illustrates the distributions for the health covariates. Body massindex (kg/rn2)and exercise expenditure (kcal/kg/day) are continuous and reflect individuals’participation in physical exercise. Well being, for reasons similar to the family functioningscore, is ordinal. The family functioning variable divides families into functional and dysfunc42100 10075. 75a, a)050C)50a, a)25250 0Figure 10: The socioeconomic covariates.FamilyType Family Functioning ScoreHealthy78%Dysfunctional22%Couple Couple && Kids Couple Others Other10075I25010075.11250IibComposite Score1510a,0a,05010075Ca,50a,250Blue CollarWork40Work ExposureNA Blue WhiteCompleted EducationYes NoIncomePostPrimary Secondary SecondaryLow Mid High43SmokerType Current SmokerTypeFigure 11: The lifestyle covariates.-I-.-Never FormerOccasional DailyDuration SmokedA.10075Ca)2 50a)02504010075Ca)050a)0250Daily Occasional NACigarettes Smoked10075CC’)2 50a)025022Ca)2 11a)0010075CC’)2 50a)02500 20 40 60 04020Number per DayAlcohol ProblemYearsNo. of Household Smokers0 1 2 3+ No Maybe Yes44Composite ScoreFigure 12: The health covariates.tional.Most of the OHS covariates display some degree of nonresponse though for thegraphsgiven so far, the nonresponding portion ofthe sample was ignored.I will currently take somespace to characterize the nonresponse for the study data.Nonresponse can be divided into unit and item nonrespoilse. In the context of theOHS,unit nonresponse was either at the household level, where no personal interview tookplace,or at the household member level, where an individual’s self completed questionnaire wasnotobtained. Item nonresponse occurred when a respondent provides partial information. Here,item nonresponse manifests itself as a respondent who partially omitted information ontheself completed part ofthe survey.Body Mass Index Energy ExpenditureInactive68%IModerate16%10 20 30kg I rr?Active14%40128a)C.)a)0086a)$4a)0200 2 4kcal/ kg! day15• 10a)C.)a,0010075Ca)2 50a,02501Well Being Score2 3 4.1Allergy0 20 YesNo45The OHS had a unit response rate of 88% for the personal interview and77% for the selfcompleted questionnaire. For the rest ofthe study I will consider only the77% ofrespondentsresponding to the self completed questionnaire.Figure 22 in the Appendix summarizes item nonresponse. The range is rather striking.Hovering around 15% are well being, energy expenditure and household income.For thenext set ofvariables, from the number of cigarettes smoked to smoker type, itemnonresponsestands at about 10%. Blue collar work, allergy, education and immigration shownegligiblenonresponse.There are at least threeplausible explanations for high nonresponse. First, therespondents’sensitivity tothe question has an effect on rates. The question on household incomeis asterlingexample. Many Canadians may well feel uneasy about providingthe information. For somethe anxiety is culturally related; for others revealing income maybe embarrassing; yet othersmay feel the government could use the survey as a device to nabtax evaders. Well being isanother example: observed nonresponse increases as age increases.Second, the length of the questionnaire will have an effect onboth unit and item nonresponse. In the OHS pilot study, ashort version ofthe survey resultedin higher overall responselevels. There is reason to believe thatitem nonresponseis also affected by questionnaire length.The fact that some household members skipped whole parts ofthesurvey is suggestive.Third, some variables are composite indicators and nonresponsecan then result from amissing answer for only one question. The well being scoreis a weighted total of one positiveand one negative statement covering seven categories: energy, controlof emotions, state ofmorale, interest in life, perceived stress, perceived health statusand satisfaction about relationships. If any one offourteen questions went unanswered the wellbeing score was coded as“not stated”. This phenomenon may explain the two other compositevariables, family scoreand energy expenditure.Some of the variables seem to have achieved 100% response rates. In thecase of the46Exposure Assessment Effects AssessmentLevel Hazard IdentificationDistribution Type of EffectNumber of People Dose ResponseTarget Dose Risk CharacterizationTable 5: Types of pollution assessment studies.geographic indicators Public Health Unit (PHU) and stratum, data exists for all respondentsbecause the variables were used in survey stratification. In other cases the response rate isartificial. An imputation method can, for example, fill in datawhere data is missing. Age andsex were imputed using a random assignment mechanism based on census information on astratum’s population breakdown into age and sex categories.From this look at item nonresponse we can agreethat some ofthe variables aremore reliablethan others. Looking at the set ofstudy variables together, only55% of the respondents havegiven complete information. At the modeffing stage provisions must be made to deal with theitem nonresponse.In summary the 1991 Ontario Health Survey is the source of the two response variablesand twenty explanatory covariates. The majority of covariates are categorical rather thancontinuous, reflecting the difficulties ofobtaining good continuous measurements from a largescale survey. The set of explanatory variables cover a range large enough to build decentmodels. The one ominous omission is the pollution data to which I now turn.5.2 Pollution CovariatesThe measurement ofinternal dose ofa pollutant is very important in pollution effects studies.In most cases, however, the internal dose is unknown. The theme of this section is how theavailable exposure data is related to an individual’s internal dose.Consider the two broad classes of pollutant assessments shown in Table 5. The mostcommonly available data measures environmental releases or concentrations in specific media;47exposure measurements such as the number of people exposed and their absorbed dose arerelatively rare. Exposure datais therefore generally estimated by making assumptions that willallow pollution concentration datato be linked withindividuals faffing in acertaingeographicalarea. Formost pollution assessment studies thedatais clearly imperfect. This studyfalls underthe effects assessment heading as a risk characterization study. How good the assumptionsneeded are is uncertain but at least one study, where NO2exposure measured by personalmonitoring devices was compared to station measurement, suggests station measurements areadequate for airborne pollutants (Hackney et al., 1992).The contact between a person and a pollutant in an environmental medium is termedexposure. Exposure is completely described by the route by which the pollutant enters thebody; the concentration of the incoming pollutant; the duration of the exposure; andthefrequency of exposure. Most pollutant data measure pollutant concentrationfor one mediumcovering some geographic area (Sexton et al, 1992).I make the following assumptions about the pollution data:L’ The weather station sites have well calibrated measuring instruments. That is,at a given concentration of SO4,say, all stations would give areading closelycorresponding with other stations.F A six year average of site pollution is a good indicator of longer term averages.t Average pollution levels do not fluctuate radically from decade to decade.i Spatially, ambient aerosol pollution is homogeneously distributed within PublicHealth Units (PHU).t Ambient aerosol pollution levels are proportional to an individual’s absorbed internal dose.t Alternative sources and environments for the pollutant under study are negligible.4850 OutlierObserved Monthly Values12:, A A A AMean1983 1985 1987 1989YearFigure 13: Typical graph of a station’s measurement of one pollutant.L Individuals within a PHU, the approximate equivalent of a Census Division, arespatially stable; migration from one PHU to the next is minimal.These assumptions make the available data appropriate for our analysis. The datacome from37 atmospheremonitoring stations unevenly scattered across Ontario. Not all stationsmeasureall pollutants of interest.The station data is used to predict pollution levels for the PHU centroidsfor which thereare no stations. Brown, Nhu and Zidek (1993) have developed a methodologyfor spatiallyinterpolated predictive distributions. The modification and implementation oftheir method,described in Duddek et al. (1994), produced the predicted monthly averages usedin my analysis. In this section I will heuristically explain the steps taken to get the pollutionestimates.The first step begins with the original measurements from which thepredicted means areeventually estimated.Nitrogen dioxide (NO2),ozone (03), sulfur dioxide (SO2)and sulfate(SO4)are the fourpollutants considered in this study. The daily measurements, taken over thesix year period,1983-89, havebeen convertedinto monthly averages. Figure 13 is an example ofameasurementofone pollutant at one station and indicates how the graphs are to be understood. The morecomplete sets of graphs are given in the appendix by Figures 23 to 26.49Station measurements of NO2are given in Figure 23. The plots of twelve of the thirteenstations are ordered by ascending station averages. By looking at the first and last stationsone gets a sense ofthe mean range, in this case from 25 to 45 ,ug/m3NO2.Low outliers appearin graphs eight and ten. Their existence is partly explained by many missing daily measurements. If at least one daily measurement is available for the month then the monthly mean iscalculated. Nothing is done about measurement or transcription error since no information ondataquality is available. If no measurements exist for the month a monthly mean is imputed.On the whole, the effect of outliers on the six year mean is minimal; the observed differencebetween the simple average and winsorized mean eliminating one observation on each extremeis less than five percent.Ozone measurements age given in Figure 24. The ozone data is better than the NO2datain tworespects. First, there are almost twice as many stations measuring ozone as NO2.Otherconditions being equal we have more information available for ozone. Second, in contrast withNO2,a temporal pattern with peaks in summer and lows in winter is evident. This allows meto more easily identify poor stations by observing which deviate from the trend. The datafrom the station in the fourth row and the third column, for example, looks somewhat suspect.Sulfur dioxide and sulfate measurements are next. Twice as many stations measure502 asSO4.No seasonal pattern appears in either. SO2has the highest between station variabilityamong pollutants.Within each PHU we choose a centroid for which a monthly estimate will be produced.The estimate is supposed to be agood proxy for theinternal dose experienced by people livingin that PHU. Our interest lies in estimates of six year averages, broken down by summerand winter. By dividing the estimates into two types we had hoped to capture subtle spatialdifferences that may exist between seasons. The final pollutant estimates are shown in Table6.We make two checks on the validity of the estimated data. First, for each pollutant we50Summer WinterPublic Health Unit 03 NO2 SO2 SO4 03 NO2 SO2 SO4Eastern Ontario 45.9 26.6 22.5 4.04 30.3 31.3 22.8 4.30Ottawa-Carleton 46.0 25.5 22.8 4.33 30.0 31.5 22.1 4.42Leeds, Grenville and Lanark 46.1 25.5 19.7 4.75 29.5 30.8 22.1 4.55Kingston, Frontenac, Lennox 46.3 25.6 18.0 5.24 28.5 30.5 21.2 4.51Hastings and Prince Edward 45.9 25.6 17.6 5.26 28.4 30.9 20.7 4.54Haliburton, Kawartha, Pine Ridge 46.3 28.2 16.9 4.88 29.1 33.118.8 4.78Peterborough 45.8 26.0 17.0 5.23 28.3 31.5 19.74.68Durham 48.1 30.7 16.6 5.89 27.9 32.819.2 4.92York 48.0 33.3 15.4 4.97 29.0 34.7 18.3 4.85Toronto 49.8 37.4 18.6 5.99 27.634.6 20.2 4.91Peel 48.7 36.1 16.0 4.89 28.7 35.718.7 4.73Weffington, Dulferin and Guelph 49.0 37.8 16.2 4.37 29.6 37.4 19.04.66Halton 50.4 38.4 17.8 5.48 28.0 35.5 20.04.76Hamilton-Wentworth 52.5 36.1 18.4 6.17 28.833.9 20.3 5.10Niagara 56.4 24.118.0 8.73 30.9 27.7 20.3 6.10Haldimand-Norfolk 57.5 25.5 18.3 8.75 31.1 28.720.1 6.25Brant 52.6 35.7 18.0 6.01 29.0 34.220.2 5.05Waterloo 49.4 39.7 16.5 4.67 28.437.7 19.8 4.47Perth 49.5 40.1 18.7 4.33 29.039.0 21.5 4.40Oxford 52.7 36.1 18.3 5.76 29.1 35.3 20.54.91Elgin-St.Thomas 56.5 30.1 20.4 7.6030.3 32.7 20.9 5.57Kent-Chatham 57.8 29.5 22.1 8.46 30.235.2 22.5 5.48Windsor-Essex 58.9 28.4 22.9 9.23 30.136.8 24.1 5.51Sarnia-Lambton 55.3 37.0 29.1 5.89 30.039.0 25.3 5.11Middlesex-London 53.8 36.6 21.95.73 29.4 37.1 22.3 4.86Huron 49.4 40.5 20.3 4.10 29.839.6 22.8 4.52Bruce-Grey-Owen Sound 47.5 37.1 16.1 3.63 31.7 37.7 19.0 4.66Simcoe 47.2 33.1 15.0 3.93 31.0 36.1 17.84.54Muskoka-Parry Sound 44.7 31.8 17.7 3.27 28.536.0 23.2 2.90Renfrew 44.9 28.3 19.5 3.79 29.8 34.4 20.33.57North Bay 42.8 31.4 20.0 2.80 27.9 34.2 21.9 2.56Sudbury 38.6 33.3 33.3 1.96 26.4 35.7 30.1 2.15Timiskaming 43.5 32.3 25.1 2.63 28.0 34.023.6 2.47Porcupine 47.5 32.2 21.9 3.19 28.9 33.522.5 2.61Algoma 39.0 34.2 30.3 2.00 26.6 36.3 28.6 2.16Thunder Bay 43.7 33.9 24.7 2.49 27.0 35.3 25.3 2.11Northwestern 49.0 32.2 20.8 3.39 29.7 33.0 21.5 2.78Table 6: Spatially interpolated ambient air pollution six-year averages(g/m3).51produce boxplots for estimated and actual values (Figure 29 in the Appendix). The range ofobserved values may be larger than the estimated values, in particular for nitrogen dioxideand sulfur dioxide, because some stations had high measurement variability. The spatialinterpolation method corrects for that variability by emphasizing stable stations over highlyvariable ones.The second check comes in theformofadisplay ofestimates by PHU in Ontario. Figures 30to 33 exhibit the distribution of pollution estimates for both summer and winter. Dark areasindicate higher pollution levels than lighter ones. The gradual nature of regional differencesand higher estimates for the more heavily industrialized Great Lake PHUs give us confidencein the method.Now that we have estimates for both summer and winter we must address whether it isworthwhile to keep two variables for one pollutant. If summer and winter pollution estimatesare highly correlated then a combined pollution estimate should suffice. The strongest relationships between all summer and winter combinations for all four pollution variables is shownin Figure 27.The linearity of the top three graphs stands out at once. Since the three pollutants,nitrogen dioxide, sulfur dioxide and sulfates, have similar ranges, data averaged over the yearwill probably dojust as well as either ofthe two measures. Ozone is ratherquirky with summervalues impressively larger in the four month summer than in the eight month winter. Sincepeople are generally outside more in the summer, the summer ozone value will be the ozonerepresentative value in the analysis.The last five graphs show the strongest relationships between pollutants of differing types.Significantly, ozone shows up in each of the plots. Summer ozone is positively correlated withsummer sulfur dioxide and, strangely enough, winter sulfates. On the bottom row, winterozone shows a hint of a relationship with summer sulfur dioxide, winter sulfur dioxide andwinter sulfates. From the limited data, nothing interesting shows up for nitrogen dioxide.52That concludes our description of the process from original station measurements tosixyear PHU estimates. We must now link the pollution datato the Ontario HealthSurvey dataand proceed to the analysis.5.3 Asthma versus CovariatesOnce the pollution data is merged with the OHS data we are poised to start with thepreliminary stages of the analysis. We can make a first assessment of the relationshipbetweenasthma and each ofthe explanatory covariates by looking at their plots. (Figures14 and 15).The bars in the plots exhibit twocharacteristics. Thefirst is the whitespacein the middleofthe bar, the weighted estimate ofthe proportion. The second is thelength ofthe bar, primarilyindicating the number of observations used to calculate the proportion.The bar length,however, is not an explicit 95% confidence interval. A nominal95% confidence interval wouldgenerally be the weighted proportion plus or minus two times the binomial variance.Withsurvey data we have the design effect (deff) mentioned in the description ofOHS methodology.Since the OHS documentation suggests that the design effect for an estimateof the mean isaround two I have constructed the following intervals:j3 ± deff() j3 ±Ofcourse, by displaying so many plots we are implicitly making multiplecomparisons. I havenot attempted to construct intervals that maintain an overall95% confidence level but ratherhave plotted them in the spirit of an exploratory analysis.Of the demographic variables, age is the most interesting. Asthmahas a higher prevalencefor the youngest members ofthe population. This suggests that some proportionofasthmaticsare relabeled later on in life.In the socioeconomic sphere household type, education and income show mild correlationwith asthma. The household type ‘D’ represents single parent households, a categoryknownto have a disproportionate number of poor, single women. The fact thathousehold type,53Demographic Covanates10(UEC’,0Urban RuralStratumMale FemaleSex Age (Years)Yes NoImmigrantSocioeconomic Covariates10(UEci)0A B CDHousehold TypeGood BadFamily FunctioningNA Blue WhiteBlue CollarYes NoWork ExposureLow Mid HighIncomeFigure ]4: Asthma prevalence by covariates (I).NA 0cc DailyCurrent Smoker:13 7e:::I:Socioeconomic and Lifestyle Covariates(UEU)Prim Sec UnivEducationNA Old 0cc DailySmokerType54Socioeconomic and Lifestyle Covariates50 0 35Cigarettes Smoked0 1 2 3+Household SmokersHealth CovariatesYes Maybe NoAlcohol ProblemFigure 15: Asthma prevalence by covariates (II).:11Duration Smoked (Years)10E010E010EC’,0:=17 33 0 30 Poor Good Yes NoBody Mass Index Energy Expenditure Well Being Score AllergyPollution Covariates• L ii iII[i.IIillII27 40 39Nitrogen DioxideI 1 II111111111160 17Ozone Sulfur Dioxide31 2.1 7.1Sulfates55education and income concur strengthens the argument that socioeconomic factors are relatedto health outcomes.Work exposure merits some attention: observe the lower prevalence of asthma for thepopulation exposed to dust and fumes in an occupational setting. The most likely cause isthat asthmatics probably try harder to avoid work that would exacerbatetheir condition.Thisis aperfect example ofthecaution thatmust be heeded in making inferences with observationaldata.The smoking variables display no interesting relationships but two of the health variablesdo, ie. the well being score somewhat and the allergy indicator strongly. The presenceof wellbeing and allergy poses a problem to the analyst wishing to use them to explainasthma oremphysema. Are they reasonable covariates if they in fact are another measureof respiratoryailments? On the other hand they could be thought about as a way of significantlyreducingthe variation in the model and thereby increasing the power ofhypothesis tests.The positionI wifi take is to avoid bringing them into the model due to the problem of interpretation.Itwould hardly be right to say that a feeling of being ill brings on asthma.Finally, the lack of pollution trends is revealing. Each estimated proportion isbased onan average of over one thousand sampled individuals living within a PHU. Thegraphs raisequestions about how any meaningful relationship between asthma and pollutioncould bediscovered.In summary allergy shows the strongest relationship though its use as an explanatorycovariate is questionable. Age, education and income show slight downwardtrends. In allthere are no compelling signs to indicate that we can fit a good modelfor asthma as theresponse variable.5.4 Emphysema versus CovariatesIn the same spirit Figures 16 and 17 show the prevalence ofone or more of bronchitis, chroniccough or emphysema against each of the covariates. It is immediately obvious that theesti56mated prevalence rates exhibit greater differences here than for asthma.As age increases so does the prevalence of emphysema. Perhaps this finding is not toosurprising since the lungs lose some of their elasticity as time passes. None of the otherdemographic variables are helpful.Of the socioeconomic variables education and income reiterate the oft observed relationbetween socioeconomic status and health. The poorerone is the morelikely one will be afflictedwith medical conditions less prevalent in the upper socioeconomic order.The smoking variables are strong explanatory covariates. Smoker type and current smokerstatus show us what we would expect given current knowledge: smoking is detrimentalto thefunctioning of the respiratory system. Duration smoked and, to a lesser extent,the numberof cigarettes smoked show distinct upward trends. The effect seems most pronounced forthelong time smokers, ie. the upper third ofsmokers. Functionally, a quadratic equation lookslikeit would fit best. Second hand smoke, partially measured through the numberof householdsmokers, shows a slight upward trend as does high alcohol consumption. Undoubtablyanymodeling ofemphysema must incorporate at least one smoking covariate.The measures of health status display some of the correlation we expect to see.Energyexpenditure, well being score and allergy have gentle downward slopes. Correlation betweenthe health covariates and smoking status could wipe out any of the effect we might attributeto the variables (also see the reservations stated in the previous section).The pollution covariates are again negative. The bars look as if they are randomly strewnover the range of estimated pollution values. The only hope in relating anyof the pollutioncovariates toemphysema wifi be tomodel out as much ofthe variation inherent in theestimatesand condition on other covariates. That is the focus of the next two chapters.Unlike asthma, we anticipate that the modeling ofemphysema will at least result in modelswith descriptivepower; age, duration smoked and income arerelated marginally toemphysema.Whether pollution will show significance, however, remains to be seen.57Demographic CovanatesSocioeconomic CovariatesYes NoImmigrant10COEci)Coci)EUI0A B CDHousehold TypeGood BadFamily FunctioningNA Blue WhiteBlue CollarYes NoWork ExposureSocioeconomic and Lifestyle Covariates10COEci)COa)-cEUI0Prim Sec UnivEducationLow Mid HighIncomeNA Old 0cc DailySmokerTypeNA 0cc DailyCurrent SmokerFigure 16: Emphysema prevalence by covariates (I).10CciE0)Coci)EUI0=Urban RuralStratum==Male FemaleSex:13 78Age (Years):=:::=:::::58Socioeconomic and Lifestyle CovanatesCuEa)U)a)a.EwCUEa)C’)a)a.EuJ10CU2a)U)a)-ca.2w00 35 0 1 2 3÷Cigarettes Smoked Household SmokersHealth CovariatesEnergy Expenditure Well Being ScorePollution Covariates40 39 60 17 31 2.1Nitrogen Dioxide Ozone Sulfur DioxideFigure 17: Emphysema prevalence by covariates (II).Yes Maybe NoAlcohol ProblemYesAllergyNo:1Duration Smoked (Years):Body Mass Index277.1Sulfates596 MethodologyThe broad objective of epidemiology is to get at the etiology of disease. Cross sectional datawill at best allow for a determination of association between the disease and related factors,casual or otherwise. The specific goal of this study is to identify and assess a number ofpossibly important explanatory factors contributing to chronic respiratory illness.Unfortunately the relationship between chronic disease and etiologic agents is muddied byunsure diagnoses, changing environments, multiple causality, changing behavioral habits, theexistence of other maladies, measurement error and lack of quality data, to give but a partiallist. Each component adds haze to the picture; in statistical jargon, the random componentmay overshadow the systematic component of the model; in other words, uncertainty threatens to obviate the underlying relationship and hence our understanding of that relationship.Statistical methods address issues ofuncertainty and are therefore appropriate in this context.Besides incorporation of error, the method used ought to be able to examine a numberof factors at once. In the epidemiological literature, synergy refers to the phenomenon oftwo factors which produce a much greater effect in conjunction than if considered separately.Regression methods take care ofthese concerns. As a bonus, regression provides the possibilityofincorporating spatial dependence and other such subtleties.The Ontario Health Survey sampling design induces a probability structure that is important to address. Should I incorporate inclusion probabilities into the regression model? Howcan that be done? I begin the chapter by going over the salient differences between finite andinfinite population inference.Classic regression methods assume a continuous response vector. I use responses to theOntario Health Survey that indicate the presence or absence of chronic respiratory illness.Therefore, before analysing the data, I ought to describe generalized linear models, the extension of classical methods that accommodates binary and binomial response vectors. In thischapter I also describe the model selection strategy and model checking procedures.606.1 Finite vs. Infinite Population InferenceThe Ontario Health Survey data arise from a sample survey of a human population. Thesample survey has a complex design which induces inclusion probabilities associated with eachelement in the sample. Should I incorporate the inclusion probabilities in the analysis? Beforethe question can be answered I must go over the basics ofsurvey sampling theory.Finite survey sampling theory gives a way to make inferences from a finite sample, representable as s = {1, ...,n8}, to a finite universe U = {1, ..., N}. Generally inferentialstatements require knowledge of the probability structure resulting from the survey design.The setup differs from classical statistical theory in that classical theory assumes an infinitepopulation.With afinite population the set ofall possible samples associated with the design, S = {s1,sM},is also finite. The probability that a particular sample is chosen, p(s), defines thesample design. Two important probabilities at the element level are7rj= P(kthelement sampled) andlrkj= P(kth andjtelements sampled).If both irk and ir can be calculated then the design is said to be measurable. A measurabledesign allows for estimation of finite population totals and the variance of those estimates.Some papers in the literature deal with estimating finite population regression parameters andtheir variances. Binder (1983) describes finite parameter estimation in the generalized linearmodels context. What happens, however, when one wants to consider parameters, such asregression parameters, where the finite population parameters are of little interest?Classic regression theory posits a linear model relating the continuous response variableto explanatory covariates:Ilk = :;/3 + kk=1,2,...,m.Further, El, E2,..., E are independently and identically distributed as N(O,v.2).To mesh610.8>%Cl)Ca)U0.030Figure 18: Empirical density of survey weights.classic regression theory with survey sampling theory, finite survey sampling theorists makeuse of a construct calied the ‘superpopulation.’The finite population is essentially a partial realization of the superpopulation. I canestimate superpopulation parameters in one of two ways: with or withouttaking the surveydesign p(s) into account. Here, the parameter/3p(s)wili specify an estimator of /3 whichincorporates the inclusion probabilities and /3 will refer to an estimator which does not.If survey weights are more or less the same, a regression analysis incorporating surveyweights may differ from an analysis which assume a self weighted design. The survey weightattached to every survey respondent reflects selection probabilities adjusted for nonresponseand age-sex population totals at the PHU level (Ontario Ministry of Health, 1992a). Byexamining the OHS data, the PHUs with the most extremely weighted respondents are urban.Figure 18 shows the skewness ofthe weight distribution where the weights have been modifiedsotheir expected value is one. Because ofthe skewness, the question ofwhat todo about surveyweights remains.An ongoing debate is over which of the two estimators is better to use. Särndal (1992)handles the question in thefollowing way. First, he notes /3 is the best linear unbiasedestimator0 10 20Survey Weight62(BLUE). That is, for any conformable constant vector c,E[c’(/33)2s,X] <E.JC’(I3other_j3)2Is,X].Under any sampling design p(.)EE[c’(/3— j3)2Is,X]EEp{C’(/3other— 3)2s,X].By the BLUE criterion, /3 is better than13p(s)•He argues, however, that/3p(s)may be preferable on the basis that /3 is design consistent whereas13p(s)is not. Pfeffermann (1993) addsthat weights protect against nonignorable sampling designs and misspecification ofthe model.Either way, a methodology which allows for possible prior weights is desirable.The OHS documentation suggests the following:The sample weights placed on the individual microdata tape records must be usedwhenproducing estimatesfrom the survey data, including ordinarystatistical tables.Otherwise, the estimates derived cannot be considered to be representative of thesurvey population, and will not correspond to those produced by the Ministry ofHealth or other users ofthe data. Users are particularly cautioned about releasingunweightedtables or performingany analysis onunweighted data(OntarioMinistryof Health, 1992b,p.3).One way around the controversy is to try the analysis both ways (Fay, 1984). For the marginaldistribution of age shown in Figure 19, for example, there are estimates for which there islittle practical difference in including or excluding weights.A nagging difficulty with the superpopulation approach is the assumption ofindependencebetween elements. In a realistic application such as the Ontario Health Survey, the units arespatially correlated. Part of the correlation is induced by the clustering and stratificationused in the sampling design. Adjustments of the classicalx2and likelihood ratio tests arenecessary to protect against invalid test statistics (Kumar and Rao, 1984). The assumptionof independence normally results in the underestimation of parameter variance.6320o0.000•No Survey WeightsIncludesSurvey Weights100Figure 19: Marginal distribution of age using and ignoring survey weights.In conclusion, the modelling of survey data introduces a twist into the analysis.A super-population model must be invoked which results in two estimation possibilities: incorporateor ignore inclusion probabilities induced by the survey sampling design.6.2 Generalized Linear Models TheoryGeneralized linear models owetheir popularity totherevolution in computertechnology. Without a mechanism for solving nonlinear equations using an iterative numerical algorithm, theresearcher faces the task of working everything out by hand. Today withthe relatively lowcomputational costs, the availability of software like GUM, SAS andS, the high speed ofcomputers, the use of the generalized model has become quite feasible. Appropriateto suchauspicious beginnings, theoreticians have intensely been focusing on generalizedlinear modelproblems since the late 1970s; today much of the theory is standardizedin “classics” likeGeneralized Linear Models (McCullagh and Nelder, Second Edition, 1989).Generalized linear models are an extension of classical linear models. The most importantchange is that the response variable Y no longer needs to be continuous and normally distributed. As long as Y can adequately be described by a member of the exponential family,10 40 70Age64generalized linear models theory will apply. The Poisson, binomial, gamma and exponentialdistributions are examples suggesting the range of possibifity for applications.There are three components to a generalized linear model (GLM):1. The random component, denoted by the response vector Y, is defined by an assumeddistributional form. Each element ofthe response vector belongs to the same exponentialfamily with E(Y) =2. The systematic component is a linear function of the explanatory variables x wherej = 1, 2, .•,p. The coefficient vector 3 is of great scientific interest since it is used asthe criterion for interpreting the importance of covariates as predictors. The systematiccomponent is summarized by ij => == X/3.3. The link function g(.) relates the random and systematic components. Simply put,Independence between elements ofthe response vector is assumed.The assumed distribution of Y is supposed to adequately model the observed vectory.As mentioned, the exponential family includes most well known probabilitydistributions.Properly defined, theexponential family covers allprobability densityfunctionsand probabilitymass functions ofthe formfy(y; 0,)exp{+ c(y, (2)for given a(.), b(.) and c(.). By convention, 0 denotes the canonical parameter.When a(4) =is called the dispersion parameter where w is a prior known weight associated withtheobservationy. Until the dispersion parameter comes up explicitly again, I will assume, for thepurposes of my argument, that is known. Questions of estimation then pertain onlyto theparameter 0.How can 0 be estimated once a distributional assumption about Y is made? If fy(y;0,4’)is regarded as a function of 0 alone then one can maximize the likelihood functionL(6) =65y,)to find an estimator by the maximum likelihood approach. As long as the maximado not occur at the endpoints ofthe parameter space, the log likelihood1 = logL(O) = 1ogf(8;y,)=+c(y,q) (3)will have the same local maxima as L(9). The solution for 0 is then01The log likelihood for Y, 1 = logL(O), takes the form of a summation over thesample.There are two important properties of the log likelihood:E6() o (4)and/0lE6 + vary = 0.(5)The statistic 01/86 is known as the score. Fishers’ information(0l\(021(6)= varo\06)= —E9is derived from (5). Note that the largest value achieved by taking the reciprocalof Fishers’information in the exponential family occurs when the variance is smallest.An analogousinversion takes place when computing standard errors for the estimated coefficientvector 3.Thus, the mathematical expression for ‘information’ meshes nicely with intuition:decreasinguncertainty, i.e. increasing certainty, implies the notion of increasing confidenceabout theinformation latent in the data. From (3),Ey()= Ey(Yj9))zzz,E(Y) = b’(6),E(0l2E— var(Y)Y —aQ?S)} — a2()E(02— E(—b”(e)’1— —b”(6)Y —‘ a(S))— a(6)66andEy()+ Ey(Y)= 0 = var(Y) = b”(8)a().A couple ofremarks are in order. Assume a(4) is ofthe form a(4) = /wwhere w is a knownprior weight. First, E(Y) does not depend on a(ç). Second, the variance of Y depends bothon the dispersion parameter and the prior weight. One possibility for weights are the surveyinclusion probabilities associated with each component ofy.So far the discussion has shied away from the coefficient vector 3. Observe, however, that= E(Y) = b’(6) g’()=g_l(XT/3).(6)if the maximum likelihood estimate (MLE) of 0 can be computed then in principle a similarapproach should work for finding a 3estimate. Indeed, an application of the principle leadsto the Newton Raphson method (Collett, 1992, Appendix B). Letu(/3)qxibe the vector ofscores where thejthelement is Ol(f3)/O/3. The objective is to satisfy the maximum likelihoodcriterionu() = = 0. (7)Unfortunately no closed form solution is possible wheng(p.) is nonlinear since g’(p.) will notreduce to a constant. In other words, the MLE for ji, a function of ,8,has no analyticsolution. A numerical solution can be obtained, however, by linearization. The first orderTaylor linearization of 1 about the vector /3 isu(f) u(/30)+H(/3)(/—/3) (8)whereH(13o)qxq is the Hessian matrix with the (i,j)t1element given byO2l(3)/O3ôj3.From (7) and (8), u(f30)+H(f30)(/3— /3)0 implies/=67In more standard notation,Iri= 13r+H’(13r)tt(/r) (9)where/,,.is now used to estimate/r+1.The last equation is more obviously in the form oftheiterative Newton Raphson procedure for obtaining maximum likelihoodestimates of/3.An alternative to Newton Raphson is Fisher’s method of scoring. Here the Hessian H(/3)in (9) is replaced with the information matrix I(f3). For the(i,j)t11element021(j3) I Ol(/3)is replaced with — EAn advantage of calculating I(/3) is that the inverse is the asymptoticcovariance matrix ofthe MLE for /3. Although both methods converge to the same/3, 1’(/3) will not necessarilybe identical toH1(/3)for all distributions ofthe exponential family nor for all link functions.In conclusion, standardized methodology exists for non-normal data.The likelihood function plays a central role, making coefficient estimation pretty straightforward.The methodcan handle subtleties like unequal weighting. As important,the methodology has been implemented into major software packages.6.3 Modelling Binary DataNow that I have sketched the important components of generalizedlinear model theory Iwill focus on the models specifically related to theOHS data. The micro level data givesthe presence or absence of chronic respiratorydisease for every sampled individual. Twoapproaches to analysis suggest themselves.One is togo through with abinary variable analysiswhile the other is to analyse by first aggregating the dataand then work with the resultingbinomial observations. In this section I will delveinto the details of the likelihood equation,link function and parameter estimation for the binary case.The notation I will use is given inTable 7.68Symbol Descriptionn Sample sizeProbability of disease occurrenceProbability vector* Maximum likelihood estimate of irY Random variable distributed as B(1, 7r)Y Random vector of binary observationsy A realization of the random variable Yw A prior weight attached to the random variableYy A realization ofthe random vector Y/3 Coefficient vector(/3i,/2,•••,Maximum likelihood estimate of /3Table 7: Definition of symbols used for the modelling of binarydata.A binary variable can take on one of two values:— f1 when chronic respiratory illness exists,1 0otherwise.The probability ir = P(Y = 1) is known as the Bernoulli probability ofsuccess and results inthe probability mass functionfy(y;)= Y(i —= ((1— 7r).When a weight w is attached to eachy2 thejoint probability mass function appears asfy(y;ir) =JJfy(w,y;7r)(10)=exp{wi[yilog(1Ki)+log(1_lri)]} (11)where>= 1. The distribution in (11) is a member of the exponential family givenin (2)upon setting 9 = log[p(1—p)’]. More precisely,f(y;9) = exp{w[yjoj_log(1+e9t)]}(12)because log(1+e°) = log[1+ir(1—ir)1]= —log(1—ir).69The algorithm for estimating j3 reduces to an iterative weighted least squares algorithm.I am going to give a detailed exposition for this case. Similar arguments can be foundinMcCullagh and Nelder (1989) and Collett (1991). The outline is simple: only the scorefunctionu(fr)and Fishers’ information matrixI(/r)are needed to derive/r+1•With binary data, the log likelihood is1= wj[yjej_log(1+e9:)].Using the chain rule to derive the score function for j3,01(13) — 9l 08 Oir 0— 06 thr 0iO/3=(—g!()3—wi(y—ir)— Ld1$ ,xt3.\lr:(1— lrj)g (ir)jFor Fishers’ information first note that E(y — lrj)(yji — iri)= 0 Vi i’ by the assumedindependence between observations. When i = i’, E(y — ir)2= ir(1— ire). Thus,1021(13)foi 0103jO/3kj= EyIw(y — ir) wi(yi —= Eyr(l—7r)g’(?r).‘ir,(1—= $ 2XijXik.— ir)[g (r)]Grouping terms helps reduce the estimation procedure to iterative weighted least squares.Letthe multiplier ofXjjXjk,vi?W(ir)— ir(1—determine the elements ofthe diagonal ii x n matrixW. Clearly I(3) = X’WX. The scorefunction is simplified by setting yj(lrj)=[g’(7rj)(yj—r)]/w. Then u(13) = X’Wy(ir).Finally, everything can be combined. Since theMLE * depends on the estimated value of3 and vice versa, all terms dependent on it and j3 get a subscript r:= 1r+’(fr)tt(/r)70= Ir+ (X’WrX)’X’Wryr(*r)= (XIWrX)_1(XWrX)/r+ (X’WrX)X’Wryr(*r)= (X’WrX)’X’Wr[XJr + gr(*r)].This is exactly the form ofiterative least squares regression on[XJr+yr(*r)J.To this point I have made no mention of the specific form of the link function. The logit,probit and complementary log link functions are the three most commonly used for binary andbinomial data. Recall thatE(Y) = b’(O)1=The advantage of employing one of the three links lies in their range,(0, 1) e R. An unrestricted range could mean, for a giveno,that g(?jo) < 0 org(o)> 1 even though0 r 1. Since i = g(ir) = logit(ir) = 6, the logit link is also known as the canonicallink. The probit link is defined as g(ir)=‘(j) where_1(.)is the inverse cumulativedensity function for the standard normal distribution and the complementary log function isgiven as g(ii-) = log{— log(1—ir)}. I will generally use the logit link for analysis due toitsinterpretability as the log odds ratio of success.717 Asthma AnalysisIn this chapter we approach model building by dividing the covariates into groups and performseparate arcsine analyses. Besides allowing for standard diagnostic checking, each analysisprovides us ultimately with a subset of covariates useful in constructing a ‘final’ model. Thisis an alternative to the more commonly used stepwise regression technique.The arcsine analysis does not allow for the incorporation of survey weights. We thereforeopt to use generalized linear models when we add the pollution terms.7.1 Arcsine AnalysisFrom Figures 14 and 15 we note the lack of differences for all of the covariates except allergy.Allergy, however, is a questionable covariate since we would hesitate to say allergy causesasthma. Without at least one significant term in a model, asthma will remain unexplained byall covariates including the pollution terms.With the study variables split into five coherent groupings, an arcsine analysis is possible. This approach transforms the response variable so that it is approximately normallydistributed. Then a “classical” regression allalysis is possible. The advantage ofthis approachcomes from being able to apply standard diagnostic procedures for model checking. Surveyweights will be ignored in this part ofthe analysis.IfY 13(n, ir) is abinomial random variable then thenormal distribution .Af(nir, nir[1— 7rj)approximates Y well if the sample size is large enough. Normal regression theory requires,however, that the response variable has a constant variance, i.e. Y .JV(E(Y),o.2),ratherthan one dependent on ir. The arcsine transformation is a transformation which obviates theimportance ofthe proportion in the variance term.The arcsinetransformationderives fromthe so-called “delta- method” ofclassical statistics.Let ‘t&(.) be a transformation and p = Y/n. The first order Taylor series expansion about K is‘b(p)=(ir)+b’(7r)(p—ir).72Group Model df r2Demographic sex+age+age2 116 0.15Socioeconomic dust exposure+income 46 0.18Lifestyle smoker type+drinker 142 0.18Health well being+family functioning 508 0.04Table 8: The best arcsine models fitted for asthma.Then approximately,E[(p)J = b(7r) andVar[b(p)] = E[’(ir)(p—= [,/,l(.)]2Var(ir).Let ir) =sin1fi.Since ‘(ir) = (2J/FR),1 7r(1—Jr) 1Var[&(p)]= 4( ) =Thus, Y’ .iV(sin\/, 1/4n) wherey*= sin’(p). The linear regression analysis becomesa weighted regression analysis.Our criterion for the arcsine analysis is the standard t-test for each of the covariates. Anunusually low p-value suggests the covariate could be important in the final model. The bestmodels for each ofthe groupings is given in Table 8. Note that although many ofthe covariatespassed this test for admission to the final model, none of the models does a very goodjob atfitting the data. At best, the models explain about 20% ofthe observed variation. This leavesanother 80% unexplained!The tables do not tell the whole story, however. For each of the models, we producedstandard diagnostic plots (see Figures 34 to 36 in the Appendix). The first noticeablefeatureis how the sample spreads unevenly in the n-way table. The leverage for most modelslooksreasonable when compared to the ‘2p/n’ line. In general, if a leverage pointlies greatly abovethe line the model becomes suspect.73TermX*df P(X*> x)Age+ Age2 52.4 2 0.000Income 42.7 2 0.000Smoker Type 21.2 4 0.000Sex 12.9 1 0.000NO2 7.3 1 0.00703 1.9 1 0.169Table 9: Most significant terms in the unweighted logitasthma model.The real test of a model, however, is provided by the residuals. The quantile-quantileplot of normality shows how well the assumption of normalityis met. All of the models faresomewhat poorly in this regard. Caution must be exercised inassessing the model.The socioeconomic covariates of well being and family functioningscore do not, uponsecond reflection, seem to offer much in terms ofinterpretabilityof the model. Thus the finalmodel was built from all covariates but those two. The full model consiststherefore of ‘age’,‘sex’, ‘dust exposure’, ‘income’, ‘drinker’ and ‘smoker type’.7.2 Logistic ModelingFor fitting the final models, generalized linear model theoryis used. The criterion for covariateinclusion in the model is a goodness of fit test. Fromtheory, models can be compared on thebasis of differences in model deviance. Mathematically, iflis the estimated likelihood ofthecurrent model and Lf the estimatedlikelihood forthefullmodel, the deviance is conventionallygiven asD = —2 log(L/Lj).Nested models can be compared by focusing on the differencein deviances. The difference isasymptotically distributed asx2with the degrees offreedom being the difference of degreesoffreedom between the nested models.The overall fit for the full model is X = 120 on 12 degrees offreedom.Once the full model74TermX*df P(X*> x3)Age+Age2 71.8 2 0.000Income 25.1 2 0.000Smoker Type 21.6 4 0.000Sex 10.4 1 0.001Drinker 12.7 2 0.002Work Exposure 4.6 1 0.032NO2 1.7 1 0.196Table 10: Most significant terms in the weighted logit asthma fitted, forward and backward updating of the model allows us to test for each term. Theunweighted analysis produces Table 9.Age is the most significant term in the model. The fitted values over the domain of agescovered in the sample show a decreasing trend for asthmawith a range ofabout3%. The nextterms are income, smoker type and sex. The model suggests that having lower income, havingbeen a former smoker and being female is associated with higher asthma prevalence.The model also suggests that NO2is associated with with asthma prevalence. The problemis that its estimated coefficient suggests that NO2is negatively related to asthma. That is tosay, the more NO2in the air, the less asthmatics you would expect to observe; the range ofthe downward trend in fitted values is about 1% over the interval ofNO2readings observed inOntario. Two comments are necessary. First, the models have 50,000 observations from whichto fit a maximum likelihood estimate and a spurious result is possible. Second, we can checkthe model estimates be comparing to models which have incorporated the survey weights.Logistic regression modeling allows for the inclusion of a weighting structure, i.e. surveyweights representing inclusion probabilities. The same analysis was run with the weights, producing Table 10. It is comforting to see that the first four significant terms in the unweightedmodel show up in the weighted model in the same order. In the weighted case, however,drinking and work exposure eclipse NO2as important model variables. In fact NO2no longer75appears to be significant. The changes reiterate our suspicions of NO2as a truly significantcovariate.In summary, none of the pollution covariates appear to be significantly related to asthmaprevaJence. NO2gave the strongest signal of the pollution measures but had a questionablenegativeestimated coefficient and was not robust enough to show consistency between weightedand unweighted logistic analyses.768 Emphysema AnalysisIn this chapter we model the cross sectional association ofrespiratory morbidity and air pollution, using the 1990 Ontario Health Survey data. For convenience, we call the binary responsevariable “emphysema.” However,that term will refer collectively to any ofemphysema, chroniccough and/or chronic bronchitis. The etiology of chronic respiratory disease, especially as regards the effect of ambient air pollutants, is currently unknown and hence hotly debated. Wehope our analysis will contribute usefully to that debate.In the first step of our analysis, we select appropriate covariates, to increase the model’ssensitivity to the effect of pollution while avoiding coffinearity and extreme data spread. Weuse monthly pollutant data derived from the multivariate spatial interpolation methodologydescribed in thelast section. In ourfirst approach, we calculate the average summer and wintersix year pollution levels for all four pollutants by Public Health Unit (PHU). The resultingvalues, unlike those from our second approach described next, resemble themeasurementstaken at the air monitoring stations; the analysis will lend itself to easy interpretation. Inour second conceptually more complex approach, we use principal components to construct apollution index which favours summer pollution levels and represents a variety of pollutants.We obtain the weighting scheme implicitly by using averages ofthe eight month winter and thefour month summer. This leads to eight estimates per Public Health Unit (PHIJ). A principalcomponents analysis creates the index which extracts the maximal amount of information inthe eight estimates.In our last step we evaluate the significance of the pollution variables. We base our evaluation on the stepwise addition of pollution variables to the covariate model built in the firststep. Using the same steps, we go through with an analysis ignoring weights and then offeracomparison to the analysis incorporating weights.778.1 Covariate Selection Excluding PollutionWe first construct the ‘best’ model excluding the pollution terms. From previous studies weknow that several factors relate to chronic bronchitis. The disease is more prevalentamongolder people than younger people, among males than females and among urbanitesmore thanrural dwellers. Social class seems important. In Britain unskilled labourers are fivetimes aslikely to have chronic bronchitis as professionals. Smoking and family historyalso rate aspossible determinants (Fry, 1985, Chapter 6).Beginning with the seemingly most relevant twenty, we must select covariatesfrom themany offered by the Ontario Health Study (OHS) to represent broad populationtraits. Butwhich covariates are best and how many should be included?Achieving an aesthetic result and avoiding the problems associated withsparse data andcoffinearity demand parsimony. In particular, a model with too many terms willspread thedataover atableofunduly high dimension, since arelatively smallproportion ofthe populationis afflicted with emphysema.Coffiriearity arises from the association of prospective covariates. Perfect associationbetween them means the second adds no informationnot in the first. In ordinary least squaresregression, collinearity leaves the coefficient estimators unbiased but of reducedprecision.Robinson and Jewell (1991) prove that in logistic regression as well, when two predictivecovariates exhibit coffinearity, one being correlated with the response, a loss ofprecisionoccurs.Unfortunately we cannot avoid the difficulties presented by collinearity sincecovariates in anobservational study cannot be controlled. Instead we havetried to side-step these problemsby excluding highly correlated pairs of covariates.To reduce our computational burden, we began building our model onefactor at a timeusing one fifth of the survey data respondents or about9,500 records. We incorporated themicrodata survey weights into the binary logistic models. This insured unbiasednessfor thefinite population ofthe estimating equations used to construct estimates ofmodel coefficients.78Term Xdf P(X*>x)Age 771 0.000Household Type 47 30.000Education Level 41 20.000Income 50 20.000Smoker Type 103 40.000Current Smoker Type 74 3 0.000Duration Smoked 150 20.000Number of Cigarettes Smoked 103 20.000Energy Expenditure 63 30.000Well Being Status 564 0.000Number of Current Household Smokers 23 10.000Allergy 201 0.000Alcohol Screening Test Category 24 40.000Blue Collar Work 12 20.002Family Functioning Status8 2 0.017Body Mass Index 8 2 0.020Sex 3 1 0.100Immigrant Status 2 10.214Work Exposure 11 0.333Stratum0 1 0.906Table 11: One term emphysema models using a fifth of thedata.Table 11 summarizes the results. Except for age, the demographic covariatesare inconspicuous in that they fall at the bottom of the list. The last four one termmodels wouldbe rejected at the c = 0.05 level criterion; sixteen of the prospective covariatesreduce thestandard deviance enough to improve the fit of the modelsignificantly. Clearly a judiciousstrategy for finding a good model is needed.The need to distinguish between missing and zero values complicated our stepwiseselectionprocedure. When a variable like ‘cigarettes smoked’ had missing values wehad to include anextra dummy variable in the model. In addition, three covariates, ‘wellbeing’, ‘allergy’ and‘family functioning status’, were excludedapriori since wejudged them to blur the distinctionbetween dependent and independent variables.Table 12 summarizes the results ofapplying our stepwise procedure. Thetable shows thatadding the smoking covariate in the second step significantly decreasesthe importance of theother smoking covariates in the subsequent steps.We stopped after step six . The covariates selected for our model, before considering79Step Stepwise added TermsX*df P(X*>x)1 Age 77 1 0.000Smoker Type 105 4 0.000Current Smoker Type 83 3 0.0002 Cigarettes Smoked 56 1 0.000Household Smokers 44 1 0.000Education 34 2 0.000Education 26 2 0.000Income 23 2 0.0003 Energy Expenditure 25 3 0.000Cigarettes Smoked 21 2 0.000Immigrant 5 1 0.026Cigarettes Smoked 20 2 0.000Energy Expenditure 21 3 0.0004 Family Type 20 3 0.000Income 15 2 0.001Drinker Category 14 4 0.007Sex 6 1 0.018Energy Expenditure 21 3 0.000Family Type 21 3 0.0005 Income 17 2 0.000Sex 10 1 0.001Drinker Category 14 4 0.009Family Type 20 3 0.000Income 14 2 0.0016 Sex 8 1 0.004Drinker Category 12 4 0.021Duration Smoked 7 2 0.025Duration Smoked 8 2 0.017Sex 5 1 0.0277 Income 7 2 0.031Drinker Category 11 4 0.032Immigrant 3 1 0.104Table 12: Terms in a stepwise fitting strategy for emphysema using a fifth of thedata.80TermX*df P(X*>x)Smoker Type 159.2 4 0.000Age 97.0 1 0.000Education 85.2 2 0.000Family Type 98.0 4 0.000Cigarettes Smoked 24.0 2 0.000Energy Expenditure 10.1 3 0.018Table 13: Goodness of fit for each of the terms in the full emphysema model.pollution, are: ‘age’; ‘smoker type’; ‘education’; ‘cigarettes smoked’; ‘energy expenditure’; and‘family type’.8.2 Evaluating the Effect of PollutionIn this subsection we add the pollution variables to the six term model. To checkthe significance of each ofthe ‘factors’ using allof the data, we drop each term and note the increaseinthe deviance. Table 13 shows the improved sensitivity of the model with four timesas muchdata. Except for ‘energy expenditure’, all factors are more significant than they were withthereduced dataset. Note that, the categorical covariates show a greater improvementthan theircontinuous counterparts.We considered the pollution covariates in two ways. For the first we computed summerand winter averages; these estimates were easily added to the model. In the second approachwe used principal components to reduce the number of pollution covariates. The first threeprincipal components based on the eight averages explained80% of the variation.The rotations for the three principal components are given in Table 14 and should beconsidered indices of long term pollution exposure. A close look at the loadings allows for acertain degree of interpretation. The first contrasts 03 and SO4against SO2and emphasizessummer averages; the second is a recasting of NO2with most of the weight given to thesummer; and the third is a weighted combination of SO2and summer 03.We evaluated the pollution covariates by comparing nested models containing the pollution81ComponentPollutant#1 #2#3Summer NO2 -0.03 -0.90 0.01Winter NO2 -0.09 -0.43 0.09Summer 03 0.70 -0.06 0.60Winter 03 0.13 0.05 0.03Summer SO2 -0.51 0.07 0.67Winter So2 -0.35 0.05 0.38Summer SO4 0.26 0.06 0.18Winter SO4 0.17 0.00 0.04Table 14: Loadings for the first three principal components.TermX*df P(X*> x)Summer NO2 4.3 1 0.039Pollution Component#2 3.2 1 0.073Winter 03 2.8 1 0.093Summer 03 1.1 1 0.299Summer SO4 0.8 1 0.380Pollution Component#1 0.8 1 0.386Pollution Component#30.5 1 0.493Winter So4 0.4 1 0.554Summer SO2 0.0 1 0.920Winter NO2 0.0 1 1.000Winter SO2 0.0 1 1.000Table 15: Goodness offit test for the pollution terms.term to the full model. The resulting decrease in scaled deviance is shown in Table 15. Only‘summer NO2’significantly improves model fit at the nominal ü = 0.05 level. The coefficientestimate is 0.013 with a standard errorof0.006. Considering the number oftests we performedand the strong assumptions used in the modeling (such as independence between respondents),the result is more suggestive than irrefutable fact.828.3 Comparing Weighted and Unweighted AnalysesSince we used weights with the logistic model, we could naturally ask what the results of ananalysis would look like if we ignored weights. TI results are similar then we can feel confidentthat model misspecification is minimal.The same stepwise procedure used previously but without survey weights produces Table 16. As before, ‘age’ and ‘smoker type’ give the strongest signals to the model. ‘Cigarettessmoked’ is again the third term to enter the model. ‘Income’, however, rather than ‘energyexpenditure’ is the fourth term and, using a similar criterion to the weighted model, thefifthterm doesn’t even make it into the model. Table 17 shows the goodness offit testfor the termsin the model before adding pollution covariates. Each ofthe four terms is highlysignificant.Table 18 illustrates the addition of pollution covariates. Upon comparingTable 18 withTable 15, we notice that summer NO2appears in both as the first pollutant. Moreover theestimated coefficient values are similar with an estimate of0.012 and a standarderror of0.006.In both weighted and unweighted models ‘age’ and ‘smoking type’ are stronglyassociatedwith emphysema. Although summer NO2shows up as borderline significant inboth cases,significantly stronger covariates such as ‘education’ and ‘income’ appear in onemodel but notthe other. We again conclude with a weak belief in the association between summerNO2andemphysema.83Step Stepwise added TermsX*df P(X*> x)Age 91.3 1 0.000Family Type 46.1 3 0.000Income 40.7 2 0.000Smoker Type 76.1 4 0.000Current Smoker Type 50.5 3 0.000Duration Smoked 145 2 0.000Smoker Type 77.8 4 0.000Current Smoker Type 60.8 3 0.0002 Cigarettes Smoked 73.7 2 0.000Household Smokers 52.8 1 0.000Duration Smoked 19.8 1 0.000Cigarettes Smoked 24.3 2 0.000Allergy 20.4 1 0.0003 Income 15.8 2 0.000Duration Smoked 9.4 2 0.009Drinker Category 13.2 4 0.010Income 15.7 2 0.000Family Type 10.9 3 0.0124 Household Smokers 5.3 1 0.022Drinker Category 10.8 4 0.029Education 6.8 2 0.034Household Smokers 6.3 1 0.012Drinker Category 9.1 4 0.0585 Duration Smoked 5.7 2 0.059Drinker Indicator 4.9 2 0.086Family Type 5.8 3 0.122Table 16: Stepwise terms for emphysema using a fifth of the data and ignoring weights.84TermX*df P(X*> x)Age 297 1 0.000Smoker Type 123 4 0.000Income 118 2 0.000Cigarettes Smoked 46.3 2 0.000Table 17: Terms in the full unweighted emphysema model.TermX*df P(X*>x)Summer NO2 3.8 1 0.051Pollution Component#2 3.7 1 0.053Winter NO2 3.4 1 0.067Summer SO4 1.9 1 0.173Pollution Component#1 1.1 1 0.286Winter 03 1.1 1 0.296Winter SO4 1.1 1 0.303Summer 03 1.0 1 0.325Summer SO2 0.2 1 0.699Pollution Component#30.2 1 0.699Winter SO2 0.1 1 0.806Table 18: Goodness offit test for the pollution terms in the unweighted analysis.859 DiscussionIn this study we explored the association between between airborne pollutants and chronicrespiratory health. Our findings at most suggest that a weak association exists. In thischapter we critically consider the model assumptions and exposure measurement problems.9.1 Model AssumptionsIf the pollution coefficients showed strong significance we would inevitably get caught up ina debate over the reported standard errors. Measurement error and clustering used in surveysampling are two reasons why the standard errors are, perhaps even grossly, underestimated.We normaily underestimate standard errors because we implicitly assume the absence ofmeasurement error. We have reason to believe, however, that measurement error does exist inour study variables, the extent to which is unknown. Breaking down the layers of error helpsfor gaining an understanding of the phenomenon.The first problem arises from self reporting. When asked whether they have a chroniccough, respondents introduce subjective assessments of their own condition. We all know ofpsychosomatic individuals and their converse, obdurate, self denying stoics. More preciselyworded questions concerning the nature ofthe symptoms may minimize interpretive leeway. InAbbey, Petersen, Mills and Beeson (1993) symptom assessment questions incorporated specifictime durations. “Did you have symptoms of cough and/or sputum production on most days,for at least three months per year, for two years or more?” is one example of the wording.The Ontario Health Survey, on the other hand, asked if the respondent had “emphysema orchronic bronchitis or persistent cough?”.Besides interpretive difficulties associated with a questionnaire, disease diagnosis can bedifficult. Emphysema provides a sterling example: a definitive answer avails itself only after autopsy. The two main diagnostic instruments, providing physiological and radiologicalmeasurements, are not failsafe.86Physiological measurements include carbon monoxide diffusion capacity, slope of the volume pressure curve and lung recall at specific lung volumes. At best, they have been shownto be good at detecting severe cases; mild disease is poorly detected. Overall, no physiologicalmeasure or combination ofmeasurements can reliably serve for disease determination.Radiography provides an opportunity to visually look inside the body without using thescalpel. Standard films do not do away with the subjective judgement of the medical practitioner, however. As with physiological measurements, chest radiographs are good for diagnosing severe cases but can only diagnose about have the moderately severe cases. Computedtomography offers an improved image over standard radiography and, as a result, better diagnosis potential. High resolution computed tomography, in particular, appears to correctlyresolve 90% of all cases (Snider, 1992,p.1342).A question arises, of course: if a trained expert is unable to diagnose emphysema reliablythen how much caution should we take in considering a self reported condition? We canenumerate some ofthe germane factors: the patient’s wiffingness to go to the doctor, the dateofthe last checkup, the severity ofthe condition, the equipment available to the physician, thephysician’s wiffingness to use the equipment and professional judgement determine to somedegree the binary response recorded by the 1990 Ontario Health Survey.Besides measurement error, clustering usually inflates the real error. In the models we considered we assume independence between observations. The Ontario Health Survey, however,uses cluster sampling in which blocks of households, ostensibly more similar to one anotherthe closer they are, are surveyed at the same time. We ignored this phenomenon because themicro-data we had did not allow us to identify the clusters.In short we used liberal assumptions and underestimated standard errors associated withcoefficient estimates. Converting to p-values, the results which we have shown seem moresignificant than they are. The reader is therefore encouraged to be cautious about acceptingstatements of significance at face value.87401980 Cutoffa>c,)a> 1.J/O020____________14%E04%01900Figure 20: Immigration background of the 1990 residentsof Ontario.9.2 Exposure Measurement ProblemsThe most difficult task for an epidemiological study of this type is to derivegood estimates oflong termexposure. Our pollution datamaynot havebeen adequateto detect the hypothesizedrelationship between ambient air pollution and pulmonary disease. We know,forinstance, thatthe study population is anything but fixed. In our study, however, weassumed that peoplewere more or less situated within a Public Health Unit fora time span long enough to makethe pollution estimates relevant. Figure 20, estimatedfrom the Ontario Health Survey dataitself, shows that at about 25% ofthe sample immigratedfrom another country! Other sourcesreport that in the fifteen year span from 1971 to 1986 almost halfof all Canadians changedtheir place of residence every five years (Statistics Canada,1991,p.72). Information on themigratory history of the individual’s in the sample would have been helpful.Another difficulty arises from incomplete exposure estimate coverage.In our case we havepollution measurements from 1983 to 1989. We then assume thatthis time period is similar toearlier intervals. The North American experience, however, leaves that assumptionin doubt.Hoberg (1989) argues that pollution controls have made adifference toobservable pollutionTotalEuropeAsia1930 1960 1990Year88CanadaU.S.AFigure 21: North American air quality trends.values. Figure 21 shows changes in Canadian and American pollution levels from the mid1970stothelate 1980s. Thoughthepicture is complexwe can generally see adecline in ambientair pollution readings. Changes in the spatial distribution of pollution will also muddy theinterpretation of the resulting analysis.Finally, ambient air pollution could be less important than accumulated levels achievedindoors. Dales et al. (1991a) and Dales, Burnett and Zwanenburg (1991b) identified theimportance of indoor molds on the respiratory function of children and adults, respectively.The models I considered for this thesis ignored the potential of household pollutants becauseof the absence of this kind of,x0UDDC,)1975 1980 1985 1990 1975 1980 1985 19901510501975 1980 1985 1990Source: Hoberg (1989),p.118.899.3 Future DirectionsIfexposure estimates are oftenthe weak link in epiderniological studies then futureeffortsmustconcentrate on improving alternatives for the future. The use of personal monitoring devicesoffer one such possibility. Silverman et al. (1992) offer a successful application of portablemultipollutant samplers. Due to expense, however, only thirty six subjects participatedin thestudy.If personal monitoring is too costly and fixed station data too crude, a greater emphasison sophisticated modeffing offers an intermediary solution. The Office ofAir Quality Planningand Standards, a branch of the U.S. Environmental Protection Agency, simulatedpeople’smovements through zones of varying air quality to approximate exposure patterns.The National Ambient Air Quality Standards Exposure Model (NEM) uses ambientair pollutionstation measurements, activity diaries and population data as its major data sources.Sincethe early stages of development in 1979, the model has shown promise (Johnson,Capel, Pauland Wijnberg, 1992).We presume the relationship between airborne pollutants and chronic respiratorydisease,if any, is subtle. This study has demonstrated the difficulty of exploring such arelationshipthrough the use of a general purpose survey.90References[1] ABBEY, David, John MOORE, Floyd PETERSEN and Larry BEESON (1991). “Estimatingcumulative ambient concentrations ofair pollutants: description and precision ofmethodsused for an epidemiological study.” Archives of Environmental Health. 46,5, 281-287.[2] ABBEY, David, Floyd PETERSEN, Paul MILLS and Lawrence BEESON (1993). “Longterm ambient concentrations oftotalsuspended particulates, ozone, and sulfur dioxideandrespiratory symptoms in a nonsmoking population.” Archives of EnvironmentalHealth.48, 1, 33-46.[3] BATES, David (1992). “Health indices ofthe adverse effects of air pollution: the questionof coherence.” Environmental Research. 59, 336-349.[4] BINDER, David (1983). “On the variances ofasymptotically normal estimatorsfrom complex surveys.” International Statistical Review. 51, 279-292.[5] BRAIN, Joseph D. (1989). “The susceptible individual: an overview.” Susceptibility toInhaled Pollutants, ASTMSTP 1024. Mark Utell and Robert Frank,Editors, 3-5.[6] BRITTON, J. (1992). “Pollution and respiratory morbidity: how much do we accept?”Thorax. 47, 5, 391-392.[7] BROWN, P.J., LE, Nhu D. and ZIDEK, J.V. (1993). “Multivariate Spatial Interpolationand Exposure to Air Pollutants.” Canadian Journal ofStatistics. To appear.[8] BURKE, T., H. ANDERSON, N. BEACH, D. COLOME, M. FIRESTONE, F. HUACHMAN,T. MILLER, D. WAGENER, L. ZEISE, and L. TRAN (1992). “Role ofexposure databasesin risk management.” Archives of Environmental Health. 47, 6, 421-429.[9] CLEMMESEN, Johannes (1993). “Lung cancer from smoking: delays and attitudes, 1912-1965.” American Journal ofIndustrial Medicine. 23, 941-953.91[10] COLLETT, Dave (1991). Modelling Binary Data. London: Chapman and Hall.[11] DALES, Harry ZWANENBURG, Richard BURNETT and Claire FRANKLIN (1991a). “Respiratory health effects ofhome dampness and molds among Canadian children.” AmericanJournal ofEpidemiology. 134, 2, 196-203.[12] DALES, Robert, Richard BURNETT and Harry ZWANENBURG (1991b). “Adverse healtheffects among adults exposed to home dampness and molds.” American Review ofRespiratory Disease. 143, 505-509.[13] DOBSON, Annette J. (1990). An Introduction to Generalized Linear Models. London:Chapman and Hail.[14] DUDDEK, Chris, Nhu LE, Weimin SUN, Richard WHITE, Hubert WONG and JamesV.ZIDEK (1994). “Assessing the impact of ambient air pollution on hospitaladmissionsusing interpolated exposure estimates in both space and time.” Final report to HealthCanada under DSS contract H4078-3-C059/01-SS.[15] EVANS, J.S., T. TOSTESON and P.L. KINNEY (1984). “Cross sectional mortality studiesand air pollution risk assessment.” Environment International. 10, 55-83.[16] EVANS, Robert (1993). “Less is more: Contrasting styles in health care.” In Canada andthe United States: Differences that Count. Edited by David Thomas. Peterborough,Ont:Broadview Press. 21-41.[17] FAY, Robert E. (1984). “Application oflinear and log-linear models to datafrom complexsurveys.” Suvery Methodology. 10, 1, 82-96.[18] FRY, John (1985). Common Diseases: Their Nature, Incidence and Care. Fourth Edition.Lancaster: MTP Press Limited.[19] GIBSON, Robert (1990). “Out of control and beyond understanding: Acid rain as a political dilemma.” In Managing Leviathan: Environmental Politics and the Administrative92State. Edited by Robert Paehlke and Douglas Torgerson. Peterborough, Ont: BroadviewPress. 243-282.[20] GOMEZ, Stephen, Robert PARKER, James DOSMAN and Helen McDUFFIE (1992). “Respiratoryhealth effects ofalkali dust in residents near desiccated Old Wives Lake.” Archivesof Environmental Health. 47, 5, 364-369.[21] HACKNEY, Jack, Wiffiam LINN, Edward AVOL, Deborah SHAMOO, Karen ANDERSON,Joseph SOLOMON, David LITTLE and Ru-Chuan PENG (1992). “Exposures ofolderadults with chronic respiratory illness to nitrogen dioxide.” American Review ofRespiratory Disease. 146, 1480-1486.[22] HOBERG, George (1993). “Comparing Canadian performance in environmental policy.”In Canada and the United States: Differences that Count. Edited by David Thomas.Peterborough, Ont: Broadview Press. 101-124.[23] INSTITUTE FOR HEALTH CARE FACILITIES OF THE FUTURE (1990).Future Health:A View of the Regional Trends. Ottawa.[24] JOHNSON, Ted, Jim CAPEL, Roy PAUL and Luke WIJNBERG (1992).“Estimation ofcarbon monoxide exposures and associated carboxyhemoglobin levelsin Denver residentsusing a probabilistic version of NEM.” Final report to U.S. EnvironmentalProtectionAgency, Office ofAir Quality Planning and Standards, contractnumber 68-D0-0062.[25] KUMAR, S. and J.N.K. Rao (1984). “Logistic regression analysis of labourforce surveydata.” Survey Methodology. 10, 1, 62-76.[26] LEBOWITZ, Michael (1991). “Populations at risk: addressing health effects due tocomplex mixtures with a focus on respiratory effects.” EnvironmentalHealth Perspectives. 95,35-38.93[27] MATANOSKI, G., S. SELEVAN, G. AKLAND, R. BORNSCHEIN, D. DOCKERY, L.EDMONDS, A. GREIFE, M. MEHLMAN, G. SHAW and E. ELLIOTT (1992). “Role ofExposure Databases in Epidemiology.” Archives ofEnvironmental Health. 47, 6,439-446.[28] McCULLAGH, P. and J.A. NELDER(1989). GeneralizedLinearModels. Cambridge:Chapman and Hall.[29] McDONNELL, Howard KEHRL, Said ABDUL-SALAAM, Philip IVES, LawrenceFOLINSBEE, Robert DEVLIN, John O’NEIL and Donald HORSTMAN (1991). “Respiratoryresponse ofhumans exposed tolow levels ofozone for 6.6 hours.” Archives ofEnvironmentalHealth. 46, 3, 145-150.[30] McDONNELL, Wiffiam, Keith MULLER, Philip BROMBURG and Carl SHY (1993). “Predictors of individual differences in acute response to ozone exposure.” AmericanReviewof Respiratory Disease. 147, 818-825.[31] ONTARIO MINISTRY OF HEALTH (1992a). “Ontario Health Survey 1990.” User’sguidevolume 1.[32] ONTARIO MINISTRY OF HEALTH (1992b). “Ontario Health Survey 1990.”User’s guidevolume 2.”[33] OZKAYNAK, Halilk and George T11URSTON (1987). “Associations between 1980U.S.mortality rates and alternative measures of airborne particle concentration.” RiskAnalysis. 7, 4, 449-461.[34] PFEFFERMANN, Danny (1993). “The role of sampling weights when modeling surveydata.” International Statistical Review. 61, 2, 317-338.[35] PURTILO, D. and R. PURTILO (1989). Survey of Human Diseases. Second Edition.Boston: Little Brown and Company.94[36] RAPPAPORT, S.M., (1993). “Threshold limit values, permissible exposure limits andfeasibility: the bases for exposure limits in the United States.” American Journal ofIndustrial Medicine, 23, 683-694.[37] ROBINSON, L. and N. JEWELL (1991). “Some surprising results about covariate adjustment in logistic regression models.” International Statistical Review, 59, 227-240.[38] ROBINSON, J. and D. PAXMAN, (1992). “The role of threshold limit values inU.S. airpollution policy.” American Journal ofIndustrial Medicine, 21, 383-396.[39] SARNDAL, Carl-Erik, Bengt SWENSSON and Jan WRETMAN (1992). ModelAssistedSurvey Sampling. New York: Springer-Verlag.[40] SELZER, M., L VAN ROOIJEN (1975). “A self administered Short MichiganAlcoholScreening Test (SMAST).” Journal of Studies on Alcohol 36, 1, 117-126.[41] SEIXAS, Noah, Thomas ROBINS and Mark BECKER (1993). “A novelapproach to thecharacterization of cumulative exposure for the study of chronic occupationaldisease.”American Journal of Epidemiology. 137, 4, 463-471.[42] SElLER, Tamara (1993). “Melting Pot and Mosaic: Imagesand Realities.” In Canadaand the United States: Differences that Count. Edited by David Thomas.Peterborough,Ont: Broadview Press. 303-325.[43] SENTHILSELVAN, A., Yue CHEN and James DOSMAN(1993). “Predictors ofasthma andwheezing in adults: grain farming, sex and smoking.” American Reviewof RespiratoryDisease. 148, 667-670.[44] SEXTON, K., D. WAGENER and J. LYBARGER (1992). “Estimatinghuman exposuresto environmental pollutants: availability and utility of existing databases.”Archives ofEnvironmental Health, 47, 6, 398-407.95[45] SILVERMAN, F., H.R. HOSEIN, P. COREY, S. HOLTON and S.M. TARLO (1992). “Effectsof particulate matter exposure and medication use on asthmatics.” Archives of Environmental Health. 47, 1, 51-56.[46] SNIDER, Gordon (1992). “Emphysema: the first two centuries and beyond.” AmericanReview of Respiratory Disease. 146, 1334-1344.[47] STATISTICS CANADA (1983). Historical Statistics of Canada. Second Edition. Edited byF. H. Leacy and M. C. Urquhart. Ottawa: Supply and Services Canada.[48] STATISTICS CANADA (1991). Canada Year Book. Ottawa: Supply and Services Canada.[49] STEINBERG, Stephen (1981). The Ethnic Myth. Boston: Beacon Press.[50] VALANIS, Barbara (1992). Epidemiology in Nursing and Health Care. Second Edition.Toronto: Prentice Hall.[51] XIPING, Xu, Douglas DOCKERY and Lihua WANG (1991). “Effects of air pollution onadult pulmonary function.” Archives of Environmental Health. 46, 4, 198-206.96Appendix: FiguresI decided to place some of the figures in this appendix in the hope of avoiding unnecessarydistraction in the text. The figures are ordered into the following logical groupings:OHS NonresponseFigure 22: Ordered summary of covariate nonresponse.Pollution Measurements and EstimatesFigure 23: NO2station measurements;Figure 24: 03 station measurements;Figure 25: SO2station measurements;Figure 26: SO4station measurements;Figure 27: Strongest relationships between pollutants;Figure 28: Estimated six year estimates for all pollutants;Figure 29: Comparison ofpollution measurments to estimates;Figure 30: NO2PHU estimates;Figure 31: 03 PHU estimates;Figure 32: SO2PHU estimates; andFigure 33: SO4PHU estimates.Asthma Arcsine Analysis DiagnositcsFigure 34: Demographic model;Figure 35: Soccioeconomic model;Figure 36: Lifestyle model;9730 -20-a)2a)010-0-’3020ci)2ci)a-100Family Body Alcohol CurrentScore Mass Problem Smoker3020a)C)a)a-1:_______Smoker BlueType Collar Allergy Education ImmigrationFigure 22: Ordered summary of study covariate nonresponse.Well No. of DurationBeing Exercise Income Cigs SmokedIWorkExposureI9880 80 8080400858798040083 85 87 8983 85 87 89083 85 87 898040ALA L A.0.‘,vy83 85 87 8980080083 85 87 89VV‘ ‘‘400804004004008040083 85 87 89804083 85 87 89 83 85 87 8940I408040083Figure85 87 89 83 85 87 89 8385 87 8923: NO2readings for twelve stations ordered by increasing station mean(g/m3).9980400r%%yA804008040080L...&A..L.k&jr[98040083 85 87 89 83 85 87 89 83 85 87 89 8385 87 89YearFigure 24: 03 readings for twenty out of twenty one stations(ug/m3).I400rvvAk.AkA.1A.flVrøi910080 80 80 8040 40 40 400,, 0 0___80 — 80 80 80035889 387890:________________:8 5 878: :________________80 80 80 800____0____80 80 80:859: ‘:8 5 7 89Figure 25: SO2readings for all twenty stations (jig/rn3).101083 85 87 8938910LAA. hA5Figure 26: SO4readings for all ten stations (tg/m3).83 85 87 8983 85 87 891050105010510501050105010501050MArjA83 85 87 8983 85 87 890105083 85 87 89102C’J0Cl)C)3117icSummer N024140C.’J0zC)272432C.)0a)C2€’34C’J0U)a)EEDCl)157272715 3SummerS0238 59Summer03Cl)C)CCl)C)C0Cl)a)CI.•1 1Summer S04.. I38 59Summer03C’J0Cl)C)EEDCl)c’J0U)a)C38 5Summer03311726 32 26 32Winter03 Winter 03Winter02Figure 27: Strongest scatterplot relationships between pollution estimates.226 321030.6Nitrogen Dioxide0.00.6Ozone0.00.6Sulfur Dioxide>SulfatesCl)a)0.015 45Estimated MeansFigure 28: Distribution of the 37 estimated PHU means for the four pollutants.104g/m3Co-.4c99000LiCii 9LICD C E Co C C C C CD Co ‘-C CD CD Co C CD Co CD Coz 0COCD 0 x 0. CDcoo(I)3CDmcoo, C 33 (DCD -‘0. 0 m‘CD0.Cl) C C 0 x 0. CD[III[].] [jjjFluBHll+[ill[II]IHEI]H[IllFill[IiiFigure 30: Estimated NO2six year averageSummerIzWinter/106SummerFigure 31: Estimated 03 six year average-F/Winter.‘/107IFigure 32: Estimated SO2six year averageSummerI/WinterIV/108Figure 33: Estimated SO4six year average/-F10941)Na)a)0.Ecci(I)Figure 34: Asthma arcsine model diagnositcs for the demographic grouping.0.141)0)ccia)-J4 244—.- /ILii’150002.50(ci0Coa,C-0.80 60 120Index.o 6à 120Immigrant (1mm/Non)Ca 60Age (Young to Old)1210.1a,0)ccici,>a)-J0cci00a)(0cciD0cciStandard Normal Quantiles.•. ..••• b,as;, .•. .. ‘•••:•6.•e.60 120Stratum (Rural/Urban)C3(0(ciD(0a)0(ci00a).....•.•.Ci•i•.C•.•t .. .•.•S ••.• •. • .1.17 0.25PredictedS•S.S•••• •.•• ••15W5? • ‘••.- if .•.%.••19;t_. •be.•03C-330.3CO(ciDcci..... •S.•.•. •• •. •..e ••b%•I•.I-.‘s ••_•.:...,C-3Sqo 60 120 Ô 60 120 Ô oioSex (Female/Male) Immigrant (1mm/Non) Age (Youngto Old)1100.22p/02a) a)CaCl)a)00a)oBlue Collar (NA/Blue/White)o 24 48Blue Collar (NA/Blue/White)0 24 48Income (NA/Low/Mid/High)4300a)NU)a)ECl)2.5Ci)CaCi)a)0-0.93Cl)CuU)a)-33....— •..IStandard Normal Quantiles• —I I •••1•....,.:..I..rl•;:p.%:•:•o 24 48Work Exposure (No/Yes)0.17 0.29Predicted3C.•...::““‘0)CuU)a)-330Cu0a)-3•J d%.•..-.•t..•0 24 48 Ô 24 48 Ô 24 48Post-Secondary (No/Yes) Blue Collar (NA/Blue/White) Income (NA/Low/Mid/High)Figure 35: Asthma arcsine model diagnositcs for the socioeconomic grouping.11160C2p/i a)a)-JFigure 36: Asthma arcsine model diagnositcs for the lifestyle grouping.8ThkAJliL.L1J2p/0.0908 0 74Duration Smoked (Years)Alcohol (Non/Drinker)....SS/-3 3Standard Normal Quantiles30.09a)a)a)-JC3(0a)D(0a)U)a)D(0a).. .. .$ .1 $I . •.I •a)NCl)a)0.Ea)Cd)2.5CO(5D0Cl)0)C—130a)0a)-3!III!.. ...•.••C____-3r0.19Predicted-30.27 a(0a)D•0(0a)U)a)VU)a)74SmokerType148• .:...••. . a. •• •.•,:.3a,3.... ...• p•p• I...•—‘a• ;..••P••.1• •. ••, •4Ja• •1. .74Alcohol (Non/Drinker)-3148 Ô 74 148# Cigarettes (Few to Many)-30 74 148Duration Smoked (Years)112


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items