UBC Community, Partners, and Alumni Publications

Outcomes management and resource allocation : how should quality of life be measured? Hadorn, David C., 1952- Jul 31, 1993

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
52387-Hadorn_David_Outcomes_management.pdf [ 29.26MB ]
Metadata
JSON: 52387-1.0048417.json
JSON-LD: 52387-1.0048417-ld.json
RDF/XML (Pretty): 52387-1.0048417-rdf.xml
RDF/JSON: 52387-1.0048417-rdf.json
Turtle: 52387-1.0048417-turtle.txt
N-Triples: 52387-1.0048417-rdf-ntriples.txt
Original Record: 52387-1.0048417-source.json
Full Text
52387-1.0048417-fulltext.txt
Citation
52387-1.0048417.ris

Full Text

OUTCOMES MANAGEMENTANDRESOURCE ALLOCATION:HOW SHOULDQUALITY OF LIFE BE MEASURED?D.C. HadornHPRU 93:70 JULY, 1993OUTCOMES MANAGEMENT AND RESOURCE ALLOCATION:HOW SHOULD QUALITY OF LIFE BE MEASURED?David C. Hadorn, M.D., M.A.HPRU 93:70 JULY, 1993HEALTH POLICY RESEARCH UNITCENTRE FOR HEALTH SERVICES AND POLICY RESEARCH#429 • 2194 HEALTH SCIENCES MALLUNIVE~SITYOF BRITISH COLUMBIAVANCOUVER, B.C. CANADAV6T 1Z3The Centre for Health Services and Policy.Research was established by the Board ofGovernors o'f the University of British Columbia in December 1990. It was officiallyopened in July 1991. The Centre's primary objective is to co-ordinate, facilitate, andundertake multidisciplinary research in the areas of health policy, health services research,population health, and health human resources. It brings together researchers in a varietyof disciplines who are committed to a multidisciplinary approach to research, and topromoting wide dissemination and discussion of research results, in these areas. TheCentre aims to contribute to the improvement of population health by being responsive tothe research needs of those responsible for health policy. To this end. it provides aresearch resource for graduate students; develops and facilitates access to health and healthcare databases; sponsors seminars, workshops. conferences and policy consultations; anddistributes Discussion papers, Research Reports and publication reprints resulting from theresearch programs of Centre faculty.The Centre's Health Policy Research Unit Discussion Paper series provides a vehicle forthe circulation of preliminary (pre-publication) work of Centre Faculty and associates. It isintended to promote discussion and to elicit comments and suggestions that might beincorporated within the work prior to publication. While the Centre prints and distributesthese papers for this purpose. the views in the papers are those of the author(s).A complete list of available Health Policy Research Unit Discussion Papers and Reprints,along with an address to which requests for copies should be sent, appears at the back ofeach paper.TABLE OF CONTENTSAbstract iList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiPart A: Calibration of a Brief Questionnaire and a Search for Preference Subgroups1. Background and Significance 1What's the Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Need for a Standard, Brief, Generic Outcome Questionnaire . . . . . . . . . . 2Questionnaire Design and Calibration 3Preference Subgroups and the Problem of Discrimination . . . . . . . . . . . . 5II. Methods ' . . . . . . . . . . . . . . . . . . . . . . 6Sample , . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . • . . . . . . . . . . . . . . . 6Preference-Measurement Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . 6Statistical Analysis 8m. Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Effect of Subjects' Quality of Life on Ratings 10Demographic Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Age Differences 12Gender Differences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13Effect on Ratings of Experience with Illness . . . . . . . . . . . . . . . . . . . . . 14Preference Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16IV. Discussion.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19Preferences and Outcomes Management . . . . . . . . . . . . . . . . . . . . . . . . 20Part BI.II.III.IV.Part C1.II.m.IV.Questionnaire ValidationIntroduction .Methods .Subjects .Validity of Questionnaire .Results .Questionnaire Validity . . . . . . . . . . . . . . . . . . . . . .Mortality Analyses .Changes in HRQOL Over Time .Discussion .Use of Global HRQOL Measure : .Induction Problems .Problems with InterpretationIntroduction .Strengthening Causal Inferences .Standardizing the Concept of HRQOL .Defmitions for Quality of Life Survey .Induction Problems .Need for a Central HRQOL Registry .Identifying "Type of Patients" .Statistical Approaches .Clinical versus Statistical Significance .The Problem of Discrimination .Length of Life versus QUality of Life .Conclusion .2223232328283033353639404141424446484950515253References 54ABSTRACTThe relentless increase in health care expenses, coupled with persistent concernsabout the quality and appropriateness of medical services, has brought increasing pressure tobear on researchers to develop more efficient strategies for determining "what works" inmedicine. Several years ago, Paul Ellwood recommended that society arrange to regularlycollect information concerning the health-related quality of life (HRQOL) of patients withmedical problems and conditions. Coupled with demographic, clinical, and treatment data,this HRQOL outcome information would permit researchers to determine which servicesprovide significant benefit to which types of patients.For many reasons, Ellwood's vision of large-scale "outcomes management" programshas not come to pass. Probably the most significant impediment has been the absence of ave"ry brief, generic HRQOL survey instrument which is calibrated according to empiricallyderived values and preferences. This discussion paper describes an effort to create and testsuch a questionnaire.In Part A we describe the process of calibrating a new, four-item HRQOL questionnairebased on preferences elicited (in various ways) from about 600 people from different walks oflife. During this process we observed few systematic differences in preferences acrossdemographic lines; moreover, people with medical conditions or disabilities did not rateHRQOL-problem health states much differently than people without those conditions ordisabilities.Part B describes our clinical testing of the mail-in questionnaire in a cohort of 400cancer patients. The questionnaire was administered three times to each patient over a sixmonth period. We found that the questionnaire was able to capture patients' "true" HRQOL,as determined by subsequent careful, standardized (and blinded) telephone interviews. Weconclude that use of our four-item questionnaire -- or of its global item alone •• can providevalid information concerning patients' HRQOL.Part C discusses several key issues pertaining to the interpretation of observationalHRQOL outcome data. To strengthen causal inferences drawn from these data, werecommend that the definition of HRQOL be restricted and standardized, and that the purposeof collecting outcome data be carefully explained as part of a public education program.Reporting one's HRQOL outcomes should come to be seen as a civic duty, like voting. Largesample sizes will be needed to control for potential confounding factors that could complicate(or preclude) inferences concerning treatment effectiveness. Finally, we discuss the centralproblem of outcome research: identifying "types of patients" who either do or do not derivesignificant benefit from specified treatments and procedures.We conclude that very brief, even single-item, questionnaires can be appropriatevehicles for obtaining valid data concerning patients' HRQOL outcomes, and that validinferences can be drawn from these data concerning the effectiveness of medicalinterventions. Large-scale outcomes management studies can provide an efficient andpowerful strategy for determining "what works" in medicine (or, more precisely, "what works forwhom"). By identifying when services have or have not been shown to provide significant nethealth benefit, society can ensure that its health care resources are allocated wisely.iiLIST OF FIGURESFigure 1 The Quality of Ufe and Health QuestionnaireFigure 2 Direct Ratings for Sixteen Health StatesFigure 3 Subgroup Ratings of Two Health StatesFigure 4 Preference StructureFigure 5 Standardized Telephone InterviewFigure 6 Characteristics of Patient Sample .Figure 7 Correlation Matrix of Patients' and Interviewers' Responses to DifferentHRQOL Dimensions .Figure 8 Correlation Matrix for Overall HRQOL MeasuresFigure 9 Differences in Mortality and Lost-to-Follow-up RatesFigure 10 Behavior Over Time of HRQOL ParametersFigure 11 Distribution of HRQOL Change ScoresiiiPart ACalibration of a Brief Questionnaire anda Search for Preference SubgroupsI. Background and SignificanceFive years have passed since Paul Ellwood first introduced the term "outcomesmanagement" to the health policy debate.' Ellwood envisioned the creation of a "permanentnational medical data base that uses a common set of definitions for measuring quality of life."Patients would regularly supply information concerning their health-related quality of life aspart of a large-scale quasi-experiment designed to determine the outcomes of care and theeffectiveness of medical and surgical services. Responses to the outcome questionnaireswould be linked to medical records containing information about patients' conditions andtreatments. This activity, Ellwood believed, would provide a "central nervous system" for ahealth care system increasingly characterized by "uninformed patients, skeptical payers,frustrated physicians, and besieged health care executives."Ellwood's vision helped launch what is now commonly referred to as the "outcomesmovement." In an editorial published a few months after Ellwood's article, Arnold Reimanheralded the dawn of the "third revolution in health care, the Era of Assessment andAccountablllty." Reiman endorsed the idea of "linking medical management decisions tonew, systematic information about outcomes," saying this process would "improve the qualityand effectiveness of health care and provide a much firmer base for future economicdecisions." This apparent allusion to resource allocation decision-making was echoed moreexplicitly a couple of years later by John Wennberg," who asserted that the use of outcomeinformation might forestall the need for society to ration truly effective services by "sort[ing] outwhat works in medicine."This bright promise seems to have dimmed somewhat of late. Indeed, despitesustained intensive efforts by many talented researchers and policy analysts, we are not muchcloser to knowing how, exactly, society might make use of outcomes information to setpriorities within the health care system.What's the Problem?The basic idea sounds simple enough. First, determine the health outcomesassociated with different treatments. Next, determine how people feel about those outcomes.1Finally, give priority to treatments that produce more-preferred outcomes. Such an outcome­based approach to resource allocation, although potentially vulnerable to charges ofdlscrtrntnatlon," S is eminently reasonable, fair, and, indeed, necessary.It is anything but simple, however. To date, only one instance of an outcome- andpreference-based effort to set priorities has occurred: the trailblazing (and highly controversial)work of the Oregon Health Services Commission." 7 8 Unfortunately, although the Oregonproject provided a wealth of experience on one possible approach to estimating anddovetailing outcomes with preferences, the result of that project -- a priority list containing 688condition-treatment pairs -- is of questionable utility. Because of the wide range of proceduresand indications contained within each "line item" on the list, substantial additional specificationwill be needed before the list can be applied to actual patients."The Oregon experience notwithstanding, the real-world use of outcomes to set healthcare priorities remains a seemingly distant goal. Moreover, systematic outcomesmanagement, as envisioned by Ellwood, is essentially no closer to implementation than whenit was first proposed. Why is this? There are many reasons, including (1) lack of a standardelectronic medical record for entering and retrieving the necessary data;" (2) physicianresistance to the use of formal patient outcome questionnaires; 10 11 and (3) reluctance tobase policy on causal inferences drawn from non-experimental data." The first two of theseproblems can be overcome with greater federal leadership; the third will require, In addition,careful methodological consldsration."None of the above factors is the most fundamental obstacle to creation of a large-scalesystem of outcomes management, however. A more basic problem is the lack of a standard,very brief, generic quality-of-Iife outcome questionnaire that is calibrated according toempirically derived public preferences. Creation of such a questionnaire constitutes theessential first step toward development of an effective system of outcomes management.Need for a Standard, Brief, Generic Outcome QuestionnaireClinical studies of particular conditions and treatments often use as endpointscondition- and treatment-specific outcomes (e.g., shortness of breath or nausea caused bychemotherapy). Such outcome measures are not suitable for purposes of resource allocation,however, as they do not permit comparison of different treatments and procedures." 15 1SHow important, for example, is a coronary bypass operation for a specifiedclinical condition vs. dlemotherapy for a particular type of cancer? If money is2tight. which service should be funded if both cannot be? Answers to this sort ofquestion will depend on the comparability of outcome data, which is possibleonly through the use of generic measures.17,(p.n5)James Bush and his colleagues were among the first to recognize the importance of genericmeasures (e.g., pain or physical suffering) for resource allocation decision-making . Genericmeasures formed (and continue to form) an integral part of the quality-adjusted life yearconcept introduced by these investigators over 20 years ago.18 19 More recently; RANDresearchers developed and used generic measures in the Medical Outcome Study in order litocompare patients who have different conditions by providing a common yardstick"20 againstwhich to measure outcomes.Recognition of these advantages has led to the development of a host of genericoutcome questlonnalres," the most commonly used of which today is known as the SF-36.22 Based on the outcome measures used in the Medical Outcome Study,20 the SF-36 (forShort Form--36 items) is now in widespread use throughout the United States. including manyof the Patient Outcome Research Teams sponsored by the federal Agency for Health CarePolicy and Research.23 The SF-36 contains items related to several aspects of health-related quality of life (HRQOL), including pain, extent of limits in daily activities, mental health,and energy level. Researchers plan to use the input obtained from the SF-36 to assess theHRQOL outcomes associated with various treatments.Questionnaire Design and CalibrationIn assessing the importance of the HRQOL outcomes reported by patients in a systemof outcomes management, scoring of the questionnaires used to detect and measure theseoutcomes must reflect the actual values or preferences of the public. IS Unfortunately, likemany other generic questionnaires, the SF-36 is constructed without explicit reference to therelative priority people place across and within different dimensions of HRQOL. For example,the SF-36 contains fourteen items pertaining to physical functioning and performance of dailyactivities, but only two items related to pain. Unless patients value functioning seven times asmuch as relief or avoidance of pain (an empirical question addressed in this study), the resultsfrom the SF-36 will be biased toward functioning and away from pain if aggregate scores arecalculated.Similarly, scoring within items on the SF-36 is essentially arbitrary. For example, allfour of the following transltlons are assigned one point:31. A change from "gqod" to "fair" health2. A change from "limited a little" to "limited a lot" in bathing and dressing one's self3. A change from "feeling calm and peaceful "most of the time" to "a good bit of thetime"4. A change from "accomplishing as much as one would like" to accomplishing lessthan one would like (because of physical or emotional problems).It is possible , however, that most people might consider transition #4 to be substantially worsethan #3, or vice versa. Similar instances of unequal weighting of preferences might be foundfor many or most of the other Items.The lack of empirical preference weights for the items contained in the SF-36 (andmost other HRQOL questionnaires) is a significant shortcoming, especially If these instrumentsare to be used for the socially sensitive purpose of setting health care priorities. By contrast,three other well-known generic instruments are calibrated according to empirically derivedpreferences: the Quality of Well-Being Scale (QWB),24 the Nottingham Health Profile(NHP)2526, and the Sickness Impact Profile (SIP).27Unfortunately, all of these questionnaires are probably too long to be used in large-scalepatient outcome studies, Under a system of outcomes management, patients must continueto complete the health status surveys regularly after they have contacted the health caresystem , perhaps every six to twelve months following initial contact. This means, in tum, thatthe surveys will need to be mailed to patients' homes at regular intervals for self­administration and returned to a central receiving center. Very high response rates will berequired if these outcome data are to command enough respect to guide priority-setting.From the perspective of cost and ease of response, a coded postal-card version of thequestionnaire would probably be Ideal for this purpose; perhaps three or four items would bethe maximum for such a format. (See Part C for further discussion of survey logistics.)The Rosser-Kind Index is the closest extant example of such an instrument." Thisquestionnaire has been used primarily for resource planning purposes in the United Kingdomand Europe, but has seen little use in North America. The Rosser-Kind Index consists of twoitems: one concerning pain (none, mild, moderate, and severe), the other on physicalfunctioning (eight levels, ranging from no limits to unconscious). We believe that thesegeneric HRQOL dimensions represent a reasonable distillation of the range of symptoms anddisabilities caused by medical conditions, diseases, and treatments. (See Reference [17] forfurther discussion of this issue.)4Based on our experiences in previous studies," 30 we modified the Rosser-KindIndex to produce a four-item questionnaire (Figure 1) for use in outcomes management andresource allocation. In Part A of this discussion paper we report the results of thequestionnaire calibration phase of this project. In Part B we describe our experience using thequestionnaire in a cohort of 400 cancer patients.Preference Subgroups and the Problem of DiscriminationDuring the process of calibrating our test questionnaire -- dubbed the Quality of Lifeand Health Questionnaire -- we also conducted a search for coherent preference subgroups.An important unresolved issue in the field of resource allocation is whether people'spreferences differ significantly based on demographic characteristics or, particularly, onwhether they have experienced (or are experiencing) medical conditions or disabilities.Concem over such differences was the stated reason for the initial denial of the waiverneeded by the State of Oregon to implement its much-discussed effort to set health carepriorities in its Medicaid proqram."What is the evidence for differing preferences? There are some data suggesting thatpreferences for specific health states may vary based on experience with those states.People who have experienced specific states of poor health may rate those states higher,3233304 lower," or the same" 37 as people without comparable experience. In contrast tothese inconsistent findings, people's preferences for generic outcomes are remarkablyconsistent, irrespective of demographic or clinical characterlsncs." 38 39 40 41 4243Indeed, this finding is perhaps the most consistent result within the entire field of preferencemeasurement. One might speculate that, in addition to their advantages in the outcome­assessment arena (as discussed above), generic outcomes may be easier for people tocomprehend (and to rate) in terms of preferences. This could be true because, although mostpeople have not experienced any given specific health state, almost everyone hasexperienced pain and at least temporary limits on daily activities. This experience mightfacilitate the assignment of preference values to generic health states.The evidence concerning the uniformity of preferences for generic outcomes has atleast two shortcomings: (1) the small numbers of people enrolled in these studies (typicallyless than 200) and (2) the lack of systematic searches for coherent preference subgroups. Inthe present study, we obtained preferences from a diverse sample of 599 people, and made asystematic effort to detect subgroups of people with significantly different preferences.5II. METHODSSampleSubjects were 618 individuals recruited throughout the State of New Jersey from awide variety of settings, Including church and civic organizations and support groups forpatients with chronic illnesses or conditions. A special effort was made to include subjectswho represented a wide cross-section of demographic and clinical variables, Including peoplewith physical disabilities. The questionnaire was administered In small groups of about 10-12people each. A trained facilitator provided preliminary instructions and assistance asnecessary. The completion rate was high overall; out of 618 surveys administered, 599(96.9%) were completed.Respondents were 61.4% female. The majority were Caucasian (70.6%) and African­American (18.3%), with the remainder (11.1%) mainly of Hispanic and Asian ethnicity. Agesranged from 14 to 91, with a mean of 48.9, a standard deviation of 20, and 90% of casesbetween the ages of 20 and 75.The sample was somewhat skewed in the direction of higher education and above­average income. For example, 21.4% of respondents reported an advanced degree, and37.1% reported yearly incomes of above $50,000. Lower-income and less well educatedsubjects were, however, reasonably represented; for example, 8.6% of subjects reported nothaving completed high school and 20.3% reported yearly incomes of less than $15,000.Preference-Measurement InstrumentBased on the results of two pilot studies,2930 we developed a specially designedinstrument designed to calibrate the four-item questionnaire shown in Figure 1. This latterquestionnaire was administered to a cohort of cancer patients, as described in Part B of thisdiscussion paper. The longer questionnaire used during this part of the study was designedto assign weights to the 16 health states defined by the four levels each of physical sufferingand activity limitations. For reasons discussed more fully elsewhere'7 we analyzed separatelythe level of emotions and of overall quality of life reported by patients, as a sort of "baselineadjustment" in patients' response level. (See Part B.)Each participant received a 32-page health preference survey that consisted of severalsections. The first section of the questionnaire asked subjects to rate the 16 health scenarioson a scale of 0 = worst possible quality of life to 10 = best possible quality of life. Scenarioswere constructed by pairing each of four level of physical suffering (none, mild, moderate,6severe) with each of four levels of activity limitations (none, mild, moderate, severe).Accompanying each scenario was a cartoon figure that represented the combination ofphysical suffering and activity impairment. (For illustrations and a more extensive discussionof the cartoons and rating formats and their validation, see Ref. [30].) The scenarios werepresented individually in random order. We refer herein to these sixteen ratings as the directratings.The second section contained pairwise comparisons of health scenarios. Thescenarios compared were taken from the 16 scenarios above. In this section, a pair ofscenarios was presented and the subject asked to rate the extent to which one was preferredto the other. For example, the first item in this section asked subjects to compare condition 1= (no physical suffering + mildly limited activities) with condition 2 =(mild physical suffering +no activity limitations). Note how this choice entails a trade-off between physical suffering andactivity limitation; all the pairwise comparisons had this characteristic.We constructed 13 comparisons so that each of the 16 original scenarios, except the(no suffering + no limitations) and (severe suffering + severe limitations) scenarios werecompared with at least one other scenario. The two extreme scenarios were excludedbecause in pilot studies the former was always strongly preferred to all other conditions andall other conditions were strongly preferred to the latter. Ratings were made on a scale of ·5= strongly prefer condition "1" to +5 = strongly prefer condition "2", with 0 = no preference.These are the paired comparison ratings.30 44Section 3 of the questionnaire investigated how subjects valued tradeoffs betweenincreased life expectancy brought about by medical treatment and concomitant reducedquality of life. The format was similar to the paired comparison items, except that eachscenario included a specific life expectancy. For example, the first item in this section askedsubjects to compare treatment outcome 1 = (no physical suffering + mildly limited activities +12 month life expectancy) with treatment outcome 2 = (mild physical suffering + mildly limitedactivities + 13 month life expectancy). The section contained 12 items with this format.Ratings were made on the same scale of ·5 to 5 as with the paired comparison ratings. Werefer to these ratings as the time-tradeoff ratings.Section 4 of the questionnaire examined how subjects believed their emotional statusand outlook on life are affect their quality of life. We do not report results from this sectionhere.Section 5 contained four items that attempted to ascertain subjects' current quality oflife (QOL). The first item asked subjects to rate their level of physical suffering (none, mild,7moderate, or severe) during the previous four weeks. The second asked them to rate theiractivity limitations (none, mild, moderate, or severe) over the same period. The third itemasked subjects to rate their outlook on life (good, fair, somewhat poor, or very poor) over theprevious four weeks. The last item asked subjects to report their current overall quality of lifeon a scale of 0 = worst possible to 10 = best possible. Note that this section is similar to theactual quality-of-life survey tested in a cohort of cancer patients (Part B). Other sectionsasked about the presence or absence of specific health conditions and demographic data,including the presence of disabling conditions.Statistical AnalysisAs described above, subjects assigned direct ratings (on a 0 to 10 scale) to each ofthe sixteen health states formed by crossing four levels of physical suffering with four levels ofactivity limitation. The means of these direct ratings constituted the basic calibration of theQuality of Ufe and Health Questionnaire (Figure 1). To determine whether one health statewas preferred over another, we assessed differences in ratings across pairs of health statesusing paired Student's t-tests, with statistical significance set at p < .01. We considereddifferences in ratings between pairs of states to be clinically significant if one state was ratedat least one-half point higher than the other state on our eleven-point scale.As a validity check on the direct ratings, we calculated scale values for the samesixteen health states using the paired-comparison ratings.30 For this analysis we used thePAIR program for graded palred-cornparlsons." We then calculated the Pearson product­moment correlations between the scale values obtained from direct ratings and those derivedfrom the paired-comparisons. The time-tradeoff ratings were not compared with the directratings or paired comparisons, but were used only in testing for the existence of coherentpreference subgroups.In searching for differences in preferences across subgroups, we began with amultivariate statistical test of association (forward stepwise regression) between thedemographic or clinical variable and each item set (the direct ratings, the paired comparisonratings, and the time-tradeoff ratings). If a significant association between a demographicvariable and an item set emerged, we then examined the association of the demographicvariable with each item in the item set individually. All statistical analyses were conductedboth with and without statistical adjustment (via partial correlations) for potential confoundingvariables. Statistical significance was defined as p < .05, although in many cases p valueswere much lower.BFIGURE 1THE QUALITY OF LIFE AND HEALTH QUESTIONNAIREInstructions: Please check the box next to the word ([x]) that best describes how you have been feelingover the past week.PHYSICAL SUFFERING: Headaches, chest or back pain, arthritis, nausea or vomiting, shortness ofbreath, dizziness, itching, etc.NONE [ ] Physical suffering is rarely or never a problem.MILD [ 1Somewhat bothersome problem but generally goes away by itself.MODERATE [ 1More troubling problem with suffering.SEVERE [ 1Extremely disturbing problem with suffering.EMOTIONS/OUTLOOK ON LIFE: Feeling happy or sad, peaceful or nervous, and how much you lookforward to getting up in the morning. How much of a problem:NONE [ ] Emotions and outlook on life are rarely or never a problem.MILD [ ] Somewhat bothersome problem with feeling downhearted and blue.MODERATE [ ] More troubling problem with feeling depressed or nervous.SEVERE [ ] Extremely disturbing problem with feeling depressed or nervous .DAILY ACTIVITIES: Working or favorite pastimes , doing things with friends and family, and basic self­care activities -. such as: bathing, getting dressed, eating, and going to the bathroom. How much of aproblem:NONE [ ] Daily activities are rarely or never a problem.MILD [ l Somewhat bothersome problem with being limited in activities.MODERATE [ ] More troubling problem with having to reduce activities .SEVERE [ l Extremely disturbing problem with having to reduce activities.OVERALL, HOW WOULD YOU RATE YOUR QUALITY OF LIFE?(Circle one number)® @ © e ©I I I I I I I , I0 1 2 3 4 5 6 7 8 9 10Worse Possible Half-way Between Best PossibleQuality of Life Worst and Best Quality of Life9III. RESULTSThe mean ratings and standard deviations for the sixteen basic health states aredepicted in Figure 2. As a rule, differences in ratings of 0.3 units or greater were statisticallysignificant (p < .01). The correlation between the direct rating values and the scale values. derived from the paired-comparisons was Irl = .94.Effect of Subjects' Quality of Life on RatingsThe first factor we considered in our search for differences in preference patterns wassubjects' perceived level of their own quality of life. For example, persons who are veryhealthy might rate an average health state as undesirable, whereas subjects who have a poorcurrent health-related quality of life might rate the same state as desirable. As noted earlierwe used four items to measure subjects' current HRQOL: current physical suffering, currentactivity limitations, current emotional outlook, and current overall HRQOL.Multivariate statistical analysis showed significant associations between the set of fourcurrent HRQOL indices and all three preference rating sets. Association was strongest withthe direct ratings. This makes sense, because the comparative nature of items in the othertwo sets should help reduce the confounding effect of current overall health status.The strongest association was between the single-item overall current HRQOL ratingand the direct rating items; direct ratings for each of the 16 scenarios had significant individualcorrelations with this variable (average Irl = .15). Mean health state ratings of subjects whosecurrent HRQOL was 9-10 (N = 316) was 5.4 (SD = 2.1); ratings for subjects with HRQOL < 8(N =144) averaged 4.7 (SD =2.0). This difference is greater than the one-half rating pointwhich we considered a priori to be clinically significant.Current emotional outlook correlated significantly with ratings on nine of the 16 directrating scenarios. By comparison, overall current HRQOL correlated significantly with only twoof the 13 paired comparison items and three of the 12 time-tradeoff items, and emotionaloutlook was not significantly correlated with any item of the paired comparison or time­tradeoff sets.The results demonstrate that subjects' current quality of life and emotional outlook (butnot their level of perceived physical suffering or limits on daily activities) may affect theirpreference ratings for various health states. This finding indicates that emotional outlook andperceived overall quality of life act similarly to "internally calibrate" subjects' ratings of healthstates. As noted, this effect was most pronounced with use of the direct rating task. In the10analyses reported below, when a current HRQOL item was significantly associated with ademographic variable, the former was treated as a confounding variable in analyses involvingthe latter.Demographic DifferencesThe demographic variables considered were ethnicity, age, and gender. In consideringethnicity we examined only African-American and Caucasian subjects, other groups beinginsufficiently represented for statistical inference. We decided not to create a "non-Caucasian"category, which we believed would be too heterogeneous for our present purposes. Ethnicitywas significantly associated with age and current emotional outlook, and assoclated at anear-significant level (p =.057) with gender; after controlling for these possible confounds withpartial correlations, ethnicity remained significantly associated with five direct-rating items.Caucasians rated these items lower (less desirable) than African-Americans. The meandifference across the five items was .54, (Caucasian mean =2.91, African-American mean =3.45). Each of these items includes either severe physical suffering or severely limitedactivities, or both.Ethnicity was significantly associated with only two paired comparison items; thisassociation remained significant after controlling for the covariates. The first of these twoitems compared (no physical suffering + severely limited activities) with (mild physical suffering+ moderately limited activities). The second compared (no physical suffering + moderatelylimited activities) with (mild physical suffering + no activity limitations). For both comparisons,Caucasians showed a greater preference for the second scenario. The results suggest thatthe Caucasian subjects were more willing to accept the increment from none to mild physicalsuffering in exchange for fewer activity limitations.Age DifferencesWe next looked at whether preferences vary by age. Multivariate analyses revealedsignificant overall association between respondent age and (i) the direct ratings, (ii) the pairedcomparison ratings and (iii) the time-tradeoff ratings. Age was significantly associated withethnicity (African-American vs. Caucasian), sex, and current activity level ; the association ofage with (i), (ii), and (iii) remained significant after controlling for these possible confounds.Five direct rating items were significantly correlated with age; four ofthe correlations remained significant after controlling for the covariates.For each of the four items, older subjects tended to give higher preference12ratings. Three of the items involved moderate activity impairment. The results suggest thatolder subjects are generally less averse to moderate physical suffering and moderate activityimpairment. This is not surprising, and may reflect a generally lower comparison level forbaseline functioning.Three paired comparison items were significantly correlated with age; two remainedsignificantly correlated after controlling for the covaJiates. Older subjects showed greaterwillingness to accept the change from no to mild physical suffering or mild to moderatephysical suffering in retum for the change from severe to moderate activity limitations. Thismay be related to the previous result, i.e., that older subjects rate moderateactivity limitation better overall.One time-tradeoff item significantly correlated with age; the effect was significantcontrolling for covariates . The item suggests older subjects may be more will.ing to accept thechange from no to mild activity limitation in exchange for a slight increase in life span.Gender DifferencesThe final demographic variable evaluated in this study was gender. Using the directrating task, respondent gender was significantly associated with (i) the direct rating items, (ii)the paired comparison items and (iii) the time-tradeoff items.. After controlling for age,ethnicity and current activity limitations, however, only (ii) remained significant; the associationof gender with (i), however, was near significant (p = .057).Despite the significant multivariate association, only two direct rating items hadsignificant associations with gender that remained after adjusting for the covaJiates. Femalesrated the condition (no physical suffering + no activity limitations) better (9.7 vs. 9.2) and thecondition (severe physical suffering + severe activity limitations) worse (1.5 vs. 1.9) thanmales. A possible explanation is that males are less likely to give extreme ratings--either highor low.As before, we observed fewer differences across gender with the paired-comparisontask compared to the direct rating task. Indeed, no individual item among the paired­comparison ratings was significantly associated with gender when the effect of the covariateswas considered (even without considering the covariates, only one paired-comparison itemwas significantly associated with gender).As a final check for systematic effects of demographic factors on ratings, we calculatedfor each subject an index [alpha] = P / (P + A), where P is the variance accounted for byphysical suffering and A is the variance13accounted for by activity impairment. The index provides a rough estimateof the importance of physical suffering compared to activity limitation indetermining a subject's preference ratings. For the entire population, [alpha] had a mean of.51 and a median of .52, showing that, overall, physical suffering and activity impairment wereof near equal importance. The [alpha] index was not significantly associated with age,ethnicity, or gender.Examination of the paired-comparison data confirmed this result. We calculated themean of these ratings, reverse-scoring when necessary so that a mean above zero reflectsoverall preference for less suffering (even at the expense of more activity limitations) and amean below zero reflects overall preference for fewer limits on activities (even at the expenseof more suffering). Mean rating for the 13 paired-comparison scenarios on the 0-10 pointscale was 0.12, again confirming near-equivalent importance placed on the two dimensions.Again, no significant differences in mean rating were observed across demographic lines in amuttlple linear regression analysis . Note that this rough equality in value between the twodimensions means that the number of items in HRQOL questionnaires should be roughlybalanced between these two dimensions.Effect on Ratings of Experience with IllnessContinuing our search for systematic differences in preference patterns, we nextevaluated the effects of having specific medical or disabling conditions on health statepreferences.Specific items in the questionnaire asked about the presence or absence of 17 healthconditions, which we later grouped into three broad categories: (1) major rnedlcal illnesses, (2)chronic symptomatic conditions, and (3) disabling conditions. The items in the major medicalillness category (with endorsement rates in parentheses) were: cancer (1.7%), diabetes(1.2%), heart disease (6.3%), and other major medical problem (12.8%). The chronicsymptomatic condition items were: allergies (9.1 %), angina (5.8%), arthritis (33.3%), asthma(10.2%), back problems (29.3%), chronic bowel inflammation (4.6%), and ulcer (5.2%). Thedisabling conditions were: stroke and other neurological disorders (4.7%), incontinence (9.0%),confinement to wheelchair (2.9%), trouble with vision (3.5%), cognitive impairment (20.8%),and other disability or handicap (12.5%).Multivariate analyses showed no significant association between any of the three broaddiagnostic categorizations and either the direct rating, thepaired comparison, or the time-tradeoff rating sets. Additional analyses14of possible associations between the 17 specific health conditions and theitem sets showed several significant associations, but the total number of associations waswithin that expected by chance alone.As a final test for preference differences, we identified several subgroups based ondemographic and cli,nical factors, Including [number of subjects in each group shown inbracketsl]:(1) age < 40 [206],(2) age> 65 [171],(3) African-American [107],(4) Caucasian [414],(5) male [228],(6) female [358],(7) >50 years old, Caucasian, and female [105],(8) <50, African-American, and male [21], andSubjects who reported having in the preceding four weeks:(9) no illness or conditions [183],(10) one or more major illnesses only [22],(11) one or more symptomatic illnesses only [148],(12) one or more disabling conditions only [37],(13) one or more conditions in two major diagnostic groups [206],(14) one or more conditions in all three major diagnostic groups [57],(15) quality of life of 9 or 10 [315],(16) quality of life of 6 or less [86],(17) no physical suffering [212],(18) moderate or severe suffering [99],(19) no activity limits [370],(20) moderate or severe limits on activities [65],(21) good outlook on life [425](22) poor or very poor outlook on life [28].Relatively few cases were observed in which a subgroup's mean rating was differentfrom the average for the entire sample to a clinically significant extent (>= 0.5 rating points) .More importantly, perhaps, no cases were observed in which a subqroup had a significantpreference for one state over a second state ( either statistically, Le., p < .05 on a paired t­test, or clinically, l.e., 0.5 rating point) when the entire sample significantly preferred the latter15state. Nor were any cases found where a subgroup had a significant preference between apair of states when the entire sample was indifferent between those two states. Severalcases were observed, however, in which certain SUbgroups did not significantly prefer onestate over another despite the entire sample having a significant preference between thosestates . (This phenomenon may be partly explained by the smaller sample sizes withinsubgroups and resulting lower statistical power). For example, the entire sample and mostsubgroups significantly preferred the state (no suffering + mild limits) over (mild suffering + nolimits). Six subgroups were indifferent between these two states, however. subjects with (1)quality of life rated as 6 or less, (2) moderate or severely limited activities, (3) poor or verypoor outlook on life, (4) one or more major illnesses, (5) one or more disabling conditions, (6)one or more illnesses or conditions in all three major diagnostic groups. This example typifiesa more general finding: that the subgroups who were indifferent between states that the entiresample had a significant preference between were often more-impaired in terms of conditions ,illnesses , or reported quality of life than the average subject.Figure 3 shows the ratings assigned to two representative health states by several ofthe patient subgroups listed above.Preference StructureThe above analysis concludes our search for preference subgroups. In a final set ofanalyses, we examined the structure of the mean preference ratings derived from both directrating and paired-comparison tasks. This analysis follows the analyses in our second pilotstudy."Figure 4 shows mean ratings on the 4 x 4 = 16 direct rating items acrossall subjects. As noted above, the scale values derived from the paired-comparison analysiscorrelated highly with the direct ratings (Irl = .94).Figures 4 has several noteworthy features. First, the bottom line (severe physical suffering)has a steeper slope (reflecting increasingly negative life quality) going from moderate tosevere activity limitation than from no to mild or from mild to moderate activity limitations; thiseffect also occurs in the lines for moderate and mild physical suffering. An analogous effectis shown by the comparatively large gap between the bottom line (severe physical suffering)and the other three lines. For both physical suffering and activity limitation, then, the none,mild, and moderate categories are roughly evenly spaced, with the severe condition moredistinct.Second, the slope of the line denoting moderate physical suffering increases slightly16--.......9876ls~CI:s~432oFigure 3Subgroup Ratings or Two Health States-Sample Number (N= ):1. Entire Sample (596)2. <40 Years (206)3. >65 Years (172)4. African-American (108)5. Caucasian (414)6. Men (228)7. Women (359)8. No Illness (183)9. Major Illness Only (22)10. Symptomatic Illness Only (148)II. Disabled Only (36)12. 2 Categories (207)13. 3 Categories (57)14. HRQOL 9-10 (316)15. HRQOL <8 (145)Mild Suffering &:Mild LimitsSevere Suffering &:Moderate Umits2 3 4 s 6 7 8 9 10 11 12 13 14 ISS.mple NumberFigure 4Preference Structure109T ~ Degree orSuffering876 1 Moderate.... I J s())•PC41-Severe32oNone Mild Moderate SevereLimits on Adlvltlesfrom the activity impairment level categories of none to mild; this appears to suggests that,given moderate physical suffering, subjects are essentially indifferent to mild versus no activitylimitations. This finding is an example of the phenomenon we call "paradoxical indifference."3oWe discuss this finding further below.A third feature of the figure is the lack of parallelism between the lines. Parallel lineswould suggest a simple additive model for integrating negative life quality on the twodimensions--that is, overall negative life quality for a scenario would be a simple sum of thenegative life quality associated with the level of suffering and the negative life qualityassociated with the level of activity limitation. Instead, the lines have a "fan-shaped" structu resuch that each successive line's slope inc.reases as one goes from bottom (severe suffering)to top (no suffering). This may indicate an asymptotic effect in preference formulation orreporting: if negative life quality is already very poor, additional negative factors mayhave relatively little effect. Whether this reflects true preferences or is an artifact of themeasurement procedures is not clear.IV. DiscussionThe most important finding emerging from our study is the clear absence of systematicdifferences in comparative preferences for health states across demographic or clinical lines.Very few differences in paired comparison and time-tradeoff ratings were observed. Even withdirect ratings, the substantial majority of health states were valued comparably by allsubgroups. Furthermore, no significant differences in comparative preference directionemerged when directly rated states were examined in a pair-wise manner. Thus, concernsabout systematic preferences differences appear to be unfounded.Our study has several limitations. First, we did not use random sampling to find oursubjects, which raises the possibility of bias and calls into question the generalizability of ourresults. However, we did make a concerted effort to reach populations not typically capturedby preference exercises. Accordingly, we reached a relatively high number of low-income,minority, ill and disabled individuals. Second, our subjects all came from one State, NewJersey, raising similar questions about generalizability. We would be surprised, however, ifour subjects were systematically different from similarly situated people in other states. Third,despite enrolling almost 600 subjeCts many subgroups contain far fewer subjects, thus limitingthe statistical power to detect differences in preference patterns.19Preferences and Outcomes ManagementThe overall question addressed in this discussion paper is whether and how healthoutcome information can be used to determine "what works" in medicine -- and to design afair and affordable resource allocation policy. Note that unwarranted attention to the fewareas of preference differences could produce highly problematic conclusions. For example, ifCaucasians were believed to be more averse to severe suffering and severe activity limitsthan African-Americans (a conclusion supported by the results of the direct rating task, but notthe paired-comparison analysis), health care services that reduced severe suffering or limits inactivities might be deemed less desirable for African-Americans . This result is, of course,unacceptable from a social and ethical perspective.Another troublesome aspect of our findings, discussed briefly above, is the phenomenonof paradoxical indifference. Paradoxical indifference (PI) occurs when two health states arenot rated as significantly different (p > .05) despite the fact that one state dominates the other(e.g., mild suffering plus mild limits vs. mild suffering plus moderate limits). When thisphenomenon originally appeared in our second pilot studfO we hoped it would "go away"when larger numbers of subjects were enrolled. This hope was partially realized, In that onlyone instance of statistically significant PI was noted using the direct rating task (comparedwith five instances in the pilot study). Specifically, we found that subjects were indifferent(using the direct rating task) between no and mild activity limitations when suffering wasmoderate. In fact, subjects actually preferred to have mild activity limitations to a statisticallysignificant extent (p =.02). Mean rating for this putatively more-impaired state was 6.00, SD =1.94, versus 5.83, SD =1.70 (N =591) for moderate suffering + no limits.The observed preference for (or indifference between) mild activity limitations (versus nolimits) when suffering is stipulated to be moderate or severe might be explained simply bynoting that most people probably would expect or desire to "take it easy for awhile" when theyare experiencing significant pain or other form of physical suffering. This finding cannot beexplained by the presence of the cartoon figures, because the same preference pattern wasobserved in our prior study when no figures were used. We discuss at more length in ourprevious paper> the difficulties entailed by paradoxical preferences for resource allocationplanning.Perhaps the most important finding concerning the structure of preferences is the extraweight placed on avoidance of severe suffering or severe limits on daily activities. Thus,services that offer relief of (or improvement in) severe suffering or activity limitations shouldreceive highest priority in systems of health care resource allocation. This result may seem20self-evident, but our results lend what might be an important element of empirical support tosuch a policy.In Part B of this discussion paper we describe the results of administering our brief four­item questionnaire to a cohort of some 400 cancer patients. The changes over time inHRQOL reported by those patients is scored according to the preferences derived in this partof the study. We model how the use of observed HRQOL outcomes might be used in asystem of outcomes management for purposes of resource allocation.21Part BQuestionnaire ValidationI. IntroductionIn Part A of this discussion paper we described how a brief, generic questionnaire mightbe used to measure patients' health-related quality of life (HRQOL) over time on a large-scale,national basis. Such a system of "outcomes rnanapement'' could be used to ascertain theeffectiveness of health care services. This information, in tum, could improve clinical decision­making and help society ensure universal access to treatments and procedures that "work,"while curtailing public spending for those that do not.Determining "what works" depends on public values concerning the outcomes of care."In Part A we reported our experience with calibrating a generic four-item questionnaire -- theQuality of Life and Health Questionnaire (QLHQ) -- according to the preferences of about 600individuals from various walks of life. We did not detect any systematic differences inpreferences based on demographic or clinical factors.In Part B we describe our experience with the QLHQ in a cohort of 400 cancer patients.The main purpose of the present study was to test the validity of this brief questionnaire underconditions amenable to its use in a system of outcomes management. Therefore, we used amail-in format designed to maximize response rate, as discussed in Part A. Patientscompleted the QLHQ at the time of initial enrollment and again three and six months later.A major problem with all quality of life research is the lack of an objective "goldstandard" against which to measure the validity of patient self-reports. In a sense, patients'statements about how they feel about the quality of their own .lives could be considered thegold standard itself. After all, can a patient think he has a good quality of life, and be wrong?Can he have a good quality of life without knowing it? Perhaps not, but patients maynonetheless get confused about meanings, may answer in ways they believe will please theinterviewer or physician, or may not wish to share their true feelings. In addltlon, it is wellknown that patients' evaluations of their health depend as much or more on psychological oremotional factors than on objective indicators of health or function." 46 For these reasons,It Is necessary to make some effort to assess the extent to which patients' self-reports reflecttheir "true" health-related quality of life. Only then can one hope to assess how wellquestionnaires capture this truth.We describe our approach to this problem below.22II. METHODSSubjectsSubjects were 400 patients with newly diagnosed advanced-stage cancer of varioustypes. Subjects were recruited from cancer clinics at the University of Colorado HealthSciences Center and the Veterans Administration Hospital, both in Denver. We selectedthese patients for two reasons: because (1) we expected that HROOL for patients withadvanced cancer might change over the succeeding several months due to the effects of thecancer and its treatments, and (2) we believed that this cohort of patients, consisting of manyelderly, low-income, and quite ill patients, would provide a reasonable test of the feasibility ofperiodic, patient-completed HRQOL questionnaires.The study protocol was approved by the University of Colorado Health Sciences Centerhuman subjects committee, and each participating patient gave informed consent. Patientsagreeing to participate in the study completed the QLHQ at the time of initial enrollment.Subsequently, project staff mailed additional copies of the QLHQ to all patients three and sixmonths after initial enrollment. Patients were asked to indicate which of the described fourlevels within each of the three dimensions of HRQOL best described their status over thepreceding week. In addition, patients rated their overall quality of life on an 11-point scale, asshown in Figure 1.We collected a core set of baseline data on each patient, including: (1) age, (2) gender,(3) type of cancer (e.g., lung, breast), (4) stage of cancer, and (5) presence or absence ofheart failure, diabetes, and history of stroke. Shortly after initial enrollment the medical recordwas obtained to determine what types of treatment, if any, were employed against the cancer.Assessing Questionnaire ValiditvThe major issue addressed during this part of the study was the validity of the patient­completed QLHQ as an indicator of patients' "true" HROOL. As noted earlier, there is no"objective gold standard" for measuring patients'. "true" HRQOL. We therefore compared theresults of the QLHQ with results of-a standardized telephone interview conducted by one oftwo research assistants . This telephone interview was first conducted two to three days afterinitial enrollment and again immediately after receipt of the patient's completed questionnaireby mail three and six months thereafter. The interviewer did not know the results of theQLHQ at the time of the interview. If necessary, several attempts were made to contactpatients via telephone over a period of at least one week, at different times of the day.Patients were considered lost to follow-up if there was no response to a follow-up letter or to a23minimum of five telephone calls.Telephone interviewers completed a standardized form designed to address the threeHRQOL dimensions contained in the QLHQ. Figure 5 lists the major questions asked by theinterviewers, who selected the level most closely matching the patients' responses. Detailedfollow-up questions were asked as needed to clarify patients' responses. The first fifteenpatient interviews were conducted by one interviewer (a1temating as to which one), with theother interviewer listening over a telephone extension. (This was done with the patients'knowledge and consent.) Each interviewer completed a response form independently;responses were compared after the interview was completed, and any differences discussed.Responses were nearly identical after the first ten patients, with no more than one answerdiffering (by one level) on the entire response form. We believe that the ratings of theseinterviewers, who together conducted over 1,000 standardized telephone interviews, comes asclose to a true "gold standard" of patient HRQOL as is likely to be identified.Interviewers' responses to the questions in Figure 5 were transformed into summaryscores reflecting patients' status along each of the three basic dimensions of HRQOL. Withineach dimension, each "a" level answer was assigned zero points, each "b" level answer wasassigned one point, each Me" level answer was assigned two points, and each "d" level answerwas assigned three points. The sum of the total number of points scored within eachdimension was used as an overall index for that dimension. The degree of correlationbetween each summary score and the corresponding patient response on the QLHQ wasmeasured using Pearson's product-moment correlations. Overall HRQOL scores obtainedfrom patients and interviewers were also compared in this way.Correlations among measures were evaluated using the multitrait-multimethod (MTMM)analysis described by Campbell and Fiske.30 47 In this setting, the "tralts' to be measuredwere patients' physical suffering, limits on daily activities, and emotional outlook. Each ofthese three traits was .measured at the time of initial enrollment and three and six monthslater, producing a total of nine separate and distinguishable "sub-traits." Thus, for purposes ofthis analysis we treat physical suffering at three months as a trait distinct from physicalsuffering at six months. This construction is based on the assumption that a validquestionnaire should be able to detect change in physical suffering (and in other dimensionsof HRQOL) over time. This characteristic has been termed 'responsfveness,"" but probablyis best considered simply as "validity-aver-time.M49Each of the nine traits was measured using two methods: the original QLHQ patient self­reports and the aggregate trait scores obtained through the follow-up telephone interviews, as24FIGURE 5STANDARDIZED TELEPHONE INTERVIEWPhysical Suffering1. How much pain have you experienced in the last week?a) Noneb) A littlec) Ouite a bitd) Severe pain2. How much nausea/vomhing have you had in the last week?a) Noneb) A littlec) Ouite a bhd) Severe nausea/vomiting3. Does pain or other symptoms keep you awake at night?a) Never or rarelyb) Sometimesc) Oftend) Very often or almost every nightDaily Activhies1. How often are you able to do things you enjoy or that are important to you?a) Very often or al,most every dayb) Oftenc) Sometimesd) Never or rarely2. How much help do you need getting around?a) No helpb) A little helpc) Moderate amount of helpd) A lot of help3. How much assistance do you need taking care of daily needs such as getting dressed or eating?a) No helpb) A little helpc) Moderate amount of helpd) A lot of help25FIGURE 5 (CaNT.)STANDARDIZED TELEPHONE INTERVIEWEmotions and Outlook on Life1. How often this week have you felt depressed and upset?a) Noneb) A littlec) Quite a bitd) Very often/most of the time2. How much of a problem have you had this past week with sleeping at nightnervousness or anxiety?a) Noneb) A littlec) Quite a bitd) Very often/major problem3. How happy are you feeling these days?a) Veryb) Somewhatc) A littled) Not at allbecause ofInterviewer's global estimate of patient's quality of life (0-10 scale, o-worst possible quality of life, 1a-bestpossible quality of life).0 1 2 3 4 5 6 7 8 9 10(Circle one number)26described above. In this test of questionnaire validity, correlations between two methods'measures of a given trait (I.e., monotrait, heteromethod values) should be higher than (1) thecorrelations between these two methods' measures of separate traits (heterotrait,heteromethod) and (2) a single method's measure of different traits (heterotrait, monomethod) .We also conducted a separate MTMM analysis of the validity of the global0-10 HRQOl item. Here, the traits to be measured were overall HRQOl at zero, three, andsix months. The different methods of measurement were (1) patients' self-reported HRQOlon the 0-10 item, (2) the sum of patients' scores on physical suffering, limits on activities, andemotional outlook, (3) the interviewers' 0-10 estimate of the patient's HRQOl, and (4) theoverall sum of the interviewer's aggregate index scores on suffering, limits, and outlook. Wehypothesized that the correlation coefficients among measures taken at a single time pointshould be higher than correlations among measures taken at different times.MTMM analysis is generally considered to provide evidence of construct validity, l.e., theindex measure should correlate more highly with measures hypothesized to assess the sameconstruct than with measures hypothesized to assess different constructs. In this case,however, because the second measurement method was the follow-up telephone interview -­which we took as an indication of "true" HRQOl -- we believe that the MTMM analyses canalso reasonably be considered to reflect concurrent or criterion validity.Our second approach to assessing the validity of the self-reported quality of lifemeasures was to compare the results of these measures with subsequent observed mortalityrates. It is well-established that global measures of health are highly correlated with mortalityrisk." 51 52 53 54 For example, Mossey and shapiro" found that people whose self-ratedhealth was poor experienced two-year mortality rates about three times higher than those whorated their health as excellent. We compared patients' baseline status on each HROOldimension with mortality rates observed six months after enrollment. We hypothesized thatpatients reporting poorer initial HROOl would experience higher mortality rates.To assess the degree of association between baseline HRQOl and mortality, wecalculated chi-square values on n x 2 contingency tables on mortality and each HRQOldimension (including overall quality of life). Significance levels were set at p < .05. We thencompared the mortality rates of patients who reported no problem versus some degree ofproblem with (1) suffering, (2) limits on activities, and (3) outlook at baseline -- as well aspatients above versus below various cutoff points in initial overall quality of life and healthstate weights. To obtain this latter parameter we placed patients into one of 16 health states(ranging from "no suffering and no limits" to "severe suffering and severely limited") by27combining their responses to the physical suffering and daily activity items, as described inPart A of this discussion paper. We assigned weights to these states based on the results ofthe direct-rating task described in Part A. Differences in mortality rates between the variouspairs of comparison groups were assessed using simple chi-square analyses, with significanceset at p < 0.05. Finally, we conducted multivariate analyses (i.e., logistic regression) toassess the effect of baseline HRQOL dimensions on mortality after controlling for othersignificant predictor variables, including demographic, clinical, and treatment variables.Significance was again set at p < .05 in these analyses.Our final set of analyses assessed the extent to which reported changes in HROOL overtime (Le., quality-of-Iife outcomes) could be accounted for (or predicted) by baseline factors ,including demographic and clinical characteristics and patients' self-reported HROOL. Weexamined the effect of age, gender, hospital (VA vs. University), staqe of cancer, treatmentselected, and self-reported HRQOL on changes in quality of life over time. For theseanalyses we created two sets of HRQOL-outcome variables . First, we created change scoresfor each dimension of HRQOL by subtracting the baseline rating on a given dimension(including overall quality of life) from the rating obtained on that dimension .six and twelvemonths later. Second, we created a separate set of change scores based on the difference inpreference weights of the health states reported at baseline and at six and twelve months.Change scores on individual HRQOL ratings and on health-state weights were treatedas continuous outcome variables. We first assessed the univariate association of thesescores and each candidate predictor variable (l.e., demographic and clinical characteristics,treatments , and baseline HRQOL ratings) using a simple chi-square test; variables found to besignificantly associated with change scores were entered into multiple linear regressions inorder to determine the extent to which they accounted for observed changes in quality of lifeafter controlling for other significant predictor variables. Because of the large number ofstatistical tests performed in these analyses, we considered an association significant if its p­value was less than or equal to 0.01.III. ResultsFigure 6 summarizes the characteristics of our patient sample. Six patients died ordropped out prior to the first telephone interview, leaving 394 patients in our sample .Questionnaire ValidityWe first examined the extent to which patients' self-reported HRQOL ratings correlatedwith ratings obtained on the corresponding dimensions during the subsequent telephone28FIGURE 6CHARACTERISTICS OF PATIENT SAMPLEN =394 Type of cancerFemale 66% Lung 14%Age (Median =60) Breast 10%18-35 9% Gastrointestinal 11%36-50 22% Genitourinary 23%51-65 35% Lymphoma 10%66-80 31% Leukemia 6%Over 80 3% Other 25%Diabetes 6% Stage II 6%Heart failure 10% Stage III 47%History of stroke 2% Stage IV 47%Treatment received:Chemotherapy 47%Surgery 20%Radiation 7%Hormones 5%Immunotherapy 3%None 21% .Note: Some patients received more than one treatment modality. Cancer was classifiedaccording to standard staging protocols, with Stage III generally involving regional spread(including lymphatic system) and Stage IV associated with distant metastases.29interviews. As described above, we used the results of these interviews to create summaryscores for each dimension of HRQOL. These scores were compared to patients' responsesusing a multitrait-multimethod analysis. Figure 7 is a correlation matrix among these variousmeasures. Underlined values show the validity diagonals, which represent the convergencesof the two separate measurement methods on each of the nine traits (e.g., activity limits atthree months). (These are the "monotrait, heteromethod" correlations.) Mean correlations onthe validity diagonals was 52.9 (SO 12.7); off-diagonal correlations averaged 32.7. Thisdifference was highly significant (p = .001 on a Student's t-test). Also, note that the on­diagonal values were usually the highest value in their respective rows and columns. This isthe expected pattern for valid measures. Thus, we conclude that patients' responses to theseparate HRQOL dimensions contained in the QLHQ were validated by the separatetelephone interviews. That is, patients response to, for example, the suffering item "really did"reflect suffering, rather than emotional status, activities, or some other construct.Figure 8 shows the results of a separate MTMM analysis on the global HRQOLmeasure. Again, underlined values represent monotrait, heteromethod convergence onvalidity diagonals. On-diagonal correlation coefficients averaged 64.2 (SO 10.0); mean off­diagonal correlation was 41.3 (SO 9.5). This difference was also statistically significant (P =.0001 on a Student's t-test). Thus, the global 0-10 HRQOL item manifested evidence ofexcellent validity over time. Again, on-diagonal values were usually the highest value in theirrespective rows and columns.Mortality AnalysesWe next examined the extent to which initial self-reported HRQOL scores correspondedwith mortality risk. By six months after enrollment, 21 of our original 394 patients were lost tofollow-up. We believe that most of these patients probably died, but we were unable toconfirm this suspicion. Of the remaining 374 patients, 59 (16%) were confirmed to have diedwithin six months. If all 21 patients lost to follow-up are assumed to have died, the overall six­month mortality rate would have been 20%. As a sensitivity analysis of sorts, we used boththe proportion of patients confirmed dead at six months and the total proportion of patientswho were either confirmed dead or lost to follow-up at six months in our mortality analyses.As expected, baseline patient self-reported HRQOL was significantly associated with six­month mortality risk. Chi-square values for each of the three 4 x 2 contingency tables createdusing the three basic dimensions of HRQOL (l.e., physical suffering, limits on activities, and30FIGURE 7CORRELATION MATRIX OF PATIENTS' AND INTERVIEWERS'RESPONSES TO DIFFERENT HRQOL DIMENSIONS(obs=273)suffO activO emotO suf f J activ3 emot3 suff6--------+---------------------------------------------------------------suffO 1.0000activO 0.5240 1.0000emotO 0.4009 0.3659 1 .0000sufn 0 .2745 0 .2409 0.1725 1.0000activ3 0 .2660 0.3245 0.2764 0.4606 1.0000emot3 0.2080 0.2448 0.3910 0.3939 0.5591 1.0000suff6 0.3408 0.2567 0.2860 0 .4498 0.3723 0.3826' 1.0000activ6 0.2788 0.2808 0.2888 0 .3350 0.4348 0.3232 0.6402emot6 0.1589 0 .2252 0 .3216 0.2602 0.3191 0.5022 0.5941psufsurnO 0.40-62 0.3267 0.2414 0.2798 0.2629 0.2765 0.2897pactsurnO 0.2592 0.2859 0.2148 0.1339 0.2573 0.2006 0.2211pemosurnO 0.2034 0.2135 0.4225 0.2432 0.3131 0.3822 0.2877psufsum3 0.2912 0.2879 0.3210 0.5189 0.4124 0.3961 0.4257pactsum3 0.2391 0.3572 0.2404 0.4199 0.5967 0.4007 0 .3660pemosurn3 0 .2664 0 .2483 0.4030 0.3101 0.4694 0 .6127 0.3568psufsum6 0 .2467 0.1570 0.2752 0 .3967 0 .3327 0.3558 0.6290pactsum6 0.1841 0.1522 0.1812 0.3337 0.3633 0.2610 0.5187pemosum6 0.1807 0.1681 0.3317 0.2787 0.3340 0.4487 0.5814activ6 emot6 psufsumO pactsurnO pemosumO psufsurn3 pactsum3--------+--------------------------------------------------7------------activ6 1 1.0000emot61 0.5183 1.0000psufsurnOI 0.2327 0.1837 1.0000pactsumOI 0.3007 0.1970 0.3037 1.0000pemosumOI 0.2487 0.3071 0.4247 0.3089 1.0000psufsurn31 0 .3477 0.3015 0.3605 0 .1596 0 .3043 1.0000pactsurn3 1 0 .5029 0 .3114 0.2758 0 .3422 0.2798 0 .3687 1.0000pemosum3 1 0.3001 0.3916 0.3097 0 .1508 0.4688 0.4797 0.4231psufsurn6 1 0.4523 0.4177 0.3875 0.1395 0.2769 0.5360 0.2928pactsurn6 1 0 .6281 0.4157 0.2324 0 .2276 0.1951 0.3170 0.5309pemosurn6 1 0 .5059 0.6481 0.2431 0.1073 0.2958 0.3450 0.28671.00001.00000.49321.00000.50750.5411pemosum3 psufsurn6 pactsurn6 pemosurn6--------+------------------------------------pemosurn3 1 1.0000psufsum61 0.3824pactsum6 1 0.2587pemosum6 1 0.5098. Legend.From QLHQ:suffx = patient's reported degree of physical suffering at time xemotx = patient's reported emotional outlook at time xactivx = patient's reported degree of activity limitation at time xqolx = patient's reported overall HRQOL at time xFrom telephone interview:psufsumx = interviewer's summary score on suffering at time xpemosurnx = interviewer's summary score on emotional outlook at time xpactsurnx = interviewer's summary score on activity limitation at time xUnderlined values represent monotrait, heteromethod convergences (see text) .31FIGURE 8CORRELATION MATRIX FOR OVERALL HRQOL MEASURES1.00000.46910.34320.68570.37620.28631.00000.32890.52690.67820.37810.56140.75951 .00000.52840.32230.62920.46090.42850 .72340 .49611.00000.40210.38660.48920.39080.28820.48280.46160.29501.00000.24350.45470.71560 .27960.51160.71280.31200.46110.63031.00000.56630.41620.70000.45020.37460.64730.50640.42280.66540.4464(obs=267)I qolO qo13 qo16 tot sumO totsum3 totsum6 pqolO--------+------ ---------------------------------------------------------qolO 1.0000qo13 0.4709qo16 0.3741tot sumO 0.6618totsum3 0.2598totsum6 0.3552pqolO 0 .4753pqo13 0.3385pqo16 0.3450ptotsumO 0.4124ptotsum3 0.3424ptotsum6 0.28261.00001 .00000 .59441.00000.50080.37671. 00000.29660.51120.7624pqo13 pqo16 ptotsumO ptotsum3 ptotsum6--------+---------------------------------------------.pqo13 I 1. 0000pqo16\ 0.6168ptotsumO I 0.4300ptotsum3 1 0.7244ptotsum6 1 0 .5388Legend.From QLHQ:qolx = Patient's reported overall HRQOL at time xtotsurnx = Sum of patient's scores on suffering, limits, and emotions attime "xFrom telephone interviews:pqolx = interviewer's estimate of patient's overall HRQOL at time xptotsumx = Sum of interviewer's scores on suffering, limits, andemotions at time xUnderlined values represent monotrait, heteromethod convergences (see text) .32outlook on life) versus mortality were significant at p < .001. Chi-square for the 11 x 2contingency table on overall baseline HRQOL versus mortality was significant at p = .004.Figure 9 summarizes the mortality rates of different patient subgroups. In general,patients with higher baseline HRQOL, higher-weighted health states, or lesser degrees ofimpairment on the three basic HRQOL dimensions experienced six-month mortality rates ofabout 10% and combined dead/lost-to-follow-up rates of about 14%. Patients with lowerbaseline HRQOL, lower-weighted health states, or more impairment on the three basicHRQOL dimensions had mortality rates of about 18% and combined dead/lost-to-follow-uprates of about 24%. As shown in Figure 6, most of these differences were statisticallysignificant.The observed significant effects on mortality rate of baseline HRQOL parameterspersisted in multivariate analyses; the only other variables found to be significantly associatedwith six-month mortality were stage of cancer and presence of lung cancer. Treatmentvariables did not predict mortality.Changes in HRQOL Over TimeWe next examined trends in patient-reported HRQOL over time. Figure 10 depicts thepattem of results observed for the HRQOL measures, including (1) pain and physicalsuffering, (2) emotions/outlook on life, (3) limitations in daily activities, (4) overall quality of life,(5) health state weights (i.e., weighted combination of suffering and limits, as describedabove), and (6) interviewer's global estimate of patients' HRQOL. Scores were transformed toparallel the overall quality-of-life measures, i.e., a value of 10 indicated the most desirablesituation (e.g., no suffering or limits on activities) and a value of 0 the least desirable situation(e.g., severe suffering or limits)..It can be seen that the measures changed relatively little over the six-month period.For our analyses of changes in quality of life we used the change-score derived bysubtractinq six-month overall quality of life ratings or six-month health-state weights fromcorresponding ratings and weights at baseline. Figure 11 depicts the distribution of six-monthquality-of-Iife change scores. Using multiple linear regression analysis we determined that themost powerful predictor of change in quality of life or weighted health states was initial qualityof life and health state, respectively. Interestingly, higher baseline ratings (meaning betterinitial HRQOL) were significantly associated with decreases in quality of life over time. Forexample, the 146 patients who were known to be alive at six months and who rated their initial33FIGURE 9Differences in Mortality and Lost-to-Follow-up RatesValues on Number Number Dead orBaseline Parameters Dead Lost to Follow-upOverall HRQOL>=8 17/163 (.10)8 23/169 (.14)b<8 41/210 (.20) 57/225 (.25)Extent of SufferingNone 10/105 (.10)C 13/108 (.12)dMild, Moderate, Severe 48/268 (.18) 67/286 (.23)Extent of Limitson ActivitiesNone 13/141 (.09)d 20/148 (.14)dMild, Moderate, Severe 45/232 (.19) 60/246Extent of Problemswith Outlook on LifeNone 15/137 (.11)8 19/141 (.14)dMild, Moderate, Severe 43/235 (.18) 61/252 (.24)Health State Weight>=7 14/156 (.09)' 22/163 (.13)'<7 44/217 (.20) 58/231 (.25)Differences significant at:34overall quality of life as "8" or higher, experienced an average decrease in quality of life of 0.9rating units. By contrast, the 168 patients alive at six months who rated their initial quality oflife less than "8" experienced an average improvement in quality of life of 0.7 rating points.This difference in change scores was significant at the p =.0001 level. Similar pattems wereobserved for changes in weighted health states.The inverse association of initial HRQOL and HRQOL outcome persisted in multivariateanalyses. Gender was the only other variable found to have a significant effect on overallquality-of-life outcomes. Men experienced an average decrease in overall quality of life of0.14 rating points, whereas women experienced an average improvement of 0.57 ratingpoints. This difference was significant at p =.001. With regard to six-month changes inhealth-state weights, both gender and stage of cancer were significantly associated withchanges in weights in multivariate analyses (in addition to baseline health-state weight). Menexperienced an average improvement in health-state weight at six months of 0.10; womenimproved by 0.96. This difference was significant at p < .001.IV. DiscussionWe observed high correlations between self-reported HRQOL on the Quality of Ufe andHealth Questionnaire (QLHQ) and subsequent standardized, in-depth telephone interviews.Results of the MTMM analyses indicate that both the separate dimensions of HRQOL(suffering, activity limits and emotions) and global HRQOL measure provide reasonablereflections of the actual underlying constructs, both at a single time point and over time. Inaddition, we observed the expected pattem between initial self-reported HRQOL andsubsequent mortality. These findings demonstrate that the QLHQ, when completed bypatients in a mail survey format, is a valid measure of patients' actual HRQOL.We achieved high compliance rates in our study; in fact, all but two patients who did notdie and who were not lost to follow-up completed all three QLHQ questionnaires at baselineand at three and six months. Although several patients required reminder telephone calls, ingeneral we found that patients were extremely enthusiastic about providing us with informationabout their HROOL. Indeed, patients would often "pour their hearts out" to us about theirexperiences. Although this high level of compliance would not be expected in the setting oflarge-scale outcome management programs, we beleve that the brief format and easilyunderstood questions would facilitate high compliance rates even in this latter setting.The finding of significant inverse associations between baseline HRQOL and quality-of­life outcome is probably due in large part to regression to the mean. A related "ceiling" effect35on HROOL is also at work; patients who rate their initial HRQOL very high have little or noroom for improvement, but lots of room for worsening. Patients who rate their initial HRQOLlower may have as much or more room to improve than to deteriorate . The significance ofthis inverse association for purposes of outcomes management is uncertain, although clearlyinitial quality-of-life ratings or weights must be entered into multivariate analyses of outcomesto control for this phenomenon.In our study we did not detect any treatment effects on either mortality or HRQOLoutcomes after correction for gender and type and stage of cancer. This is not surprising,because our sample contained relatively few patients with a given type or stage of cancer(see Figure 6), and even fewer patients with a given type of cancer who received a particulartreatment. Much larger studies, such as are envisioned by Ellwood in systems of outcomesmanagement, would potentially contain thousands of patients with a given type and stage ofcancer. With such large numbers of patients, we would often expect to find significantdifferences in health outcomes across altemate courses of treatment for patients with a givenclinical condition. Indeed, such findings form the very basis for the concept of outcomesmeasurement and management.In our analyses we used change scores to depict the difference between Initial andsubsequent HRQOL. Altematively, the subsequent scores themselves (e.g., six-month overallHRQOL) could be used as the dependent variables in multivariate outcome analyses .Inclusion of initial HRQOL scores in either case should help to minimize the biasing effect ofinitial HRQOL on outcome inferences.Use of a Global HRQOL MeasureFor reasons summarized above, we believe that the QLHQ can provide a useful andvalid basis for regularly surveying larg"e numbers of patients concerning their HRQOL. Thegeneric states depicted in the QLHQ adequately represent, we believe, the range ofsymptoms and limitations produced by medical conditions and diseases -- as well as theimprovements (or worsenings) of symptoms and limitations produced by medical and surgicaltreatments and procedures. Indeed, when very large samples are obtained, outcomeanalyses might reasonably be focused on changes in the single, global HRQOL measurecontained in the QLHQ. Use of single measure would almost certainly improve compliancerates over even a four-item questionnaire.Use of a global, generic HRQOL item for purposes of outcomes management raisesseveral questions . First, what about the problem of questionnaire calibration discussed in Part36A of this article? In a sense, however, single-item global measures might be considered"internally calibrated," especially when they are anchored, as ours was, by terms such as"best possible quality of life" and "worst possible quality of life." For example, a patient whoselected a 7 as most closely representing his current HRQOL would be saying, in effect, thathe considers his HRQOL about 70% as desirable as the best possible quality of life. Thiselement of desirability closely parallels the notion of preferences and values.Beyond this conceptual similarity between global measures and preference-weightedhealth states, we observed in this study empirical support for the comparability of these twotypes of measures. Specifically, our global HRQOL measure closely paralleled the healthstate weights produced by assigning the empirically derived scale values (per Part A of thisarticle) to the various combinations of physical suffering and activity limits reported bypatients. This similarity was apparent both in the multivariate and mortality analyses reportedin the Results section and in the observed changes in patients' HRQOL over time, as depictedin Figure 10. Global patient-reported HRQOL fell in the middle of the seven measuresdepicted in that Figure.Other investigators have advocated the use of single-item measures for capturingpatients' HRQOL over time, including Gough et aI., who recommended that patients be askeda single question: "Overall, how would you rate your quality of life today?,,55 Theseinvestigators note that such global indicators correlate well with multi-dimensional constructsand that more complex conceptualizations of quality of life require an increase in the complexity of patient outcome questionnaires. Moreover, many multi-itemquestionnaires are validated by the extent to which the aggregate score on all items correlateswith a single "overall quality of life" item. Why not, then, use the single item in the firstplace?An additional argument in favor of single-item questionnaires was mentioned above,namely the fact that responses to single measures correlate significantly with mortality risk.50-54We observed similar correlations in the present study. This universal finding is furtherevidence that single, global measures can provide valid depictions of patients' overall state ofhealth.Alvan Feinstein identified additional advantages of single-item questionnaires, noting thatlonger, rnuttlple-ltem questionnaires Mare seldom satisfactory for individual patients, becausethe main focus of clinical concern or management may be lost, obscured, or not includedamong all the other information that surrounds it."56 A single-item, "global index," on theother hand, may be the best way of letting the patient decide and indicate the rating for37Figure 10Behavior Over Time or HRQOL Parameters(.0)(Xl87.5j~ 7a~~6.5EmptionsIoterview-estimated HRQOLPalieol-reporudHRQOLWeilbled Health StateActivitiesSuffering1 I Io 3Mont....6• Transformed where necessary so that 0 =worst situation and 10 =best situation (see text).whatever he or she chooses to emphasize as the important features of life. Furthermore, ifthe indexes have an adequate number of categories, they may be particularly desirable forletting patients use their own criteria to determine changes in their status.56,(p.MS53)Finally, good to excellent reliability and validity have been observed for several single­item HRQOL ouestlonnalres," 58 including the Delighted-Terrible scale." 59 the Facesscaie", and the Ladder Scale."Induction ProblemsEven when data are aggregated into rnasslve data sets, complete with global HROOLmeasures, the results of retrospective outcome analyses may be difficult to accept at facevalue. It is widely appreciated that all non-experimental (Le., non-randomized) study designsare vulnerable to a variety of biases, particularly with respect to the presence of unmeasuredconfounders." For this reason, the medical records used to supply data to systems ofoutcomes managment should contain as much information as possible about comorbidities,severity of illness, and other variables related to outcomes in order to permit statisticaladjustment of observed outcomes and to strengthen causal inferences."Even under the best of circumstances, however, it will be difficult to forge what Ellwoodcalled the "link [between] the actions and observations of thousands of health professionalswith [the outcomes of] millions of patients. 111 Ellwood acknowledges that "major questionssurround attempts to measure the impact of medical care on quality of life," including thepotential for bias, but he argues that outcomes management "would generate informationabout the results of the natural, seemingly random variations in practice style." Thisoverstates the methodological case for what would be, in effect, a massive case-control study,In Part C we discuss further the problem of interpreting observational outcome data. .In summary, we have demonstrated that very brief, mail questionnaires, perhapsconsisting of a single, global measure of HRQOL, can serve as a valid basis for monitoringthe HRQOL outcomes associated with diseases, conditions, treatments, and procedures. Wealso determined that additional items concerning physical suffering, functional status, andemotional outlook can obtain valid information concerning these respective dimensions.Whether the additional information gained is worth the potential decrease in compliance isunknown at this time.We echo Ellwood's call for the routine collection of HROOL data and urge policy-makersto make creation of large data bases a high priority.39Part CProblems with InterpretationI. IntroductionIn Parts A and B of this discussion we reported that a brief mail survey, consisting ofonly four questions and calibrated according to empirically derived preferences, was able toprovide valid information about the health-related quality-of-Iife (HRQOL) outcomes of patientswith a serious medical condition. Information of this type, collected regularly andsystematically, could become the basis for a system of "outcomes rnanaqement," in whichefforts were made to determine "what works" in medicine. Such a system would provide anefficient and valuable alternative (or supplement) to randomized controlled trials and to moretraditional types of non-experimental studies. In this final Part, we discuss some of thephilosophical and practical issues that would arise during creation and use of a system ofoutcomes management.In Part B we concluded that a single, global HRQOL item would probably suffice forpurposes of measuring outcomes in a large system of outcomes management. We observedexcellent construct and criterion validity of such a global item in a cohort of cancer patients,with "objective" interviewers' careful assessments of patients' HRQOL serving as the "goldstandard" of validity . Our 0-10 global HRQOL measure tracked very closely with patients'responses to items on physical suffering, activity limits, and emotional outlook -- and withernplrically derived preference weights for combinations of problems with suffering and activitylimits (as described in Part A). These separate HRQOL items also manifested excellentconstruct validity. Finally, we found that determinants and behavior of the overall HRQOLmeasure were very similar to that of the preference-weighted health states reported bypatients.In constructing the ideal survey instrument for large systems of outcomes management,one could use a single, global item or include a few additional items, as we did. Responsesto all items could be "blended" together to form an overall estimate of HRQOL. Although thisapproach might increase reliability of the HRQOL estimate somewhat, we saw no evidencethat the accuracy, or validity, of this estimate provided by the global item would be enhancedby the presence of additional items. Moreover, increasing the number of items would likelyresult in lower response rates. The optimal number of items remains an open question fornow, but it seems clear in any case that a few items would be sufficient. This is fortunate, in40discussed in previous Parts and again below.Even granting that single, global items can produce valid measurements of overallHRQOL, it still may seem difficult to believe that such measures could provide the basis forconclusions about the effectiveness of particular procedures and treatments. In this Partwediscuss two major issues with respect to the problem of interpreting change in global HRQOLoutcome: (1) how to strengthen causal inferences drawn from observational HRQOL data, and(2) how to identify, using statistical procedures perfonned using global HRQOL data, different-types of patients" who either do or do not receive significant benefit from some treatment orprocedure. We conclude that global HRQOL data, if properly obtained and interpreted, canindeed provide a legitimate basis for large-scale outcome studies of the type envisioned byEllwood some five years ago.II Strengthening Causal InferencesIn Part A of this discussion paper we discussed the advantages of generic outcomemeasures over disease- or treatment-specific ones for purposes of outcomes management.These advantages include universality across different types of treatments and the patients'ability ~o convert generic measures into their personal (and often idiosyncratic) outcomes ofinterest. These theoretical advantages, important as they may be, would probably beinsufficient, in and of themselves, to overcome skepticism concerning the relevance of (whatmany would see as) a "fuzzy" outcome variable -- global HRQOL.Standardizing the Concept of HRQOLOne major step toward "de-fuzzifying" HRQOL is to fonnally limit and standardize, aspart of a substantial public education program (see below), what is meant, exactly, by HRQOLin this setting. Respondents must understand that the question(s) being asked concern(s)health related quality of life, and that they should specifically exclude from considerationfactors beyond the purview of health and health care (e.g., neighborhood crime rate, maritalrelationships, pollution). What is beyond the purview of health care is, of course, open todiscussion; for example, job satisfaction and marital hannony (or lack thereof) might beaffected by one's health, or access to health care. Usually, however, such issues should beexcluded.Indeed, for purposes of outcomes management, and as discussed in Part A of thisarticle, it is probably best to restrict health domains in this context to physical suffering and41limits on actlvttles." Emotional outlook should be included in HRQOL ratings only to theextent that any emotional problems stem from physical symptoms and activity limits.Otherwise (among other problems), the use of HROOL as an endpoint would likely prove too"soft" for many physicians and policy-makers. (These remarks assume that physicalconditions and treatments are being evaluated. Psychiatric or emotional problems andtreatments would, of course, be addressed using different descriptors, a discussion of which isbeyond the scope of this article.)To assist in further standardizing responses to the HRQOL survey, respondents shouldbe asked to rate their average overall HRQOL during the time since the previous survey, e.g.,over the preceding six to twelve months -- or, in the case of initial contact with the provider,over the previous month or so. Additional standardization might also be realized by explicitlydefining the constructs to be included in the HRQOL ratings (e.g., suffering and activity limits) ,and by providing definitions for each level of the 11-point scale, perhaps something like this:Definitions for Quality of Life SurveyPain and physical symptoms means headache, backache, chest pain, arthritis or anyother form of physical pain ; also nausea, dizziness, shortness of breath, or any other form ofphysical suffering. Activity limits include problems with working, housework, hobbies, socialrelationships, self-care activities (including eating and dressing)Over the past six months or a year, I have had:HRQOLrating Description10. Absolutely no pain or physical symptoms and absolutely no limits on activities9. Very occasional pain or symptoms of a very minor nature (never need treatment) and/orvery minor limits on activities (never restricted in important things)8. Occasional pain or symptoms, usually of a minor nature (rarely need treatment) and/orminor limits on activities (rarely restricted in important things)7. Pain or symptoms sometimes troublesome, commonly needs treatment and/orsometimes dissatisfied with extent of limits on activities (sometimes restricted inimportant things)And so on. Due to space limitations on the postal card survey, these definitions might best be42published in newspapers and information brochures in conjunction with publicizing theoutcomes management system, rather than being printed on the survey post-card itself.Along these lines, a high-profile educational and promotional effort (perhaps sponsoredby the govemment and assisted by the media and health care providers) should be launchedto educate the public. both about the nature of HRQOL and, perhaps even more importantly,about the purpose and significance of the outcomes management system. The public shouldbe informed that the survey results will be used to determine "what works" in medicine, and toassist future patients and payers in making decisions about what treatments to undergo or topay for. Respondents are being asked to contribute their experience to this body ofknowledge. This is a weighty responsibilityl Indeed, with proper education and promotionabout the health outcomes management system and its implications, people could come toview reporting of their HRQOL as a civic duty -- akin to voting in elections. The process ofreporting HROOL could even be incorporated into popular culture ("Hi, how are you?" "Oh,about a 7, I guess.")In view of this avowed purpose, respondents would be asked to place greater emphasison symptoms and activity limits that relate to their health care problems and to services forthose problems, and to downplay the effects of unrelated or ephemeral health problems.Patients would realize that they were being asked, in effect, HIs the treatment or procedurethat you have undergone something you would recommend to others with your same problem"and "Should taxpayers pay for that treatment or procedure for patients who can't afford it?"Patients who have not undergone any treatments or procedures for their medical problemwould realize that they are being used as a control group to assess health outcomes.It is hoped that this sort of information will encourage people to participate in theprogram. Even with the best of efforts, however, many people will fail to respond to theHRQOL questionnaires, in effect refusing to "vote" their HROOL. This less-than-perfectresponse will create the potential for a response bias in the data, e.g., the possibility that less­ill (or more-ill patients) will respond disproportionately more (or less) often. It is not clearwhich way such a bias would work, however, and in any case this sort of democratic effort(one patient-one vote per intervention) must operate based on the votes cast. Those whochoose not to vote will be effectively disenfranchised. Of course, every effort should be madeto encourage and to facilitate responding. Indeed, this is the very rationale for using the briefpostal-card format to solicit HRQOL outcome information.43Induction ProblemsEven if global HRQOL is agreed to be a sufficiently "hard" endpoint for large-scaleoutcome studies, problems remain in ascribing observed changes (or lack of change) inHRQOL to specific treatments and procedures. In Part A we noted that use of global HRQOLmeasures to derive inferences about treatment outcomes would constitute, in effect, a seriesof large-scale case-control studies. Actually, though, an even greater causal leap of faithseems to be required than for most case-control studies, which generally focus on specificdiseases or conditions, and which evaluate specific factors that possess a fairly tightly positedcausal link with those diseases or conditions (for example, the relationship between smokingand lung cancer).Unlike most case-control studies, i~ which the outcome is usually a specific condition(e.g., cancer), the outcome of interest in outcomes management is simply the change (if any)in HRQOL over time. Even restricting the definition of HRQOL, as discussed above, such achange (or lack of change) could occur for a large number of reasons, only one of which isthe treatment or procedure of interest. For example, patients may sprain their ankle, comedown with the flu, or develop an unrelated disease or condition.Even more troublesome, perhaps, reported changes in quality-of-Iife might be used toassess the effects of different treatments on the same patient over the same time period. Forexample, a person who received a cataract operation for decreased vision .might report that hechanged from a HRQOL of 7 before surgery to a 9 after surgery. When this outcome is usedin a study of cataract surgery, the reported Improvement in HRQOL would be attributed to thecataract surgery (although the weight assigned to this outcome would be determined by theextent of comorbidities and other factors). If the same patient received a hip replacementoperation during the same time period, the very same improvement in HRQOL might beascribed to the hip surgery during subsequent evaluations of this procedure's effectiveness.An extensive discussion of the fascinating epistemological problem of inductiveinferences63 is beyond the scope of the present article. Suffice it to say that we must takerefuge in large numbers and simply acknowledge the probabilistic nature of the activity whichrising costs and concerns about quality have compelled us to undertake. Ascribing observedoutcomes (i.e., effects) to antecedent events (l.e., causes) is always a risky business,particularly in an uncontrolled (e.g., retrospective) environment. Nonetheless, the analysesand inferences envisioned for Ellwood-style outcomes management are fundamentally nodifferent than the analyses and inferences regularly relied upon by medical science and policy­makers.44As noted, large numbers of patients are needed in order to strengthen causal inferencesemerging from outcomes-management data, and to reliably detect the "signal" of treatmenteffectiveness amidst the inevitable noise Each arm of outcomes-management case-controlstudies should probably contain on the order of thousands or tens of thousands of patients.The effect of increasing sample size is directly analogous to increasing the power of a lens inmicroscopy or astronomy: just as a more powerful lens enables the viewer to resolve apreviously indistinct body into its constituent parts, so increasing sample size enables theobserver to "magnify" the effects of any given, distinct component of HRQOL.For example, the change in one hypothetical patient's HRQOL over six months or a yearmight be divided into the following components:1. Effect of lung cancer plus chemotherapy2. Effect of angina plus medical treatment3. Effect of sprained ankle4. Effect of arthritis5. Effect of headacheAt the same time, another patient's HRQOL outcome might consist of the followingcomponents:1. Effect of lung cancer without chemotherapy2. Effect of inflammatory bowel disease3. Effect of angina plus coronary artery angioplasty(Note that lung cancer-plus-chemotherapy, and other condition-treatment pairs, should belisted as one item for the first patient; listing, e.g., lung cancer and chemotherapy separatelywould imply that .symptoms deriving from the cancer per se are equivalent in both treated anduntreated patients _. an unlikely proposition.) If these two patients were included as part of avery large sample of patients with lung cancer, the confounding factors (i.e., all factors exceptchemotherapy) should be distributed more or less evenly between patients who undergo,versus those who forego, chemotherapy. Again, the "signal" of chemotherapy is magnifiedand discemed from the "noise" of confounding factors. This effect is related to the well-knownstatistical tenet that increasing sample size increases the precision of estimates regardingsome feature of the population from which the sample is drawn."From a statistical perspective even the process of ascribing a single HRQOL change (orlack of change) to more than one intervention is perfectly legitimate. On average, the signalrepresenting the HRQOL outcome caused by a particular intervention should be detectable45through the noise of all confounding factors -- including other procedures. After all, manymore patients receive hip replacements without contemporaneous cataract surgery (and viceversa), and the effects of these relatively few patients is small within the total patient sample.Moreover, both cataract and hip surgery would contribute to any change in HROOL over thestudy period, and might produce either additive or counteracting effects on HROOL. Thespecific contribution of each procedure can be ascertained by grouping the patient first with alarge number of other cataract patients (to be compared with similar patients who did notreceive cataract surgery), then (later) with other hip-replacement patients (to be compared withsimilar patients who did not receive hip replacements).One problem increased sample size cannot correct is any systematic biases in the data,such as might occur if patients who receive a particular intervention tended to have, say,substantially better social support than those who do not receive that intervention. In thiscase, if the degree of social support is itself responsible for any favorable outcomes, theanalysis will incorrectly ascribe that outcome to the intervention. Analysts and policymakersmust be alert for this sort of bias, and should attempt to measure and correct for biasingfactors whenever possible. However, the problem of bias in large-scale outcome studiesshould not be overstated. It seems unlikely that many cases of meaningful bias (in the sensethat the bias completely invalidates inferences regarding effectiveness) would persist insample sizes of several thousands or tens of thousands of patients. Moreover, the problem isno different than that faced in the thousands of non-experimental studies already beingperformed every year. Indeed, the generic problem of bias in observational medical studieswould be SUbstantially reduced if researchers had access to the sort of very large data basesenvisioned for outcome management programs.Need for a Central HROOL RegistryIn his original vision of outcomes management,' Ellwood suggested that HRQOLinformation be obtained from patients (only) during encounters with their health carepractitioners. Although the initial enrollment of patients is probably best performed duringsuch encounters, subsequent polling should be effected at regular intervals (in order tostandardize observed outcomes). It would be next to impossible for practitioners to schedulereturn visits according to a standard schedule; moreover, patient and practitioner compliancewith these data collection efforts would likely be sporadic, at best. For these reasons, follow­up information about HROOL outcomes should probably be performed via standardized mail46surveys, as has been discussed before in these articles.We propose that a special centralized registry be set up collect health outcomeinfonnation on all patients within a certain geographic area (e.g., a State). Providers wouldregularly send to this registry lists of new patients newly diagnosed with one or more of adefined list of serious medical conditions, including cancer, heart disease , arthritis, and thelike. Conditions would be selected based on prevalence, severity, cost, and on the existenceof substantial variations in how the condition is treated.Providers would transmit (ideally electronically) the names, mailing addresses,identification numbers, and Initial, baseline HRQOL survey results to the central registry, whichwould then take responsibility for regular mailing of follow-up post-card questionnaires,perhaps every six months or a year. The retum questionnaires would be coded (no names) toensure confidentiality; in addition, strict measures would be adopted by the system to protectpatient confidentiality. Registry personnel would take steps to enhance compliance (e.g.,reminder postcards or phone calls).A major function of the registry would be to facilitate linkage of the outcome data withother data bases. This linkage is, of course, what makes outcome data useful -- by pennittinginferences to be drawn conceming the effectiveness (or lack of effectiveness) of medical andsurgical treatments and procedures. The technology necessary for successfully linking databases is well-developed. What is lacking, however, is the sort of electronic medical recordneeded to make full use of the outcome data. Administrative data (e.g., discharge diagnoses,age, gender) can take us a ways down the outcome management road, but, as will bediscussed below, there are limitations to this approach.Perhaps the first successful application of outcomes management will occur in largemanaged care organizations that have committed to electronic medical records. For example,the Harvard Community Health Plan and the Mayo Clinic provider network are both installingsophisticated electronic medical record systems which, when complete, will cover over500,000 patients each. A recent article in The New York Times reporting on these datasystems recogniZed their potential role in outcomes management:Justifying the costs [of converting to electronic medical records] is a tall order....But the savings could be enormous, The systems could feed infonnation Intonation-wide data bases , which could then be studied to detennine whichtreatments work. This invaluable evidence would enable doctors to eliminateunnecessary exams and surgery.5547These early efforts, it must be hoped, will serve as harbingers of much bigger things to come.For full-scale implementation and adequate inferential power, even one million patients isprobably insufficient, given that each intervention will likely need thousands of patients in eacharm. Ideally patients in an entire state or region would be included in the system.Unfortunately, the fragmented payer system characteristic of the United States precludes thesort of large scale population-based patient data base necessary for real outcomesmanagement. Indeed, no population-based data base of the size necessary exists in theUnited States . For this reason, provinces in Canada would be the ideal laboratories for largescale outcomes management. British Columbia and Manitoba have province-wide data baseson all hospital and physician visits and procedures. Alternatively, a State that succeeds inimplementing a single payer system -- rumored to be an option under the coming Clintonhealth reform package, could attempt such a program. It remains to be seen whether the so­called Health Insurance Purchasing Cooperatives, characteristic of "managed competition"schemes, would be able to assemble the necessary data bases. Clearly, federal leadership(and, perhaps, financing) will be needed if a substantial portion of the United States populationis ever to be included within a system of outcomes management.III. Identifying "Type of Patients"We now tum to the second major issue pertaining to the interpretation of large-scaleobservational HRQOL data: identifying patients who either do or do not receive significantbenefit from specified treatments and procedures. In a meaningful sense, identifyingcategories of patients based on differential responses to treatment represents the major taskof outcomes analysis and management. Toward this end, it is necessary to classify patientsinto different -types," based on combinations of clinical and demographic variables. Forexample, it mi~t be determined that, in order to receive significant benefit from a certaintreatment, patients must (1) be over age 65, with (2) mean blood pressure over 120 and (3)serum urea nitrogen under 60. This constellation of indicators denotes a "type of patient."The predictor variables used to define types of patients are demographic and clinicalfactors that have been shown to correspond statistically with increased or decreased degreesof benefit, or of some outcome (e.g., mortality). Selection of predictor variables should beguided by studies that have analyzed the relationship of patient outcomes with possiblepredictors.48Statistical ApproachesThere are two basic approaches to this problem: regression techniques and recursivepartitioning. Multivariate regression techniques determine the extent of correspondencebetween each candidate variable and the outcome of interest (e..g, mortality, Improved qualityof life), holding the effects of other variables constant. These analyses produce equations thatprovide quantitative estimates of the likelihood of the outcome given any particular set ofvalues on the selected variables. Each variable is assigned a specific numerical weight basedon its observed contribution to the outcome; to determine expected outcomes, these weightsare multiplied by the values of the variables and the resulting products added to produce theoutcome estimate.Using regression techniques, therefore, "types of patients" would be identified byreference to the probability that they would experience the outcome of interest with treatment.Examples might include "patients who, with treatment, have an expected HRQOLimprovement of at least (or less than) 0.5 rating unit compared to no treatment," or "patientswho; with treatment, have at least a one-year life-expectancy."In recursive partltlonlnq," 67 sa the patient sample is initially divided into twosubgroups that are as internally homogeneous as possible with respect to the outcome ofinterest. This process is repeated recursively within each resulting subgroup until certainstopping rules are invoked. The product of this effort is identification of two or moresubgroups, or "types of patients," who differ maximally from each other with respect to theoutcome. For example, imagine a sample of cancer patients with an overall mortality rate of25%, and about whom we know the age, gender, and stage of cancer. Recursive partitioningmay determine that, for example, the best "split" on this group is at age 65; perhaps 50%patients older than 65 die, compared to only 10% of patients 65 or yoiJnger. If the programchose this split, no other split, whether on age, gender, or stage, could produce greater within­subgroup homogeneity. A second split might then occur in one, both, or neither subgroup; inour hypothetical case, perhaps the older group would be split between less than Stage IIIcancer versus those Stage III or greater. Either one of these latter subgroups could be splitagain, either on age or gender. The resulting patient taxonomy ("types of patients") mightinclude 1 (and 2) men (and women) 'over age 65 with Stage III or greater cancer, (3) patientsover age 65 with less than Stage III cancer, and (4) patients aged 65 or younger. Theproportion of people in each subgroup with the outcome of interest is known from the HRQOLdata base; this proportion becomes the probability that future patients in those subgroups willexperience the outcome.49Although regression techniques .are generally somewhat more accurate than recursivepartitioning,69 providers would certainly resist the (what they would likely perceive as) anoverly quantitative approach to patient classification. The weighted sum (regression)approach is just too reminiscent of "cookbook medicine" i.e., too close to a recipe calling for,say, -3 eggs, 1 cup milk, and 1/2 teaspoon salt.- The recursive partitioning approach, on theother hand, produces more-traditional, holistic descriptions of different types of patients basedon specified combinations of predictor variables. For this reason, recursive partitioning maybe the procedure of choice for identifying patient subgroups who either do or do not derivesignificant benefit from a particular treatment or procedure.Much more needs to be said about how, exactly, different types of patients would beidentified using the above-described techniques. We are presently developing explicit modelsof how such process might work, and plan to report our results in the near future.Clinical versus Statistical SignificanceThe statistical significance of observed differences in HRQOL between different types ofpatients can be calculated according to established techniques. Traditional statistical testsmust be interpreted with cauti~n, however, because almost any difference across very largenumbers of patients will be statistically significant at usual cutoffs (e.g, p < .05 or .01). Asnoted recently by Diamond and Denton,the statistics of hypothesis testing (P values and the associated concept of-statistical significance") were originally designed for sample size around 30, not30,000. Nevertheless, investigators are now suggesting some questions of clinicalinterest. .. will require assessment trials involving as many as 140,000 patients.With trials of such size, a clinically inconsequential difference in outcome is readilyelevated to the lofty level of statistical significance.7O(p·457)With very large sample sizes it might be best simply to dispense with the concept ofstatistical significance and focus exclusively on clinically meaningful differences in outcomesacross differently treated groups. In Part B we suggested that this threshold might be set atone-half rating point on our 0-11 overall HRQOL scale. This difference represents one­twentieth of the distance between best and worst possible HRQOL - the same fractioncommonly accepted as the boundaries of statistical significance (Le., p <= .05). The choice ofa threshold for significant dinical difference is a value judgment and should ultimately be setafter public discussion and using due process democratic procedures."50The Problem of DiscriminationAlthough clinical variables (e.g., blood pressure) are always appropriate variables fordefining patient subqroups, demographic variables (e.g., gender, ethnicity) are potentiallyproblematic if the results of effectiveness analyses are to be used for purposes ofreimbursement policy or resource allocation. Conclusions to the effect that a treatment·worked" for one demographic group but not another, although perhaps scientifically valid,would be very difficult to accept from a social and political perspective.For example, what if a given service is found to provide significant benefit (either on Iife­expectancy or quality of life) for whites only (or for non-whites only), or for men but notfor women (or vice versa)? Would a policy whereby men were covered for, say,chemotherapy for lung cancer, but women were not, be supportable -- even if based ongood evidence? ... The appearance of discrimination that would inevitably exist if somebut not others were covered for putatively beneficial treatment would be extremelydifficult to accept. 5Society will need to come to grips with this problem, because it is certain that differences ineffectiveness will be observed across demographic lines.The use of secondary diagnoses and comorbidities could give rise to similar problems.For example, patients who have diabetes may derive significantly less benefit from cardiactransplantation than do patients without diabetes. Although it might be appropriate to counseldiabetic patients about this finding, a formal policy that denied coverage for heart transplantsto patients with diabetes might seem discriminatory. If based on good evidence, however,such a policy would probably not be in conflict with the Americans with Disabilities Act orrelated statutes."These latter considerations illustrate a problem with the use of administrative data basesthat contain only demographic and diagnostic information as potential predictor variables.Lacking clinical data of the type preferred for construction of patient taxonomies,administrative data bases can only provide limited degrees of useful insight into the questionof ·who benefits· from a particular treatment. However, administrative data can provideinformation concerning differences in outcomes based on which provider is used. This type ofdata is becoming of increasing significance in this era of quality assessment" and "outcomereport cards. M7351There is another important methodological consideration regarding the identification ofpatient subgroups. It is well known that retrospective searches for subgroups inevitably"succeed" in finding some types of patients who appear to respond to a given interventioneven when the entire sample, taken as a whole, does not respond." Often such findings aredue strictly to chance, the result of "data dredging." Indeed, it is statistically certain thatobviously nonsensical subgroups will be found to benefit from treatment, perhaps patients whoare Scorpios and have last names beginning with a NT."In general, patient subgroups who are determined to benefit from treatment should passtwo tests before the results are accepted and acted upon. First, the effect should persistacross different States or other large subsets of the population, i.e., it must be reproducible.Second, the finding must make clinical sense; there must be some more-or less acceptedclinical theory that can account for the finding. In borderline cases, the findings should betested in prospective studies, perhaps using longer generic HRQOL questionnaires (such asthe SF-36) and disease- or treatment-specific measures. Some procedural mechanism,perhaps a standing Commission, will be needed to determine which findings should be takenat face value, which rejected, and which subjected to prospective study.Length of Life versus Qualitv of LifeAs a final consideration in developing taxonomies of "types of patients," it must be notedthat in order to properly evaluate treatment effectiveness it is necessary to consider not onlyHRQOL outcomes, but also the impacts of treatment (if any) on life-expectancy. Sometreatments, such as an appendectomy for appendicitis, are valued not for any improvementsin quality of life, but rather for their substantial effects on life-expectancy (e.g., in the case ofappendicitis, from perhaps two weeks untreated to normal life-expectancy given appropriatetreatment). Other treatments have significant impacts on quality of life, but little or no effecton life-expectancy -- such as medication or surgery for arthritis, or prostatectomy for benignobstruction. Still other treatments involve trade-offs between quality and quantity of life, wherea longer life-expectancy comes at the expense of various symptoms resulting in a possibledecrease in quality of life. Chemotherapy for some forms of cancer might fall into thiscategory .The basic model for evaluating treatment effectiveness can be summed up in algebraic terms:52Effectiveness =[(AvQO~ x NYE~) - (AvQO~llt) x (NYE~llt)]AvHRQO~ = Average HRQOL with treatment (scaled between 0-1)NYE~ = Number of Years Expected to Life with treatmentAvHRQO~Ollt) = Average HRQOL without treatmentNYE~Ollt = Number of Years Expected to Life without treatmentThese terms can be depicted in graphical form:no treatmenttreatmentNYELnotxAvHRQOLnotxAVHRQO~NYELtxThe difference in "weight" (quality x quantity) between the top and bottom boxesconstitutes the net effectiveness of the treatment. Note that the weight of the bottom(treatment) box may be less than that of the top box if treatment results in overall lower lifeexpectancy and/or worse HRQOL.In the end, the goal of outcomes management is to identify types of patients whose"treatment boxes" "weigh" significantly more than their "no-treatment boxes" with respect to aspecific intervention. The extent to which the treatment box should outweigh the non­treatment box before the difference is considered "significant" .- and the intervention deemedeffective -- constitutes a value jUdgment which, again, calls for public debate and the use ofdemocratic evidence-based decision-making procedures.IV. ConclusionWith due care, the collection and anaJysis of global HRQOL outcome data can providevaluable information concerning the effectiveness of medical treatments and procedures.Reporting of one's health outcomes should come to be viewed as a civic responsibility,encouraged by providers and policymakers. Large data sets must be assembled.Appropriate data collection infrastructures should be developed as soon as possible, andanalysis and interpretation issues debated and resolved.The festering problems of increasing health care costs and inequitable access toeffective services demand that large scale outcomes measurement and management systems,as suggested by Ellwood over five years ago, become a reality.53REFERENCES1. Ellwood PM. Outcomes management: a technology of patient experience. N Engl JMed 1988; 318: 1549-56.2. Epstein AM. The outcomes movement: will it take us where we want to go? N Engl JMed 1991.3. Reiman AS. Assessment and accountability: the third revolution in medical care. NEngl J Med 1988; 319: 1220-2.4. Wennberg JE. Outcomes research, cost containment, and the fear of health carerationing. N Engl J Med 1990; 323: 1202-4.5. Hadom DC. The problem of discrimination in health care priority setting. JAMA 1992;268: 1454-1459.6. Hadom DC. Setting health care priorities in Oregon: cost-effectiveness meets the Ruleof Rescue. JAMA 1991; 265: 2218-2225.7. Eddy DM. Oregon's plan: should it be approved? JAMA 1991; 266: 2439-45.8. Fox DM, Leichter HM. Rationing care in Oregon: the new accountability. Health Affairs1991 10:2:7-27. .9. Gareiss R. Beyond billing. Am Med News, January 25, 1993, pp. 9-10.10. Deyo RA, Carter WB. Strategies for improving and expanding the application of healthstatus measures in clinical settings. Med Care 1992 Supplement; 30: MS176-MS186.11. Lansky D, Butler JBV, Waller FT. Using health status measures in the hospital setting:from acute care to 'outcomes management.' Med Care 1992; 30 Supplement MS57­MS73.12. Campbell DT, Stanley JC. Experimental and Quasi-Experimental Designs forResearch. Chicago: Rand McNally, 1966.13. Sechrest L, Perrin E, Bunker J., (eds.) Research Methodology: Strengthening CausalInterpretations of Nonexperimental Data. Rockville, MD: United States Department ofHealth and Human Services. Agency for Health Care Policy and Research, 1991.14. Kaplan RM and Bush JW. Health-related quality of life measurement for evaluationresearch and policy analysis. Health Psychology 1982; 1: 61-80.15. Patrick DL and Deyo RA. Generic and disease-specific measures in assessing healthstatus and quality of life. Medical Care 1989; 27 Supplement 8217-8232.16. Ware JE. Methodological Considerations in the Selection of Health Status AssessmentProcedures. In Wenger, NK, Mattson ME, Furberg CD, Elinson J (eds.), Assessmentof Quality of Life in the Clinical Trials of Cardiovascular Therapies. New York: LeJacqPublishing, Inc., 1984, pp. 87-111.17. Hadom DC. The role of public values in setting health care priorities. Soc Sci Med1991; 32: 773-782.5418. Bush JW, Fanshel S, Chen M. Analysis of a tuberculin testing program using a healthstatus index. Social-Economic Planning Sciences 1972; 6: 49-69.19. Bush JW, Chen M, Patrick DL. Cost-effectiveness using a health status index: analysisof the New York State PKU screening program. In Berg A, (00.), Health StatusIndexes. Chicago: Hospital Aesearch and Educational Trust, 1973, p. 172-208.20. Stewart A.L., Greenfield S., Hays A.D., et al. Functional status and well-being ofpatients with chronic conditions. JAMA 262,907-913, 1989.21. McDowell I, Newell C. Measuring Health: A Guide to Aating Scales andQuestionnaires. New York: Oxford University Press, 1987.22. Ware JE, Sherboume CD. The MOS 36-item short-form health survey (SF-36):1.conceptual framework and item selection. Moo Care 1992 Vol. 30 pp. 473-483.23. Inter-Patient Outcome Aesearch Team's Outcomes Assessment Work Group Survey ofOutcomes Assessment Instruments Used in Primary Data Collection and DataCollection Instruments. Alexandria, VA: Walcott, 1992.24. Kaplan AM, Anderson JP. The general health policy model: an integrated approach. InQuality of Life Assessments in Clinical Trials (Spilker B, ed.). New York: Raven Press,1990.25. Hunt SM, McEwen J, McKenna SP. Measuring health status: a new tool for cliniciansand epidemiologists. J Aoy Coli Gen Pract 1985; 35: 185-89.26. McKenna SP, Hunt SM, McEwen J. Weighting the seriousness of perceived healthproblems using Thurstone's method of paired comparisons. Int J Epidem 1981;10: 93-97.27. Bergner M, Bobbitt AA, Carter WB. et al. The Sickness Impact Profile: developmentand final revision of a health status measure. Med Care 1981; 19: 787-805.28. Rosser A, Kind P. A scale of valuations of states of illness: is there a socialconsensus? Int J Epidem 1978; 7: 347-58.29. Hadom DC, Hays AD. Multitrait-multimethod analysis of health-related quality of lifemeasures. Med Care 1991; 29: 829-840.30. Hadom DC, Hays RD, Uebersax J, et al. Improving task comprehension in themeasurement of health state preferences: a trial of informational figures and a pairedcomparison task. J Clin Epi 1992; 45: 233-243.31. Analysis Under the Americans with Disabilities Act (ADA) of the Oregon AeformDemonstration. Washington, DC: US Department of Health and Human Services;August 3, 1992.32. Najman J, Levine S. Evaluating the impact of medical care and technologies on thequality of life: A review and critique. Soc Sci Med 1981; 107-115.33. Sackett DL, Torrance GW. The utility of different health states as perceived by thegeneral public. J Chron Dis 1978: 31: 697-704.5534. Slevin ML, Stubbs L, Plant HJ, et al. Attitude to chemotherapy: comparing views ofpatients with cancer with those of doctors, nurses, and general public. BMJ 1990; 300:1458-1460.35. Christensen-Szalanski JJJ. Discount functions and the measurement of patients'values: Women's decisions during childbirth. Med Dec Making 1984; 4: 47-58.36. O'Connor AM, Boyd N, Warde P. Eliciting preferences for altemative drug therapies inoncology: Influence of treatment outcome description, elicitation technique andtreatment experience on preferences. J Chron Dis 1987; 40: 811-818.37. LLewellyn-Thomas HA, Sutherland HJ, Thiel EC. Do patients' evaluations of a futurehealth state change when they actually enter that state? Med Decis Making1991:11 :323 (abstract).38. Kaplan RM, Bush JW, Berry CC. The reliability, stability, and generalizability of ahealth status index. American Statistical Association, Proceedings of the Social StatusSection, 1978,704-709.39. Froberg DG, Kane RL. Methodology for measuring health-state preferences--III:population and context effects. J Clin Epidemiol 1989; 42: 585-92.40. Rokeach M. The Nature of Human Values. New York: The Free Press, 1973.41. Balaban DJ, Sagi PC, Goldfarb NI, et at. Weights for scoring the quality of well-beinginstrument among rheumatoid arthritics: A comparison to the general populationweights. Med Care 1986; 24: 973-980.42. LLewellyn-Thomas H, Sutherland HJ, Tibshirani R; et al. Describing health states:Methodological issues in obtaining values for health states. Med Care 1984; 22: 543­552.43. Boyle MH, Torrance GW. Developing multiattribute health indexes. Med Care 1984; 22:1045-1057.44. Thurstone L. A law of comparative judgment. Psychological Review 1927; 34: 273-286.45. Fylkesnes K, Forde OH. Determinants and dimensions involved in self-evaluation ofhealth. Soc Sci Med 1992; 35: 271-279.46. Barsky AJ, Cleary PO, Klerman GL. Determinants of perceived health status of medicaloutpatients. Soc Sci Med 1992; 34: 1147-1154.47. Campbell DT, Fiske OW. Converent and discriminant validation by the multitrait­multimethod matrix. Psych Bull 1959; 56: 81·105.48. Guyatt, G., Deyo, A.A., Charlson, M., et al, Responsivness and validity in health statusmeasurement: A clarification. J Clin Epidemiol 1989:42:403-408.49. Hays RD, Hadom DC. Responsiveness to change: an aspect of validity, not a separatedimension. Quality of Life Res 1992; 1: 73-75.50. Reuben DB, Rubenstein LV, Hirsch SH, Hays RD. Value of functional status as apredictor of mortality: results of a prospective study. Am J Med 1992; 93: 663-9.5651. Wolinsky FD, Johnson RJ. Perceived health status and mortality among older men andwomen. J Gerontol 1992; 47: S304-312.52. Reuben DB; Siu AL; Kimpau S. The predictive validity of self-report andperformance-based measures of function and health. J Gerontol 1992; 47: M106-110.53. Shepherd SL, Hovell MF; Slymen DJ; Harwood IR; Hofstetter CR; Granger LE; KaplanRM. Functional status as an overall measure of health in adults with cystic fibrosis:further validation of a generic health measure. J Clin Epidem 1992; 45: 117-125.54. Mossey JM, Shapiro E. Self-rated health: A predictor of mortality among the elderly.AJPH 1982; 72: 800-808.55. Gough IR, Fumival CM, Schilder L, et. aI. Assessment of the quality of life of patientswith advanced cancer. Eur J Cancer Clin Oncol 1983; 19: 1161-5.56. Feinstein AR. Benefits and obstacles for development of health status assessmentmeasures in clinical settings. Med Care 1992; 30 Supplement MS50·MS56.57. Andrews FM, Crandall R. The validity of measures of self-reported well-being. SocIndicat Res 1976; 3: 1-19.58. Andrews FM, Withey SB. Social indicators of well-being: Americans' perceptions of lifequality. New York: Plenum, 1976.59. Lehman AF, Ward NC, Unn Ls. Chronic mental patients: the quality of life issue. Am JPsychiatry 1982; 139: 1271-1276.60. Atkinson T. The stability and validity of quality of life measures. Soc Indicat Res 1982;10: 113-132.61. Campbell DT, Stanley JC. Experimental and Quasi-Experimental Designs for Research.Chicago: Rand McNally, 1966.62. Sechrest L, Perrin E, Bunker J., (eds.) Research Methodology: Strengthening CausalInterpretations of Nonexperimental Data. Rockville, MD: United States Department ofHealth and Human Services. Agency for Health Care Policy and Research, 1991.63. Weed DL. On the logic of causal inference. Am J Epidem 1986; 123: 965-979.64. Wonnacott RJ, Wonnacott TH. Introductory Statistics (4th edition). New York: JohnWiley, 1985.65. Rifkin G. New momentum for electronic patient records. New York Times, May 2, 1993,F8.66. Marshall RJ. Partitioning methods for classification and decision making in medicine.Stat Med 1986; 5: 517-526.67. Cook EF, Goldman L. Empiric comparison of multivariate analytic techniques:advantages and disadvantages of recursive partitioning analysis. J Chron Dis 1984; 37:721-731.68. Breiman L, Friedman JH, Olshen RA, Stone CJ. Classification and Regression Trees.Belmont, CA: Wadsworth International Group, 1984.5769. Hadom DC, Draper D, Rogers WH, Keeler EB, Brook RH. Cross-validationperformance of mortality prediction models. Stat Med 1991; 11: 475-489.70. Diamond GA, Denton TA. Alternative perspectives on the biased foundations of medicaltechnology assessment. Ann Int Med 1993; 118: 455-464.71. Hadom DC. Emerging parallels in the American health care and legal- judicial systems.Am J Law Med 1992; 18: 73-96.72. Darby M. Minnesota blues re-allocates resources using outcomes. Moo Guidelines andOutcomes Res 1992; 3: 1-2.73. Mitka M. Ready or not, here come outcomes 'report cards'. Am Moo News March22/29, 1993: 4.74. Bulpitt CJ. Subgroup analysis. Lancet July 2, 1988; 31-34.58

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.52387.1-0048417/manifest

Comment

Related Items