UBC Faculty Research and Publications

Assessment of health-related quality of life in arthritis: conceptualization and development of five… Kopec, Jacek A; Sayre, Eric C; Davis, Aileen M; Badley, Elizabeth M; Abrahamowicz, Michal; Sherlock, Lesley; Williams, J I; Anis, Aslam H; Esdaile, John M Jun 2, 2006

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


52383-12955_2006_Article_267.pdf [ 1.08MB ]
JSON: 52383-1.0223737.json
JSON-LD: 52383-1.0223737-ld.json
RDF/XML (Pretty): 52383-1.0223737-rdf.xml
RDF/JSON: 52383-1.0223737-rdf.json
Turtle: 52383-1.0223737-turtle.txt
N-Triples: 52383-1.0223737-rdf-ntriples.txt
Original Record: 52383-1.0223737-source.json
Full Text

Full Text

ralHealth and Quality of Life OutcomesssBioMed CentOpen AcceResearchAssessment of health-related quality of life in arthritis: conceptualization and development of five item banks using item response theoryJacek A Kopec*1,2, Eric C Sayre2,3, Aileen M Davis4, Elizabeth M Badley5, Michal Abrahamowicz6, Lesley Sherlock2, J Ivan Williams7, Aslam H Anis1,2 and John M Esdaile2,8Address: 1Department of Health Care and Epidemiology, University of British Columbia, Vancouver, BC, Canada, 2Arthritis Research Centre of Canada, Vancouver, BC, Canada, 3Department of Statistics and Actuarial Science, Simon Fraser University, Burnaby, BC, Canada, 4Outcomes and Population Health, Toronto Western Research Institute, Toronto, Ontario, Canada, 5Department of Public Health Sciences, University of Toronto, Toronto, Ontario, Canada, 6Department of Epidemiology and Biostatistics, McGill University, Montreal, Quebec, Canada, 7Toronto Rehabilitation Institute, Toronto, Ontario, Canada and 8Division of Rheumatology, University of British Columbia, Vancouver, BC, CanadaEmail: Jacek A Kopec* - jkopec@arthritisresearch.ca; Eric C Sayre - esayre@arthritisresearch.ca; Aileen M Davis - adavis@uhnresearch.ca; Elizabeth M Badley - e.badley@utoronto.ca; Michal Abrahamowicz - Michal@epimgh.mcgill.ca; Lesley Sherlock - lsherlock@richmond.ca; J Ivan Williams - williams.jack@torontorehab.on.ca; Aslam H Anis - aslam.anis@ubc.ca; John M Esdaile - jesdaile@arthritisresearch.ca* Corresponding author    AbstractBackground: Modern psychometric methods based on item response theory (IRT) can be used to developadaptive measures of health-related quality of life (HRQL). Adaptive assessment requires an item bank for eachdomain of HRQL. The purpose of this study was to develop item banks for five domains of HRQL relevant toarthritis.Methods: About 1,400 items were drawn from published questionnaires or developed from focus groups andindividual interviews and classified into 19 domains of HRQL. We selected the following 5 domains relevant toarthritis and related conditions: Daily Activities, Walking, Handling Objects, Pain or Discomfort, and Feelings.Based on conceptual criteria and pilot testing, 219 items were selected for further testing. A questionnaire wasmailed to patients from two hospital-based clinics and a stratified random community sample. Dimensionality ofthe domains was assessed through factor analysis. Items were analyzed with the Generalized Partial Credit Modelas implemented in Parscale. We used graphical methods and a chi-square test to assess item fit. Differential itemfunctioning was investigated using logistic regression.Results: Data were obtained from 888 individuals with arthritis. The five domains were sufficientlyunidimensional for an IRT-based analysis. Thirty-one items were deleted due to lack of fit or differential itemfunctioning. Daily Activities had the narrowest range for the item location parameter (-2.24 to 0.55) and HandlingObjects had the widest range (-1.70 to 2.27). The mean (median) slope parameter for the items ranged from 1.15(1.07) in Feelings to 1.73 (1.75) in Walking. The final item banks are comprised of 31–45 items each.Conclusion: We have developed IRT-based item banks to measure HRQL in 5 domains relevant to arthritis. Theitems in the final item banks provide adequate psychometric information for a wide range of functional levels inPublished: 02 June 2006Health and Quality of Life Outcomes 2006, 4:33 doi:10.1186/1477-7525-4-33Received: 08 April 2006Accepted: 02 June 2006This article is available from: http://www.hqlo.com/content/4/1/33© 2006 Kopec et al; licensee BioMed Central Ltd.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.Page 1 of 17(page number not for citation purposes)each domain.Health and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33BackgroundOver the past decade, item response theory (IRT) has beenincreasingly applied to the assessment of health-relatedquality of life (HRQL) [1]. IRT can be used to evaluate,modify, link, compare, and score existing measures as wellas develop new instruments [1,2]. An important applica-tion of IRT is computerized adaptive assessment of HRQL[1-4]. The process is adaptive because it allows differentrespondents to answer different questions depending ontheir level of health for the specific domain (dimension)being evaluated. The computer selects the questions froman item bank, i.e., a pool of previously calibrated ques-tions, using an adaptive algorithm. The selection of anitem at a given stage is based on the pattern of responsesto previous items and properties of the items available inthe item bank. The final score for the respondent isderived from the responses to the administered itemsusing maximum likelihood estimation [2,3].Because HRQL is a multi-domain concept, adaptiveassessment of HRQL requires an item bank for eachdomain. Item banks for measuring the impact of head-ache [5], depression [6], anxiety [7], perceived stress [8],fatigue [9], pain [10], and physical function [11] haverecently been developed and other item banks are underconstruction [3]. The objective of the current study was todevelop item banks for the HRQL domains relevant toarthritis and related conditions. In this article we discussthe conceptual framework for our measurement system,describe the process of item generation, present the meth-odology and results of an empirical study to calibrate andselect the items for each domain, and discuss the proper-ties of the final items. Further studies of this measurementsystem, including validation studies, alternative scoringmethods, and comparisons with other instruments, willbe described in subsequent publications.Content developmentThe World Health Organization (WHO) defined health asa state of complete physical, mental, and social well-being[12]. Ware proposed functional status, well-being, andgeneral health perceptions as the minimum set of generichealth concepts [13]. Other models of health and HRQLhave been proposed [14], but none has been generallyaccepted. There are significant differences in the domainsincluded in the leading HRQL instruments; furthermore,domains with similar content may have different namesin different instruments [15-17].The most comprehensive framework for describing healthis the International Classification of Functioning, Disabil-ity and Health (ICF) [18,19]. The ICF considers fourmajor areas of health and function, i.e., body structuresmobility), and environmental factors. For each area, thereare multiple levels of classification. For example, mobilityis divided into changing and maintaining body position,moving and handling objects, walking and moving, andmoving around using transportation. Walking and mov-ing, in turn, is subdivided into walking, moving around,moving around in different locations, and moving aroundusing equipment. Finally, walking is classified into walk-ing short distances, walking long distances, walking ondifferent surfaces, and walking around obstacles. We feltthat this multi-level structure and a large number of pos-sible domains would make the ICF too complex for use asa measurement tool. Therefore, in developing our meas-urement system, we combined the ICF model with anempirical approach based on existing instruments.Our objective was to create a large database of previouslyvalidated items that would serve as a starting point for thedevelopment of several item banks. To this end, wereviewed the content of a large number of publishedhealth and quality of life questionnaires, both generic anddisease- or domain-specific. The review started with theinstruments included in major texts and published litera-ture reviews [20-23]. These were supplemented with addi-tional instruments known to the investigators. Literaturesearches were then performed to look for additional ques-tionnaires. We first entered items from widely used multi-dimensional measures. The content of the database wasevaluated continuously and items from other instrumentswere entered, selected primarily for their domain contentand perceived status as standard or well established meas-ures. After items from 32 instruments were entered (Table1), a consensus was reached among the investigators thatthe database was sufficiently comprehensive for the pur-pose of our study.The items were then reclassified using the ICF concepts of"body functions" and "activities and participation". Eachitem was described in terms of the source questionnaire,wording of the question and response options, originalconcept measured, and domain assigned according to ournew classification. In this way, over 1,400 items were clas-sified into the following 19 domains: lower extremityfunction, upper extremity function, pain, emotional func-tion, cognitive function, communication, energy, sleep,vision, hearing, cardiopulmonary function, digestivefunction, sexual/reproductive function, urinary function,skin function/appearance, self-care activities, domesticactivities, interpersonal activities, and major life activities.Further work was limited to a smaller number of domains,as our main target population were persons with arthritisand related disorders. Based on the literature [24] and ourPage 2 of 17(page number not for citation purposes)(e.g., structure of lower extremity), body functions (e.g.,movement functions), activities and participation (e.g.,experience in measuring health outcomes in musculoskel-etal conditions we identified the following domains asHealth and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33being highly relevant to this population: self-care, domes-tic and major life activities, lower extremity function,upper extremity function, pain, and emotional function.The first 3 domains represented the ICF concept of partic-ipation [18,19]. Lower and upper extremity functionswere conceptualized as ability to perform activities thatdepend on these functions, such as those in the ICFdomains of walking and handling objects, respectively[19]. We extended the concept of pain, which is part ofsensory function in the ICF [19], to include discomfort, asthis term has been used in some questionnaires [16].Finally, emotional function was conceptualized based onthe ICF as the spectrum of feelings, such as joy, sorrow,anger, or anxiousness [19]. We combined self-care,domestic and major life activities and modified the labelsto arrive at the following five final domains: (1) DailyActivities, (2) Walking, (3) Handling Objects, (4) Pain orDiscomfort, and (5) Feelings. Of the 1,400 items in thedatabase, 624 were classified into these five domains andInitial item reductionTo eliminate redundant items, i.e., items that measuredthe same facet of a given domain, the items were organ-ized by content, grouping all similar items together. Inthis way it was possible to identify identical or very similaritems and eliminate duplication. When choosing betweenitems sharing similar content, we considered primarily thewording of the question and the format of the responseoptions. While redundant items were removed, some-times multiple items with similar content were includedfor empirical testing. This was particularly true in theDaily Activities and Pain or Discomfort domains, wherethe number of distinct areas of activity was limited. Forexample, items asking about the level of difficulty anddegree of limitation due to health for the same type ofactivity were included in the final questionnaire. At thisstage, the wording of most items and the number andwording of response options were modified to achieve asufficient degree of uniformity. The level of functioningTable 1: Instruments used to select items for the preliminary item databaseInstrument (acronym) [reference] Number of itemsArthritic Impact Measurement Scales (1 & 2) (AIMS) [42] 150Beck Depression Inventory (BDI) [53] 21Cancer Rehabilitation Evaluation System (CARES) [54] 140Center for Epidemiologic Studies – Depressed Mood Scale (CES-D) [43] 20Clinical Back Pain Questionnaire (CBPQ) [55] 19Disabilities of the Arm, Shoulder and Hand (DASH) [56] 68Disability Rating Index (DRI) [57] 12EORTC Quality of Life Questionnaire – C30 (EORTC) [58] 30EuroQol (EQ-5D) [17] 5Functional Assessment of Cancer Therapy Scales (FACT) [41] 177Functional Living Index: Cancer (FLIC) [59] 22General Health Questionnaire (GHQ) [60] 28Health Utilities Index, Mark 2 (HUI2) [16] 7Health Utilities Index, Mark 3 (HUI3) [16] 8Musculoskeletal Functional Assessment Instrument (MFAI) [61] 120McMaster Toronto Arthritis Patient Preference Disability Questionnaire (MACTAR) [62] 5McMaster Health Index Questionnaire (MHIQ) [63] 74Michigan Hand Outcomes Questionnaire (MHOQ) [64] 3790 S Short Form-36 Health Survey (SF-36) [15] 36Multidimensional Health Assessment Questionnaire (MDHAQ) [65] 9National Population Health Survey Questionnaire (NPHS) [66] 10North American Spine Society Low Back Pain Outcome Instrument (NASS) [67] 46Nottingham Health Profile, Version 2 (NHP) [68] 30Pain Disability Index (PDI) [69] 7Profile of Mood States (POMS) [70] 65Revised Oswestry Pain Questionnaire (OPQ) [71] 10Rheumatoid Arthritis Quality of Life (RAQoL) [72] 30Rotterdam Symptom Checklist (RSC) [73] 39Sickness Impact Profile (SIP) [74] 136Stanford Health Assessment Questionnaire (modified) (HAQ) [50] 20WOMAC Osteoarthritis Index (WOMAC) [51] 24Zung Self-Rating Depression Scale (SDS) [75] 20Page 3 of 17(page number not for citation purposes)these items were considered for further testing and reduc-tion.for items in the Daily Activities, Walking, and HandlingObjects domains was measured in terms of difficulty, lim-Health and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33itations, or need for help with specific activities. Pain orDiscomfort was assessed in terms of impact, intensity, orfrequency. Items measuring Feelings asked about theamount of time spent in a given emotional state. Weincluded items assessing depression and anxiety, usingboth positive and negative wording. All items wereworded to reflect a 4-week timeframe commonly used inHRQL instruments and had between 3 and 6 responseoptions. We decided that a 4-week recall period wasappropriate for studies in chronic conditions such asarthritis to remove the "noise" caused by short-term fluc-tuations in symptoms, although alternative versions ofthe questions with a shorter timeframe, e.g., 7 days or 24hours, could be developed in the future.The items were categorized according to the approximatelevel of HRQL they pertained to ("difficulty") to identifygaps and further redundancies. Upon careful inspection ofthe items in each domain, we noted that extreme levels offunction were not covered well. Such items tend to havehighly skewed response distributions and are oftendeleted from HRQL questionnaires in the content reduc-tion phase. However, for an item bank it is important toinclude items that can discriminate at either a very high orvery low level of function. The relative scarcity of suchitems, particularly those measuring the highest functionallevels, required the development of new items. For exam-ple, in the walking domain, we included an item askingabout difficulty running or jogging 20 miles to discrimi-nate among relatively healthy, younger individuals. Simi-larly, in Handling Objects, we added items about carrying100 and 200 lbs. An item about planning a suicide wasincluded in Feelings to discriminate among severelydepressed individuals. All new items used a standardizedformat, with a 4-week recall and 5 ordered responseoptions.Pre-testing and final revisionsThe procedures described in the previous section reducedthe number of items from over 600 to about 230. In thenext step, the items in each domain were subjected to amulti-stage empirical pre-testing and iterative revisionprocess [25]. Twenty-four volunteers pre-tested the itempool. Subjects ranged from 25 to 86 years of age (mean =46) and 71% were female. Most had completed at leastsome college or university. Nearly half reported havingosteoarthritis or back pain. Some pre-tests were conductedin groups, others individually. Following completion ofthe questionnaire, a discussion was held about the clarityof instructions, format and wording, as well as reactionsto item content, e.g., to identify items considered contro-versial or irrelevant. Content development continuedthrough the pre-testing stage, with new items being devel-Most items identified by the subjects as unclear eitherreferred to more than one concept (e.g., items combiningactivities such as eating and bathing) or were consideredtoo lengthy. "Some questions have more than one idea, whichmakes it unclear how to answer." "Ask how difficult it wouldhave been for you, not whether or not you have done the activ-ity." "Some of the long questions are over-worded." Clarifica-tion was also sought on items referring to distances. "It'seasier to think in terms of blocks than yards." Some partici-pants reacted positively to the inclusion of multiple itemsaddressing the same function, while others disliked therepetition. "It is well organized, and I liked the repetition.""The repetition was irritating and made it seem like a test."Difficulty in choosing the appropriate response because ofrecent health changes was also expressed. The items wererevised or deleted as testing progressed, based on subjects'comments. The introduction to the questionnaire wasmodified to help clarify the purpose of the questionnaireand to help subjects decide how to respond if their healthstate had changed recently. All 11 subjects completing thelast two versions found the instructions very clear and 9/11 found the meaning of the questions very clear.In the final stage of content development, a nominalgroup technique was used, where members of the investi-gative team reviewed all the items and reached consensuson the final item pool. This process resulted in a 219-itemitem calibration questionnaire (ICQ). The questionnairecontained 43 items in the Daily Activities domain, 38 inWalking, 54 in Handling Objects, 39 in Pain or Discom-fort, and 45 in Feelings.Item calibration studySubjects in the item calibration study were patients drawnfrom two clinics at the Vancouver Hospital and SciencesCentre (VHSC) and a stratified random community sam-ple in British Columbia (BC), Canada. We obtained a listof 554 patients with rheumatic conditions, treated byrheumatologists at the VHSC between 1994 and 2001.The vast majority had been diagnosed with rheumatoidarthritis (RA), although several patients had other types ofinflammatory arthritis. We also obtained a list of 472patients with radiographically confirmed osteoarthritis(OA) of the hip or knee waiting for joint replacement sur-gery. All patients able to complete the questionnaire wereconsidered eligible for the study. For the community sam-ple, a random computerized list of 3,000 telephone sub-scribers in BC, aged 18 years or older, was obtained toprovide a representative sample of households in theprovince. The sample was randomly divided into two sub-samples. In 2,000 subscribers we asked that the question-naire be completed by the adult in the household whosebirthday date came next, following receipt of the ques-Page 4 of 17(page number not for citation purposes)oped from focus groups and individual interviews. tionnaire. In the remaining 1,000 subscribers we askedthe oldest person in the household to complete the ques-Health and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33tionnaire. The reason for weighting the sample towardsolder persons was to increase the proportion of individu-als with functional limitations.A letter of introduction and the 219-item ICQ weremailed to each potential participant, along with a self-addressed pre-paid envelope. A reminder card was sentafter one week. A second copy of the questionnaire wassent to non-respondents four weeks after the initial mail-ing, followed by a second reminder one week later. In theclinical samples, the remaining non-respondents werecalled to remind them to send in the questionnaire. Up tofive phone calls were attempted at different times of theday and different days of the week and/or two voice mes-sages were left. In the community samples, no phone callswere made to non-respondents; however, a cash drawincentive was offered to those completing the question-naire. The study was approved by the University of BritishColumbia Ethics Board.Data analysisThe items were analyzed in a step-wise fashion, asdescribed in the literature [4-6]. The steps in the analyseswere as follows: 1) analysis of item dimensionality; 2)derivation of item parameters and option characteristiccurves (OCCs); 3) analysis of item fit; and 4) analysis ofdifferential item functioning. Extreme response categorieswith less than 5 responses were collapsed with the nextmost extreme category prior to the analysis to ensure sta-tistical stability of item parameters.DimensionalityDimensionality of the items within each domain wasinvestigated via factor analysis. We used polychoric corre-lations because of the categorical nature and skewed dis-tribution of responses to many items [26]. We assessedthe amount of variance explained by the first factor andplotted consecutive eigenvalues on a graph (scree plot)[27]. We fit a single-factor model to each domain andassessed factor loadings for each item, residual correla-tions between each item and all others, and root meansquare (RMS) residual correlations. Factor loadings ≥0.4are usually required to decide that an item is representedby a given factor [6,28].IRT modelSeveral IRT models were considered for item calibration,ranging from the one-parameter Rasch model [29] toMuraki's Generalized Partial Credit Model (GPCM) [30].A non-parametric approach developed by Ramsay (Test-graf) was also explored [31]. After a series of preliminaryanalyses, the GPCM, as implemented in Parscale version3.5 [32], was chosen for further analyses. This model isordered response options [32]. It has been successfullyapplied to similar items by other authors [5-8].In the GPMC, the probability of a given response to anitem for a subject with the trait level θ is modeled as afunction of the number of categories, the "location" and"slope" parameters for the item, and its "item-category"parameters [32]. We used Parscale to estimate itemparameters and to obtain option characteristic curves(OCCs) for each item. OCCs represent the probabilities ofselecting each response option as a function of the esti-mated trait level. Item parameters were estimated via mar-ginal maximum likelihood [32,33]. The locationparameter describes the difficulty of the task being askedabout in the item and represents a "center of discrimina-tion" [34]. For example, an item asking about difficultywalking a few steps will have a lower location than anitem asking about walking 5 miles; the former item isintended to discriminate between low and very low traitlevels and provides little discriminatory information forsubjects at the high end. The location parameters areexpressed on the same scale that is used to estimate HRQLscores for each respondent. The slope parameter indicatesthe degree to which the distribution of response categoriesvaries as the trait level changes [33]. This parameter incombination with item-category parameters describe theability of an item to discriminate between trait levels.Higher slope indicates better discrimination, while greaterspread of category parameters indicates a broader regionof discrimination.Item fitItem fit depends on the level of agreement between theobserved and model-predicted probabilities of selectingeach response option by subjects at different levels of thetrait. The statistical methodology for testing item fit in theGPCM is not well established [6]. We used graphicalmethods and a chi-square test similar to that proposed byMuraki [32]. The trait axis was divided into consecutiveintervals of 0.2 length and exact chi-square goodness of fittests were used to compare observed and expected countson the options within each interval. The exact chi-squarestatistics were added and compared to a chi-square distri-bution with degrees of freedom equal to the sum of theindividual interval-specific degrees of freedom. Use ofexact tests instead of asymptotic tests ensured that fit sta-tistics were robust to small cell sizes. P-values wereadjusted for multiple comparisons by a modified Bonfer-roni method [35].Each item failing the fit test was treated either by mergingoptions together or by placing it in a different "block". Allitems in a block share the same category parameters andPage 5 of 17(page number not for citation purposes)flexible and appropriate for multi-categorical items with selecting the most appropriate block for a given itemtends to improve item fit [32]. The appropriate treatmentHealth and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33for an item was decided on with the aid of fit graphs thatcompared observed versus model-predicted counts foreach option across several broad intervals of the trait. Thisprocedure was iterative, with OCCs for all items re-esti-mated after each iteration of treatment. Item fit graphswere plotted with SAS. Items were considered for deletionif they did not fit the IRT model despite these modifica-tions.Differential item functioningDifferential item functioning (DIF) exists when responsesto an item differ systematically across groups of respond-ents, e.g., males vs. females, that have similar values of thetrait being measured [36]. DIF was examined with ordinallogistic regression [37]. We tested the effect of age and sexon the ordinal response to a given item while controllingfor the IRT-based estimate of the trait. We also fit a modelin which the estimate of the trait was the only independ-ent variable. Statistical significance was determined basedon the p-values for age or sex. The magnitude of DIF wasmeasured by change in the Nagelkerke maximal rescaledR-square (delta-R-square) between models with and with-out age or sex [38]. Substantial DIF was defined as delta-R-square ≥0.02 [5,6]. Items that had both statistically sig-nificant and substantial DIF were considered for deletion.We also assessed other statistical properties of the items aswell as their conceptual contribution to their respectivedomains.Item and test informationItem parameters provided by Parscale were used to obtainitem information functions as well as the overall testinformation function for each domain using the formulagiven by Muraki [32]. Psychometric information can bethought of as a measure of discrimination (or precision ofestimation) at a given point along the trait spectrum, anddepends on both the slope and category parameters [32].A higher value of the information function indicates thatthe trait is estimated more precisely. Item informationtends to be highest at high/low trait levels for items withhigh/low location parameters. Domain-specific test infor-mation at a given trait level is the sum of the informationfor all items in the domain [34]. Item and test informa-tion curves were plotted with SAS.ResultsStudy sampleFor the purpose of item analysis we selected subjects witharthritis, as this was the main target population for ourinstrument. We received 331 questionnaires from patientsin the rheumatology clinic, 340 from patients on theorthopedic waiting list, and 217 from respondents withRA or OA in the community sample. These 3 groupssamples was 80%. The response rate in the communitysample was 33%. Key characteristics of the respondentsare presented in Table 2. The proportion aged 65 years orolder differed significantly between the samples andranged from 25% in the rheumatology clinic to 50% inthe community sample. The majority of respondents werefemale in all three samples, but the proportion of femaleswas highest among patients from the rheumatology clinic.About one third of the participants in all three sampleshad college/university education. Co-morbid conditionswere fairly common across the three samples and their fre-quencies were likely influenced by the age and sex distri-butions. For example, 32 – 47% reported back pain, 9 –15% reported heart disease, 5 – 8% reported diabetes, and9 – 11% reported depression. The proportion reportingfair or poor self-rated health was high in patients from therheumatology clinic (55%) and similar among those onthe orthopedic waiting list (26%) and in the communitysample (29%).DimensionalityThe results of factor analysis for each domain are pre-sented in Table 3. The first factor explained between58.3% (Feelings) and 72.3% (Walking) of the variance.All items loaded ≥0.4 on a single factor. RMS residual cor-relations were ≥0.1 for 4 items in the Daily Activitiesdomain, 3 in the Walking domain, 11 in the HandlingObjects domain, 7 in the Pain or Discomfort domain and8 in the Feelings domain (Table 3). Most of these itemshad RMS residual correlations <0.12 and loadings >0.7 ona single factor; the largest RMS residual correlation was0.15 (Item 16 in Pain or Discomfort). High RMS correla-tions were almost invariably associated with a highlyskewed response distribution. The scree plots suggested asingle factor for all domains, although we noted a slightindication of possible additional factors in the HandlingObjects and Feelings domains (data not shown). Toexplore this further, we performed several additional fac-tor analyses allowing for more than 1 factor and reviewedthe content of the items loading on different factors todetermine if these factors could represent distinct concep-tual facets of a given domain. We also applied graphicalmethods of analysis, whereby the items were ordered bylocation and displayed as a series of lines, using differentcolors for items loading on different factors. These graphsshowed very little overlap between the factors (data notshown). Both graphical analysis and content review indi-cated that these potential factors were related to item dif-ficulty rather than content. When we considered all theresults of factor analyses, the five domains were deemedsufficiently unidimensional for IRT modeling and noitems were dropped at this stage.Page 6 of 17(page number not for citation purposes)formed our analysis dataset (N = 888). The overallresponse rate among eligible subjects in the two clinicalHealth and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33Item fitExamples of item fit plots are presented in Figure 1. Theseplots compare the observed probabilities of choosing spe-cific response options (solid lines) with the probabilitiesestimated from the model (dashed lines). For eachdomain except Daily Activities we show examples of awell-fitting item and an item that was deleted due to lackof fit. Items that do not fit the model well (right column)display greater discrepancies between the observed andpredicted OCCs. In Daily Activities (Figure 1a–b), whereno items were dropped due to lack of fit, we show twoitems that differ in difficulty. In Walking, Item 33 (goodfit) asks about difficulty running or jogging 2 miles andItem 35 (poor fit) asks about standing on one's toes. InFeelings, Item 40 (good fit) asks about frequency of sui-cidal thoughts and Item 44 (poor fit) asks about feelingtotally relaxed. Note that Item 40 was collapsed to just 2options to achieve a good fit. Based on Bonferroni-cor-rected p-values ≤ 0.05, 0 items in the Daily Activities4). These items were deleted after inspecting the itemparameters and considering their contributions to theirrespective domains in terms of content validity and infor-mation.Differential item functioningItems that showed statistically significant and substantialDIF for age and/or sex are listed in Table 4. For example,Items 40, 41 and 42 in the Daily Activities domain, all per-taining to traveling, showed DIF for age, whereas Item 37in the Handling Objects domain (putting hand in apocket) showed DIF for sex. Item 31 in Feelings, askingabout the occurrence of crying spells, also had DIF withrespect to sex. Three anxiety-related items in this domaindisplayed DIF for age (feeling calm and peaceful, worryingabout the future, and feeling carefree). All items display-ing significant and substantial DIF were dropped after weassessed their statistical properties and their contributionto the content of their respective domains.Table 2: Characteristics of respondents in the item calibration study (N = 888)Community sample with arthritis (N = 217) n (%)Rheumatology clinic (N = 331) n (%)Joint replacement waiting list (N = 340) n (%)Age0–24 0 (0.0) 5 (1.5) 4 (1.2)25–44 23 (10.6) 71 (21.5) 28 (8.2)45–64 83 (38.2) 160 (48.3) 133 (39.1)65+ 108 (49.8) 84 (25.4) 166 (48.8)Missing 3 (1.4) 11 (3.3) 9 (2.7)SexMales 69 (31.8) 69 (20.9) 150 (44.1)Females 147 (67.7) 250 (75.5) 179 (52.7)Missing 1 (0.5) 12 (3.6) 11 (3.2)EducationLess than high school 50 (23.0) 61 (18.4) 81 (23.8)High school completed 94 (43.3) 146 (44.1) 133 (39.1)Trade school/college/university 72 (33.2) 113 (34.1) 115 (33.8)Missing 1 (0.5) 11 (3.3) 11 (3.2)Self-reported co-morbidity1Back pain 101 (46.5) 106 (32.0) 135 (39.7)Heart disease 33 (15.2) 30 (9.1) 38 (11.2)Lung disease 10 (4.6) 35 (10.6) 5 (1.5)Diabetes 15 (6.9) 26 (7.9) 17 (5.0)Depression 20 (9.2) 36 (10.9) 30 (8.8)Other2 147 (67.7) 237 (71.6) 227 (66.8)Self-rated healthExcellent 8 (3.7) 5 (1.5) 26 (7.7)Very good 57 (26.3) 40 (12.1) 98 (28.8)Good 87 (40.1) 103 (31.1) 126 (37.1)Fair 52 (24.0) 135 (40.8) 77 (22.7)Poor 12 (5.5) 46 (13.9) 12 (3.5)Missing 1 (0.5) 2 (0.6) 1 (0.3)1Cell counts do not sum to the sample size because a person may have multiple co-morbidities2 Includes: high blood pressure; ulcer or stomach disease; kidney disease; liver disease; anemia or blood disease; cancer; other medical problem (excluding OA and RA)Page 7 of 17(page number not for citation purposes)domain, 5 in Walking, 6 in Handling, 2 in Pain or Dis-comfort and 3 in Feelings did not fit the IRT model (TableHealth and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33Page 8 of 17(page number not for citation purposes)Examples of item fit plotsFigure 1Examples of item fit plots. The solid lines indicate the observed probability of response and dashed lines indicate the esti-mated probability of response. Items dropped due to lack of fit are marked with an asterisk. a) Daily Activities: Item 35 (eating meals). b) Daily Activities: Item 38 (getting in/out of a car). c) Walking: Item 33 (running/jogging 2 miles). d) Walking: Item 35* (standing on toes). e) Handling Objects: Item 44 (grocery bag). f) Handling Objects: Item 49* (light furniture). g) Pain or Dis-comfort: Item 3 (prevents activities). h) Pain or Discomfort: Item 19* (perfectly healthy). i) Feelings: Item 40 (thinking killing self). j) Feelings: Item 44* (totally relaxed).a) Daily Activities: Item 35 (eating meals)  b) Item 38 (getting in/out of a car) -2 -1 0 1 2TraitProbability00. -2 -1 0 1 2TraitProbabilityc) Walking: Item 33 (running/jogging 2 miles) d) Item 35* (standing on toes) -2 -1 0 1 2 3TraitProbability00. -4 -3 -2 -1 0 1 2 3TraitProbabilitye) Handling Objects: Item 44  (grocery bag)  f) Item 49* (light furniture) -2 -1 0 1 2 3 4TraitProbability00. -2 -1 0 1 2 3TraitProbabilityg) Pain or Discomfort: Item 3 (prevents activities) h) Item 19* (perfectly healthy) -3 -2 -1 0 1 2 3TraitProbability00. -3 -2 -1 0 1 2 3TraitProbabilityi) Feelings: Item 40 (thinking killing self)  j) Item 44* (totally relaxed) -2 -1 0 1 2 3 4TraitProbability00. -2 -1 0 1 2 3 4 5TraitProbabilityHealth and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33Page 9 of 17(page number not for citation purposes)Table 3: Factor loadings and root mean square residual correlations (RMS) for all itemsItem number Daily Activities Walking Handling Objects Pain or Discomfort FeelingsLoading RMS Loading RMS Loading RMS Loading RMS Loading RMS1 0.894 0.038 0.900 0.049 0.823 0.069 0.854 0.084 0.837 0.0552 0.864 0.052 0.890 0.047 0.832 0.080 0.864 0.073 0.849 0.0523 0.913 0.052 0.841 0.038 0.855 0.076 0.888 0.046 0.840 0.0634 0.842 0.063 0.904 0.038 0.867 0.057 0.885 0.051 0.841 0.0625 0.911 0.037 0.917 0.029 0.819 0.065 0.791 0.091 0.761 0.0636 0.888 0.028 0.931 0.037 0.848 0.047 0.853 0.065 0.711 0.0837 0.752 0.097 0.877 0.063 0.891 0.063 0.844 0.068 0.778 0.0788 0.869 0.075 0.922 0.063 0.830 0.087 0.461 0.098 0.740 0.0759 0.903 0.055 0.913 0.036 0.863 0.051 0.765 0.055 0.770 0.08710 0.771 0.076 0.881 0.042 0.890 0.036 0.867 0.044 0.788 0.08611 0.857 0.064 0.919 0.043 0.905 0.062 0.768 0.102 0.807 0.07312 0.842 0.074 0.934 0.039 0.773 0.125 0.826 0.091 0.851 0.06113 0.910 0.059 0.747 0.095 0.904 0.026 0.748 0.083 0.704 0.10114 0.868 0.072 0.909 0.048 0.906 0.043 0.834 0.047 0.700 0.09015 0.810 0.058 0.894 0.041 0.870 0.056 0.742 0.131 0.541 0.08816 0.772 0.091 0.877 0.052 0.889 0.052 0.691 0.150 0.740 0.10217 0.875 0.049 0.898 0.034 0.767 0.071 0.791 0.087 0.687 0.11618 0.897 0.038 0.886 0.049 0.883 0.062 0.736 0.126 0.768 0.08219 0.891 0.036 0.842 0.061 0.887 0.044 0.703 0.094 0.824 0.05720 0.881 0.062 0.848 0.076 0.759 0.132 0.760 0.099 0.804 0.07921 0.748 0.059 0.907 0.076 0.890 0.047 0.863 0.041 0.890 0.03922 0.892 0.061 0.694 0.067 0.629 0.125 0.765 0.103 0.705 0.06323 0.875 0.056 0.434 0.093 0.891 0.044 0.848 0.059 0.756 0.07624 0.875 0.064 0.845 0.097 0.863 0.054 0.823 0.070 0.839 0.04225 0.801 0.058 0.672 0.101 0.718 0.135 0.808 0.076 0.673 0.06526 0.888 0.064 0.835 0.060 0.862 0.053 0.550 0.087 0.616 0.12927 0.777 0.095 0.926 0.046 0.849 0.078 0.668 0.116 0.707 0.08328 0.870 0.066 0.707 0.092 0.854 0.066 0.877 0.040 0.653 0.12729 0.848 0.068 0.790 0.076 0.662 0.112 0.721 0.083 0.707 0.06030 0.860 0.076 0.755 0.114 0.762 0.134 0.853 0.074 0.750 0.07231 0.785 0.098 0.893 0.034 0.865 0.061 0.813 0.056 0.684 0.08232 0.772 0.106 0.898 0.023 0.898 0.046 0.741 0.099 0.787 0.07833 0.818 0.109 0.865 0.093 0.881 0.045 0.713 0.091 0.787 0.09334 0.775 0.113 0.923 0.041 0.762 0.093 0.891 0.054 0.656 0.11235 0.795 0.112 0.724 0.064 0.901 0.035 0.885 0.064 0.895 0.04036 0.806 0.064 0.739 0.107 0.870 0.052 0.691 0.117 0.710 0.08537 0.865 0.060 0.853 0.056 0.867 0.051 0.753 0.066 0.766 0.08138 0.815 0.046 0.898 0.033 0.873 0.059 0.807 0.063 0.814 0.07339 0.853 0.073 0.772 0.081 0.832 0.084 0.853 0.07040 0.841 0.073 0.630 0.129 0.719 0.10841 0.840 0.069 0.836 0.067 0.785 0.10841 0.839 0.064 0.592 0.125 0.757 0.04943 0.804 0.097 0.913 0.047 0.780 0.09944 0.801 0.107 0.818 0.07045 0.836 0.044 0.755 0.09946 0.858 0.04947 0.871 0.04948 0.824 0.06749 0.754 0.11150 0.748 0.12151 0.809 0.06652 0.889 0.03253 0.854 0.06154 0.842 0.045Health and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33Properties of final itemsThe numbers of items that were retained/eliminated ineach of the five domains were as follows: Daily Activities39/4, Walking 31/7, Handling Objects 45/9, Pain or Dis-comfort 36/3, and Feelings 37/8. Thus, the total numberof items in the final domains was 188. Unadjusted p-val-ues from the item fit chi-square test for these items are pre-sented in Table 5, with lower p-values indicating worseitem fit. Unadjusted p-values ≤ 0.01 were observed for 3items in Daily Activities, 0 in Walking and 1 in each of theremaining domains. While many items displayed statisti-cally significant DIF, the magnitude of DIF, as measuredby change in the Nagelkerke maximal rescaled R-square,was generally small (data not shown).Descriptive statistics for the distribution of item parame-ters are shown in Table 6. The range of the locationparameter was large compared to a standard normal dis-tribution in all five domains, indicating that the itemscovered a wide range of the construct measured. The DailyActivities domain had the narrowest range (-2.24 to 0.55)and the Handling Objects domain had the widest range (-1.70 to 2.27). Items in the Walking domain tended tohave the highest slopes, although most items in all fivedomains had slopes greater than 1.0. The mean (median)slope ranged from 1.15 (1.07) in Feelings to 1.73 (1.75)in Walking (Table 6) (see also the Appendix [Additionalfile 1]).Table 4: Deleted items by domain and reason for deletionItem Number Item content Lack of fit DIF for sex DIF for ageDaily Activities27 Need for help with grooming X40 Difficulty traveling around the town or city X41 Difficulty traveling between cities X42 Difficulty traveling overseas XWalking3 Limitations in walking or climbing stairs X4 Usual ability to walk X23 Difficulty moving toes X28 Difficulty lifting one foot off the ground X30 Difficulty getting in and out of bed X35 Difficulty standing on toes X38 Description of ability to walk XHandling Objects5 Difficulty scratching the lower back X22 Difficulty putting on shoes, socks, or stockings X24 Difficulty cutting fingernails X29 Difficulty cutting toenails X34 Difficulty making bed X37 Difficulty putting a hand in a pocket X38 Difficulty wiping mouth with a napkin X40 Difficulty picking up clothing from the floor X49 Difficulty lifting and moving light furniture XPain or Discomfort8 Time free from any physical complaints X19 Feeling perfectly healthy X26 Having minor pains and aches XFeelings8 Feeling tense or "high strung" X15 Losing temper X19 Feeling calm and peaceful X29 Worrying about the future X31 Having crying spells X41 Planning to commit suicide*44 Feeling totally relaxed and free of tension X45 Feeling carefree X*Deleted due to insufficient cell sizesDIF = Differential item functioningPage 10 of 17(page number not for citation purposes)Examples of OCCs are given in Figure 2. For each domainwe show 2 items that differ in location, slope or both. ForHealth and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33Page 11 of 17(page number not for citation purposes)Table 5: Unadjusted p-values from item fit chi-square tests by domainItem No. Daily Activities Walking Handling Objects Pain or Discomfort Feelings1 0.471 0.500 0.063 0.708 0.0292 0.884 0.036 0.046 0.744 0.8503 0.864 - 0.201 0.964 0.9764 0.398 - 0.155 0.023 0.9165 0.141 0.988 - 0.082 0.0346 0.011 0.832 0.889 0.096 0.0457 0.698 0.962 0.945 0.888 0.0228 0.295 0.870 0.023 - -9 0.018 0.945 0.405 0.145 0.07310 0.197 0.579 0.646 0.806 0.99711 0.986 0.336 1.000 0.521 0.02312 0.776 0.990 0.017 0.512 0.61013 0.998 0.549 0.146 0.235 0.41514 0.202 0.985 0.775 0.806 0.62615 0.713 0.125 0.253 0.190 -16 0.566 0.831 1.000 1.000 0.39717 0.549 0.560 0.125 0.111 0.46518 0.534 0.447 0.582 0.995 0.12019 0.204 0.697 1.000 - -20 0.509 0.039 0.096 0.027 0.80421 0.842 0.808 0.488 0.266 0.05722 0.121 0.898 - 0.579 0.02523 0.696 - 1.000 0.057 0.10824 0.002* 0.458 - 0.353 0.98925 0.513 0.045 0.011 0.005* 0.33026 0.001* 0.364 0.992 - 0.01827 - 0.752 0.329 0.880 0.31628 0.006* - 0.732 0.310 0.42029 0.366 0.100 - 0.145 -30 0.723 - 0.009* 0.377 0.24431 0.741 0.234 0.513 0.115 -32 0.977 0.330 0.152 0.097 0.04433 0.010 0.999 0.239 0.259 0.89034 0.207 0.511 - 0.058 0.66335 0.997 - 0.275 0.356 0.67636 0.140 0.658 0.887 0.731 0.33137 0.751 0.198 - 0.021 0.21138 0.466 - - 0.887 0.06139 0.689 0.680 0.070 1.00040 - - 0.97641 - 0.414 -42 - 0.054 0.002*43 0.040 1.000 0.11944 0.233 -45 0.494 -46 0.24647 0.06848 0.93849 -50 0.17351 0.47952 0.99553 0.84654 0.988A dash indicates that the item has been deleted. P-values ≤ 0.01 are marked by an asterisk.Health and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33example, the OCCs for Item 23 in Handling Objects (dif-ficulty brushing teeth) are shifted to the left, with theprobability of selecting option 1 (no difficulty) reachingalmost 100% for estimated trait scores greater than 0. TheOCCs are fairly steep, consistent with a relatively highslope for this item. The item provides information forlower levels of health in the Handling Objects domain. Bycontrast, Item 50 (difficulty lifting and moving heavy fur-niture) is more informative at the higher end of the traitspectrum where item 23 is virtually non-informative. TheOCCs for this item are less steep and the slope is lower.Finally, overall test (domain) information functions,describing the amount of psychometric information foreach domain according to trait level, are shown in Figure3. As one would expect, the curves are mound-shaped,indicating that information is not evenly distributed.Also, the curves are shifted slightly to the left, towardslower levels of health, especially for Handling Objects.Nevertheless, these curves show that information is avail-able for a wide range of functional levels in each domain.Since information is related to discrimination, thedomain-specific scores should be able to discriminatebetween different levels of HRQL among relativelyhealthy people as well as among those with severe healthproblems.DiscussionThis article describes the development of item banks forfive domains of HRQL relevant to arthritis and relatedconditions. The items were pre-tested and revised beforethe calibration study. Both conceptually and factor analyt-ically the domains were unidimensional. The items werecalibrated on a large sample of people with arthritis. Wedropped 31 out of 219 items, either because of lack of fit,substantial DIF in relation to sex or age, or because of anextremely skewed distribution (one item). The final itembanks are comprised of 31 – 45 items and appear appro-priate for the application of computerized adaptive testing(CAT) though additional analyses will be required to eval-uate their performance under CAT conditions.Although the principles of item banking are fairly wellestablished in the context of educational testing [39], theirapplication to health assessment is a relatively new area ofresearch. For valid application of IRT, the items shouldmeasure a single concept, fit the chosen IRT model, andnot function differently across groups [39]. However,there is no consensus on the best methods and criteria forassessing dimensionality, model fit, and DIF. Further-more, while dropping items that do not meet strict IRT cri-teria should improve validity of the scores, it may alsoreduce information, especially for extreme levels of thetrait.Unidimensionality can be assessed both statistically andconceptually. In all our domains, items with RMS residualcorrelations ≥0.1 tended to be very easy or very difficult.For example, Item 16 in Pain or Discomfort, which hadthe highest RMS residual correlation, asked how oftenpain prevented use of the toilet. Our analyses suggest thatany additional "dimensions" in factor analysis were likelya statistical artifact related to item location rather thanitem content, a phenomenon well known in the literature[34,40]. Conceptually, Daily Activities could be consid-ered a multi-dimensional domain, as it addresses limita-tions in self-care, work, recreation, and social activities.However, in our sample of persons with arthritis, all itemsin this domain loaded highly on a single factor and thescree plot was unidimensional. The Handling Objectsdomain has items assessing hand function as well as armand upper body function. Interestingly, several misfittingitems in this domain asked about activities typicallyaffected by back problems, for example, putting on shoes,making bed, or picking up clothing from the floor.In the Feelings domain, we included items assessing bothdepression and anxiety. While a mixture of depressionand anxiety items is not uncommon in scales measuringemotional function [15,41,42], separate scales for thesetwo related concepts have been developed [43-45]. Two ofthe items that were dropped due to lack of fit assessed anx-iety (feeling tense and feeling totally relaxed). AdditionalTable 6: Descriptive statistics for the location and slope parameters for the final itemsStatistic Daily Activities Walking Handling Objects Pain or Discomfort FeelingsLocation Slope Location Slope Location Slope Location Slope Location SlopeMean -0.664 1.610 -0.097 1.729 -0.774 1.354 -0.497 1.294 -0.573 1.154SD 0.805 0.463 0.859 0.550 0.897 0.409 0.834 0.544 0.819 0.390Min. -2.237 0.970 -1.274 0.611 -1.698 0.662 -1.896 0.649 -2.349 0.56025% -1.434 1.264 -0.708 1.346 -1.341 1.036 -1.169 0.859 -1.186 0.863Median -0.511 1.466 -0.380 1.747 -1.180 1.278 -0.352 1.102 -0.639 1.07175% 0.014 1.947 0.456 2.145 -0.451 1.707 0.071 1.627 0.211 1.434Max. 0.546 3.156 1.958 2.695 2.272 2.104 1.637 2.546 1.011 1.989Page 12 of 17(page number not for citation purposes)anxiety items were dropped because of DIF. Thus the finalHealth and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33Page 13 of 17(page number not for citation purposes)Examples of option characteristic curves (2 items per domain)Figure 2Examples of option characteristic curves (2 items per domain). a) Daily Activities: Item 29 (bathing). b) Daily Activi-ties: Item 37 (heavy chores). b) Walking: Item 11 (walking 20 yards). d) Walking: Item 18 (on feet for 4 hours). e) Handling Objects: Item 23 (brushing teeth). f) Handling Objects: Item 50 (heavy furniture). g) Pain or Discomfort: Item 18 (grooming). h) Pain or Discomfort: Item 4 (normal work). i) Feelings: Item 33 (complete failure). j) Feelings: Item 26 (elated or overjoyed).a) Daily Activities: Item 29 (bathing)   b) Item 37 (heavy chores) -4 -3 -2 -1 0 1 2 3 4 5TraitProbability00. -4 -3 -2 -1 0 1 2 3 4 5TraitProbabilityc) Walking: Item 11 (walking 20 yards)  d) Item 18 (on feet for 4 hours) -4 -3 -2 -1 0 1 2 3 4 5TraitProbability00. -4 -3 -2 -1 0 1 2 3 4 5TraitProbabilitye) Handling Objects: Item 23 (brushing teeth) f) Item 50 (heavy furniture) -4 -3 -2 -1 0 1 2 3 4 5TraitProbability00. -4 -3 -2 -1 0 1 2 3 4 5TraitProbabilityg) Pain or Discomfort: Item 18 (grooming)  h) Item 4 (normal work) -4 -3 -2 -1 0 1 2 3 4 5TraitProbability00. -4 -3 -2 -1 0 1 2 3 4 5TraitProbabilityi) Feelings: Item 33 (complete failure)  j) Item 26 (elated or overjoyed) -4 -3 -2 -1 0 1 2 3 4 5TraitProbability00. -4 -3 -2 -1 0 1 2 3 4 5TraitProbabilityHealth and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33domain is more strongly oriented toward depression thananxiety.The question how to best assess item fit for polytomousIRT models is not yet resolved [6,46]. Graphical methodshave been advocated in addition to formal statistical tests[47]. It has also been demonstrated that minor deviationsfrom a perfect fit have very little effect on the scores [48].In our study, there was generally good agreement betweenthe plots and the chi-square test of fit; apparent discrepan-cies were usually related to small samples in certain inter-vals of the trait. Because we performed multiple tests,some very low p-values would be expected by chance.Some authors have used a p-value > 0.01 as a cut-off foracceptable item fit [5]. Our correction for multiple com-parisons led to 6 items with unadjusted p-values ≤ 0.01being retained. We believe this is acceptable, especially ina study with a large sample size such as this one. The levelof misfit for any item in the final domains seems smalland most likely has little effect on the scores.tial DIF. We considered DIF to be important if it was bothstatistically significant and substantial, as suggested in theliterature [5,6]. We assessed DIF with regard to age andsex, as these two variables are fundamental to almost anyanalysis of HRQL. Had we studied DIF for other variables,for example, education, income, ethnicity, type of arthri-tis, or co-morbidity, we would have undoubtedly foundmore items that functioned differentially. While DIF issometimes considered a form of bias and indicates thatresponses to an item are systematically affected by factorsother than the trait being measured, few items are totallyfree from such influences [34]. More research is needed onthe effect of DIF on the validity of the scores and the mostappropriate treatment of items that display DIF [49].Some authors have used item discrimination as an addi-tional criterion in item selection [6]. In our data, very dif-ficult and very easy items tended to have low slopes andrelatively flat information curves. However, such itemswere informative for extremely high or low levels of func-tion and helped minimize floor and ceiling effects.It has been suggested that an ideal item bank should havea "rectangular" distribution of the location parameter[[39], p.42]. Our initial distribution of location in alldomains was mound-shaped and, to a varying degree,skewed and/or irregular, with areas of high and low den-sity. A series of preliminary analyses revealed that in orderto achieve a flat distribution, we would have to sacrifice alarge number of items, including some highly informativeand conceptually relevant items. A rectangular distribu-tion may be achievable when one has a very large numberof items to choose from at all levels of HRQL. In healthassessment, such item pools are not available at this time.Besides, a rectangular distribution may be more impor-tant for dichotomous items than ordered categorical itemsused in this study.ConclusionThe main reason for developing item banks is to applyCAT. Advantages of this technology in terms of bias, espe-cially for high and low levels of the trait measured(reduced floor/ceiling effects), and efficiency (increasedinformation per item), have been demonstrated both the-oretically and empirically [34]. Thus, when a question-naire is administered on a computer and a validated itembank is available, there seems to be little justification forusing a conventional "fixed" questionnaire with similaritems. Nevertheless, it may not be easy to convince theusers of HRQL instruments to abandon well-establishedconventional measures. In arthritis, instruments such asthe Health Assessment Questionnaire (HAQ) [50], Arthri-tis Impact Measurement Scales (AIMS) [42] or WesternTest information functions for the five domains of HRQLFigure 3Test information functions for the five domains of HRQL. a) Daily Activities. b) Walking. c) Handling Objects. d) Pain or Discomfort. e) Feelings.a) Daily Activities 050100150-5 -4 -3 -2 -1 0 1 2 3 4 5TraitInformationb) Walking 050100150-5 -4 -3 -2 -1 0 1 2 3 4 5TraitInformationc) Handling Objects 050100150200-5 -4 -3 -2 -1 0 1 2 3 4 5TraitInformationd) Pain or Discomfort 020406080100-5 -4 -3 -2 -1 0 1 2 3 4 5TraitInformatione) Feelings 0204060-5 -4 -3 -2 -1 0 1 2 3 4 5TraitInformationPage 14 of 17(page number not for citation purposes)Various approaches have been employed to measuringDIF and treating items that display significant or substan-Ontario and McMaster Index (WOMAC) [51] have a longhistory of applications. Clinicians and researchers areHealth and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/33familiar with those instruments and feel comfortable withtheir content. Also, the user can relatively easily calculatethe scores. With adaptive testing the user does not see allthe questions in the item bank and must rely, for bothquestionnaire administration and scoring, on a complexcomputer program provided by the item bank developer.For these reasons, it seems that CAT is unlikely to com-pletely replace conventional assessments of HRQL in theforeseeable future [52]. Wider use of the adaptive meas-urement system we have developed will be facilitated by ademonstration of superior psychometric properties, suchas validity, reliability and responsiveness, as well as supe-rior measurement efficiency, in head-to-head compari-sons with conventional instruments.Competing interestsThe author(s) declare that they have no competing inter-ests.Authors' contributionsJAK conceived the study, supervised data collection andanalysis, and led the drafting of the manuscript. ECS per-formed data analyses and drafted parts of the manuscript.AMD co-designed the study and participated in draftingthe manuscript. EMB co-designed the study and partici-pated in drafting the manuscript. MA helped to design thestudy and draft the manuscript. LS coordinated the study,collected the data, and drafted parts of the manuscript.JIW participated in designing the study and helped todraft the manuscript. AHA helped to design parts of thestudy and draft the manuscript. JME provided general sup-port for this research, oversaw its clinical aspects, and par-ticipated in drafting the manuscript.Additional materialAcknowledgementsThis research was supported by a grant from the Canadian Arthritis Net-work (CAN). JA Kopec is a Michael Smith Foundation for Health Research Senior Scholar. EC Sayre and L Sherlock were supported by the research grant from CAN. AM Davis held an Investigator Award from the Canadian Institutes of Health Research. LM Badley is a Professor at the University of ish Columbia. We thank Drs. D. Garbuz, N. Greidanus and A. Chalmers for assistance in obtaining access to their patients. The funding agency (CAN) played no role in the design of the study, collection, analysis, and interpre-tation of data, writing of the manuscript, or decision to submit the manu-script for publication.References1. McHorney CA: Generic health measurement: past accom-plishments and a measurement paradigm for the 21st cen-tury.  Ann Intern Med 1997, 15:743-750.2. Hays RD, Morales LS, Reise SP: Item response theory and healthoutcomes measurement in the 21st century.  Med Care2000:II28-42.3. Fries JF, Bruce B, Cella D: The promise of PROMIS: using itemresponse theory to improve assessment of patient-reportedoutcomes.  Clin Exp Rheumatol 2005, 23:S53-57.4. Ware JE Jr, Kosinski M, Bjorner JB, Bayliss MS, Batenhorst A, DahlofCG, Tepper S, Dowson A: Applications of computerized adap-tive testing (CAT) to the assessment of headache impact.Qual Life Res 2003, 12:935-952.5. Bjorner JB, Kosinski M, Ware JE Jr: Calibration of an item pool forassessing the burden of headaches: an application of itemresponse theory to the headache impact test (HIT).  Qual LifeRes 2003, 12:913-933.6. Fliege H, Becker J, Walter OB, Bjorner JB, Klapp BF, Rose M: Devel-opment of a Computer-adaptive Test for Depression (D-CAT).  Qual Life Res 2005, 14:2277-2291.7. Becker J, Walter OB, Fliege H, Bjorner JB, Kocalevent RD, Klapp BF,Rose M: Validating the German computerized adaptive testfor anxiety [abstract].  Qual Life Res 2004, 13:1515.8. Kocalevent RD, Walter OB, Becker J, Fliege H, Klapp BF, Bjorner JB,Rose M: Perceived stress – measured by a computerizedadaptive test (STRESS-CAT) [abstract].  Qual Life Res 2004,13:1515.9. Lai JS, Cella D, Dineen K, Bode R, Von Roenn J, Gershon RC, ShevrinD: An item bank was created to improve the measurementof cancer-related fatigue.  J Clin Epidemiol 2005, 58:190-197.10. Lai JS, Dineen K, Reeve BB, Von Roenn J, Shervin D, McGuire M, BodeRK, Paice J, Cella D: An item response theory-based pain itembank can enhance measurement precision.  J Pain SymptomManage 2005, 30:278-288.11. Bode RK, Lai JS, Dineen K, Heinemann AW, Shevrin D, Von Roenn J,Cella D: Expansion of a physical function item bank and devel-opment of an abbreviated form for clinical research.  J ApplMeas 2006, 7:1-15.12. Preamble to the Constitution of the World Health Organi-zation as adopted by the International Health Conference,New York, 19–22 June, 1946; signed on 22 July 1946 by therepresentatives of 61 States (Official Records of the WorldHealth Organization, no. 2, p. 100) and entered into force on7 April 1948.  .13. Ware JE: The status of health assessment.  Annu Rev Public Health1995, 16:327-354.14. Cella DF: Quality of life: concepts and definition.  J Pain SymptomManage 1994, 9:186-192.15. Ware JE, Sherbourne CD: The MOS 36-item Short-FormHealth Survey (SF-36).  Med Care 1992, 30:473-483.16. Furlong WJ, Feeny DH, Torrance GW, Barr RD: The Health Utili-ties Index (HUI) system for assessing health-related qualityof life in clinical studies.  Ann Med 2001, 33:375-384.17. EQ-5D The EuroQol Group: EuroQol – a new facility for themeasurement of health-related quality of life.  Health Policy1990, 16:199-208.18. Ustun TB, Chatterji S, Bickenbach J, Kostanjsek N, Schneider M: TheInternational Classification of Functioning, Disability andHealth: a new tool for understanding disability and health.Disabil Rehabil 2003, 25:565-571.19. World Health Organization: The International Classification ofFunctioning, Disability and Health.   [http://www3.who.int/icf/icftemplate.cfm]. Accessed on May 9, 200620. Spilker B, Molinek FR Jr, Johnston KA, Simpson RL Jr, Tilson HH:Quality of life bibliography and indexes.  Med Care 1990, 28(12Additional File 1Kopec additional. Appendix: Abbreviated content, location, and slope parameters of the items in the five domains of HRQL. The items are ordered by location parameter; item numbers relate to order in the original questionnaire. Missing location and slope parameters indicate deleted items.Click here for file[http://www.biomedcentral.com/content/supplementary/1477-7525-4-33-S1.doc]Page 15 of 17(page number not for citation purposes)Toronto. JI Williams was Vice-President Research at the Toronto Rehabil-itation Institute. EM Abrahamowicz is a James McGill Professor at McGill University. AH Anis and JM Esdaile are Professors at the University of Brit-Suppl):DS1-77.21. Spilker B, Ed: Quality of Life and Pharmacoeconomics in Clinical TrialsEdited by: 2. New York: Lippincott-Raven; 1996. Health and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/3322. MacDowell I, Newell C: Measuring Health: a guide to rating scales andquestionnaires 2nd edition. N York: Oxford University Press; 1996. 23. Bowling A: Measuring disease: a review of disease-specific quality of lifemeasurement scales Philadelphia: Open University Press; 1995. 24. Mason JH, Anderson JJ, Meenan RF: A model of health status forrheumatoid arthritis. A factor analysis of the ArthritisImpact Measurement Scales.  Arthritis Rheum 1988, 31:714-720.25. Streiner DL, Norman GR: Health Measurement Scales. A Practical Guideto Their Development and Use Edited by: 2. Oxford: Oxford UniversityPress; 1995. 26. Nunnally JC, Bernstein IH: Elements of statistical descriptionand estimation.  In Psychometric Theory 3rd edition. Edited by: Nun-nally JC, Bernstein IH. New York: McGraw-Hill; 1994:127. 27. Cattell RB: The screen test for the number of factors.  Multivar-iate Behav Res 1966, 1:629-637.28. Tabachnick DG, Fidell LS: Using multivariate statistics 3rd edition. NewYork, New York: Harper Collins; 1996. 29. Rasch G: Probabilistic Models for Some Intelligence and Attainment TestsDenmark; 1960, University of Chicago Press; 1980. 30. Muraki E: A Generalized Partial Credit Model.  In Handbook ofModern Item Response Theory Edited by: van der Linden WJ, Hamble-ton RK. New York: Springer; 1996:153-164. 31. Ramsay JO: Testgraf. A Program for the Graphical Analysis of MultipleChoice Test and Questionnaire Data McGill University, Montreal, Can-ada; 1995. 32. Muraki E, Bock RD: Parscale: IRT based test scoring and item analysis forgraded open-ended exercises and performance tasks Chicago: ScientificSoftware Int; 1993. 33. Muraki E: A Generalized Partial Credit Model: application ofan EM algorithm.  Appl Psych Meas 1992, 16:159-176.34. Embretson S, Reise SP: Item Response Theory for Psychologists Mahwah,NJ: Lawrence Erlbaum Associates; 2000. 35. Ryan TA: Multiple comparisons in psychological research.  Psy-chol Bull 1959, 56:26-47.36. Kelderman H, Macready GB: The use of loglinear models forassessing differential item functioning across manifest andlatent examinee groups.  J Ed Meas 1990, 27:307-327.37. Swaminathan H, Rogers HJ: Detecting differential item function-ing using logistic regression procedures.  J Ed Meas 1990,27:361-370.38. Nagelkerke NJD: Miscellanea. A note on a general definition ofthe coefficient of determination.  Biometrika 1991, 78:691-692.39. Wainer H: Computerized Adaptive Testing: A primer Hillsdale, NJ: Law-rence Erlbaum Associates; 1990. 40. Caroll JB: The effect of difficulty and chance success on corre-lations between items or between tests.  Psychometrika 1945,10:1-19.41. Cella DF, Tulsky DS, Gray G, Sarafian B, Linn E, Bonomi A, SilbermanM, Yellen SB, Winicour P, Brannon J, Eckberg K, Lloyd S, Purl S, Blen-dowski C, Goodman M, Barnicle M, Stewart I, McHale M, Bonomi P,Kaplan E, Taylor S IV, Thomas CR, Harris J: The functional assess-ment of Cancer Therapy Scale: development and validationof the general measure.  J Clin Oncol 1993, 11:570-579.42. Meenan RF, Gertman PM, Mason JH: Measuring health status inarthritis. The Arthritis Impact Measurement Scales.  ArthritisRheum 1980, 23:146-152.43. Radloff LS: The CES-D scale: a self-report depression scale forresearch in the general population.  Appl Psych Meas 1977,1:385-401.44. Bieling PJ, Antony MM, Swinson RP: The State-Trait AnxietyInventory, Trait version: structure and content re-exam-ined.  Behav Res Ther 1998, 36:777-788.45. Zigmond AS, Snaith RP: The hospital anxiety and depressionscale.  Acta Psychiatrica Scandinavica 1983, 67:361-370.46. Hattie JA: Methodology review: assessing unidimensionality oftests and items.  Appl Psych Meas 1985, 9:139-164.47. Drasgow F, Levine MV, Tsien S, Williams BA, Mead AD: Fitting pol-ytomous item response theory models to multiple-choicetests.  Appl Psych Meas 1995, 19:143-165.48. Tay-Lim BSH, Zhang J, Davis S, Tang C: The impact of item treat-ments on NEAP reporting scale scores.  Paper presented at theAnnual Meeting of National Council on Measurement in Education(NCME), Chicago . April 22–24 200349. Crane PK, van Belle G, Larson EB: Test bias in a cognitive test:50. Fries JF, Spitz P, Kraines RG, Holman HR: Measurement of patientoutcome in arthritis.  Arthritis Rheum 1980, 23:137-145.51. Bellamy N, Buchanan WW, Goldsmith CH, Campbell J, Stitt LW: Val-idation study of WOMAC: a health status instrument formeasuring clinically important patient relevant outcomes toantirheumatic drug therapy in patients with osteoarthritis ofthe hip or knee.  J Rheumatol 1988, 15:1833-1840.52. Sloan J, Wilson M: Item response theory: when is it useful andwhere does classical test theory fit in?  Qual Life Res 2005,14:1982.53. Beck AT, Ward CH, Mendelson M, Mock J, Erbaugh J: An inventoryfor measuring depression.  Arch Gen Psychiatry 1961, 4:561-71.54. Schag CA, Heinrich RL: Development of a comprehensive qual-ity of life measurement tool: CARES.  Oncology (Williston Park)1990, 4:135-8.55. Ruta DA, Garratt AM, Wardlaw D, Russell IT: Developing a validand reliable measure of health outcome for patients withlow back pain.  Spine 1994, 19:1887-96.56. Hudak PL, Amadio PC, Bombardier C: Development of an upperextremity outcome measure: the DASH (disabilities of thearm, shoulder and hand) [corrected]. The Upper ExtremityCollaborative Group (UECG).  Am J Ind Med 1996, 29:602-8.57. Salen BA, Spangfort EV, Nygren AL, Nordemar R: The DisabilityRating Index: an instrument for the assessment of disabilityin clinical settings.  J Clin Epidemiol 1994, 47:1423-35.58. Aaronson NK, Ahmedzai S, Bergman B, Bullinger M, Cull A, Duez NJ,Filiberti A, Flechtner H, Fleishman SB, de Haes JC, Kaasa S, Klee M,Osoba D, Razavi D, Rofe PB, Schraub S, Sneeuw K, Sullivan M, TakedaF: The European Organization for Research and Treatmentof Cancer QLQ-C30: a quality-of-life instrument for use ininternational clinical trials in oncology.  J Natl Cancer Inst 1993,85:365-76.59. Schipper H, Clinch J, McMurray A, Levitt M: Measuring quality oflife of cancer patients: the Functional Living Index-Cancer,development and validation.  J Clin Oncol 1984, 2:472-483.60. Goldberg DP, Hillier VF: A scaled version of the General HealthQuestionnaire.  Psychol Med 1979, 9:139-45.61. Martin DP, Engelberg R, Agel J, Snapp D, Swiontkowski MF: Devel-opment of a musculoskeletal extremity health status instru-ment: the Musculoskeletal Function Assessmentinstrument.  J Orthop Res 1996, 14:173-81.62. Tugwell P, Bombardier C, Buchanan WW, Goldsmith CH, Grace E,Hanna B: The MACTAR Patient Preference Disability Ques-tionnaire – an individualized functional priority approach forassessing improvement in physical disability in clinical trialsin rheumatoid arthritis.  J Rheumatol 1987, 14:446-51.63. Chambers LW, Macdonald LA, Tugwell P, Buchanan WW, Kraag G:The McMaster Health Index Questionnaire as a measure ofquality of life for patients with rheumatoid disease.  J Rheuma-tol 1982, 9:780-4.64. Chung KC, Pillsbury MS, Walters MR, Hayward RA: Reliability andvalidity testing of the Michigan Hand Outcomes Question-naire.  J Hand Surg [Am] 1998, 23:575-87.65. Pincus T, Swearingen C, Wolfe F: Toward a MultidimensionalHealth Assessment Questionnaire (MDHAQ): assessment ofadvanced activities of daily living and psychological status inthe patient-friendly health assessment questionnaire format.Arthritis Rheum 1999, 42:2220-30.66. Statistics Canada: National Population Health Survey: 1994/5Household Component Questionnaire.   [http://www.statcan.ca/english/concepts/nphs/quest94e.pdf]. Accessed on May 9, 200667. Daltroy LH, Cats-Baril WL, Katz JN, Fossel AH, Liang MH: TheNorth American Spine Society lumbar spine outcomeassessment instrument: reliability and validity tests.  Spine1996, 21:741-9.68. Hunt SM, McKenna SP, McEwen J, Backett EM, Williams J, Papp E: Aquantitative approach to perceived health status: a valida-tion study.  J Epidemiol Community Health 34:281-6.69. Tait RC, Pollard CA, Margolis RB, Duckro PN, Krause SJ: The PainDisability Index: psychometric and validity data.  Arch Phys MedRehabil 1987, 68:438-41.70. McNair DM, Lorr M, Droppleman LF: Manual for the Profile of MoodStates San Diego: Educational and Industrial Testing Service; 1971. 71. Fairbank JC, Couper J, Davies JB, O'Brien JP: The Oswestry lowPage 16 of 17(page number not for citation purposes)differential item functioning in the CASI.  Stat Med 2004,23:241-256.back pain disability questionnaire.  Physiotherapy 1980, 66:271-3.Publish with BioMed Central   and  every scientist can read your work free of charge"BioMed Central will be the most significant development for disseminating the results of biomedical research in our lifetime."Sir Paul Nurse, Cancer Research UKYour research papers will be:available free of charge to the entire biomedical communitypeer reviewed and published immediately upon acceptancecited in PubMed and archived on PubMed Central Health and Quality of Life Outcomes 2006, 4:33 http://www.hqlo.com/content/4/1/3372. de Jong Z, van der Heijde D, McKenna SP, Whalley D: The reliabil-ity and construct validity of the RAQoL: a rheumatoid arthri-tis-specific quality of life instrument.  Br J Rheumatol 1997,36:878-83.73. de Haes JC, van Knippenberg FC, Neijt JP: Measuring psychologi-cal and physical distress in cancer patients: structure andapplication of the Rotterdam Symptom Checklist.  Br J Cancer1990, 62:1034-8.74. Bergner M, Bobbitt RA, Carter WB, Gilson BS: The SicknessImpact Profile: development and final revision of a healthstatus measure.  Med Care 1981, 19:787-805.75. Zung W: A Self-Rating Depression scale.  Arch Gen Psychiatry1965, 12:63-70.yours — you keep the copyrightSubmit your manuscript here:http://www.biomedcentral.com/info/publishing_adv.aspBioMedcentralPage 17 of 17(page number not for citation purposes)


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items