SIMULTANEOUS AND SEQUENTIAL ROC ANALYSES FOR DIAGNOSTIC TESTS By RAYMOND RUI FANG B . S c , Beijing Broadcasting University, 1983 M . S c , Beijing Institute of Remote Sensing, 1989 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F THE REQUIREMENTS FOR T H E DEGREE M A S T E R OF SCIENCE in T H E F A C U L T Y O F G R A D U A T E STUDIES (Department of Statistics) We accept this thesis as conforming to the required standard T H E U N I V E R S I T Y O F BRITISH C O L U M B I A September 1991 (c) Raymond Rui Fang, 1991 In presenting this degree at the thesis in University of partial fulfilment of of department this thesis for or by his or requirements British Columbia, I agree that the freely available for reference and study. I further copying the representatives. an advanced Library shall make it agree that permission for extensive scholarly purposes may be her for It is granted by the understood that head of copying my or publication of this thesis for financial gain shall not be allowed without my written permission. Department of StoutiSt The University of British Columbia Vancouver, Canada Date DE-6 (2/88) A b s t r a c t Relative or receiver operating characteristic (ROC) analysis is a simple procedure which can be used to measure the accuracy of diagnostic tests. Diagnostic tests are often used to classify an individual as belonging to one of two populations. Based on statistical decision theory, R O C was first developed to evaluate the performance of electronic signal detection, and has been used to evaluate the accuracy of diagnostic tests. The R O C theory for evaluating one single test, or comparing individual tests is reasonably well understood. The question arises in cases where multiple tests are available as to whether some combination of the tests are better than any single one. In this paper, two R O C procedures of evaluating the aggregate performance of multiple diagnostic tests were presented, one is for evaluating simultaneous multiple diagnostic tests, and the other is for sequential diagnostic tests. These procedures are illustrated by using a breast cancer data set. 11 C o n t e n t s Abstract ii Contents iii L i s t of Tables vi L i s t of F i g u r e s viii Acknowledgement x 1 Introduction 1 2 R O C Analysis Methodology 7 2.1 Constructing Empirical R O C Points 2.2 Fitting a Smooth R O C Curve 10 2.2.1 Parametria Fitting Methods 11 2.2.2 Nonparametric Fitting Methods 12 2.2.3 Linear and Transformed R O C Curve 13 2.3 ; . . How to Summarize R O C Curves 8 14 2.3.1 Some Typical Indices . 2.3.2 Simple Trapezoidal-Area Index iii 14 14 3 4 5 2.3.3 Area Index Based on Statistical Models . 15 2.3.4 Estimation of Area Index Using U Statistic 16 2.3.5 Comments on Area Index 17 2.3.6 Complements 18 R O C Analysis For The Test With Multivariate Measurements 20 3.1 Survey of Multivariate Discrimination Procedures 22 3.2 Linear and Quadratic Discriminant Functions 24 3.2.1 The General Discrimination Problem 25 3.2.2 Classification of Two multivariate Normal Populations 26 3.3 Evaluating Multi-measurement Tests By R O C 29 3.4 Summary 32 , R O C Analysis For Evaluating Sequential Diagnostic Procedures 33 4.1 Sequential Discrimination Procedures 34 4.2 Performance Evaluation 37 4.2.1 Comparison under Different Distribution Overlap . 38 4.2.2 Comparison under Different Correlation Coefficients 40 4.2.3 Comparison under Different Discrimination Bounds 41 4.3 Evaluating Sequential Procedures by R O C 43 4.4 Some Initial Work On Determining Bounds 48 Analyses Of Breast Cancer Data Using R O C Procedures 5.1 Tumor Markers and Breast Cancer Data iv 54 . 55 6 5.2 Exploratory Data Analysis 56 5.3 The Aggregate Performance of C E A , CA15.3, and M C A 59 5.4 Sequential Diagnosis Using C E A , CA15.3 and M C A 62 Conclusions 66 Bibliography 69 v L i s t o f 2.1 T a b l e s Illustrative Data Table and Measures of Performance from a Binary D i agnostic Test 8 2.2 Illustrative Rating-Scale Data [3] 10 2.3 Illustrative 2 by 5 Table of Rating Data 11 3.1 The Form of Data for a Discrimination Analysis 22 4.1 Comparison under Different Overlap (1,000 observations) 39 4.2 Comparison under Different Correlation (1,000 observations) 42 4.3 Comparison under Different Discrimination Bounds 44 4.4 The Maximum P ( T P ) and the Corresponding A , B , and C 52 4.5 Comparison of P ( T P ) Values for Two Methods 53 5.1 Statistical Descriptions for Original Data in Three Groups 5.2 Statistical Descriptions for Rank-transformed Data 5.3 R O C Areas of Individual Test on the Patients in Group N and Group B C 60 5.4 R O C Areas of Individual Test on the Patients in Group H R and Group B C 61 vi . . . . . . . . 57 58 5.5 R O C Areas When a Second Test is Added (N and B C ) 5.6 R O C Areas of Sequential Tests on the Patients in Group N and Group B C 64 5.7 R O C Areas of Sequential Tests on the Patients in Group H R and Group BC 63 65 vii L i s t o f F i g u r e s 1.1 One Typical R O C Curve 72 1.2 Comparing Two R O C Curves 72 2.1 Empirical R O C Points 73 2.2 Sub-area under R O C Curve 74 3.1 Cutlines in Bivariate Case 75 5.1 Q-Q Plot For C E A in Group N 5.2 Histogram For C E A in Group N 5.3 Q-Q Plot For CA15.3 in Group N 76 5.4 Histogram For CA15.3 in Group N 76 5.5 Q-Q Plot For M C A in Group N 76 5.6 Histogram For M C A in Group N 76 5.7 Q-Q Plot For C E A in Group H R 77 5.8 Histogram For C E A in Group H R 77 5.9 Q-Q Plot For CA'15.3 in Group H R 77 : . 76 . vm 76 5.10 Histogram For CA15.3 in Group H R 77 5.11 Q-Q Plot For M C A in Group H R . . 77 5.12 Histogram For M C A in Group H R 77 5.13 Q-Q Plot For C E A in Group B C 78 5.14 Histogram For C E A in Group B C 78 5.15 Q-Q Plot For CA15.3 in Group B C 78 5.16 Histogram For CA15.3 in Group B C 78 5.17 Q-Q Plot For M C A in Group B C 78 5.18 Histogram For M C A in Group B C . 78 5.19 Boxplot of the Ranked C E A Measurements 79 5.20 Boxplot of the Ranked CA15.3 Measurements 80 5.21 Boxplot of Ranked M C A Measurements 81 5.22 Boxplot of the R L D F in N and B C Groups . 82 5.23 Boxplot of the R Q D F in N and B C Groups . . . 82 5.24 Boxplot of the R L D F in H R and B C Groups 83 5.25 Boxplot of the R Q D F in H R and B C Groups 83 5.26 R O C Curves (Group N and Group B C ) 84 5.27 R O C Curves (Group H R and Group B C ) ix .85 \ A c k n o w l e d g e m e n t I would like to thank my advisor, Dr. Andy Coldman, whose valuable support, help, and numerous constructive suggestions are crucial to this research. I also like to give my thanks to Professor Michael Schulzer for critical reading of the manuscript, M r . Alec Lai for assistance in part of programming work, and the Department of Statistics for the financial assistance during the period of my graduate study at U B C . x Chapter 1 I n t r o d u c t i o n Diagnosis is one of the central problems of medicine and is frequently achieved using physical, bio-chemical, or functional tests on tissue samples. Diagnostic tests are often used to classify individuals as belonging to one of two populations. For convenience, we will refer to one population as D + (suggesting the diseased, positive, or case popula- tion) and the other population as D_ (suggesting the non-diseased, negative, or control population). Here, we will be concerned with test whose outcome may be measured on a continuous numerical scale and that the larger values of the diagnostic test are associated with diseased population, and the smaller values of the test are associated with the nondiseased population. After the numerical scores are obtained, a cutpoint c is chosen by some criterion such that any individual whose test value, T, is larger than c is classified as from the diseased population, and any individual whose test value is smaller than c is classified as non-diseased. Sensitivity and specificity 1 are measures of the accuracy of a binary diagnostic test. The probability that a randomly chosen individual is classified as diseased when one is, in fact, from diseased population is called the sensitivity, or probability of true positive of the test and the probability that an individual is classified as non-diseased when one is from the non-diseased population is called the specificity, or probability of true negative. Two questions arise regarding diagnostic test. The first question is whether there exists a test which is free of error. Secondly, If a perfect test exists, then why do we need a new test. No test may be 100 % accurate, but many are very good. For example, by surgical operation, we can know with high probability whether a patient has breast cancer. In addition, by pathognomonic tests, if the test result is positive, the patient will be diagnosed as diseased with 100 % accuracy. The existence of a gold standard is assumed. In other words, there exists accurate tests for diagnosis. But the problem arises if we operate on a patient before we confirm that one has in fact breast cancer. In many cases, the best tests, which are considered as gold standard, have disadvantages, they may be dangerous, expensive, slow, etc.. Consequently, simple diagnostic tests, which are safe, cheap, and quick, are required. Unfortunately, such tests are usually not 100 percent accurate. As described previously, a cutpoint is necessary for a diagnostic test to classify objects into one of the two populations. The problem is how to choose an optimal cutpoint. When a cutpoint is chosen, a sensitivity/spedficity pair is determined. Intuitively, we hope to choose such a cutpoint that the corresponding sensitivity 2 and specificity would be maximized at the same time (e.g. 1.0), but this is seldom possible. When the cutpoint is chosen to be minus infinity, the corresponding sensitivity unity, and the specificity will be zero. In contrast to this, when the cutpoint is taken to be plus infinity, the sensitivity, will be zero and the specificity Sensitivity will be and specificity will be one. are important measures of the performance of the diag- nostic test. There are direct relations between sensitivity and specificity and statistical type I and type II errors, which are defined as the error that an observation is classified as positive when in fact it comes from D _ , and the error that an observation is classified as negative when otherwise. Under the null hypothesis that an observation is from D_ population, the probabilities of the type I and the type II errors are P(I) = P(T > c|D_) = P(FP) = 1 - specificity, P(II) = P(T < c|D+) = 1 - P(T > c|D+) = 1 - P{TP) = 1 - We cannot, in general, simply compare the sensitivities sensitivity. and specificities of two tests. For example[1], suppose that there are two potential breast tumor markers: A and B. Each marker gives its test result as a score, and for each test we select a cutpoint to classify samples as diseased or non-diseased. The reason that we cannot simply compare them is that the choice of a different cutpoint may result in a different classification result. At one cutpoint, test A may give a higher correct classification rate, but at another cutpoint, this conclusion may be reversed. Often in such cases, a single measure of the performance of the diagnostic test which 3 is independent of cutpoint choice is desired. One such measure is based on the ROC curve. ROC analysis removes the arbitrary nature of selecting the cutpoint, and can be used to evaluate and compare the performances of the diagnostic tests. Based on statistical decision theory, ROC was first developed to evaluate the performance of electronic signal detection methods [2]. In electronic signal detection, an experimenter obtains observations which may include a signal (with background noise), SN, or none (noise only), N. In other words, an observation may come either from SN population or from N population. The experimenter does not know which of the two populations the observation is drawn from and wishes to classify the signal. In this case, the experimenter works as a decision maker, and the action space is composed of two actions: a i , claim that a signal is present, and ai, claim that it is not. It is assumed that there are two elements in the state space, ©. B\ parameterizes the distribution of the SN population, and #2 parameterizes the distribution of the N population. The ROC curve, in this case, is a plot of the hit rate versus the the false alarm rate, where the hit rate is the probability of correct identification of the signal, and the false alarm rate is the probability of affirmative response when no signal is present. Just as an observation can be classified as from SN population or N population in electronic detection, an individual may be classified as from D test, where SN population and iV population correspond to D + or D_ in diagnostic + and D _ , respectively, and the hit rate and false alarm rate are replaced by sensitivity and 1-specificity. Drawing from this concept, the ROC analysis may be used in medical applications to 4 evaluate the accuracy of diagnostic tests. R O C curve is a trace of (1-specificity, sensitivity) pairs in the unit square that are formed as the cutpoint varies in its range of possible values. In other words, corresponding to each particular value of cutpoint, we obtain a single point on the plot of 1-specificity versus sensitivity. By varying the cutpoint, we obtain an R O C curve, a locus of points from a plot of empirical 1-specificity versus sensitivity at each cut- point. If the sample size is infinite, varying the cutpoint continuously gives the whole R O C curve. If the sample size is finite, we can only obtain finite R O C points on the unit grid, but by some parametric or nonparametric methods which will be discussed in the next chapter, we can still fit a continuous R O C curve. A test which perfectly discriminates would pass through the point (0,1) on the unit grid. In contrast, a test with no discriminating ability will have equal expected values for the sensitivity and 1 - specificity. Consequently, the R O C curve associated with non-informative test would follow the diagonal of the grid. In practice, most diagnostic tests would have R O C curves which fall between the above two extremes. The better test would have its R O C curve reaching more sharply upward to ideal point (0,1) and being farther away from the diagonal on the grid. One typical R O C curve is shown in Figure 1.1. In Figure 1.2 R O C curves of two different tests are given. Test A with (sensitivity, 1-specificity) values lying on the upper R O C curve is more discriminating than test B lying on the lower R O C curve. After the R O C curve is obtained, we can derive some indices as the measures of 5 the performance of the corresponding test from the R O C curve. Based on such indices, diagnostic tests can be evaluated and compared reasonably and effectively. The most commonly used index is the area under the R O C curve of the test, with greater area indicating better tests. The area index as well as some other indices from the R O C curve is described in detail in the next chapter. In diagnosis of a particular disease, there are often several tests available for testing the same disease. When the result of a single test is subject to substantial error, we can carry out several different diagnostic tests on the same patient, whether simultaneously or sequentially. The R O C theory for evaluating one single test, or comparing several single tests is reasonably well understood. The question arises in cases where multiple tests are available as to whether some combination of the tests are better than any single one. In other words, the question is how we can combine several test results and provide as much information as possible. In this paper, we will connect multivariate discrimination theory with the R O C methodology, and present an R O C method to evaluate the accuracy and compare the performance of the combinations of several tests, which will be referred as to the R O C analysis for multivariate-measurement test. Also, in many clinical situations, sequential discrimination procedures are increasingly used to classify patients into D+ or D_ populations. How to evaluate the performance of sequential discrimination procedures by R O C techniques will also be discussed in this paper. 6 Chapter R O C 2 A n a l y s i sM e t h o d o l o g y R O C principles are based on the notion of a "decision variable". The concept is needed because very few clinical diagnostic tests produce results which fall into two obviously defined categories with unequivocal boundaries. With some tests, an "explicit" variable exists; for example, a medical test may yield a numerical test result. In such situations, one can choose among an infinity of decision thresholds or cutpoints along the continuum of this variable to serve as the boundary above which one would declare the test positive: each different choice will yield a different true-positive fraction and a different falsepositive fraction, with a decrease in one being accompanied by an increase in the other. The resulting 2 by 2 table (see, for example, Table 2.1) corresponds to one particular point on the R O C curve. Performance: In D_ group False positive fraction P(FP) = 0.05 (25/500) 7 Table 2.1: Illustrative Data Table and Measures of Performance from a Binary Diagnostic Test Test result - + Total D_ 475 25 500 D 150 350 500 True negative haction(specificity) In D + P(TiV) = 1 - P(FP) = 0.95 group True positive ha,ction(sensitivity) False negative fraction 2.1 + P(TP) = 0.70 (350/500) 'P(FiV) = 1 - P ( T P ) = 0.30 Constructing Empirical R O C Points The R O C curve shows the trade-off between P ( T P ) (or sensitivity) P(JPP) (or 1-specificity) successes and errors as one employs different decision boundaries. Consider the hypothetical data of Table 2.2, which represent numbers of rating responses made in each of five categories of a rating scale both for cases truly diseased and for cases truly non-diseased. In each cell the raw frequencies of responses are given at the left. The cumulative response proportions given in parentheses at the right of each cell (moving downward) are the estimates of conditional true-positive TP probabilities (right column). The TP and FP probability (or sensitivity 8 ' and 1-specificity) estimates are simply obtained by successively considering each rating-category boundary as if it were a binary-decision criterion. For convenience, Table 2.2 is usually arranged in the form of Table 2.3, where ratings of 1 corresponds to category ++, ratings of 2 corresponds to category +, and so on, finally, ratings of 5 corresponds to category — . Considering only ratings of 1 as diseased, and ratings of 2 to 5 as non-diseased, we obtain an estimate of the FP probability of 0.05, and an estimate of the TP probability of 0.70. This R O C point of (0.05, 0.70) is the equivalent of one generated by a binarydecision criterion that is relatively stringent. Move on now to regard ratings of both 1 and 2 as indicating diseased, and ratings of 3 to 5 as indicating non-diseased-a somewhat more lenient criterion. Then P ( P P ) equals 0.15 and P ( T P ) equals 0.80, and so on, through the fifth rating category, yielding the points (0.30, 0.88), (0.60, 0.95), and (1.0, 1.0). The corresponding empirical R O C points are shown in Figure 2.1. We see, incidentally, because the last category always yields P ( P P ) = P ( T P ) = 1.0, that rating judgments yield a number of R O C points which equals the number of judgment categories. Besides the construction of the R O C points described above, some other aspects must also be considered. One important aspect is R O C curve fitting, and another important one is to develop indices to summarize R O C curves for the purpose of performance evaluation and comparison. R O C curve fitting will be described in Section 2.2. In Section 2.3 we will discuss some frequently used R O C indices, as well as some other 9 Table 2.2: Illustrative Rating-Scale Data [3] Stimulus Response Diseased Non-diseased Very likely diseased, 1 350 (0.70) 25 (0.05) Probably diseased, 2 50 (0.80) 50 (0.15) Possibly diseased, 3 40 (0.88) 75 (0.30) Probably nondiseased, 4 35 (0.95) 150 (0.60) Very likely nondiseased, 5 25 (1.00) 200 (1.00) aspects of the R O C analysis. 2.2 F i t t i n g a Smooth R O C Curve After a number of points on the R O C curve space are generated from a diagnostic test, it is frequently desirable to obtain a smoothed estimate of the R O C curve. If the sole purpose is to give a very rough, almost qualitative statement and picture of whether / the R O C points are close to the upper left corner, close to the diagonal, or halfway in-between, we can use a simple eye-fit. Otherwise, if a more quantitative description is required, as in a formal comparison of^ two diagnostic tests, some form of objective curve fitting and inferential procedure is necessary. Basically, R O C curve fitting methods can be divided into two categories, one is 10 ' Table 2.3: Illustrative 2 by 5 Table of Rating Data Test result (rating category) - +/- + ++ Total D_ 200(1.00) 150(0.60) 75(0.30) 50(0.15) 25(0.05) 500 D+ 25(1.00) 35(0.95) 40(0.88) 50(0.80) 350(0.70) 500 parametric, and the other is nonparametric. The objective is to fit a function Y = / ( X ) , where Y = P ( T P ) (or sensitivity), 2.2.1 and X = P(FP) (or 1-specificity). Parametria Fitting Methods The R O C curve can be estimated by a parametric method if the distributions of the test measurements from disease (or signal-plus-noise) and non-disease (or noise) are assumed known. For example, let us simply assume that the two underlying populations are normal with cumulative probability functions Fr>_(x) ~ N(u,a ) 2 and F]j (x) + ~ iV(^ + A,<7 ). Then given a value of cutpoint, c = c , we can calculate the coordinates 2 k of the point t Ck = (P(TP), P(FP)) on the R O C grid as c=Ck P(TP) c=Ck = P(X > c | x e D ) k = P(FP) c=Ck l - S ( * ^ " a c = P(X>c = + A >) (2.1) \xen-) k 1-$(^LZ^) a 11 + (2.2) where $ is the cumulative probability function of standard normal distribution. If the parameters, such as /x, A , and a in the above normal case, are known, we can decide each point on R O C grid corresponding to each cutpoint, and obtain an accurate continuous R O C curve. Unfortunately, in practice, the parameters of the underlying distributions are usually unknown and are to be estimated from random samples. Morgan [4], Dorfman and A l f [5, 6], and Ogilvie and Creelman [7] have proposed estimation procedures for the cases when noise and signal-plus-noise are assumed to have uniform, normal, and logistic distributions, respectively. Grey and Morgan [8] dealt with both normal and logistic distribution cases. In these papers, maximum likelihood estimates of the parameters for any family of distributions are used. 2.2.2 Nonparametric Fitting Methods If one is reluctant to impose any prior assumptions or, in the absence of sufficient information, unable to justify a certain distribution, it is preferable to have a distributionfree method, i.e. nonparametric method, of finding the relationship between P ( T P ) and P(FP). In other words, it is necessary to determine the equation Y = f(X) of an R O C curve without any prior distribution considerations. There are quite few references related to nonparametric procedures of R O C curve fitting in the literature. One algebraic method of fitting Y = f(X) available was pro- posed by Birdsall [9] in the context of pure signal detection. According to Birdsall, the R O C curve can be viewed as part of a conic section in general. Jaraiedi and Heirin [10] devised a method which can be applied to any available data set to yield the equation 12 of an R O C curve. They assumed that the R O C locus had a function form of Y = where X and Y correspond to P(FT) f(X), and P ( T P ) respectively. In their paper, two alternative approaches are developed for estimation of its parameters. Suppose that J is the number of data points available and m is the number of parameters in deciding the function Y = f{X). In the first approach the R O C curve is passed through the first m points, and then the sum of deviations for the remaining J — m points, including (1,1), is minimized. The second approach differs from the first only in the way the objective function is defined: all points are assigned equal weight (equal to 1) and the sum of squared deviations around all points, including (1,1), is minimized. 2.2.3 Linear and Transformed R O C Curve It should be mentioned that some authors prefer to plot the empirical and the formally fitted R O C curves on transformed (normal, logistic, ...) axes rather than on the conventional linear (0, 1) ones [3]. Such a transformation is most useful when it corresponds to the inverse probability transform projecting R O C curve from probability space to parameter space. The main advantages for doing this are that it is easier for the human eye to compare straight lines rather than curves, and that the lack of fit of a parametric model is more readily apparent on suitably transformed axes. Also, these axes make it easier to plot several curves, since the scales expand as one approaches the extremes, i.e., the crowding at the upper left corner of the unit square is avoided. 13 2.3 How to Summarize R O C Curves Finally, some indices to summarize R O C curves must be derived as the measures of accuracy. 2.3.1 - Some Typical Indices As an index of accuracy, one can use the TP fraction corresponding to a particular FP fraction, which one might refer to as P(TP)pp. A second popular index is the area under the R O C curve. Also, one can use P(TP)pp , range which is the average P ( T P ) over a restricted high level of FPs. The P(TP)pp index was originally discussed by Greenhouse and Mantel [11] and, more recently, by Linnett[12]. The main advantage of the P(TP)pp readily understood. However, it is P(TP)s Unlikely to the same value of P(FP), index is that it is that different authors will standardize their and one cannot always gather from a published report whether a P(FP) value was chosen in advance of a study or after inspection of the curve(s). Therefore, Swets and Pickett [3] recommended the area, A, under the R O C curve plotted on ordinary axes as the index of choice. 2.3.2 Simple Trapezoidal-Area Index The trapezoidal area, A , under the "curve" that has been formed simply by joining the t empirical R O C points is one of the choices, which is nonparametric and easy to calculate. However, it is likely to be affected by the location or spread (or small number) of the 14 R O C points and generally yields a smaller area than one derived from a smooth curve, because the smoothed R O C curve is usually convex. 2.3.3 Area Index Based on Statistical Models In order to distinguish the trapezoidal area from those which are model-based areas of the R O C curve fitted using one probability model, we refer to A z as the area that is based on a model indexed by the parameter z, e.g., normal model. Generally, if the distributions of the two underlying populations are known, then we can exactly calculate the area under the R O C curve. Suppose that X and Y are independent continuous variables representing test scores from D_ and D , respectively. + Then the area under the R O C curve will be / P(Y > c)f (c)dc x which provides an intuitive meaning for the area under an R O C curve. In other words, the area under the R O C curve of a diagnostic test equals to the probability that the random measurement X from D_ is stochastically dominated by Y from D . The above + formula can also be used to calculate the area after the model is decided. Replacing the quantities in (2.3) by normal distribution ones, A(X,Y) 15 becomes A . z 2.3.4 E s t i m a t i o n of A r e a Index U s i n g U Statistic The area under an R O C curve can be estimated by non-parametric method, in which the estimation is only based on samples and no statistical model is specified. Suppose that xi, X2,--,x , and y i , j / 2 v , Vn nD D+ are samples from D_ and D , respectively, and + the re's and y's are independent. Let c , k = 1, 2, K, be cutpoints such that c k > Cfc_i, k where K is the number of cutpoints chosen, and cjt's can be decided from the order statistics from the combined samples. Usually we choose K to be 5, 10, or 20. The maximum K is n + nr>_ + 1 when there are no tied observations. From Figure D+ 2.2, the k th sub-area, A(X,Y) A(X,Y) k can be calculated by corresponding trapezoidal area as = - [P(x > c -x) - P(x > c )] [P(y > c ) + P(y > c ^ ) ] l k k = fc ±P(x = c _ ){P(y>c ) k 1 k k " + P(y>c _ )} k (2.4) 1 and the estimated total area is A(X,Y) = ^ = A(X,y) = i £ P ( x = c _!)P(y = c _ ) + Y, ^ P f e fc k x = c -i)P(y k > c) k — [Proportion of all x, y pairs in data with x = y] + [Proportion of x, y pairs with x < y] U (2.5) where U = (Number of pairs with X < Y) H—(Number of pair's with X = Y) is the Mann-Whitney U statistic when sample size is large. If we assume that x's and ?/'s are continuous variables, which is usually the case in practice, Prop(x = y) w 0, 16 then, A = [Proportion of x, y pairs with x < y] (2-6) Since the area can be estimated by U-statistic, many properties for U statistic can be applied to area estimation, such as the property that A is an asymptotically unbiased estimator of the true area. Bamber [13] showed the relationship between the area falling under the points comprising ah empirical R O C curve and the Mann-Whitney U-statistic. For the general theory of U-statistic, see Lehmann [26]. 2.3.5 Comments on Area Index The area index has also been studied by various investigators, including Bamber[13], Hanley and McNeil [14, 15], Goddard [16], Dorfman and A l f [6], and Delong et al [17]. Under certain circumstances, the area index may not be an adequate measure for comparing diagnostic tests. For example, the areas under the two R O C curves may be equal although, in fact, one test dominates the other over an interval of clinically relevant FPs. A solution to this problem is to use P ( T P ) F P r a n g e index. Part of the statistical theory for this type of index has been developed by Wieand et al [18], in which the statistical procedures are established in the context of a class of indices and illustrated using continuous rather than rating data; however, they can be adapted to rating method data. 17 2.3.6 Complements Finally, it should be pointed out that besides the construction, fitting, aiid summarizing of the R O C curves, some other important aspects should also be taken into consideration. For example, in order to score the correctness of each diagnosis ,the true state of each observation (i.e., the population, D_ or the population, D ) must be known. Un+ fortunately, obtaining such truth data for R O C studies in dealing with real cases is often difficult as described in the first chapter, and investigators sometimes resort to using surrogate truth data. Revesz et al. [19] investigated various methods of approximating the truth on the conclusions of a study that compared three radiographic techniques. They found that any of the three techniques could be shown to be more accurate than the others, depending on which method was used to define the truth. Henkelman et al. [20] have proposed a method of R O C analysis that does not require truth data, for use when several very accurate tests are being compared, but their suggestion has not been followed up. The main drawback seems to be that by essentially relying on the other tests to "define" the truth, it is difficult for the new modality to appear better. For the sake of our emphasis on the construction of R O C curve for the combination of several tests, we just simply assume that the data in our research is truth data. Other aspects involved in R O C analysis can be found in Hanley's paper [21], which is an useful survey paper about R O C methodology. How to use R O C methodology to evaluate combined diagnostic tests as well as sequential clinical discrimination tests will 18 developed in the rest of this pap C h R O C a p t e A n a l y s i sF o r T h e M u l t i v a r i a t e r 3 T e s t W i t h M e a s u r e m e n t s As mentioned in the first chapter a problem in R O C curve plotting arises if we have more than one diagnostic test, and we would like to plot a single curve to represent the aggregate performance of those multiple test results. In fact, a combination of several tests can be considered as a single test with multivariate measurements. Here, for our convenience, we refer the test with univariate measurements as to univariate measurement test, or simply univariate test, and a combination of several tests as to multivariate measurement test, or simply multivariate test. One purpose of this paper is to develop an R O C methodology to evaluate the performance of such multivariate diagnostic tests. There are two obvious strategies of carrying out R O C analysis for evaluating multivariate tests. First, instead of using cutpoint in constructing R O C curve as We did for 20 univariate test, we can use cutline for bivariate tests, cutface for trivariate tests, or even super-cutface for more than three variate tests to calculate TP proportions and FP proportions. If the cutface is 'controlled' by a single parameter, we may then vary this to obtain an R O C curve. Secondly, we can project multivariate measurements on to a straight line by which the resulting univariate measurements from the two populations are separated as much as possible, and then use simple cutpoint to obtain TP and FP proportions. It is clear that the second strategy is a special case of the first. The second approach may be implemented by using multivariate discriminant functions, which is well illustrated by Figure 3.1 in a simple two dimensional example. For convenience, we begin by defining the notation which will be used in this Chapter. We use upper case X to denote a random vector, and lower case x to a realization. Further, let ^ be the sample space, \&i be the sub set of \& for which ail observation : would be classified as being D_ and $ 2 be the sub-set of x values for which the objects are classified as being D , where $ = $ ! U + \&2j a n d ^1 l~l ^ 2 = <t>- In addition, let /D_(X) and / D ( X ) be the probability density functions associated with thep x 1 random vector + X for the two populations D_ and D , respectively. + In this chapter, we first introduce several typical multivariate discriminant functions, and then describe how to use those functions as well as R O C methodology to plot a single R O C curve for multivariate measurement tests. 21 Table 3.1: The Form of Data for a Discrimination Analysis Population 3.1 x Individual 1 1 1 2 1 m 2 2 2ll x p 212 x 21p x ^mll X \2 1 #121 X\22 x 2 2 X221 %222 x 2 n X 2l x n p Xn Xm x .,.. m n22 Xynlp 12p 22p n2p x Survey of Multivariate Discrimination Procedures Multivariate discriminant analysis is used to separate two or more groups of individuals, given measurements for these individuals on several variables. Discriminant function is a univariate score based on X . In the two population case there are random samples from the two populations, of sizes m and n, respectively, and values are available for p variables Xi, X , 2 X for each sample member. Thus the data for a discriminant v analysis takes the form shown in Table 3.1. In order to do multivariate discriminant analysis, many procedures have been sug- 22 gested in the literature. Fisher's linear discriminant function(LDF) [22], [23], and [24] is most widely used. Fisher's idea was to transform the multivariate observations x to univariate observations y such that the j/'s in the two groups were separated as much as possible. With the assumption that the two populations are normally distributed with the equal covariance matrices, L D F arises as a likelihood ratio. Fisher suggested taking linear combinations of x to create the y's because they are simple functions of x and are easily handled mathematically. If the assumption that the two populations have the same covariance matrices is not true, then it is appropriate to use quadratic discriminant functions (QDF) instead of linear one's[25]. Logistic regression[27] is another useful way for multivariate discrimination with assumption that logit q is linear in the X's, where q devotes the probability of a subject being diseased. One advantage is that we only need to estimate the p+l parameters a 0 and 0 from the sample data, without having to specify the the underlying distributions, whereas in L D F , for example, there are 2p + { \(p-2)\} P 2 a r a m e t e r s to estimate. The logistic model is particularly useful for handling diagnostic data[27]. For example, as indicated in [27], if the sampling distributions are multivariate normal with identical dispersion matrices, then the diagnostic variable will be linear-logistically distributed. Given a discriminant function Z?(x), we can also determine an assignment rule based only on ranked values of D. Some ranking procedures are considered by Randies et al.[28] and [29] and Beckman and Johnson[30], respectively. Alternatively, instead of forming the discriminant function and then using ranks, Conover and Iman[31] proposed ranking 23 the data first and then using the L D F or Q D F , called R L D F and R Q D F . The ranked data are no longer normally distributed, but R L D F and R Q D F work well. They compared R L D F and R Q D F with several other nonparametric techniques, and concluded that if the data are multivariate normal, very little is lost by using the R L D F and R Q D F methods instead of the L D F and Q D F methods. When the data were non-normal, the ranking methods were superior to L D F and Q D F and they compared favorably with the other nonparametric methods, such as nearest neighbor and kernel estimator. L D F , Q D F , R L D F , and R Q D F will be discussed and used in our R O C analysis of this and next Chapters. In addition, there are several other multivariate discrimination procedures available. For example, sequential discrimination method [32], in which the dimensions of X are assumed to be sequentially gathered, is completely distribution-free. After each step the sequential discrimination method decides either to classify the current x to oiie of the population, D_ and D + or to introduce another variable into x. The sequential discrimination method will be discussed in further detail in Chapter 4. The other discrimination procedures include nearest neighbor techniques [33], partitioning methods [34], and Kernel Method [35]. 3.2 Linear and Quadratic Discriminant Functions In this section, our discussion will follow up from the general discrimination problem to the simple L D F and Q D F procedures. 24 3.2.1 The General Discrimination Problem It is clear that discrimination rules will not usually provide an error-free method of assignment. This is because there may not be a clear distinction between the measured characteristics of the populations; that is, the groups may overlap. It is then possible to incorrectly classify a D + object as belonging to D _ or a D _ object as belonging to D + . A good discrimination procedure should result in few misclassifications. In addition, an optimal discrimination rule should take some prior information into account, such as prior probability of occurrence of each population. Finally, an optimal discrimination procedure should account for the costs associated with misclassification. For example, failing to diagnose an illness is generally more costly than concluding that the disease is present when it is not. Let P(-f | D _ ) denote the conditional probability of classifying an object as D + when, in fact, it is from D _ , and P( — | D ) denote the conditional probability of classifying an + object as D _ when, in fact, it is from D . Similarly, we define P(-f | D ) and P( — | D _ ) . + + Then we have P( + | D _ ) = P(xG* |D_)=/ / _(x>x P(-|D ) = P ( x € * i | D + ) = / / + 2 D D + (xVx (3.1) Let 71"! and 7r be the prior probabilities that an observation comes from D _ and D , 2 + respectively, where TTI -f 7r = 1, then the overall probabilities of correctly or incorrectly 2 classifying objects will be P{correct) = TT P(+|D ) + ^ ^ ( - J D . ) 2 25 + < P(incorrect) TT P(+|D_) + 7 r P ( - | D ) = 1 (3.2) + 2 If we take misclassification cost into consideration, then an optimal classification rule could be found in the sense of minimum expected cost of misclassification (MECM) Ecu = /9(+|D_)P(+|D_)TT + 9(-|D )P(-|D )TT 1 / + + 2 by (3.3) where p ( + | D _ ) denotes the cost of classifying a patient as diseased when he or she is from D _ population. p( — | D ) , / > ( + | D ) , and p( — | D _ ) can be defined similarly. We + + assume that p( — | D _ ) = p ( + | D ) = 0. + It is well known that the regions \& and ^ 1 2 that minimize the E C M are determined by the values x satisfying the following. 2 - V / _(x) D JEM w \p(-\B )J + < ^(+|D-)\ . In practice, when both the prior probabilities and misclassification cost ratios are unity or one ratio is the reciprocal of the other, the optimal classification regions are determined simply by comparing the values of the density functions, i.e., /D_(X) 3.2.2 /D_(X) Classification of Two multivariate Normal Populations Here we first suppose that the two populations have the same covariance matrix S . Let / D _ ( X ) and / D ( X ) be the joint densities for population D_ and D , respectively. Then, + fD-W + = ( 1 )p/2| |l/2 ^ [ - ^ X - ^ i y S e 2 7 r S 26 '(X-^I) (3.6) where fii and j2 are the mean vectors of D _ and D , and fii, j2 and X are supposed 2 + 2 to be known here. Then / D + (x) = /D-(X) exp j i - /Zi)'s x - ^(/r - /?i)'s (/ii + jt ) _1 exp (3.7) -1 2 2 2 Consequently, the M E C M regions $i and \& in (3.4) become 2 *2 : 1 {fo - M i ) ' S - x - ^(/T - h y ^ i h A 2 + fa) > I n p(+|D_) 7r w V(-|D ) 7r n + * i : (M2 - M l ) £ x - -(/T - / 2 x i /&) < /n + / 0( + |D_) ) 7>(-|D + ) j 2 ' 1 _ 7T! ' 7T2' (3.8) If we ignore prior probabilities and misclassification cost and simply take both the prior probability and misclassification cost ratios unity, then (3.8) becomes * 2 * i : (M2 - M i ) ' S *x > ^(/7 - / T i y S - ^ / x i + fa) : (/?2 - M O ' S - ^ < i ( / T - / i i ) ' ! ] - ^ ! + /fe) 2 1 1 2 (3.9) which is exactly the same as Fisher's linear discrimination rule. Fisher's linear discriminant function converts the D _ and D + multivariate popu- lations into univariate populations such that the corresponding univariate population means are separated as much as possible. In fact, (3.9) can be used as a classification device, which gives one score to each observation vector. Then for a new observation XQ, we obtain the linear discriminnt function (LDF) as y = {fa- 1 27 x 0 (3.10) where y and x have a linear relation, and the optimal cutpoint between the two univariate population means as cp= ^ 2 - A r 1 ) / £ - a Z 1 In practice, the population parameters 1 + M) (3.11) 2 /i , and S are usually unknown. They 2 m a y be estimated from observations that have already been correctly classified. Then the estimated L D F is and the optimal cutpoint is ®=l(x2-xi)'S-l (xi+X2) (3,13) o]ed where x i and X2 are the sample mean vectors, and Spooled 1 S the pooled sample covari- ance matrix. If the assumption that the two underlying populations have equal covariance matrices is not satisfied, the classification rule becomes more complicated. Substituting densities with unequal covariance matrices into (3.5) gives the quadratic classification rule that classifies x to 0 D + if ^ ( S j 1 - /ZiEr^xd - b > In - Sl^xo - ' /?(+|D_) n, r %(-|D ) 7T A + D _ if ^ ( E g 1 - E r ^ x o - (p'2^2 1 ~ Pi^i )^ 1 J J -b<ln //( + l - ) w l \ (3.14) > D 7 r }p(-\d ) W. } + 28 2 where 6= i '|Sf| ~ |(^22 V /7( ) S 2 - ^i^rVi) (3.i5) we obtain the quadratic discriminant function (QDF) and the optimal cutpoint when we ignore the prior probabilities and misclassification costs y = ^x'o^ 1 1 CP = - S^Jxo - ( & E 2 - /TiSi^xo 1 1 IS21 2 ^p7|^ /n (3.16) ~ 2^2 2 V2 S (3-17) and similarly the estimated Q D F and cutpoint y = ^ ( S ^ - XiSj^xo (3.18) PP = ^ M j i j r | ) - ^ ( ^ S i ^ - x ^ S l ^ ) (3.19) 1 -S^XQ - ( x ^ 1 1 1 If either outliers exist or the two underlying populations are not normally distributed, RLDF(or R Q D F ) , or logistic regression methods could be used for the same purpose. 3.3 Evaluating Multi-measurement Tests B y R O C In the above sections we discussed several useful multivariate discrimination functions. By using such functions, multivariate measurements can be transformed into univariate measurements, and much of the information for separating populations is included in the univariate measurements. Meanwhile, each corresponding optimal cutpoint has been derived, upon which the samples can be classified into one of the two underlying populations. 29 In order to perform an R O C analysis two requirements must be met. First, the measurements must be univariate. Secondly, cutpoints must be set up, varied across the whole range. Based on multivariate discrimination functions, such univariate measurements are automatically produced. Then the traditional R O C analysis described in Chapter 2 can be applied. In this section, we will emphasize the utility of rank transformation in R O C analysis. On the one hand, the area under the ROC curve only depends on the relative placement of the observed measurements. Since rank transformation of a univariate measurement does not change such relative placement, the area under the R O C curve remains the same whether rank transformation is taken or not. On the other hand, outliers are influential upon the multivariate discriminant functions described previously, because those outliers will change the estimated location and dispersion parameters of the two sample populations. After rank transformation, those outliers, such as the very high observed values for the diseased individuals, or very low values for non-diseased individuals, will agree in magnitude with the other measurements in the same population. In other words, discrimination using rank transformations is not influenced greatly by outliers. Finally, it will be seen in the next chapter, rank transformation can also play an important part in the R O C analysis for the sequential diagnostic tests. The rank transformation for the test with multivariate measurements involves ranking the k th components of all observations from smallest, with rank 1, to largest, with rank N=m+n. Each component is ranked separately for k=l, through k=p. Then the 30 sample means Xj(R) and sample covariance matrices S;(R) are computed On the ranks of the observations from each population separately. In applying the R O C analysis described in this Chapter to multivariate data, we first check outliers by Q-Q plot. If obvious outliers exist, R L D F or R Q D F can be tried. Meanwhile, if the univariate normality for each component is not acceptable, even if there is no obvious outlier, R L D F or R Q D F should be used because R L D F and R Q D F have no model assumptions. Otherwise, L D F or Q D F can be applied. In other words, the advisability of the rank transformation will depend on the data set. The only difference between L D F and R L D F (or Q D F and R Q D F ) is that in R L D F the observations are replaced by their ranks. Finally, we summarize the major steps to carry out R O C analysis for the test with multivariate measurements as follow. Step 1: Consider p diagnostic tests as a single test with p-variate measurements. Step 2: Check data set for outliers and normality and choose suitable discriminant function. Then do discriminant analysis, and obtain univariate scores. Step 3: Use R O C methodology to analyze those univariate scores from discriminant functions, and obtain the final R O C curve, which represents the performance of the corresponding multiple diagnostic test results. 31 In Chapter 5, we will apply the R O C techniques described in this chapter to a breast cancer data set, which contains three diagnostic test measurements obtained from 381 individuals by three different tumor markers. 3.4 Summary In this Chapter we have discussed the R O C technique to evaluate the aggregate performance of several diagnostic tests. We consider several tests as a single test with multivariate measurements. Then we use discrimination functions, such as L D F and Q D F or R L D F and R Q D F , to project the multivariate measurements on a straight line by which the measurements from the two populations are separated as much as possible. Finally, the traditional R O C analysis is applied to the univariate scores obtained during the projection, and the aggregate performance of the original multiple tests can be evaluated. As we mentioned before, in diagnosis of a particular disease, there are often several tests available for testing the same disease. If those tests are applied sequentially, we carry out one test and then decide if we need a second test based on the test result of the first test. This process continues until the status of the patient has been decided or we run out of tests. In the next Chapter, we will describe sequential discrimination procedures, and then examine R O C technique for evaluating sequential tests. 32 C h a p R O C A n a l y s i s F o r E v a l u a t i n g S e q u e n t i a l D i a g n o s t i c t e P r o c e d u r e s Sequential discrimination procedure can be of significant interest in clinical diagnosis. For the sake of risk, expense, time, and etc., we hope to use fewer tests to determine whether an individual is really diseased or not. For example, suppose that we have three diagnostic tests for diagnosing breast cancer, if we are quite sure that an individual is of cancer free by the result from the first diagnostic test, we do not need carry out a second on that individual. Otherwise, we should apply successive tests to that individual until we identify their status. In the following we will describe a sequential discrimination procedure. Then, by simulation, we compare the performance of sequential discrimination procedure with that of simultaneous multivariate discrimination procedure in terms of total error rate and number of tests needed. Finally, we will introduce our new method to apply R O C 33 r 4 to evaluate sequential diagnostic procedures. 4.1 Sequential Discrimination Procedures A sequential discrimination procedure is as follows. For each individual, after each step we decide to either allocate the current x to one of the two populations, D_ and D+, or introduce another variable into x. In other words, for each patient, after each diagnostic test we decide to either classify him/her to diseased or non-diseased population, or delay judgement and carry out a further test. One simple sequential discrimination procedure is suggested by Kendall aiid Stuart [36] and Kendall [37]. They first order the X\ values of all the m -f n sample points and choose a\ and b such that all observations with x\ < a belong to D _ , arid all t 1 observations with x\ > b\ belong to D . For those observations with "a 1 < Xi < 61 we + go through the same procedure with x , choosing a and b such that the observations 2 2 2 with x < a belong to D_ and those with x > b belong to D . For those observations 2 2 2 2 + with a i < x i < b\ and a < x < b we proceed to x and continue the process until all 2 2 2 3 m + n observations are classified to their correct groups, or we run out of variables. In this way, there are usually some unclassified observations left. How to choose the first test to start the sequential diagnosis depends oh circumstances. If our emphasis is put on accuracy of the sequential tests, x should be the 1 most accurate test. If we are concerned with economic factor, we can choose the cheapest test as Xy even if it is not very accurate. 34 Assuming that potentially x has infinite dimensions, most sequential discrimination procedures are based on log-likelihood ratio and can be expressed as follow. Suppose that / D _ ( X ) and / c ( x ) are density functions for x from D_ and D , respectively. Let + + LRk = log(fr) (xi,X2,.-.,Xk)/fD-( i-,X2,...,Xk)), then the classification rule is x + // LRk < Bk, classify to D_; // LR to D ; k > Ak, classify (4.1) + If Bu < LRk 5: A),, introduce x +i and culculate LRk+\. k A general theory of sequential discrimination is given by Hora[32]. It is a difficult problem to decide optimal bounds at each stage of the sequential tests, especially when joint distribution functions are involved. Since the main purpose of this Chapter is to develop R O C technique to evaluate sequential diagnostic test, not to design the test itself, we simply pick up the method described in [33] to determine the discrimination bounds. In 4.4, we will show some initial research results in determining optimal bounds in the sense that the P ( T P ) is maximized at any given P'(PT). In this paper, assuming x to be normally distributed, we will use discrimination rule (4.1) for the first p—1 stages, which utilizes all the test measurements available at each stage, and forced discrimination procedure at the p th stage, where p is the total number of tests available. By the method in [33], we observe xi and calculate A = log(^-) 35 - (4.2) B LR i (4-3) a — 'fp ( i) JD-M x X = log (4V4) + where a and /3 are type I and type II errors. If X\ follows either N(pi,o~ ) 2 when it comes from D_ or N(p, , a ) when it comes from D , then (4.4) becomes 2 2 + LRy = - (p ~ Pi) = 1 Then, we will assign x either to D_ if LR < B or to D x (4.5) -D(x ) 2 X + if LR\ > A; otherwise, take a second observation and compute the log-likelihood ratio for the two bivariate distributions, LR 2 — JD (X ,X ) JD-{X\,X )_ + log 1 2 (4.6) 2 When 5c=(xi,x ) 2 follows either iV(/7i,£) or iV(//2,£), then, similarly to (4.6), we get LR, = x - £ ^{Pi + P2) _ 1 (M2-£l) = Then, we will assign x either to D_ if LR < B or to D 2 (4.7) -D(x) + if LR > A; otherwise, 2 take a third observation and compute the log-likelihood ratio for the two trivariate distributions, LR3 and so on. Generally, at the first p-1 stages, the discrimination rule is to assign to D_ if LR k = log 7 p + ( X l ' " ^ ) jD-(xi,..,X ) " < B (4.8) > A (4.9) k m to D+ if LR k = log fp+{x ,..,x ) fD-{xi,..,X )_ 1 k k 36 and observe Xk+i otherwise. LR can be written in the same form as (4.6) in multinofmal k case but with x having k components. At the p th stage, in which x is the last test measurement available, we simply apply p forced discriminant procedure discussed in Chapter 3 to those observations which were not classified at the previous stages, that is, assign x to D_ if LR p < 0 to D + if LR •> P 0. 4.2 Performance Evaluation The performance of sequential diagnostic procedures can be compared with that of simultaneous multivariate discrimination procedures in terms of total error rate and number of tests needed, which will be defined later. The performance of sequential diagnostic procedures can also be evaluated by R O C . In this chapter we will develop a new method based on rank transformation and R O C methodology to evaluate sequential diagnostic procedures. The main advantage of sequential discrimination procedure is that the patients could be classified to either D_ or D + populations by fewer diagnostic tests compared with simultaneous procedure. Of course, benefit of test saving will result in loss of accuracy. By limited simulation below, these two procedures were compared in terms of total error rate and number of tests needed. Though we did not take account all the different situations, the limited comparisons still gave us some rough idea that in large sample case the sequential procedure is asymptotically as good as simultaneous procedure in 37 accuracy, but, on the other hand, results in a significant reduction i n the number of tests needed. 4.2.1 Comparison under Different Distribution Overlap Samples were generated from each of the two populations, D_ and D , in which the pa+ rameters were chosen roughly based on one, two, and three times of standard deviations of separation. The trivariate normal distributions sampled were as follows: a. Weak overlap: 0 Ml = 3 0 M2 = 0 1.00 0.25 0.50 0.25 1.00 0.75 2,25 0.50 0.75 1.00 2 1.00 0.25 0.50 0.25 1.00 0.75 0.50 0.75 1.00 2.5 , and X = Medium ovei lap: 0 Ml = • 0 M2 = 1.5 , and X = 1.25 0 c. Strong overlap: Ml = 0 0 1 .nil 0.25 0.50 1 0 M2 = 0.5 , and 0.25 X 0.25 1.00 0.75 0.50 0.75 1.00 We generated samples of size 500 from each of these distributions and classified them by the two procedures, in which a — 0.05 and $ = 0.20 were used in the sequential 38 Table 4.1: Comparison under Different Overlap (1,000 observations) Method Overlap Error tests Needed : Rate Correctly Classified: RE Incorrectly Classified Simultaneous Weak 0.045 3,000:955:45 0.00 Sequential Weak 0.052 1,233:948:52 0.8835 (1,000:813:30/157:78:3/76:57:19) Simultaneous Medium 0.121 3,000:879:121 0.00 Sequential Medium 0.138 1,892:862:138 0.554 (1,000:458:22/520:133:15/372:271:101) Simultaneous Strong 0.274 3,000:726:274 0.00 Sequential Strong 0.276 2,908:724:276 0.046 (1,000:7:0/9'93:64:14/915:653:262) procedure (Equation 4.2 and 4.3). Repeated this process by five times and calculated the average proportions of the observations which Were misclassified, and the average number of tests needed to classify all the observations. The results are given in Table 4.1. From the Table 4.1, the error rates, which are defined as „ Number of False Positive + Number of False Error Rate = — ——— . Total number of the Observations Negative - of sequential discrimination procedure are larger than, but quite close to the corresponding error rates of simultaneous procedure. Further, let p be the number of different tests 39 available, N (S equential) be the number of tests actually done by the sequential test, and M be the number of subjects, then we define the relative efficiency (RE) of sequential test as Number of tests saved M x (p - 1) N (Simultaneous) — N (S equential) M x (p - 1) M x p — N(S'equential) ~ M x (p - 1) When no test can be saved, the RE — 0, and when all the second tests and the later tests can be saved, the RE = 1. From Table 4.1, the number of the test saved by the sequential procedure is significant, especially when the overlap between the two populations is not great. Meanwhile, as the distribution overlap increases, there were fewer observations which could be classified by the first two sequential discrimination rules. Therefore, more and more observations could not be classified until, last stage, i.e., until the forced discrimination stage. 4.2.2 Comparison under Different Correlation Coefficients To investigate the effect of correlation coefficients among different diagnostic test measurements on the relative efficiency of sequential procedure, we generated 500 samples from each of the two populations in the three situations. 40 a. Weak correlated: 0 Ml = 0 1.00 0.10 0.10 0.10 1.00 0.10 2.25 0.10 0.10 1.00 3 1.00 0.50 0.50 0.50 1.00 0.50 2.25 0.50 0.50 1.00 3 . 1.00 0.90 0.90 0.90 1.00 0.90 0.90 0.90 1.00 3 and 2.5 M2 £ b. Medium correlated: 0 0 Mi M2 = and 2.5 £ c. Strong correlated: 0 Ml - 0 0 M2 = and 2.5 2.25 £ The proportions of misclassification and the relative efficiencies are given in Table 4.2, where a = 0.05, and (1 = 0.20 for the sequential procedure. As the correlations increased, the information contained in three tests approached that of a single test. If the three tests were identical, then three tests would be as good as a single one. In Table 4.2 the error rates increased and relative efficiencies decreased as the correlations increased. 4.2.3 Comparison under Different Discrimination Bounds The two discrimination bounds, A and B, which depend upon a and are crucial to the relative efficiency of sequential procedure. In the following situations we compared 41 Table 4.2: Comparison under Different Correlation (1,000 observations) Method Correlation Error Tests Needed : Rate Correctly Classified: RE Incorrectly Classified Simultaneous Weak 0.015 3,000:985:15 0.00 Sequential Weak 0.037 1,205:963:37 0.8975 (1,000:813:30/157:106:3/48:44:4) Simultaneous Medium 0.047 3,000:953:47 0.00 Sequential Medium 0.057 1,274:943:57 0.863 (1,000:813:30/157:39:1/117:91:26) Simultaneous Strong 0.067 3,000:933:67 0.00 Sequential Strong 0.085 1,253:922:78 0.843 (1,000:813:30/157:0:0/157:102:55) 42 sequential and simultaneous procedures at a: a = 0.10 and 0 = 0.20 b : a = 0.05 and 0 = 0.20 c : a = 0.01 and 0 = 0.01 Using samples of size 500 from each of the two populations with medium correlation, the results are given in Table 4.3. It is obvious that as a and /? decrease, the upper (lower) bound would become high (low). At the first two stages, more and more observations fell into the intervals between the two bounds, and could not be classified. Thus, the total tests needed increased and the error rates decreased as the intervals enlarged. As a whole, sequential discrimination procedure is nearly as good as simultaneous procedure in accuracy, and the number of tests needed for classification reduces significantly compared to simultaneous procedure. In addition, it should be pointed out that the performance of sequential test depends heavily on the separation of the two populations ori x\, i.e., the best test among all the tests. If the best test is good enough, the second and the third tests will be unnecessary. In contrast, if all the three tests are quite poor, the sequential test will not yield sufficient savings. 4.3 Evaluating Sequential Procedures by R O C In the above sections we discussed sequential discrimination procedures, and compared them with the simultaneous procedures. Although the sequential procedure is sub- optimal in the sense of minimum misclassification rate, the number of tests saved is 43 Table 4.3: Comparison under Different Discrimination Bounds Method Bounds Error Tests Needed : Rate Correctly Classified: RE Incorrectly Classified Simultaneous Sequential 0.047 3,000:953:47 a=0.10; (3=0.20 log(B)=-1.5 0.00 1,195:931:69 0.069 (1,000:828:47/125:54:1/70:49:21) 0.9025 log(A)=2.08 Sequential a =0.05; (3=0.20 log(B)=-1.56 1,274:943:57 0.057 (1,000:813:30/157:39:1/117:91:26) 0.863 log(A)=2.77 Sequential a=0.01; (3=0.01 log(B)=-4.6 1,682:951:49 0.049 (1,000:647:7/346:20:0/326:284:42) log(A)=4.6 44 0.659 significant relative to simultaneous ones. In this section, we will develop a procedure to evaluate the performance of sequential diagnostic procedures. The major problem in using R O C here is how we set high dimensional cutpoints in order to obtain an R O C curve. At the first stage we need a cutpoint to separate the univariate measures in x\ space, at the second stage we need a surface to separate bivariate measures in (x , x ) space, and so on. Following the idea of projection in x 2 Chapter 3, we can use multivariate discrimination functions to transform multivariate measurements to an univariate one, and use cutpoints at each stage. The problem arises that how we should combine those univariate measurements obtained at each stage. The purpose here is the same as that in Chapter 3, i.e., combine multivariate sequential measurements into univariate scores. The univariate scores will be expressed in terms of ranks. The diseased individuals should have high ranks, and the non-diseased ones should have low ranks. Further, the individuals who are classified by the first test as diseased (or non-diseased) are considered to have highest ranks (or lowest ranks). The individuals who can not be classified until the last test should have moderate ranks. Such idea will be described iii detail in the following. In sequential discrimination, only those individuals with either very high or very low values among all the individuals would be classified by the first test. We consider these people as very diseased and very non-diseased respectively. Therefore, the ranks of those classified individuals would be either highest ones or lowest ones. Then, at the second stage, discriminant function based on the first two best tests on those unclassified at the 45 first stage gives univariate measurements, among which those with either relatively high or low values would be classified at this stage. Those people are considered as diseases and non-diseased respectively. Thus, the ranks of the individuals with relatively very high observations classified at the second stage would follow that of the individuals with very high observations classified at the first stage, while the ranks of the individuals with relatively very low observations classified at the second stage would be followed by that of the individuals with very low observations at the first stage, and etc.. For example, suppose that there are a total of N observations to be classified. At the first stage, rii and mi observations are classified as positive and negative, respectively. Thus, those n\ observations would rank from N down to N — n\ + 1, and those m\ observations would rank from 1 up to m-^. Then, at the second stage, n 2 and m 2 observations are classified as positive and negative, respectively. Therefore, those n 2 would rank from iV — n\ down to N — n\ — n + 1, and those m would rank from m\ + 1 2 lip to m-i -f m . 2 Finally, the remained N — 2 — n — m^ — m observations would be 2 2 forced to be classified at the third stage, and their ranks would be in the middle. The relative ranks of the observations classified at each stage depend on the absolute values of the univariate measurements. After the above rank transformation based on sequential discrimination procedure, an univariate rank "measurements" are obtained. Sequently, traditional R O C technique can be applied to evaluate the performance of the corresponding sequential discrimination procedures. We summarize the above R O C procedure as follows, where we sup'pose 46 that the best test is in the sense of most accurate. Step 1: Apply R O C procedure to each diagnostic test, respectively, and find out the best in terms of the largest area under the R O C curve. In the same way decide the second best discriminator and so on. Step 2: The sequential discrimination procedure will start from the test which is the best discriminator with given a and (3. Using (4.6) and comparing with A and B, partial observations will be classified and the corresponding ranks are decided. Step 3: For those which have not been classified at the first stage, add the second test measurements, which is best conditional on that the first best test is used and construct consequent bivariate measurements. Using (4.8) and (4.9), and comparing with discrimination bounds, those observations which fall outside the interval will be classified and their ranks, as described earlier, are determined. Step 4: Suppose that there are only three diagnostic tests available, then at the third stage all the observations unclassified before would be classified. 47 Step 5: Combining all the observations by their ranks which are derived at different stages, we obtain an univariate rank measurements. Finally, R O C analysis is applied to those rank measurements, and the performance of the sequential diagnostic procedure could be evaluated. 4.4 Some Initial W o r k O n Determining Bounds In the previous discussion of sequential discrimination procedures, discrimination bounds were determined by the preassumed type I and type II errors. In other words, when the two errors a and /3 are chosen, the discrimination bounds A and B are decided. Usuk k ally, a and j3 are chosen according to the required power (= 1- f3) and the significance level (= a) of the test. The constant bounds given by (4.2) and (4.3) are simple, but unlikely optimal. There may be some other ways to choose optimal discrimination bounds in the sense that for a given value of P(FP), say Pj , we find bounds such that P ( T P ) is maximized. For p example, in a simple two-stage diagnostic test, we have three bounds, A , B , and C, to be decided. Suppose that object x has two sequential measurements, x\ and x , and 2 our classification rule is At the first stage: If x i > A, classify x to D ; If x i < B, classify x to D_; + 48 (4-10) // Xi G (B,A), go to the second measurement x. 2 and at the second stage: If 2 If C, classify x to D ; < C, classify x to D_. > x x 2 + (4.11) Assume that the two populations are normally distributed with known parameters Mn Ml = M21 M2 = M12 M22 then we can calculate P ( T P ) and P(FP) P(TP) •• P{xi > A\f } + p{ D+ 1 - $ ( A - fi ) + n v p(x U P { * i > A\f _) D e (B,A),x _ ^ X Jc ^A-fi 2 - p(x l 2 1 e C\f } D+ 2 B - fin ~ p(x 2 (B, A),x > 2 \ 2(^2-M22) -j==e— / 21 l > 2 - K ^ - w 2 ) _ 12 + P { l-$(A-fi )+ e - ^ ) 2 00 = P as X l r Jc ' r A - fi - P(FP) 1 P S = - fi ) 12 dx 2 (4.12) dx 2 (4.13) C\f _) D 2 V27T - fi )^ _ $ r 22 Let ^ ( A , B , C ) = P ( T P ) , and ^(A,B,C) = P Jp B ~ - P(FP), ~ P( ? x - M2 )^ 2 then the problem becomes that we find A , B , and C so as to maximize ?/>(A,B,C) subject to <^(A,B,C) = 0. In other words, we need maximize r = il>{A,B,C) = P(TP) + + 49 \<f>(A,B,C) X{P -P(FP)) Jp (4.14) where A is the Lagrange multiplier. Direct solution for the above optimization problem is difficult. One suggested way to solve the problem iteratively is to use the Newton-Raphson method. Unfortunately, we can not give a general criterion of deciding initial values of A , B , and C. In the following we will suggest a possible procedure of choosing optimal bounds in the sense that P ( T P ) is maximized when P ( P P ) is given. The idea is that for the given values of A , and B , choose C such that P(FP) = Pj , p then calculate the corresponding P(TP)(X,B,C). Different values of A and B will result in different C and P ^ P ) ^ ^ ) - The optimal values of (A, B , C) will be the values at which P ( T P ) reaches its maximum. Then the problem is how to choose limited reasonable pair of (A, B) among the infinite combinations. By (4.13) we have P(FP)>1-$(AP(FP) u-n) (4.15) < 1 (4.16) When P ( P P ) = Py , there exist constants a and b such that (4.15) and (4.16) are p equivalent to the following equalities. 1 - $(A - /in) = aP 1 fp * ( 5 - / i n ) = 6(1 - P / ) P (4.17) (4.18) where 0 < a < 1, and 0 < b < 1. Selecting a arid b, the bounds A and B can be solved from (4.17) and (4.18), and C can be computed from (4.13). For example, let a as well as 50 b to be 0.1, 0.3, 0.5, 0.7, and 0.9, respectively. We have 25 combinations of corresponding (A, B ) . Given P ( F P ) = P / , the bound C can be calculated, assuming that the two p distributions are known. Thus, the resulting 25 values of P ( T P ) are obtained. Choose the approximate optimal bounds A , B , and C such that the corresponding value of P ( T P ) is the largest among the 25 values. Of course, the more values a and b take, the more accurate the values of A , B , and C will be, and the more amount of calculations will be needed. We can also write a program to do automatic searching. For example, given the distribution parameters as / / n = p i2 = 0, p \ = 1-5, P22 = 2 1.0, and p = 0, we find the maximum value of P ( T P ) and corresponding A , B , and C with a given value of Pj . Choosing P / to be a batch of values, we summarize the p p results in Table 4.4. From Table 4.4, as the value of Pf increased, the corresponding P ( T P ) increased p as well. The bounds A and B shifted to the left, and the bound C changed accordingly. Finally, we compare the P ( T P ) value at each Pf point of our nearly optimal procep dure (OP) with that of the procedure with discrimination bounds calculated by Kendall and Stuart's method (KS), and give the results in Table 4.5. It is clear that when Pj is fixed, the larger the P ( T P ) is, the stronger the discrimp ination ability will be. From Table 4.5, all the P ( T P ) values by OP are higher than those by K S , especially when Pj is small. Thus the improvement by OP method is p obvious. Finally, it should be pointed out that the results from the optimal algorithm in the 51 Table 4.4: The Maximum P ( T P ) and the Corresponding A , B , and C P/p Maximum P ( T P ) A B C 0.01 0.2599 2.6525 1.2320 1.5788 0.02 0.3622 2.4093 1.1851 1.2309 0.03 0.4309 2.2576 1.1408 1.0084 0.04 0.4828 2.1449 1.0985 0.8411 0.05 0.5243 2.0542 1.0581 0.7059 0.06 0.5588 1.9778 1.0194 0.5917 0.08 0.6172 1.8526 0.6307 0,8173 0.10 0.6606 1.4758 0.3319 1.23 0.20 0.8271 1.2817 -1.3514 1.1600 0.30 0.9442 1.0364 -1.4132 0.8630 0.50 0.9561 0.3853 - 0.6745 0.32 0.90 0.9986 - 0.8779 -0.18808 - 0.15 52 Table 4.5: Comparison of P(TP) Values for Two Methods False Positive Proportion (Pj ) p Method 0.01 0.02 0.04 0.06 0.10 0.20 0.30 0.50 0.90 OP 0.2599 0.3622 0.4828 0.5588 0.6643 0.8271 0.9442 0.9561 0.9987 KS 0.2278 0.3118 0.4181 0.4908 0.5922 0.7407 0.8272 0.9242 0.9958 Table 4.4 and 4.5 are theoretical values with the assumption that the two populations are normally distributed with known parameters and common variance. If these If these assumptions are not satisfied, the performance will decrease. Meanwhile, the optimal bounds A , B , and C, calculated froom a fixed value of P(FP) do not necessarily result in the same value of false positive proportion for the data set because the estimated parameters are not exactly equal to the true values. 53 Chapter 5 A n a l y s e s O f U s i n g R O C B r e a s t C a n c e r D a t a P r o c e d u r e s In the previous chapters we described the traditional ROC methodology, and developed two modified ROC procedures, which can be used to evaluate the aggregate performances of several diagnostic tests and of sequential diagnostic methods, respectively. In this chapter we will apply those two modified ROC procedures to breast cancer data obtained from three tumor markers. The data set is used with permission from British Columbia Cancer Agency. The data set was first studied by Silver et al. [1], and considered of three potential breast tumor markers, CEA, CA15.3, and MCA. 54 5.1 Tumor Markers and Breast Cancer D a t a Data were obtained using three tumor markers, C E A , CA15.3, and M C A . We rearrange the data by three groups: normal(N), breast cancer(BC), and high risk patients(HR). Tumor markers are important methods in clinical diagnosis, in which they work as discriminators between breast-cancer patients and non-breast-cancer patients. C E A was first used and well accepted as a tumor marker for diagnosing cancer[38]. In recent years, more and more potentially specific tumor markers have been developed [39], among which are CA15.3 and M C A . Here, we will not go into too much detail of tumor makers, but only give some general descriptions of the three tumor markers, as described in [1]. Carcinoembryonic antigen(CEA) is a large complex molecular weight glycoprotein associated with the cellular glycocalyx. The measurement of C E A employs an enzyme immunoassay method using heat extraction and polyclonal anti-CEA antibody. The CA15.3 test uses a double determinant radioimmunoassay utilizing two different monoclonal antibodies, 115D8 and DF3. 115D8 is immobilized on polystyrene beads to complete the double antibody sandwich with 125I-DF3. Mucinous-like carcinoma-associated antigen(MCA) is a 350 K d glycoprotein produced by mammary carcinomas and some normal tissues. The monoclonal antibody used to detect M C A is known as bl2. The data to be used here are from three groups. The normal group(N) contained samples from 100 female donors, aged from 16 to 91, collected and donated by the Canadian Red Cross. Donors did not suffer from known neoplastic, inflammatory or 55 liver disease; In the breast cancer(BC) group, samples were from 158 female patients with histologically confirmed breast carcinoma. The third group (HR) consisted of samples from 33 women who were believed to be at high risk of developing breast carcinoma on the basis of physical examination or mammogram. Each of the 33 women had undergone diagnostic cytology which did not reveal evidence of malignancy. 5.2 Exploratory D a t a Analysis Before we start to evaluate the three tumor markers by modified R O C procedures, we first undertook exploratory data analyses on the 291 original observations from the three groups. In Table the basic statistical descriptions for the measurements from three tumor markers in each group are summarized. In Table 5.1 we notice that the dispersions of the observations from both Group N and Group H R are not severe. The mean and median values in each group are quite close, and the standard deviations are relatively small. These imply that the data in these two groups agree with one another well in magnitude. These were proven to be nearly true by the histograms and the Q-Q plots shown in Figure 5.1 to Figure 5.12 The Q-Q plots show that the normalities for the observations in the two groups are satisfied. Also in Table 5.1 the distributions for the three measurements in Group B C are not symmetric at all. From Figure 5.13, 5.15, and 5.17 the normalities for the three measurements in this group even worse. The reason for these is that there are several extreme large measurements, which appear obviously in the histograms in Figure 5.14, 56 Table 5.1: Statistical Descriptions for Original Data in Three Groups Group Normal High Risk Cancer Test Mean SD Max Q3 Median Qi Min CEA 0.95 0.67 2.8 1.45 0.80 0.40 0.1 CA15.3 15.02 4.93 28.2 17.95 15.10 11.00 5.5 MCA 6.01 3.36 13.5 8.15 5.95 . 3.15 0.6 CEA 1.55 1.11 4.1 2.3 1.0 0.8 0.4 CA15.3 20.10 8.60 51.0 24.3 19.7 14.1 8.0 MCA 7.95 4.64 18.0 10.6 9.2 3.4 0.6 CEA 125.58 1431.96 18000 4.2 1.60 0.7 0.1 CA15.3 274.87 1977.19 24600 48.0 22.00 16.0 6.2 MCA 46.97 190.80 2100 15.8 7.45 3.7 0.5 57 Table 5.2: Statistical Descriptions for Rank-transformed Data Group Normal High Risk Cancer Ql Min 105.0 _ 41.5 11.5 99.0 41.5 1.5 174.8 129.8 61.0 3.0 250.5 221.5 131.5 105.0 41.5 77.7 254.5 212.0 167.0 76.5 13.0 151.0 81.0 255.0 212.0 188.0 69.0 3.0 CEA 168.7 87.9 291.0 252.0 181.5 93.0 11.5 CA15.3 175.4 84.3 291.0 251.0 193.5 107.0 3.5 MCA 160.0 90.4 291.0 251.5 161.3 78.6 1.0 Test Mean SD Max Q3 CEA 108.7 69.2 237.0 170.5 CA15.3 99.0 62.5 228.0 142.3 MCA 122.3 69.0 242.0 CEA 150.4 66.6 CA15.3 147.5 MCA Median 5.16, and 5.18. Unfortunately, these observations can not be simply treated as outliers and deleted, because they agree with the other observations in the sense that large values correspond to disease. But the existence of such extreme values will influence the estimations of the population parameters, which is crucial to the implement of the multivariate discriminant functions in R O C analyses for evaluating the aggregate performance of multiple tests. Under such circumstance, as described in Chapter 3, rank transformation could be used so that the observations in each group will agree with one the other in magnitude. The resulting statistical descriptions for the rank-transformed data are shown in Table 5.2. 58 5.3 T h e Aggregate Performance of C E A , C A 1 5 . 3 , and M C A Based on above rank-transformed data, modified R O C analysis is carried out in this section to evaluate the aggregate performance of the three tumor markers. As indicated in Chapter 2, the area under an R O C curve represents the probability that the random sample X from D_ is stochastically dominated by Y from D . + The more separated the samples from two populations, the larger the area under the corresponding R O C curve. In other words, a good diagnostic test will result in large separation between the samples from D_ and D . The separation between the two + populatons can be illustrated using a boxplot. In order to evaluate the performance of multiple tests quantitatively, we also present the performance of each single test. For comparing Group N and Group B C and comparing Group H R and Group B C , Figure 5.19, 5.20, and 5.21 show the boxplots of C E A , CA15.3, and M C A , respectively. It is appeared in these figures that CA15.3 separates the two populations more than that either C E A or M C A does. C E A appears a little better than M C A . These impressions from boxplots are confirmed by the traditional R O C analysis. The results of R O C analysis are given in the first three rows in Table 5.3 and Table 5.4. Then we consider those three tests as a single test with three measurements, and use R L D F or R Q D F , described in Chapter 3 which does not need normality assumption, to obtain the univariate scores. The corresponding boxplots for pairwise comparison are 59 Table 5.3: R O C Areas of Individual Test on the Patients in Group N and Group B C A Test CEA z SD A z A t S D A 0.7141 0.0311 0.7022 0.0321 0.7722 0.0283 0.7573 0.0294 MCA 0.6462 0.0332 0.6239 0.0349 RLDF 0.8105 0.0262 0.8153 0.0258 RQDF 0.8208 0.0254 0.8159 0.0258 CA15.3 Note: A z and A are the area under the fitted curve and trapezoidal (Wilcoxon) t area, and the S D A z and S D A are the corresponding estimated standard de- viations. R L D F and R Q D F are combined tests based on linear or quadratic discrimination function, respectively. 60 Table 5.4: R O C Areas of Individual Test on the Patients in Group H R and Group B C Test A z SD A z A t S D A CEA 0.5714 0.0446 0.5744 0.0528 CA15.3 0.6259 0.0488 0.6116 0.0510 MCA 0.5625 0.0494 0.5485 0.0539 RLDF 0.6657 0.0471 0.6799 0.0466 RQDF 0.7121 0.0425 0.7096 0.0443 given in Figure 5.22, 5.23, 5.24, and 5.25. Finally, we follow the steps listed in the end of Chapter 3 and calculate the R O C index values. The values for R L D F and R Q D F are given in the last two rows in Table 5.3 and Table 5.4, which evaluate the aggregate performance of the three tests. From the results of R O C analysis in Table 5.3 when we compared the normal (N) people and breast cancer (BC) patients, it is immediately apparent that CA15.3 was the best discriminator of the three tumor markers in this case, and C E A was the second best one. Further, the combined test from the three markers based on either linear discrimination function or quadratic discrimination function was better than any single tumor marker, which means that we can get more information from the combined test than from any single one of the three markers. Similarly, when we tested how well tumor markers discriminated between the high risk (HR) and breast cancer (BC) groups, CA15.3 was still the best discriminator among the three single ones, as indicated in Table 61 5.4. In both tables the combined tests based on R Q D F , according to the areas under the fitted R O C curves, were more or less better than the ones based on R L D F . Since we could not obtain valid random samples from underlying two populations, we did riot carry out hypothetical test of equal covariance matrix, and simply gave all those results in Table 5.3 and 5.4. The empirical R O C curves are shown in Figure 5.26 and 5.27. 5.4 Sequential Diagnosis Using C E A , C A 1 5 . 3 and MCA Sequential diagnostic procedures are usually important and necessary in clinical diagnosis. For the sake of risk, expenses, time, and etc., we hope to diagnose a cancer patient based on minimal number of tests. Here we can start from the diagnostic test in the sense of minimum misclassification rate, and check with the measurement from that test. If its value is greater than one preset value, A , we may come to the conclusion that the individual is diseased. In contrast, if the value is below another preset value, B , we diagnose that individual to be non-diseased. In such cases, we do not need carry out any more diagnostic tests on that individual. Otherwise, we need use more tests to classify that individual. The above diagnostic procedure is called sequential discrimination procedure, and was formally introduced in Chapter 4. Following the steps of the modified R O C analysis summarized in Chapter 4, we evaluated the performance of such sequential diagnostic procedure based on the three tumor markers. 62 Table 5.5: R O C Areas When a Second Test is Added (N and BC) Test A SD z Az At SD A CA15.3 0.7722 0.0283 0.7573 0.0294 CA15.3 & C E A 0.7791 0.0279 0.7714 0.0286 CA15.3 & M C A 0.8093 0.0264 0.8089 0.0263 By the results of R O C analysis on the three tumor markers in Table 5.3, globally CA15.3 is the best discriminator in terms of largest area under the R O C curve. M C A is the worst discriminator, and C E A is in the middle. Therefore, we use CA15.3 as the first test. But which test should be used as the second test? We know that this second test is not the globally second best test, but the best test conditional on the first best test. Thus, under the condition that the CA15.3 is Used, we introduce C E A and M C A , respectively, and check the performance increase by adding a second test. Using the R L D F and R O C technique described in Chapter 3, the aggregate performances in terms of R O C areas for adding a second test to CA15.3 are given in Table 5.5.. Although the M C A test is the worst globally among the three tests, from Table 5.5, adding M C A to CA15.3 results in more increase in discrimination performance (A z increases 6.81%) than that of adding C E A (A increases 1.86%). Therefore, CA15.3 is z the best test, and M C A is the second best test under the condition that CA15.3 is used. In our sequential diagnosis, we will use CA15.3 measurements first, then introduce M C A 63 Table 5.6: R O C Areas of Sequential Tests on the Patients in Group N and Group B C a [3 A z SD A z A t S D A Relative Efficiency 0.01 0.01 0.8105 0.0262 0.8153 0.0258 0.00% 0.10 0.20 0.8045 0.0266 0.8094 0.0262 9.69% 0.10 0.30 0.7999 0.0269 0.8008 0.0268 19.51% to CA15.3 for those unclassified by CA15.3 only. For those which are not classified by the CA15.3 and M C A , C A E will be included at last. As mentioned in the Chapter 4, the discrimination bounds, A and B which depend y upon type I and type II errors, are crucial to the performance of the sequential discrimination procedure. Choosing different values of type I and type II errors, we first applied sequential diagnostic procedure to the patients in the Group N and Group B C , and used modified R O C procedure to evaluate the resulting performances. The results are shown in Table 5.6. As we increased the type I and type II errors, the interval between the two bounds was getting smaller, and the area under the R O C curve was getting smaller. Thus, accuracy of the sequential diagnostic procedure decreased. Fortunately, such performance decreases were very small, and were not sensible to the discrimination bounds. On the other hand, as the interval between the two bounds was getting smaller, the number of tests needed to classify those patients reduced. For example, when a = 0.10 and (3 = 0.30, the trapezoidal R O C area reduced only about 2% compared with the simultaneous 64 Table 5.7: R O C Areas of Sequential Tests on the Patients in Group H R and Group B C A SD A S D a (3 0.05 0.10 0.6649 0.0471 0.6799 0.0466 0% 0.35 0.35 0.6646 0.0501 0.6799 0.0466 4.19% 0.40 0.40 0.6569 0.0474 0.6391 0.0494 . 31.24% z A z t A Relative Efficiency procedure, but about 20% tests were saved in that case. Then, we did the same R O C analysis on the patients in the Group H R and Group B C , similar conclusions could be derived and results of R O C analysis are given in Table 5.7. From the above analyses, it is immediately apparent that the sequential diagnostic procedure is a potentially efficient and economic diagnostic procedure in clinical diagnosis. Compared with the simultaneous diagnostic procedure, a small decrease in accuracy results in significant test savings. 65 Chapter 6 Conclusions In this paper, two ROC-based techniques have been developed to evaluate the aggregate performance of several diagnostic tests, one is for evaluating simultaneous multiple diagnostic tests, and the other is for sequential tests. Quite often, there are several tests available in diagnosis of a particular disease. When these univariate tests are applied to the same patients in the group, the tests can be treated as one test with multivariate measurements. In this case, Fisher's linear discrimination functions or their quadratic forms could be used to combine such multivariate measurements as univariate measurements, upon which the traditional R O C analysis can be applied. The indices derived from the R O C analysis will represent the aggregate performance of the multiple diagnostic tests involved. On the other hand, in consideration of the risk, expenses, and time consuming of the diagnostic tests, we prefer to use as few tests as possible to classify an individual as belonging to diseased or non-diseased population. In such situation, sequential 66 diagnostic procedures are strongly recommended. From both the simulation study of comparing the sequential discrimination procedures and the simultaneous ones, and the data analyses on the breast cancer data, the sequential procedures are nearly as good as the corresponding simultaneous ones in accuracy, but the number of tests needed for classifying individuals reduces significantly. By the method of taking ranks described in Chapter 4, sequential diagnostic test scores can be transformed into corresponding univariate rank scores, upon which the traditional R O C analysis can be applied. The indices derived from the R O C analysis will represent the aggregate performance of such sequential diagnostic procedures. In addition, the importance of rank transformation in R O C analysis was emphasized in this paper. First, the indices under the R O C curve remain the same no matter rank transformation is applied to the original data or not. Second, after the rank transformation, outliers (extremely large values in the diseased population, or extremely small values in the non-diseased population) will agree with the other observations in the same population. Finally, when the data are not normally distributed, multivariate discrimination methods based on rank transformation are superior to the ones without rank transformation. It should be mentioned that the use of R O C techniques for evaluating the performance of clinical diagnosis has grown considerably in the past years. Methods for statistical inference have been derived for most situations. However, a number of questions remain to be answered. For example, what types of questions can or cannot be 67 i answered by the R O C studies? Can R O C analysis be improved to the conduct of multicenter imaging trials? What is the best way to evaluate the quantitative tests, and what is the best way to compare quantitative tests with qualitative ones? Therefore, ROC-based techniques need to be developed by our further studies. 68 Bibliography [I] Silver, H . K . B . , Archibald, B . L . , Ragaz, J., and Coldman, A . J . , Relative Operating Characteristic Analysis and Group Modelling for Tumor Markers; Comparison of C A 15.3, Carcinoembryonic Antigen, and Mucin-like Carcinoma-associated Antigen in Breast Carcinoma, Cancer Research, 51, pl904-1909, 1991. [2] Green, D . M . and Swets, J . A . , Signal Detection Theory and Psychophysics. John Wiley k Sons, New York, 1966. [3] Swets, J. A . and Pickett, R. M . , Evaluation of Diagnostic Systems: Methods from Signal Detection Theory, Academic Press, New York, 1982. [4] Morgan, B . J . , The Uniform Distribution in Signal Detection Theory, British Journal of Statistical Psychology, 29, p81-88, 1976. [5] Dorfman, D . D . and Alf, E . , Maximum Likelihood Estimation of Parameters of Signal Detection Theory-A Direct Solution, Psychometrika, 33, pll7-124, 1968. [6] Dorfman, D . D . and Alf, E . , Maximum Likelihood Estimation of Parameters of Signal Detection Theory and Determination of Confidence Intervals: Rating Method Data, Journal of Mathematical Psychology, 6, p487-496, 1969. [7] Ogilvie, J . C. and Creelman, C. D . , Maximum Likelihood Estimation of R O C Curve Parameters, Journal of Mathematical Psychology, 5, p377-391, 1968. [8] Grey, D . R. and Morgan, B . J., Some Aspects of R O C Curve Fitting: Normal and Logistic Models, Journal of Mathematical Psychology, 9, pl28-139, 1972. [9] Birdsall, T. G . , The Theory of Signal Detectability: R O C Curves and Their Characters, unpublished dissertation, Department of Electrical and Computer Engineering, The University of Michigan, A n n Arbor, 1973. [10] Jaraiedi, M . and Herrin, G . D . , Effect of Human Inspector Error on Sample Plan Design, Proceedings, 1985 Fall Industrial Engineering Conference, Chicago 1985, p436-439. [II] Greenhouse, S. W . and Mantel, N . , The Evaluation of Diagnostic Tests, Biometrics, 6, p399-412, 1950. 69 [12] Linnett, K., Comparison of Quantitative Diagnostic Tests: Type I Error, Power, and Sample Size, Statistics in Medicine, 6, pl47-158, 1987. [13] Bamber, D., The Area Above the Ordinal Dominance Graph and the Area Below the Receiver Operating Characteristic Graph, Journal of Mathematical Psychology, 12, p387-415, 1975. [14] Hanley, J. A. and McNeil, B. J., The Meaning and Use of the Area Under a Receiver Operating Characteristic (ROC) Curve, Radiology, 143, p29-36, 1982. [15] Hanley, J. A. and McNeil, B. J., A Method of Comparing the Areas Under Receiver Operating Characteristic Curves Derived from the Same Cases, Radiology, 148, p839-843, 1983. [16] Goddard, M . J. and Hinbery, I., ROC Curves for Non-normal Data, Presented paper, "joint statistical meetings, Chicago, 1986. [17] Delong, E. R., Delong, D. M . , and Clarke-Oearson, D. L., Comparing the Area Under Two or More Correlated Receiver Operating Characteristic Curves: A Nonparametric Approach, Biometrics, 44, p837-845, 1988. [18] Wieand, S. et al, A Family of Nonparametric Statistics for Comparing Diagnostic Tests with Paired or Unpaired Data, Biometrika, 76, p585-592, 1989. [19] Revesz, G., Kundel, H. L., and Bonitatibus,M., The Effect of Verification on the Assessment of Imaging Techniques, Investigative Radiology, 18, pl94, 1983. [20] Henkelman, R. M . , Kay, I. B., and Bronskill, M . L., Receiver Operator Characteristic (ROC) Analysis without Truth, unpublished document, Department of Medical Biopgysics, University of Toronto, Toronto, 1986. [21] Hanley, J. A . , Receiver Operating Characteristic (ROC) Methodology: The State of Art, Critical Reviews in Diagnostic Imaging, 29, Issue 3, p307-335, 1989. [22] Fisher, R. A . , The Use of Multiple Measurements in Taxonomic Problems, Annals of Eugenics, 7, pl79-188, 1936. [23] Fisher, R. A., The Statistical Utilization of Multiple Measurements, Annals of Eugenics, 8, p376-386, 1938. [24] Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis, 2nd edition, Prentice Hall, 1988. [25] Seber, G. A . F., Multivariate Observations, John Wiley & Sons, 1984. [26] Lehmann, E. L., Nonparametrics : statistical methods based on ranks, Holden-Day, 1975. [27] Dawid, A. P., Properties of Diagnostic Data Distributions, Biometrics, 32, p647-658, 1976. 70 [28] Randies, R. H., Broffitt, J. D., and Hogg,R.V., Discriminant Analysis Based on Ranks, Journal of Am. Stat. Assoc., 73, p379-384, 1978. [29] Randies, R. H., Broffitt, J. D., and Hogg,R.V., Generalized Linear and Quadratic Discriminant Functions Using Robust Estimates, Journal of Am. Stat. Assoc., 73, p564-568, 1978. [30] Beckman, R. J. and Johnson, M . E., A Ranking Procedure for Partial Discriminant Analysis, J. of Am. Stat. Assoc., 76, p671-675, 1981. [31] Conover, W. J. and Iman, R. L., The Rank Transformation as a Method of Discrimination with Some Examples, Communications in Statistics, A9(5), p465-487, 1980. [32] Hora, S. C , Sequential Discrimination, Commun. Statist. Theor. Meth., A9(9), p905916, 1980. [33] Lachenbruch, P. A . , Discriminant Analysis, Hafner: New York, 1975. i [34] Gordon, L. and Olshen, R. A., Asymptotically Efficient Solution to the Classification Problem, Ann. Stat., 6, p515-533, 1978. [35] Glick, N . , Sample-based Classification Procedures Derived from Density Estimators, J. Am. Stat. Assoc., 67, pll6-122, 1972. [36] Kendall, M . G. and Stuart, A.,The Advanced Theory of Statistics, Vol.III. Griffin: London, 1966. [37] Kendall, M . G., Multivariate Analysis. Griffin: London, 1975. [38] Beard, D. B. and Haskell, C. M . , Carcinoembryonic Antigen in Breast Cancer, Am. J. Med., 80, p241-245, 1986. [39] Tondini, C , Hayes, D. F., and Kufe, D. W., Circulating Tumor Markers in Breast Cancer, Hematol. Oncol. Clin. North Am., 3, p653-674, 1989. 71 Figure 1.1 One Typical ROC Curve 00 d >, CD > ° CO CD CO ^J" d CM d o d 0.0 0.2 0.4 0.6 0.8 1.0 1 - Specificity Figure 1.2 Comparing Two ROC Curves CO d >> co CO d 1 'co ° CM d o d 0.0 0.2 0.4 0.6 1 - Specificity 7 2., 0.8 1.0 Figure 2.1 Empirical ROC Points i 0.0 1 0.2 1 1 0.4 0.6 1 - Specificity 7.3 1 0.8 — 1 1.0 Figure 2.2 Sub-area under ROC Curve Figure 3.1. Outlines in Bivariate Case -2 o 2 Testl •..75 4 6 Fig.5.1 Q-Q Plot For Group N Fig.5.2 Histogram For Group N 5 o -2 fi - 1 . 0 1 2 i 1 1 1 1 1 0.0 0.5 1.0 1.5 2.0 2.5 Quantiles of Standard Normal Original CEA Fig.5.3 Q-Q Plot For Group N Fig.5.4 Histogram For Group N CM < o I 1, .2" *~ 6 i - 2 - 1 0 1 r- 2 10 15 20 25 Quantiles of Standard Normal Original CA15.3 Fig.5.5 Q-Q Plot For Group N Fig.5.6 Histogram For Group N < o o -2 -1 0 1 2 0 2 4 6 8 Original MCA . Quantiles of Standard Normal 76 10 12 14 Fig.5.7 Q-Q Plot For Group HR Fig.5.8 Histogram For Group HR Quantiles of Standard Normal Original CEA Fig.5.9 Q-Q Plot For Group HR Fig.5.10 Histogram For Group HR < o 10 20 30 40 50 Quantiles of Standard Normal Original CA15.3 Fig.5.11 Q-Q Plot For Group HR Fig.5.12 Histogram For Group HR < o -2 -1 0 1 10 Quantiles of Standard Normal Original MCA 77 15 Fig.5.13 Q-Q Plot For Group BC Fig.5.14 Histogram For Group BC 8 -, • 2 - 1 0 1 5000 2 Quantiles ot Standard Normal 10000 15000 Original CEA Fig.5.15 Q-Q Plot For Group BC Fig.5.16 Histogram For Group BC o 8 < o o. - 1 0 1 2 0 5000 Quantiles of Standard Normal Fig.5.17 Q-Q Plot For Group BC 15000 20000 25000 Fig.5.18 Histogram For Group BC o < 10000 Original CA15.3 - § • o o in 8o. 00 Q s . o. o. OJ o - 2 - 1 0 1 2 500 Quantiles of Standard Normal 1000 Original MCA 7 8' : 1500 2000 Figure 5.19. Boxplot of the Ranked C E A Measurements o o CO o m < LU o TJ CO JSC • c o o eg o in o o CO o m Normal High Risk Breast Cancer '79 gure 5.20 Boxplot of the Ranked CA15.3 Measurements Normal High Risk 80 Cancer Figure 5.21 Boxplot of the Ranked MCA Measurements Normal High Risk ,8"! Cancer Fig.5.22 Boxplot of RLDF in N and BC CM CM Normal Breast Cancer Fig.5.23 Boxplot of RQDF in N and BC co - c\j - CM Normal Breast Cancer 8'2 Fig.5.24 Boxplot of RLDF in HR and BC High Risk Breast Cancer Fig.5.25 Boxplot of RQDF in HR and BC High Risk Breast Cancer '83 Figure 5.26 ROC Curves (Group N and Group BC) 0.0 0.2 0.4 0.6 1 - Specificity 8:4. 0.8 1.0 Figure 5.27 ROC Curves (Group HR and Group BC) o 0.0 0.2 0.4 0.6 1 - Specificity 85 0.8 1.0
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Simultaneous and sequential ROC analyses for diagnostic...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Simultaneous and sequential ROC analyses for diagnostic tests Fang, Raymond Rui 1991
pdf
Page Metadata
Item Metadata
Title | Simultaneous and sequential ROC analyses for diagnostic tests |
Creator |
Fang, Raymond Rui |
Publisher | University of British Columbia |
Date Issued | 1991 |
Description | Relative or receiver operating characteristic (ROC) analysis is a simple procedure which can be used to measure the accuracy of diagnostic tests. Diagnostic tests are often used to classify an individual as belonging to one of two populations. Based on statistical decision theory, ROC was first developed to evaluate the performance of electronic signal detection, and has been used to evaluate the accuracy of diagnostic tests. The ROC theory for evaluating one single test, or comparing individual tests is reasonably well understood. The question arises in cases where multiple tests are available as to whether some combination of the tests are better than any single one. In this paper, two ROC procedures of evaluating the aggregate performance of multiple diagnostic tests were presented, one is for evaluating simultaneous multiple diagnostic tests, and the other is for sequential diagnostic tests. These procedures are illustrated by using a breast cancer data set. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-11-05 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0098505 |
URI | http://hdl.handle.net/2429/29835 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-UBC_1991_A6_7 F36.pdf [ 3.83MB ]
- Metadata
- JSON: 831-1.0098505.json
- JSON-LD: 831-1.0098505-ld.json
- RDF/XML (Pretty): 831-1.0098505-rdf.xml
- RDF/JSON: 831-1.0098505-rdf.json
- Turtle: 831-1.0098505-turtle.txt
- N-Triples: 831-1.0098505-rdf-ntriples.txt
- Original Record: 831-1.0098505-source.json
- Full Text
- 831-1.0098505-fulltext.txt
- Citation
- 831-1.0098505.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0098505/manifest