Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Detection of malignancy associated changes in cervical cell nuclei using feed-forward neural networks Kemp, Roger 1994

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1994-0274.pdf [ 3.12MB ]
Metadata
JSON: 831-1.0085144.json
JSON-LD: 831-1.0085144-ld.json
RDF/XML (Pretty): 831-1.0085144-rdf.xml
RDF/JSON: 831-1.0085144-rdf.json
Turtle: 831-1.0085144-turtle.txt
N-Triples: 831-1.0085144-rdf-ntriples.txt
Original Record: 831-1.0085144-source.json
Full Text
831-1.0085144-fulltext.txt
Citation
831-1.0085144.ris

Full Text

DETECTION OF MALIGNANCY ASSOCIATED CHANGES INCERVICAL CELL NUCLEI USING FEED-FORWARD NEURALNETWORKSByRoger KempB. Sc. (Physics) Simon Fraser University, 1991A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTERS OF SCIENCEinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF PHYSICSWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIAApril 1994© Roger Kemp, 1994In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)Department of PL-f’S CS’The University of British ColumbiaVancouver, CanadaDate (9 Apc / 2DE-6 (2/88)AbstractIt has been recognized that normal cells in the presence of a precancerous lesionundergo subtle changes that affect the DNA distribution in their nuclei. These changeshave been termed Malignancy Associated Changes (MACs). This thesis examines thedesign of a classifier that separates normal slides from slides containing MACs in thepresence of a severely dysplastic lesion.Classifiers were designed using discrimiriant functions and feed-forward neural networks with various structures. The discriminant function correctly separated MACs fromnormal cells with a classification rate of 61.6% for a 16904 cell test set. Neural networkclassifiers were able to achieve up to 72.5% separation for this cell-by-cell classificationtask when four hidden units were used. Using more than four hidden units led to adecline ill the test set performaice.The slide-by-slide classification rates were calculated for each classifier based on thedistribution of classifier values for the cells on each slide. The discriminant functionscored 695% on the test set containing 197 slides. The neural network classifiers allscored between 74% and 77% when used for slide-by-slide classification.11Table of ContentsAbstract iiTable of Contents iiiList of Tables vList of Figures viiAcknowledgements xi1 Introduction 12 Classification with Discriminant Functions and Neural Networks 52.1. Linear Discriminant Functions 62.2 Neural Networks 112.2.1 Introduction 112.2.2 Feed Forward Networks with Error Back-propagation 122.2.3 Alternatives to the Back-Propagation Algorithm 162.2.4 The Quickprop Algorithm 182.2.5 Decision Surfaces 192.2.6 Neural network solutions to the classification example 212.2.7 Overfitting Training Data 263 Classification of Cytological Smears 283,1 Cancer. . , 281113.2 The Epithelium3.2.1 The Nucleus3.3 Differences Between Normal and Malignant3.3.1 Staging of Malignancies3.4 Malignancy Associated Changes3.5 Cervical Cytology and Cytometry.3.6 Design of a Cytological Classifier4 Experiments and Results4.1 Description of Data4.2 Nuclear Features4.3 Feature Correlations4.4 Training Method Selection4.5 Spurious Classifications4.6 Discriminant Function Analysis4.7 Neural Network Experiments. .4.8 Principal Component Analysis .4.9 Slide-by-Slide Classification . .5 Conclusions and Further WorkBibliography9094Cells2930313234373944444750• 5156• 57• . 58• . 80•. 85ivList of Tables3.1 Descriptitve differences between the cells that make up the squamous epithelium, . ....... . . .,.,., , ..... . , , 314.1 Frequency of the number of cells per slide used in the design of the cell-by-cell classifier 464.2 The breakdown of the number of cells of each type used to train and testthe cell-by-cell classifiers is shown, 464.3 The frequency distribution for the absolute values of the pairwise correlation between the 53 features is shown. Correlatioi values near 1 are ofinterest because they indicate variables which can be eliminated from theclassifier model 514.4 The average performance of 15 neural networks with 20—2—1 structuretrained with Quickprop on 1000 cells (500 MACs, 500 negatives) is shownfor different weight decay levels. The test results for an independent setof 1000 cells should be closer to those of the training data for larger decaylevels 554.5 The linear discriminant function attained a 62.1% classification rate onthe training set 584.6 The linear discriminaut function attained a 61.6% classification rate onthe test set 584.7 The number of network weights and number of training iterations used areshown for networks with different numbers of hidden units 60v4.8 The average performance and standard deviations for 20 randomly initialized networks are shown when different numbers of hidden units are used.The training set performance improves as the number of hidden units isincreased, but the test set performance peaks when four hiddens are used.The performance of the best network on the test data is given for each case 624.9 The staining dates of the 41 slides that make up peak 3 (see text) areshown. The total number of slides from each date is given in the rightmostcolumn 774.10 The first 18 principal components are shown for the MAC training data.They are obtained by finding the eigenvalues of the covariance matrix for53 feature training set. The contribution of each principal component tothe total variation of the data and the cumulative total are also given. . 834.11 The average performance and standard deviations for 20 randomly initialized networks are shown for different numbers of hidden units. Thenetworks were trained on the first ten principal components of the training set, which accounts for 86.9% of the variation of the data 844.12 Breakdown of the number of slides of each type used to design slide-byslide classifiers 864.13 Correlation values of the five slide features are shown for the five hidden unit cell-by-cell classifier. These values were calculated for the slidetraining set 874.14 The performances of the slide features for classification of the test set of197 slides are shown for the best neural networks and discriminant function 88viList of Figures2.1 Decision boundary and contours of equal probability density are shown fortwo populations that identical covariances of the form S = o.21 82.2 A linear discriminarit vector is shown along with the separating boundary(solid line) for an example of a two class classification problem, Discrirninant function value contours are lines perpendicular the discriminantvector as shown by the two additional contours in the figure. . 92.3 The decision boundary for a quadratic discriminant function is shown forthe classification problem. This discriminant function correctly separates85.59o of the training examples 112.4 Structure of a feed-forward with one layer of hidden units 132.5 A sigmoid shaped function (one that saturates for extreme input values) isthe most common nonlinear activation function used with neural networks 162.6 Decision surfaces are shown for feed-forward networks with one, two andthree layers of weights. In the first case the separating surface will be ahyperplane and the network performs like discriminant function vector.When there are two layers of weights, the decision surface will be curvedand can separate a single region of arbitrary shape. If more than twoweight layers are used, the network can form boundaries around multimodal distributions in the feature space 202.7 The combination of two vectors defines a decision region in the lower leftcorner that approximates the original distribution of the training examples 22vii2.8 The structure of a neural network with two hidden neurons that implements the decision boundary defined by the two vectors is shown. . . 232.9 The decision boundary is shown for a neural network before and after itwas optimized using a conjugate gradient training technique. 242.10 Three neural network output contours are shown for a 5 hidden unit solution to the classification example. The decision boundary is shown by the.50 output value contour. The .01 and .99 contours nearly coincide withthe decision boundary. Thus all points in the plane have neural networkoutput values of nearly zero or one 253.1 The types of cells that make up the squamous epithelium are shownschematically, There are usually several layers of parabasal, intermediate and superficial cells in normal epithelia 293.2 The differences between the different stages of precancerous lesions areshown. The images on the left are the nuclei of typical cells of the fourtypes scanned using the automated image cytometer described in Section3.5 353.3 The decision tree used by the automated classifier system at the B.C.Cancer Agency Imaging Department. Each node represents a decisioncurrently made using a discriminant function or threshold 404.1 Every third cell nucleus was selected for inclusion in the training data andthe remainder of the data was used for testing 474.2 The cell-by-cell performance of neural networks with different numbers ofhidden units is shown for the training and test sets. Each result is averagedfor 20 runs using randomly initialized networks. The standard deviationsare shown as error bars for each result 61yin4.3 The ROC curves for the the discriminant function and networks givingthe best test set performance are shown. The ROC curves for the bestnetworks using three, five, six and seven hidden units nearly coincide withthose of the two and four hidden networks and are not shown,.., 644.4 The histogram of discriminant function values is shown for the test data.The discriminant function correctly classifies 61.6% of the test examples 674.5 The histogram of neural network output values for the optimal networkwith 1 hidden unit is shown for the test data. This network correctlyclassifies 63.5% of the test examples 684.6 The histogram of perceptron output values is shown for the test data. Theperceptron assigns negative cells to extreme output values 694.7 The ROC curves for the the perceptron and the combination of the perceptron and threshold are shown. The ROC curves for the neural networkwith two hidden units is shown for comparison purposes 714.8 The histogram of neural network output values for the optimal networkwith 2 hidden units is shown for the test data. This network correctlyclassifies 71.9% of the test examples 724.9 The histogram of neural network output values for the optimal networkwith three hidden units is shown for the test data. This network correctlyclassifies 72.4% of the test examples 744.10 The histogram of neural network output values for the optimal networkwith four hidden units is shown for the test data. This network correctlyclassifies 72.5% of the test examples 754.11 The histograms of neural network output values for the optimal networkwith 4 hidden units is shown for the whole negative test set and for thecase where 41 slides have been removed.,,.,..,,...,,,,,,, 76ix4.12 The outputs for every eight training example for the two perceptrons thatmake up the two hidden unit networks are plotted. 784.13 The two principal components are shown for an artificial data distributionin two dimensions shown by the shaded region. The first principal component lies alollg the direction of greatest variation of the data, The secondcomponent is constrained to be perpendicular to the first 824.14 The slide-by-slide ROC curves for the the discriminant function and neuralnetworks with two and five hidden units are shown. The slide test setconsists of 125 normal and 72 severe slides 89xAcknowledgementsI would like to acknowledge several people who have helped me a great deal during myresearch and writing of this thesis. First, I would like to thank Dr. Branko Palcic forintroducing me to the field of cytology and making this thesis possible. Dr. CalumMacAulay has provided me with good advice and suggestions on many topics relating topattern classifiers. I would also like thank Henry Lee for his friendship and wisdom, andPeter Payne who has been my guru for all things cytopathological.xiChapter 1IntroductionCaicer is one of the leading causes of death in society today. In 1987, the death ratesfor cancer in Canada were 212 per 100,000 for men and 168 per 100,000 for women [44].This places cancer, which accounts for 26% of all fatalities, number two behind heartdisease, which accounts for over 40% of all fatalities.The incidence of cancer in British Columbia is on the rise. The incidence rate for newcancers in 1991 was approximately 300 per 100,000 [3], which is up 9% over incidencedata from 10 years ago. This statistic was standardized to account for the aging of thepopulation. For these reasons, cancer remains a research priority. Research is conductedinto treatment, prevention and early detection of cancers. Early detection is importantbecause many cancers are more successfully treated when detected early.For some types of cancer, early detection can be accomplished by public screeningprograms. British Columbia has had a cervical cancer screening program in place since1949. Over the period from 1955 to 1985 the mortality rate from cervical cancer hasdropped from approximately 11 per 100,000 to 3 per 100,000 [1]. This improvement islargely due to the early diagnosis of this cancer, where it can be effectively treated.In the case of cervical cancer, a sample smear is done by scraping the surface of thecervix to obtain a cell sample and fixing the cells on a slide. The slide is stained and thediagnosis is made by a pathologist or cytotechnologist. The diagnosis is made by searchingthe slide for examples of abnormal cells, which are characteristic of a malignancy. Thisis done by viewing different portions of the slide under a microscope and distinguishing1Chapter 1. Introduction 2the cells from other materials and debris on the slide. On average, there are over 150,000cells on each slide and rendering the diagnosis requires careful examination of most ofthese objects.Under the current screening program in British Columbia, women are examined everytwo years between the onset of sexual activity and age 35, and every 5 years thereafter.The cytotechnologists at the B.C. Cancer Agency, therefore, must examine between 2500and 3000 slides daily. This labor intensive task can be accomplished in British Columbiabecause of the amount of resources devoted to the screening program. However, mostcountries, including the United States, do not have such public screening programs inplace.A computerized system to screen cervical cell smears would be a valuable tool toaid cytotechnologists. Recently, researchers at several laboratories [32, 38, 39, 51] havebuilt prototype systems to screen smears automatically. These systems attempt to detectabnormal cells in order to aid the cytotechnologist’s diagnosis of the slide.A new approach to the diagnosis of cervical smears involves detecting MalignancyAssociated Changes (MACs)[34, 36] in apparently normal cells. It has been shown thatnormal cells in the presence of a precancerous lesion undergo subtle changes that affecttheir nuclear features [4, 7, 30, 31]. These changes cannot be usually detected by eye,but can be seen when making quantitative measurements of the DNA distribution inthe nucleus. A decision process which includes the detection of MACs could be used todetect cancer in cases where no tumor cells are present on the cervical smears.A new automated image cytometer [15] developed by Xillix Technology Corporationand the B.C. Cancer Agency uses both abnormal cells and the presence of MACs todiagnose cervical smears. The device scans the nuclei of stained cervical smears andcalculates up to 120 quantitative nuclear features. These features are used to classifyeach object on the slide and ultimately the entire slide itself. The usual decision methodsChapter 1. Introduction 3used in this process are linear discrimiriant functions in combination with thresholds onthe nuclear features. This thesis compares the use of linear discriminant functions andneural networks for the detection of MACs.Neural networks have become a popular pattern classification technique since thepublication of the back-propagation algorithm [42] nearly ten years ago. They are usefulfor decision problems where a nonlinear decision surface separates different classes of data.For this reason they can be successful in problems where standard statistical techniquessuch as the discriminant function fail.Chapter two of this thesis derives the linear discriminant function and the back-propagation algorithm for training neural networks. The relationship between the twodecision methods is shown using a two-class classification problem. This problem issolved using discriminant functions and neural networks with different structures. It isalso used to demonstrate how overfitting of the training data can occur when classifierswith too many free parameters are used. Alternative neural network training methodsare introduced and discussed.Chapter three gives backround regarding cancer and its effects on cells, The structure of the squamous epithelium, which lines the uterine cervix, is described. A briefdescription of the nature of MACs and what is known about them is given in Section 3.4,Cytological screening and issues regarding the design of an automated cytological classifier are discussed in this chapter.Chapter four contains the results of a series of experiments performed on a datasetconsisting of MAC cells and normal cells. A detailed description of the data used in thestudies is given in Section 4.1. The results are given for experiments to:1. calculate the correlations between nuclear features,2. compare three neural network training methods,Chapter 1. Introduction 43. study the effects of varying neural network weight decay parameters,4. design classifiers that separate cells from different staining batches,5. design cell-by-cell discriminant function and neural network classifiers,6. use principal component analysis to reduce the dimensionality of the data, and7. perform slide-by-slide classification using a discriminant function and optimal neural networks of various structures.Chapter five summarizes the results of this thesis and suggests further worked basedon these results.Chapter 2Classification with Discriminant Functions and Neural NetworksA pattern classifier is a rule or procedure for assigning a label to an object. An objectcan be represented by a vector of featuresx = (x1,x2,..., xd) (2.1)where each z is some measurement of a feature of x and d is the number of featuresused to describe the object. A classifier transforms objects in the d dimensional featurespace into scalar values which may then be interpreted as probabilities or measures ofthe degree to which objects belong to a particular class. The features can be physicalmeasurements or enumerations of some quality that distinguishes the classes, such ascolour, etc.The design of a classifier begins with choosing a set of prototypes from the desiredclasses and selecting a classification model such as a discriminant function, neural networkor Bayes network [45]. The classifier is trained with the prototypes and parametersare found for the selected model that minimize the error function appropriate to themodel. If the model is sufficiently complex, i.e. if a large number of parameters areused, it is possible to form a rule that correctly classifies most or all of the trainingexamples. However, the performance of the classifier will generally be worse on anymutually exclusive validation (test) set. Consequently, the use of a separate test set isimportant to confirm how well the classifier can be expected to classify new data. Itsperformance on the training data can be misleading, a point which will be demonstrated5Chapter 2. Classification with Discriminant Functions and Neural Networks 6further in this chapter. The performance of the classifier on a test set is commonlyreferred to as its ability to generalize [10].2.1 Linear Discriminant FunctionsThe linear discriminant function [11] is one of the most common methods used for patternclassification. It was introduced by Fisher [13] as a technique for reducing multidimensional data to scalar “discriminants” such that the means of populations of differentclasses are maximally separated, The linear discriminant vector described by Fisher canalso be derived from Bayesian principles [20]. If the populations of the two classes arenormally distributed and have identical covariance matrices, the maximum-likelihoodBayesian classifier turns out to be the linear discriminant vector.Let x1,.. . , x be n samples with n0 of class 0 from set X0 and n1 of class 1 fromset X. Let w he the discriminant vector whose components we wish to find. Then, let= wx aild define Y0 and Y1 to contain each y for which the corresponding x, iscontained in Xo or X1. The sample means are given bym=--x (2.2)xXand= y = I wTx = wTmj (2.3)yEY xeXiLet the covariance matrices for the populations beS, = (x — m)(x — mj)T (2.4)xEXand define scatter of the Y populations to be= (y )2 (2.5)yEYChapter 2. Classification with Discriminant Functions and Neural Networks 7Thn isproportional to the variance within the class and can be written as= (wTx — wTmj)2 = w(x — — m)Tw = WTSW (2.6)xEX XEXThen the Fisher linear discriminant is the vector w which maximizes the expression2 TiJ( ) — rn0 — rn1 — W m0 m1 (2 7)w—— WT(S+Si)wThe numerator of this expression is the separation between the means of the two dataclasses while the denominator is proportional to the variances within the classes. Thus,maximizing J(w) seeks to maximize the separation between the classes while minimizingthe spread within the classes. If the populations have identical covariance matrices, i.e.So = S1 then there is a solution for w which maximizes J(w), namelyw = . (mo — mi) (2.8)For each new object x we form the product y = wTx and assign x to class 1 if yexceeds some threshold, otherwise to class 0. The discriminant vector w measures theassociation of objects to the two classes. When the expression for y is written usingthe components of w as y= j wjxj = d(x), d(...) is referred to as a discriminantfunction. The terms discriminant function and discriminant vector will both be usedwhere appropriate.The second term in the product in Equation 2.8 is the vector that connects the meansof the two populations. If the covariance S = cr2I where I is the identity matrix, andeach feature has identical variance 2, then the objects fall into identical hypersphericalclusters centered around the respective class means.This situation is shown in Figure 2.1 for a two dimensional classification task. Inthis case the discrirninant vector is, to within a multiplicative constant, (mo— mi) andthe decision surface will be a hyperplane perpendicular to the vector at its midpoint.Chapter 2. Classification with Discriminant Functions and Neural Networks 8When S has some other form, the vector w is adjusted from the direction (mo — mi) bythe matrix S (Equation 2.8). This means that if a feature has a large variance withrespect to the separation between the class means, it makes a smaller contribution to thediscriminant vector.x2Figure 2.1: Decision boundary and contours of equal probability density are shown fortwo populations that identical covariances of the form S, = a21.Figure 2.2 shows a two dimensional classification problem which will be used todemonstrate the behavior of discriminant functions and neural networks. The two groupsof objects were generated by taking uniformly distibuted points from slightly overlappingannular regions in the first quadrant and stretching them along the vector (1, 1). Eachgroup consists of 100 points, and these points were used to train the neural networksand discriminant functions. The entire training set of 200 points is shown in Figure 2.2.A larger validation set of 1000 points was constructed to test the performance of theclassifiers. It consisted of 500 objects from each class drawn from the same probabilityDecision BoundaryChapter 2. Classification with Discriminant Functions and Neural Networks 9deilsity distributions.1.81.61.41.2Feature 20.80.60.40.20Two class classification problem solved with a discriminant functionFigure 2.2: A linear discriminant vector is shown along with the separating boundary(solid line) for an example of a two class classification problem. Discriminant functionvalue contours are lines perpendicular the discriminant vector as shown by the two additional contours in the figure.Figure 2.2 also shows the linear discriminant vector which best separates the twogroups. Lines perpendicular to the vector represent contours of equal discriminant value.The decision boundary is shown by the solid line between the two groups. Adjusting theposition of decision threshold is equivalent to adding a constant to each of the y values.This shifts the location of the decision boundary along the direction of the discriminantvector, The decision boundary shown in the figure correctly separates 81.5% of the 200training examples and 83.7% of the 1000 test examples.0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8Feature 1Chapter 2. Classification with Discriminant Functions and Neural Networks 10In practice, the condition that the covariances be equal is rarely observed. If thepopulations are both multivariate normally distributed but the covariances are different,then the optimal discriminant function will be quadratic. The optimal discriminantfunction g(x) will have the formd d d ct*g(x) = wo+w3x+WjkxjXk = az(x) (2.9)j=1 j—i kjwhere d* is the dimension of a feature space containing all the original features plus allthe terms of products of pairs of features, The value of d* is given by d* (d+ 1)(d+ 2)and a and z are d*dimensional vectors.Points in the d-dimensional space are mapped into points in the d*dimensiona1 space.In z-space the decision surfaces will be hyperplanes, and the linear discriminant vectorcan be found as described above. In x-space this discriminant function will containquadratic terms of the x variables so the decision surfaces will be hyperparabloids. Thisis commonly referred to as a quadratic discriminant function.This highlights a problem with the terminology for discriminant functions. The discriminant functions are linear because they result from the inversion of the covariancematrix S, not because of the form of the decision surface in the original feature space.Duda [11] uses the term generalized linear discriminant to refer to the discriminantswhich can be expressed as g(x) = aTy.The curse of dimensionality can make it difficult to use generalized discriminants.For a problem with ten free features, a quadratic discriminant will have 66 componentswhile a third order (cubic) discriminant would have 286 components. Thus it becomesimpractical to use discriminants of higher orders unless there are very few features orvery many training examples.Figure 2.3 shows the decision surface for a generalized linear discriminant with quadratic terms of the two features. It does a better job of approximating the curvature ofChapter 2. Classification with Discriminant Functions and Neural Networks 11the decision boundary between the two classes. The boundary correctly separates 85.5%of the training examples and 86.9% of the test examples.1.81.61.41.2Feature 20.80.60.40.20Discriminant function solution containing quadratic termsFeature 12 2.2 2.4 2.6 2.8Figure 2.3: The decision boundary for a quadratic discriminant function is shown forthe classification problem. This discriminant function correctly separates 85.5% of thetraining examples.2.2 Neural Networks2.2.1 IntroductionThere has been a great deal of interest in using neural networks for pattern classificationduring the last ten years. Pattern classification is done by the brain all the time. Imagesare recognized by the brain in a fraction of a second (say,-‘. .1— .2 seconds). Yet, the**10SSS0 S0*S0*5SS 5S55SS•SSS SClass 0 0Class 1 SDecision boundary0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8Chapter 2. Classification with Discriminant Functions and Neural Networks 12firing rate of neurons in the brain is of the order .005 seconds. This means that the braincan recognize an image in roughly 30 neuron cycles.Todays computers operate approximately a million times faster than the brain. Yet,the fastest computers running sophisticated pattern recognition algorithms cannot comeclose to what the brain does routinely. It is the distributed nature of the brain that givesit its ability to quickly process images and categorize them. This distributed processingfeature is the motivation behind developing neural network systems for classification.It is knowll that there are 1011 neurons in the brain each with iO interconnections. Each neuron performs a non-linear transformation on a weighted combination ofits mputs and passes the result to others via synaptic junctions. The term neural networks is used to describe a variety of systems that have this property, i.e. interconnectedelements each performing simple computations. The term multilayer perceptron, afterthe terminology introduced by Rosenblatt [41], is used by many authors to maintain thedistinction between the implementation of neuron-like algorithms on computers and thetrue neural networks of the brain.2.2.2 Feed Forward Networks with Error Back-propagationThe most popular neural network paradigm is the feed forward network trained by theback propagation of errors [42]. The back propagation algorithm describes how theweights of a neural network are changed in order to train it to give a particular outputwhen presented with a set of input/output pairs (x,y).Figure 2.4 shows a schematic of a feed forward neural network with two layers ofconnection weights. The d inputs are shown by x and the m outputs are shown by Ok.Between the inputs and outputs are a layer of n hidden units denoted V. The inputsare connected to the hidden units by connection weights wj, and the hidden units areconnected to the outputs by connection weights T47jk Subscript i refers to only inputChapter 2. Classification with Discriminant Functions and Neural Networks 1jXA U1’Sw FIRST WEIGHT LAYERHIDDEN UNiTSW1, SECOND WEIGHT LAYEROk OUTPUTSFigure 2.4: Structure of a feed-forward with one layer of hidden units.units, j to hidden units and k to output units.Let the training examples (x, y) come from a set of p training pairs. For eachtraining pattern, hidden unit j receives the sum= (2.10)and produces the output= g(h) = g(w3x) (2.11)where g(...) is a nonlinear function. Output unit k receives the sum= = (2.12)and generates the output= g(h) = g( T4/jk g( wx)) (2.13)This process is repeated for as many layers of weights as exist in the network.Wi1 W23Chapter 2. classification with Discrirninant Functions and Neural Networks 14The standard error function is the sum-of-squares error measureE(w) = - Q)2 (2.14)E is differentiable with respect to the weights provided g(...) is also differentiable. Thederivative of E with respect to the hidden-to-output units is8Wk==—S’ (2.15)where(2.16)The derivative E with respect to the input-to-hidden units is calculated using thechain rule=—(y — O)g’(h)Wjkg’(hfl4= —8x (2.17)where= >(y — O)g’(h) Wjk g’(h) = g’(h) Wjk6k (2.18)k kFor multi-layer networks the derivatives are found by further applying the chain rule toeach layer of weights.With the back-propagation algorithm, network weights are updated usingWjk = 8E and = —-- (2.19)öWjk OWin order to minimize the error function E. The parameter is referred to as the learningrate and controls the size of the steps taken in the direction of the error gradient. TheChapter 2. Classification with Discriminant Functions and Neural Networks 15task of training a neural network becomes a multivariate optimization problem withthe network weights as the optimization parameters. Training patterns x are appliedto the network inputs and the results are fed forward through the network until anoutput pattern O is obtained. The output is compared to the desired output patternsy, and the error function E is computed. From this, the error contributions of theindividual units (the S’s) can be calculated, These errors are propagated back throughall the network weights towards the inputs. The prescription for updating each weight(Equation 2.19) can then be applied.The function g(...) is referred to as the transfer or activation function. As the equations above show, it must be differentiable. It must also be nonlinear, otherwise theneural network as whole will be linear and all non-input units can be replaced by a singleunit. The standard activation function is sigmoidal in shape (Figure 2.5). It saturatesfor small and large inputs. This means that for inputs with large magnitude (positiveor negative), the function returns a value that asymptotically approaches the minimumor maximum output values. This nonlinear behavior is the key feature which gives neural networks the ability to perform decisions that cannot be made with discriminantfunctions.The two most common activation functions are both exponential in form:g(h)= 1 + -2h(2.20)is used when using units that have a [0,1] output range andg(h) = tanh(3h) (2.21)when using units that have a [-1,1] output range. Both functions have a central rangewhere their behavior is almost linear. The parameter 3 controls the steepness at whichthe activation function approaches its extreme values.Chapter 2. Classification with Discriminant Functions and Neural Networks 16Sigmoid activation function used by many feed-forward networks10.90.80.70.6g(h) 0,50.40.30.20.10Input hFigure 2.5: A sigmoid shaped function (one that saturates for extreme input values) isthe most common nonlinear activation function used with neural networks.It is usual for each non-input unit in the network to have a bias or threshold. Standardpractice is to add an extra connection weight between each unit and an artificial inputunit clamped at ±1. These connection weights are treated like regular weights whencalculating derivatives and performing weight updates. Thus, the number of weights ina one hidden layer feed-forward network with full connections between layers will beweights = + l) X hiddens + (rihiddens + 1) X outputs (2.22)2.2.3 Alternatives to the Back-Propagation AlgorithmThe back-propagation algorithm is a non-linear optimization algorithm that takes stepsof a given length i along the error gradient. If is very small, the convergence of thenetwork to an acceptable solution will be very slow. If j is too large, the training errorwill oscillate wildly as many of the update steps will overshoot the minima of local valleysof the error value in the weight-space.-5 -4 -3 -2 -1 0 1 2 3 4 5Chapter 2. Classification with Discriminant Functioiis and Neural I\Tetworks 17Hundreds of papers have been published in which the authors describe some improvement of the back-propagation algorithm. The most common improvement is theintroduction of a momentum parameter. a. The weight update equation (Equation 2.19)is modified so that each step includes a contribution of the previous weight update:w(t + 1) = + aw(t) (2.23)wtJThe momentum, a, has a value between zero and one, The best value of a andare usually found by trial and error. Several authors describe adaptive techniques toautomatically adjust these parameters [48, 19] to improve the rate of convergence.The realm of operations research provides many alternatives to gradient descent whichare much better thaii the back-prop algorithm. Some of these methods rely on calculatingthe second derivative of E with respect to the network weights. The Taylor expansion ofE about the current set of weights w0 isE(w)=Eo+(w_wo).VE(wo)+(w_wo)T.H.(w_wo)+... (2.24)where H is the Hessian matrix of second derivatives= 62E (2.25)8w1awjIgnoring higher order terms, the derivative of Equation 2.24 with respect to wVE(w) = VE(wo) + H (w — wo) (2.26)To minimize E we set VE(w) = 0 and rearrange (Equation 2.26) in terms of w toobtainw = w0— ‘VE(wo) (2.27)This technique is Newton’s method for function minimization [40]. It requires that wecompute and invert a d x d matrix of second derivatives at each iteration. This canChapter 2. Classification with Discrirninant Functions and Neural Networks 18be costly in terms of both memory and CPU time when implemented in an algorithm.Further, unless the error function is quadratic with respect to the weights it can performin an unstable manner. As such, it is not practical to implement in this form.It is possible to obtain an optimization algorithm with second order convergencewithout explicitly calculating H [40]. This fact is used by conjugate gradient methodssuch as the Polak-Ribiere method, which are the basis of several neural network trainingtechniques [23, 28]. At each iteration of the algorithm, a line search is made along adirection dTh that is selected to contain contributions from both the gradient and theprevious search direction:—VE’ + 7d0 (2.28)The factor 7 is calculated at each step to optimize the contribution of the previous searchdirection. These algorithms can provide an order of magnitude increase in the rate ofconvergence over the standard back-propagation algorithm for some problems.2.2.4 The Quickprop AlgorithmSeveral neural network training procedures have beell developed which use approximations of the second derivative matrix. These methods perform repeated line searchesbased on an approximation to H1 and are known as quasi-Newton or variable metricmethods. Watrous [49] has studied their effectiveness for training neural networks andfound them to be an order of magnitude more efficient than standard back-propagation.An alternative to using the full second derivative matrix is to ignore the off-diagonalelements of H. Fahlmann’s Quickprop algorithm [12] treats the change in the slope ofthe error function E(w) as a function of each weight independently. In this case it ispossible to use the pseudo-Netwon rule=_-- (2.29)5W3.Chapter 2. Classification with Discriminant Functions and Neural Networks 19to updat.e each weight. Further he assumes that the error function is a parabolic functionopening upwards. In this case Equation 2.29 reduces toSE— Sw(t)SE SE W —Sw(t—1) Sw(t)where t refers to the current iteration and t — 1 to the previous. Since the updaterule reduces to using the current and previous derivatives of the error it is efficient toimplement it in a training algorithm. This compensates for the simplification that ismade by assuming that the error surface is quadratic.There are situations when following this update rule can lead to difficulties. If thecurrent and previous error slopes are nearly equal then the denominator of Equation 2.30will be near zero and the rule will prescribe an enormous update to the weights. Toprevent this, Fahlmanll introduces a parameter t which he calls the “maximum growthfactor”. No step is allowed to exceed a previous step by a factor greater than ,u. Afterexperimenting with a variety of values he recommends using t = 1.75 for most problems.To prevent the algorithm from oscillating around the local error minimum a gradientdescent term is added to the weight update. This term adds times the current slope toeach wj if the slope of has the same sign as If the two slopes have differentsigns then the minimum has been overshot and only the quadratic update rule is used.2,2.5 Decision SurfacesAs was shown in Figure 2.2, a discriminant vector forms a hyperplane separating boundary between two populations. This is also the case for a neural network that contains asingle layer of weights (Figure 2.6). The vector of network weights, w, can be interpretedas a discriminant vector because the non-linear activation function, g(w), is invertible.This type of neural network structure is called a perceptron [41].When two layers of weights (one layer of hidden units) are used, each hidden unit inChapter 2. Classification with Discriminant Functions and Neural Networks 20Three or MoreOne Weight Layer Two Weight Layers Weight LayersMultiple Decision SurfacesFigure 2.6: Decision surfaces are shown for feed-forward networks with one, two andthree layers of weights. In the first case the separating surface will be a hyperplane andthe network performs like discriminant function vector. When there are two layers ofweights, the decision surface will be curved and can separate a single region of arbitraryshape. If more than two weight layers are used, the network can form boundaries aroundmultirnodal distributions in the feature space.the network acts as a discriminant vector in the feature space. Since these discriminantvalues pass through the activation function (Equation 2.11) before reaching the outputunits, the resulting decision boundary will be a single curved surface. The corners shownin the decision surfaces in Figure 2.6 will be rounded because hidden unit outputs aresummed at the output unit and contributions from two hidden units will adjust theposition of the switching threshold [24]. If the weights between the inputs and eachhidden unit are scaled by some large value it is possible to approach arbitrarily close tothe boundaries shown in the figures. It has been shown that a neural network with a...Hyperplane Boundary Single Decision SurfaceChapter 2. Olassification with Discriminant Functions and Neural Networks 21single layer of hidden units can represent any continous function to an arbitrary accuracygiven enough hidden units [9].Note that if only a single hidden unit is used in the hidden layer, the decision surfacewill be a hyperplane, as with the perceptron. The neural network requires at least twohidden units to form a nonlinear decision surface.When three or more layers of weights (two or more layers of hidden units) are usedit is possible to approximate multimodal population distributions. Given enough hiddenunits it is possible to approximate any function with arbitrary accuracy with exactly twohidden layers [8].2.2.6 Neural network solutions to the classification exampleThe relationship between the decision surface of a feed-forward neural network and adiscriminant function vector can be demonstrated with the classification example fromthe previous section. Figure 2.2 showed the optimal boundary when a single separatingplane is used. The curved boundary could be approximated better if two separating lineswere used in combination as shown in Figure 2.7. Neither boundary alone provides goodseparation for all the data, but the combination of two, shown by the region in the lowerleft, provides a better fit to the training examples. The vectors to which these boundariescorrespond are also shown in the figure. The first points along the direction (0.4, 1.0)while the second points along the direction (1.0,0.4).It is possible to construct a simple neural network which contains these two vectorsamong its weights. Figure 2.8 shows a network with two hidden units connected to theinputs. The weights leading to the first hidden unit w1 are set to a constant k1 timesthe first discriminant vector (1.0, 0.4), and similarly w2 = k2(O.4, 1.0). Then, making afew decisions regardiiig how fast the output should change from zero to one, the valuesfor k1, k2 and the remaining weights can be found.Chapter 2. Classification with Discriminant Functions and Neural Networks 22The combination of two vectors defines a decision region in the lower leftapproximates the original distribution of the training examples.The decision surface for this neural network is shown by the dotted line in Figure 2.9,Although only a rough approximation to the boundary in Figure 2.7, it shows how theweights connecting the inputs to each hidden unit can be interpreted as discriminantvectors in the feature space. This network was then optimized using the Polak-Ribiereconjugate gradient optimization routine. The training set was presented to the networkfor 200 iterations of the algorithm, resulting in the optimized design shown in the figure.The optimized network correctly classifies 89% of the training examples and 88.1% of thetest set.There are two ways to improve the performance a neural network solution to thisSeparation of the two classes with a combination of two discriminants44.400000 oO0cjP 00e0o 00eQo41.81.61.41.2Feature 20.80.60.40.20Figure 2.7:corner that0••00* 004*4•• •0•0*00a..Class 0 0Class 1 *Decision boundary 2Decision boundary 1.00 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8Feature 1Chapter 2. classification with Discriminant Functions and Neural Networks 23Feature 1 Feature 2w1=k1(1.0, 0.4) 1.0)Figure 2.8: The structure of a neural network with two hidden neurons that implementsthe decision boundary defined by the two vectors is shown.problem. The first is to repeat the training process with more iterations. The performance of the previous solution, obtained after 200 iterations of the conjugate gradientmethod, could be improved slightly by optimizing the network for a longer time. Inthis case the improvement is marginal, and there is 110 detectable difference between thedecision boundaries for the two networks.The second way is to use a larger number of hidden units. Adding extra hiddenunits to the network adds parameters to the system, thereby increasing its ability to fitthe training examples. Figure 2.10 shows three contours of neural network output for aneural network with 5 hidden units.The activation function used for the neural network classifiers is Equation 2.20, with= 1.0. Since the range of the neural network output (Equation 2.13) is the range ofg(.. .), each point in the plane of Figure 2.10 is mapped to an output value between zeroand one. The contours of output value equal to 0.01, 0.50, and 0.99 are shown. Thenetwork was obtained by performing 10,000 iterations of the conjugate gradient trainingmethod using the 200 training examples.Because of the large number of presentations of the training patterns the trainingprocedure produced a network that is precisely tuned to the training examples. TheOutputChapter 2. Classification with Discriminant Functions and Neural Networks 241.61.41.21Feature 20.80.60.40.2Decision boundary for a neural network classifier with two hidden units.1 1.2 1.4 1.6 1.8Feature 1Figure 2.9: The decision boundary is shown for a neural network before and after it wasoptimized using a conjugate gradient training technique.decision boundary (solid line) appears to be a series of five segments, each due to theinfluence of one of the hidden units. The contours of neural network output equal to 0.01and 0.99 are both shown as dotted lines in the figure. They almost coincide the decisionboundary in many areas. This means that there are almost no regions where the trainingexamples have intermediate values (between .01 and .99). All points in the plane aremapped to either nearly zero or one. This demonstrates how the neural network output,although continuous, can come arbitrarily close to a discontinuous mapping.There is almost no rounding of the corners where the decision surfaces for these vectorsmeet. This occurs because of the large network weights that can result from training the1.800 *00.....,00SS0•S*0SS St0 S0%0e0•00Soo0SaClass 0 0Class 1 oInitial designOptimized designS.00.0...0 0.2 0.4 0.6 0.8 2 2.2 2.4 2.6 2.8Chapter 2. Classification with Discriminant Functions and Neural Networks 25Contours of neural network output for a five hidden unit network1.81.61.41.21Feature 20.80.60.40.201 1.2 1.4 1.6 1.8Feature 1Figure 2.10: Three neural network output contours are shown for a 5 hidden unit solutionto the classification example. The decision boundary is shown by the .50 output valuecontour. The .01 and .99 contours nearly coincide with the decision boundary. Thus allpoints in the plane have neural network output values of nearly zero or one.network for many iterations. Given that the original distribution being approximatedhad a smooth, nearly circular boulldary this result is not desirable.The network correctly identifies 92.5% of the training examples but only 86% of thetest examples. Thus, the performance of the five hidden unit neural network on the testset is poorer than that of the two hidden unit neural network. This is an example ofwhat is referred to as overfitting. The large number of parameters in the model allowsthe network to give undue attention to individual training patterns. Its training setperformance is better than that of the simpler network but no longer reflects how well ite*000.0•0 .0*000 00•s.00.000Class 0 0Class 1 •NN output= .1NN output= .5NN output= .90.0 0.2 0.4 0.6 0.8 2 2.2 2.4 2.6 2.8Chapter 2. Classification with Discriminant Functions and Neural Networks 26will perform on a separate test set.2.2.7 Overfitting Training DataThe performance of the discriminant vector (Figure 2.2) and the five hidden unit network(Figure 2.10) show the difficulty of selecting the right number of parameters to fit thetraming data. If too few parameters are used the decision rule will not be able toadequately separate the classes, while if too many parameters are used the generalizationwill be poor.Most of the time, it is difficult to know how many parameters to use in a classifierin advance, It is possible that a problem may be linearly separable, which means that aplane can be used to separate the two distributions. In this case a neural network withhidden units is not even needed. The optimal neural network will have oniy one layer ofweights (i.e. a perceptron) and is functionally equivalent to a discriminant vector.It is likely that there are errors in the training and test sets used in most real worldapplications. There may be misciassifications of examples or measurement errors in theirfeatures. Consequently, it is not desirable to fit the data beyond a certain level of accuracybecause we may be attempting to fit the noise in the system.To avoid overfitting, several methods have been introduced to reduce the number ofparameters in the system. The most widely used approach is weight decay. A penaltyterm is added to each weight during each iteration to reduce its size [23]. This can leadto some weights having nearly zero magnitude, whereupon they can be removed from thenetwork either during the training [18] or afterwards. The penalty term causes weightsthat are not constantly being reinforced to decrease in value.One way to introduce the penalty term is to multiply each weight by a factor (1 — e)Chapter 2. Classification with Discriminant Functions and Neural Networks 27after each iteration, so thateW= (1 — (2.31)where e is the weight decay term. This method is the same as modifying the error functionto be= E0H + w (2.32)2The value of used will depend on the amount of noise in the training set, although usinga value of iO or iO appears to be common [23].Reducing all weights by a constant fraction may excessively penalize the use of largeweights causing to the network to use many smaller weights. Although this methodmay solve the problem of weights becoming exceedingly large it also introduces a decayterm in the derivative [16] which can lead to error terms decreasing with each iteration.Further, it forces the network to perform the classification task using small contributionsfrom every feature rather than allowing important features to predominate.Chapter 3Classification of Cytological Smears3.1 CancerCancer is the general name applied to a group of over 100 disease-states [43] characterizedby uncontrolled cell growth and the ability of cells to invade other tissues. Normal cellsdivide through mitosis, producing two cells genetically identical to the original. Theblueprint for cell division is contained within the DNA in the nucleus of the cell. The DNAcan become damaged due to the presence of certain chemicals, viruses, high frequencyradiation or other environmental stresses. When this happens, the normal cell growthand division behavior can be altered. If the portion of the DNA responsible for growthregulation is changed, the cell may divide more rapidly than the surrounding tissue. Thenext generation of cells produced by the cell will tend to have unstable DNA. Throughmitotic errors and further environmental stresses, the errors in the DNA of the succeedinggenerations of the original cell may multiply.The overall behavior of this population is difficult to predict. While still localizedaround the site of the original cell, the population is referred to as a premalignant (precancerous) lesion. The lesion may progress to some stage and then halt. If the growthhalts, it may then remain static for the life of the bearer or actually regress, as normalcells surrounding the lesion outgrow those that make up the lesion. The lesion may alsogrow and spread to other areas of the body. This type of lesion is termed invasive cancer.28Chapter 3. Classification of Cytological Smears 29SCHEMATIC OF SQUAMOUS EPIThELUM CELL TYPESFigure 3.1: The types of cells that make up the squamous epithelium are shown schematically. There are usually several layers of parabasal, intermediate and superficial cells innormal epithelia.3.2 The EpitheliumEpithelia are tissues that line the surfaces of organs or glandular tissue characterized byclose packed cells and a free surface. They protect the underlying tissue physically andchemically. One of the most common types of epithelial tissue is the squamous epithelium,which is the name for most epithelia in direct contact with the external environment.This includes the skin, mouth, esophagus, vagina and uterine cervix. Cells of uterinecervix are of particular interest to this thesis.Figure 3.1 shows a schematic representation of stratified squamous epithelial tissue.There are four types of cells that make up the stratified squamous epithelium. The basalcells form the base of the epithelium. In healthy tissue, they are the only cells that areable to divide. Their cytoplasm tends to be smaller than the other type of cells and havea large nuclear area compared to the surrounding cytoplasm.4 SUPERFICIAL4 ll’4ThRMEDIATE—Y—PARABASAL—JChapter 3. Classification of C’ytological Smears 30The basal cells divide periodically, and the new cells are pushed towards the surface.These cells enlarge to become parabasal cells. Their cytoplasmic area is approximatelytwice that of the basal cells and their nuclear area is around 50% larger.The third type of cell is the intermediate cell. Intermediate cells are between threeand six times larger than basal cells. Like the parabasal cells, their nuclei are 50% largerthan that of the basal cells, and they may be oval in shape. They are the most commontype of cells in the epithelium and are the most useful for diagnosing the condition ofthe tissue.The cells at the surface are the superficial cells. As intermediate cells are pushedto the surface, they grow in size and differentiate to become the superficial layer. Thenucleus condenses into a very small size in the superficial layer. Such nuclei are calledpykilotic (dense) nuclei.3.2.1 The NucleusThe main component of the nucleus observed in stained cell samples is the nuclear chromatim. These structures are made up of DNA and associated proteins. The quantity ofchromatin is the measure of the amount of DNA in the nucleus. DNA content is one ofthe most important methods for identifying cancerous cells.In basal cells the chromatin has a coarse structure. Parabasal and intermediate cellshave granular chromatin. As the nucleus condenses during the formation of superficialcells, the chromatin appears very dark and little structure is seen. Table 3.1 summarizesdifferences between the types of cells that make up the squamous epithelium,Chapter 3. Classification of Cytological Smears 31Criteria Basal Parabasal Intermediate SuperficialSize 8—10tm l5—2OJLm 30—60jtm 40-60izmShapePolygonal 0% 5% 85% 75%Oval 5% 40% 10% 20%Round 95 % 55% 5% 5%Nuclear Diameter 7—9jm 8—l3itm 10-l2jim 5—7tmNuclear Shape Round Round or oval Round or oval RoundChromatii Pattern Coarse Granular Finely granular PykiloticTable 3.1: Descriptitve differences between the cells that make up the squamous epithehum.33 Differences Between Normal and Malignant CellsWhen cells become malignant, there are usually changes in the cytoplasm and nucleusthat are noticeable when viewed through a microscope. These changes depend on boththe type and location of the cancer. Detecting them accurately is important for bothdiagnosing the state of the tissue and determining the prognosis for the patient.Cancer cells tend to have a greater variability in their size when compared to normalcells of the same cell type. However, cells can become variable from benign disorders,non-malignant viral infections, or other causes. Therefore, using the cytoplasmic area isnot a reliable method to diagnose the tissue. In squamous epithelial tissue, normal cellsvary between l5—60tm ill size, as is shown in Table 3.1. This makes it even more difficultto determine whether a cell is abnormally large, since the normal variability in the sizeof the cells making up the epithelium is already a factor of four.A population of cancer cells will have larger size variations than a population of normalcells. When viewing a tissue section, the structure of the tissue ca be determined (asin Figure 3.1). In this case it is possible to use the presence or absence of size variationsamong the cell population to aid the overall diagnosis.Chapter 3. Classification of Cyto1ogical Smears 32Nuclear abnormalities are the most noticeable differences between cancer cells aildnormal cells. The amount of DNA in the nuclei usually increases with the progressionof the lesion. The area of the nucleus is roughly proportional to the amount of DNAcontained therein, so doubling the amount of DNA usually causes the nuclear area todouble. It is not uncommon for nuclei of malignant cells to be four times the size ofnormal cells taken from the same site.The nuclei of normal cells undergoing mitosis will temporarily increase in size as DNAsynthesis occurs. Just before division these cells will contain twice the DNA of normal,nondividing cells. Consequently, the absolute size of the nucleus or DNA content is notan exact characteristic of a malignant cell. It is necessary to consider the DNA contentof suspected malignant cells in relation to that of the population of normal cells.Normal cells have round or oval nuclei with no bumps or indentations. Irregularitiessuch as notches in the perimeter of the nucleus or protrusions are common indicators ofcancer cells.The texture of cancer cell nuclei tend to be coarser than that of normal cells. Stainedcancer cells tend to have dark nuclei, which is termed hyperchromasia. Chromatin inthese cells aggregates into clumps, giving the appearance of dark spots in the nucleiwhen viewed under a microscope.3.3.1 Staging of MalignanciesAll of the characteristic differences between normal and premalignailt cells describedabove are used to assess the severity of a precancerous lesion. It is important to notethat the existence of a precancerous lesion alone, no matter what its severity, does notautomatically imply a poor prognosis for the patient, if left untreated. What is importantis that lesion does not progress to a stage where it spreads to the surrounding tissue.In order to succinctly characterize the stages of premalignancies, oncologists haveChapter 3. Classification of Cytological Smears 33attempted to develop a standard nomenclature for epithelial lesions. Several differentstaging systems are used by cancer specialists to describe the tissue state. Staging systemsgroup precancerous tissues into classes based on the degree of change of the cellularcytoplasm, nuclei and tissue organizations from normal appearance, called the degree ofdysplasia. Such tissues are referred to as being dysplastic.As a lesion progresses, it is thought that the cells that make up the lesion passthrough three stages: mild, moderate and severe dysplasia. If a lesion progresses beyondthis stage, but has not yet spread to another site it is referred to as carcinoma in situ(carcinoma= “cancer of the epithelium”, in situ= “localized”).A cancer spreads by breaking through the base of the epithelium and sending cellsthrough the blood or lymphatic system to other parts of the body. This condition iscalled metastasis. These cells may then take hold in other organs creating new cancerouspopulations which do not perform the regular functions of the cells native to that site.This accounts for the deadly nature of a cancer. If a cancer spreads away from the siteof origin then it is called metastatic carcinoma which is a full-blown cancer.Lesions that fall under the category of mild dysplasia are characterized by smallmorphological changes of the epithelial cells. They are “low grade” lesions in the sensethat 60% to 70% of them will naturally regress when left untreated [22]. The cells tend tohave slightly enlarged nuclei, and their nuclear chromatill, although mostly fine grained,is just beginning to form clumps. The nucleus is still round or oval in shape and anyprotrusions or indentations in the nucleus should be small.Moderately dysplastic cells usually have significantly larger nuclei than normal cells.Dark spots can be seen in the nucleus as the chromatin clumps together. There may besignificant irregularities in the shape of the nucleus.Lesions that are severely dysplastic are also known as “high grade” lesions since mostwill progress to invasive cancer. Severely dysplastic cells have very large nuclei or a highChapter 3. Classification of Cyto1ogical Smears 34nuclear area to cytoplasm ratio compared to normal cells from the same site. The interiorof the nucleus will usually have large clumps of chromatin and the nuclear shape is oftenirregular. In some cases, the nucleus appears to be in the midst of splitting into morethan one lobe. These types of cells are easier to identify than cells with other grades ofdysplasia.The descriptions of the different stages of dysplasia are summarized in Figure 3.2. Thefigure also shows normal, mildly dyspiastic, moderately dysplastic and severely dysplasticcell nuclei. These nuclei were stained and scanned using the automated image cytometersystem described in Section 3.5.In general, there is some difficulty in categorizing cells into the different classes. Thereis a continuum of changes that occur as a lesion progresses. These changes cannot beadequately represented using a few descriptive labels. As well, the definitions of the stagesof dysplasia are vague enough that there is some overlap between the classes. It is commonfor two different cytopathologists to classify the same object into different classes. Thiscaii occur even if they confer on the diagnosis, as each specialist’s opinion will dependon what their past experience and training have taught them. For this reason, there issome overlap between what is called benign/mild, mild/moderate, and moderate/severe.There are fewer instances of disagreement of diagnosis where the diagnoses are two ormore classes apart.3.4 Malignancy Associated ChangesNormal cells in the presence of a malignancy undergo subtle changes which alter someof the characteristics of the nucleus, Nieburgs [34, 35, 36, 37] described changes in thenuclei of normal cells taken from blood and liver tissue when a tumor was nearby. Thesechanges were termed Malignancy Associated Changes (MAC). The term MACs or MACChapter 3. Classification of Cytological Smears 35STAGING OF PRECANCEROUS LESIONSNORMALINTERMEDIATECELL- Nucleus is round or oval in shape- Chromatin is uniformlydistributed throughout the nucleusMILDDYSPLASIAMODERATEDYSPLASIASEVEREDYSPLASIA- Nucleus is slightly enlarged orirregular in shape- Nuclear chromatin is justbeginning to form clumps- Significantly larger nucleus thannormal cells; irregular shape- Distinct clumps of chromatinare seen within the nucleus- Nucleus is large, irregularlyshaped and appears dark due toextra amounts of chromatin- Chromatin tends to be seen inlarge clumps- Nuclear area is very largecompared to the surroundingcytoplasm (not seen in photo)Figure 3.2: The differences between the different stages of precancerous lesions are shown.The images on the left are the nuclei of typical cells of the four types scanned using theautomated image cytometer described in Section 3.5.Chapter 3. Classification of Cytological Smears 36cells then denotes apparently normal cells which have undergone such a change.The changes that were observed were difficult to detect by eye. Some question existedas to whether the observed changes were artifacts of the cell samples. Since then, carefulcytometric measurements have been done by different groups [7, 21, 50] on normal cellsand cancer cells to reveal significant changes in quantitative nuclear features. Cytometricmeasurements of MACs have been made in colon, lung and cervical carcinomas. Theyhave been detected up to 10cm [33] away from the site of a carcinoma.It is believed that MACs occur because of growth factors emitted by tumor cells. Thegrowth factors are proteins that are absorbed by the normal cells around the lesion andenter the nuclei where they instigate changes in the nucleus. Cytometric measurementsof the nuclear features have shown differences in the nuclear roundness [4], nuclear area,and DNA content [31] of MACs versus normal cells from normal-diagnosed patients. Ithas also been noticed that there is a difference in the chromatin distribution betweennormal cells and MACs taken from in situ cervical carcinomas, where the chromatin inthe MACs was noticed to be closer to the nuclear periphery [30].Since MACs are difficult to detect while abnormal cells are fairly easy to detect, it isnatural to question why one would bother to search for MACs in potentially malignanttissue. When a cell smear is taken, the cells are scraped off one area of the epithelialtissue. If this area does not contain any abnormal cells, there is the possibility that apatient with a premalignaut lesion or cancer will be diagnosed as normal. Since the rangefor MAC expression has been measured to be 10cm in some cases, it is possible that acancer can be detected using only the apparently normal cells from the cell smear. Thus,a cytological classifier which includes the detection of MACs may be able to diagnose theexistence of a lesion that would otherwise be missed.Chapter 3. Classification of Cytological Smears 373.5 Cervical Cytology and CytometryCervical cell smears are obtained by scraping the surface of the cervix with a devicesimilar to a spatula. The spatula is then smeared over a slide and the cells are allowedto air dry or are fixed to the slide by dipping the slide in a fixative solution such as95% ethanol. A smear generally contains over 100,000 objects including epithelial cells,cellular debris, leukocytes (white blood cells), inacrophages, microorganisms, etc.When nuclear features are being used to diagnose the cells, the slide is stained witha stoichiometric nuclear stain, where the amount of stain absorbed by the nucleus isproportional to the amount of DNA in the nucleus. This gives the nuclei high contrastagainst the cytoplasm.The number of overlapping objects on the slide depends largely how the smear hasbeen applied to the slide. This has consequences with respect to the ability of cytopathologists or cytotechnologists to identify premalignant or malignant cells. If the nuclei oftwo cells overlap, then the technologist may see what appears to be a very dark nucleus,or a large nucleus with an abnormal protrusion. It may be possible for the technologistto resolve the overlap, but an automated cytometric device would likely decide that theobject is an abnormal cell.The cytologist renders the diagnosis for the slide by examining areas throughoutthe slide under a microscope. Several thousand objects must be examined iii order tocompletely view the slide. The diagnosis of the slide is determined by the most severelevel of dysplasia of any cell on the slide.An automated image cytometer developed by Xillix Technologies Corporation andthe Cancer Imaging Department of the B.C. Cancer Agency [15] can serve as an aidto the cytotechnologist by automatically scanning the slide collecting cell images. Thecytometer consists of a digital camera which uses a scientific CCD array containing 1.4Chapter 3. Classification of Cytological Smears 38million pixels to scan the field of view of a light microscope. The objects in the field ofview are segmented using a combination of an adaptive thresholding algorithm and anedge detection algorithm. The segmented nuclei are then stored in a computer as 256grey level images.The current automated diagnostic system uses discriminant functions calculated withover 70 available nuclear features. Fifty-three of these features are used in this thesis forthe separation of MACs from true normal cells. The features consist of shape features,DNA quantity features, Markovian nuclear texture features, and discrete nuclear texturefeatures. A complete description of these features can be found in Calum MacAulay’sthesis [25].Some of the features are those that have been demonstrated in the past to be usefulin separating normal cells from abnormal cells. The integrated optical density (IOD)feature, which is the measure of the amount of chromatin in the ilucleus is the mostcommonly used quantitative nuclear feature. It is calculated by summing the logarithmof the iitensity ratio between each pixel of the segmented image and the backroundintensity. This raw TOD value is then normalized by dividing it by the mean IOD ofthe intermediate cell nuclei on the slide. Normal, non-dividing cell nuclei should have anormalized IOD near 1.0 while those undergoing division should have a normalized JODnear 2.0.The IOD and nuclear area are powerful features for separating normal from abnormalcells. This is to be expected given the changes that occur when cells become malignant.For MACs however, the changes are difficult to distinguish by eye. It is not clear whichfeatures will give the best discrimination for this decision process.Chapter 3. Classification of Cytological Smears 393.6 Design of a Cytological ClassifierThere are several components to consider in the design of a cytological classifier system:decision strategy, model selection, data selection, training and testing/validation. Eachcomponent is important because it affects the accuracy and usefulness of the final product.The decision strategy addresses how the overall system will treat the information onthe slide. Of the thousands of objects on a typical slide specimen, roughly 50% of theobjects convey no useful information. These objects consist of cellular debris, overlappingcells, white blood cells and other tissue components that do not aid the overall diagnosis.Since the final diagnosis is made from the diagnostic cells, these cells must be efficientlyseparated from the other material.The decision strategy will contain rules to separate the objects into subgroups, andcan be represented as a tree-like process. The tree shown in Figure 3.3 is the decisiontree process used by the automated classifier system at the Cancer Imaging Departmentof the B.C. Cancer Agency. The decision at each node is performed by a combinationof thresholds and a discriminant function. The second decision, which splits objects intothree groups based on integrated optical density, is performed by applying a thresholdto the normalized IOD nuclear feature. An object may pass through as many as sevendecision nodes in the process of being classified.The diagnostic cells consist of the normal and abnormal cells on the slide. These makeup only a fraction of what will be scanned by the cytometer. Further, it is necessary thatthe cells be in good focus and are segmented correctly by the image acquisition software.If the segmentation is incorrect, or the focus blurred, the values of the nuclear featureswill be affected. The orientation of the cell on the slide also affects its nuclear features.A cell that is folded or merely viewed on edge will have an apparently smaller nucleararea and irregular nuclear shape as compared to the same cell when viewed face on.Chapter 3. Classification of Cytological Smears 40DECISION TREE FOR CYTOLOGICAL SCREENINGSTARIS OBJECT ALEUKOCYTE?NME.Figure 3.3: The decision tree used by the automated classifier system at the B.C. CancerAgency Imaging Department. Each node represents a decision currently made using adiscriminant function or threshold.Chapter 3. Classification of Cytological Smears 41Since there may be hundreds or thousands of cells with diagnostic value on a slide,how many should be used in the decision process? If a large number of cells is used,then adding more cells to the decision process should not affect the the slide diagnosis.The question is then how to determine what number of cells is required to adequatelyrepresent the slide distribution as a whole. It has been the experience of researchers in theCancer Imaging Department that no significant improvement is made in a slide diagnosiswhen classifications using more than 1000 cells are compared with the classification madeusing around 1000 cells [26].Model selection refers to choosing the right size model to solve the problem. Thesimplest decision model is the application of a threshold on a single feature, such asthe clustering of objects by IOD in Figure 3.3. Choosing the right model requires someinformation about the classes of objects being separated. If the classes are linearlyseparable, multivariate regression or linear discriminant functions are appropriate. If thedata being used is not linearly separable, a higher order discriminant function or neuralnetwork will provide a nonlinear separation between the two classes of objects. Generally,as more parameters are added to the model, the resulting classifier becomes less robustto variations in the feature parameters.The problem of representability is also hidden in the model selection component of theclassifier. We assume that the measured features are adequate to provide discriminationbetween the classes. If this assumption is incorrect, we may infer erroneous conclusionsfrom the results about the true nature of the classes.As an example, imagine designing a classifier to separate apples from oranges usingthe basic features of volume and mass. These features, obviously, are not the best onesto use to distinguish the two classes, With a little enterprise, it might be possible todesign a non-linear classifier which could correctly classify 75% of the training and testdata. If our approach to the model selection was systematic, we may find that usingchapter 3. Classification of Cytological Smears 42a simpler or more complicated model oniy reduces the classifier performance, It wouldbe a mistake to conclude that the classes are only 75% separable, since introducing thefeature of colour to the model would certainly improve the performance of the model tonearly 100%.This situation is analogous to the one of detecting malignancies on the slide basedon MACs. Since the changes to the normal cells in the presence of malignancies are notdetectable by eye, there is no way to determine whether the features we measure completely describe the differences between the populations. In the case of the apple/orangeclassifier, colour is an obvious addition to the feature set. With MAC classifiers, optimalfeature selection is not readily apparent.When selecting the data used to design the classifier, it is necessary to ensure that theslides providing the sample cells be representative of all the slides to which the classifierwill be applied. The slides are stained with a stoichiometric DNA stain that makes thenuclei stand out from the cytoplasm. There is batch-to-batch variation of the apparentdarkness of the nuclei when slides are stained due to variations in the concentrations ofthe stain components. The coefficient of variation in the slide-by-slide normalized IODdistribution of intermediate cells within has been recently measured to be approximately7% [46] for cervical smears prepared with Feulgen-Thionin stain. This variation is largeenough that slides should be selected from many different staining batches when designinga classifier in order to obtain robust performance.The MAC classifier is trained using intermediate cells that have been selected by anexperienced technologist. The resulting decision process is validated using an independenttest set also selected by the technologist. If the training data has been representative ofthe data that is encountered in everyday use, one can be confident that the validationperformance is an accurate representation of the classifier performance on intermediatecells in general.Chapter 3. Classification of Cytological Smears 43However, as Figure 3.3 shows, the objects to which the MAC classifier is applied arethose which have successfully passed through the five or six previous decision steps. Someobjects that are not intermediate cells are likely to make it through these discriminantfunctions. This will reduce the true performance of the MAC classifier when used as apart of the heirarchical system. Thus it is necessary to seek the best possible decisionrule at each node of the tree.Chapter 4Experiments and Results4.1 Description of DataThe data used in this thesis were taken from cervical cell smears obtained from the B.C.Cancer Agency. The cell nuclei used to design the cell-by-cell classifier came from 334slides from premenopausal patients. A total of 190 slides diagnosed “normal” and 144slides diagnosed “severely dysplastic” were used. Each slide was individually diagnosedby three or four independent cytotechnologists. Oniy the slides for which three of thediagnoses were identical were candidates for use in this study.The slides were stoichiometrically stained with a Feulgen-Thionin stain, which isa quantitative DNA stain. To prevent batch-to-batch variations in the average stainingintensity of the slides from affecting the results, slides were selected from different stainingbatches. This ensured that the training data would be representative of data normallyscanned by the automated image cytometer,There were a total of 292 normal slides available for use in the study, 283 of whichwere stained in 24 different staining batches. A total of 190 normal slides were used inthe study and were taken from 14 of the 24 staining batches. Similarly, there are a totalof 160 severe slides available from 21 different staining batches. Data was acquired from144 of these severe slides which spanned the 21 staining batches.The slides were scanned by an automated image cytometer designed by Xillix Technologies Corporation and the B.C. Cancer Agency. Each slide contains hundreds of44Chapter 4. Experiments and Results 45thousands of objects, including cells, debris and connective tissue. A decision processinvolving thresholds and discriminant functions was applied to each scanned object toseparate out the obvious junk from the collected objects, leaving possible cells. Thecomputer program controlling the cytometer was configured to collect 900 possible cellsper slide.The discriminant. function and neural network cell-by-cell classifiers were trained usingintermediate cells taken from normal and severe slides. The intermediate cells fromnormal slides will be referred to as negatives and the intermediate cells from severe slideswill be referred to as MACs. It was not practical to use all available cells to design theclassifier, Hence, it was necessary to restrict the number of cells selected from each slide.A maximum of 75 negatives and 150 MACs were selected from each of the respectiveslides for use in the design of the cell classifier. These limits were imposed to prevent aslide that had a high intermediate cell count from being overrepresented in the designdata. Intermediate cells make up a smaller fraction of the cells scanned from the severeslides than scanned from normal slides. It was necessary to accept twice as many MACsas negatives to provide adequate sized samples of each class of data.Each object was examined five times to determine which objects were intermediatecell nuclei in good focus. Only those cell nuclei that satisfied these criteria were includedin this study. Consequently, the number of available cells varied from slide to slide.Table 4.1 shows the frequency of the number of intermediate cells in good focus per slideused in the design of the cell-by-cell classifier. Most of the normal slides had more thanthe maximum 75 cell limit. Only 47 of 144 severe slides had 150 intermediate cells amongthe 1000 scanned objects.A total of 25360 cells were used to design the cell-by-cell classifier, 12680 from eachclass. One third of the data was used for training each classifier and the remainder wasused for testing. Table 4.2 provides a breakdown of the number of cells of each type usedChapter 4. Experiments and Results 46Negative slides Severe slidesNumber of cells Number of slides Number of cells Number of slidesfrom slide from slide1—10 7 1—20 2911—20 6 21—40 1021—30 7 41—60 1431—40 3 61—80 1541—50 3 81—100 951—60 2 101—120 961—74 15 121—149 1175 147 150 47Total 190 Total 144Table 4.1: Frequency of the number of cells per slide used in the design of the cell-by-cellclassifier,in the study.Cell type Training Testing Total cells of typenegative 4228 8452 12680MAC 4228 8452 12680Total 8456 16384 25360Table 4.2: The breakdown of the number of cells of each type used to train and test thecell-by-cell classifiers is shown.The data were divided into the two groups by selecting every third cell for inclusionin the training set. This is shown schematically in Figure 4.1. This method of selectingthe cells prevented any slide from being overrepresented in either the training or thetest dataset. Since there are batch-to-batch variations in the staining intensity it isimportant to train on as wide as possible a cross-section of slides. The intensity differencesbetween staining batches can be large enough that a classifier can be trained to discernintermediate cells from different batches reliably. This point will be demonstrated inChapter 4. Experiments and Results 47Section 4.5.•oo•oo•NN. \ \SLIDE 1 SLIDE 334cDFigure 4.1: Every third cell nucleus was selected for inclusion in the training data andthe remainder of the data was used for testing.4.2 Nuclear FeaturesEach cell ilucleus used in this thesis was represented by a feature vector. Fifty-three ofthe 66 available features were selected for the classification experiments. The 53 featuresconsisted of optical density, shape, Markovian and discrete texture features. A completedescription of the features can be found in the MacAulay thesis [25].Chapter 4. Experiments and Results 48The value of each feature was normalized by converting it to a z-score usingxi.—xiz= (4J)where z is the normalized value of the feature of the th cell, x’ is the sample meanfor the jth feature, and cr’ is the standard deviation for the ith feature. The means andstandard deviations were calculated for the training set population of 8456 cells.Normalization, although not strictly necessary when using neural networks, is usefulbecause it allows an easier interpretation of the important features from the strength oftheir corresponding weights in the network. Some of the features such as the nuclear area(AREA) have means of the order of iO while others, such as the Markovian energy feature(ENERGY) have means of the order iO, When a neural network is being trained,the connection weights corresponding to important features are usually increased. Themagnitude of the weights can then be used to indicate as to which features contributeto the performance of the network. If the inputs to the network differ in magnitude,the average magnitude of the inputs must be multiplied by the corresponding connectionweights when examining the weights to determine the relative importance of the inputfeatures.It is important to note that the strength of a connection weight for a particular inputdoes not always give a measure of its importance. If there are correlations betweenfeatures, the training procedure may assign a significant weight to inputs that provideonly moderate discriminating ability for the network. We imagine training a network thathad among its features two with a correlation of 1,0 and providing little discriminationability. Through the initialization of the network and the particular training data used,the weights connecting these inputs to the hidden units could assume significant values.The weights connecting the input of the second feature to the hidden neurons would justbe equal and opposite in sign to those connecting the first to the hidden neurons.Chapter 4. Experiments and Results 49In this situation, the weight strengths alone are not enough to determine the importance of the features. The procedure that many authors recommend to determine thecontribution of a feature input to a particular network is to remove the input and retrainthe network. If the network performance can be recaptured, the input makes no significant contribution to this particular network. This procedure is time consuming though,as the network must be retrained after each input is removed.This procedure does not address whether a feature is important to the classificationproblem in general. It only helps determine whether the feature is important to thenetwork for the current combination of weights. A feature that can apparently be removed may prove to be a significant feature if the experiment is repeated with differentinitialization weights.The most important reason for normalizing the feature values relates to the initialization of the networks for training. The training of the network is a multivariate optimization procedure that requires the network to be initialized with some (usually random)starting weights. The weights are initialized to small values so that the sigmoid activationfunction g(...) in Equation 2.11 does not saturate to either zero or one.The derivative of the error with respect to the input-to-hidden weights is calculatedusing Equation 2.17. This equation includes the derivative of the activation function,= wjx), where x are inputs and wjj are input-to-hidden-unit connectionweights. When h has a large magnitude g’(h) is nearly zero (Figure 2.5). In thissituation will be small and the update performed by Equation 2.19 will also besiriall.As a result, the weights connecting the inputs to the th hidden unit will be changedvery slowly by the optimization procedure. This hidden unit will make little contributionto the performance of the network. This situation is avoided by choosing weights suchthat in Equation 2.10 is of order one. When the feature values are normalized, allChapter 4. Experiments and Results 50the weights can be treated in the same fashion. There is no need to consider the averagemagnitude of the inputs when initializing the weights.4,3 Feature CorrelationsThe 53 nuclear features were used to train the neural networks without regard for anycorrelations which exist between the features. It is known from regression theory thatusing too many features in a model, at best, has no effect on the model and, in manycases, adversely affects the performance of the classifier. Neural networks are subjectto these effects as well. Using redundant features increases the dimensionality of thefeature space. This slows the optimization procedures from convergence to a satisfactoryminimum.Discriminant function analysis (DFA) does not suffer from this problem, though.With the stepwise addition rule used in DFA, features are added to the model one at atime, starting with an empty model. Each feature is temporarily added to the currentmodel and the model is tested. The feature which gives the best performance when addedto the current model is kept in the model for the next iteration. The procedure thenrepeats the test with each of the remaining features until the addition features no longerimproves the performance of the classifier or meets the inclusion criterion.There are 1378 pairwise correlations between the 53 features calculated for all the8456 cells used for training the classifiers. Table 4.3 gives the frequency distribution ofthe absolute values of the correlations between the 53 features. Of particular interestare the correlations between features that are close to one. There are seven feature pairswith a correlation value between .95 and 1.0. Seven features likely could have beenremoved from the dataset without the potential for a significant ioss of performance ofthe classifiers.Chapter 4. Experiments and Results 51Range of absolute Number ofcorrelation value occurrences0—.5 1081.57 1837—.9 9290—95 15.95—LO 7Table 4.3: The frequency distribution for the absolute values of the pairwise correlationbetween the 53 features is shown. Correlation values near 1 are of interest because theyindicate variables which can be eliminated from the classifier model.Removing features whose correlations with others are slightly lower is problematic.Since the differences between negatives and MACs are subtle, one is reluctant to throwaway any feature which might provide some discrimination value. Feature selection forsuch problems is a weighty subject on its own and should be investigated separately in athorough manner.4.4 Training Method SelectionThere are several things to consider when selecting a training method. The most obviousconsideration is the speed of convergence of the method when training a network. Mostauthors report a rate of convergence an order of magnitude higher for their methodsover standard back-propagation. The benchmark problems used to conduct the tests areusually not representative of real-world problems. Ill most such problems there is a trueglobal minimum, for which the network weights exactly solve the training criteria. Inreal applications there is noise or errors in the training data and the problem does nothave an exact solution for most neural network structures.A second consideration is the reliability of the training method. Some algorithms aresensitive to the initial network weights. The error-weight space is full of spurious localChapter 4. Experiments and Results 52minima. When starting with random network weights, it is easy to become trapped inthe region of such a minimum.A common occurrence is for the weights of the network to grow without bound whellnear a spurious mrnimum. The training algorithm is able to make marginal improvementsin the performance of the network on the training data by making tenfold increases insome of the weights. This problem is avoided in some methods, such as Quickprop, bylimiting the maximum step size.A third consideration for selection of the training method is the sensitivity of thetraining procedure to its input parameters. Most training techniques have input param—eters which control step sizes, acceleration factors, momentum, etc. Much time can bespent merely tuning these parameters to provide satisfactory performance for a particularproblem. If the number of network weights is changed or a different number of trainingexamples are used it is necessary to readjust the parameters. The time spent findingtraining parameters rivals that spent finding a good solution to the problem.Three different training procedures were evaluated for designing a MAC classifier: Jacobs’ method [19], Polak-Ribiere conjugate gradient optimization [40], and the Quickpropalgorithm by Fahlmann [12]. Training was performed on a group of 500 negative cells and500 MACs, each represented by 20 features. Each method was tested for 1000 iterationsusing a variety of training parameters on 20 randomly initialized neural networks.The networks had 20 inputs, two hidden units and a single output unit. In general,the structure of a network with 20 inputs, two hiddens and one output would be denotedas 20—2—1. The network was trained to give an output of zero for MACs and one fornegative cells. This training criterion was used for all experiments for separating MACsand negatives.Jacobs’ method has five parameters which must be set for each training attempt.The best values for these parameters are determined by trial and error. When theChapter 4. Experiments and Results 53proper training parameters were selected it performed as well as the other two methods.However, selecting poor values for the parameters caused the method to produce networkswith poor performance on the test data of 500 MACs and 500 negatives. Of the threemethods examined, Jacob’s method was the most susceptible to spurious local minima.Jacob’s method was used with several different combinations of input parameters.The magnitude of the vector containing the network weights was found for each neuralnetwork and compared to the network performance. In 14 of the 20 networks trainedusing Jacob’s method the magnitude of the final neural network weight vector was atleast twice that of the average weight vector of the networks trained using Quickprop.The average performance of these 14 networks was 69% on the training data, while theperformance of the remaining six networks was 72%. The neural networks with largerweights had a poorer performance than the others. This occurred because these networkswere in the region of spurious minima. This behavior was also seen when the five inputparameters were varied.The conjugate gradient method is the simplest to use as it requires no input parameters, other than a stopping criterion, which all methods require. It starts with a set ofCartesian search directions which it adjusts to fit the local error-weight landscape. Themethod assumes that the error function is quadratic with respect to the variables [14].This assumption can lead to the search directions becoming linearly dependent when themethod is used for general optimization problems. This means that the algorithm endsup searching only a subspace of the weight space. The implementation of the conjugategradient algorithm in Numerical Recipes [40] avoids this by limiting the number of iterations of the algorithm to a maximum of 200. The search directions can be reset to theoriginal directions at this time and the procedure repeated.Like Jacobs’ method, the conjugate gradient method was found to be susceptible toundesirable local minima when optimizing a randomly initialized network. This trainingChapter 4. Experiments and Results 54method was used in an identical experiment to the one performed with Jacob’s algorithm. The network weights of eight of the 20 neural networks grew without bound.They grew to as large as 50 times the average magnitude of the weights of the optimalneural networks obtained with the Quickprop algorithm. Once in the neighborhood ofa minimum, however, the algorithm performed as well as the Quickprop algorithm forconvergence. This algorithm appears to be useful when used in conjunction with othertraining methods.The Quickprop algorithm has two parameters (, ) to set when training a network.The algorithm’s author recommends using the maximum step size, i=l.75 for most problems [12], leaving just one parameter to adjust. This had the fastest rate of convergenceand performed the most reliably of the three tested. When the previous experimentwas repeated with this algorithm, the average network performance was 72%, which wasachieved by only a few of the networks trained with the other methods. Consequently,it was used for all subsequent classifier design attempts.One way to prevent the network weights from growing without bound is to implement the weight decay formula (Equation 2.31). After each iteration of the algorithm,new network weights are multiplied a factor (1-c). This scheme is used to prevent thenetwork weights from becoming too finely tuned to the individual training examples.Only weights which receive reinforcement from many training examples should survivethe decay process.Ultimately, weight decay is used because it should lead to better generalization bythe network. This can be tested by training a network with different levels of weightdecay and checking the disparities between the training and test set performances. Theclassification rate for the test set should be close to that of the training data if thegeneralization ability of the network is good.This test was conducted for the training and test sets of 1000 cells described earlier.Chapter 4. Experiments and Results 55The Quickprop algorithm was used with 15 randomly initialized networks. The learningrate was set to = .1 and each network was trained for 5000 iterations of the algorithm.Table 4.4 shows the results of this experiment for four weight decay levels: = 0,= = iO—, and = l0_2, There is a slight reduction in the performance on thetraining data as the level of weight decay is increased, The test set classification ratedoes not improve significantly as the decay is increased.Weight decay (e) Classification RateTraining Set Test Set0 70.9% 63.3%iO 70.0% 61.5%iO 68.0% 61.5%102 67.5% 63.6%Table 4.4: The average performance of 15 neural networks with 20—2—1 structure trainedwith Quickprop on 1000 cells (500 MACs, 500 negatives) is shown for different weightdecay levels. The test results for an independent set of 1000 cells should be closer tothose of the training data for larger decay levels.This experiment was repeated by increasing the number of training iterations from5000 to 20000 with similar results. There was no significant difference between theperformances of those networks trained with weight decay versus those trained withoutweight decay. The effect of weight decay should be more apparent in larger networks,where the the larger number of free parameters should lead to overfitting of the trainingdata. When the network was enlarged to 20—3—1 for this problem, the training setperformances were all around 72% while the test set performances remained at 61% forthe different weight decay levels. Thus, there were no strong results in favour or againstthe use of weight decay.This result may not be scalable up to situations where more features are used. It isChapter 4. Experiments and Results 56certain that with 53 features, some of which highly correlated, the reduction or elimination of some of the weights is possible. The use of a simple weight decay factor, however,may not be adequate to obtain the best possible neural network solutions to the problem.It may be better to limit the size of the network and improve the test set performance byjudicious elimination of features through other techniques. The remainder of the neuralnetwork classifiers obtained in this thesis were trained without any weight decay beingapplied to the network during training.4.5 Spurious ClassificationsThe importance of the data selection method can be demonstrated by performing anexperiment to detect the differences between negative cells from different staining batches.A set of 4000 negative cells were drawn from 66 slides, most of which were stained onsix different dates in 1991. Roughly half the slides were stained on three of these dates,and half on the other three dates. The 4000 cells were divided evenly into two groups of2000 cells: those stained in three batches in early February, and those stained later inFebruary and March. The two groups were split into training and test sets by selectingevery other cell for inclusion in the training data, similar to that shown in Figure 4.1.A simple neural network with 53 input features, 1 hidden unit and 1 output (53—1—1)was trained to separate the two groups of 1000 cells of each class. The network wastrained by performing 200 iterations of the Quickprop training algorithm. It correctlyclassified 82.3% of the 2000 training examples. The network was then applied to the testset of 2000 examples, where it had a correct classification rate of 80.6%.The features of all scanned cells are normalized to prevent staining batch variationsfrom affecting their values. Yet, a simple classifier can consistently separate over 80%of the negative cells from three staining batches versus three others. This indicates theChapter 4. Experiments and Results 57potential for obtaining spurious classifiers if one is not careful in the selection of thedata for training and testing. If a larger network neural network had been used theclassification rate would likely have been higher.As a control, a related experiment was performed by selecting 2000 cells uniformlyfrom across all staining batches the same 4000 cell group for a new training set, Halfof the cells were randomly selected to be class 0 and remainder were called class 1 Theremaining 2000 cells were also randomly divided into two artificial groups to form thetest set. In this situation, there should be no differences between the two groups and itshould be impossible to obtain a classifier that gives such remarkable results.A neural network with structure 53—1—1 was trained for 200 and then a further 2000iterations of the Quickprop algorithm. The network only had a performance of 55% after200 iterations, so the network was optimized for a further 2000 iterations. After the 2200iterations the training set classification rate was 56.2% on the 2000 examples. The testset performance was 49.4%, which is what one would expect from a completely randomdecision, i.e. a equal likelihood of assigning objects to either class. This result highlightsthe significance of the 80% classification rate of the previous experiment.4.6 Discriminant Function AnalysisDiscriminant function analysis was applied to the training set (Section 4.1) of 8456 cells.The function was obtained using the 7M module of the BMDP statistical software package. The software includes a routine for stepwise addition of variables for inclusion inthe discriminant function. The resulting discrirninant function uses 43 of the 53 features.It correctly separated 62% of the training examples (Table 4.5) and 61.6% of the testexamples (Table 4.6).A quadratic discriminant function calculated on 53 features has 1432 variables toChapter 4. Experiments and Results 58[ Cell type Classification“MAC” “non-MAC” Totalnegative 2615 (61.8%) 1613 4228MAC 1595 2633 (62.3%) 4228Total 4210 4246 8456Table 4.5: The linear discriminant function attained a 62.1% classification rate on thetraining set.Cell type Classification“MAC” “non-MAC” Totalnegative 5142 (60.8%) 3310 8452MAC 3178 5274 (62.1%) 8452Total 8320 8584 16904Table 4,6: The linear discriminant function attained a 61.6% classification rate on thetest set.enter in the discriminant model. Since there are only 8456 training examples, there isnearly one free parameter for every five examples. This has the potential problem ofoverfitting the data, so no quadratic discriminants were calculated for the MAC data.4.7 Neural Network ExperimentsExperiments were performed to train feed-forward neural networks with the training setto separate MACs from negatives. To find the network with the optimum number ofhidden units, networks with different numbers of hidden units were used. The networkseach had 53 inputs, one output and between one and seven hidden units. They weretrained using the Quickprop algorithm.The training procedure consists of optimizing the sum of squares error measure (Equation 2.14). For training purposes MACs have been designated class 0 and negatives haveChapter 4. Experiments and Results 59been designated class 1. Each cell, represented as a 53 component vector, is used as theinputs to the neural network. The feed-forward mechanism is applied to the network andthe output unit returns a continuous value between zero and one. The error contributionfor each cell is just the difference between the designated class and the output value,It is necessary to provide a stopping criterion to end the training procedure whenno progress is being made. Normally, training is stopped when the network error is nolonger decreasing with each iteration or when the length of the vector derivative of theerror with respect to the weights drops below a certain limit. It was decided to traineach network for a certain number of iterations of the training procedure. The numberof iterations was selected to be large enough such that each network had sufficient timeto reach a local error minimum. In the case of simple networks, with one or two hiddenunits, 1000 iterations sufficed. For large networks, with seven hidden units, an upperlimit of 20000 iterations was used.This upper limit was due, in part, to the amount of time required to perform 20000iterations with a large network. During each iteration, the entire training set of 8456cells must be presented to the network and the error derivative calculated for each cell.The training of a seven hidden unit network required more than 50 hours of CPU timeon the HP Series 700 Apollo workstation used for these experiments.Table 4.7 gives the number of training iterations used when training networks withdifferent numbers of hidden units. It also lists the number of weights that are present witheach network structure according to Equation 2.22. The largest network used had 386 freeparameters. All networks attempt to optimize the fit of 8456 equations simultaneously.Thus, the ratio of training examples to free parameters is always kept above 20:1.Each neural network was trained 20 times using different random starting weights.This was done in order to obtain an estimate of the average performance that could beexpected from networks with such a structure. Each network was then tested using theChapter 4. Experiments and Results 60Number of Number of weights Number of traininghidden units in network iterations used0 541 56 10002 111 1000:3 166 50004 221 100005 276 200006 331 200007 386 20000Table 4.7: The number of network weights and number of training iterations used areshown for networks with different numbers of hidden units.independent validation set containing 16904 cells. Table 4.8 shows the average performance of each network structure on the training and test sets. The standard deviations ofthe training and test performances are also given. The discriminant function performanceis also shown for reference.The simplest neural networks, with one hidden unit form linear decision boundariesbetween the classes and are comparable to linear discriminant functions. As expected,their performance is almost identical to that of the discriminant function obtained forthe training data. These networks and the DF achieve approximately 61% correct classification on the training and test sets.There is large performance improvement when two hidden units are used. The trainingset performance rises to 72.9% while the test set performance is 71.6%. The differencebetween the training and test set performances increases as more hidden units are used.The training set results improve marginally with each extra hidden unit, but the test setperformance peaks when four hidden units are used. When five, six or seven hidden unitsare used the average test set results actually decline. These results are shown graphicallyin Figure 4.2.Chapter 4. Experiments and ResultsPercentcorrect7876747270686664626058Performance of neural network classifiers with different numbers of hidden units61Figure 4.2: The cell-by-cell performance of neural networks with different numbers ofhidden units is shown for the training and test sets. Each result is averaged for 20 runsusing randomly initialized networks. The standard deviations are shown as error bars foreach result.1 2 3 4 5 6 7Number of hidden unitsChapter 4. Experiments and Results 62Number of Average training Average test Best networkhidden units set performance set performance performancediscriminant 62.0% 6L6%—1 61±3% 61±3% 63.5%2 72.9±.3 71.6±.2 71.93 73.4±1.0 71,8±.2 72.44 74.7±.6 71.9±.4 72.55 75.7±.4 71.6±.3 72.06 76.3±.4 71.0±.4 71.67 76.6±.7 71.1±.4 71.6Table 4.8: The average performance and standard deviations for 20 randomly initializednetworks are shown when different numbers of hidden units are used. The trainingset performance improves as the number of hidden units is increased, but the test setperformance peaks when four hiddens are used. The performance of the best network onthe test data is given for each case.The disparity between training and test set results grows as more hidden units areused. The significance of the difference between the training and test results can be seenwhen considering the standard deviation of each result. The error bars in the figure showthe standard deviations of each average and are small compared to the difference betweenthe two curves.This can be understood in terms of the discussion of Section 2.2.7. A network withsix or seven hidden units with the 53 input features has too many free parameters forthis training set. Although the number of training examples is between 20 and 30 timesthe number of weights, the neural networks are still overfitting the training examples.This is due, in part, to an overlap in the feature space between the two classes, as wasthe case in the classification example of Chapter 2. When classifying MACs, the overlapis caused by misclassified cells among training examples and noise in the feature values.In both cases using too many hidden units causes the network to attempt to fit errorsand noise in the training set. Figure 2.10 shows how this occurs for the classificationChapter 4. Experiments and Results 63example. This behavior is also present when six or seven hiddens were used for the MACcell-by-cell classifier.The classification percentages for the neural network and discriminant functions wereobtained by choosing a threshold and applying it to the classifier output. In the case ofa neural network that is trained to call objects either class 0 or class 1 it is normal tochoose 0.5 as the threshold. This threshold can be set arbitrarily, allowing one to changethe levels of false positives and false negatives.In this study, MACs are designated class 0 while negatives are designated class 1. Ifit is the case that the threat of calling a negative cell “MAC” (false positive) is moredangerous than calling a MAC “negative” (false negative), the decision boundary can bemoved to say, 0.6. Then all objects that score in the range 0.0 to 0.6 are called “MAC”reducing the number of false negatives at the expense of more false positives.This information can be represented visually in a graph known as a Receiver-OperatorCharacteristic (ROC) curve. This graph displays the false negative versus false positiverate for the classifier. The ROC curves were calculated for the neural networks thatgave the best test set performance for each choice of hidden units. Figure 4.3 shows theROC curves for the discriminant function and neural networks with one, two and fourhidden units. The ROC curves for the best networks containing three, five, six and sevenhidden units nearly coincide with those of the two and four hidden-unit curves and arenot shown.The diagonal line in the figure represents the performance of a classifier which gives nodiscrimiuation (i.e. randomly assigning a class to each object). For example, calling allobjects “MAC” would give a false positive rate of 100%, or calling all objects “negative”would give a false negative rate of 100%. The ROC curves for the better classifiers dipmore deeply below this line, than do those for the poorer classifiers. The ROC curve foran ideal classifier would trace a path along the horizontal axis and up the vertical axis.Chapter 4. Experiments and Results 6410090807060False 50Negative40302010060 70 80 90 100Figure 4.3: The ROC curves for the the discriminant function and networks giving thebest test set performance are shown. The ROC curves for the best networks using three,five, six and seven hidden units nearly coincide with those of the two and four hiddennetworks and are not shown.ROC curve for cell by cell classifiers on the test data0 10 20 30 40 50False PositiveChapter 4. Experiments and Results 65Although the discriminant function and the neural network with one hidden unit areboth linear classifiers, their ROC curves are different. The neural network converged to aset of weights that gives it a different behavior than that of the discriminant function. Itgives a lower false negative rate when the false positive rate is allowed to be large. Thatmakes this neural network more suited to the task of identifying MACs in a situationwhere the penalty for accidentally classifying a MAC as “negative” must be avoided.The discriminant function, however, performs better in a situation where misclassifyingnegatives as “MAC” has a more severe penalty, Its ROC curve is lower in the regionwhere the false positive rate is constrained to be small.The difference between these two classifiers is apparent when the weights that connectthe inputs to the hidden unit are treated as a vector. The angle between this perceptronvector and the discriminant function vector can be found from their inner product. Theangle between the vector of perceptron weights and the linear discriminant vector is107°, which is 17° from orthogonal. Yet, both classifiers score 61% on the test data.Thus, there exist nearly orthogonal hyper-planes that separate significant portions of thetwo classes. This argues that a nonlinear decision surface should give better separationbetween MACs and negatives.The difference between the one hidden unit network and discriminant function vectorclassifiers can also be demonstrated by examining histograms of discriminant functionand neural network values. Figure 4.4 shows a histogram of the discriminant functionvalues for the test data. The decision boundary is marked by a discriminant value ofzero. The MAC population has a slightly lower discriminant value than the negativepopulation, which is seen in the small separation between the two peaks. This can becontrasted to the histogram of the output of the neural network with 1 hidden unit seen inFigure 4.5. Here the decision boundary giving optimal performance is at neural networkoutput equal to 0.5. This classifier assigns most test examples to a network value aroundChapter 4. Experiments and Results 660.4, which causes many negatives to be misclassified as “MAC”.The narrow peak of the neural network histogram is the result of the application ofthe activation function g(...) to the sum of the inputs multiplied by the first weightlayer. Since g(...) saturates for extreme positive or negative values, many test examplesare grouped into the same bins at neural network values 0.4 and 0.85. If we remove thehidden unit and the activation function from this network, what remains is the classicalperceptron. The histogram of the perceptron output is shown in Figure 4.6.The appearance of the histogram of the neural network (Figure 4.5) and that of theperceptron are markedly different. The test examples with perceptron output between-20 and 0 are squashed by the activation function into a narrow neural network outputrange at 0.38. As well, the examples that receive a perceptron output between 7 and 20are grouped at the neural network peak between .825—.85.“Unsquashing” the neural network output has revealed an interesting result. Theperceptron assigns negative examples to both extremes. This is not visible in either theROC curve or histogram for the neural iletwork because the bins used to calculate thehistogram were too wide to resolve details in the peak. The ROC curve for the perceptronis shown in Figure 4.7. When the ROC curve for the network is replotted on a finer scaleit is identical to that of the perceptron. This is expected since the activation functionthat is applied to perceptron output is monotonic.Chapter 4. Experiments and Results 67700600500400Frequency3002001000-2“MAC” Discriminant ValueFigure 4.4: The histogram of discriminant function values is shown for the test data.The discriminant function correctly classifies 61.6% of the test examples.Histogram of discriminant function value for the cell by cell classifier-1.5 -1 -0.5 0 0.5 1 1.5 2“Negative”Chapter 4. Experiments and Results 68Figure 4.5: The histogram of neural network output values for the optimal network with1 hidden unit is shown for the test data. This network correctly classifies 63.5% of thetest examples.Histogram of network output for a neural network with one hidden unit600050004000Frequency30002000100000.3“MAC”0.4 0.5 0.6 0.7 0,8 0.9Neural network output1“Negative”Chapter 4. Experiments and Results600500400Frequency30020010006910“Negative”Figure 4.6: The histogram of perceptron output values is shown for the test data. Theperceptron assigns negative cells to extreme output values.Histogram of perceptron output-20 -15 -10 -5 0 5“MAC” Perceptron OutputChapter 4. Experiments and Results 70The ROC curve shows that the perceptron actually performs worse than chance whenthe false positive rate is constrained to be small. This occurs because more negativesare assigned a “MAC” score than true MACs as seen in the region between perceptronoutput -20 and -12 in Figure 4.6. It is necessary to apply a threshold in combinationwith the perceptron to obtain the best performance of the perceptron classifier. Thisthreshold should be set such that all objects that receive a perceptron score less thanthis amount are automatically classified as “negative”. The optimal threshold, calculatedfrom the training examples occurs at a perceptron value of approximately -11.5. Settingthe threshold at this value gives the highest correct classification rate for objects thatreceive a perceptron score less than -11.5.The new classifier, which applies this threshold in combination with the perceptron,was applied to the training and test examples. It correctly classifies 69.2% of the trainingexamples and 68.3% of the test examples. The ROC curve for this new classifier is alsoshown in Figure4.7.The histogram of the best neural network (of the 20 training attempts using thisstructure) using two hidden units is shown in Figure 4.8. There is a significant reductionin the number of negatives being called “MAC”. This is seen by comparing the height ofthe negative peak around network values 0.2—0.3 in Figures 4.5 and 4.8. This differenceaccounts for much of the performance improvement over the perceptron and discriminantfunction classifiers. The ROC curve (Figure 4.3) for the two hidden unit classifier issignificantly lower than the other classifiers in the region where false positives are keptbelow 50%. It approaches the performance of the perceptron for higher false positivevalues.The histogram also shows frequency peaks for the negative test examples at networkvalues 0.85 and 0.95. It appears that some feature differences exist among the populationof negative examples. This trend is also seen in the histogram of network values for theChapter 4. Experiments and Results 7110090807060False 50Negative4030201000 10 20 30 40 50 60 70 80 90 100False PositiveFigure 4.7: The ROC curves for the the perceptron and the combination of the perceptronand threshold are shown. The ROC curves for the neural network with two hidden unitsis shown for comparison purposes.ROC curve for cell by cell classifiers on the test dataChapter 4. Experiments and Results140012001000800Frequency6004002000720.7 0.8 0.9 1“Negative”Figure 4.8: The histogram of neural network output values for the optimal network with2 hidden units is shown for the test data. This network correctly classifies 71.9% of thetest exarnpes.Histogram of network output for a neural network with two hidden units0 0.1 0.2 0.3 0.4 0.5 0.6“MAC” Neural network outputChapter 4. Experiments and Results 73optimal network using three hidden units shown in Figure 4.9. Here, the test examplesare more evenly distributed among the two frequency peaks near 1.0 than in the former.The performance of the three hidden unit network for identifying MACs is similar tothat of the former network, Both have a MAC frequency distribution that assigns mostMACs a network value of around 0.3 and both tail off slowly to zero as the network valueapproaches 1.0.The optimal network with four hidden units provides the most interesting histogram(Figure 4.10). Four distinct frequency peaks are seen at network values 0.12, 0.32, 0.8and 1.0. These are labelled peaks 1 through 4 in the figure. Peaks 1 and 4 are theexpected result of training the network to separate MACs and negatives. The first twoprocessing layers of the network assign each training example a value (Equation 2.12)that is mapped into the range 0—1. This causes many MAC examples to be groupedtogether near zero aiìd negative examples near one. Peak 4 occurs near 1.0 but peak 1is found at a network value around 0.12. This is most likely due to an inability of thenetwork to satisfy both the constraint of assigning negatives exactly 1.0 and assigningMACs a value of 0.0. This condition is more conspicuous with the simpler networks(Figures 4.5, 4.8, 4.9) where neither peak occurs at zero or one.Peaks 2 and 3 are unexpected and require interpretation. Both consist of more than1000 cells over three histogram bins, which is roughly 12% of the 8452 cells populationof either class. A possible explanation is that the peaks are an artifact of the decisionsurface for the neural network. Peak 3, however, is seen in the simpler neural networksthat use two and three hidden neurons as well.Most of the 1300 negative cells that make up peak 3 come from 41 of the 190 normalslides. The histograms for these slides contain frequency peaks near 0.8 and have almostno cells near 1.0. Figure 4.11 shows the histogram of neural network output for thenegative cell population from 149 slides after these 41 slides have been removed. TheChapter 4. Experiments and Results 741600140012001000Frequency80060040020000.7 0.8 0.9“Negative”Figure 4.9: The histogram of neural network output values for the optimal network withthree hidden units is shown for the test data. This network correctly classifies 72.4% ofthe test examples,Histogram of network output for a neural network with three hidden units0 0.1 0.2 0.3 0.4 0.5 0.6“MAC” Neural network outputChapter 4. Experiments and Results 75800700600500Frequency40030020010000.7 0.8 0.9“MAC” “Negative”Figure 4.10: The histogram of neural network output values for the optimal network withfour hidden units is shown for the test data. This iletwork correctly classifies 72.5% ofthe test examples.Histogram of network output for a neural network with four hidden units0.1 0.2 0.3 0.4 0.5 0.6Neural network outputChapter 4. Experiments and Results 76original histogram for all the negative cells is also shown. When the 41 slides are removed,the frequency peak near 0.8 is also removed.The first explanation examined was whether staining differences can account for theunique distribution of neural network values for the cells on the 41 slides. Table 4,9shows the breakdown of staining dates for the 41 slides. The right hand column liststotal number of slides stained on this date that were used in the study. The table showsthat the slides came from a total of seven staining batches, with a majority of the slidescoming from three of the batches. For each batch, however, there are a significant numberof slides that do not have the same characteristics as the 41 slides. If the peak were anartifact of staining, one would expect that almost all of the slides from that batch sharethat characteristic.700600500Frequency40030020010000.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1“MAC” Neural network output “Negative”Figure 4.11: The histograms of neural network output values for the optimal networkwith 4 hidden units is shown for the whole negative test set and for the case where 41slides have been removed.Histogram of network output for a neural network with four hidden units800Another hypothesis was that the scanning of the 41 slides by the automated imageChapter 4. Experiments and Results 77Staining date Number of slides Total number ofwith characteristic slides used in study2/12/91 1 72/19/91 1 36/20/91 8 388/15/91 12 2010/03/91 11 3012/04/91 4 810/10/91 4 11Total 41 117Table 4.9: The staining dates of the 41 slides that make up peak 3 (see text) are shown.The total number of slides from each date is given in the rightmost column.cytometer accounts for the differences between these slides and the rest. Examinationof the scanning dates reveals that 40 of the 41 slides have a scanning date of 02/10/93.However, 133 of the 190 normal slides also have this scanning date. It was not possible forthe automated cytometer to scan all 133 slides on one day. The scanning of these slidesmost likely occurred over the course of a week and all slides were given the same scanningdate. It is unlikely that the configuration of the apparatus was changed during this time.The scanning dates and the ages of the patients (which were uniformly distributed) donot appear to provide a reasonable explanation for the phenomenon.The two frequency peaks for negative cells were also present in the histogram ofnetwork output for the neural network using two hidden units. For this network it ispossible to interpret the data visually. The two hidden units in this network act asa perceptron which are combined in a nonlinear expression to obtain the final neuralnetwork output. Each perceptron can be treated as a vector in the feature space. Ascatter plot of the perceptron outputs for this network is shown in Figure 4.12. Forclarity, only every eighth training example is plotted. The appearance of the graphwould be similar if any other portion of the training data were plotted.chapter 4. Experiments and Results 78105Perceptron 20-5-10Perceptron outputs for the two perceptrons that make up the two hidden unit network-10. .•.. ...::.°•0O•I. ‘i.8 .*00000 0• °co8°°00Figure 4.12: The outputs for every eight training example for the two perceptrons thatmake up the two hidden unit networks are plotted.e0a.0 •• *..•.•a0Negative aMACa0a000 0aaDo0..4I.00oO 00..00a a • a•000 .-5 0 5 10Perceptron 1Chapter 4. Experiments and Results 79The figure shows that the negative examples are clustered at the extremes of thedistribution. The examples that receive a large output value from perceptron 1 (rightmostdistribution) make up the peak at 0.80 in Figure 4.8. The examples that receive a largeoutput value from perceptron 2 (leftmost distribution) make up the peak at 0.95 in thefigure. Although the figure demonstrates the existence of the two clusters, the reason fortheir existence remains unknown.Peak 2 in Figure 4.10 is also difficult to explain. It occurs at a network value ofaround 0.32 for both the MACs and negatives. There does not appear to be a preferredstaining date among the MACs that make up this peak. Roughly 1000 of 1300 cells inthe region 0.3—0.4 come from 44 of the 144 slides. These 44 slides were stained on at least11 different staining dates. Staining does not appear to be the common thread betweenthese slides. Further, the slides which contain the cells that make up peak 2 also containcells that make up peak 1.Since there is a smaller peak of negatives at a neural network value of 0.3, it is possiblethat this peak is an artifact of the neural network. The weights for this particular networkmay have been optimized such that they reached an error minimum in a region of theweight space where more of the negatives were given a network value of 1.0 at the expenseof increasing the average MAC value. This peak is not present in either of the histogramsfor the 2 and 3 hidden unit networks, making this explanation more likely.A second explanation of this peak can be made by suggestillg that there are twosubpopulations of the MAC training data or that the MAC data is non-normally distributed. If the second subpopulation has features which more closely represent thoseof the negative cells the network will be unable to fulfill the constraints of assigning thenegative population 1.0 and also assigning the “negative-like” MACs 0.0. This would alsoexplain why there is a small negative peak at 0.3. These negatives could have featuresvalues very close to the peak 2 subpopuiation and would therefore be misclassified intoChapter 4. Experiments and Results 80the region near peak 2.Such a subpopulation could exist if there were misclassified cells in the data. If theMAC expression in cells near a premalignant lesion is not assured (i.e. if it is possible tohave a lesion where the intermediate cells around it do not display MAC characteristics),then much of the data which has been called MAC could be in error. Such a biologicalinterpretation would require more analysis of MACs themselves and cannot be inferredfrom this study of statistical classifiers.Another interpretation could be made by measuring to see whether the cells foundin this peak are located further from the lesion than the cells of the same slide thatwere correctly assigned to peak 1. It has been mentioned that MACs have been detected10cm from the tumor site in some organs. If this property varies or is subject to otherlimiting factors then some intermediate cells further from the lesion may not exhibitMAC expression. Nieburgs has observed two different types of MAC expression [6]. Hehas noted that MAC expression in intermediate cells that are in close proximity to atumor differs from MAC expression in those cells that are distant from the tumor.This study cannot be done for the cytological smears used in this thesis becausewhen the physician scrapes cells from the cervical epithelium and smears it on a slide, allspatial relationships between cells are lost, This study could be done on biopsies, wherethe tissue is removed in a single piece, sliced and fixed on the slide. The distribution ofthe cells is maintained in such slides.4.8 Principal Component AnalysisPrincipal component analysis (PCA) is a statistical techiique for data analysis. It is adata reduction method that finds a ranked set of orthogonal vectors that accounts for asmuch of the variance of the data as possible. The goal is to find the smallest possibleChapter 4. Experiments and Results 81set of vectors that spans the subspace occupied by the data. This is useful to make largemulti-dimensional data sets more manageable for analysis. It is often the case that mostof the variance of the data can be accounted for by the first few principal components.The first principal component is designated to be the vector along which the varianceof the data is greatest. The second principal component is then constrained to be thevector in the subspace perpendicular to the first that accounts for the greatest portionof the remaining data variance, This definition is applied in a similar fashion to allthe remaining principal components until the entire space is spanned by the mutuallyorthogonal vectors. This is shown graphically in Figure 4.13 for a simple data distributionin two dimensions. The density distribution is shown as the shaded region and the twoprincipal component vectors are shown beneath. The first principal component pointsalong the direction of greatest spread of the data. In two dimensions, the second principalcomponent is then completely determined and lies perpendicular to the first.The principal components of a data distribution are obtained by calculating the covariance matrix for the distribution and finding the eigenvectors of this matrix. It turnsout that the kt principal component is the eigenvector corresponding to the kt largesteigenvalue of the matrix. The magnitudes of the eigenvalues also give the relative contributions of each principal component vector to account for the variation of the data.The principal components were calculated for the training set of 8456 cells. Table 4.10shows the eigenvalues for the first 18 principal components. The contribution of eachprincipal component to the total variation of the data is shown in the third columnand the cumulative total in the fourth. The results reveal that the first 13 componentsaccount for 90% of the data variation and the first 28 components account for 99%. Thisdemonstrates that the underlying structure of the data can be explained with 28 featuresmade up of components for all 53 original features. This result is expected since thereare large pairwise correlations between some of the 53 features. However, the result alsoChapter 4. Experiments and Results 82SECOND PRINCIPALCOMPONENTFigure 4.13: The two principal components are shown for an artificial data distributionin two dimensions shown by the shaded region. The first principal component lies alongthe direction of greatest variation of the data. The second component is constrained tobe perpendicular to the first.tells exactly how many orthogonal directions to use to represent the data to a specifiedaccuracy. In this case we can represent each data point as a 28 component vector andonly lose 1% of the expected variation of the data.The first ten principal components account for 86.9% of the variation of the trainingdata. The training and test data were transformed into ten component vectors usingthese ten principal components. The new data sets were used to train and test neuralnetwork classifiers to separate MACs and negatives. The neural networks were trainedwith the Quickprop algorithm as before using different numbers of hidden units. Eachtraining run was performed 20 times using different randomly initialized networks andthe results averaged. Table 4.11 shows the performance of the neural networks trainedon the first ten principal components of the MAC dataset.FIRST PRINCIPALCOMPONENTChapter 4. Experiments and Results 83Number Eigenvalue Percentage of Cumulativetotal variation percentage1 161058.8 36.5 36.52 60906.4 13.8 50.33 34315.3 7.8 58.14 30629.8 6.9 65.05 25316.6 .5.7 70.86 20038.1 4.5 75.37 17602.1 4.0 79.38 13403.9 3.0 82.49 11307.7 2.6 84.910 8644.4 2.0 86.911 6094.7 1.4 88.312 5801.5 1.3 89.613 5233.1 1.2 90.814 4959.3 1.1 91.915 4721.8 1.1 93.016 4398.5 1.0 94.017 3409.2 .8 94.718 2702.5 .6 95.3Table 4.10: The first 18 principal components are shown for the MAC training data.They are obtained by finding the eigenvalues of the covariance matrix for 53 featuretraining set. The contribution of each principal component to the total variation of thedata and the cumulative total are also given.Chapter 4. Experiments and Results 84Number of Average training Average testhidden units set performance set performance1 59.0+.1% 58.4±.1%2 62.9+3% 61.6±.2%3 64.7+.4% 63.6+.2%4 65J+.4% 63.7±.3%5 65.3+.4% 63.6±.3%6 65.6+.4% 63.7+.3%7 66.0+.4% 64.0+.3%Table 4.11: The average performance and standard deviations for 20 randomly initializednetworks are shown for different numbers of hidden units. The networks were trained onthe first ten principal components of the training set, which accounts for 86.9% of thevariation of the data.The results show that on average, the training set performance does not exceed 67%and the test set results peak near 64%. The reduced classification rate is expected sincethe classifier is only training on a subset of the full feature space. In this situation, thetraining rate is more reflective of the expected classification rate on independent datasets. The training rate is within 2% of the test rate in each case, which was the casefor only the smaller networks trained on all 53 features (Table 4.8). This suggests thatthe networks being trained on all 53 features are being overfitted by the use of too manyfeatures.Although, PCA is useful for reducing the dimensionality of the data it does not suggestwhich features are important for discrimination. The principal components point alongthe directions of greatest variance, but the features with the greatest variance are notnecessarily those which differ between the MACs and the negatives. A better strategyfor reducing the dimensionality of the data would be to examine the features individuallyfor inclusion in the model and perform this in conjunction with PCA.Chapter 4. Experiments and Results 854.9 Slide-by-Slide ClassificationIt is the slide-by-slide classification rate of a classifier that is of interest to cancer researchers. This rate is more meaningful than the cell-by-cell classifier since it addressesthe reliability of the classifier’s diagnosis of the patient. In general, the cell-by-cell classifier is used as a decision step in the larger classification system, as shown in Figure 3.3.The slide-by-slide rate would normally be calculated for the whole system.For comparison purposes, the slide-by-slide classification rates were obtained for theMAC cell-by-cell classifiers, The data used to train and test the slide-by-slide classifiersconsisted of 93494 cells taken from a total of 395 slides. All normal and severe slides withat least 20 cells were included. There were 65645 negative cells taken from 251 normalslides and 27985 MAC cells taken from 144 severe slides.The data used to train and test the cell-by-cell classifiers were a subset of the data usedin the slide-by-slide classification. This introduces the possibility of a biasing error in thedesign of the slide-by-slide classifier. The cells used to design the best neural networkswere used again in the slide-by-slide classification. This potential error is mitigated bythe fact that the cells used to train the cell-by-cell classifiers comprised less than 10% ofthe total data. The statistical features used by slide classifiers are calculated based onall the available cells on the slide. Most slide datasets contain images of between 200 and600 intermediate cells. The nuclear features of all of these cells are used to calculate theslide features. For some of these slides, up to 50 cells have been used in the cell-by-cellclassifier study. Their influence on the value of each of the slide features is small andtheir effect on the results should be negligible.The slides were split into two groups by selecting every second slide for inclusion inthe training set. The training set consisted of 198 slides and the test consisted of 197slides. Table 4.12 provides a breakdown of the number of slides of each type used to forChapter 4. Experiments and Results 86slide-by-slide classification.Slide type Training Testing Total slides of typenegative 126 125 251MAC 72 72 144Total 198 197 395Table 4.12: Breakdown of the number of slides of each type used to design slide-by-slideclassifiers.When selecting features for the slide classifier, an obvious choice is the average classifier value assigned to the MAC and negative cells. This statistic (called AVGVAL) wasselected as one of the features. A total of five features were calculated for the intermediatecells on each slide:1. MAC: the fraction of the slide population with neural network value less than 0.5or discriminant function value less than 0,2. VMAC: the fraction of the slide population with neural network value less than0.25 (not used for discriminant function),3. AVGVAL: the average value of neural network output or discriminant functionvalue,4. STDDEV: the standard deviation of neural network values, and5. SKEW: the skewness of the distribution of neural network values.There is significant overlap in some of the slide feature definitions, For example, anyslide which receives a large VMAC (“Very MAC”) score will necessarily have a high MACscore, so these features will be highly correlated. As well, any of the slides described aboveChapter 4. Experiments and Results 87will have a low AVGVAL score, indicating a strong negative correlation of the AVGVALfeature to these other features.Table 4.13 shows the correlation between the five slide features for the slide trainingset calculated using the five hidden unit network. The features MAC, VMAC, AVGVALand SKEW have large pairwise correlations. The STDDEV feature has a small correlationto these variables. This behavior was observed for the slide features calculated using theother neural networks as well.Feature MAC VMAC AVGVAL STDDEV SKEWMAC 1.0 .92 -.98 .14 .84VMAC 1.0 -.91 .06 .78AVGVAL— 1,0 -.20 -.90STDDEV—— 1.0 .35Table 4.13: Correlation values of the five slide features are shown for the five hidden unitcell-by-cell classifier. These values were calculated for the slide training set.Table 4.14 shows the results of performing slide-by-slide classification using thresholdscalculated for individual features. The results were obtained by setting a threshold usingthe training slide set and recording the classification rate of this threshold when appliedto the test slide set. The STDDEV feature provided almost no discrimination abilitywhen used alone or in conjunction with the other features and was not included in thetable.All the neural networks with two or more hidden units scored between 74% and 77%for correct slide classification for at least one of the slide features. This is a significantimprovement over the perceptron and discriminant which both gave approximately a 69%correct classification rate. The performance of the thresholds on the training slides wasalways 1—2% higher than the test results recorded in Table 4.14. Each score is sensitiveto the threshold position, varying by up to 2% when it is shifted by a small amount. ThisChapter 4. Experiments and Results 88Number of Hiddens MAC VMAC AVGVAL SKEWDiscriminant 69.0%— 69.5% 66.0%1 68.6— 69.0 63.62 75.6 76.6% 76.0 75.13 76.1 75.1 75.6 72.14 74,1 74.6 73.6 74.15 77.0 76.1 76,6 74.66 75.6 76.6 76.0 75.17 75.1 75.1 74.1 72.3Table 4.14: The performances of the slide features for classification of the test set of 197slides are shown for the best neural networks arid discriminant function.occurs because there are oiiiy 197 test examples.For each classifier the four slide features produced nearly the same results. This isexpected given the strength of the correlation between the features. It is interesting tonote that the neural network with four hidden units, which gave the highest cell-by-cell classification rate, scored lower than the other networks for the slide classification.Figure 4.14 shows the ROC curves for slide-by-slide classification using AVGVAL forthree classifiers. The ROC curves for the neural networks with three, four, six and sevenhidden units are not significantly different than those shown for the two and five hiddennetworks.Figure 4.14 displays the performance advantage of neural networks over the lineardiscriminant vector for classifying MAC slides. The ROC curve for any of the neuralnetworks with two or more hidden units shows significant improvement over that of thediscriminant. There is at least a 5—8% increase in the slide-by-slide classification ratewhen using neural networks. The figure also shows that the networks with more thantwo hidden units offered no significant improvement over the network with two hiddenunits for slide-by-slide classification.Chapter 4. Experiments and Results 8910090807060False 50Negative40302010060 70 80 90 100Figure 4.14: The slide-by-slide ROC curves for the the discriminant function and neuralnetworks with two and five hidden units are shown. The slide test set consists of 125normal and 72 severe slides.ROC curve for slide by slide classification for three different classifiers0 10 20 30 40 50False PositiveChapter 5Conclusions and Further WorkPattern classifiers based on feed-forward neural networks are a relatively recent developinent. Their use has spread since the publication of the back-propagation algorithm in1985. They have recently been applied by several researchers to the diagnosis of differenttypes of cancer including skin cancer [5], liver cancer [27] and breast cancer [2]. Theyhave also been used to diagnose cervical smear images by McKenna [29], who trainedmultilayer networks using 80 features to separate normal cells from dysplastic cells witha 93.7% correct classification rate.This thesis has examined the use of neural networks in the design of a cervical smearclassifier based on detection of MACs. The performance advantages of using neuralnetworks in place of discriminant functions have been demonstrated for this classificationproblem. The success of neural networks in this situation is due to the subtle differencesbetween MACs and negative cells. The author believes that neural networks shouldbe used in other cytological screening decision problems (Figure 3.3) where standardstatistical techniques have not performed adequately.A linear discriminant function calculated for the training set correctly separated 61.6%of the 16904 test examples (Table 4.6). Twenty randomly initialized neural networkswith two hidden units were trained to perform this classification task. On average,they correctly classified 71.6% of the test examples (Table 4.8). This classification rateimproved as the number of hidden units was increased. Using four hidden units gave thehighest average performance (71.9%). Using more than four hidden units overfitted the90Chapter 5. Conclusions and Further Work 91training examples leading to a slight decrease in the classification rate for the test cells.The optimal four hidden unit neural network attained a classification rate of 72.5%. Itseparated the negative examples into two subpopulations (Figure 4.10). This separationmay have been an artifact of the classifier, but it was also present in the histograms ofthe neural networks using two and three hidden units. Most of the cells that made upone of the subpopulations came from 41 of the slides. These 41 slides contained almostonly cells that were given a neural network value of 0.8 by the classifier, Neither stainingnor scanning dates could account for the difference between these slides and the rest.For slide-by-slide classification, the neural networks giving the best cell-by-cell classification rates were compared to the linear discriminant fullction. Five slide featureswere defined for the distributions of neural network and discriminant function values.Four of the five features provided nearly the same results for each classifier. The averagediscriminant function value of the intermediate cells on the slide could correctly classify69.5% of the 197 test slides. The neural networks using between two and seven hiddenunits all had classification rates between 74% and 77% for slide-by-slide classification.The best performance by a neural network for slide-by-slide classification was 77.0%by a network with five hidden units. This was only marginally better than a networkusing two hidden units which scored 76.6%. When deciding which classifier to use inpractice, robustness is an important issue. Very little performance is gained by usingthe larger network. With this in mind, one would choose the network with two hiddenunits, which uses 111 free parameters, over the five hidden unit network, which uses 276parameters.The data used to train the discriminant function and neural network classifiers consisted of vectors of 53 features calculated for the images of the cell nuclei. No analysiswas performed on the the usefulness of the features for inclusion in the classifier models.Chapter 5. Conclusions and Further Work 92Seven of the features had strong pairwise correlations (.95 magnitude or larger) to others and could have been removed at the outset. Since the differences between negativesand MACs are subtle, any features which may provide some discrimination value wereincluded in the study.The problem of feature selection merits further study. Using too many features in thetraining of the neural networks at best, slows the convergence of the training methods.This can also lead to an overall reduction in the performance of the classifiers. Theneural networks can be overfitted on the training examples using features with marginaldiscriminating ability. The performance of these networks on the training set will notreflect their expected performance in practice.Discriminant function analysis does not suffer from this problem because features areadded to the discriminant model one at a time. Features are oniy added to the modelif they provide significant discriminating ability with respect to the features alreadyincluded in the model.Principal component analysis on the training set of 8456 cells showed that the MACtraining data could be represented by 28 mutually orthogonal vectors with OJ% loss ofaccuracy. The uncertainty of the nuclear features calculated from the segmented imagespossibly matches or exceeds this value. Thus, the training of the classifier could beshortened by converting each 53 component feature vector into a 28 componentvectorby using the first 28 principal components. Given that the neural network cell-by-cellclassifiers performed between one and five percent less on the test set as compared to thetraining set when using all 53 features (Table 4.8), the loss of accuracy from using these28 component vectors would not be significant.Although principal components could be used to reduce the dimensionality of thedata, it does not lead to an improvement of the overall classifier performance. Theprincipal component vectors point in the directions of the greatest variance of the data.Chapter 5. Conclusions and Further Work 93These are not necessarily the directions along which the data classes show the greatestseparation.The variability of the feature values due to staining batch differences appeared topose a problem in designing a robust MAC classifier. A set of 1000 cells sampled fromslides taken from three staillillg batches were designated class 0. Another set of 1000cells from slides from three other staining batches were designated class 1. A simpleperceptron classifier was trained using these 2000 cells. It was trained to separate cellsof slides from three batches from cells of slides from the other three batches, It achieveda classification rate of 82.3% on the training set. It correctly classified 80.6% of a 2000cell test set of cells drawn from the same slides used for training.A control experiment was performed using the same 2000 training cells by randomlyassigning them either of two classes and training neural networks to separate the classes.In this case, it was impossible to obtain a classifier which performed better than chanceat separating the two classes. These two results show that staining batch variationsamong the features are large enough that they can eclipse the feature differences betweenthe MAC and negative populations. This problem was lessened by choosing trainingdata from slides that spanned the staining batches. However, further work should beconducted to find a way to normalize features so that staining variations do not affectfeature values.Bibliography[1] Anderson, G.H., D.A. Boyes, J.L. Benedet, J.C. LeRiche, J.P. Matistic, K,C. Suen,A.J. Worth, A. Miliner, and OM. Bennett(1988) The organization and results of thecervical cytology screening program in British Columbia from 1955 to 1985. Lancet,1988.[2] Astion, M.L. and P Wilding(1992). Application of Neural Networks to the Interpretation of Laboratory Data in Cancer Diagnosis. Clinical Chemistry, 38, 34—38.[3] BC Cancer Agency (1993). Annual Report 1992—1993 British Columbia CancerAgency.[4] Bibbo, M., A,G, Montag, E. Lerma-Puertas, H.E, Dytch, S. Leelakusolvong, andRH. Bartels(1989). Karyometric Marker Features in Tissue Adjacent to InvasiveCervical Carcinomas. Analytical and Quantitative Cytology and Histology, 11, 281-285.[5] Bostock, R.T.J., E. Claridge, A.J. Harget and P.N. Hall(1993). Towards a neuralnetwork based system for skin cancer diagnosis. Third International Conference onArtificial Neural Networks, IEEE Conference Publication 372, 215—219.[6] Palcic, B.(1994) Personal communication based on unpublished observations by H.E,Nieburgs.[7] Burger, G., U. Jutting, K. Rodenacker(1981). Changes in Benign Cell Populations inCases of Cervical Cancer and Its Precursors. Analytical and Quantitative Cytology,3, 261—271.[8] Cybenko, G. (1988). Continuous Valued Neural Networks with Two Hidden Layers Are Sufficient. Technical Report, Dept of Computer Science, Tufts University,Medford, MA.[9] Cybenko, G. (1989). Approximation by Superposition of a Sigmoidal Function.Mathematics of Control, Signals and Systems 2, 303—3 14.[10] Denker, J., D. Schwartz, B. Wittner, S. Solla, R. Howard, L. Jackel, and J.J. Hopfield(1987). Large Automatic Learning, Rule Extraction, and Generalization. ComplexSystems 1, 877—89294Bibliography 95[11] Duda, R.O. and P.E. Hart (1973). Pattern Classification and Scene Analysis, Wileyand Sons.[12] Fahlmaiin, S.E.(1988). An Empirical Study of Learning Speed in Back-PropagationNetworks, Carnegie Melon University Tech Report CMU-CS-88-162.[13] Fisher, R.A. (1936). The use of multiple measurements on taxonomic problems,Annals of Eugenics, 7, 179—188; also found in Contributions to Mathematics andStatistics, (1950), John Wiley.[14] Fletcher, R. and C.M. Reeves(1964). Function minimization by Conjugate Gradients.Computer Journal, 7, 149.[15] Garner, D.M., A. Harrison, J. Korbelik, B. Palcic, C. MacAiilay, J. Matisic, andG.H. Anderson(1992) Automated Cervical Cell PreScreening System (ACCESS).[16] Hanson, S.J. and L.Y. Pratt (1989). Comparing Biases for Minimal Network Contstruction with Back-Propagation. In Advances in Neural Information ProcessingSystems I, 177—185, Denver 1988, Ed: D. Touretzky, Morgan Kaufmann.[17] Hertz, J., A. Krogh and R.G. Palmer (1991). Introduction to the Theory of NeuralComputation, Sante Fe Institute in the Sciences of Complexity, Addison Wesley.[18] Heywood, M. and P. Noakes (1993). Simple Addition to Back-Propagation Learningfor Dyiiamic Weight Pruning, Sparse Network Extraction and Faster Learning. InIEEE International Conference on Neural Networks 199, V2, 620—625, IEEE.[19] Jacobs, R.A. (1988). Increased Rates of Convergence Through Learning Rate Adaptation, Neural Networks 1, 295—307.[20] James, M. (1985). Classification Algorithms, William Collins and Sons.[21] Katzko, M.W., M.M. Pahlplatz, P.S. Oud, and G.P. Vooijs(1987). Carcinoma InSitu Specimen Classification Based on Intermediate Cell Measurements. Jytometry,8, 9—13.[22] Koss, L.G.(1992). Diagnostic Cytology and its Histopathological Bases. Fourth Edition, J.B. Lippincott Company.[23] Kramer, A.H. and A. Sangiovanni-Vincentelli(1989). Efficient Parallel Learning Algorithms for Neural Networks. In Advances in Neural Information Processing Systems 1, 40—48, Denver 1988, Ed: D. Touretzky, Morgan Kaufmann.[24] Longstaff, I.D. and J.F. Cross(1987). A pattern recognition approach to understanding the multi-layer perceptron, Pattern Recognition Letters 5, 315—319.Bibliography 96[25] MacAulay, C.E.(l989) Development, implementation, and evaluation of segmentation algorithms for the automatic classification of cervical cells, PhD Thesis, University of British Columbia, Canada.[26] MacAulay, C.E.(l993). Personal Communication.[27] Macun P.S. and J. Dempsey(1993), A Neural Network To Diagnose Skin Cancer.IEEE International Conference on Neural Networks, III, 1492—1497.[28] Makram-Ebeid, S., J.A. Sirat and J.R. Viala(1989) A Rationalized Back-Propagation Learning Algorithm. In International Joint Conference on Neural Networks. (Washington 1989), V-Il, 373—380, IEEE[29] McKenua, S.J., I.W. Ricketts, A.Y. Cairns and K.A. Hussein(1993). A comparisonof neural network architectures for cervical cell classification. Third InternationalConference on Artificial Neural Networks, IEEE Conference Publication 372, 105—109.[30] Montag, A.C., P.H. Bartels, E. Lerma-Puertas, H.E. Dytch, S. Leelakusolvong, M.Bibbo(1989). Karyometric Marker Features in Tissue Adjacent to in Situ CervicalCarcinomas. Analytical and Quantitative Cytology and Histology, 11, 275—280.[31] Montag, A.C., P.H. Bartels, H.E. Dytch, E. Lerma-Puertas, F. Michelassi, M.Bibbo(1991). Karyometric Features in Nuclei Near Colonic Adenocarcinoma. Statistical Analysis. Analytical and Quantitative Cytology and Histology, 13, 159—167.[32] V1ukawa, A., Y. Kamitsuma, S. Tsunekawa, and N. Tanaka(1989) Report on a longterm trial of Cybest Model 2 for prescreening squamous cell carcinoma of the uterinecervix. Analytical Cellular Pathology, 1, 225—233.[33] Ngoi, S.S., L. Staiano-Coico, T.A. Godwin, R.J. Wong, and J.J. Decoss(1990). Abnormal DNA ploidy and proliferative patterns in superficial colonic epithelium adjacent to colorectal cancer. Cancer, 66, 953—959.[34] Nieburgs, H.E., R.G. Zak, D.C. Allen, H. Reisman and T. Clardy(1959). Systematiccellular changes in material from human and animal tissues in presence of tumors.Trans. 7th Ann. Inter-Soc. Cytology Council, 137[35] Nieburgs, H.E., A.D. Parents, V. Perez, C. Boudreau(1965). Cellular changes inliver tissue adjacent to adjacent and distant malignant tumors. Arch. Pathol., 80,262—272.[36] Nieburgs, H.E., A.F. Goldberg, B. Bertini, J. Silagi, D. Pacheco, and H. Reisman(l967). Malignancy associated changes (MAC) in blood and bone marrow cellsof patients with malignant tumors. Ada Cytologica, 11, 415—423.Bibliography 97[37] Nieburgs, H.E., and A.F. Goidberg(1968). Changes in polymorphic nuclear leukocytes as a manifestation of malignant neoplasia. Cancer, 22, 35—42.[38] Nordin(1989) The development of an automatic prescreener for the early detectionof cervical cancer: Algorithms and implementation. PhD Thesis, Uppsala University,Sweden[39] Ploem, J.S., A,M.J, van Driel-Kulker, L. Goyarts-Veldstra, J•J. Ploem-Zaaijer, NY.Verwoerd, M. and van der Zwan(1986) Image analysis combined with quantitativecytochemistry, Results and instrumental developments for cancer diagnosis. Histochemistry, 84, 549—555.[40] Press, W.H., B.P, Fiannery, S.A. Teulosky, and W.T. Vetterling (1986). NumericalRecipes, Cambridge University Press.[41] Rosenblatt, F. (1962). Principles of Neurodynamics, Spartan.[42] Rumeihart, DuE., J.L. McClelland, and the PDP Research Group (1986). ParallelDistributed Processing: Explorations in the Microstructure of Cognition, Volume 1:Foundations, MIT Press.[43] Schien, P.S.(1985). Etiology of Cancer. In Medical Oncology: Basic Principles andClinical Management of Cancer. Ed: P. Calabresi, P.S. Schien, and S.A. Rosenberg,Macmillan Publishing.[44] Statistics Canada (1989). Leading Causes of Death at Different Ages 1987, StatscanHealth Division, Government of Canada, July 1989.[45] Spiegeihalter, D.J., A.P. Dawid, S.L. Lauritzen, and R.G. Cowell(1993). BayesianAnalysis in Expert Systems. Statistical Science, 8, 219—283.[46] Tezcan A., D.M. Garner, P. Lam, J. Korbelik, B. Palcic(1994), An investigationof the three nuclear stains: thionin, gallocyanin and hetnatoxylin for automatedquantitative image cytometry of specimens from the uterine cervix. Submitted forpublication to Cytometry, February 1994.[47] Thodberg, H.H (1991). Improving Generalization of Neural Networks Through Pruning. International Journal of Neural Systems, 1(4), 3 17—326.[48] Vogi, T.P., J.K. Mangis, A.K. Rigler, W.T. Zink, and D.L. Alkon (1988). Accelerating the Convergence of the Back-Propagation Method. Biological Cybernetics 59,25 7—263.Bibliography 98[491 Watrous, R,L, (1987). Learnillg Algorithms for Connectionist Networks: ApppliedGradient Methods of Nonlinear Optimization. In IEEE First International Conference on Neural Networks (San Diego 1987), V-lI, 619—627, Ed: M. Caudill, IEEE.[50] Wied, G.L., P.H. Bartels, M. Bibbo, J.J Sychra(1980). Cytomorphometric Markersfor Uterine Cancer in Intermediate Cells, Analytical and Quantitative Cytology, 2,257—263.[51] Zahniser, D.J., P.S. Oud, M,C.T. Raaijmakers, G.P, Vooys, and R.T. Van deWalle(1980). Field test results using the BioPEPR cervical smear prescreening system. Cytometry, 1,3, 200—203.

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0085144/manifest

Comment

Related Items