A STATISTICAL ANALYSIS OF ELECTROENCEPHALOGRAPHIC SPIKES IN BENIGN ROLANDIC EPILEPSY OF CHILDHOOD by ROBERTO BENCIVENGA Dottore i n Matematica, University of Naples, I t a l y , 1976 M.Sc, Memorial University of Newfoundland, 1978 Ph.D., Memorial University of Newfoundland, 1982 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in THE FACULTY OF GRADUATE STUDIES Department of S t a t i s t i c s We accept t h i s thesis as conforming to the required standards THE UNIVERSITY OF BRITISH COLUMBIA August 1987 © Roberto Bencivenga, 1987 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Roberto BENCIVENGA _ , S t a t i s t i c s Department of The University of British Columbia 1956 Main Mall Vancouver, Canada V6T 1Y3 Date September 11, 1987. DE-6(3/81) ABSTRACT The occurrence of spikes during an electroencephalogram is a basic feature of Benign Epilepsy of Childhood (BREC). In this thesis we analyze several problems related to the structure of such spikes. The currently used mathematical model describing the spike assumes that all the inter-spike variations are due to background activity. We show that non-negligible additional variability is present during the spike and propose a slightly richer model which takes such variability into account. In particular we conclude that background noise may not be used to assess the precision of the estimates of the signal. The technique of spike averaging is presently used to obtain more precise estimates of the signal. By comparing averaging with trimmed mean, median and the "lowess" smoother, we find no discrepancies indicating the presence of skewness or long tails in the underlying distribution of the data and conclude that spike averaging is an adequate method for estimating the deterministic part of the spike. Next, three automated procedures for the detection of the peak of the spike are compared to the existing method, which is based on a visual analysis of the E E G tracing. None of the alternative methods is found to be superior, but the methodology developed for this problem is rather general and could be applied to other similar comparisons. Finally we address the question of whether "atypical" B R E C patients, who are characterized by having other neurological abnormalities besides seizures, have a spike structure different from that of the "typical" patients. The non parametric i i method of "classification trees" is discussed and then applied to find whether certain features of the spike can discriminate between typical and atypical patients. The location and amplitude of the spike are found to provide a satisfactory classification rule, suggesting that the two groups may be affected by different types of epilepsy. We have used, throughout the thesis, simple methods which do not require strong assumptions. In particular we have tried to avoid assumptions of normality and linearity and to rely mostly on non parametric methods. i i i Table of contents Abstract ii List of Tables vii List of Figures viii Chapter 1: Medical Background 1 .1 Benign Epilepsy 1 1 .2 Electroencephalography 2 1 .3 Some History 4 1 .4 E E G Channels, Waves and Spikes 4 1 .5 Digitization and Topographic Mapping 9 1 .6 Rolandic Spikes and Benign Epilepsy 10 1 .7 Typical and Atypical B R E C 12 Chapter 2: Collection of Data 2 .1 The Equipment 14 2 .2 Patient Selection 14 2 .3 Spike Selection 15 2 .4 Computing Facilities 17 Chapter 3: Averaging 3 .1 Introduction 18 3 .2 Modeling "Signal" and "Noise" in an Individual Patient 19 3 .3 Estimation of the Signal and the Averaging Process 22 3 .4 Analyzing the "Time-Locked" Assumption 25 3 .5 Estimating the Increase in Variability 33 3 .6 Assessment of the Average as an Estimator 37 i v Chapter 4: Methods of "Peak" Detection 4 .1 Introduction 43 4 .2 Four Detection Methods 43 4 .3 Methods of Comparison 46 4 .4 Comparing the Methods: Location of Minimum Value 48 4 .5 Comparing the Methods: Maximum Voltage Difference 51 4 .6 Comparing the Methods: Location of the Focus 53 4 .7 Conclusions 57 Chapter 5: Classification Trees 5 .1 The Classification Problem 58 5 .2 Classification Rules, Partitions and Trees 60 5 .3 Constructing a Classification Rule from a Sample 65 5 .4 Choosing the Splitting Criteria 66 5 .5 "Pruning" the Tree 70 5 .6 Estimates of the Misclassification Rate 73 5 .7 Variable Misclassification Costs 74 5 .8 Surrogate and Competing Splits 76 5 .9 Some Theoretical Considerations 77 Chapter 6: Classification of B R E C Spikes 6 .1 Introduction 82 6 .2 Choice of Variables 83 6 .3 Selection of Sample 85 6 .4 Preliminary Analysis 86 6 .5 Using Linear Criteria 91 6 .6 Variable Costs 91 6 .7 Classifying the Averages 92 6 .8 Conclusions 93 6 .9 Some Comments on C A R T 95 References 97 v Appendix: Competing Estimators for the Average A . l Trimmed Mean 99 A.2 Median 100 A.3 Lowess 100 Tables 1 0 2 v i List of tables Table 1: Names and codes of the E E G channels in the International 10-20 system 102 Table 2: Background data on the patients 103 Table 3: Number of spikes and selected channel for each set of spikes 104 Table 4: Observed values of Std* and Std* 105 Table 5: Observed values of D ^ g g ) . ^(58) a n ^ ^(935)' ^ 6 Table 6: Observed values of R- 107 Table 7: Observed values of Shannon's measure d(P,i) 108 Table 8: Observed values of Sm, m=l,2,3,4 109 Table 9: Observed values of \Sm\ 110 Table 10: Results of preliminary C A R T runs I l l Table 11: Analysis of the classification of averaged segments 112 v i i List of figures Figure 1: Segment from a normal E E G 3 Figure 2: Electrode placement positions according to the International 10-20 System 6 Figure 3: Typical tracing of a spike 7 Figure 4: Different types of spikes 8 Figure 5: Outline of the cerebral cortex 11 Figure 6: Boxplots of std(T) 30 Figure 7: Plots of devt(T) for the trimmed mean 40 Figure 8: Plots of devt(T) for the median 41 Figure 9: Plots of devi(T) for lowess 42 Figure 10: A binary classification tree 64 Figure 11: Classification tree of the first sample 88 Figure 12: Classification tree of the second sample 89 v i i i Chapter 1: Medical Background 1.1 Benign Epilepsy Although epilepsy has been known and feared since the early times of civilization, little was known about its biological basis until certain theories proposed by J.H.Jackson at the endof the 19th century (Jackson, 1931) were confirmed by electroencephalographic studies. Subsequent progress has led to the development of various forms of therapy which, even when they are unable to cure the problem, can be effective in suppressing its most devastating symptom, namely seizures (O'Donohoe, 1985). Moreover, during the last thirty years physicians have observed that some patients who experienced epileptic seizures during childhood did not suffer from seizures after puberty and did not show any permanent neurological damage. Research performed in Europe and in the United States focused on this observation and led to the description of a syndrome referred to as "Benign Epilepsy of Childhood" (Lombroso, 1967, O'Donohoe, 1985, Lerman and Kivity, 1986). As for any form of epilepsy, the fundamental feature of this benign form is the presence of seizures. In this case seizures are infrequent (they may even occur only once), are typically nocturnal, occurring during sleep, involve mostly the oral-facial region, may produce drooling and, if awakening occurs, speech arrest. The age at which these seizures first occur is between 4 and 14 years. Patients having this condition show no neurological abnormalities and have normal intelligence. Certain electroencephalographic features associated with this benign form of epilepsy 1 will be the object of this study and will be described after some basic concepts are introduced. The benign nature of this syndrome can be asserted with some confidence (Lombroso, 1967; Lerman and Kivity, 1975). The prognosis includes the disappearance of seizures by the age of 14, the absence of permanent neurological damages and a normal life. As a consequence, no treatment is usually required, except for suppressing the occasional seizures. 1.2 Electroencephalography Electroencephalography is a common non-invasive technique for studying the electrical activity of the brain (Cooper, Osselton and Shaw, 1980). It consists of placing a set of electrodes on the patient's scalp, at specific sites, together with a reference electrode at an electrically neutral location, usually the earlobe or the cheek. The differences of potential between each of the scalp electrodes and the reference electrode are measured and then either transmitted to a writing device or stored electronically. The word "channel" is often used to denote a site, as well as the electrical activity recorded at that site. The abbreviation "EEG" is usually applied to both the technique and its produced output; the context always indicates which meaning is attached to it. When a writing device is used, one obtains an "electroencephalogram": a number of pens, one per channel, record the activity in the form of tracings outlined on continuously rolling paper. These can be studied and constitute the classical E E G . A typical section of an E E G is presented in Figure 1. Often one wants to save the information obtained from the E E G on a magnetic tape or disk, either to make storage more efficient or to be able to review it and to further analyze it. In this case a dedicated computer samples the differences of potential at regular intervals, of the order of a few milliseconds, simultaneously on all channels, thus storing a discrete and digitized version of the tracing. We shall 2 |70pV I SEC F IGURE 1 Segment From a Normal E E C 3 discuss this technique later with more details. An E E G may be obtained either with the patient at rest or under some type of stimulus. In the first case the aim is to detect spontaneously occurring activity of the brain, while in the second one wants to study the way in which the brain, or one of its regions, reacts to the stimulus administered. 1.3 Some History Experimental work on the detection of electrical activity in the brain by means of electrodes began in the latter part of the 19th century. Electrodes implanted in the brains of animals showed the existence of electrical activities which could be altered by external stimuli. The first major studies on the human brain were performed by Hans Berger at the beginning of the 20th century, but were not published until the early 1930's (Gloor, 1969). Berger, a German psychiatrist, persuaded some neurosurgeons to implant zinc-plated electrodes in the epidural tissue of certain patients, close to the cerebral cortex, and was then able to monitor the activity of the brain. This allowed him to detect, describe and study certain periodic features of such activity and to apply "frequency analysis" (Cooper et al., 1980) to them. However his work was ignored until Adrian and Matthews reported the same results in the mid and late 1930's (see, e.g., Adrian and Matthews, 1934). By that time Berger's papers had already described many of the E E G concepts and methods which are presently being used. Later advances in electroencephalography were due more to new developments in electronic technology than to theoretical or methodological breakthroughs. 1.4 E E G Channels, Waves and Spikes In this work we shall consider EEG's obtained by placing 20 scalp electrodes 4 according to the so-called "International 10-20 system", as illustrated in Figure 2. The names of the channels and their corresponding code names are given in Table 1. It is important to notice that the mid-frontal polar channel, called F' , usually does not correspond to a physical electrode, but is linearly interpolated from F j and Fp2- We shall consider it as a separate electrode in order to maintain the structure of the data as they are usually collected, but shall remember its nature whenever needed. The main classical features of an E E G tracing are "waves" and "spikes" (Chatrian et al., 1974). By "waves" we shall mean periodic oscillations of the voltage related to ongoing activities of the brain. Early studies of E E G focused on, and were prompted by, certain types of waves which were characteristic of well-defined physiological or pathological processes, such as sleep, seizures etc. The analysis of wave patterns forms the core of E E G interpretation. A "spike" is an occasional burst of activity, appearing on the paper tracing as a rapid increase of voltage (see Fig. 3) which can be clearly distinguished from the ongoing activity. It appears at unpredictable intervals and disappears within a short period of time, usually less than 150 milliseconds (abbreviated ms.). Some authors refer to a spike lasting more than 70 ms. as a "sharp wave" (Chatrian et al., 1974), but this distinction is not relevant to our purposes and hence we shall only use the term "spike". Spikes are described as "monopolar", if in all relevant channels the voltage increases in the same direction or "polarity" (Fig. 4a), and as "multipolar", if different areas of the scalp show an increase in opposite directions (Fig. 4b). A monopolar spike affecting all channels is usually generated by a muscular movement rather than by the electrical activity of the brain. In that case it is considered an artifact and ignored. Also, a spike is said to be "monophasic" if the initial increase is followed 5 FRONT FIGURE 2 Electrode Placement Positions According To The "International 10-20 System". 6 C3 /XA^'^'A/vw^v^WV C 4 ^ ^ v ^ - ^ ^ / ^ O2 ^^->i^^J\l\y^\/^^ FIGURE 3 Typical Tracing Of A Spike. 7 FIGURE 4 Different Types Of Spikes: a: Monopolar b: Multipolar c: Monophasic d: Polyphasic. 8 immediately by a return to background level (Fig. 4c), but is "polyphasic" if the initial increase is followed by one or more reversals of polarity before returning to the background level (Fig. 4d). 1.5 Digitization and Topographic Mapping With the present day availability of computers, data obtained in an E E G may be easily manipulated in many ways in order to bring out those features which are too minute or too numerous to be recordable by hand or visual inspection. For this purpose computers have been developed which are dedicated to the E E G machine and allow the simultaneous sampling of all channels at regular time intervals and the transcription and storage of the information into digital form. The sampling interval is called a "dwell" and usually ranges from one to several milliseconds. The term "dwell" is also used to denote a sample point; the intended meaning of the word will be clear from the context. When short-lived phenomena such as spikes are of interest, it is useful to isolate a short sequence of dwells. This is referred to as a "segment" and its length in milliseconds as the "epoch". Even when only these "interesting" sections of the E E G are stored, the amount of data so generated is enormous and any software managing the digitization process must be economical in its storage method. Parsimony is usually achieved by storing each entry as a sequence of n bits, so that 2" different values are detectable, each representing a multiple of a smallest quantity. This quantity depends on the level of amplification of the signal and is chosen so as to avoid upper and lower clipping. Hence it varies from patient to patient and from channel to channel. Recently, digitization and advances in computer graphics have been combined to construct "topographic mappings" (Duffy et al., 1979). These are color coded maps of the scalp obtained as follows. First a color scale is associated with the 9 observed range of voltage values, so that each color represents one value or a sufficiently small range of values. Then a schematic contour of the scalp is drawn (views from different angles are usually available) and the pixels (minimum graphical elements on the screen) covering the positions of an electrode are colored in accordance with the voltage value observed there. The color to be used at each of the other pixels is determined by linearly interpolating the voltage values of the 3 or 4 closest electrodes. These maps may be scrolled forward in time to provide a movie-like representation of the evolution of the brain's activity during the epoch: onset and decay of a spike, transmission of a visual signal from the front (eyes) to the back (visual cortex), etc. 1.6 Rolandic Spikes and Benign Epilepsy As mentioned in Section 1.1, certain E E G findings are associated with benign epilepsy (Nayrac and Beaussart 1958, Lombroso 1967, Lerman and Kivity 1975). In fact a characteristic feature of this condition is the occurrence of spikes usually affecting the "Rolandic" region of the brain, a region approximately bounded by Tj, F^ and C j on the left and symmetrically on the right (Fig. 5). These spikes, which last approximately 100 ms., are characteristic in the sense that they are detected in all B R E C patients, occur frequently, especially during sleep, and are believed to be absent in normal individuals. In fact, an otherwise normal individual whose E E G revealed such spikes would be considered as suspect; some asymptomatic siblings of patients with benign epilepsy have revealed Rolandic spikes (cf. Lerman and Kivity, 1986). In some patients a spike may be immediately followed by a longer wave, referred to as an "after-wave". Since these waves are often absent, or at least go undetected, they are not considered as distinguishing features of the condition. We shall take them into account only in Chapter 6. 10 FIGURE 5 Outline Of The Cerebral Cortex: Top: Left Lateral View Indicating The Main Cerebral Regions And The Rolandic Fissure. Bottom: Top View With The Rolandic Region Indicated By The Shaded Area. 11 The importance attached to these spikes in the diagnostic process has led some investigators to describe the syndrome as "Benign Rolandic Epilepsy of Childhood" (Nayrac and Beaussart, 1958; Gregory and Wong, 1984), with the acronym "BREC", which we shall use. Other names have been used in the literature, such as "benign focal epilepsy", "benign epilepsy with Rolandic foci" and "benign centrotemporal epilepsy", but they all refer to the same syndrome (O'Donohoe, 1985). Rolandic spikes have been a subject of research at the Neurology Department of the British Columbia Children's Hospital (BCCH) in Vancouver, by a team led by Dr. Peter Wong. Their work has led to the observation that B R E C spikes are usually dipolar and to the conjecture that such spikes are generated in a small area of the brain by electrically opposite activities of two neighboring sets of neurons. In this case information about the spatial structure of this electrical field, such as location, strength or synchronization among channels, may contain useful information about the nature of B R E C and about the clinical evaluation of a patient (Gregory and Wong, 1984). This in turn has pointed to the usefulness of analyzing the spike in more details, both in a classical visual way and in a mathematical one. 1.7 Typical and Atypical B R E C Dr. Wong's group has also observed (Gregory and Wong, 1984) that some patients, while satisfying the requirements for the diagnosis of B R E C with respect to age, seizure type and presence of spikes, also had some neurological abnormalities, such as weakness, hemiplegia (difficulty in muscle control on one side of the body) and gross motor problems, or revealed a level of intelligence or of school performance below normal. These patients were diagnosed as suffering from "atypical" B R E C , a name aimed at pointing out both the similarities and the differences between their manifestation of the disease and the "typical" description of BREC. At present all of the atypical B R E C patients are still in their childhood, so the 12 question arises whether the good prognosis associated with typical B R E C may be extended to them. Since the presence of spikes is believed to be strictly related to epileptic activities (O'Donohoe, 1985, Lerman and Kivity, 1986), we addressed this question by analyzing whether certain characteristics of the structure of such spikes could clearly discriminate between the two groups. In Chapter 5 we discuss a nonparametric classification methodology which is then applied in Chapter 6 to this problem. A positive answer might suggest that the additional neurological abnormalities are likely to be directly linked to the epileptic activities of the brain, possibly limiting a benign prognosis for the atypical group. On the other hand a negative result might support the hypothesis that the prognosis associated to the typical B R E C syndrome may be extended to the atypical group, despite the presence of neurological abnormalities. The results of this analysis may be compared in future with the eventual observed condition of the atypical patients after puberty. 13 Chapter 2: Collection of Data 2.1 The equipment Al l the E E G data used in our study were obtained at B C C H using a NIHON K O H D E N Model 4221 Electroencephalograph manufactured by Nihon Kohden America Inc. (Irvine, California). The machine was connected to a Model DEI00 digital tape recorder manufactured by Telefactor Corp. (West Conshohocken, Pennsylvania), so that the output was stored on magnetic tapes in addition to being displayed on paper. The digitization was obtained using a dedicated computer called "Brain Atlas" (abbreviated BA), distributed by Bio-Logic Systems Corporation (Mundelein, Illinois). As mentioned in Section 1.5, computers of this type store each voltage value as a sequence of n bits. BA uses 8 bits, so that 256 different values are distinguishable on any one reading of one channel. The maximum sampling rate attainable by our version of BA was 500 dwells per second, but for our study it was set at 200, a value considered small enough to allow storage of large portions of E E G , but large enough to contain satisfactory information about the individual spikes. Hence for our purposes one dwell lasts 5 milliseconds. Al l the operations described in Section 2.3 were performed using BA functions and were aimed at obtaining data suitable for statistical analysis. 2.2 Patient Selection Since both the diagnosis of B R E C and the techniques of E E G digitization and 14 topographic mapping are rather new developments in neurophysiology, only relatively recent data are available. The earliest magnetic E E G records obtained at B C C H on B R E C patients are from 1982. Al l available tapes were reviewed by a trained technician and the patients selected on the basis of acceptable technical quality of the corresponding tapes. Several patients had to be eliminated from the analysis because of deterioration of the tape or because of technical mistakes made during the recording. This happened particularly in the earlier tapes, recorded at a time when the newness of the techniques and the lack of a clearly defined direction of analysis led to records which, in retrospect, were unacceptable. Finally, tracings from 24 patients were deemed to be of sufficiently good quality and searched for spikes. Background data for these patients are given in Table 2, where each patient is identified by the code number used in the storage of the data. Of the 24 patients, 11 were boys and 13 girls, 12 were typical and 12 were atypical B R E C , and the average age was 9.3 years (median 10 years, standard deviation 2.8). The distribution of ages was slightly different between the two groups, having a higheraverage (10.4 vs. 8.2) and smaller standard deviation (2.1 vs. 3.2) for the atypical group. However the difference was well within the acceptance limits for a Kolmogorov-Smirnovtest (cf., for instance, Lehmann, 1975) and was not believed to have an influence on the type of analysis we wanted to perform. 2.3 Spike Selection For each of the selected patients, the data stored on tape were transformed back into visual tracings scrolling on a monitor and inspected, again by the technician, so that occurrences of spikes could be detected. Once a spike was detected, the technician stopped the scrolling and aligned the cursor on the screen with the dwell containing the peak of the spike. The computer 15 then labelled the chosen dwell as 129th and transferred data corresponding to dwells 1 to 256 to disk. Since a spike lasts approximately 100 ms., or 20 dwells, a segment of this length also contains an initial segment of background noise and a final segment with after-waves or more noise. In order to obtain useful information about the background noise (see Chapter 3), the technician was instructed to ignore a spike if the initial portion of the corresponding segment contained another similar spike or a different kind of large wave or artifact. Since this kind of contamination occurs in a relatively small number of cases and is not known to be related to the phenomenon of interest, there is no reason to believe that this selection introduced a bias in the study. The peak of the spike was identified with the earliest dwell at which one channel achieved a noticeable negative peak of voltage. Alternative methods of peak identification are discussed in Chapter 4. We then reviewed all the segments stored on disk and rejected those which contained unacceptable features missed by the technician, such as clipping of the spike when the voltage had achieved values larger than those used by BA as limits, occurrence of large waves in the first portion of the segment, or obvious misalignment between the spike and the 129th dwell, indicating a technical error during the transfer of the data from tape to disk. In all these cases the segments were rejected only after Dr. Wong had reviewed them and agreed to the decision. Finally BA was used to convert the disk files, containing the data in 8-bit format, into numerical files according to the "American Standard Code of Information Interchange", abbreviated "ASCII". A final check was made on the ASCII files to ensure that no segment had been transferred twice. Given the methodology used, we could think of only two ways for this to happen: either the technician had issued the transferring command twice before going to the next spike or we had converted the same segment twice into two different ASCII files. In both cases the resulting ASCII files would have been 16 identical. A F O R T R A N program was therefore written comparing the first 10 dwells of each pair of files from a given patient. As soon as two corresponding values were found to be different, the two spikes were declared different. Three cases of duplication were found in this way and in each case only one of the files was retained. Each of the two patients identified by numbers 36 and 39 revealed two different types of spikes, one involving the left side of the brain, the other the right. Since it was clear that left and right spikes could not be viewed as resulting from the same process, they were separated. Thus set 37, containing the right spikes of patient 36, and set 40, containing the left spikes of patient 39, were formed. In conclusion, the selection produced 26 sets of spikes, each set containing between 6 and 20 spikes, as listed in Table 3. 2.4 Computing Facilities Given the large amount of data involved in the study we had to rely on computers to perform all the arithmetical operations needed. F O R T R A N programs were written to perform operations on the ASCII files created by BA. These were compiled by a Microsoft compiler and executed on a Zenith personal computer. Each program was carefully tested to eliminate possible mistakes. The classification program "CART" , distributed by California Statistical Software (Lafayette, California) was run on the Microvax II microcomputer of the Department of Statistics of the University of British Columbia. The same computer was used for the remaining analyses and for the graphics, which were all obtained by programs run in the "S" statistical environment (Becker and Chambers, 1984). 17 Chapter 3: Averaging 3.1 Introduction One of the problems of E E G analysis is that of separation of signal from noise (Cooper et al., 1980). The live brain is in constant activity so that at any time the voltages detected by the E E G are generated by the superposition of a large number of neuronal discharges. Some of this activity is directly related to the process under study and contributes to what is considered as the "signal", while the rest is related to other processes not of current interest and is therefore considered "noise". When, as in the case of B R E C spikes, one may assume that the signal consists of repeated occurrences of the same short term process, an averaging procedure is used to decrease the noise level and highlight the structure of the signal (Cooper et al., 1980). In this chapter we describe the averaging procedure and the mathematical model, which will be referred to as the "time-locked" model, on which the procedure is usually based (Glaser and Ruchkin, 1976). We then propose a richer model and show, on the basis of the available data, that, at least for B R E C spikes, the additional parameters of the new model are not negligible. This is relevant since certain quantities of interest depend on these parameters and their estimates based on the time-locked model may be misleading. Since the same process of averaging is carried out on each channel separately we have restricted our analysis to a single channel for each set of spikes. The rationale used is that the time-locked model should be considered insufficient if it is shown to be insufficient for one channel. 18 Since Rolandic spikes are usually more evident in the midtemporal region, we have selected channel Tj for patients exhibiting left spikes and T4 for those exhibiting right spikes (see Table 3). The only exception is the one patient providing sets 36 and 37 for whom we have chosen C j and respectively, since in that case the observed spikes involve the midtemporal channels only in a minimal way. In order to simplify the notation, no index is used to denote the channel. Clearly, a multivariate version of our approach, involving several channels at a time, may be a worthwhile topic for further investigation. It is important to realize that, for biological reasons, the parameters of the models under investigation may have different values in different patients. As a consequence we use methods that allow the pooling of information from all patients without requiring that the parameters be the same for all patients. However, to further simplify the notation, no index to denote the patient is used in Sections 3.2 and 3.3 and at the beginning of Section 3.4. 3.2 Modeling "Signal" and "Noise" in an Individual Patient Definitions 3.2.1 : At any time f the voltage detected at the selected channel will be denoted by v( T"), the voltage generated by the phenomenon of interest will be called "signal" and denoted by s( Tj-'), while the voltage generated by other background activity will be called "noise" and denoted by n(T). Our first assumption is based on the fact that electrical fields are additive (see, for example, Halliday and Resnick, 1970). Assumption 3.2.2 : At any time f , v(T) = s(V) + n(f). 19 Further, in the absence of unusual cerebral activities and provided that the E E G machine is properly calibrated, we can make the following assumption. Assumption 3.2.3 : The noise n( X ) constitutes a second orderstationary random process with E[n(f)] = 0 and VarfnfT)] = A > 0. Unfortunately not much can be assumed in general about the covariance structure of this process even within a single patient. This is because the background activity has different patterns at different times, changing even within seconds. In particular wave frequencies may change quite markedly, making it impossible to rely on one particular structure for all replicates of the spike or for all patients. In order to simplify the discussion, it is sometimes assumed in the E E G literature (cf., for example, Glaser and Ruchkin, 1976) that the sampled sequence of noise voltages is "white noise"; that is, it has zero correlation. Since this is not a likely assumption, we have chosen not to assume a specific structure, but rather to use methods that are influenced in small measure, if any, by the presence of correlations. In each specific case we shall state the conditions needed for these tests and discuss the sense in which such conditions may be considered satisfied by our data. The signal s( f ) represents, in our case, the extra voltage occasionally generated in the Rolandic region and appearing on the tracing as spikes. We shall consider it as a process which is identically 0 except during the finite intervals of time when the spikeoccurs. These intervals can be represented in the form [w(j),w(j)+h]', where j=l,2,... is a positive integer, w(j) is a random increasing sequence of positive real numbers and h is a constant which has a value of approximately 100 ms. A study of the stochastic nature of the sequence w(j) and, in particular, of the 20 distribution of the waiting times w(j) = Mi) - w(J-i) would be of interest, but would require longer and more complete E E G records, which were not available to us. In view of the selection procedure described in Section 2.3, we assume that W(j) > h for any j. We also assume that W(j) is large enough so that all the replicates of the spike may be regarded as independent. The following assumption describes the time-locked model. Assumption 3.2.4 : For any integer j and for any / in [0,h] s(w(j)+t) = f(t) , where f(t) is a continuous function with support on fO.hJ. The alternative model that we propose includes variability generated by the spike. Assumption 3.2.5 : For any integer j and for any / in [0,hj s(w(j)+t) = f(t)Fj + e(t)j where the various quantities are defined as follows: - the function f(t) is continuous with support on [0,h]; - the Fj's are independent replicates of a positive random variable F, where E[F]=1, Var[F]=B; - for each j, e(Oj is an independent replicate of a stationary process e(t) defined on fOMJ, with E[e(t)]=0, Var[e(t)]=C and giving rise, when sampled every 5 ms., to white noise. 21 The variable F may be viewed as the quantity regulating the strength of the electrical field giving rise to the spike. Since the spikes are detected by visual inspection of the tracing, we are de facto assuming that F does not take values so small as to make the spike undetectable. This means assuming that B is sufficiently small, an assumption justified by current knowledge of the structure of the spike and brought to the limit in the time-locked assumption. The process eft) may be viewed as caused by the activities of neurons surrounding the main generators and involved only marginally and rapidly with the spike. These activities are numerous, minute and rapid enough to give rise to the postulated white noise. The "time-locked" assumption may therefore be interpreted as a special case of Assumption 3.2.5, saying that all replicates of the spike have the same strength and follow the same spatial pattern. This might be true for evoked responses to identical stimuli, although it has been questioned even in that case (Mocks et al., 1984), but it is less believable for spontaneously occurring phenomena like B R E C spikes. Our aim is to show that the additional sources of variability postulated by Assumption 3.2.5 cannot be neglected and that the simplifications in Assumption 3.2.4 may therefore generate misleading results. Although it is possible to develop more complex models that describe the processes at work better, the model based on Assumption 3.2.5 provides an alternative to the time-locked model which is sufficient to test the appropriateness of the latter. Finally we shall assume that n(w(j)+t), e(t)j and F- are produced by independent biological processes and are therefore statistically independent. 3.3 Estimation of the Signal and the Averaging Process In light of Assumptions 3.2.4 and 3.2.5 the first objective in a study of Rolandic spikes consists of estimating fft). Since each occurrence of a spike may be viewed 22 as an observation of f(t) distorted by variable factors, we shall refer to each such observation as a "realization of the signal". Current knowledge of the structure of B R E C spikes suggests that f(t) is strictly monotone in the interval [0,p] for some p in (0,h), has an extremum at p and may have more extrema in (p,h) if the spike is polyphasic. Definition 3.3.1 : The point p in (0,h) at which f(t) achieves its first extremum is said to be the "peak" of the spike. As described in Section 2.3, the information available from a recorded spike consists ofa sequence of voltages fv(T), T=l 256}, so that for the analysis we need a discrete version of the model developed so far. The relation between continuous time and discrete time, represented by the dwell number, may be established as follows. Definition 3.3.2 : If the y'-th recorded segment includes the / - t h spike and if p is aligned precisely with the P-th dwell, then dwell number T corresponds to time T=w(j')+p-5(P-T) and we shall make the following identifications: v(T)j = v(w(j')+p-5(P-TJ) »(T)j E n(w(j')+p-5(P-T)) f(T) = f(p-5(P-T)) e(T)j = e(p-5(P-T))j . It is with reference to this definition that from now on we shall consider v(T), n(T) and e(T) as random variables and shall use the subscript j to index the recorded spikes. 23 Proposition 3.3.3 : For any T, v(T) is an unbiased estimator of f(T). Under Assumption 3.2.4 its variance is Varfv(T)J = A while under 3.2.5 its variance is Var[v(T)] = A + f(T)2B + C. Proof: This follows immediately from 3.2.4, 3.2.5 and 3.3.2. Since we are assuming that the interval between spikes is usually large enough so that different segments may be assumed to contain independent realizations of the spike, the usual arithmetic average may be used to improve the precision of this estimate. Definition 3.3.4 : For a given set of k segments the average voltage at dwell T Proposition 3.3.5 : The average v(T) is an unbiased estimator of f(T) with variance is given by k v(T) = k Var[v(T)] = A/k under 3.2.4 and 3-a Var[v(T)J = A + f(T)2B + C k under 3.2.5. 24 Under both models the average provides an estimate of f(T) with lower variance and its use is therefore desirable. This estimate is sufficient for studying the general structure of the spike and for detecting certain features that may be characteristic of the spike or of certain groups of patients. However, an estimate of Varfv(T)] for T in {120 140} is often needed to construct "signal-to-noise ratios" and confidence intervals for certain quantities which may also be characteristic of the spike, such as the peak voltage or the amplitude, that is, the difference between maximum and minimum voltage (Cooper et al., 1980). Such an estimate may be obtained, under the "time-locked" assumption, by using the early portion of the averaged segment, which has a known mean, namely zero, and a larger number of dwells and so, presumably, more information. However, by using this method one obtains estimates of A/k and clearly if B and C are not negligible the quantities obtained by assuming A/k rather than 3-a to be the variance of v(T) could lead to smaller confidence intervals and hence to inaccurate, and usually overconfident, conclusions. 3.4 Analyzing the "Time-Locked" Assumption Since each spike is assumed to be centered at the 129th dwell and lasts approximately 20 dwells, we shall assume that f(T) is zero outside the set {120 140). The last portion of each segment, corresponding to values of T in {141 256}, will be ignored, since it may contain after-waves, while our interest for this problem is focused on the spike. Also, since after-waves do not occur consistently after every spike, an increase of variance in the last portion due to their presence is likely. In light of 3.2.4 and 3.2.5 we can state the null hypothesis as: HQ : B = C = 0 forT = 120 140 , the alternative hypothesis being: Hj : B + C > 0 forT = 120 140 . In particular we shall view H, as stating that the variability increases during the 25 spike, so that our approach for testing this hypothesis will be to compare the observed variability before and during the spike. Proposition 3.4.1 : Under Hn, the sequence fvar(T), T=l 140), where var(T) = k Y_ (y(T)j - v(T))2 j=l k-1 is first order stationary. Proof: Under HQ, Var[v(T)] = A and is therefore independent of T. Since E[var(T)] = Var[v(T)] the result follows. The absence of skewness is important for the validity of certain tests that we shall perform, sothe skewness of var(T), generated by its quadratic nature, is not a desirable feature. In order to decrease it we apply a square root transformation (Miller, 1986) and hence we shall use the corresponding sequence of standard deviations Proposition 3.4.1 does not extend automatically to the sequence 3-b, since E[X]=E[Y] does not imply that E[ Jx]=E[ J7f. However the rationale behind assumption 3.2.3 is that the variability of n( T ) is independent of time, so that we shall assume such independence for 3-b as well. Definition 3.4.2 : For each set i we denote by std(T,i) the value of 3-b observed from that set. Moreover the observed average values of std(T.i) before and 3-b 26 during the spike will be defined, respectively, by std*(i) 115 Y_ std(T.i) T=l 115 std*(i) 140 Y_ std(T.i) T=120 21 Assumption 3.4.3 : For each i the variables std*(i) and std*(i) are independent and each has a symmetric distribution. The theoretical motivation for this Assumption is as follows. Definition 3.4.4 : Let { X(T); T>0 } be a sequence of random variables and let m be a positive integer. Then this sequence is said to be "m-dependent" if for any positive integers n and k the variable X(n+m+k) is independent of (X(l) X(n)}. Theorem 3.4.S : Let fX(T)} be a sequence of m-dependent, uniformly bounded random variables and denote by S(n) the quantity n S(n) X(T) T=l If, as n —> oo , VarfS(n)] —> oo 27 then S(n) - E[S(n)] {Var[S(n)]}]/2 converges in distribution to a standard normal variable. Proof: See, for example, Chung, 1974. Corollary 3.4.6 : Suppose that (X(T)} is a sequence of w-dependent, uniformly bounded random variables such that VarfX(T)] = V for any T, and Cov[X(T),X(T+k)]= =C/C>0 for k<m. Then for any large positive integers n, n' and for any p>m, the two averages = Y_ X(T) T=l n+p+n X* = Y~ X(T) T=n+p+l are independent and each has an approximate normal distribution. Proof: Independence follows from the definition of m-dependence. Moreover, in this case 3-c Var[S(n)] = nV + 2(n-l )C j + 2(n-2)C2 + ... + 2(n-m)Cm > nV is of order n, so that the conditions of the theorem are satisfied and the result holds for X%. Applying the same procedure to the sequence {X(T),T>n+p} one verifies the result for X*. Now to arrive at Assumption 3.4.3 we must check whether the available data conform to the conditions of Corollary 3.4.6. Given its nature, we may assume that v(T), and hence std(T,i), are uniformly bounded. Computations of the sample correlation 28 coefficients for the sequence {std(T,i),T=l 115} show that in all available sets these are positive at lags less than 6 and become negligible after that. In fact for 22 of the 26 sets they are smaller than 0.5 at lags larger than 4 and in 17 they are smaller than 0.4 at lags larger than 4. By assuming that this lack of correlation is generated by independence we then conclude that, under HQ, the sequence (std(T)} is m-dependent for some m not much larger than 4. The final step is then to assume m=4 and that 115 and 21 are large enough with respect to 4 to justify the normal approximation and hence the assumption of symmetry. Further evidence for the assumption of symmetry is given by the fact that box-plots of the values {std(T,i),T=l15} for each i do not reveal any systematic asymmetry and are mildly skewed only in a few cases, as shown in Figure 6. Proposition 3.4.7 : Under HQ, P{std%(i) > std*(i)} = P{std%(i) < std*(i)} = 0.5, and hence the statistic 26 S = ^ I{std*(i)>std*(i)} i=l where I{ } is the usual indicator function, has a binomial distribution with n=26 and p=0.5. Proof: Under HQ, E[std%(i)] = E[std*(i)J. Let yu be the common mean and let X = std*(i) -j^ ; Y = std*(i) - yu Then X and Y are independent, continuous and symmetrically distributed around 0. This implies that P{X>Y) = P{X<Y} = 0.5 . But X>Y if and only if std*(i) > std*(i), so the result follows. The alternative hypothesis we are considering states that during the spike the 29 Set # 11 Set #14 Set # 15 Set #16 Set#17 Set # 20 Set # 23 Set # 24 Set # 25 <0 I I * SI N 8 O » 1 | 9 1 1 •Id Set * 27 T I _L Set #30 Set #33 Set #34 3 S Set # 36 Set #37 Set* 38 8 8 Set #39 Set # 40 Set # 42 Set #48 Set #50 Set #53 8 SS R S Set #54 Set #60 FIGURE 6 Boxplots Of std(T). 30 voltage variability increases, so that under Hj one would expect higher values of std(T) and of std*. This leads us to reject HQ for small values of 5. One possible problem is the fact, noticed earlier in this Section, that an increase in E[var(T)] does not necessarily imply an increase in E[std(T)]. Howeverweshall still reject the null hypothesis for small values of S for two reasons. First, disagreement between the behavior of the two expected values would only happen if the distribution of v(T) were rather pathological; for instance it could not happen if the distribution were normal. Second, our concern is that the standard deviation not be underestimated, whatever the reason for its decrease. A low value of S indicates that P{std*>std*} < 0.5 and this should be of concern regardless of whether it is caused by a true increase in variability or by an abnormal change in the distribution of the data. The observed values of std*(i) and std*(i) are reported in Table 4 and they show that in our case 5=3. Direct calculations show that P{ S < 3 | HQ } = 4.4-10'5 and that an exact two-sided 95% binomial confidence interval for P(std*(i)>std%(i)} based on this value of S is given by (.698, .956). As a result we reject the null hypothesis and conclude that there is additional variability during the spike. Since the previous procedure was based on a series of approximations, we decided to test the null hypothesis by a different method, the rationale being that if different procedures arrive at the same conclusion, our confidence in the assumptions used increases. The second approach is based on the nonparametric Hodges-Lehmann estimator of shift (Hodges and Lehmann, 1963) and may be described as follows. Definition 3.4.8 : Let X(T), for T=l,...,n , be independent random variables from a continuous distribution G(x) and let Y(T'), T'=l n' be independent random variables from the distribution G(x - Q ). We shall define by DJ.J,, the nxn' 31 pairwise differences DT T = X(T) - Y(T') and by Df^j > k=l nxn', their order statistics. If nxn' is odd and we let q=(nxn'+l )/2, then &(q) 1S t n e median of the D(jj and may be used as an estimator of Theorem 3.4.9 : If n and n' are odd then P{ £>^ > Q} = 0.5. Moreover, if G is symmetric, then Dfqj is unbiased for Q and its distribution is also symmetric. Proof: See Lehmann, 1975. Under the null hypothesis that Q(i)=0 the variables X(T)=std(T,i) for T=l 115 and Y(T)=std(T',i) for T'=120,...,N0 have the same distribution for each set /. Hence, according to 3.4.8, each event ]208/'^>^^ would have probability 0.5 and the sign test could again be applied. However two aspects of this procedure require some considerations in our context. First, although the statistic ^(q) 1S ^ n o w n t 0 D e reliable for non symmetric distribution and robust with respect to gross errors (cf., for instance, H0yland, 1965),we could not assess its theoretical properties in the presence of correlations. In order to decrease the effect that correlation would have in our setting we perform this analysis on the variables X(T) = std(T,i) T=l+5k , k=0,l 22 Y(T') = std(T'j) T=120+5k , k=0,l 4. which, according to our previous assumption of 4-dependence, should be independent. The corresponding estimator is denoted by ^(58) w n e r e t n e superscript "s" stands for "selected". 32 The second point is the interpretation of the shift model in this context. Clearly inour model, under Hj the distribution of std(T) and std(T') are not simply shifted with respect to one another. However, under HQ 0=0 so that the first part of 3.4.9 is satisfied and the statistic 26 S = S[_ I{Ds(58)(i)>0} i=l hasa binomial distribution with «=26 and p=0.5. Under Hj, as discussed previously, the values of std(T') would tend to increase and hence P{D^gj<0} would decrease, so that 5 would take small values. It is with respect to this behavior that the statistic 5 will be used. The observed values of DS(58) a r e S * v e n m Table 5. From the table we can notice that DS(58) is negative for exactly the same sets for which std* is larger than std*. The agreement betweenthe two results strengthens the validity of our rejection of HQ. 3.5 Estimating the Increase in Variability Having concluded that B+C should be assumed to be positive, we address the question of how influential its omission is for the estimate of the standard deviation. Since this estimate appears as a multiplicative factor in the analyses of interest, the relevant quantity to consider should be 3 - e J * + f(T)2B + C Before discussing which statistic to use as an estimator, we notice that there is a problem in the fact that this quantity depends on T, so that one would have to estimate it for each T=120 140. However, for reasons already discussed, we cannot pool across patients, so that for each T and each set of segments only one 33 observation would be available. A more meaningful, even if less informative, approach is to look, in each patient, for an "average" quantity across the set {120 140}, a quantity that could be interpreted as a characteristic of the patient and could be estimated from 21 values for each patient. This average may be interpreted in two ways, either as a quantity similar to 3-e, but with f(T) replaced by its mean value over its support, or as thequantity to which 3-e reduces when B=0, that is, in the absence of strength variability. We shall draw on the latter interpretation to simplify the notation and collapse f(T)2B+C to a single parameter denoted by C. We first consider an estimator of 3-e whose theoretical properties can be assessed, at least within some reasonable approximations. Then we shall describe three other estimators which have mostly heuristic appeal; the values they provide will be compared to those provided by thefirst estimator. The presence of major disagreements may indicate problems in the assumptions made on the distribution of the first estimator and hence little reliability for the observed estimates. Definition 3.5.1 : Let A*(i) and (A+C)*(i) be defined by 115 var(T,i) A*(i) T=l 115 140 var(T,i) (A+C)*(i) T=120 21 Then an estimator of 3-e is given by 34 Since Rj is the square root of a ratio of estimators, it will likely be biased. The following result will be used to obtain information concerning such bias. Theorem 3.S.2 : Let X and Y be two random variables with E[X]=a and E[Y]=b. Let g(x,y) be a function having continuous second partial derivatives. Then Efg(X,Y)J ~ g(a,b) + j - VarfXJ gxx(a,b) + j - VarfYJ gyy(a,b) + CovfXJJ gxy(a,b) Proof: This is based on a Taylor expansion of the function. See, for instance, Mood, Greybill and Boes, 1974. Nowlet X=(A+C)*, Y=A* and notice that the hypothesis of 4-dependence may be equivalently applied to the sequence (var(T)}, so that the covariance term appearing in 3.5.2 may be ignored. Also let g(x,y) = so that j_ 4 Then we can approximate the expected value of Rj by EfRjJ Z A+C - j - Var[(A+C)*J (A+C) A + f V a r f A * j A ± C _ A+C 1 + — 8 Var[(A+C)*] 3 VarfA*] + ^ (A+C)2 35 In order to assess the magnitude of the bias we notice that if Var[var(T)]=V, Cov[var(T),var(T+k)]=Cjc and W=V+2C J+...+2CJ is approximately the same both before and during the spike, we can write 1 1 W Var[A*] ~ (115V+228Cj+ ... +222C4) ~ (V+2Cj+ ... +2C4) = 115' 115 and similarly Var[(A+C)*] ~ W/21. Finally we obtain 115 EfRjJ ~ A+C W 1 + — 8 1 21(A+CY 38A' So R j will tend to underestimate 3-e for small values of C , while for larger values of C, roughly 3C>A, it will tend to overestimate. However, if 3C > A we can already conclude that the contribution of C is relevant, so in any case Rj will not provide misleading information. The observed values of Rj are also given in Table 6 and they show that the increase in variability is rather substantial, being larger than 20% in 20 cases and larger than 100% in 3 cases. We believe that increases of this order of magnitude should not be ignored and that more accurate estimates should be used whenever \JA+C is needed. The following Definition describes three alternative estimators which have an intuitive motivation and, in the absence of distributional abnormalities, should provide values similar to Rj. Definition 3.5.3 : Let std^gj be the median of the std(T) for T=l,...,115, stdpjj be the median of the std(T) for T=120 140 and ^(1208) ^e a s 36 described in 3.4.8. Then we define the estimators R2, R$ and R4 as follows: R 2 ( l ) ~ std*(i) R m _ Std(H) R 3 ( 0 - s t d ( 5 8 ) R4d) - i + sta(58) The values obtained for each of these four estimators are given in Table 6 and they show a high degree of agreement, thus increasing our confidence in the corresponding estimates. One question of interest for further investigation is whether the additional variances B and C are intrinsic physiological quantities or whether they are mostly due to a discrepancy between the true time of the peak and the one assumed during the storage of the segment. Since by increasing the sampling rate one should be able to identify the peak with greater accuracy, one way to address this question would be to collect a set of segments containing BREC spikes sampled at a higher rate and to repeat on them the analyses we have performed here. A comparison between the two results might then provide information relevant to the problem. 3.6 Assessment of the Average as an Estimator In Proposition 3.3.3 we have noticed that v(T) is an unbiased estimator of f(T) regardless of the magnitude of B and C and this fact, together with the simplicity of its calculation, provides powerful support for its practical use. However, it is well known that certain optimality properties of the average are only 37 valid when the underlying distribution of the data is normal and that this estimator may be grossly influenced by "outliers". Moreover for non symmetric distributions the mean may not be the quantity of interest (cf., for example, Lehmann, 1980). So, although the use of the average has the advantage of computational simplicity, a comparison with other methods would be useful to rule out pathologies in the distribution of the data. Three estimators are compared to the average: trimmed mean, median and "lowess" (Cleveland, 1979). The definitions and properties of the estimators are summarized in the Appendix. For each of them we look for the presence of major disagreements indicating some abnormalities in the distribution of the data. Since we are interested also in the temporal aspect of the problem, we do not use histograms or other similar graphical methods, since we would have to rely on too few observations for each dwell. Instead the investigation is based on the following considerations. If, for a given T, the distribution of the noise n(T) were normal, then for a set of k spikes the interval would contain the average f(T) with 95% probability. Let CE(T) bethe value obtained from a competing estimator and assume that this is in fact the true value of the average. Then, if the estimates obtained from different sets of spikes and different values of T were independent, we expect that in about 95% of the cases On this basis we construct the sequence fdevt(T); T=l 256} for each set of spikes and for each competing estimator, with the aim of detecting the occurrence of lower percentages or of temporal patterns, which might indicate problems in the distribution of the errors. v(T) - CE(T) 3-f devt(T) = £ (tk-l,0.025> lk-1,0.975> • ^ var(T)/k 38 We have already stated that the two assumptions on which this analysis was based are not satisfied in our model, but for our purposes only gross deviations from the expected result were of interest and worth analyzing further. In fact, for both the trimmed mean and the median the values of devt(T) were all well within the confidence limits described in 3-f (Figures 7 and 8). In the case of lowess the values of devt(T) were also within the limits in the majority of cases, but they tended to get larger during the occurrence of the spike and in the same direction of the voltage (Figure 9). This means that lowess tended to underestimate the magnitude of the signal during the spike, an effect which is usually desired in a smoother, but is not particularly useful in our context, since sharpness is an important feature of a spike. Hence we conclude that the arithmetic average can be considered as the estimator of choice for f(T) and that, at least on the basis of our data, there is probably no need for further investigation into this question. 39 Set # 11 Set #14 Set #15 Set # 16 I ° 0 SO 100 200 300 0 SO 100 200 300 0 SO 100 200 300 0 50 100 200 300 Set #17 Set #20 Set # 23 Set # 24 S o « ° 9 0 SO 100 200 300 0 SO 100 200 300 rt 9 o 0 50 100 200 300 0 SO 100 200 300 Set #25 Set #27 S 9 * ° 9 • . . . . i •< * 9 0 SO 100 200 300 Set #30 Set #33 © 1 s 9 — . — . — . — i i— dm -0.2 0.1 0 50 100 200 300 0 SO 100 200 300 0 SO 100 200 300 Set #34 Set #35 Set # 36 Set # 37 9 0 50 100 200 300 0 50 100 200 300 0 50 100 200 300 dwell Set #38 Set # 39 Set #40 Set # 42 •7 * »* . »? * . . * . 3 i ' * . •? ^_ * 9 9 9 9 ' 0 SO 100 200 300 0 SO 100 200 3O0 0 50 100 200 300 0 SO 100 200 300 Set #43 Set # 48 Set # 50 Set #53 0 50 100 200 300 I § 0 SO 100 200 300 0 SO 100 200 300 0 SO 100 200 300 dwell Set #54 Set #60 0 SO 100 200 300 dwell 0 SO 100 200 300 FIGURE 7 Plots of devt(T) for the trimmed mean. 40 Set * 11 Set* 14 Set* 15 Set* 16 « I , 1 3 1 9 1 ' 9 1 0 SO 100 200 300 0 SO 100 200 300 0 80 100 200 300 0 SO 100 200 30O S«t*17 * ° Set* 20 Set* 23 0 SO 100 200 300 Set * 24 0 SO 100 200 300 0 SO 100 200 300 0 50 100 200 100 Set* 25 Set* 27 Set* 30 Set* 33 9 9 9 0 SO 100 200 300 0 SO 100 200 300 0 SO 100 200 300 0 SO 100 200 300 Set #34 Set* 35 Set* 36 1 § « 9 : 9 9 ' 9 Set * 37 0 SO 100 200 300 0 SO 100 200 300 0 SO 100 200 300 0 SO 100 200 300 Set* 38 © I 2 « 9 Set* 39 Set* 40 9 . — . — 9 — , — - i - i 1 5 L——^ ——— Set* 42 0 SO 100 200 900 0 SO 100 200 300 0 SO 100 200 300 0 50 100 200 300 Set* 43 Sot #48 Set* 50 Set* 53 1 3 ^ 3 ^ - P ^ . v 9 9 0 SO 100 200 300 0 SO 100 200 300 0 SO 100 200 300 0 SO 100 200 300 Set #54 Set* 60 0 SO 100 200 300 0 SO 100 200 300 FIGURE 8 Plots of devt(T) for the median. 41 Sett 11 Set* 14 Sett 15 Set* 16 0 50 100 200 300 0 50 100 200 300 0 50 t X 200 300 0 SO 100 200 300 dwC A M A A M I tfwtfl Set* 17 Set #20 Set #23 Set #24 * s « ° i § 0 50 100 200 300 0 SO 100 200 300 O S 0 100 200 300 0 SO 100 200 300 Set* 25 Set* 27 Set* 30 Set* 33 I s 1 ° q I § » 0 50 100 200 300 0 50 100 200 300 0 SO 100 200 300 0 50 100 200 300 <ftptf d w f O n l Amfl Set* 34 Set* 35 Set #36 Set # 37 0 50 100 200 300 0 SO 100 200 300 0 50 100 200 300 0 50 100 200 300 Set* 38 Set* 39 Set* 40 Set * 42 0 50 100 200 300 0 50 100 200 300 0 50 100 200 300 0 SO 100 200 300 * m M d m i M O Set #43 Set #48 Set* 50 Set* 53 I § I 5 •A Ml I ° •n • 0 SO 100 200 300 0 SO 100 200 300 0 50 100 200 300 0 SO 100 200 300 Set #54 Set #60 0 50 100 200 300 0 SO 100 200 300 FIGURE 9 Plots of devt(T) for lowess. The horizontal lines correspond to the limits 3-f. 42 Chapter 4: Methods of "Peak" Detection 4.1 Introduction As we have seen in Chapters 2 and 3 a major problem in carrying out the analysis of B R E C spikes in general and the averaging procedure in particular is the detection of the "peak" of the spike. While Definition 3.3.1 is not ambiguous when applied to a single channel, one runs into problems when trying to extend it to the case when all channels are considered simultaneously, since often not all the channels which detect the spike achieve the first extremum at the same time. In this Chapter we shall consider four different definitions of "peak" with the aim of deciding which one is preferable in practice. 4.2 Four Detection Methods Since each spike is recorded as a sequence of 256 dwells, a peak detection method may be viewed as a way of choosing, for each segment, one of these dwells. The method currently applied by the technicians identifies the peak with the earliest dwell at which a large negative peak is observed in one channel and, as mentioned in Section 2.3, it results in the choice of the 129th dwell. This method is based on the observation that, during the spike, negative voltages usually have larger magnitudes than positive ones, so that they are possibly the more important parts of the spike. This leads to the following Definition. 43 Definition 4.2.1 : For each segment the peak Pi coincides with the 129th dwell. In order to avoid the dependence of this definition on the subjective visual judgement of the technician, it would be desirable to construct an automated detection procedure based on the same criterion. Several spike detection algorithms have been suggested (Gotman and Gloor, 1976), but they require an accurate knowledge of the background noise, usually obtained from longer periods of time than the one available from the early portion of a recorded segment. In fact when we adapted the method proposed by Gotman and Gloor to a single segment and verified its performance, we found many instances in which the chosen dwell did not correspond to any meaningful feature of the spike or was related to a channel not even affected by the spike. We therefore decided not to use such an automated procedure as an alternative method. However, in view of the conclusions of this study (see Section 4.6), this problem remains an interesting topic for further investigation. Three more criteria are now considered which can be easily automated. These criteria were implemented by means of F O R T R A N programs. The problems encountered in trying to automate the search according to the first criterion were successfully avoided for the other methods by simply restricting the search of the maximum to values of T in {120,...,140). In keeping with the idea that the negative voltages observed during the spike are of great importance, the second criterion identifies the peak with the dwell at which the largest negative voltage is attained. Definition 4.2.2 : For a given segment let v(T,c) be the voltage observed on channel c and dwell T. Then the peak P2 is defined by v(P?,c) = Min[v(T,c); T=120 140, c=l 21 J. 44 The third criterion is based on the observation, mentioned in Section 3.1, that the effect of the spike on the midtemporal channel of the affected side seems to be of major importance. By using the same choice of channel as given in Table 3, this criterion identifies the peak with the dwell at which the selected channel achieves its first peak. The implementation was made more effective by inputting the polarity of the first peak, obtained by visual inspection of the E E G tracing. Definition 4.2.3 : Let CQ denote the selected channel and let P(CQ) equal +1 or -1 according to the polarity of the segment in question Then the peak P3 is defined by: v(P3,c0) = Max[p(cQ) v(T,c0); T=120 140] . It should be noted that this is not equivalent to maximizing over the absolute values \V(T,CQ)\, since the latter might lead to the selection of the second extremum when the spike is polyphasic, a choice which does not agree with the intended meaning of "peak". The last criterion makes use of the observed dipolar structure of B R E C spikes. If, during the spike, different channels increase their voltage in opposite polarity, the inter-channel variance should increase as well, so that the dwell at which the maximum variance occurs may be indicative of the peak of the spike. In this case the channel F is ignored, since it is not a real channel (cf. Section 1.4). Definition 4.2.4 : For a given segment, denote the inter-channel mean at dwell T by 21 c=l CA(T) = 20 45 and the inter-channel variance by 21 Y_ (v(T.c) - CA(T))2 c=l CV(T) = 19 Then the peak P4 is defined by CV(P4) = MaxfCV(T); T=120 140J . The value of 19 in the denominator of CV(T) is not essential and was in fact not used in the program. 4.3 Methods of Comparison The detection method which uses Pj is the one currently used and accepted, so our aim is to compare each of the alternative methods to this one. Accordingly, all our tests are pairwise and one-sided. Since we do not know when the true peak occurs, we have no way of directly assessing the bias of a given method. Instead we evaluate the methods on the basis of the variability which they generate in the measurement of certain variables deemed characteristic of the peak. More precisely, let X(T) be a random variable describing a feature of the electrical field at dwell T. If P is the peak dwell and X(P) is an important characteristic of the spike, we would like it to have a small variability. Three such variables are considered in the next sections and the detection methods are compared on the basis of their variability, where "variability" will be measured in different ways, depending on the nature of the variable. With this approach there is the obvious danger of giving preference to a method having small variance but large bias. However, each of the four methods is based on a condition which, according to current knowledge, is specifically associated with 46 the peak, so that the bias should be small and this danger minimal. Another general comment concerns the probabilistic approach used. The four methods use different criteria, but are aimed at the same phenomenon, so that correlation exists among them. Moreover, since P may only take one of 21 values, the probability of ties is non zero even when X(T) is a continuous variable. Hence we shall view the problem in light of a "randomization model" (cf., for example, Lehmann, 1975). In the comparison of two methods, under HQ, two dwells, possibly identical, are chosen from each segment and then assigned at random, with equal probabilities, to the two methods. Definition 4.3.1 : Let X and Y be two random variables having / j ^ y f x , ^ as their joint density or mass function. We say that X and Y are "interchangeable" if, for any x and y, fxy(x<y) = fx,Y<Y'X) Proposition 4.3.2 : If X and Y are interchangeable then p(x > Y | x/Y) = p(x < Y | Ayr; = 0.5. One can easily see that under the randomization model, for m=2,3,4, X(P ) and X(Pj) are interchangeable and so are the respective measures of variability. This will be used as the basis for an application of the sign test. The following proposition, the proof of which is immediate, will provide a measure of the difference in variability between the two methods which is comparable across different sets of spikes. 47 Proposition 4.3.3 : Let X and Y be two positive random variables and consider the statistic D(X,Y) defined by Y - Y D(X,Y) = . X + Y Then \D(X,Y)\ < 1 and if X and Y are interchangeable then D(X,Y) has a distribution symmetric around 0. 4.4 Comparing the Methods: Location of Minimum Value Definition 4.4.1 : Let T be a dwell of a given recorded segment and define MIN(T) to be the channel defined by the relation v(T,MIN(T)) = Min[v(T,c); c=l 21] . Notice that MIN(T) may not equal 2, since the corresponding channel, F , is pz interpolated from the adjacent frontal channels (cf. Section 1.4, Figure 2 and Table 1). This fact, however, does not affect the properties that we shall discuss later, so that we have included it in Definition 4.4.1 to simplify the notation. In particular if we denote by P the dwell selected for the given segment by one detection method, then we can consider the variable MIN(P), which identifies the channel of minimum voltage at the time of the peak. Since MIN(P) is a categorical variable, taking one of 21 values, we may assume that it has a multinomial distribution F with parameters (pj, p2 P21^> w n e r e Pj I S m fOJJ and 21 j=i The method giving rise to MIN(P) is then "good" if for each patient most of the pj's are close to 0. A positive valued function of F achieving small values for "good" distributions would then assume the role played by the variance for continuous variables. Several such functions have been suggested (cf. Patil and Taillie, 1982); 48 we shall rely on the following definition. Definition 4.4.2 : A function from the space of multinomial distributions to the set [0,00) which is 0 for distributions with zero variance and is maximum for the uniform distribution is called a "diversity measure". In particular the diversity measure defined by 21 d(F) = - Y_ Pj ln(Pj) j=l with the convention that pAn(p-)=0 if p -=0, is called "Shannon's measure". The properties of Shannon's measure are discussed in detail in, eg., Aczel and Daroczy (1975). The sample analogue of Shannon's measure is defined, in our context, as follows. Consider the function {1 if Min(P)=c for the kxh spike of the /th set 0 otherwise. If the /-th set contains spikes, consider, for each channel c, the quantity p(c;i) = Y_ I(c;i,k) k=l that is, the proportion of spikes which have their minimum at time P in channel c. Finally, for each set construct the statistic 21 d(P,i) = - Y_ P(c;i)ln(p(c;i)) c=l where p(c;i)ln(p(c;i))=0 if p(c;i)=0. (Again we point out that p(2;i)=0 always 49 since the minimum can never occur at FpZ)- This statistic is always non negative and, since in our case kj is at most 20, the maximum value that d(P,i) can achieve is In(k-). Moreover it is easy to see that d(P,i) is zero when all p(c;i) except one are 0 and increases as the observed distribution approaches the uniform distribution. The observed values of d(P,i) for each / and for m=l 4 are given in Table 7. In view of Proposition 4.3.2 we can assume that, under HQ. P{ d(Pvi) > d(Pm,i) | d(Pj,i) / d(Pm,i) } = 0.5 so that, if Pj and P have t non-tied pairs, the statistic 26 D = Y_ d(Pj.i)>d(Pm.i) ; d(Pj.i)ffd(Pm,i) } i=l has a binomial distribution with n=t and p=0.5 . From Table 7 we see that for m=2 we have n=19 and D=4, while for m=4 we have n=23 and D=2. Since we would reject only for large values of D, in both these cases we do not reject HQ and hence conclude that neither P2 nor P^ are better than Pj . For m=3 we have n=20 and D=12 , providing a p-value of 0.252. This value is not significant at usual levels, but is not very high either. In order to check whether this is caused by an existing difference not detected by our test because of low power, we use a more sensitive test. In keeping with the randomization model, we apply the Wilcoxon rank-sum test (Lehmann, 1975). Moreover, since d(P,i) may only achieve a maximum of ln(k^), we perform this test on the normalized statistics d(P,i) / ln(kt) which, though not having the same distribution for every i , have the same range and therefore provide a fairer way of ranking the appropriate differences. This produces a value for the test statistic of 208 and, by using a normal approximation, a 50 p-value of approximately 0.138. (Note: for this analysis we included the six tied pairs in the ranking. Other approaches are available for the treatment of ties (cf. Pratt, 1959), but the negative results obtained so far do not suggest the need for further investigations along this line.) Thus we conclude that none of the three alternative methods is better than the existing one on the basis of location of the minimum. Remark : Since only 21 points of the scalp are monitored, the real minimum voltage will in fact occur at some point not corresponding to any channel. Thus it is possible that a small variability in the real position of the minimum generates a large variability in MIN or vice versa. Unfortunately this problem is inherent in the structure of the data and cannot be avoided. In Section 4.6 we consider a different estimator of location that is possibly affected to a lesser extent by this problem. 4.5 Comparing the Methods: Maximum Voltage Difference Definition 4.5.1 : For each segment and each dwell T define the variable diff(T) by diff(T) = Max[v(T,c); c=l 21] - Min[v(T,c); c=l 21] . As in the previous section, we are interested in diff(P), for a detection method P. This quantity can be interpreted as a measure of the strength of the generating dipole, and since the dipolar structure is believed to be important, we would like to record it with high precision. Definition 4.5.2 : For a given set / of segments we denote by Sm(i) the sample variance of difffP ) obtained from that set. 51 The observed values of Sm(i) are given in Table 8. As in the previous Section, the sign test does not reject the null hypothesis and we apply the Wilcoxon signed-rank test. Since for different patients both diff(P) and its variance could be quite different, we rely on Proposition 4.3.3 to obtain a variable whose magnitude may be compared across patients. By letting X=Sm, m=2,3,4 , and Y=Sj, Proposition 4.3.3 assures us that the statistic D(X,Y) will have the same range for any set of spikes. Assuming that values of the statistic from different sets could be compared on the basis of equality of range we use the Wilcoxon signed-rank test. This provides one sided p-values larger than 0.25 in all cases. A different way of making the different values comparable is to multiply D(Sm,Sj) by a suitable coefficient, depending on the set, so that the resulting statistic will have approximately the same variance for all sets. To determine this variance stabilizing transformation, we assume that for any i diff(Pm) and diff(Pj) are independent normal variables, so that Sm(i) and Sj(i) are independent chi-square random variables with & - / degrees of freedom. Lemma 4.5.3: Let X, Y be independent random variables each having a chi-square distribution with k degrees of freedom. Then the statistic D(X,Y) defined in 4.3.3 has variance equal to (k+l)~^. Proof: Under the stated hypotheses the function Y B(X,Y) = X + Y has a Beta distribution with parameters (k/2, k/2) (see, for example, Mood et al. 52 1974) and hence its variance equals (k/2)2 = 1 (k+l)(k)2 4(k+l) But D(X,Y) = 1-2B(X,Y) and hence the result follows. Using Lemma 4.5.3, it follows that under HQ for any set i the statistic ( J~kj) D(Sj,Sm) is symmetric and has variance approximately equal to 1. By using again the Wilcoxon signed-rank test on this transformed statistic we obtain p-values of 0.46 for m=2, 0.33 for m=3 and 0.28 for m=4. Our conclusion is that according to this criterion as well, none of the alternative methods is better than the existing one. 4.6 Comparing the Methods: Location of the Focus It is believed that the electrical activity giving rise to the spike is generated within a small area of the brain in a form similar to that of an electrical dipole (Gregory and Wong, 1984). A question of interest is to find the location of this generator, which we shall refer to as the "focus" of the spike. A good method for detecting the peak should locate such a focus with small variability. In this section we use a simple method for locating the focus and then we compare the four detection methods with respect to the locations so observed. It is important to keep in mind that the results obtained in this way refer to both the detection and the localization methods and it is possible that better localization methods would provide a more informative answer to the detection question. However at the moment no reliable localization method is available, although research is being performed in this area (Weinberg et al., 1986), so that we consider our method as a first approximation. 53 Given positive integers a and b, let rem(a:b) be the remainder obtained when dividing a by i and let [a] be the integer part of a. Let us now associate with electrode site number c (see Table 1) a pair of integer coordinates (X(c), Y(c)) as follows: X(c) = if c < 3 if 4 <c < 18 if c > 18 Y(c) = l-[(c-4)/5J -2 if c < 3 if 4 <c < 18 if c > 18 This has the effect of representing the scalp in a two-dimensional plane, with the electrodes at the integer coordinate points surrounding the origin. Definition 4.6.1 : Let M1N(P) be as in Definition 4.4.1 and define VMIN(P) by VMIN(P) = v(P,MIN(P)) . Similarly define MAX(P) by: v(P, MAX(P)) = Max[ v(P,C); C=1,..,21J and VMAX(P) by VMAX(P) » v(P,VMAX(P)) . Intuitively, the projection of the focus on the scalp at dwell P should lie on the line segment connecting MAX to MIN and should be closer to the site having higher voltage magnitude. Definition 4.6.2 : The "focus" of the spike at time P is identified with the point 54 of coordinates (FX(P), FY(P)), where FX(P) = VMAX X(MAX) - VMIN X(MIN) VMAX - VMIN VMAX - VMIN FY(P) = VMAX Y(MAX) -VMIN Y(MIN) VMAX - VMIN VMAX - VMIN As in the previous section, our null hypothesis is that for m=2,3,4 the vectors (FX(Pj), FY(Pj)) and (FX(Pm), FY(Pm)) are interchangeable and have the same distribution. Also as before, the alternative is that the first vector has a larger variability, so that the following definition is needed. Definition 4.6.3 : Let (X, Y) be a random vector with covariance matrix A. Its "generalized variance" is the determinant of A, denoted by \A\ (cf. Anderson, 1958). An estimate of \A\ is given by the determinant of the sample covariance matrix, denoted by \S\. For a given set of spikes let us now denote by |5 | the determinant of the sample covariance matrix of (FX(Pm), FY(Pm))- The observed values of the statistic are given in Table 9. Proposition 4.3.3 can now be applied with X=\Sm\ and Y=\Sj\ and we can proceed as in the previous Section. The Wilcoxon signed-rank test applied to the variables D(\Sm\,\Sj\) provides p-values larger than 0.3, so that, once again, we try a variance stabilizing transformation. Theorem 4.6.4 : If (X, Y) has a bivariate normal distribution with covariance matrix A and if S is the sample covariance matrix based on n observations, then the quantity (n-1) \S\/\A\ is distributed as the product of two independent chi-55 square random variables with n-1 and n-2 degrees of freedom respectively. Proof: See, for instance, Anderson (1958). In this case it is not easy to compute the exact value of the variance of D(\Sm\,\Sj\) under the assumptions of Theorem 4.6.4, but we can derive an approximate value by using a result similar to Theorem 3.5.1. Theorem 4.6.5 : Let X, Y and g(x,y) be as in Theorem 3.5.2. Then 4-a Varfg(x,y)J ~ VarfXJ g^ajb) + VarfYJ g2y(a,b) + CovfXJJ gx(a,b) gy(a,b) Proof : See, for instance, Mood et al., 1974. Corollary 4.6.6 : If X and Y are independent random variables, each distributed as the product of two independent chi-square random variables with k-1 and k-2 degrees of freedom, then the statistic D(X,Y) defined in 4.3.3 is symmetric around zero and has variance approximately equal to (k-2)~*'. Proof: Under the given assumptions If we use these values in the approximation of Theorem 4.6.5 we obtain the result. E[X] = E[Y] = (krl)(kr2) VarfXJ = VarfYJ = (^+1)^(^-0(^-2) Cov[X,Y] = 0 . However, by using the statistic both the signed rank test and the test based on a normal approximation still provide high p-values. 56 4.7 Conclusions The three comparisons attempted do not suggest that any of the alternative methods should be preferred to the existing one. However, the fact that the one-sided p-values were seldom larger than 0.5 may suggest that the no definite advantage can be claimed for Pj either. Perhaps a refinement of one of the automated methods, or of a combination of them, might prove to be the key to a better detection method. In any case, the methodology employed for this analysis could be applied in the future to test other detection methods or to examine other variables of interest. Finally, given the results of this chapter, we shall use only method Pj in the application of C A R T discussed in Chapter 6. 57 Chapter 5: Classification trees 5.1 The classification problem Let Ej, E2 Er be subsets, possibly finite, of the real line and let B be a finite set. We shall denote by E the product Ej\E2x ... xEr , by x=(xj, x2 xr) an element of E and by y an element of B. Assume now that a probability measure P is defined on ExB. A "classification rule" is a formula or algorithm that assigns a value of y to each value of x. Such a rule may be desirable in one of two situations: either when for each pair (x,y) the value of y is of interest but only the value of x can be observed, or when one wants to find whether a mechanism exists that can "explain" y on the basis of x. The "classification problem" is then the problem of constructing a classification rule able to determine the correct value of y with high probability. Definition 5.1.1: The random variables Xj, j=l r, defined by x/ x, y) = X j are called "explanatory variables", the random vector X = (Xj Xr) is called the "explanatory vector" and the random variable Y defined by Y(x, y) = y is called the "response variable". 58 The particular case in which Y is a binary variable and B = {0,1} is frequently encountered in practice and will be the one discussed here. As with so many problems in statistics, satisfactory solutions to the classification problem have been obtained when the Xj's are normally distributed. For instance if one assumes that, for i=0,l, the conditional distribution of X given that Y=i is multivariate normal with mean vector ^ and variance £2 •> t n e n it is known that the discriminant function allows the construction of a classification rule with optimal properties (cf. Johnson and Wichern, 1982). Another frequently used approach consists of assuming that the logit function P(Y=1\ X = xj 5-a In P(Y=0\ X = xj is a linear function of x. A classification rule is then obtained by estimating the coefficients of such linear relation and by setting y=l if the value of 5-a corresponding to the observed value of x and obtained from the estimated linear function is greater than 0 or some other threshold value. This method is called "logistic discrimination" (cf. Dillon and Goldstein, 1984). However, in many cases the assumptions on which these methods are based may not be satisfied. Also, the resulting models may be difficult to interpret, especially when some of the Xj are categorical and may take more than two values, so that dummy variables have to be introduced into the model. Finally, when collinearity is present, the coefficients of the resulting model may be unstable and may lead to incorrect conclusions about the importance of certain explanatory variables or to incorrect hypotheses on the mechanism relating explanatory vector and response variable. In the application of interest to us we want to use as explanatory variables several 59 measures describing the electrical field at the time of the spike as well as certain indicators of the temporal structure of the spike. Except for the binary variables, we do not wish to assume any distributional form and in particular we do not want to assume normality, since earlier observations made by Dr. Wong's group, confirmed by histograms of our observed data, have revealed the presence of skewness in some voltage measures. More importantly, those observations suggested that the two groups might be juxtaposed on the basis of extreme versus intermediate values of some location variables, a phenomenon that could not be picked up by linear methods. Finally, since this is an exploratory study, we want to use all the variables believed to be relevant, even if they are correlated, without being confused or misled by possible collinearity. For these reasons we use a different approach that does not make any distributional assumptions, accommodates all types of variables and may be easier to interpret. This approach, called "classification and regression trees" and referred to by the acronym "CART", is described in detail in Breiman et al, 1984, and implemented by the C A R T package software, distributed by California Statistical Software (Lafayette, California). In the following sections we shall describe the main aspects of this methodology in the case of a binary response variable. Therefore, all the definitions and results we present here may be found, in a more general form, in the above cited monograph, but we have adapted them to the case of interest and, in some cases, formalized them more in order to clarify the exposition. 5.2 Classification Rules, Partitions and Trees Definition 5.2.1 : The two subsets ExfO) and Ex{l} will be referred to as the two "classes" of the sample space. 60 Definition 5.2.2 : A "classification rule" is a function d : E —> B As mentioned in the previous Section, a classification rule d is "good" if it determines the true value of y with high probability. This is formalized in the following Definition. Definition 5.2.3 : The quantity M(d) defined by M(d) = 1 - P{(x, d(x)); x£E} = Pf(x, l-d(x)); x£E} is referred to as the "misclassification rate" of the rule d. For example let us assume that E is the set of real numbers and consider the density function / on ExB defined by which is simply the sign of x and has M(d)=0 . In the more realistic, and hence interesting, cases M(d)>0 for any rule d. The values if (2y-l)x > 0 otherwise. Then the best classification rule is dfx) = X P0 = P{(x,0);x£E} Pj = P((xJ);x£E} indicate the "a priori" probability that, for a pair (x,y), y=0 or y=l 61 respectively. Moreover the classification rules dg and dj defined by d()(x) = 0 for every x dj(x) = 1 for every x have misclassification rates Pj and PQ respectively. They correspond to classifying every sample point in the same class. When PQ is known, but no other information is available, then the best classification rule one can construct is dg if Pn > 0.5 and d, if Pn< 0.5. Proposition 5.2.4 : Let T be a partition of E, denote by N an element of T and suppose that, for each N, P{Nx{0}} is known. Then the classification rule dj defined, for an element x of N, by (O if P{Nx{0}} > P{Nx{l}} . dT(x) = 1 if P{Nx{l}} > P{Nx{0}} . has misclassification rate M(dT) = ^ P{Nx{l-dT(N)}} N e r and is the best classification rule which is constant on each element of T. This proposition implies that a good classification rule may be obtained by constructing a partition T with the property that each N in T contains mostly elements of one class only. Any such partition may be obtained iteratively as follows. First set a stopping criterion for deciding whether a subset of E contains a satisfactory majority of elements of one class, so that it will not be split. If E satisfies this criterion then stop at the trivial partition. Otherwise split E into two subsets JV, and N-,, denote by C the "splitting criterion", that is, the 62 condition on x which defines the split. We shall write x £ C if x satisfies C, so that N] = { x | x € C} ; N2 = E - Nj . If Nj, i=l,2, does not satisfy the stopping criterion, split it into two subsets according to condition C ;-. Continue splitting until all subsets so obtained satisfy the stopping criterion. Although infinite partitions can be obtained in this framework, our ultimate goal is to construct partitions based on a finite sample, so that only finite partitions are of interest and we shall assume that the procedure just described does terminate. The choice of splitting criteria will be discussed in Section 5.4 and the stopping decision in Section 5.5. Definition 5.2.5 : The procedure just described is called an "iterated partition". Any iterated partition may be represented by a binary tree as in Figure 8. A branching node of the tree represents both a subset of one of the intermediate partitions and the criterion which splits it. A branch stemming from a node represents one answer to the criterion used to split it and a terminal node represents an element of the final partition. Hence the following terminology will be adopted. Definition 5.2.6 : A "classification tree" consists of a partition T of E, a tree representing it and the corresponding classification rule described in Proposition 5.2.4. The symbol "T" will be used to denote both the tree and the partition, whenever no confusion will arise and the misclassification rate of T will be denoted by M(T). A subset N of E represented by a node of the tree will also be called a "node" and a subset representing a terminal node will also be called a "terminal node". In particular E will be referred to as the "root" node. The set of nodes 63 s 9 FIGURE 10 A Binary Classification Tree. The circles represent split nodes, the squares terminal nodes. An observation descends to the left node if it satisfies the criterion expressed under the node, (from Breiman et al, 1984) 64 contained in a node N will be denoted by <N> and the set of terminal nodes contained in a node N will be denoted by (N) . The correspondence between finite classification trees and finite partitions is surjective, but not injective, since the same partition may be obtained by performing the splits in different ways. 5.3 Constructing a Classification Rule from a Sample Suppose now that a sample R consisting of n observations: has been drawn from ExB. By assuming that the distribution observed in R is a good approximation of the distribution of ExB, we can construct an iterated partition of E by using the information provided by R. In particular the choice of splitting criteria and the stopping decision will be based on the proportions observed in the sample. The sample R is usually referred to as the "learning sample". Definition 5.3.1 : For any node N we denote by \N\ the number of observations contained in N. For i=0,l the proportion of observations belonging to class / among all observations contained in N is given by: while if N is a subnode of M, the proportion of observations contained in node N, R = { (XyVj) ; j=l,..,n) n p(i\N) = \N\ 65 among those contained in node M, is given by: \N\ p(N\M) = \M\ and, in particular: p(N) = P(N\R). The following result is analogous to Proposition 5.2.4. Proposition 5.3.2 : Given a sample R and a classification tree T, define a classification rule d^ j by setting, for each node N of T, dR = i if x £N and p(i\N) > p(l-i\N). Then dR j, minimizes the misclassification rate for elements of R among all rules which are constant on each element of T. In light of Proposition 5.3.2, the classification rule dR j will be the one associated to a tree T and a sample R and the subscripts will be omitted whenever no confusion shall arise. 5.4) Choosing the Splitting Criteria So far we have not specified how the splitting criteria should be chosen so as to obtain a "good" rule. Two objectives are important in this choice. The first is that the resulting rule should have a low misclassification rate. The second is that the classification rule should be easy to interpret and implement. 66 The second objective is partially achieved as follows. Definition 5 .4.1 : A "standard splitting criterion" is a criterion of the form C: X: < c , c £ (-00,00) for ordered scalar variables, or C: XjES , SdEj for categorical variables. A "linear criterion" is a criterion of the form n C: Y_ < c C £ (- 00, 00) where aj are real numbers and a~0 if Xj is categorical. All the criteria we use are standard or linear criteria. The other problem related to this objective is to ensure that the tree does not become unreasonably large. This is discussed in Section 5.5. The following definition will be used to achieve the first objective. Definition 5.4.2 : Let f(p) be a real valued function twice differentiable on [0,1] such that: 1) f(0) = f(l) = 0 , 2) f(p) = f(l-p) for p in [0.1] , 3) f"(p) < 0 on (0,1). Then the "impurity" of a node N is defined by f*(N) = f(p(0\N)) . 67 Thus a node will be most impure when p(0\N) is close to 1/2 and will be increasingly pure as p(0\N) approaches 0 or 1, that is, when the node contains mostly units of the same class. The requirement of convexity is made to ensure that impurity decreases at a faster rate as p approaches either one of the two extremes. Shannon's measure of diversity, described in Chapter 4, gives rise to the impurity function corresponding to f(P) - - ( p Hp) + (l-p) in(i-p)) while the simpler quadratic function f(p)= P(l-P) is referred to as "Gini's impurity function" (cf. Patil and Taillie, 1982). Definition 5.4.3 : The overall impurity of a tree T is defined by: f*(T) = Y_ f*(N)p(N) . N€{E) Assume that C is a binary criterion splitting N into nodes Nj and N2 • Let T be the tree existing before the split and T' be the tree obtained after the split. Then to measure how effective such a split is we may consider the decrease of tree impurity it generates, namely: 5-b f*(T) - f*(T') = f*(N)p(N) - f*(N])p(N]) - f*(N2)p(N2) . Clearly with this criterion the optimal split at a given node depends only on the node itself. Definition 5.4.4 : For any node N the corresponding splitting criterion is the 68 standard or linear criterion that maximizes 5-b . Several splitting criteria may be devised in this way and others not related to an impurity function may also be considered, but Breiman et al suggest that the final structure of the tree is rather insensitive to the criterion used. Hence only two criteria are included in the CART software: Gini's criterion and the so-called "twoing" criterion, not related to an impurity function. The two, however, can be shown tocoincide for the two-class problem (Breiman et al. pages 104-108) so we shall only refer to Gini's criterion henceforth. A nice interpretation of Gini's criterion, in terms of general principles of classification, may be obtained as follows. Assume that N is split into nodes Nj and N2 and that, for j=l,2 Also, suppose that we want to construct a classification rule able to reproduce the distribution of classes observed in N •, j=1.2, so that if x is in N- we assign it to conditional probability of misclassifying x, given that x is in N can be written as P{x£Nj\x6N} = p(Nj\N) P{Y=i\x£Nj) = p(i\Nj). class / with probability p(i\N-J. Denoting this rule by d^ we observe that the P{dN(x)=l-y} = 2 j=l 2 1 P{dN(x)=i\x ZNj) P{y=l-i\x£Nj} PfxeNj) 2 1 j=l i=0 69 and hence it equals 5-c 2 [ p(0\N1)p(l\N1)p(N1\N) + p(0\N2)p( 1\N2)p(N$N) ] . On the other hand, for the given node N, maximizing 5-b is equivalent to minimizing f*(Nj)p(Nj) + f*(N2)p(N2) or, dividing by p(N)/2, to minimizing 5-c. Therefore, by using a proportional assignment rule, the use of Gini's function can be seen as an effort to minimize the misclassification rate by a splitting of the node in question. 5.5 "Pruning" the Tree Following the procedure we have described so far, one may continue splitting nodes until each terminal node either contains elements of the same class, or contains elements with identical values of x. However the classification rule obtained in this way would be very artificial and data dependent, in the sense that for many nodes P{(N, dj(N))} would likely be too small to be of any practical relevance and indeed it may be true that P{ (N, dj{N)) } < P{(N,l-dj{N))}, thus increasing the true misclassification rate. The intuitive idea of not splitting a node when the decrease of impurity it produces is below a certain threshold soon runs into two type of problems. The first is that if the threshold is too small or too large the resulting tree may be overly rich or may have important splits missing. The second is that while a node may not produce a decrease larger than the threshold, one, or both, of the subnodes resulting from its best split may in turn produce relevant splits. A different approach may be taken as follows. 70 Definition 5.5.1 : Let TQ be the tree obtained from the sample R by splitting until each terminal node either has no impurity or contains units with the same value so that M({N}) is the proportion of elements of R that are assigned to node N and eventually misclassified. Definition 5.5.2 : A tree T is said to be contained in the tree 7" if every node of T is a node of T. Now it is intuitively clear and easy to show (Breiman et al.) that, for any node N, M(N) > M(<N>) . Definition 5.5.3 : The tree Tj is obtained from TQ by removing all those nodes contained in a non terminal node N for which M(N)-M(<N>)=0. Clearly the tree Tj will have the same misclassification rate as TQ, but fewer terminal nodes, and hence it may be considered as a better tree. The next step might be to consider the tree 7^ obtained from Tj by excluding the nodes contained in the node N for which M(N)-M({N'}) is smallest. It is easy to see that this criterion would restrict the analysis to those nodes which contain only two terminal nodes. Since our aim with this procedure is to simplify the structure of the tree, it would be desirable to avoid such restriction. of x. Then for each node N of TQ define M(N) = p(l-d(N)\N)) p(N) M(N') N'€{N} 71 Definition 5.5.4 : The "link strength" of a node /V is defined by L ( N ) = WN) ~ M(<N>) \<N>\ - 1 where L(N)= oo if N is terminal. Definition 5.5.5 : Given the tree Tj described in Definition 5.5.3, define iteratively 7^ to be the subtree of 7^ -7 obtained by eliminating the nodes properly contained in the node, or nodes, of T^j having smallest link strength. Such smallest value is called the "complexity parameter" of and is denoted by Wk, with fX j=0. The motivation for Definition 5.5.5 is given by the following theorem, proved in Breiman et al. Theorem 5.5.6: For every (X in the interval t &k'^k+l^ t n e t r e e ^k m m i m i z e s t n e quantity M(T) + 0( |71 among all trees contained in TQ. Moreover, if there exists another tree T contained in TQ for which M(T) + c< |71 = M(TK) + OC \TK\, then TK is contained in T. This procedure is called "pruning" and it generates a decreasing sequence of trees 5-d {Tj, T2 T^EJ, each tree of the sequence having more terminal nodes and lower misclassification rate than the following one. The optimal tree can now be chosen from among the trees of 72 sequence 5-d in two ways, either by choosing the tree corresponding to a desired value of the complexity parameter, or by choosing the smallest tree whose misclassification rate is comparable to that of TQ. The first method is equivalent toachieving a desired number of terminal nodes. For the second, one needs an estimate of the misclassification rate (see Section 5.6) and a formal criterion, such as, for example, the one given in the following Definition, referred to as the "1 standard error rule". Definition 5.5.7: Let m(T'•) be an estimate of the misclassification rate of the tree Tj, and denote by sm(T j) its estimated standard error. The "optimal" classification tree is that tree Tj% of sequence 5-d defined by the following conditions: 1) m(Tj) < m(Tj%) => j* > j; 2) m(Tj%) < m(Tj) + sm(Tj) for any j=l k. In particular the misclassification rate of the optimal tree is within one standard error of the best misclassification rate obtained by a tree of the sequence 5-c. Breiman et al prove that the tree described in Definition 5.5.7 exists and is unique. 5.6 Estimates of the Misclassification Rate In the previous section we have noticed that the choice of the optimal tree depends on the misclassification rate of the trees of sequence 5-d, so that in order for the procedure to work well we need good estimates of such rates. Since each tree tends to minimize the misclassification rate of the learning sample, to estimate the misclassification rate of a tree by the proportion of elements in the learning sample that are misclassified may lead to optimistic underestimates. These 73 are called "resubstitution" estimates. Instead we shall use "cross-validated" estimates, defined as follows. Definition 5.6.1 : Given the tree 7*^ , divide R randomly into v subsets of equal size, denoted by Rj Rv, and let R* = R-Rj. For each /=/ v construct, by using the previous procedure, the smallest tree based on RJ having complexity parameter not larger than W ^. Define M*(RJ) as the proportion of elements of Rj misclassified by the tree so constructed and define M*(Tk) = r Y_ \Rt\M*(Rp \R\ to be the "cross-validated" estimate of M(Tj) • This definition is based on the assumption that if v is sufficiently large, Tk and each of the smaller trees constructed as described will be similar, hence with similar misclassification rates. On the other hand, each subset used to estimate M*(Rp is independent of RJ so that the cross-validated estimate does not have the same inherent tendency to underestimate as the resubstitution estimate. Reasonable estimates of the standard error of M*(Tk) may be obtained by heuristic arguments as described in Breiman et al. 5.7 Variable misclassification costs In certain problems different types of misclassification are associated with different degrees of severity. For instance, withdrawing a safe and effective treatment from a sick patient because of incorrect diagnosis may be worse than 74 applying the treatment to a healthy individual. In these cases it is desirable to construct a classification rule that minimizes a weighted combination of the different types of misclassification. There are two approaches to this problem. The first consists of constructing the tree TQ as before and then pruning according to the following definitions of classification rule and misclassification rate. Definition 5.7.1 : Let Cj be the cost of classifying an element of class / as belonging to class For any node N of TQ let d(N) = i if p(i\N)Ci> pfl-HNJcj^ M(N) = p(l-d(N)\N)p(N)Cl_d(N) and define M(N) as the "misclassification cost" of N. By modifying the definition of link strength accordingly,one obtains a pruning procedure that deletes from the tree those splits that produce a small improvement in misclassification cost. The second approach is to modify the value of the proportions used in the impurity function so that elements of one class are given greater relevance. Definition 5.7.2 : Let Cj be as in Definition 5.7.1 and define CQ+CJ The "cost adjusted" proportion p*(i\N) is then defined by 75 p.(tw). —1 (l) p(^> 1 {U)p(0\R) + I {Up(l\R) If [*(i)=p(i\R) then p*(i\N)=p(i\N). Since p*(i\N) is an increasing function of if Pf/> pfi-zj, elements of a class /' will count for a larger proportion and hence their misclassification rate will tend to be lower. By introducing variable costs with the first method we simply obtain a different subtree of the same maximal tree TQ obtained with equal costs. With the second method the tree may be based on entirely different splits. 5.8 Surrogate and competing splits A problem encountered frequently in regression as well as classification procedures is that of collinearity. When two, or more, explanatory variables are strongly dependent, the resulting model may rely on either one with the same efficacy, sothat two slightly different samples may provide different models. In particular some explanatory variables may be judged irrelevant for the problem when in fact they are almost as good as those highlighted by the model and highly correlated to them. In C A R T this problem is handled by considering "surrogate" and "competitor" splits. Definition 5.8.1 : Let C be a splitting criterion which divides the node N into Nj and N2- Let C" be another splitting dividing N into N'j and N'2- Then the "association" of C and C is given by \Njr\N~j\ + \N2C\N'2\ a(C,C) = |A>| 76 Clearly, if C and C produce similar splits of TV their association will be large. Definition 5.8.2 : Let C be a splitting of a node N. The "first surrogate split" of C is that split generated by the criterion C for which a(C.C) is maximum. Similarly one may define second, third and higher surrogate splits. An analysis of the surrogate splits may reveal variables that, at least within a node, are similar to the one used for the best split. Moreover surrogate splits may be used in the predictive stage of the classification when the value of the variable generating the best split is not known, but the value of a surrogate variable is known. The latter application, however, is not relevant to the spike classification problem discussed in Chapter 6, since the values of all the variables used there are obtained from one set of data, namely the segment, so that it is not possible to miss only some variables. Definition 5.8.3 : Let C be the optimal split for the node N. The "first competitor" split is the one that achieves the second lowest value for 5-b. Again we can define second, third and higher competitor splits and these may be analyzed to identify variables that, although not used in the tree, may provide reasonably good splits at some nodes. 5.9 Some theoretical considerations The procedure for constructing a classification tree is based mostly on heuristic 77 arguments. The choices of impurity function and of splitting criteria, the pruning procedure and the selection of the optimal tree are made from general principles using methods which are either of common use in statistics or make sense for the goal to be achieved. No theoretical results are available, however, to qualify these aspects of the methodology as "optimal" in some sense. Breiman et al. provide a theorem (Theorem 12.19) showing that, under certain assumptions, the procedure is "consistent", meaning that asymptotically its "risk", in the Bayesian sense, tends to the theoretical lower bound. In practice, however, both the hypotheses and the conclusion of this theorem may not be relevant. In fact, the conditions needed for the theorem relate to the distribution of the sample space, which, in most of the cases in which one would use this methodology, is unknown. So, even though these conditions are mild, one can never be sure that they are satisfied. Moreover, in order for the asymptotic approximation to become reasonable one is likely to need a tree with a very large number of nodes, which may be rather complex and may shed little light on the way in which the explanatory vectors are related to the response variable. We believe that C A R T should be viewed primarily as an organized attempt to analyze all possible nodes and to assess the predominance of one class in a given node, so that an important theoretical consideration is whether, for a given node N, the observed proportions of the two classes reflect the corresponding probabilities. Moreformally one needs to know whether, given a learning sample R={(xI,y]),.-/xfVyn)}, P{Nx{0}} P{NxC} If N is a terminal node the question of interest may be whether the \N\ units belonging to that node provide a fair assessment of the proportions of classes, in 5-e Y_ «*j e N. yr0) J±l n j=i 78 the sense of identity 5-e. Proposition 5.9.1 : Identity 5-e is satisfied upon conditioning on the number of units \N\ observed in node N. Proof : Assume, without loss of generality, that the first \N\ units belong to node N. Then the left hand side of identity 5-e may be rewritten as: and the bracketed expression is a binomial random variable with «=|7V| and p=P{Nx{0}}/P{N}, so that the result follows. The unconditional expectation, however, is more relevant in the context of the search for "good" nodes. In this case the exact bias is difficult to assess in general, but we may use again Theorem 3.5.2 to obtain relevant information. Proposition 5.9.2 : If the sample consists of n independent observations from the sample space, then 5-e is satisfied up to terms of order n~*'. Proof : Let n X = n Y k=l 79 so that EfXJ = n P{N\{0}} and E[Y] = n P{N} VarfYJ = n PfN) (1-P{N}J CovfX.YJ = n Covflfxj £ N, yj=0}, I{x £N}J = = n (E[I{Xj£N. yj=0)] - EfI{x£N, yj=0)J EfI{x£N}J) = = n P{Nx{0}} (1-P{N)J . Now consider the function g(x,y) = x/y with second derivatives 8xx = 0 ' 8yy = 2x/y3 ; gxy = -y~2 If we apply Theorem 3.5.2, we conclude that theleft hand side of 5-e is approximately equal to n P{Nx{0}} 1 2nP{Nx{0}} 1 + — nP{N}(l-P{N}) - 1 -3 n(l-P{N})P{Nx{0}} n P{N} 2 n3 P{N}3 n2 P{N}2 and hence to P{NxfO}}/P{N) as claimed. When correlations exist among the observations the above result may no longer hold, but the C A R T procedure may still be effective, if the sample provides a reasonable representation of the distribution of the sample space. In our case correlations exist,since several replicates are available of each spike, and in Section 6.3 we describe the sense in which our use of C A R T is to be considered "reasonable". Finally, keeping in mind the exploratory nature of this methodology, it is important, in assessing a proposed tree, to consider also factors like agreement of the estimates obtained by different methods and interpretation of the nodes, while de-emphasizing minor inconsistencies among threshold values. In particular, a low value 80 of the estimated misclassification rate should be considered as more reliable when it is obtained by using few splits, since, as explained in Section 5.5, a misleadingly low estimate may often be obtained by using a large number of splits. 81 Chapter 6: Classification of BREC spikes 6.1 Introduction As mentioned in Section 1.7, the distinction between "typical" and "atypical" B R E C patients is made exclusively on the basis of clinical data. Both groups of patients have the same type of seizures, but while atypical patients also suffer from some other neurological abnormalities or have a level of intelligence or school performance lower than normal, typical patients do not have any other neurological problem. The question of interest is whether this dichotomy is truly associated with two different types of epilepsy, possibly leading to different prognoses, or whether the presence of neurological complications is simply an added feature to the same form of epilepsy. Since Rolandic spikes are present in both groups and are believed to be strictly related to the epileptic activities (Gregory and Wong, 1984, Lerman and Kivity, 1986), a step towards answering this question may be taken by finding whether a classification rule able to distinguish satisfactorily between the two groups may be constructed on the basis of certain features of these spikes. By using the terminology and notation described in the previous Chapter, we may formalize this question as follows. For each spike, the value of the binary response variable Y is 0 if the spike occurs in a typical patient and 1 if it occurs in an atypical patient. The explanatory vector X consists of a number of variables, described in Section 6.2, determined by certain features of the spike. What is the misclassification rate of the "optimal" classification rule that can be constructed 82 from the available data? Can this rule be interpreted in terms of meaningful physiological characteristics? If the "optimal" rule has a low misclassification rate and can be interpreted physiologically, then one would conclude that the two groups can be differentiated on the basis of epilepsy related E E G features and hence that they are possibly affected by different types of epilepsy. On the other hand, a different conclusion might be reached if no significant differences are found between the two groups in the E E G patterns. The classification tree methodology described in the previous Chapter will be used to obtain the required classification rule. In this Chapter we shall use the term "peak" as defined in Definition 4.2.1. 6.2 Choice of variables The variables described in this Section were selected as explanatory variables because previous studies (Gregory and Wong, 1984, Lerman and Kivity, 1986) and preliminary visual analyses performed by Dr. Wong's team had pointed to them as either potential classifiers, or as being related to some important underlying neurological phenomenon. An attempt was always made at keeping the measurements objective, but for the variables described in Definitions 6.2.1 through 6.2.4 visual analysis of the relevant topographic maps turned out to be the best method of evaluation (cf. Section 4.2), so that an element of subjectivity did enter. Contingency tables did not reveal the presence of correlation among these variables. Definition 6.2.1 : The categorical variable PHASE describes the temporal development in the polarity of the channels. For a given spike it takes the value 1 if no involved channel exhibits a reversal of polarity after the peak, a value of 2 if channels with both peak polarities have a reversal; a value of 3 if only channels of 83 a given peak polarity exhibit a reversal, the others being monophasic. Definition 6.2.2 : The categorical variable POLES describes the electrical field structure at the peak of the spike. For a given spike it takes the value 1 if the spike is monopolar, the value 2 if it is dipolar and involved channels represent a small area of the scalp, the value 3 if it is dipolar and a large area of the scalp is involved, the value 4 if more than two poles are evident at the time of the peak. Definition 6.2.3 : The categorical variable SYNCH describes the degree of synchronism among the channels affected by the spike. For a given spike it takes the value 1 if all channels achieve their peak within 2 dwells of each other; the value 2 if the involved channels may be divided into two groups, with channels within the same group peaking simultaneously and channels from different groups peaking more that 2 dwells apart; the value 3 if a major asynchronization among channels occurs. Definition 6.2.4 : The binary variable WAVE detects the presence of an after-wave following the spike. For a given spike it takes the value 0 if no after-wave follows the spike; the value 1 if an after-wave follows the spike. Al l the remaining explanatory variables describe features observed at the peak of the spike and are those defined in 4.3.1, 4.5.1 and 4.5.2, namely MIN, the location of minimum voltage, VMIN, the value of the minimum voltage, MAX, the location of maximum voltage, VMAX, the value of maximum voltage, and FX and FY, the coordinates of the estimated focus. These were evaluated from the ASCII files by means of F O R T R A N programs and provide a way of describing the location and strength of the electrical field generating the spike. The only interesting correlation is found between the variables VMIN and VMAX and it equals -0.739. This may be viewed as an indication 84 that both variables are measures of the strength of the spike. The absolute values of all other correlations coefficients among continuous variables were smaller than 0.3. 6.3 Selection of sample As we have mentioned in Section 5.7, the data available cannot be considered independent, since they consist of several replicates of the same spike for each set of spikes. It is intuitively clear that if a large number of replicates of the same spike are part of the learning sample, they might generate a terminal node which is in fact describing only one patient and may therefore have little descriptive value and lead to a high misclassification rate. To put this observation in a formal way, we notice that we may view the sample space as the union of a finite number of subsets, each corresponding to a patient and each having the same probability measure. (For the purpose of this discussion we shall use the word "patient" to mean a set of spikes, as defined in Section 2.3) Associated to each patient is a distribution for the explanatory vector, so that the distribution of the sample space is a mixture of the individual distributions, with equal weights. Our sampling scheme consists of two stages. In the first we select from the population of patients, with equal probabilities, the n individuals to analyze. In the second we make independent observations from patient /, for a total of k recorded segments. Let us denote by x/y the j'th observed explanatory vector of the /th selected patient. If we condition on the first stage of the sample, then for a node N the expected proportion of units belonging to that node is given by E hi k 85 1=1 j=l n ^ = Y_ — P(N\I} i=i k and this quantity is independent of the number of replicates per patient only when k~k/n for every patient. In fact in that case the above expectation equals P{N}. On the other hand if the k- are different then the observed distribution within a node may be heavily influenced, as previously stated, by a single patient. Since the smallest set available consisted of 6 spikes and in order to maintain a certain degree of randomness in all sets, we used 5 spikes from each set. Thus, for the /th set we selected 5 random integers between 1 and kj, without repetitions, and the corresponding spikes became part of the learning sample. Hence a learning sample consists of 130 observations. We produced three such learning samples and performed the preliminary analysis on all three. In accordance with the last comment of Section 5.9, the reliability of the classification rule constructed from each of them was then assessed on the basis of consistency of several diagnostic statistics. 6.4 Preliminary analysis In the preliminary analysis we used only standard splits; the optimal trees were chosen from cross-validation estimates of the misclassification rate with v=10 and using the "1 standard error rule". The characteristics of these trees, in terms of complexity and misclassification rates, are summarized in Table 10. The first two samples provide rather similar results, while the third gives rise to a much larger tree, with small terminal nodes and a noticeable discrepancy between 86 cross-validation and resubstitution estimates of M(d). Moreover, despite being larger, the third tree provides similar misclassification rates for the entire available sample as the first two, thus indicating that the greater number of splits does not achieve any real improvement. By pruning this tree further, until it is left with four terminal nodes, we obtain a tree with rather inconsistent estimates of the misclassification rates, both between typical and atypical groups and between cross-validation and resubstitution estimates. The former discrepancy indicates that this tree identifies only a small portion of the atypical observations and thus fails to separate the two groups effectively. The latter indicates an instability of the tree structure and the possibility that the resubstitution estimates are too optimistic and hence unreliable. Our conclusion, therefore is that the information provided by the third sample is not reliable and we shall not use it in further analyses. The first two samples give rise to trees with a noticeable degree of agreement (Figures 11 and 12). Both include a split on the variable VMIN, with the threshold value being approximately -50 in both cases. Moreover, the first split obtained in the first tree (based on MIN) has, as first surrogate, the first split obtained in the second tree (based on FX), with an association value of 0.67, while the latter has the former as competitor split. This indicates that the two trees are being generated by very similar considerations. However the tree obtained from the second sample has a larger complexity parameter and lower values for all types of estimates of the misclassification rate. These are all indications of a more stable structure, so that we believe this tree to be more reliable than the first one. The optimal tree obtained from the second sample will be denoted, from now on, by T* and its terminal nodes can be described as follows (in brackets are the numbers of typical and atypical spikes, respectively, corresponding to the given node). Node 1: FX < -.805 => Typical (25-2) Node 2: FX > -.805, VMIN < -50.5, MIN£{ 1,3,17,21} => Typical (32-12) Node 3: FX > -.805, VMIN < -50.5, MIN £{ 1,3,17,21} => Atypical (1-9) Node 4: FX > -.805, VMIN > -50.5 => Atypical (7-42). 87 1 + / \ + 2 + •/ \ 3 + / V + Node 1: Split: To left if MIN is in (2,5,6,7,9,11,14,15,16,20). First Surrogate: FX < -0.496, a=0.67. First Competitor: FX < -0.892 Node 2: Split: To left if VMIN < -45.3. First Surrogate: VMAX > 36.9, a=0.54. First Competitor: POLES is in {2,4}. Node 3: Split: To left if POLES is in {2,4}. First Surrogate: MAX is not in {4,12,17,18,21}, a=0.42. First Competitor: MAX is in {1,2,5,7,9,10,13,14,15,16,19,20}. F IGURE 11 Classification Tree For The First Sample. 88 Node 1: Split: To left if FX < -.805. First Surrogate: MIN is in (2,4,5,67,8,9,11,14,16,20), a=0.67. First Competitor: MIN is in (2,5,6,7,8,9,10,11,12,13,14,15,16,20). Node 2: Split: To left if VMIN < -50.5. First Surrogate: VMAX > 35.6, a=0.55. First Competitor: MIN is in (2,4,5,6,7,8,10,11,12,13,15,16,20,21). Node 3: Split: To right if MIN is in (1,3,17,21). First Surrogate: FY > -0.852, a=0.40. First Competitor: POLES is in f2,<?. F IGURE 12 Classification Tree For The Second Sample. 89 This may be interpreted as saying that a typical spike is either generated in the extreme left side of the brain, or has a strong negative pole in a region other than the frontal or occipital. The criteria on the location variables FX and MIN are consistent with earlier observations that typical B R E C spikes occur predominantly in the Rolandic region (cf. Chapter 1), and the requirement on the magnitude of the negative voltage may also be explained in terms of location, since a superficial spike gives rise to higher voltages than a deep one. The lack of symmetry for the conditions on FX and MIN cannot be explained by existing theories and in fact its most likely explanation is that it is due to lack of information: the right spikes available may not be sufficient to generate splitting criteria symmetric to those described above. The split generated by the variable POLES in the tree obtained from the first sample appears as competitor split in the tree obtained from the second sample and is consistent with earlier observations (Gregory and Wong, 1984) that typical spikes generate a dipolar field focalized in the Rolandic region. Therefore, although it is not used as a splitting variable in T*, this variable should be considered as important for the problem. The variables SYNCH and PHASE play a role only in the tree obtained from the third sample and then only to split small nodes. This indicates that they may not have any classification value and in fact they were not used for the analyses described in the following Sections. Finally we point out that, except for the similarity between FX and MIN noted earlier, no large association values were noted among surrogate splits. This indicates that none of the variables used are so highly correlated or collinear as to justify the deletion of one of them. Clearly some overlapping information does exist among the location variables, but we believe that this would be best corrected by using more refined location algorithms. 90 6.5 Using linear criteria In the second stage of the analysis we allowed also linear splits. On the basis of the results of the preliminary analysis, we used only the second sample for this stage and did not include the variables SYNCH and PHASE. The corresponding optimal tree has four terminal nodes which can be described as follows. Let Lj = 0.0346(VMAX )-0.0467(VMIN )-0.99(FX) L2 = 0.01(VMAX)-0.834(FX)-0.551(FY) . Then: Node 1: L} < 2.29 => Atypical (3-35) Node 2: L} > 2.29, L2 < 0.844, MIN £ (2,5,6,7,8,11,12,13,15,16,20} => Typical (25-7) Node 3: L} > 2.29, L2 < 0.844, MIN £ (1,3,4,9,10,14,17,18,19,21} => Atypical (5-23) Node 4: L} > 2.29, L2 > 0.844 => Typical (32-0). The misclassification estimates for this tree are summarized in Table 10. The interpretation of this partition is not clear. However we notice that, as in the tree obtained without linear combinations, a spike generated in the extreme left side and with large values of VMAX and VMIN would likely end up in nodes 2 or 4 and be classified as typical. Our conclusion is that the improvement in the misclassification rates achieved by using linear splits is only marginal when compared with the more unclear interpretation it requires. Therefore there is insufficient reason to prefer it to the tree T* obtained by standard splits only. 6.6 Variable costs The next stage of analysis involved the use of variable misclassification costs and, as in the previous Section, we only used the second sample for this stage. By using both methods described in Section 5.7, the resulting trees were only marginally 91 different from T*. Using costs in the ratios 2:1 up to 5:1 in either direction we only obtained splits of some of the the existing terminal nodes into nodes too small to be of real value. As expected, misclassification rates became skewed, without an overall improvement. The use of variable costs, however, did lead to an observation of interest for our main question. The tree obtained by using cost-adjusted proportions and a cost ratio of 3:1 in favor of atypical spikes can be obtained from T* by splitting node 3 according to the variable FX. Observations for which FX>1.32 are classified as typical, the others as atypical. This can be interpreted as a criterion symmetric to the oneused in T* to obtain node 1. In Section 6.3 we observed that the lack of symmetry for that criterion could not be easily explained in terms of biological theories. The fact that such symmetry is obtained once variable costs are used strengthens our hypothesis that the problem may have been generated by insufficient data. 6.7 Classifying the averages In Chapter 3 we have addressed the question of whether for a single channel the average described in Definition 3.3.4 is an appropriate estimate of the deterministic part of the signal. A related question is whether that averaging procedure, when applied simultaneously to all channels, retains the relevant spatial features of the spike. For instance, if three neighboring channels are strongly affected by the spike, it is possible that only two of them appear as location of the minimum value in the individual spikes, but the third appears as such location in the averaged segment. In order to obtain information on this question we applied the classification rule associated to T* to the 26 averaged segments and analyzed its outcome. Table 11 contains, for each set of spikes, the terminal node of T* to which the averaged segment is assigned, the classification it implies, the proportion of spikes of that 92 set which were assigned to the same node and the correct classification. Agreement between the classification of the averaged segment and the majority of the individual spikes wasobserved in all sets except the two for which half of the spikes were classified as typical and half as atypical (numbers 25 and 54). We also constructed an optimal tree from the 26 averaged segments. This tree had three terminal nodes defined as follows: Node 1: FX < -1.07 => Typical (5-0) Node 2: FX > -1.07, VMIN < -60.2 => Typical (6-2) Node 3: FX > -1.07, VMIN > -60.2 => Atypical (2-11). This partition is very similar to that defined by T*, although based on fewer observations. The conclusion is, therefore, that the averaging process does retain those features which were considered relevant and used in this study. This knowledge could be useful if a similar study is repeated with a larger number of patients for which only the averaged segments have been retained. 6.8 Conclusions The misclassification rate of the selected tree T* is estimated to be approximately 20% by all the methods used. While this value may not be sufficiently large to justify a definitive statement that the two groups are indeed distinct, other factors suggest that we view it as important evidence in favor of such statement. In particular, the facts that the rule requires only three criteria (cf. last comment of Chapter 5), that these criteria can be interpreted in physiological terms and that they are consistent with previous empirical observations suggest that the rule is not a simple artifact of the procedure and the samples used, but it is likely to be based on real differences. Hence our conclusion is that the typical and atypical forms of B R E C , while similar, should be considered as separate. It is possible that a lower misclassification rate may be obtained by using more 93 refined versions of the explanatory variables used here. For instance, the variables FX and MIN, which play a basic role in the tree, are both estimators of location and, as mentioned in Sections 4.4 and 4.6, one could obtain better estimates by increasing the number of electrodes or by using a more advanced algorithm for locating the focus. The importance of the variable VMIN is supported by the routine use of voltage magnitudes in classical EEG interpretation, so that its use in the classification rule is not surprising. The fact that none of the visually obtained variables is directly used in the tree T* may be a consequence of the fact that they are obtained subjectively and that they only take few values, thus providing few possible splits (cf. Breiman et al). In particular, as noted in Section 6.4, the variable POLES seems to be relevant for the problem, so that further analyses of it may be worthwhile. The variables MAX and VMAX generate only surrogate or competitor criteria, but with a non negligible degree of association with the corresponding best split. Moreover they are also descriptors of location and strength, the two factors that seem most important to the classification and, as reported in Section 6.2, the variable VMAX is well correlated to VMIN. Hence they should not be dismissed as irrelevant. A hypothesis suggested by the partition associated with T* is that typical and atypical spikes may be differentiated on the basis of the variability of their focus: atypical spikes may be generated in a more extended region of the brain than the typical ones. This might then be associated with the presence of other neurological abnormalities. To confirm this hypothesis a more precise method of focus localization would be needed and, perhaps, a larger number of patients and of spikes per patient. Finally we note that it would be of interest to perform a similar analysis by using a different peak detection method, or by pooling the information obtained from different detection methods. 94 6.9 Some comments on C A R T As noticed in Section 5.1, the problem of classification of B R E C spikes was well suited for the C A R T methodology. The possibility of constructing a classification rule without relying on any models or distributional assumptions was very important and the finding of a possible non linear dependence of the response variable on FX has confirmed its value. Many of the features of C A R T described in Chapter 5 have been useful to obtain information on both the problem and its obtained solutions. For instance, the ability to control the splitting and pruning mechanisms has allowed us to draw reasonable conclusions about the reliability of the trees, while the analysis of surrogate and competing splits has shed light on the value of certain variables or partitions. The main problem we find associated with this methodology is that its simplicity may lead to the formulation of overconfident conclusions based on its outcome. For instance, our use of three samples has highlighted the possibility of obtaining different information from presumably similar learning samples and hence the need to analyze all the available estimates for consistency before making conclusions. The commercially available computer package which implements the C A R T methodology proved to be simple to use, both in interactive and in batch mode, but we did notice some aspects which could be improved. First of all the absence of a well organized manual, with a list of all possible options, gives rise to the need for extensive initial empirical exploration before the program can be used to its fullest capabilities. Moreover, it makes it mandatory to learn the meaning of all the technical terms from the C A R T monograph, which is not organized as a manual. Another deficiency is the inability to control the choice of the learning sample from the full set of available information. The program does perform a random choice of subsamples, but it is notclear how the selection is made or whether it is reproducible, nor is it possible to make a stratified choice, as was needed, for 95 instance, in our case. Further, when linear combinations are requested, all possible ordered variables are used, while in some cases one may want to consider linear combinations of certain variables only. One final comment relates to the cross-validation estimates of the misclassification rates. As mentioned in Section 5.6, these are obtained by constructing auxiliary trees from smaller random subsamples of the learning sample. In order to better understand the importance of the estimates and assess the stability of the tree under consideration, it may be useful to know how similar the auxiliary trees are among themselves as well as to the tree in question. The problem of a quantitative assessment of the similarity among trees is not addressed in the C A R T monograph and it may be an interesting topic for further research, but in the meantime a useful option for the software may be the possibility of describing the auxiliary trees so that obvious similarities or dissimilarities may be noticed by the investigator. 96 REFERENCES Aczel J . , Daroczy Z. (1975) On measures of information and their characterization. Academic Press, New York, N.Y. Beckers R.A., Chambers J.N. (1984) S An interactive environment for data analysis and graphics. Wadsworth Int., Belmont, California. Breiman L., Friedman J.H., Olshen R.A., Stone C.J. (1984) Classification and regression trees. Wadsworth Int., Belmont, California. Adrian E.D., Matthews B.H.C. (1934) "Berger rythm. Potential changes from the occipital lobes of man." Brain 57:355. Chatrian G.E . (Chairman), Bergamini L., Dondey M.,Klass D.W., Lennox-Buchtal M., Petersen I. (1974) "A glossary of terms most commonly used by clinical electroencephalographers." Electroenceph.Clin.Neurophysiol. 37(5):538-548. Chung K .L . (1974) A course in probability theory. Second Ed. , Academic Press, Orlando, Florida. Cleveland W.S. (1979) "Robust locally weighted regression and robust scatterplots." J.Am.Statist.Assoc, 74:829-836. Cooper R., Osselton J.W., Shaw J.C. (1980) EEG technology. Third Ed. , Butterworths, London. Dillon W.R., Goldstein M. (1984) Multivariate analysis, methods and applications. John Wiley and Sons Inc., New York, N.Y. Duffy F., Burchfield J.L. , Lombroso C.T. (1979) "Brain electrical activity mapping (BEAM): a method for extending the clinical utility of E E G and evoked potential data." Annals of Neurology 5(4):309-321. Glaser E.M. , Ruchkin D.S. Principles of neurobiological signal analysis. Academic Press, Inc. New York, N.Y. Gloor P. (1969) "Hans Berger on the electroencephalogram of man. The fourteen original reports on the human electroencephalogram." Electroenceph.Clin. Neurophysiol. Supplement 28. Gotman J . , Gloor P. (1976) "Automatic recognition and quantification of interictical epileptic activity in the human scalp E E G . " Electroenceph.Clin. Neurophysiol. 41:513-529. Gregory D.L., Wong P.K.H. (1984) "Topographical analysis of the centrotemporal discharges in benign Rolandic epilepsy of childhood." Epilepsia 25(6):705-711. Halliday D., Resnick R. (1970) Fundamentals of Physics. John Wiley and Sons Inc., New York, N.Y. 97 Hodges J.L. , Lehmann E.L. (1963) "Estimates of location based on rank tests." Ann. Math.Statist. 34:598-611. H0yland A . (1965) "Robustness of the Hodges-Lehmann estimates for shift." Ann. Math.Statist. 36:174-197. Jackson J.H. (1931) Selected writings of John Hughlings Jackson. 2 vols., Edited by Taylor J . , Hodder and Stoughton, London. Johnson R.A., Wichern D.W. (1982) Applied multivariate statistical analysis. Prentice-Hall Inc., Englewood Cliffs, New Jersey. Lehmann E.L. (1975) Nonparametrics: Statistical methods based on ranks. HoldenDay Inc., San Francisco, California. Lehmann E.L. (1983) Theory of point estimation. John Wiley and Sons, New York, N.Y. Lerman P., Kivity S. (1975) "Benign focal epilepsy of childhood. A follow-up study of 100 recovered patients." Arch.Neurol. 32:261-264. Lerman P., Kivity S. (1986) "Benign focal epilepsies of childhood." in Recent advances in epilepsy. Number 3. Edited by Pedley T.A. and Meldrum B.S., pages 137-156, Churchill Livingston, New York, N.Y. Lombroso C. (1967) "Sylvian seizures and midtemporal spike foci in children." Arch.Neurol. 17:52-59. Miller R.G.Jr. (1986) Beyond ANOVA, Basics of applied statistics. John Wiley and Sons,New York, N.Y. Mocks J . , Pham D.T., Gasser T. (1984) "Testing for homogeneity of noisy signals evoked by repeated stimuli." Ann.Statist. 12(1): 193-209. Mood A . M . , Graybill F.A., Boes D.C. (1974) Introduction to the theory of statistics. Third Ed. , McGraw-Hill Inc., New York, N.Y. Nayrac P., Beaussart M. (1958) "Les pointes ondes prerolandique, expression E E G tres particuliere. Etude electroclinique de 21 cas." Rev.Neurol. 99:203-205. O'Donohoe N.V. (1985) Epilepsies of childhood. Second Ed. , Butterworths, London. Patil G.P., Taillie C. (1982) "Diversity as a concept and its measurement." J.Am. Statist.Assoc. 77: 548-560. Pratt J.W. (1959) "Remarks on zeros and ties in the Wilcoxon signed-rank procedures." J.Am.Statist.Assoc. 54:655-667. Silverman B.W. (1985) "Some aspects of the spline smoothing approach to non-parametric regression curve fitting" J.R.Stat.Soc. B 47(1): 1-52. Weinberg H., Brickett P., Coolsma F., Baff M. (1986) "Magnetic localisation of intracranial dipoles: simulation with a physical model." Electroenceph.Clin. Neurophysiol. 64:159-170. 98 Appendix : Competing Estimators for the Average A . l Trimmed Mean Let Xj, X2, ..,Xfl be (usually i.i.d.) random variables and let %(j)> J=l n-> be the corresponding order statistics. Then the "a% trimmed mean" X' is defined by r x(j) n-2p where p = [na/100] is the greatest integer not larger than na/100 . Since the trimmed mean ignores the p largest and the p smallest values, it is less sensitive to the presence of extreme values and so it may differ markedly from the average when the distribution is skewed or has heavy tails. In our case, for a given dwell T, Xj = v(T)j and the corresponding trimmed mean shall be denoted by TM(T) . Given the size of our sets we chose a=100/6 , which resulted in trimming one observation from each end for sets of size «<12, three for sets of size n>17 and two for the others. As mentioned in Chapter 3, a strong agreement was observed in all cases between average and trimmed mean and no pattern was evident in the plots of devt(T) (Figure 7). Although the assumption of independence is violated by our data we believe that such consistent results suggest the absence of heavy tails in the distribution of the noise. 99 A.2 Median By using some of the notation and terminology of the previous section, we may define the "sample median" as the a% trimmed mean with n - par(n) a = 50 n where par(n), the parity of «, equals 1 if n is odd and 2 if n is even. It is easy to check that this definition is equivalent to the usual one (Lehmann, 1983). It is well known that the sample median is a robust estimator of the population median and hence it differs from the average when the underlying distribution is not symmetric. As for the trimmed mean, when the median was used as competing estimator no values of devt(T) fell outside of the confidence limits and no peculiar patterns were observed (Figure 8). These results suggest that the distribution of the noise is not skewed. A.3 Lowess The two estimators considered so far produce estimates of f(T) based on the values observed at dwell T only. Since, however, we are assuming that f(T) is continuous, it may be interesting to assess the performance of an estimator that takes into account also values observed at surrounding dwells. The "locally weighted scatterplot smoother", abbreviated "lowess" (Cleveland, 1979), achieves this purpose as follows. For each dwell T, consider all points of the form (T+k, v(T+k) •), where k=-n, -n+1 n and n is specified through a smoothness parameter. A weighted regression line g(t) is fitted to these points according to a weight function w(t) with support on [T-n, T+n] and the value g(T) is used to estimate f(T). The 100 procedure can be generalized to the case when the values of the independent variable are not equally spaced and to obtain estimates at the endpoints. In order to make the estimates less sensitive to outliers the procedure is repeated and points which produced large residuals on the first pass are down-weighted on the second iteration. If necessary the iteration may be repeated until a sufficient amount of smoothness is achieved. By its nature, this method tends to eliminate local oscillations due to noise, but it also tends to lower peaks and raise valleys, a side effect that is not desirable at the approximately 20 dwells during which the spike is seen. After trying several values of the smoothing parameter on individual spikes, we decided that by setting it at 0.02, corresponding to n = 2, we could obtain a satisfactory smoothing effect without excessively deflating the extreme values occurring during the spike. We would therefore expect devt(T) to be within the set limits most of the time, to be positive close to the local maxima of the spike and negative close its minima. The plots of devt(T) show exactly this behaviour (Figure 9), and in fact even the larger deviations observed during the spike were not far away from the limits. Thus our conclusion is that no unusual local phenomena were occurring that would require a more longitudinal method of estimation. Although other smoothers exist with different properties and performances from lowess (cf., for example, Silverman, 1985) the results obtained here did not suggest that it would be worthwhile to compare them to the average and therefore we did not to pursue this point any further. 101 Table 1 : Names and codes of the EEG channels the International 10-20 System. Site Name Code 1 Left frontal polar Fp1 2 Mid frontal polar F pz 3 Right frontal polar Fp2 4 Left anterior temporal F 7 5 Left frontal F 3 6 Midfrontal z 7 Right frontal F4 8 Right anterior temporal F 8 9 Left mid temporal T 3 10 Left central C 3 11 Mid central C z 12 Right central C4 13 Right mid temporal T 4 14 Left posterior temporal T5 15 Left parietal P 3 16 Mid parietal P z 17 Right parietal P 4 18 Right posterior parietal T 6 19 Left occ ip i ta l ° 1 20 Mid occ ip i ta l ° z 21 Right occ ip i ta l ° 2 102 Table 2: Background data on the patients. Patient BREC Age i n Sex Date > number Type(*) years EEG 11 A 11 M 1985 14 T 4 F 1985 15 A 12 F 1985 16 T 10 M 1985 17 T 11 F 1986 20 T 9 M 1986 23 T 8 M 1986 24 T 8 M 1985 25 A 12 M 1986 27 T 14 F 1984 30 T 5 F 1984 33 A 10 F 1986 34 A 13 M 1983 35 A 8 F 1983 36 A 11 F 1982 38 T 5 F 1983 39 T 12 F 1985 42 A 10 M 1983 43 A 10 M 1983 48 A 13 M 1985 50 A 6 F 1985 53 T 5 M 1985 54 T 8 F 1985 60 A 9 F 1986 * : T stands for t y p i c a l BREC, A stands for atyp i c a l BREC. 103 T a b l e 3: Number o f s p i k e s and s e l e c t e d c h a n n e l f o r each s e t o f s p i k e s . Set Number Number of spi kes Chan 11 14 T 4 14 8 T 3 15 12 T 4 16 10 T 3 17 17 T4 20 12 T4 23 20 T3 24 17 T 3 25 12 T 3 27 14 T 3 30 17 T 4 33 19 T4 34 8 T 4 35 9 T4 36 13 C 3 37 11 C4 38 14 T 4 39 9 T 4 40 11 T 3 42 11 T4 43 10 T 4 48 12 T4 50 13 T 3 53 18 T 3 54 6 T 3 60 14 T 4 104 Table 4 : Observed values of std* and Set number 11 14 15 16 17 20 23 24 25 27 30 33 34 35 36 37 38 39 40 42 43 48 50 53 54 60 std* 8.798 3 . 625 10.308 5.510 10.918 11.332 23.732 21.024 10.506 6.176 30.076 8.725 3.021 5.817 15.622 10.142 11.823 8.413 7.071 7.796 8.166 13.946 16.361 14.004 5.703 2 . 326 * std 11.264 6.617 10.635 6.686 10.047 16.428 29.653 18.357 12.253 9.788 37.455 18.724 4.495 7.959 19.175 13.286 27.917 14.071 6.729 11.637 11.146 14.300 37.398 18.891 6.073 4.275 105 T a b l e 5: Observed v a l u e s o f D(i208)' D(58) a n ( * D(935)' number D(1208) °(58) D(935) 11 2.107 2.009 1.082 14 2.910 0.611 1.389 15 0.648 0.855 -1.185 16 0.421 0.524 -0.394 17 -1.096 -2.089 -1.967 20 5.252 4.416 4. 007 23 5.703 4.549 3.895 24 -2.388 -2.821 -3.655 25 1.812 2 .101 0.794 27 3.422 3 .334 2 . 542 30 7.497 6.426 5.227 33 7.391 5.359 6. 343 34 1. 355 1.588 0.975 35 2 . 030 1.558 1.436 36 3.784 3.271 2.529 37 3.101 3.810 2.031 38 15.723 15.180 14.241 39 4.242 5.234 3.061 40 -1.031 -0.979 -1.550 42 3 .819 2.117 2.165 43 2.844 2.563 1.973 48 0.927 0.775 -0.053 50 20.059 22.072 18.010 53 5.972 6. 096 4.575 54 0.608 0.551 -0.343 60 1.808 1.370 1.376 106 Table 6: Observed values of Set number 11 14 15 16 17 20 23 24 25 27 30 33 34 35 36 37 38 39 40 42 43 48 50 53 54 60 Rl 1.306 1.952 1.046 1.292 0.932 1.435 1.248 0. 880 1.162 1.614 1.242 2.235 1.494 1.359 1.194 1. 322 2.245 1.736 0. 900 1.492 1.341 1.002 2.281 1.292 1. 055 1.872 R2 1.280 1.825 1.032 1.213 0.920 1.450 1.249 0. 873 1.166 1.583 1.245 2.146 1.488 1. 368 1.227 1.310 2.361 1.672 0. 888 1.493 1. 365 1. 025 2.286 1.349 1.069 1.838 R3 1.256 1.888 1.123 0.964 0.864 1.522 1.227 0. 917 1.173 1.488 1. 351 1.901 1.363 1.354 1.224 1.245 2.379 1.490 0. 771 1. 588 1.356 1. 084 2.208 1.468 1.246 1.786 R4 1.253 1.815 1.063 1.077 0.898 1.451 1.248 0. 886 1.170 1.547 1.258 1.866 1.466 1. 358 1.246 1.311 2.435 1.519 0. 849 1.533 1. 356 1.068 2.248 1.460 1.114 1.796 107 T a b l e 7: Observed v a l u e s o f Shannon's measure d(P,i) P a t i e n t Pl P2 P3 P4 11 1.093 1.291 .683 1.909 14 .974 1.256 1.040 1. 040 15 .287 .566 .888 .837 16 .000 .500 .000 .500 17 .224 .000 .000 .872 20 .000 .000 .000 .983 23 .647 .791 .693 .824 24 1.253 .709 .224 .998 25 .287 .287 .000 1.350 27 .257 .000 .000 .509 30 . 678 1.115 .362 1.447 33 . 681 .681 . 681 . 681 34 .661 1.040 .562 1.494 35 1.149 .349 1.310 1.465 36 . 000 .000 . 000 .536 37 .000 .000 .000 .600 38 .000 1.227 .000 .755 39 .530 .687 .937 .965 40 .995 1.241 .760 .995 42 1.294 1.367 1.034 2.146 43 1.089 1.471 .639 1. 696 48 .562 .562 . 693 1. 028 50 1.266 1.306 .859 1.413 53 .215 .655 . 000 .557 54 . 000 . 000 1.330 . 000 60 .992 1.116 1. 240 .956 108 T a b l e 8 : V a l u e s o f Sm, m=l,2,3,4. 11 236 .39 219 .67 241 .37 259 .87 14 338 .99 338 .31 300 .12 341 .73 15 450 .86 618 .74 615 .88 531 .68 16 89 .59 86 .43 85 .60 56 .26 17 787 .61 548 .37 603 .13 666 .45 20 386 .07 743 .06 743 .06 1103 .00 23 2677 .25 2974 .01 2715 .75 2866 .13 24 1021 .73 1166 .42 1243 .55 780 .99 25 380 .28 430 .82 410 .20 374 .80 27 513 .76 383 .08 383 .08 426 .42 30 2142 .47 1506 .67 2430 .54 1556 .05 33 5714 .73 3710 .12 5594 .46 3687 . 68 34 92 .15 70 .95 40 .78 67 .57 35 138 .85 332 .25 200 .88 233 .55 36 526 .48 412 .23 412 .23 498 .97 37 331 . 66 308 .03 308 . 03 343 .18 38 2048 .98 3589 . 68 2814 . 34 3086 .75 39 528 .21 510 .21 536 .56 558 .23 40 394 .05 741 .33 666 .51 613 . 53 42 210 .53 114 .85 93 .06 79 .66 43 177 . 53 101 .84 138 .87 99 .64 48 336 .21 609 .04 581 .60 656 .75 50 7007 . 37 4932 .08 3967 .40 3631 .44 53 625 .29 694 .33 677 .37 627 .56 54 217 .46 57 .30 124 .82 71 .83 60 102 .43 143 .90 71 . 67 157 .48 109 T a b l e 9: V a l u e s o f \S \ , m=l,2,3,4. S e t \Si\ 1*21 \S4\ 11 .01602 .01882 .00367 .01177 14 .40865 .40260 .44551 .50112 15 .00612 .01498 .01299 . 02031 16 .01643 .02350 .02139 .03021 17 .02310 .00428 .00752 .01191 20 .08577 .19828 .19828 .27689 23 .06433 .04567 .04580 .06887 24 .01097 .00560 .00242 .00243 25 .01511 .00547 .00590 .00975 27 .00493 .00391 .00391 .00897 30 .06654 .25288 .03649 .09297 33 .02196 .01928 .01692 .02111 34 .05149 .03665 .03828 .05730 35 .02367 . 01985 .02018 .02459 36 .00459 .00323 .00323 .01084 37 .01549 .01128 .01128 .01140 38 .00795 .27581 .01294 .02893 39 . 00152 .00422 .00529 .00756 40 .01438 .01505 .01186 .02020 42 . 03083 .03012 .02548 .04653 43 .02920 .06042 .03811 .03238 48 .00143 .00515 .00599 .00464 50 .00466 .01024 .00340 .00583 53 .01267 .00758 .00638 .00855 54 .00006 .00186 .23847 .00258 60 .05710 .10772 .09963 .05678 110 Table 10: Results of preliminary CART runs. Sample 1 Sample 2 Sample 3 Sample 3 Sample 2 (4 nodes) ( I in . sp l . ) Number of Terminal Nodes 15 Complexity Parameter 0.0192 0.0346 0.00387 0.0250 0.0154 Cross-validated Estimates of M(d): Class 0 Class 1 0.38 0.40 0.23 0.22 0.17 0.21 0.38 0.38 0.29 0.22 Resubstitut ion Estimates of M(d) Class 0 Class 1 0.15 0.23 0.12 0.22 0.03 0.05 0.08 0.34 0.12 0.11 Full Sample Estimates of M(d) Class 0 Class 1 0.15 0.26 0.14 0.25 0.10 0.23 0.14 0.48 0.14 0.17 111 Table 11: Analysis of the c l a s s i f i c a t i o n of averaged segments. Set Number of Type Terminal Node Clas s i f i ca t ion Spikes Spikes of the Average of the Average the same 11 14 A 4 A 14 14 8 T 4 A 7 15 12 A 4 A 7 16 10 T 1 T 8 17 17 T 2 T 11 20 12 T 2 T 11 23 20 T 2 T 14 24 17 T 1 X 13 25 12 1 T 5 27 14 T 1 T 14 30 17 T 2 T 17 33 19 A 3 A 18 34 8 A 4 A 8 35 9 A 4 9 36 13 A 2 T 9 37 11 A 4 10 38. 14 T 2 T 13 39 9 T 2 T 7 40 11 T 1 T 6 42 11 A 4 A 11 43 10 A 4 10 48 12 A 2 T 10 50 13 A 2 T 8 53 18 T 1 T 16 54 6 T 4 A 3 60 14 A 4 A 10 112
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A statistical analysis of electroencephalographic spikes...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A statistical analysis of electroencephalographic spikes in benign Rolandic epilepsy of childhood Bencivenga, Roberto 1987
pdf
Page Metadata
Item Metadata
Title | A statistical analysis of electroencephalographic spikes in benign Rolandic epilepsy of childhood |
Creator |
Bencivenga, Roberto |
Publisher | University of British Columbia |
Date Issued | 1987 |
Description | The occurrence of spikes during an electroencephalogram is a basic feature of Benign Epilepsy of Childhood (BREC). In this thesis we analyze several problems related to the structure of such spikes. The currently used mathematical model describing the spike assumes that all the inter-spike variations are due to background activity. We show that non-negligible additional variability is present during the spike and propose a slightly richer model which takes such variability into account. In particular we conclude that background noise may not be used to assess the precision of the estimates of the signal. The technique of spike averaging is presently used to obtain more precise estimates of the signal. By comparing averaging with trimmed mean, median and the "lowess" smoother, we find no discrepancies indicating the presence of skewness or long tails in the underlying distribution of the data and conclude that spike averaging is an adequate method for estimating the deterministic part of the spike. Next, three automated procedures for the detection of the peak of the spike are compared to the existing method, which is based on a visual analysis of the EEG tracing. None of the alternative methods is found to be superior, but the methodology developed for this problem is rather general and could be applied to other similar comparisons. Finally we address the question of whether "atypical" BREC patients, who are characterized by having other neurological abnormalities besides seizures, have a spike structure different from that of the "typical" patients. The non parametric method of "classification trees" is discussed and then applied to find whether certain features of the spike can discriminate between typical and atypical patients. The location and amplitude of the spike are found to provide a satisfactory classification rule, suggesting that the two groups may be affected by different types of epilepsy. We have used, throughout the thesis, simple methods which do not require strong assumptions. In particular we have tried to avoid assumptions of normality and linearity and to rely mostly on non parametric methods. |
Subject |
Epilepsy in children Electroencephalography -- Data processing |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2010-07-07 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0096862 |
URI | http://hdl.handle.net/2429/26165 |
Degree |
Master of Science - MSc |
Program |
Statistics |
Affiliation |
Science, Faculty of Statistics, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-UBC_1987_A6_7 B46.pdf [ 4.95MB ]
- Metadata
- JSON: 831-1.0096862.json
- JSON-LD: 831-1.0096862-ld.json
- RDF/XML (Pretty): 831-1.0096862-rdf.xml
- RDF/JSON: 831-1.0096862-rdf.json
- Turtle: 831-1.0096862-turtle.txt
- N-Triples: 831-1.0096862-rdf-ntriples.txt
- Original Record: 831-1.0096862-source.json
- Full Text
- 831-1.0096862-fulltext.txt
- Citation
- 831-1.0096862.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0096862/manifest