A User-Customized Self-Paced Brain Computer InterfacebyHossein BashashatiB.Sc., University of Tehran, 2008M.Sc., University of Tehran, 2011A THESIS SUBMITTED IN PARTIAL FULFILLMENTOF THE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinTHE FACULTY OF GRADUATE AND POSTDOCTORALSTUDIES(Electrical and Computer Engineering)The University of British Columbia(Vancouver)April 2017c© Hossein Bashashati, 2017AbstractMuch attention has been directed towards synchronous Brain Computer Interfaces (BCIs).For these BCIs, the user can only operate the system during specific system-defined periods.Self-paced BCIs, however, allow users to operate the system at any time he/she wishes. Theclassification of Electroencephalography (EEG) signals in self-paced BCIs is extremelychallenging, as the BCI system does not have any clue about the start time of a control task.Also, the data contains a large number of periods during which the user has no intention tocontrol the BCI.For sensory motor self-paced BCIs (focus of this thesis), the brain of a user goes throughseveral well-defined internal state changes while performing a mental task. Designing clas-sifiers that exploit such temporal correlations in EEG data can enhance the performance ofBCIs. It is also important to customize these BCIs for each user, because the brain charac-teristics of different people are not the same.In this thesis, we first develop a unified comparison framework to compare the per-formance of different classifiers in sensory motor BCIs followed by rigorous statisticaltests. This study is the largest of its kind as it has been performed on 29 subjects of syn-chronous and self-paced BCIs. We then develop a Bayesian optimization-based strategythat automatically customizes a synchronous BCI based on the brain characteristics of eachindividual subject. Our results show that our automated algorithm (which relies on less so-phisticated feature extraction and classification methods) yields similar or superior resultsiicompared to the best performing designs in the literature.We then propose an algorithm that can capture the time dynamics of the EEG signalfor self-paced BCI systems. We show that this algorithm yields better results comparedto several well-known algorithms, over 13 self-paced BCI subjects. Finally, we propose afully automatic, scalable algorithm that customizes a self-paced BCI system based on thebrain characteristics of each user and at the same time captures the dynamics of the EEGsignal. Our final algorithm is an important step towards transitioning BCIs from researchenvironments to real-life applications, where automatic, scalable and easy to use systemsare needed.iiiPrefaceThis thesis presents the research conducted by Hossein Bashashati, in collaboration withProf. Rabab K. Ward, Dr. Ali Bashashati, Dr. Gary Birch and Dr. Amr Mohamed.Below is the list of the scientific papers by Hossein Bashashati that were published (orsubmitted) through the course of his studies as a Doctor of Philosophy (PHD) candidate atUniversity of British Columbia (UBC).J1 - H. Bashashati, R. K. Ward, G. H. Birch and A. Bashashati, Comparing DifferentClassifiers in Sensory Motor Brain Computer Interfaces, PloS one. 2015 Jun 19.J2 - H. Bashashati, R. K. Ward and A. Bashashati, User-Customized Brain ComputerInterfaces Using Bayesian Optimization, Journal of neural engineering, 2016 Jan 29.C1 - H. Bashashati, R. K. Ward and A. Bashashati, Hidden Markov Support VectorMachines for Self-Paced Brain Computer Interfaces, International Conference on MachineLearning Applications (IEEE ICMLA), 2015.C2 - H. Bashashati, R. K. Ward and A. Bashashati, Bayesian Optimization of BCIparameters, IEEE Canadian Conference on Electrical and Computer Engineering (IEEECCECE), 2016C3 - H. Bashashati, R. K. Ward, A. Bashashati and A. Mohamed, Neural NetworkConditional Random Fields for Self-Paced Brain Computer Interfaces, Accepted in Inter-national Conference on Machine Learning Applications (IEEE ICMLA), 2016.Hossein Bashashati (primary author and the main contributor of the papers) proposedivthe algorithms, implemented them, and performed the evaluations.Dr. Rabab K. Ward supervised the development of work, and provided technical feed-back as well as editorial comments throughout writing of all the papers (J1, J2, C1, C2 andC3).Dr. Ali Bashashati provided technical feedback as well as editorial comments through-out writing of the papers (J1, J2, C1, C2 and C3).Dr. Gary Birch helped in manuscript writing for J1. Dr. Amr Mohamed helped inmanuscript writing for C3.A version of Chapter 2 appeared in J1. A version of Chapter 3 appeared in J2 and C2.A version of Chapter 4 appeared in C1 and C3.vTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ixList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiiiAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Different Neurological Phenomena . . . . . . . . . . . . . . . . . . . . . . 21.2 Self-Paced vs. Synchronous BCIs . . . . . . . . . . . . . . . . . . . . . . 41.3 Block Diagram of a BCI . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.4 Properties of EEG Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.5 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . . 82 A Unified Comparison Framework for Sensory-Motor BCIs . . . . . . . . . . 152.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.2 General Structure of Comparison Framework . . . . . . . . . . . . . . . . 17vi2.2.1 Spatial and Frequency Filtering . . . . . . . . . . . . . . . . . . . . 182.2.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202.2.4 Performance Measure . . . . . . . . . . . . . . . . . . . . . . . . . 242.2.5 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2.6 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Customizing Brain Computer Interfaces . . . . . . . . . . . . . . . . . . . . . 413.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.2 Bayesian Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2.1 Modeling Posterior Distribution . . . . . . . . . . . . . . . . . . . 443.2.2 Acquisition Function . . . . . . . . . . . . . . . . . . . . . . . . . 473.3 Proposed Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483.3.1 Learning Algorithm (Objective Function) . . . . . . . . . . . . . . 503.3.2 Selecting Hyper-Parameters in BCIs . . . . . . . . . . . . . . . . . 523.3.3 Results Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . 533.3.4 Termination of the Algorithm . . . . . . . . . . . . . . . . . . . . . 553.3.5 Time Complexity of the Algorithm . . . . . . . . . . . . . . . . . . 553.4 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563.4.1 Comparing with Other Hyper-Parameter Search Algorithms . . . . 573.4.2 Comparing the Results with Literature . . . . . . . . . . . . . . . . 633.4.3 Effect of Number of Optimization Iterations . . . . . . . . . . . . . 663.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68vii4 Discriminative Sequence Learning Algorithms for Self-paced BCIs . . . . . . 714.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 714.2 EEG Sequence Labeling with Hidden Markov Models . . . . . . . . . . . . 744.3 Discriminative Sequence Labeling Algorithms . . . . . . . . . . . . . . . . 754.3.1 Hidden Markov Support Vector Machines . . . . . . . . . . . . . . 764.3.2 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . 784.3.3 Neural Network Conditional Random Fields . . . . . . . . . . . . . 804.4 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 824.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 845 Ensemble of Discriminative Sequence Labeling Classifiers for Self-Paced BCIs 865.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 895.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 926 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 936.1 Summary of Our Contributions . . . . . . . . . . . . . . . . . . . . . . . . 936.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100viiiList of TablesTable 2.1 List of classifier parameters that were optimized using cross-validationin the training phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27Table 2.2 Specification of datasets used in this thesis. . . . . . . . . . . . . . . . . 28Table 2.3 The accuracy of classifiers for synchronous BCI operation over all sub-jects. The accuracies are calculated using the unseen test dataset. Foreach classification algorithm the first column shows the results of BPfeatures and the second column shows the results of Morlet features. . . 32Table 2.4 The Area Under the Curve (AUC) of classifiers for self-paced subjects.For each classification algorithm the first column shows the results ofBP features and the second column shows the results of morlet features. . 34Table 2.5 Average Rankings of the classification algorithms for both synchronousand self-paced datasets. The number in the parenthesis corresponds tothe average rank of the algorithm among different subjects. For eachfeature extraction method the classifiers typed in bold are the recom-mended ones. The recommended classifiers are selected based on theresults of the statistical tests. . . . . . . . . . . . . . . . . . . . . . . . . 36ixTable 2.6 P-values corresponding to pairwise comparison of different classifiers.α is chosen to be 0.1. For settings 1 and 2 all hypothesis with p-valueless than 0.0333 are rejected. For setting 3 and 4 all hypothesis with p-value less than 0.05 are rejected. The results are rounded up to 4 decimalplaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37Table 2.7 Average Rankings of the classification algorithms for binary and multi-class classification in synchronous datasets. The number of subjects inbinary task was 12 and the number of subjects in multi-task BCIs was9. The number in the parenthesis corresponds to the average rank of thealgorithm among different subjects. For each feature extraction methodthe classifiers typed in bold are the recommended ones. The recom-mended classifiers are selected based on the results of the statistical tests. 38Table 2.8 P-values corresponding to pairwise comparison of different classifiers.α is chosen to be 0.1. For binary task BCIs with BP features all hypoth-esis with p-value less than 0.02 are rejected. For multi-task BCIs withBP features all hypothesis with p-value less than 0.0333 are rejected.For binary task BCIs with Morlet features all hypothesis with p-valueless than 0.1 are rejected. For multi-task BCIs with Morlet features allhypothesis with p-value less than 0.025 are rejected. The results arerounded up to 4 decimal places. . . . . . . . . . . . . . . . . . . . . . . 39Table 3.1 Accuracy of each algorithm with Morlet features. For each search al-gorithm we used MLR, voting (VOTE) and averaging (AVG) for aggre-gating the results of the classifiers. MIN is the result of the best level 1classifier in optimization phase. Highlighted cells show the configura-tion for which the performance is the best. . . . . . . . . . . . . . . . . 59xTable 3.2 Accuracy of each algorithm with BP features. For each search algorithmwe used MLR, voting (VOTE) and averaging (AVG) for aggregating theresults of the classifiers. MIN is the result of the best level 1 classifier inoptimization phase. Highlighted cells show the configuration for whichthe performance is the best. . . . . . . . . . . . . . . . . . . . . . . . . 60Table 3.3 Average ranking and average accuracy of the optimization algorithmswith (a) Morlet features and (b) BP features. . . . . . . . . . . . . . . . 62Table 3.4 P-values corresponding to pairwise comparison of different hyper-parametersearch algorithms (a) Morlet features and (b) BP features. α is chosen tobe 0.05. Bergmann’s procedure rejects hypotheses 9 and 10 for the Mor-let features. For the BP features, hypotheses 7, 8, 9 and 10 are rejected.The results are rounded up to 3 decimal places. . . . . . . . . . . . . . . 64Table 3.5 Accuracy of our algorithms compared to the results reported in the lit-erature for dataset III3b. Highlighted cells show the best performingalgorithm. The text inside the parenthesis corresponds to the originallabel of the subjects used in the BCI competition. . . . . . . . . . . . . . 64Table 3.6 Accuracy of our algorithms compared to the results reported in the lit-erature for dataset IV2b. Highlighted cells show the best performingalgorithm. The text inside the parenthesis corresponds to the originallabel of the subjects used in the BCI competition. . . . . . . . . . . . . . 65Table 3.7 Accuracy of our algorithms compared to the results reported in the lit-erature for dataset IV2a. Highlighted cells show the best performingalgorithm. The text inside the parenthesis corresponds to the originallabel of the subjects used in the BCI competition. . . . . . . . . . . . . . 65xiTable 4.1 The results of comparing AUC of different algorithms on the test dataset.Highlighted cells show the algorithm for which the performance is thebest. The last row shows the average rank of different classifiers acrossall subjects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83Table 5.1 The results of comparing different classifier after performing Bayesianoptimization on the test dataset. MIN is the results of applying the bestperforming classifier in optimization phase on the test data. AVG cor-responds to the results of the ensemble of classifiers. Highlighted cellsshow the algorithm for which the performance is the best. . . . . . . . . 90xiiList of FiguresFigure 1.1 (a) Self-paced BCI in which the user can control the system at anytime (b) Synchronous BCI in which the user is restricted to control thesystem at system-defined periods. . . . . . . . . . . . . . . . . . . . . 4Figure 1.2 A simple block diagram of an EEG-based BCI. . . . . . . . . . . . . . 6Figure 1.3 Our proposed hyper-parameter optimization in BCIs. . . . . . . . . . . 12Figure 2.1 The BCI framework. Our framework has three main steps: 1. Filtering,2. Feature Extraction and 3. Classification. The output of each step isfed to the next step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Figure 3.1 An example of sequences of trials in BCI calibration phase in whicheach trial is 8s. In the first trial, the subject performed a left hand (L1)imagery movement followed by a No-Control (NC1) interval of 4s. Inthe second trial, the subject performed a right hand imagery movement(R1) followed by a 4s NC interval. . . . . . . . . . . . . . . . . . . . . 51Figure 3.2 Accuracy of our algorithm with the BP features versus the number ofiterations. The label of cross validation curves finish with ’ CV’ andthe label of test data accuracies finish with ’ Test’. . . . . . . . . . . . . 67xiiiFigure 3.3 Accuracy of our algorithm with the Morlet features versus the numberof iterations. The label of cross-validation curves finish with ’ CV’ andthe label of test data accuracies finish with ’ Test’. . . . . . . . . . . . . 68Figure 5.1 Diagram of our optimization algorithm at iteration t. . . . . . . . . . . 88Figure 5.2 Average AUC of different settings for each classifier across all subjects.The blue bars correspond to the average AUC when the default valueof the hyper-parameters have been used. The red bars correspond tothe average AUC after using Bayesian optimization. The green barscorrespond to the average AUC when we used an ensemble of differentclassifiers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91xivAcknowledgmentsI would like to express my sincere gratitude to my supervisor, Dr. Rabab K. Ward. Shehas provided me with full support and encouragement throughout my studies. She has al-ways provided time, helpful advice and guidance, and displayed great understanding duringchallenging times.Finally, I would like to express my deepest gratitude to my parents and siblings, fortheir understanding, love, encouragement and many years of support.This work was funded by the National Science of Engineering Research Council ofCanada (NSERC) and Qatar National Research Fund (QNRF).xvChapter 1IntroductionA Brain Computer Interface (BCI) aims at detecting the presence of specific patterns ina person’s brain activity. These patterns relate to the user’s intention to control a device[102]. If such patterns are detected in the brain waves, then the BCI issues specific signalsto control the device.Traditionally, the most popular application of BCIs has been to assist people with dif-ferent kinds of motor impairments to communicate with the external world [78]. A BCI asa non-muscular communication channel allows disabled people to control different assis-tive devices. Besides rehabilitation applications, many new BCI applications have emergedrecently [101] [17] [1]. These applications include playing video games, mental state mon-itoring, training and education, cognitive biometrics etc.There are few different methods to record the brain activity of a person. Not all ofthese however are practical to use in real life applications. There are at least three factorsthat determine the practicality of BCIs. These factors include risk in use, signal temporalresolution and device portability. Invasive BCIs such as (implanted) electrocorticography-based ones record the brain activity from within the skull. Although the recorded signal hasa high signal to noise ratio and is less prone to artifacts, the practicality of these methods1is limited because of the need for the surgical installation of the electrodes inside the skull(risk of use) [66]. Magnetoencephalography and functional Magnetic Resonance Imagingare non-invasive but bulky, non-portable, and have very low temporal resolution.Among the various means to measure the brain activity, the non-invasive EEG-basedBCI systems can measure the changes in the brain activity within milliseconds (high tem-poral resolution); they are inexpensive, non-invasive and portable [75]. However, they alsohave several drawbacks. The signal to noise ratio of the EEG signal is low, i.e., the signalshave very low amplitude (i.e. about 10 to 100 micro volts) compared to the backgroundnoise. They are also contaminated by artifacts. Therefore, detecting the intentions of aperson from his/her measured EEG brain signals is a challenging task and has been at theforefront of research. Furthermore, the resolution of the EEG brain signals referred to asthe spatial resolution is low and dependent on the number of electrodes that could be placedon the subject’s head.1.1 Different Neurological PhenomenaPresent-day EEG-based BCIs operate based on four major electrophysiological activitiesof the brain [75].Visual Evoked Potential (VEP) are typically generated in the EEG signal in responseto a visual stimulus while the user is looking at the screen. An example of such a BCI isfrequency-VEP in which a set of letters are flashing at different frequencies. When the usergazes at a letter, the BCI recognizes the intended letter. This type of BCI does not need anytraining but the drawback is that the user should stare at the screen while operating the BCI.Therefore, it is not suitable for some users such as patients at late stages of ALS disease[13].P300 evoked potentials are generated in response to an irregular stimulus amongst sev-eral different stimuli. 300 milliseconds after the appearance of an infrequent stimulus a2positive peak appears in the EEG signal of the subject. An example of such a BCI is whena grid of letters is shown to the subject. Each time one of the rows or columns of the gridflashes. The subject focuses on the letter he/she wants to communicate. When the row orthe column containing the character flashes, a P300 evoked potential is generated [37].Like VEP-based BCIs, P300 BCIs do not need user training. However, their need forexternal stimulus restricts their applicability. Another drawback is that after sometimeof using a P300-based BCI, the user gets accustomed to the infrequent stimulus and thedetection of the peak becomes difficult [84]. Also, during the whole operation of the BCI,the user should stare at the screen [40].Slow Cortical Potentials (SCP) are slow voltage changes in the EEG generated bythe subject. SCP usually lasts between 0.5 to several seconds. Subjects need training toself-regulate these SCPs, and usually the training needs up to several months of practice.Self-regulation performance varies for different subjects and some subjects might not beable to master SCP control. Besides, many factors such as the physiological and physicalstate of the subjects can affect the performance of the SCP-based BCIs [48].Sensory-motor BCIs rely on the changes in sensory-motor rhythms of the brain. Sensory-motor rhythms consist of the mu, beta and gamma rhythms of the brain. The mu rhythmsof the brain are generally in the range 8-12 Hz, the beta band is 13-30 Hz and the gammais in the range 30-100 Hz.Among the different ways to operate a BCI, those that rely on the sensory motorrhythms have been of special interest, e.g. [71] [58] [89]. Unlike VEP-based and P300-based BCIs, these kinds of BCIs do not need any stimuli from outside to control the system,and the subject can learn to generate the appropriate brain pattern to control a device vol-untarily. The training process for sensory-motor based BCIs is not lengthy like SCP-basedBCIs. Of particular interest is that a motor imagery activity (i.e., an imagined movement)generates patterns in the EEG that are similar to those generated by actual movements [70].3In addition, BCIs that are motor imagery-based are especially suitable for disabled peoplewho do not have control over their limbs [59].Figure 1.1: (a) Self-paced BCI in which the user can control the system at any time(b) Synchronous BCI in which the user is restricted to control the system atsystem-defined periods.1.2 Self-Paced vs. Synchronous BCIsBCI systems can be classified into two categories: synchronous and self-paced systems [69][9]. In synchronous BCIs, subjects control the BCI output during specific short periods.That is, the user can only issue a control command when he is prompted to operate thesystem, therefore, the onset of the mental task is known beforehand. Figure 1.1 part (b)shows a synchronous BCI system.Self-paced BCIs, on the other hand give the users the option of controlling the systemwhenever they wish to. The periods during which the user is controlling the system arecalled control states and those during which the user is not controlling the system are calledNo-Control (NC) states. During the NC states, the BCI system should not issue any controlsignal i.e. a false positive output should not be produced. In Figure 1.1 part (a) a self-paced4BCI is depicted.Compared to synchronous BCI systems, designing self-paced BCIs remains extremelychallenging. The challenges faced by self-paced BCIs stem from the following two prop-erties of these BCIs:1) In self-paced BCIs, the user can control the system at any time. Therefore, the BCIsystem does not have any clue about the start time of a control command. Also, the BCIshould analyze and classify the EEG data at all times. This makes the design and evaluationof these BCIs more challenging.2) During the NC periods, the user can do whatever he/she wants and can be in anymental state (except for those intended to control the BCI). Therefore the vast majorityof EEG signals is formed of NC periods. This means that self-paced BCI data contains alarge variety and number of periods during which the user has no intention to control theBCI. This makes it difficult to find specific patterns in the EEG during NC periods andtherefore classification and controlling false positives associated to NC states become verychallenging.1.3 Block Diagram of a BCIFigure 1.2 shows a simple block diagram of the signal processing part of an EEG-basedBCI. The main two components are Feature Extraction and Classification.5Feature Extraction ClassificationElectrodesDeviceFigure 1.2: A simple block diagram of an EEG-based BCI.The aim of the feature extraction block is to extract useful information i.e. specificpatterns from the brain signal. The features capture the characteristics of the brain signalswhich are then used by the classification block to distinguish different mental tasks. Differ-ent mental tasks produce different patterns in the brain signals. A good feature extractionalgorithm should be able to capture the information so that distinguishing between differentmental tasks becomes easy.The features should be able to capture time information, as to when certain changes inthe brain signal occur. Frequency information should be considered because many mentaltasks are frequency specific. Also, spatial information should be exploited because differentmental tasks are related to different parts of the brain.The aim of the classification block is to use the extracted features to detect the inten-tion of the subject. Classifiers are calibrated using a small set of collected samples in thetraining sessions. The calibrated classifier is used in the test sessions on unseen data. Sev-eral different classifiers have been applied in different BCI studies. In general, simpleralgorithms are preferred to more complex algorithms [8][75].1.4 Properties of EEG SignalThe following properties of the EEG signal makes classification of the data challenging:6Data is noisy and contains outliers. The amplitude of the EEG signal is small andin the order of µV s, and the signal is usually contaminated with different types of noise(or artifacts). The noise in the EEG signal can have physiological sources such as elec-trooculogram and electromyogram signals or it might originate from outside sources suchas power-line noise [38].Specific variations in the EEG signal over time should be taken into account. Thesevariations might happen in specific frequency bands. When a subject performs a sen-sory motor mental task, the amplitude of the signal changes in specific frequency bands.It is worth noting that these frequency bands vary from one person to another i.e. they aresubject specific [102]. Therefore, besides the time information in the EEG signal, the fre-quency information should also be taken into consideration when designing feature extrac-tion and classification algorithms. Equally important is that the BCI should be customizedfor each person.The EEG is of high dimensions. The EEG signal is collected by several electrodeslocated on the surface of the scalp. Each electrode is referred to as an EEG channel. De-pending on the mental task being performed, some locations on the scalp might providemore discriminative information than the other locations (for classifying the mental tasks).It is usually important to apply channel reduction algorithms to reduce the number of sig-nals we are dealing with; this reduces the dimension of the feature space. It has beenshown that the performance of the BCI is affected by the choice of the selected channels[45][5][56].The EEG signal is non-stationary. The statistical properties of the EEG signal changeover time [57]. These changes originate from different causes such as changes in the mentalstate of the subject, etc. Subsequently, the BCI needs to adapt according to the mental stateof the subject.BCI datasets are usually very small. Since collecting data and subject training are7very time consuming processes, the available EEG datasets are usually very small. Thismakes it impossible to apply some powerful classifiers e.g. deep learning [62] to this data.1.5 Motivations and ContributionsEven though much interest and progress have been made during the past two decades inBCI research, this field is still at its infancy. Many improvements are still necessary inorder to encourage its widespread adoption. Current BCI systems are not accurate enoughand are far from ideal for use in on-line practical settings. Particularly, for noninvasiveBCIs which are by far the most widely used BCIs, the characteristics of the recorded signal(e.g. the low signal to noise ratio, susceptibility to artifacts, etc.) make the extraction ofpeople’s intentions from their brain waves a challenging task. Therefore, the accuracy ofBCI systems has not yet reached the high level required for their use in day-to-day humanlife applications. This is especially true for self-paced BCIs, which are the more naturalway to operate a BCI.The main aim of this thesis is to design a sensory-motor based self-paced BCI. BCIsthat depend on sensory-motor rhythms are convenient because no stimulus is required inthe BCI control and there is no need for the constant involvement of the sensory modalitiessuch as vision or auditory systems. Because of the challenges in designing a self-pacedBCI, it is imperative to exploit the properties of the BCI data as much as possible. Asdiscussed in section 1.4, while the subject is performing the sensory motor mental task,the variations over time in the EEG signal in specific frequency bands should be taken intoconsideration. These frequency bands are the sensory motor (i.e. alpha, beta and gamma)frequencies of the brain.When a person performs a motor activity such as a limb movement, the amplitude of thesensory motor rhythms in the motor cortex of the brain changes. Generally such changesmay result in a decrease or increase in the amplitude of the EEG signal in the sensory motor8frequency bands of the brain [77]. The decrease in the amplitude is called Event RelatedDesynchronization (ERD), and the increase is referred to as Event Related Synchronization(ERS).For the mu rhythm, ERD begins shortly before movement onset and continues for fewseconds. For the beta rhythm, after a short ERD, the ERS which is the increase in theamplitude of the EEG signal occurs and lasts for a few seconds after the occurrence ofthe movement. In addition to alpha and beta rhythms, the gamma rhythm also exhibitsamplitude changes during the onset of the movement.The other important property is that the brain characteristics of each person is differentfrom that of others. These brain characteristics are fed to the BCI system in the form ofsome parameters referred to as hyper-parameters. For instance, the frequency ranges of thealpha and beta waves of the brain are some of the hyper-parameters of the system. Theoptimal frequency bands of these waves that discriminate between different sensory motormental tasks vary between different subjects. The value of these frequencies should beidentified for each individual person.Briefly, in this thesis, due to the lack of suitable benchmarks we first design and developa general BCI comparison framework (aim 1). Using this framework we then develop auser-customized synchronous BCI to show the importance and feasibility of customizingBCI based on each subject’s brain characteristics (aim 2). Then we propose new classifiersthat are able to capture the variations over time of the EEG signal in self-paced BCIs (aim3). Finally, based on the frameworks and findings from aims 1-3, we design and developa user-customized self-paced BCI (aim of this thesis) that exploits the time and frequencyfeatures of the EEG signal.In the following we briefly describe the aims and contributions in this thesis:9Aim 1) Design and Implementation of a General BCI Comparison Framework.Due to the large variations in BCI technologies and the lack of a general comparison frame-work in BCIs, it is almost infeasible to compare existing BCI technologies. Consequently,the progress in the field has been slow, from the signal processing and machine learningpoint of view. There are some general purpose software systems [87] [85] that facilitate theresearch in BCIs. However, these systems are mostly suitable for data acquisition and forreal time processing of brain signals.In the first part of this thesis (Chapter 2), we design and develop a general frameworkto compare the performance of different algorithms in sensory-motor-based BCIs. Thenumber of subjects considered in our experiments is 29 which is by far the largest study ofits kind in this field. Other studies have been performed on smaller sample sizes and as aresult, their findings may not be generalized for a larger subject pool.We perform rigorous statistical tests to compare the performance of different algo-rithms. This enables us to examine whether or not there is significant differences betweenthe performances of different classifiers. Other studies have not employed rigorous statis-tical tests for comparisons. This is mainly due to the smaller sample sizes they employed.They have relied on statistical tests that are not suitable for comparing several algorithmson multiple datasets.We believe that our work provides a more comprehensive comparison between well-known BCI settings and will have profound impact in the progress of the BCI field.Aim 2) Design and Implementation of a User-Customized BCI for SynchronousBCIs.As discussed in section 1.4, one of the characteristics of the sensory-motor BCI data isthat the changes in EEG signal are frequency-specific. ERD and ERS occur in specificfrequency bands which are different for each subject. The majority of approaches have10treated the problem of finding optimal values for these frequency bands (and other hyper-parameters of the system) separately from the methods used in the feature extraction andclassification blocks. These approaches usually need considerable knowledge and expertisein EEG signal, and do not take into consideration the interconnection between preprocess-ing, feature extraction and the classification parts of a BCI.An optimal learning framework for a BCI problem should aim at studying preprocess-ing of the data, feature extraction and classification jointly, considering the fact that theperformance of the classifier depends on the choice of the features and the frequency rangesthat are used to filter the raw EEG signal. In other words, we should customize the clas-sifier and the feature extractor jointly i.e. based on each other. For instance, selecting aparticular feature extraction method might cause the data samples of different brain statesto be linearly separable in the feature space. As a result, selecting a linear classifier wouldbe a better choice for discriminating between the brain states.Another problem in the BCI literature is that the majority of the approaches in the field,have treated frequency, time, and the spatial dimensions of the BCI separately. The choiceof the frequency ranges are important in the accuracy of the BCI, but so is the choice ofthe channels used for feature extraction. Usually the time segment from which features areextracted is also important in order to achieve maximum discrimination. The BCI systemshould exploit the interdependency between these dimensions to yield better performance.In Chapter 3, we propose an automatic algorithm that jointly optimizes the values of theBCI hyper-parameters for two different types of synchronous motor-imagery-based BCIs.Our algorithm does not rely on the operator expertise and knowledge of EEG to obtain thevalues of the hyper-parameters. It iteratively tunes the hyper-parameter values based on theperformance of the BCI.Figure 1.3 shows a simple block diagram of our algorithm. In our approach the BCI islike a self-regulating system which improves itself based on the feedback that it receives11from the classification block. The performance of the classification block is given to thehyper-parameter optimization block, and this block proposes new values for the hyper-parameters for each iteration of the algorithm. The hyper-parameter values are used toextract features from the EEG signal, and a new classifier is trained using the featuresextracted from the training data. Based on the cross-validation performance of the classi-fier, the hyper-parameter optimization block proposes another set of values, and the wholeprocess continues until the algorithm converges.Feature Extraction ClassificationElectrodesDeviceHyper-parameter optimizationAccuracyHyper-parameter values, e.g. Frequency bands, time segment, channelsFigure 1.3: Our proposed hyper-parameter optimization in BCIs.We show the importance of hyper-parameter optimization using 21 subjects from publicBCI competition datasets. We then compare our algorithm to other approaches used forfinding the hyper-parameter values, and finally we compare our method to the best reportedresults in the literature.Aim 3) Using Time Information in Designing a Self-Paced BCI.Another characteristic of the BCI data is that the changes in EEG signal are time-specific.As discussed in the previous sections, the changes in the EEG signal while a user is per-forming the mental task are variable over time. For example, ERD and ERS occur during12the movement onset and continue for few seconds after the occurrence of the movement.Exploiting this time related information is crucial in improving the performance of theBCI, Specially in self-paced BCIs where the start and the end time of the mental task is notknown, and the BCI should continuously analyze and classify the EEG data.To exploit the time information, three possible approaches can be considered:1) Extract features from different parts (i.e. time windows) of the EEG signal, con-catenate the features and give the feature vector to the classifier. This is the most popularapproach in the literature [21] [43] [88] [99] [96], however this approach does not capturethe temporal ordering of different features.2) Another approach is to train different classifiers on different parts of the EEG signal,and then aggregate the results. The problem of this approach is that the correlation betweendifferent parts of the signal is not taken into account.3) None of the above approaches properly capture the dynamics of the EEG signal.A better approach is to apply sequence labeling classifiers on the EEG data. Sequencelabeling classifiers predict a sequence of labels associated with a sequence of observations.In this approach the EEG signal is divided into several consecutive segments which wouldresult in a sequence of observations and labels. The set of these observation/label sequencesbuilds the training set which forms the input to the classifier.In chapter 4, we propose a sequence labeling algorithm which is a combination of neu-ral networks and Conditional Random Fields. Our algorithm combines the power of neuralnetworks (to extract features from the EEG signal) with the conditional random field clas-sifier (to capture the dynamics of the EEG signal). We performed our experiments on 13(self-paced) BCI subjects and compared the performance of our algorithm to several se-quence labeling and classical classifiers. All classifiers were evaluated in an online setting,i.e. classifiers continuously classified the test data similar to real applications.13Aim 4) Design and Implementation of a User-Customized BCI for Self-Paced BCIs.Changes in the EEG signal while a subject is performing a mental task are time andfrequency-specific. In Chapter 5, we design a self-paced BCI which not only takes intoaccount the time information for self-paced BCIs but also is customized based on the char-acteristics of each subject’s EEG signal.Our BCI exploits the time information by using a sequence labeling classifier. We useBayesian optimization to tune subject specific frequencies, channels and other parametersof the self-paced BCI system. Our algorithm is a combination of the methods proposedin Chapters 3 and 4. While in Chapter 3, many different methods of customizing a syn-chronous BCI are studied, in chapter 5, we use an ensemble approach which led to morestable and better average performance across all subjects.In Chapter 5, we describe our algorithm and compare our method to several well-knownclassifiers. We conduct our experiments using 13 subjects in self-paced BCIs. In ourexperiments, the aim is to discriminate the control from the no-control states of the brain.All classifiers are evaluated in a setting similar to online BCIs, i.e. classifiers continuouslyclassify the test data by only looking at the past data. We show that tuning the hyper-parameters of a self-paced BCI system and at the same time exploiting the dynamics of theEEG signal, can considerably improve the performance of self-paced BCIs.14Chapter 2A Unified Comparison Framework forSensory-Motor BCIs2.1 IntroductionDue to wide variations in BCI technologies and the lack of a general comparison frame-work in BCIs, it is almost infeasible to compare existing BCI technologies and speed up theprogress in the field. As the first part of my thesis, I have developed a general framework tocompare the performance of different algorithms in EEG-based BCIs. As this research fo-cuses mostly on the classification part of a BCI system, the unified comparison frameworkthat we developed was used to evaluate the performance of different classifiers on severalsensory motor BCI datasets. The data used are those from BCI competitions1, as well asone of our own datasets [20].There have been some efforts to compare the performance of different classifiers in BCIsystems [3] [63] [6] [91] [19]. However, these comparisons have been performed on a smallnumber of subjects and for a small number of classification methods. In [68], the authors1http://www.bbci.de/competition15surveyed the performance of different classifiers in BCIs that used different subjects anddifferent features. However, the best way to compare different classifiers is to evaluate theirperformance in the same context, i.e., on the same set of subjects and using the same set offeatures with the same set of parameters.The general trend in EEG-based BCI systems has been to utilize the Linear Discrimi-nant Analysis (LDA) classifier for classification purposes [8] [53]. The binary LDA classi-fier has a very strong assumption that the conditional probability densities of the two classeshave a Gaussian distribution with the same covariance function. As a result, the discrimi-nant function is linear and may thus not be suitable for non-linear separable feature spaces.In addition, this classifier is very sensitive to outliers. On the other hand, as discussed insection 1.4 the main challenges in classification of EEG data in BCIs is that the data arenon-stationary, noisy, contain outliers, are usually of high dimensions and the training setis unfortunately small [67] [103]. As a result, the classification strategies in BCI systemsshould be able to cope with these problems appropriately.This part of our study provides open source comparison framework for BCI systemsand is unique in three aspects. First, the number of subjects considered in this chapter are21 subjects that used synchronous BCIs and 8 subjects that used self-paced BCIs (i.e. anoverall of 29 subjects). Other studies were performed on smaller sample sizes (less than5 subjects) and as a result, their findings may not be generalized to a larger subject pool.Secondly, we have used rigorous statistical tests to compare the performance of differentalgorithms. This will enable us to examine whether or not there are significant differencesbetween the performance of different classifiers. Other studies have not employed statis-tical tests for comparisons, mainly because of their smaller sample sizes and have solelyrelied on the mean performance of the classifiers (which is not a robust measure), as asurrogate for the overall performance of a specific classifier. Thirdly, to increase the trans-parency, the source code of our experiments has been made openly available so that other16researchers can apply it on their own data and benchmark their algorithms with standardmethods.In the following sections we first explain the general structure of the comparison frame-work, then we explain the datasets used and finally the results of the different algorithmsare compared.Figure 2.1: The BCI framework. Our framework has three main steps: 1. Filtering,2. Feature Extraction and 3. Classification. The output of each step is fed to thenext step.2.2 General Structure of Comparison FrameworkThe overall structure of our comparison framework is shown in Figure 2.1. This frame-work has three main sequential processing components, Filtering, Feature Extraction andClassification. All data from the available EEG channels are fed as the input, the Filteringcomponent performs frequency filtering and spatial filtering, and the Feature Extractioncomponent extracts features from the filtered data. Finally, in the last step the extractedfeatures are combined to build a dataset which is then fed to the classification component.In the following subsections, the details of each of the above mentioned components areexplained.172.2.1 Spatial and Frequency FilteringThe first step of our framework was composed of frequency filtering and spatial filtering.Frequency filtering was done using a filter bank. A filter bank is an array of band passfilters and contains n blocks. Each block corresponds to filtering the original signal in aspecific sub-band. For frequency filtering, we used the fifth order Butterworth filter. Thisfilter has a flat frequency response in the pass band. The result of applying each filter-bankblock was fed to the next block in which spatial filtering is performed on the signal.Common Spatial Patterns (CSP) [82] [18] was used for spatial filtering. CSP has beenwidely used in BCI systems and has yielded considerable improvements in the signal tonoise ratio of the EEG signal. CSP projects multiple channels of the EEG data onto a sur-rogate sensor space by applying a linear transformation on the EEG data. This techniqueis a supervised method of combining multiple channels and has been developed for binaryclassification. This method maximizes the variance of signals for one class while minimiz-ing the variance of the signals for the other. As the signals are standardized before applyingCSP, the variance of the signals would be equivalent to their band power. Thereby, if weuse the band power as the extracted feature in the surrogate space, the discrimination of theclasses (e.g., right and left hand imagery movements) would be maximized.Suppose that the normalized covariance matrices of both classes are given by Σc wherec ∈ {+,−}, and + and − correspond to different mental tasks in BCI. The CSP algorithmmaximizes the following equation:argmaxWW TΣ+WW TΣ−W(2.1)where W is the projection matrix. This equation can be solved by applying simultaneousdiagonalization of the covariance matrices of both classes. By multiplying the projectionmatrix by the original EEG signal we can obtain the uncorrelated brain signals.18The benefit of applying CSP is that we can select a subset of filters that preservesas much information as possible and discriminates the two classes very well. However,choosing the number of filters (i.e., spatial patterns) is difficult, and is usually determinedby heuristic approaches. CSP is inherently designed for 2-class BCI tasks, and we haveused a one-against-one scheme to use CSP for the multi-class dataset. In a one-against-onescheme a separate CSP filter is used to discriminate each pair of mental tasks.2.2.2 Feature ExtractionIn the feature extraction step, different kinds of features can be extracted from the filteredsignal. The extracted features should properly represent the information hidden in the rawEEG signal. We used two of the most common and successful feature extraction methodsutilized in motor-imagery based BCI systems. Both methods calculate the band-power (BP)of the signal [25] [24] [47].In the first approach, the band-power of a signal is extracted by directly applying to asignal, a filter that only allows the frequencies inside each filter band to pass. Assuminga perfect block filter, we can then estimate the power of the filtered signal by summingthe squares of the magnitude of the filtered signal. Since the logarithm of band-powerfeatures has also been used in the literature [25], we have included both the band-powerand its logarithm in our feature set. In this study, we extracted the band power features forthe α [8-12]Hz and β [16-24]Hz frequency bands of the brain signals. We refer to theseband-power features as the BP features in the remainder of the text.In the second feature extraction approach, we used the Morlet wavelet to extract theband-power features. This method is one of most the successful feature extraction methodsused in sensory motor BCI systems [25]. In this method, the EEG signal is decomposedusing the Morlet mother wavelet and the power in each frequency band is calculated. Weused the exact configuration as in [24] to extract these features; i.e., for each channel, all19the frequency bands are chosen to be in the 4 to 30 Hz range. This results in 26 featuresfor each channel. This setting for extracting BP features will result in high-dimensionalfeature spaces, especially for datasets with many EEG channels.After extracting features from each block of the filter bank, the extracted features arecombined to build a feature vector, which is fed to the classification component of ourframework.2.2.3 ClassificationIn supervised learning, given a set of training samples D = {( f1,y1),( f2,y2), ,( fn,yn)}, theaim is to find an approximation of the unknown function g : F→Y which has generated thesamples. Each fi is a feature vector and yi is the label of the corresponding feature vector. Fis the feature matrix and Y is a vector corresponding to the label of each row (i.e. fi) of X. Inprobabilistic classification, instead of approximating the function g, a posterior probabilityP(c| f ) (where c is the predicted class labels vector and f is the feature matrix) is found.Eventually the label is assigned to the class with the highest probability. This probabilitycan be calculated using the Bayes rule. P(c| f ) is called the posterior, and P( f |c) is theclass conditional density.P(c| f ) = P( f |c)P(c)P( f )(2.2)In the following, we briefly describe the classifiers we have compared using our frame-work.Gaussian Discriminant AnalysisWhen the class conditional density is assumed to have a Gaussian distribution, then theresulting classifier is referred to as the Gaussian Discriminant Analysis [73]. This modelfits a Gaussian distribution to the samples of each class. If the covariance matrices of both20classes are considered to be the same, the result will be a Linear Discriminant Analysis(LDA) classifier in which the decision boundary is a linear surface. If we do not assumeany constraints on the covariance function, the resulting decision boundary is a quadraticfunction and the corresponding classifier is called Quadratic Discriminant Analysis.These classifiers impose very strong assumption on the underlying distribution of thedata. This has the advantage that the computation of the discriminative function is veryefficient. Therefore, the LDA classifier has been popular in the BCI field. It is importantto mention that various algorithms [105] [42] [65] have been proposed to eliminate theshortcomings of LDA. These algorithms are more robust and some of them, such as Z-LDA, [105] can handle the case where the covariance of two classes are different.Logistic Regression (LR)Unlike Linear Discriminant Analysis (LDA) which makes strong assumptions about theunderlying distribution of the data, logistic regression does not make any assumptions andhas been preferred over LDA in machine learning applications [80].This classifier is a discriminative learning classifier that directly estimates the param-eters of the distribution p(c| f ) where c is the class label and f is the feature vector. Thisalgorithm assumes the distribution p(c| f ) is modeled using a softmax function:p(c| f ) = exp(WTk f )Σ jexp(W Tj f )(2.3)where Wjs are the parameters to estimate. Then the maximum likelihood approach is usedto directly approximate Wjs. As the Hessian matrix for the logistic regression model ispositive definite, the error function has a unique minimum. Over-fitting can occur in logisticregression when the data is sparse and high dimensional (which is the case in BCIs). Weused l1 and l2 regularization jointly to cope with the over-fitting problem.21Random Forests (RF)The Random Forest classifier is an ensemble learning algorithm that is constructed by com-bining multiple decision trees at the training stage and produces a result that is the averageof the output of all individual trees [29]. This powerful learning algorithm injects random-ness into each tree in two ways. The first uses bootstrapping to sample from the originaldataset (i.e. the algorithm takes M samples with replacement from the original dataset).The second selects a subset of the features to split each node of the tree. Injecting random-ness in the process of building Random Forests, makes these classifiers robust and causethem to have a good performance when the data have many outliers, which is the case inBCIs [33]. Another consequence of injecting randomness in random forests is the abilityto rank the different features and also to acquire a measure for feature importance.The seminal paper of Random Forests [22] claims that increasing the number of treesdoes not cause the random forest to over-fit. However, [90] has found that RFs can over-fitwith noisy datasets. As a consequence, along with other important parameters of RF wehave tuned the number of trees (2.2.5).Support Vector Machines (SVM)An SVM [51] is a discriminative classification algorithm which is able to find a decisionhyper-plane with the maximum distance (margin) to the nearest data points (Support Vec-tors) of each class. As a result, this method has a high generalization power. The decisionfunction of an SVM is fully specified by a subset of the training data points, which leads toa sparse solution for SVM. The cost function of an SVM is a convex function that leads toan optimal solution for the optimization task.The mathematical formulation of an SVM gives us the ability to use what is known asthe kernel trick, to map the original finite dimensional space into a destination space withmuch higher dimensions. The use of the kernel trick is beneficial particularly when the data22points cannot be separated with a hyper-plane in the original feature space. This may leadto an easier separation of the data points in the destination space. In this study, we haveonly used SVMs with a radial basis function (RBF) kernel. As stated in [55], under someconditions, linear kernel SVM can be approximated with an RBF kernel. Even though thedata points might be mapped into a very high dimensional or even an infinite space, thecomplexity of computing a kernel matrix can be far smaller. Hence, the SVM algorithmcircumvents the curse of dimensionality by relying on the data points only through thekernel. SVMs are inherently designed for two-class classification tasks. Many differentmethods have been proposed to adapt SVMs to multi-class problems [52]. We used a one-against-one scheme to adapt SVMs for multi-class problems. Given a multi-class problemwith M classes, in a one-against-one scheme, we train(M2)classifiers to discriminate eachpair of classes. At the test time, a voting scheme is used to predict the label of a new unseensample.Boosting AlgorithmA Boosting classifier [41] is a learning algorithm based on the observation that buildinga highly accurate rule for classification is a really difficult task. This algorithm works bybuilding many rough rules of thumb. These rules are called weak classifiers. Each of theseweak classifiers is trained on a subset of the data points. These weak classifiers are thencombined into a single prediction rule which would be much more accurate than individualweak classifiers.There are many variants of the boosting algorithm. The differences amongst thesealgorithms stem from two issues. The first is concerned with how to select the subset of thedata points for each weak learning algorithm. The second is related to how to combine theoutput of a weak classifier into a single prediction rule. In this study, we used a variant ofboosting called the Adaboost algorithm. In this algorithm, the subset of data points that are23used for training a weak learning are the ones that are misclassified by the previous weaklearner. For generating a single prediction rule, Adaboost takes a weighted majority voteof the prediction of the weak learners.Multi Layer Perceptron (MLP)The last classifier used in this study is a feed forward neural network [14] with one hiddenlayer. It has been shown that an MLP with enough number of neurons in the hidden layercan approximate a wide variety of functions[50]. Despite the flexibility and capability ofthis algorithm to approximate any nonlinear function, this algorithm can easily overfit andthe cost function to optimize is a non-convex function. The cost function we used is thenegative log-likelihood function with both l1 and l2 regularization to avoid overfitting.2.2.4 Performance MeasureFor the synchronous BCI datasets, we first obtained the labels of the different classes ofmotor imagery tasks, then the classification accuracy was used as the measure to comparethe performance of different algorithms.For self-paced datasets, the true positive rate (TPR) and the false positive rate (FPR)are the most popular measures in the field. It is common to fix the FPR to a small value(e.g. 1 percent) and compare the performance of TPRs of different methods. However, thismethod does not exploit all the information given by the output of the classifiers because allthe classifiers applied in this research produce probability estimates (i.e. confidence) of asample belonging to each class. Since, the output of a classifier is not just the label of eachsample, a good performance measure should exploit the added information. The receiveroperating characteristic (ROC) [39] curve is the best way to compare these classifiers asit illustrates the performance of the classifiers for varied discrimination thresholds. ROCis well-suited for self-paced BCIs because ROC is suitable for evaluating the performancewhen we have imbalanced data or unequal misclassification costs. To compare the perfor-24mances of the methods, we use the Area Under the ROC Curve (AUC). AUC representsthe probability that a randomly selected positive sample gets a higher rank than a randomlychosen negative sample. AUC varies between 0 and 1 and in general a higher AUC is better.2.2.5 Model SelectionModel selection consists of finding the appropriate set of parameters for each learning al-gorithm. We adjusted the value of a set of parameters (referred to as ”BCI parameters”)consisting of parameters of the sensory motor EEG signal and the classifier-specific param-eters using a grid-based approach.For each individual subject, a set of candidate parameters including both the BCI pa-rameters and the classifier parameters were tuned for each algorithm, and 5 times 5-foldcross validation was performed to find the best value of the parameters. Then the parametervalues with the best mean performance were used to train a final classifier on the trainingset for that subject. The final classifier was used to evaluate the performance on indepen-dent test dataset for the same subject. The values of all these parameters (i.e. the BCIparameters and the classifier parameters) were adjusted jointly.The BCI parameters used in the synchronous BCI dataset consisted of selecting the timesegment from which the features are extracted and the number of CSP filters. To adjust thevalues of these parameters, the system should be customized based on the brain character-istics of each individual subject. To get an acceptable performance with the synchronousdatasets, the usual practice is to discard some parts of the movement period in each trial.The choice of the time segment from which features are extracted, has a major role in theperformance of the learning system. In datasets with many channels we have also usedCSP to combine different channels. As a result, the number of CSP filters was included asa parameter in model selection.In self-paced BCIs, the classification problem is a sequential learning problem. To be25able to apply static classifiers to this problem, we used sliding windows with overlap. Thesize of the window, the overlap of two consecutive windows, and the number of CSP filtersformed the set of BCI parameters in the self-paced datasets.The set of classifier-specific parameters that we adjusted for each classifier are given inTable 2.1.2.2.6 Statistical TestsInstead of using empirical approaches that have been commonly used in the BCI field tocompare the performance of different algorithms, the Friedman statistical test is used inthis study. The Friedman test [32] [44] is a non-parametric statistical test, which ranksdifferent classifiers for each subject separately. Then it averages the ranks over all subjects.The null hypothesis assumes that all algorithms have the same performance so they havethe same rank. In other words, it assumes the difference between different algorithms israndom. The test statistic has an approximate Chi-square distribution, and if the p-value islow enough to reject the null hypothesis, we can conclude that the difference between thealgorithms is not random.26Table 2.1: List of classifier parameters that were optimized using cross-validation inthe training phase.Random ForestNumber of treesMaximum number features evaluated to split each nodeMaximum depth of each treeMinimum number of samples in each leafSVMCGammaGLMRegularization TypeRegularizer CoefficientBoostingNumber of treesMaximum number features evaluated to split each nodeMaximum depth of each treeLearning rateMLPNumber of neurons in hidden layerL1 coefficientL2 coefficientLearning rateGDA NoneIf the null hypothesis is rejected, another statistical test is performed to identify whichalgorithms are the source of difference. We then conduct the Holm’s test as the post-hocstatistical test. The classifier with the best rank is selected as the control classifier, then apairwise comparison of all the other classifiers and the control classifier is performed. Inthis statistical test, the null hypothesis states that the control classifier and the other clas-sifier have the same mean rank. We have divided the algorithms based on the outcome of27the Holm’s test into two categories: the recommended and not recommended. The recom-mended classifiers include the control classifier, i.e., the best one and any other classifierthat we are not able to prove it has a worse performance than the best classifier. All theothers were deemed to belong to the not recommended category.Table 2.2: Specification of datasets used in this thesis.Dataset Type Number of Subjects Task Number of ChannelsBCICIII3b Synchronous 3 left hand vs. right hand 2BCICIV2b Synchronous 9 left hand vs. right hand 3BCICIV2a Synchronous 9 left hand vs. right hand vs. both feet vs. tongue 22BCICIV1 self-paced 4 left hand, right hand and foot vs. No-Control 59SM2 self-paced 4 right index finger vs. No-Control 102.3 DatasetsFive sensory motor BCI datasets consisting of 29 subjects were used to evaluate differentmethodologies studied in this chapter. These are the datasets I [15], IIa [26] and IIb [64]of the BCI competition IV, and dataset IIIa [16] from BCI competition III and SM2 from[20].Table 2.2 shows the specifications of each dataset. While SM2 and BCICIV1 datasetswere used to evaluate different BCI designs in the self-paced paradigm, the remainingdatasets were used for synchronous BCI systems evaluation. For each subject in eachdataset we have two sets of data: calibration and evaluation data. The calibration data isused to train (calibrate) the BCI system. The evaluation data is used for the evaluation ofthe BCI system.Dataset I from competition IV (BCICIV1) was recorded from four subjects performingmotor imagery tasks (left hand, right hand or foot imagery). Each subject participated intwo sessions of brain signal recording. The first session was used for training the BCI28system, and the second session of signal recording was used for the evaluation of the BCIsystem. This dataset consisted of 59 EEG channels (corresponding to 59 sensors). In thecalibration session, each subject was assigned to perform two of the three classes of motorimagery tasks: left hand, right hand, or foot imagery movements. There were 200 trialsof imagery movements that were balanced between two classes. Each trial was 8 secondslong, the length of the motor imagery intervals in the evaluation session varied from 1.5 to8s. The NC intervals were also between 1.5 and 8s. Getting a very good performance onthis 3-class self-paced dataset is very challenging, therefore we converted the problem intoa binary classification in which the aim is to decide whether an output is a control state ora No-Control state.The SM2 dataset was collected from four subjects attempting to activate a switch. Atrandom intervals (mean = 7 seconds), a cue was displayed for the subjects. The subjectsattempted to activate the switch by moving their right index finger after the cue appeared.In this dataset, the EEG was recorded from 10 channels positioned over the supplementarymotor area and the primary motor cortex (i.e. FC1-4, FCz, C1-4, Cz). For each subject,an average of 10 sessions of EEG recording was performed for six days. In each session,the period between any two trials varied and the subjects performed actual movement. Foreach subject a fixed set of trials (mean = 500 trials) was used for training the BCI and therest for evaluating the BCI. A detailed description of this dataset can be found in [20].Dataset IIa from the BCI competition IV (BCICIV2a) was recorded for nine subjectsperforming 4-class motor imagery (left hand and right hand, both feet and tongue im-agery movements) tasks. The data consisted of 19 channels along the scalp. The datawas recorded in two sessions and each session comprised of 288 trials. The first sessionwas used to train (calibrate) the BCI and the second session was used to evaluate the BCIsystem.Dataset IIb from BCI competition IV (BCICIV2b) was recorded for nine subjects per-29forming 2-class motor imagery (left hand and right hand) tasks. The data consisted of threechannels (i.e. C3, CZ and C4) along the motor cortex of the scalp. For each subject, fivesessions were recorded. The first three sessions were used for training the BCI system, andthe remaining two sessions were used for testing the system.Dataset IIIb from BCI competition III (BCICIII3b) was recorded from three subjectsperforming 2-class motor imagery (left hand and right hand) tasks. The data had two chan-nels (i.e. C3, C4) from the motor cortex area of the brain. For each subject, three sessionsof EEG recordings was provided.2.4 Results and DiscussionWe applied different combinations of feature extraction, classification and model selectionmethods to five datasets. Tables 2.3 and 2.4 show the performance for synchronous andself-paced datasets, respectively, with the best performing feature extraction/classificationcombinations highlighted in blue. Qualitative comparison of the different feature extrac-tion/classification combinations suggests the following: 1) for synchronous BCI systems,logistic regression classification outperforms the other classifiers (Table 2.3). This is re-gardless of the feature extraction methodology used. 2) For self-paced BCIs, (Table 2.4),both logistic regression and MLP classifiers yield better performances. In addition, in bothself-paced and synchronous BCIs, Tables 1 and 2 show that for the subjects with the highernumber of EEG channels, the Band-Power (BP) outperforms the Morlet features mainlydue to the application of CSP on the channels. For datasets having a lower number of EEGchannels, the use of Morlet features outperforms that of BP features.The Friedman statistical test was performed to compare the performance of the differ-ent classification methods. This was done for both the BP and Morlet feature extractionmethods for both types of BCIs (synchronous and self-paced), resulting in four differentcomparison settings. In Settings 1 and 2, the BP and Morlet features with different clas-30sifiers were applied on the synchronous datasets, respectively, and in settings 3 and 4, theBP and Morlet features with different classifiers were applied on the self-paced datasets,respectively.31Table 2.3: The accuracy of classifiers for synchronous BCI operation over all subjects. The accuracies are calculated usingthe unseen test dataset. For each classification algorithm the first column shows the results of BP features and the secondcolumn shows the results of Morlet features.SubjectsClassifier Boosting Logistic Random Forest SVM LDA QDA MLPBP Morlet BP Morlet BP Morlet BP Morlet BP Morlet BP Morlet BP MorletSubject1 (O3) 75.47 80.5 80.5 82.39 74.84 79.25 81.76 77.36 81.13 74.21 79.25 60.38 78.62 83.65Subject2 (S4) 71.11 77.96 70.19 83.89 68.52 79.26 71.11 83.52 70 81.11 70.37 72.41 70.37 82.22Subject3 (X11) 72.22 78.15 74.26 78.15 71.48 77.78 72.78 77.78 74.07 76.48 74.81 72.22 74.81 76.67Subject4 (100) 66.23 61.84 60.96 68.86 63.6 65.35 64.04 64.91 61.4 75.88 63.16 58.77 63.6 69.3Subject5 (200) 53.47 51.84 56.33 58.37 54.29 54.29 56.73 59.18 56.33 55.92 54.69 55.51 56.33 58.37Subject6 (300) 56.52 54.35 56.09 53.48 51.74 51.74 51.74 45.22 56.09 49.13 54.35 46.52 57.83 51.74Subject7 (400) 89.25 90.88 94.79 94.79 91.21 92.83 95.11 94.14 93.81 94.46 95.77 83.39 92.18 94.46Subject8 (500) 61.9 86.45 67.77 91.58 63.37 85.71 68.13 85.71 65.57 87.18 67.03 76.56 67.4 85.71Subject9 (600) 74.1 79.68 75.7 82.87 74.1 80.48 65.74 80.88 76.1 82.07 75.3 60.96 76.89 83.27Subject10 (700) 54.31 70.69 53.02 72.84 49.14 76.29 52.59 70.69 59.05 71.55 57.33 64.66 51.29 71.12Subject11 (800) 91.74 83.91 92.17 83.48 90 86.52 92.17 80.87 92.17 80.87 86.52 67.83 90.87 80Subject12 (900) 78.37 82.45 77.55 86.12 75.92 82.86 76.73 86.12 77.96 76.33 77.96 68.57 77.14 83.67Subject13 (1) 79 71.53 79 60.85 81.85 73.31 81.14 61.57 74.38 50.89 52.67 26.33 81.85 58.01Subject14 (2) 51.59 53.36 61.13 54.42 53.36 51.24 57.24 57.6 62.9 32.86 38.87 25.8 58.3 54.77Subject15 (3) 78.75 83.15 86.45 86.81 78.02 82.78 84.25 78.02 80.22 50.55 41.03 27.47 85.35 83.15Subject16 (4) 71.49 32.02 73.68 41.23 73.68 45.18 71.05 36.84 60.96 35.53 36.84 30.26 72.81 34.65Subject17 (5) 56.16 33.33 60.14 40.58 58.33 35.14 56.88 41.67 50 28.26 34.06 28.26 60.14 38.77Subject18 (6) 51.63 24.65 56.74 26.98 52.09 26.05 57.21 25.58 54.42 27.91 33.95 26.98 59.07 26.05Subject19 (7) 84.48 64.98 87.36 56.32 81.95 71.48 87.36 64.26 70.4 32.49 25.27 31.41 87 60.29Subject20 (8) 77.12 54.98 80.81 61.62 79.34 63.47 81.55 62.36 75.65 37.27 46.49 31 80.44 61.62Subject21 (9) 78.03 46.97 83.71 39.39 82.95 56.44 84.09 45.08 73.86 41.29 34.47 25 84.85 44.3232The reason we compared the performance of different classifiers when a single featureextraction methodology was used was that the nature of feature spaces was different foreach setting and the results could thus be misleading. For example, when we use the BPfeatures, the feature space is of low dimensions compared to the settings where the Morletfeature is used. Therefore, different classification methodologies could only be comparedwhen they are applied on the same feature space. Since the classification problem in syn-chronous and self-paced BCIs was different, it only make sense to compare the classifierson self-paced and synchronous datasets separately.33Table 2.4: The Area Under the Curve (AUC) of classifiers for self-paced subjects. For each classification algorithm the firstcolumn shows the results of BP features and the second column shows the results of morlet features.SubjectsClassifier Boosting Logistic Random Forest SVM LDA QDA MLPBP Morlet BP Morlet BP Morlet BP Morlet BP Morlet BP Morlet BP MorletSubject22 (22) 0.46 0.56 0.64 0.58 0.49 0.49 0.41 0.48 0.64 0.57 0.62 0.55 0.63 0.56Subject23 (23) 0.54 0.61 0.67 0.68 0.53 0.55 0.5 0.59 0.64 0.68 0.63 0.53 0.66 0.7Subject24 (24) 0.6 0.58 0.66 0.58 0.58 0.51 0.6 0.52 0.65 0.58 0.64 0.54 0.63 0.57Subject25 (25) 0.36 0.31 0.78 0.77 0.6 0.69 0.47 0.6 0.79 0.77 0.68 0.66 0.79 0.73Subject26 (a) 0.59 0.53 0.66 0.55 0.64 0.53 0.56 0.53 0.65 0.53 0.61 0.5 0.66 0.57Subject27 (b) 0.79 0.77 0.82 0.83 0.78 0.81 0.72 0.76 0.66 0.8 0.72 0.73 0.82 0.83Subject28 (f) 0.49 0.54 0.51 0.53 0.51 0.53 0.49 0.51 0.5 0.53 0.48 0.51 0.49 0.52Subject29 (g) 0.52 0.5 0.53 0.58 0.52 0.51 0.53 0.51 0.53 0.55 0.58 0.52 0.53 0.5834The comparison between different classifiers was performed as follows: (a) we rankedeach classification algorithm based on its performance on the test data. For example, forsubject1 (O3), when the Morlet features were used, the MLP was the best classifier, LRwas the second best and so on. We ranked the algorithms over each subject separately,then we averaged the ranks, (b) the Friedman test was then applied on the resulting rankedresults. (c) if the null hypothesis was rejected by the Friedman statistical test, we performeda second set of statistical tests. In particular, we used the Holm’s test and compared allclassifiers to the control classifier. The control classifier is the one with the best averagerank on the test data.The average rank of the different classifiers that are based on the number of wins forthe synchronous and the self-paced datasets respectively are given in Table 2.5. Theseresults show that depending on the feature sets and the dataset type (synchronous versusself-paced), both linear classifiers such as LR and LDA and non-linear ones (e.g., MLP,RF) could be among the top performing classifiers. Another interesting observation is thatin all settings, the average rank of the LR classifier is better than the LDA’s. This is due thefact that the BCI data is noisy and usually has many outliers. As discussed in the previoussections, unlike LR, LDA is sensitive to outliers and is therefore not robust. Comparingthe ensemble classifiers shows that the RF classifier has a better average rank compared toAda-Boosting (i.e. BST); this is again due the sensitivity of BST to outliers.35Table 2.5: Average Rankings of the classification algorithms for both synchronousand self-paced datasets. The number in the parenthesis corresponds to the aver-age rank of the algorithm among different subjects. For each feature extractionmethod the classifiers typed in bold are the recommended ones. The recom-mended classifiers are selected based on the results of the statistical tests.Data Synchronous Self-pacedFeature BP Morlet BP Morlet(Setting 1) (Setting 2) (Setting 3) (Setting 4)1 MLP(2.92) LR(2.45) LR(1.81) LR(1.87)2 LR(2.97) RF(3.3) MLP(2.75) MLP(2.56)3 SVM(3.11) MLP(3.42) LDA(3.06) LDA(2.81)4 LDA(4.09) SVM(3.57) QDA(4.18) BST(4.25)5 BST(4.54) BST(4.16) RF(4.87) RF(4.87)6 QDA(5.11) LDA(4.45) SVM(5.5) SVM(5.81)7 RF(5.21) QDA(6.61) BST(5.81) QDA(5.81)In Setting 1, as shown in Table 2.5, MLP was the best performing classifier, i.e., whenthe BP features were used on synchronous datasets. Since the null hypothesis was rejectedby the Friedman statistical test (p-value=1.4e-4), the Holm’s post-hoc was performed andall classifiers were compared to Multi-Layer Perceptron (control classifier).The p-values corresponding to pairwise comparison of classifiers are shown in Table2.6. For α = 0.1, all hypothesis with p-value less than 0.0333 were rejected. Accordingto the Holm’s test results, there is no significant difference between the performance ofMLP, LR, SVM and LDA. Thereby, these classifiers are the recommended ones for theBP features in synchronous data. The other classifiers, i.e., BST, QDA and RF had a poorperformance for this type of features.36Table 2.6: P-values corresponding to pairwise comparison of different classifiers. αis chosen to be 0.1. For settings 1 and 2 all hypothesis with p-value less than0.0333 are rejected. For setting 3 and 4 all hypothesis with p-value less than0.05 are rejected. The results are rounded up to 4 decimal places.Setting 1(Synchrounous, BP) Setting 2(Synchrounous, Morlet) Setting 3(Self-paced, BP) Setting 4(Self-paced, Morlet)hypothesis P− value hypothesis P− value hypothesis P− value hypothesis P− valueRF vs. MLP 0.0006 QDA vs. LR 0.0 SVM vs. LR 0.0002 SVM vs. LR 0.0002QDA vs. MLP 0.0010 LDA vs. LR 0.0026 BST vs. LR 0.0006 QDA vs. LR 0.0002BST vs. MLP 0.0151 BST vs. LR 0.0101 RF vs. LR 0.0045 RF vs. LR 0.0054LDA vs. MLP 0.0801 SVM vs. LR 0.0932 QDA vs. LR 0.0278 BST vs. LR 0.0278SVM vs. MLP 0.7750 MLP vs. LR 0.1431 LDA vs. LR 0.2471 LDA vs. LR 0.3854LR vs. MLP 0.9430 RF vs. LR 0.1985 MLP vs. LR 0.3854 MLP vs. LR 0.5244In Setting 2, the best performing classifier was LR (Table 2.5). The p-value for theFriedman test was 1.7e-8; therefore, the difference between the classifiers is not random.The Holm’s test suggests that RF, MLP, and SVM are as good as the LR classifier (Table2.6 setting 2).In Settings 3 and 4, the LR classifier performed better than others (Table 2.5). In Setting3 the Friedman’s test p-value was 7.16e-4, and in Setting 4 Friedman’s test p-value was1.8e-4. The Holm’s test results suggested that there was no significant difference betweenMLP, LDA and LR (Table 2.6 setting 3 and 4). All hypotheses with p-value less than 0.05were rejected.Among the classifiers used in this study, RF, BST and MLP are inherently designed tohandle multi-class classification and the others (i.e., SVM, LR, LDA and QDA) are usedin a one against others setting to handle multi-task problems. Therefore, in addition to thefour settings discussed above, we have also performed two other statistical tests. The aimwas to determine which classifiers would yield the best results in binary-task BCIs andwhich one(s) would yield the best results in multi-task BCIs.37Table 2.7: Average Rankings of the classification algorithms for binary and multi-class classification in synchronous datasets. The number of subjects in binarytask was 12 and the number of subjects in multi-task BCIs was 9. The number inthe parenthesis corresponds to the average rank of the algorithm among differentsubjects. For each feature extraction method the classifiers typed in bold arethe recommended ones. The recommended classifiers are selected based on theresults of the statistical tests.Feature BP MorletData Binary Multiclass Binary Multiclass1 SVM(3.33) MLP(2.11) LR(1.87) RF(2.50)2 LDA(3.41) LR(2.22) MLP(3.20) SVM(3.00)3 MLP(3.54) SVM(2.83) RF(3.91) LR(3.22)4 LR(3.54) RF(3.88) LDA(3.91) MLP(3.72)5 QDA(3.70) BST(4.94) SVM(4.0) BST(3.94)6 BST(4.25) LDA(5.0) BST(4.33) LDA(5.16)7 RF(6.20) QDA(7.0) QDA(6.74) QDA(6.44)From the total of 21 subjects in the synchronous BCIs datasets, 12 had performed binarytasks and 9 had performed multi-task control of BCIs. Therefore, we performed separatestatistical tests for binary and for multi-task datasets. The average rank of different classi-fiers for the binary and for the multi-task datasets are given in Table 2.7. This Table showsthat in binary-task BCIs for both BP and Morlet features the best performing classifier isan inherently binary classifier (i.e., SVM in binary-task BCIs with BP features and LR inbinary-task BCIs with Morlet features). Furthermore, in multi-task BCIs for both kinds offeatures the best performing classifier is an inherently multi-class classifier (i.e. MLP inmulti-task BCIs with BP features and RF in multi-task BCIs with Morlet features).38Table 2.8: P-values corresponding to pairwise comparison of different classifiers. αis chosen to be 0.1. For binary task BCIs with BP features all hypothesis withp-value less than 0.02 are rejected. For multi-task BCIs with BP features allhypothesis with p-value less than 0.0333 are rejected. For binary task BCIs withMorlet features all hypothesis with p-value less than 0.1 are rejected. For multi-task BCIs with Morlet features all hypothesis with p-value less than 0.025 arerejected. The results are rounded up to 4 decimal places.(Binary, BP) (Multi-task, BP) (Binary, Morlet) (Multi-task, Morlet)hypothesis P− value hypothesis P− value hypothesis P− value hypothesis P− valueRF vs. SVM 0.0011 QDA vs. MLP 0.0 QDA vs. LR 0.0 QDA vs. RF 0.0001BST vs. SVM 0.2986 LDA vs. MLP 0.0045 BST vs. LR 0053 LDA vs. RF 0.0088QDA vs. SVM 0.6706 BST vs. MLP 0.0053 SVM vs. LR 0.0159 BST vs. RF 0.1560LR vs. SVM 0.8132 RF vs. MLP 0.0808 RF vs. LR 0.0206 MLP vs. RF 0.2300MLP vs. SVM 0.8132 SVM vs. MLP 0.4781 LDA vs. LR 0.0206 LR vs. RF 0.4781LDA vs. SVM 0.9247 LR vs. MLP 0.9131 MLP vs. LR 0.1305 SVM vs. RF 0.6234For each group of subjects, we performed the Friedman test and the Holm’s post-hoctest. This is done to statistically compare the performance of other classifiers with respectto the best performing classifier in each case. The p-values corresponding to pairwise com-parison of classifiers are given in Table 2.8. Table 2.8 suggests that for the BP features,MLP, LR and SVM are recommended for both the binary and multi-class BCIs. Accord-ing to Table 2.5, these classifiers are also recommended in Setting 1 (corresponding to 21subjects performing synchronous BCIs with BP feature extraction method). Table 2.8 alsosuggests that, for Morlet features, LR and MLP classifiers are recommended for both thebinary and multi-class BCIs. According to Table 2.5, these classifiers are among the rec-ommended ones in Setting 2 (corresponding to 21 subjects performing synchronous BCIswith Morlet feature extraction method). The results suggest that among the classifiers thatare designed to handle multi-task classification, RF and MLP are recommended for multi-task BCIs. The other observation is that even in multi-task BCIs, some inherently binaryclassifiers are among the recommended classifiers. The results of Table 2.5 are, however,39more reliable as the number of subjects in Settings 1 and 2 is almost twice the number ofsubjects in Table 2.7.2.5 ConclusionIn this chapter, we built a general open source framework to compare the performanceof different algorithms in BCIs. Using this framework, we performed a comprehensivecomparison between 14 different BCI designs (two feature extraction methods and sevenclassification methods for each feature extraction methodology) over 29 BCI subjects per-forming sensory motor tasks in synchronous and self-paced BCI paradigms. We backedour results with rigorous statistical tests.Our results show that the Logistic Regression (LR) and Multi-Layer Perceptron (MLP)classifiers are among the best performing classifiers and are recommended for all differentdesigns. LR is a linear classifier like Linear Discriminant Analysis (LDA) while MLP is avery powerful nonlinear classifier. Both LR and MLP classifiers are prone to over-fitting;however, in both cases we have included regularization terms to avoid over-fitting. Theobservation that LR was among the best classifiers suggested that the feature space of ourtask was somewhat linearly separable.Finally, we should emphasize that classification is just one step in our framework, andto get acceptable performance other steps are also important. Pre-processing of the data,feature extraction, and feature selection all change the distribution of the data in the featurespace and have a major role in getting good results. Therefore, a BCI system should beviewed as a unit consisting of different blocks in which all the block settings and parametersshould be adjusted jointly for each individual subject.40Chapter 3Customizing Brain Computer Interfaces3.1 IntroductionAs discussed in chapter 1, different subjects have different brain characteristics, and a BCIsystem should be customized for each individual subject. For example, the frequency bandsin which the alpha and beta activities occur differs from one individual to another. In sen-sory motor BCIs, the event related desynchronization (ERD) and the event related synchro-nization (ERS) occur in these frequency bands [79].Generally the range of the alpha and beta brain waves are ≈ [8− 12]Hz and ≈ [16−24]Hz respectively. However, the value of these brain waves vary from one person toanother [77]. Finding these frequency bands would thus play a significant role in obtaininga desirable performance. In synchronous BCIs, the selection of the EEG time segment(from which the features are extracted), for each person has a major role in determiningthe accuracy of a learning system. Also, in BCI systems, the choice as to which channelsto use is very important. These subject-specific brain characteristics form some of thehyper-parameters that are fed to a BCI system. The value of these hyper-parameters aredetermined before the feature extraction and classification processes.41The problem of finding the appropriate values for the hyper-parameters is known ashyper-parameter optimization. This problem is of great importance in any machine learningtask. Two of the most common approaches for finding good values for hyper-parametersare grid search and manual search. To apply a grid search method on a continuous space,we usually discretize the space. In the grid search approach, when the dimensionality ofthe search space, or the granularity of the discretization of continuous hyper-parametersincreases, the computation time to carry out the search process increases vastly (curse ofdimensionality). Manual search, has the drawback that its results are not reproducible.This is because researchers usually do not report the hyper-parameter values used in theirmethods, and this makes the reproduction of their results difficult. Both these methodsare thus not applicable in high-dimensional search spaces. Therefore, there is a need forsmart methods that find good solutions efficiently i.e. using a small number of functionevaluations.In the BCI field, the values of these hyper-parameters have been mostly selected man-ually i.e. selected by EEG experts. There exists some prior work in the field that havepartially addressed the hyper-parameter optimization task. Some studies only adjust thefrequency bands and the time interval [93] [18] [28] [94] [31] [11], and some others addressthe channel selection problem [45] [104] [5] [61]. These methods are however not scalable,they also do not optimize all the hyper-parameters jointly. Furthermore, in these methodsthe process of selecting the subject specific hyper-parameters is completely independentfrom the classification and feature extraction processes. That is, the hyper-parameter op-timization task has no knowledge of the nature of the feature extraction and classificationalgorithms.In this work, we take a totally different approach from those in the literature. An ideallearning framework for a BCI problem should aim at studying the hyper-parameter opti-mization, the feature extraction and the classification processes jointly. In other words, the42hyper-parameters of the BCI system should be optimized based on the selected classifierand feature extractor methods.In this chapter, we first show the importance of finding the optimal values for the hyper-parameters of BCIs. Then, we propose a learning framework that is capable of performingsubject-specific customization for BCIs. we use Bayesian optimization to tune the hyper-parameters of the BCI system. The proposed framework is computationally in-expensiveand can be used to customize any number of hyper-parameters in a BCI, and the experi-ments have led to significant improvements in motor imagery based BCIs.3.2 Bayesian OptimizationBayesian optimization (BO) is capable of optimizing functions that do not have closed formmathematical expressions and are computationally expensive to evaluate [23]. This ap-proach has several advantages over other global optimization algorithms [54] and has thusgained much attention for hyper-parameter optimization in machine learning. Bayesianoptimization is very efficient in terms of the number of function evaluations. Its efficiencyoriginates from its ability to incorporate prior knowledge about the optimization task. Es-pecially in the case of optimizing the hyper-parameters, the candidate points in the neigh-borhood of a certain point x have almost the same function value (i.e. the function beingoptimized is smooth). In Bayesian optimization, we can incorporate this domain knowledgeabout the system through a Kernel function. Bayesian optimization uses all the informationgained from previous function evaluations to find a new candidate point. This is done byutilizing the global information captured in the probability distribution fitted to the data topropose a new candidate.Bayesian optimization optimizes an objective function g(x) over some bounded set X .The objective function does not have a closed form expression and its derivative is un-known. A Bayesian optimization algorithm sequentially constructs a probabilistic model43for g(x) and then uses this model to select a candidate for the optimization task. In ourcase, the objective function is the accuracy of the learning framework used.Let us assume that at time t − 1, the sequence D1:t−1 ={(x1,g(x1)),(x2,g(x2)), ...,(xt−1,g(xt−1))} have been observed. The Bayesian opti-mization algorithm creates a posterior distribution over the objective function g given theobservations i.e. P(g |D1:t−1). This posterior function expresses the algorithm’s beliefabout the objective function g, and can be seen as a way of estimating the objectivefunction.Based on this posterior function, the Bayesian optimization algorithm proposes anothercandidate to be evaluated in the next iteration. In particular, it creates a new utility functionbased on the posterior distribution P(g |D1:t−1). This function is called the acquisitionfunction. Unlike the original optimization task, finding the maximum of the acquisitionfunction is not difficult, however this function is still nonconvex. To optimize this function,a gradient descent algorithm or a derivative free optimization algorithm can be used. A localmaximum of the acquisition function is the new candidate (i.e. xt). This whole process iscontinued for T iterations.For the Bayesian optimization algorithm, there are two main components that shouldbe determined in advance. The first component is how to model the posterior (P(g |D1:t))distribution over the objective function. Such modeling should be powerful enough to beable to model complex posterior distributions of the objective function. This is discussedin Section 3.2.1. The second component is the choice of the acquisition function. This isdiscussed in Section 3.2.2.3.2.1 Modeling Posterior DistributionTo model the posterior distribution two of the most powerful methods for regression areused. The first method utilizes Gaussian Processes (GP) [83] which is a very convenient44method for modeling non-linear regression tasks (Section 3.2.1). GPs assume the corre-lation between samples in all parts of the space has the same structure, and they fail tomodel cases in which the dataset is piece-wise continuous. The second approach utilizesRandom Forests (RF) for the regression task (Section 3.2.1). In [29] Criminisi et al. havecompared the performance of RF and GP regression methods. They stated that GP andRF have different behaviors when large gaps exist in the training data. In such cases RFscan model the uncertainty in the gaps appropriately. They also concluded that RFs canmodel multi-modal distributions of data in the gaps while GPs are intrinsically uni-modalGaussian distribution.Gaussian Processes (GP)Gaussian Processes (GPs) have been shown to be well-suited to estimate the posterior dis-tribution over an objective function [72]. GPs are one of the most popular families ofstochastic processes for modeling regression problems. One characteristic of GPs is theirability to model the uncertainty of predictions. A GP is a non-parametric learning algo-rithm which represents an infinite dimensional Gaussian distribution over functions. Agood property of GP is that solving the prediction problem associated with GPs is straight-forward and there is a closed form solution for that.As is the case of an ordinary Gaussian distribution, Gaussian processes can be fullydetermined by their first and second order moments. As a result, to model a regressionproblem with a GP, the mean and the variance i.e. kernel function of the GP should bedetermined. The family of functions represented by a GP is determined by the choice ofthe mean and kernel function of the prior distribution. Usually we assume the mean of theprior is zero and only the kernel function defines the functions represented by a GP. Sucha kernel function controls the smoothness of the functions represented by a GP. This isspecially useful in hyper-parameter optimization, where we can design a smooth posterior45over the hyper-parameter space by choosing an appropriate kernel.To model the posterior distribution with a GP, we assume that the sequence of hyper-parameter candidates up to time t− 1 is x1:t−1 = [x1,x2, ...,xt−1]T , and the correspondingvalues of the objective function for this sequence are g1:t−1 = [g(x1),g(x2), ...,g(xt−1)]T .At time t, the joint density distribution of g1:t−1 and gt (i.e. g(xt)) is Gaussian, i.e.,g1:t ∼ N(0,[ K kt,1:t−1kTt,1:t−1 kt,t]), (3.1)where K is the kernel matrix and is the result of applying the kernel function (ki, j = k(xi,x j))on all pairs of previous observations x1:t−1:K =k1,1 ... k1,t−1. . .. . .. . .kt−1,1 ... kt−1,t−1. (3.2)In Equation 3.1, kt,1:t−1 is [kt,1,kt,2, ...,kt,t−1]T and kt,t = k(xt ,xt). Then the posterior dis-tribution is calculated using equation 3.3:P(gt |g1:t−1,xt)= N(µt ,σ2t), (3.3)whereµt = kTt,1:t−1K−1g1:t−1, (3.4)σ2t = kt,t− kTt,1:t−1K−1kt,1:t−1. (3.5)46Random Forests (RF)If we use random forests to model the posterior distribution, at each iteration t of theBayesian optimization algorithm, a random forest is trained using D1:t−1. Assuming Mregression trees are trained on the data, then:p(gt |g1:t−1,xt)=1MM∑m=1pm(gt |g1:t−1,xt)(3.6)where pm(gt |g1:t−1,xt)is the probability density function over the output variable producedby the mth tree.3.2.2 Acquisition FunctionThe second component of a Bayesian optimization algorithm is the acquisition function.The role of the acquisition function is to direct the search process to the new candidatepoints. The new points should be selected so that they have higher probability i.e., yieldhigher values of the objective function. A good acquisition function should be able tobalance the trade-off between exploration and exploitation. In other words, the algorithmshould search globally in the beginning of the optimization process. As the search proceeds,the posterior over the objective function becomes more accurate, and the algorithm shouldbecome more exploitative.One simple acquisition function is to select a candidate that has the maximum prob-ability of improving the objective function value over the current best optimal solution.However, this method does not explore the search space properly, and the points that areinfinitesimally greater than the current best value will have a higher chance to be selected.A better alternative is to maximize the Expected Improvement (EI). EI considers both the47probability of improvement and the amount of improvement.xt = argmaxxE(max(0,g(x)−g+) |D1:t−1), (3.7)where g+ is the maximum observed value until iteration t − 1. The maximization insidethe expectation in equation 3.7 states that we only evaluate those points which our modelbelieves would result in improvement over the current best point.Under the assumption of a Gaussian posterior distribution, EI can be evaluated ana-lytically. Therefore, the expectation for a possible candidate xt in Equation 3.7 can beevaluated, as following:(µt−g+)Φ(Zt)+σtφ(Zt) σt > 00 σt = 0, (3.8)andZt =µt−g+σt, (3.9)where φ(.) and Φ(.) are the probability density function and the cumulative distributionfunction of a standard Gaussian distribution. The candidate with the maximum EI value inequation 3.8 is selected as the candidate for the tth iteration of Bayesian optimization. Forboth Gaussian process and random forest posteriors, we have used equation 3.8.3.3 Proposed AlgorithmFor any BCI system, the goal is to achieve the best possible accuracy. The traditionalapproach is to model the BCI design task as a machine learning problem and solve theproblem accordingly. In this section however, we solve the BCI task from an optimizationpoint of view. We define an objective function (Section 3.3.1), and we then use a Bayesian48optimization algorithm (Section 3.3.2) to optimize this function. Our objective functionis the performance of the learning system, and we optimize this function over the hyper-parameter space. Subsequently, our algorithm builds an ensemble of classifiers (Section3.3.3) using the results of the optimization phase.The pseudo code of our algorithm is given in Algorithm 1. At each iteration (t) ofour algorithm, the optimizer suggests a new candidate (xt) in the hyper-parameter space,then a new classifier (lt) is built based on the values of the new candidate. Based on theperformance of lt at iteration t, another candidate is suggested for iteration (t +1) and thisprocess continues for T iterations. After T iterations, we have T classifiers, each trainedon a different subset of the hyper-parameter space. The outputs of the classifiers are thencombined into a single prediction rule which is much more accurate than the individualclassifiers.Algorithm 1: The pseudo-code of the proposed algorithm1 Select an x1 from the hyper-parameter space uniformly at random;2 Create a feature matrix f1 from the brain signals using x1;3 Train a new Classifier (l1) using feature matrix f1, and calculate the its performanceg1;4 Create the posterior distribution;5 for t:=2 to T do6 Find candidate (xt) from the hyper-parameter space (equation 3.7);7 Create a feature matrix ft from the brain signals using xt ;8 Train a new Classifier (lt) using feature matrix ft , and calculate the itsperformance gt ;9 Update the posterior distribution p(gt |g1,t−1,xt) (equations 3.3 for GP and 3.6for RF);10 end11 Combine l1...lT to build an ensemble classifier;493.3.1 Learning Algorithm (Objective Function)The objective function g : H→ A is a function that maps the hyper-parameter space (H) tothe set of possible performances (A) of the learning system. Using Bayesian optimizationwe decide which hyper-parameters yield the best performance. Below, we explain ourlearning system in more details.We used the learning framework described in section 2.2 as our learning system. Forfrequency filtering we used the fifth order Butterworth filter, and for spatial filtering we usedCSP (section 2.2.1). For feature extraction we used BP and Morlet features as described insection 2.2.2. In our experiments the number of BP features extracted from each channel,were either 2 or 4 (Section 3.3.1). For the classification step, we used Logistic Regression.Comparing classifiers in sensory motor BCIs in chapter 2, Logistic Regression was amongthe best performing classifiers. To select the appropriate values for the parameters of theclassifier, 5× 5-fold cross validation is performed. The parameters with the best meanperformance were used to train the classifier.The accuracy in the classification is used as the performance measure for evaluatingdifferent algorithms. As the number of samples are the same for different imagery move-ments, this measure is appropriate to use in this study. Therefore, the output domain of theobjective function is taken as the cross validation accuracy of all possible classifiers. Inother words, our algorithm uses 5×5-fold cross-validation to assess each classifier.Input Domain of the Objective Function (Hyper-Parameter Space)In this section, we explain the domain i.e. the hyper-parameter space of the objectivefunction. We have considered three groups of parameters to form our hyper-parameterspace. We find the values of these hyper-parameters simultaneously for each subject:(i) the time interval from which features are extracted: In order to explain this hyper-parameter, a sequence of trials in a movement sensory BCI dataset is illustrated in Figure503.1. Each trial consists of a movement interval (left or right hand movement) and a No-Control (NC) interval. During a movement interval, the subject is requested to performa movement imagery task, and in an NC interval the subject should stop controlling thesystem.As depicted in Figure 3.1, the movement and NC interval are consecutive. The presenceof these transition states (i.e. switching from a NC state to movement state or vice versa)in each brain state segment makes classification more difficult. To alleviate this problem, achunk of the signal is discarded from the beginning and the end of the trials. The featuresare extracted from the remaining signal.Since, this BCI hyper-parameter has a real value, we have discretized the search spacein our experiments.Figure 3.1: An example of sequences of trials in BCI calibration phase in whicheach trial is 8s. In the first trial, the subject performed a left hand (L1) imagerymovement followed by a No-Control (NC1) interval of 4s. In the second trial,the subject performed a right hand imagery movement (R1) followed by a 4sNC interval.(ii) EEG Channels: For datasets containing data from many channels (dataset IV2a),selecting a small set of appropriate channels forms another hyper-parameter of the system.Our algorithm selects the number of CSP filters (N = 2, 4, 6) or has the option of notapplying CSP (i.e. uses all the channels without applying any spatial filtering).51(iii) Frequency bands: Another important hyper-parameter in sensory motor BCIs, isthe choice of the frequency bands used for feature extraction. As discussed in Section3.1, ERD and ERS occur in the upper alpha and the lower beta rhythms. This hyper-parameter only applies to BP features. As mentioned in Section 2.2.2, for Morlet features,the frequency ranges were selected based on [24].For BP features, in each iteration of the optimization algorithm, the algorithm selectsbetween two options. The first option is to apply a filter-bank with 2 separate blocks cor-responding to the alpha and the beta frequencies of the brain on each channel, and then,extract band-power of the filtered signal. The second option is to apply a filter with a largebandwidth in the range ≈ [4−35]Hz that includes both alpha and beta brain waves of eachchannel.The frequency band hyper-parameter is real valued, therefore we have discretized thespace.3.3.2 Selecting Hyper-Parameters in BCIsChoice of the Bayesian Optimization AlgorithmTwo different Bayesian optimization algorithms were used here. The first one uses Gaus-sian process posterior with Matern 5/2 Kernel [92]. This covariance function results infunctions that are twice differentiable, an assumption used to perform quasi-newton opti-mization. The behavior of the prior function is governed by the choice of its parameters.The most common approach to derive appropriate values for the parameters is to use a pointestimate (e.g. Maximum Likelihood) of these values. However in [92] a fully Bayesian ap-proach is used to obtain a marginalized acquisition function. The acquisition function isthe expected improvement (EI), as it can be written in a closed form.As mentioned in Section 3.2.1, a Gaussian distribution is uni-modal, and it is not ap-propriate for use in modeling the data when the underlying distribution is multi-modal. To52handle such cases, we have also applied a second type of Bayesian optimization whichuses Random Forest (RF) posterior. The acquisition function used for RF posterior is theexpected improvement (EI).Random searchIn addition to the Bayesian optimization algorithm, we have also used random search forfinding the hyper-parameter values. Bergstra et al. have shown empirically and theoret-ically that random search can perform as good as grid-search [10] in less computationaltime.3.3.3 Results AggregationAfter performing optimization on the objective function, we obtain T classifiers which aretrained using the T candidate points proposed by the search algorithm (i.e. Bayesian opti-mization or random search). We can use the classifier with the minimum cross validationerror as our final classifier. However, instead of selecting a single classifier to predict theclass label of a given new sample, we exploit all the information gained from trainingseveral classifiers on the data, by aggregating all the results. The above hyper-parameteroptimization leads to classifiers which are trained on different parts of the data. We callthese classifiers the level one classifiers. Each level one classifier can be considered as anexpert on a specific part of the dataset. We then combine all these classifiers into one finalmeta-level classifier using three different methodologies as explained below:(i) Voting: Given a new sample, the output of each classifier (i.e. the predicted classlabel) is considered as the vote of that classifier. Then the label with the majority of votesis selected as the final label for a given sample. This method is most useful for non-probabilistic classifiers.(ii) Averaging: In probabilistic classifiers, the classifier output represents its beliefabout the class label of a new sample. Then based on the classifiers belief, we decide53the label of the new sample. This refinement of the voting method for probabilistic clas-sifiers, uses the belief of each classifier for final prediction. In this method we take theaverage of the probabilities returned from each level one classifier. In our case the numberof classifiers is the same as the number of iterations of the hyper-parameter optimizer (i.e.number of candidates we have examined).P(C f inal = k| f ) ∝ 1TT∑i=1P(ci = k| f ) k ∈ 1,2, ...,K (3.10)Where T is the number of iterations (i.e the number of level one classifiers), K is the numberof classes and f is the new sample. C f inal represents the final classifier’s label.(iii) Stacking: In this method, a learning algorithm is employed to combine the classi-fiers. A meta level learning algorithm decides the label of a new given sample.One of the most successful stacking methods is the multi-response linear regression(MLR) algorithm [36] [95]. In this meta learning algorithm, a classification problem withK output classes is converted into K regression problems. Given a new sample, the be-lief of the meta level classifier about each class label is a linear combination of the classprobability of all T classifiers. Therefore, the linear regression for class k is given byP(C f inal = k| f ) =T∑i=1αkiP(ci = k| f ), (3.11)where P(ci = k| f ) is the class probability of the ith classifier, P(C f inal = k| f ) correspondsto the class probability of the final (meta) classifier and the parameters (αki) of the linearmodels are learned using least squares.Given a new sample f , we first plug f in all of the T classifiers, then we calculate theclass probabilities P(ci = k| f ) for each k. Finally, we plug the class probabilities corre-sponding to the class k in each of the K linear models. The class k with the maximum valueof P(C f inal = k| f ) is selected as the final label of the given sample ( f ).543.3.4 Termination of the AlgorithmIn our Bayesian optimization algorithm, the algorithm stops when the termination conditionis reached, otherwise the algorithm continues for T iterations. In fact, T is the maximumpossible number of iterations for our algorithm. The termination condition basically checksif there has been an improvement in the cross-validation accuracy in the last τ iterations ofthe algorithm. To check if there is no improvement, we use the following equation:MAX−minm1≤ ε (3.12)In equation 3.12, MAX is the maximum cross validation accuracy in the last τ iterations ofthe algorithm, min is the minimum cross validation accuracy in the last τ iterations of thealgorithm, and m1 is the initial cross validation accuracy at time t = 1. ε is the threshold tostop the algorithm.3.3.5 Time Complexity of the AlgorithmThere are three factors that govern the running time of the Bayeisan optimization algorithm.1) the sum of the running times of feature extraction and classification, 2) the runningtime of each iteration of the Bayesian optimization, and 3) the number of iterations of thealgorithm.1) The sum of the running times of classification and feature extraction: In our case, therunning time of this part is small. The running time of extracting Morlet and BP featuresis low. The running time of Logistic Regression is also very low. The Hessian of the costfunction in Logistic Regression can easily and quickly be calculated. Using second orderoptimization algorithms makes the convergence of the algorithm very fast.2) Running time of each iteration of Bayesian optimization: The running time of eachiteration of the Bayesian optimization is governed by the complexity of building the pos-55terior distribution, and in finding the optimal value of the acquisition function. The com-plexity of Gaussian process is O(T 3) (assuming the parameters of the covariance functionare known). T is the number of Bayesian optimization iterations. The complexity of gen-erating a Random Forest is O(M(mT logT )), M is the number of trees, m is the number ofhyper-parameters selected to split each node of the tree. In our case M was equal to 50.In Bayesian optimization, the size of T is very small so the complexity of one iteration ofBayesian optimization is almost negligible.3) The number of Bayesian optimization iterations: Depending on the dimensionalityof the hyper-parameter space, the number of iterations required by Bayesian optimiza-tion (to converge) increases. The overall running time of the Bayesian optimization is thenumber of iterations multiplied by the sum of the running time of feature extraction andclassification, and the running time of each step of Bayesian optimization.3.4 Results and DiscussionThree sensory motor BCI datasets consisting of 21 subjects were used to evaluate themethodology proposed in this chapter. These are datasets IIa [26] and IIb [64] of BCI com-petition IV, and dataset IIIa [16] of BCI competition III. All datasets are for synchronousBCIs. We have used the training data of these datasets for our training phase and tested ourmethod on their test data (section 2.3).To show the importance of finding the appropriate values for hyper-parameters, wehave first compared the performance of different search (optimization) algorithms in Sec-tion 3.4.1. Then in Section 3.4.2, we have compared our results to those reported in theliterature. In both cases, we have conducted separate statistical tests. To gain more in-sights about the effects of the number of optimization iterations we have performed a set ofexperiments which are described in Section 3.4.3.563.4.1 Comparing with Other Hyper-Parameter Search AlgorithmsThe purpose of this section is to compare our fully automated algorithm with other searchmethods (e.g. random search, manual search), and demonstrate the importance of hyper-parameter optimization.The most common approach to search for the hyper-parameter values is the grid-search.Since, the BCI hyper-parameter space has real valued hyper-parameters (e.g. frequencybands for alpha and beta bands), we have discretized the search space (section 3.3.2). Un-fortunately, even with discretization, the search space would still contain hundreds or thou-sands of candidate points to examine. This makes it infeasible to find the optimal valuesusing the grid search. Therefore, we compare our method with the manual search method.In the manual search method, a set of hyper-parameter values that were specified by theuser are tested.We tested 3 different automated optimization algorithms: 1) Bayesian optimizationwith Gaussian Process (BO-GP) posterior, 2) Bayesian optimization with Random Forest(BO-RF) posterior and 3) random search. The initial point used to start the Bayesian op-timization was selected randomly. For each of the 3 search algorithms, we aggregated theresults of the trained classifiers using the three methods discussed in section 3.3.3; MLR,voting and averaging. We also reported the results of the best classifier. This classifier wastrained using the candidate hyper-parameter set that had minimum cross-validation error.To extract the Band Power (BP) features, we optimized 1) the part of the movementinterval from which the features are extracted, 2) the frequency bands used to extract fea-tures, and 3) the number of CSP filters for dataset (IV2a) with 22 channels. To extract theMorlet features, we optimized the same parameters excluding the frequency bands, becausewe used the same frequency bands as [24] for the Morlet features.In addition to the results obtained by our proposed fully automatic algorithm and themanual search method, we have also included the results of using the default values of57the hyper-parameters. In our results, we refer to this as the ”No-Search” method. In thismethod, the features were extracted from the whole segment of the movement interval, andno channel selection was performed. Also, in the case of BP features, the frequency bandswere in the range [8−12]Hz and [16−24]Hz which correspond to the standard alpha andbeta frequencies of the brain.The results of applying all these search algorithms when the Morlet features are usedare given in Table 3.1, and those for the BP features are given in Table 3.2. For datasetsIII3b and IV2b, the maximum number of iterations was 40 (i.e. T = 40), the maximumnumber of iterations for dataset IV2a was 60, and for all algorithms ε (equation 3.12) wase−4. We have repeated our automatic search algorithms 10 times to reduce the effects ofthe random seed.The Qualitative comparison of the different algorithms in Tables 3.1 and 3.2 suggeststhe following results: 1) for almost all subjects, Bayesian optimization significantly im-proves the results compared to the manual search method. There are however some casesin which Random Search has the best performance. 2) For all subjects, the results of ap-plying Bayesian optimization is significantly better than the results of using the defaultvalues of the hyper-parameters (i.e. No-Search columns in Tables 3.1 and 3.2). 3) For theMorlet features (Table 3.1), Bayesian optimization with Random Forest posterior outper-forms other methods. 4) For BP features (Table 3.2), Bayesian optimization with Gaussianprocess posterior outperforms other methods.As discussed in 2.2.6, according to Demsˇar et al. [32], when comparing several algo-rithms across several datasets, comparing the average ranks of different algorithms is morereliable than comparing average accuracies. Comparing the average ranks of different algo-rithms shows that for the Morlet features, the best performing algorithm is BO-RF (AVG),and for the BP features, the best algorithm is BO-GP (AVG).58Table 3.1: Accuracy of each algorithm with Morlet features. For each search algorithm we used MLR, voting (VOTE)and averaging (AVG) for aggregating the results of the classifiers. MIN is the result of the best level 1 classifier inoptimization phase. Highlighted cells show the configuration for which the performance is the best.No Manual BO-GP BO-RF Random SearchSearch Search MLR VOTE AVG MIN MLR VOTE AVG MIN MLR VOTE AVG MINsubject 1 73.58 82.39 86.10±0.91 83.77±0.55 83.84±0.80 84.28±0.00 85.60±0.77 82.70±0.42 83.65±0.69 84.28±0.00 84.28±0.63 83.14±1.01 81.07±0.71 83.96±0.94subject 2 83.88 83.89 84.19±0.09 83.30±0.26 83.35±0.23 83.15±0.00 84.20±0.12 83.11±0.23 83.28±0.35 83.15±0.00 83.81±0.42 82.17±0.58 82.80±0.29 82.39±1.20subject 3 77.4 78.15 86.20±0.22 83.89±0.23 84.98±0.33 80.74±0.00 86.46±0.31 83.37±0.44 84.70±0.45 80.74±0.00 86.46±0.55 83.74±0.51 84.98±0.54 80.70±0.11subject 4 68.85 68.86 73.33±0.26 70.44±0.59 70.70±0.70 70.18±0.00 73.68±0.48 70.83±0.35 71.40±0.75 69.69±1.45 73.51±1.24 70.13±1.15 71.80±0.79 70.35±2.15subject 5 57.95 58.37 56.82±1.34 58.73±0.94 58.37±0.45 58.78±0.00 56.53±1.57 61.27±1.36 60.57±1.21 57.92±1.23 57.14±1.49 58.69±0.94 58.41±0.76 57.22±1.19subject 6 55.21 53.48 53.26±1.74 54.96±0.71 55.74±0.91 56.87±0.26 53.48±0.00 58.52±0.40 58.09±0.29 56.96±0.00 53.48±1.63 54.87±1.06 55.70±1.31 55.35±3.03subject 7 96.09 94.79 96.06±0.27 96.64±0.21 96.35±0.24 97.62±0.15 95.44±0.00 97.04±0.18 97.13±0.20 97.69±0.10 96.19±0.90 95.96±0.49 96.48±0.41 97.04±0.76subject 8 86.44 91.58 88.86±0.72 91.61±0.45 91.50±0.32 88.39±0.17 89.19±0.34 90.22±0.33 91.32±0.23 88.46±0.18 89.41±1.09 90.84±0.91 90.33±0.85 86.92±3.03subject 9 84.86 82.87 85.62±0.28 84.86±0.31 85.10±0.32 85.66±0.00 85.86±0.74 86.18±0.86 85.94±0.36 85.46±1.47 85.26±1.39 85.10±0.69 85.46±1.03 84.86±1.61subject 10 73.27 72.84 72.11±0.51 75.34±0.42 74.57±0.61 72.41±0.00 72.41±0.70 77.37±0.59 76.98±0.65 72.59±0.52 74.83±1.52 76.98±0.93 75.86±0.96 72.41±1.49subject 11 85.21 83.48 89.09±0.23 88.43±0.21 88.65±0.13 84.78±0.00 89.09±0.41 91.35±0.71 91.96±0.68 85.22±1.51 89.30±0.83 91.17±0.87 90.91±1.17 85.30±0.80subject 12 85.3 86.12 81.96±0.31 85.92±0.46 87.06±0.41 85.71±0.00 79.71±1.16 84.61±0.73 84.53±0.76 85.63±0.24 81.55±1.81 84.78±1.13 84.82±0.87 84.82±1.25subject 13 54.51 77.77 63.23±4.52 84.86±0.62 85.00±0.66 83.68±0.00 68.85±9.76 81.53±1.44 82.12±1.20 83.68±0.00 68.06±8.33 82.08±1.46 82.29±1.31 82.26±2.69subject 14 34.72 52.77 35.49±1.35 44.51±0.43 44.03±0.97 46.53±0.00 30.07±2.80 40.94±1.48 44.86±2.15 42.43±3.10 33.72±1.64 40.76±1.17 42.33±0.91 46.08±3.40subject 15 59.72 86.11 82.85±0.78 86.18±0.51 86.60±0.59 85.28±0.23 82.26±1.28 85.66±0.97 86.60±0.88 84.97±0.31 81.32±3.44 83.85±1.21 84.79±0.69 85.00±1.57subject 16 44.44 47.91 43.65±4.79 60.49±0.73 61.98±1.31 59.38±0.00 41.04±6.30 64.86±1.06 66.28±1.44 58.82±1.06 48.23±8.20 61.91±1.61 62.33±1.95 58.19±2.28subject 17 32.638 38.54 43.65±0.16 43.06±1.24 43.82±0.95 42.71±0.00 38.72±2.44 46.08±1.18 48.72±1.54 43.78±2.39 33.19±4.63 44.72±1.88 46.04±2.24 43.09±2.34subject 18 36.8 42.7 47.29±1.61 44.90±0.68 47.50±1.16 41.94±1.60 44.10±0.38 53.61±1.57 53.30±1.55 44.79±0.00 41.70±2.81 47.74±1.93 51.32±1.46 42.53±1.30subject 19 58.68 58.33 64.55±1.46 67.60±0.35 69.41±0.96 64.93±0.00 67.15±3.09 70.07±1.36 72.64±1.78 64.03±1.36 66.53±4.33 72.05±1.67 75.03±1.15 64.48±2.04subject 20 43.75 79.16 83.51±0.70 82.57±0.58 83.75±0.67 83.40±0.14 79.76±6.44 80.66±1.97 82.33±2.18 79.55±3.18 72.50±10.80 82.50±1.84 83.23±1.70 80.66±1.53subject 21 49.65 77.08 75.00±0.00 78.65±1.23 79.41±0.93 79.51±0.00 73.85±3.44 76.18±0.78 76.35±0.74 79.51±0.00 72.36±4.65 76.01±1.43 76.74±1.26 76.84±3.3159Table 3.2: Accuracy of each algorithm with BP features. For each search algorithm we used MLR, voting (VOTE) and aver-aging (AVG) for aggregating the results of the classifiers. MIN is the result of the best level 1 classifier in optimizationphase. Highlighted cells show the configuration for which the performance is the best.No Manual BO-GP BO-RF Random SearchSearch Search MLR VOTE AVG MIN MLR VOTE AVG MIN MLR VOTE AVG MINsubject 1 69.81 80.5 83.08±0.82 83.21±0.63 83.33±1.02 84.03±0.50 74.09±1.46 74.59±0.64 75.41±0.66 74.21±0.89 81.19±1.88 80.44±1.24 81.51±1.23 80.88±1.47subject 2 69.62 70.19 77.59±0.37 75.87±0.34 75.85±0.36 77.78±0.00 65.65±1.49 60.61±0.63 61.33±0.92 65.11±1.52 73.74±1.09 68.24±1.04 69.43±1.48 72.07±2.22subject 3 74.07 74.26 78.65±0.51 78.46±0.37 78.70±0.51 77.78±0.97 61.56±4.15 59.63±1.10 60.30±1.13 63.28±1.36 77.39±0.84 73.89±0.77 74.85±0.72 77.33±1.54subject 4 62.28 60.96 67.81±0.76 67.02±1.14 67.19±1.03 66.75±0.94 50.09±3.49 52.02±1.11 53.03±2.02 54.47±2.87 65.66±1.81 61.32±0.92 62.54±0.79 66.23±1.30subject 5 57.55 56.33 59.35±1.24 60.73±1.35 61.80±1.23 60.78±1.54 50.61±2.45 53.27±0.58 55.22±1.28 52.78±2.42 61.63±1.49 59.88±1.10 62.45±0.77 60.86±1.59subject 6 53.91 56.09 55.91±0.85 58.61±1.73 59.35±1.43 57.13±0.52 54.30±2.60 52.48±1.15 52.87±1.39 53.87±1.71 55.78±3.41 56.74±2.24 56.09±1.08 57.78±1.91subject 7 92.18 94.79 94.82±0.57 92.87±0.63 93.39±0.39 92.44±1.07 78.44±3.81 66.38±3.47 71.63±3.62 78.14±4.54 95.08±0.98 93.84±0.79 94.20±0.65 92.25±1.41subject 8 63.73 67.77 72.78±0.98 70.95±1.25 71.50±1.02 71.79±0.00 55.57±2.59 56.85±1.20 57.14±1.87 57.77±2.79 71.61±1.94 64.80±0.93 65.64±1.60 70.59±3.32subject 9 72.11 75.7 83.55±0.99 79.40±1.17 79.52±0.80 80.52±2.24 65.66±3.15 60.88±1.47 62.47±1.29 65.46±2.78 82.07±1.83 75.78±1.10 77.73±1.55 80.12±2.02subject 10 49.56 53.02 76.21±1.50 73.23±1.16 74.57±1.31 73.71±2.75 49.14±0.00 49.14±0.00 49.78±0.65 50.47±2.76 75.17±0.99 64.91±1.14 68.88±1.07 73.58±2.39subject 11 92.17 92.18 91.96±0.40 91.57±0.62 91.61±0.62 91.00±1.08 53.48±5.22 51.61±3.62 51.48±7.56 52.87±11.22 90.22±1.09 90.43±0.75 90.83±0.36 90.43±1.73subject 12 77.55 77.55 85.67±0.69 84.41±0.48 83.76±0.70 83.76±1.35 52.94±4.11 52.16±1.51 53.63±2.27 56.57±3.88 81.84±1.92 78.49±1.27 77.96±0.98 80.82±3.10subject 13 53.47 72.56 74.86±2.42 80.56±1.20 79.65±1.19 76.98±2.60 64.10±3.88 64.72±2.13 63.78±2.31 67.78±1.12 69.76±4.19 78.58±2.14 79.13±1.10 75.45±2.74subject 14 40.27 52.08 41.46±2.24 45.24±1.55 47.33±2.02 40.49±3.25 31.67±1.50 33.30±0.88 33.58±0.82 32.78±2.52 45.14±4.03 45.31±1.88 46.56±1.29 43.85±2.66subject 15 62.15 71.18 70.35±2.27 72.85±0.89 73.30±0.77 69.44±1.98 73.96±1.31 75.31±0.94 75.73±1.12 75.21±1.10 68.96±2.23 71.15±0.78 71.98±0.91 68.61±2.22subject 16 45.13 60.76 58.26±1.85 59.72±0.60 61.25±0.91 57.40±1.77 57.88±1.58 58.47±2.02 59.83±1.90 57.26±2.35 58.92±3.34 64.69±0.84 63.92±1.15 57.85±2.84subject 17 35.76 31.59 32.19±1.15 37.57±1.67 35.56±1.42 30.97±0.21 36.88±1.76 36.49±2.39 36.25±2.87 32.99±2.50 36.01±2.13 40.24±1.07 40.49±1.73 34.69±1.70subject 18 41.66 40.62 48.61±2.02 54.44±1.61 55.45±1.22 43.23±0.17 44.38±2.62 47.74±2.82 48.12±2.37 41.49±1.38 47.12±3.97 46.63±1.76 49.72±2.30 45.52±3.07subject 19 56.59 60.06 75.42±2.63 77.53±0.84 77.64±0.70 73.99±2.28 70.83±3.69 71.46±3.52 70.90±2.21 68.40±4.73 67.57±2.58 75.28±0.77 74.69±1.37 67.85±4.82subject 20 56.25 78.47 75.62±1.55 79.72±0.61 80.52±1.01 75.52±1.08 74.65±3.28 77.88±2.89 80.03±2.50 76.28±3.14 78.47±2.52 81.15±1.18 81.98±1.47 76.18±2.44subject 21 61.45 66.66 75.03±2.57 80.76±1.43 81.15±1.28 72.26±2.92 76.18±2.34 77.01±1.18 77.43±0.89 76.18±1.09 73.85±4.11 78.44±2.03 79.51±1.64 72.85±3.6760Statistical Testswe use the Friedman test to statistically validate the obtained results. The null hypothesisassumes that all algorithms have the same performance i.e., they have the same rank. Ifthe p-value is low enough to reject the null hypothesis, it is concluded that the differencebetween the algorithms is not random. If the null hypothesis is rejected, we use Bergmann’stest [44] as the post-hoc statistical test. For this test, we carried out a pairwise comparisonof all the five algorithms (i.e. No-Search, Manual Search, BO-GP, BO-RF and RandomSearch). Since our goal is to show the importance of hyper-parameter optimization andthe power of our proposed method (in finding good candidate hyper-parameter values), wehave compared MIN column of the automatic methods to Manual Search and No-Searchscenarios. MIN column corresponds to the performance of the BCI in the test phase, whenthe corresponding hyper-parameter value results in the best cross-validation accuracy in theoptimization phase.In the Bergmann’s test, each null hypothesis assumes that two different algorithms havethe same performance. In pairwise comparison of algorithms, we have(52)= 10 hypothesesto test, and each of them can be true or false. These hypotheses are logically related i.e.some combinations of true and false hypothesis cannot hold at the same time. Bergmann’stest considers the logical relation between the different hypotheses. This algorithm takesall the sets of hypotheses Ψ that can be true at the same time. Then it computes a set A, andevery hypothesis H which does not belong to A is rejected.A = ∪{S : S ∈Ψ and for each hypothesis Hi ∈ S, Pi > α/|S|}, (3.13)where Pi is the p-value corresponding to hypothesis Hi, and α is the significance level.To perform the statistical tests for each feature extraction method, we first ranked thefive different search algorithms based on their performance. The ranking for each subject61Table 3.3: Average ranking and average accuracy of the optimization algorithms with(a) Morlet features and (b) BP features.(a) Morlet FeaturesMethod Average Ranking Mean Accuracy (%)No-Search 4.21 63.95Manual Search 3.19 71.29BO-GP (MIN) 2.14 73.14BO-RF (MIN) 2.30 72.83Random Search (MIN) 3.14 72.40(b) BP FeaturesMethod Average Ranking Mean Accuracy (%)No-Search 3.97 61.30Manual Search 2.83 66.35BO-GP (MIN) 2.0 69.42BO-RF (MIN) 4.0 59.68Random Search (MIN) 2.19 68.85was done separately, then we averaged the rank of different algorithms over the subjects.For both the Morlet and the BP features, the average ranks of each algorithm are shown inTable 3.3.For the Morlet features, after performing the Friedman’s test, the null hypothesis wasrejected (p-value = 1.2E-4), indicating that there is a significant difference between theperformance of different algorithms. Bergmann’s post-hoc test (Table 3.4) rejected the fol-lowing hypotheses: No-Search vs. BO-GP (MIN) and No-Search vs. BO-RF (MIN). Forthe BP features, the null hypothesis was also rejected (p-value = 3.7E-6) when we appliedFriedman statistical test. Therefore, there are significant differences between the algo-rithms, and the differences between their performances are not random. The Bergmann’sprocedure rejected the following hypotheses: No-Search vs. BO-GP (MIN), No-Search vs.Random Search (MIN), BO-GP (MIN) vs. BO-RF (MIN) and BO-RF (MIN) vs. RandomSearch (MIN) (Table 3.4).Based on the results of all the above statistical tests, we can conclude that for the Morlet62features, Bayesian optimization with Random Forest or Gaussian process posterior are thebest suited methods for hyper-parameter optimization. Also, for the BP features, Bayesianoptimization with Gaussian process posterior and Random Search algorithms are the bestchoices for hyper-parameter optimization.The results of the performed statistical tests support the idea that searching for suitablevalues for the hyper-parameters can significantly improve the performance of the algo-rithm. In addition to the results of the statistical tests, we have also included in Table 3.3the average accuracy of different algorithms across all subjects. As shown in Table 3.3, forboth feature extraction methods, Bayesian optimization can considerably improve the av-erage accuracy. Comparing the average accuracies across all subjects shows that Bayesianoptimization can boost the accuracy from 63.95% to 73.14% for the Morlet features. Forthe BP features, the average accuracy across all subjects increases from 61.30% to 69.42%if we use Bayesian optimization.3.4.2 Comparing the Results with LiteratureAs discussed in the previous section, for the Morlet features, Bayesian optimization withRandom Forest and averaging (BO-RF (AVG)) yields the best results, and for the BPfeatures, Bayesian optimization with Gaussian process posterior and averaging (BO-GP(AVG)) yields the best results. Below, we compare the results in the literature with thesetwo algorithms (i.e. BO-RF (AVG) and BO-GP (AVG)).Table 3.5 shows the results of comparing our algorithms to those reported in the litera-ture for dataset III3b. For datasets IV2b and IV2a, the results are given in Tables 3.6 and3.7 respectively. In all Tables, the last row shows the average rank of different algorithms.To compare the results of our algorithm to those reported in the literature, we have used theresults of their best algorithms, reported in each of [25], [24], [106], [4] and [7].To compare the different algorithms, we have performed the Friedman’s statistical test63Table 3.4: P-values corresponding to pairwise comparison of different hyper-parameter search algorithms (a) Morlet features and (b) BP features. α is chosento be 0.05. Bergmann’s procedure rejects hypotheses 9 and 10 for the Morletfeatures. For the BP features, hypotheses 7, 8, 9 and 10 are rejected. The resultsare rounded up to 3 decimal places.(a) Morlet Featuresi Hypothesis p-value10 No-Search vs. BO-GP (MIN) 0.0009 No-Search vs. BO-RF (MIN) 0.0008 No-Search vs. Random Search (MIN) 0.0287 Manual Search vs. BO-GP (MIN) 0.0316 No-Search vs. Manual Search 0.0355 BO-GP (MIN) vs. Random Search (MIN) 0.0404 Manual Search vs. BO-RF (MIN) 0.0713 BO-RF (MIN) vs. Random Search (MIN) 0.0872 BO-GP (MIN) vs. BO-RF (MIN) 0.7321 Manual Search vs. Random Search (MIN) 0.922(b) BP Featuresi Hypothesis p-value10 BO-GP (MIN) vs. BO-RF (MIN) 0.0009 No-Search vs. BO-GP (MIN) 0.0008 BO-RF (MIN) vs. Random Search 0.0007 No-Search vs. Random Search (MIN) 0.0006 Manual Search vs. BO-RF (MIN) 0.0165 No-Search vs. Manual Search 0.0194 Manual Search vs. BO-GP (MIN) 0.0873 Manual Search vs. Random Search (MIN) 0.1872 BO-GP (MIN) vs. Random Search (MIN) 0.6961 No-Search vs. BO-RF (MIN) 0.961Table 3.5: Accuracy of our algorithms compared to the results reported in the litera-ture for dataset III3b. Highlighted cells show the best performing algorithm. Thetext inside the parenthesis corresponds to the original label of the subjects usedin the BCI competition.BO-GP(AVG) BO-RF(AVG) results of results of results of(BP) (Morlet) [25] [24] [106]subject 1 (O3) 83.33 83.65 77.1 80.7 89.3subject 2 (S4) 75.85 83.28 81.5 81.7 72.96subject 3 (X11) 78.7 84.7 80.4 80.9 74.26Average 79.29 83.88 79.67 81.1 78.84Average Rank 3.67 1.33 3.67 2.67 3.6764Table 3.6: Accuracy of our algorithms compared to the results reported in the litera-ture for dataset IV2b. Highlighted cells show the best performing algorithm. Thetext inside the parenthesis corresponds to the original label of the subjects usedin the BCI competition.BO-GP(AVG) BO-RF(AVG) results of results of results of(BP) (Morlet) [4] [25] [24]subject 4 (100) 67.19 71.4 70 77.5 74.4subject 5 (200) 61.8 60.57 60.5 56.4 56.1subject 6 (300) 59.35 58.09 61 51.9 52.2subject 7 (400) 93.39 97.13 97.5 93.4 95.6subject 8 (500) 71.5 91.32 93 96.9 95.9subject 9 (600) 79.52 85.94 80.5 87.8 89.1subject 10 (700) 74.57 76.98 78 80.6 72.2subject 11 (800) 91.61 91.96 92.5 80 89.1subject 12 (900) 83.76 84.53 87 78.8 82.8Average 75.85 79.77 80.0 78.14 78.6Average Rank 3.8 2.6 2.1 3.2 3.3Table 3.7: Accuracy of our algorithms compared to the results reported in the litera-ture for dataset IV2a. Highlighted cells show the best performing algorithm. Thetext inside the parenthesis corresponds to the original label of the subjects usedin the BCI competition.BO-GP(AVG) BO-RF(AVG) results of results of(BP) (Morlet) [4] [7]subject 13 (1) 79.65 82.12 76 80.5subject 14 (2) 47.33 44.86 56.5 53.5subject 15 (3) 73.3 86.6 81.25 79subject 16 (4) 61.25 66.28 61 62.5subject 17 (5) 35.56 48.72 55 44.5subject 18 (6) 55.45 53.3 45.25 50.5subject 19 (7) 77.64 72.64 82.75 76.75subject 20 (8) 80.52 82.33 81.25 78.25subject 21(9) 81.15 76.35 70.75 82Average 65.76 68.13 67.75 67.5Average Rank 2.9 2.0 2.5 2.665for each dataset separately. The Last row in Tables 3.5, 3.6 and 3.7 show the average rank-ing of different algorithms in datasets III3b, IV2b and IV2a, respectively. In all datasets,the Friedman’s statistical test did not reject the null hypothesis. The p-values for the Fried-man’s test in datasets III3b, IV2b and IV2a were 0.280, 0.138 and 0.471, respectively.These results suggest that there is no statistical difference between the best performingresults in the literature and our results.To be able to draw conclusions about the performance of different algorithms, we havealso compared the mean accuracy of different methods. In datasets III3b and IV2a, theaverage accuracies of our algorithms are 2.78% and 0.38% better than the results reportedin the literature, respectively. In dataset IV2b, the mean accuracy of [4] is 0.23% betterthan our algorithms. It is also important to mention that in both datasets III3b and IV2a,the average rank of our algorithm is considerably better than the others.While our proposed algorithm yielded similar or better performance compared to theliterature, it is important to note that the best reported results in the literature were achievedbased on manual fine-tuning of the hyper-parameters of the BCI system and based on moresophisticated feature extraction and classification methods. Our results show that our au-tomated parameter optimization of a BCI system with less sophisticated feature extractionand classification methods yields similar/superior results compared to best performing de-signs in the literature.3.4.3 Effect of Number of Optimization IterationsTo gain more insights about the effects of the number of optimization iterations (i.e., thenumber of classifiers) on the BCI performance results, we have conducted an experimentto observe the cross-validation accuracy and test set accuracy with different numbers ofoptimization iterations ranging from T=1 to T=80 iterations.For subject 1 of dataset IV2a, Figure 3.2 shows the cross-validation accuracy and the660 10 20 30 40 50 60 70 80Number of iterations0102030405060708090Accuracy (%)BO-GP (MLR_CV)BO-GP (MLR_Test)BO-GP (AVG_CV)BO-GP (AVG_Test)BO-GP (VOTE_CV)BO-GP (VOTE_Test)Figure 3.2: Accuracy of our algorithm with the BP features versus the number ofiterations. The label of cross validation curves finish with ’ CV’ and the labelof test data accuracies finish with ’ Test’.test set accuracy of our algorithm as a function of the number of optimization iterationswith BP features. Figure 3.3 repeats the same but with the Morlet features.As shown in these figures, as the number of iterations increases the cross-validationaccuracy eventually reaches a plateau. The termination condition in our algorithm checkswhen the cross-validation accuracy stops improving and then stops the optimization pro-cess. In the case of MLR algorithm, if we do not stop the optimization process, we start tosee some signs of over-training i.e. the performance on the test set declines compared tothe cross-validation accuracy. This happens after about 25 iterations in the Morlet featuresand 50 iterations in the BP features. We did not observe any signs of over-fitting with othermethods that combine the classifiers (i.e., AVG and VOTE). However, to decrease the riskof over-training, we have used the termination condition described in Section 3.3.4 ratherthan using a fixed number of iterations which is prone to over-fitting.670 10 20 30 40 50 60 70 80Number of iterations0102030405060708090Accuracy (%)BO-GP (MLR_CV)BO-GP (MLR_Test)BO-GP (AVG_CV)BO-GP (AVG_Test)BO-GP (VOTE_CV)BO-GP (VOTE_Test)Figure 3.3: Accuracy of our algorithm with the Morlet features versus the number ofiterations. The label of cross-validation curves finish with ’ CV’ and the labelof test data accuracies finish with ’ Test’.We performed the same analysis for all the other subjects (in all datasets) and observedthe same results as subject 1 of dataset IV2a.3.5 ConclusionCustomizing the parameters of a BCI system is a necessary step to capture each subject’sunique brain signal characteristics. Such characteristics are captured by a collection of pa-rameters of the different blocks of a BCI system, and form the hyper-parameters of a BCIsystem. These hyper-parameters are either tuned manually based on the operator experi-ence (which is very cumbersome and not optimal); or in a semi-automated approach.In this chapter, we proposed a fully automated optimization-based approach for tuningthe hyper-parameters of a BCI system for each subject with the goal of achieving a user-customized BCI. To be more specific, we demonstrated the utility of Bayesian optimization68with Gaussian process and random forest posteriors to tune the parameters of a motor-imagery BCI. The BCI was based on the band-power and the Morlet wavelet features andLogistic Regression classification followed by ensemble classification. In our framework,we treated each of the classifiers (built upon each set of hyper-parameters) as a member ofthe ensemble of classifiers. This is to harness as much information as possible since eachhyper-parameter set captures specific characteristics of the brain signals.The results of comparison of the performance of different algorithms on 21 subjects inthe previous section, showed that, when the BP features were used, Bayesian optimizationwith GP posterior and averaging (BO-GP (AVG)) outperform other methods. When theMorlet features were used, the proposed Bayesian optimization with RF posterior and av-eraging (BO-RF (AVG)) yielded better performance. In addition to providing a frameworkfor automated tuning of the parameters of a BCI system for each subject, our findings con-firm that customizing a BCI for each subject has a major role in improving the performanceof the system.More specifically, using our proposed approach, the mean accuracies improved by8.12% and 9.19% when the band power and the Morlet features were used in the BCIsystems respectively. These results highlight the importance of parameter optimization andutility of our approach in finding optimal parameters of a BCI system. We have also backedour results with proper statistical tests.We have compared the performance of the optimized BCIs with those reported in theliterature and have shown that hyper-parameter optimization leads to similar or better per-formance compared to the literature. To be more specific, in dataset III3b, our results werebetter than the ones reported in the literature. The average accuracy before optimizationwas 71.16% for the Morlet features. After applying our proposed algorithm, the averageaccuracy is boosted to 83.88%. The best reported method in the literature has an averageaccuracy of 81.1%. In dataset IV2b, the average accuracy across all subjects using the69hyper-parameter values usually used in the literature was 77.02%; our algorithm boosted itto 79.77%. The average accuracy of the best reported results in the literature was 80%. Fordataset IV2a, the average accuracy for the Morlet features before optimization was 46.1%and after optimization it became 68.13%. The competition winners average accuracy acrossall subjects was 67.75%.We demonstrated the utility of our proposed optimization-based approach to customizethe parameters of two BCI systems (based on the BP and the Morlet features) that performmotor-imagery tasks. The proposed framework however can be applied to any BCI systemthat is based on other neurological phenomena and to any combination of other featureextraction and classification schemas.Our results show that applying a good hyper-parameter selection algorithm can boosta simple baseline algorithm’s performance to enable it to compete with the state of theart BCIs. It is also important to mention that our algorithm can also be applied to any BCIdesign irrespective of the number of parameters that need to be optimized. Our algorithm iseasy to apply, fully automatic, inexpensive and does not need expertise in EEG. In addition,unlike similar works in the field, our algorithm is scalable and can tune any number ofhyper-parameters in a BCI system.70Chapter 4Discriminative Sequence LearningAlgorithms for Self-paced BCIs4.1 IntroductionAs we described in chapter 1, BCI systems can be classified into two categories: syn-chronous and self-paced systems [9]. In synchronous BCIs, subjects control the BCI outputduring short periods specified by the system, while self-paced BCIs, give users the optionof controlling the system whenever they wish to. For the latter type of BCIs, the periodsduring which the user is controlling the system are called Control states and those duringwhich the user is not controlling the system are called No-Control (NC) states.Designing a self-paced BCI which is ultimately the more natural way to operate BCIsis extremely challenging compared to synchronous BCI systems. The task of classifyingbrain signals in a self-paced BCI is formulated as a sequential supervised learning problem[34]. In sequential learning, a sequence of observations and their corresponding labels areknown, and the goal is to construct a classifier that predicts the sequence of the labels of anew sequence of observations.71The most popular way to obtain observation sequences for sequential supervised learn-ing is to use a sliding window over the signal (these windows might overlap). This approachtakes consecutive input windows of the EEG brain signal, each of length w milliseconds,extracts features from each window and assigns a label to each window. The assigned labelcorresponds to the intention of the user, as to whether or not he/she wants to operate thedevice. The sequence of the extracted feature vectors and their corresponding labels areused to train a classier. For a new sequence, the trained classifier estimates the label ofeach window (i.e. the intention of the user).The advantage of using the sliding window approach is that it converts the sequen-tial supervised learning problem into a standard classification problem. Thus any classicalsupervised learning method could be used to solve the problem. The majority of the pub-lications in the field have used this approach to build different self-paced BCIs [8]. Thedisadvantage of the sliding window approach is that the sequential correlation in the labelsof consecutive EEG windows is not exploited. The observations and the labels of nearbywindows are usually related to each other, and it is thus important to use the informationabout sequential correlation between adjacent EEG windows.In [77] the authors show that the brain goes through several well-defined internal statechanges while the subject is carrying out a motor imagery mental task. Utilizing an al-gorithm that can model/exploit these state transitions can enhance the performance of theBCI. The class of sequence labeling algorithms has been used to exploit the temporal struc-ture of the EEG data. Among these algorithms the most popular is the Hidden MarkovModel (HMM) [81] which is a generative classifier. A generative classifier models thejoint probability of observations and label sequences. Although the HMM classifier hasbeen successful in synchronous BCIs, in [27] the authors concluded that the sliding win-dow approach (i.e. using classical classifiers) is superior to HMM in self-paced BCIs. Dueto intractability issues, HMMs assume the observations are independent given the states.72This makes it difficult to incorporate the knowledge about the structure of the EEG datainto the model by extracting informative overlapping observations.Another type of sequence labeling classifiers (besides HMMs) are the discriminativeclassifiers. These classifiers directly maximize the conditional likelihood of the label se-quence given the observations. These algorithms have yielded very promising results in thenatural language processing [74], and activity recognition [100] fields (which have verysimilar nature to the task of self-paced classification of BCIs). The advantage of thesemodels is that they give the user the freedom to extract many informative and overlappingfeatures from the observation sequence 1. These features might be extracted from previouswindows of the brain signal and may be correlated.In [46] and [86], the authors applied discriminative sequence labeling algorithms toself-paced BCIs to classify different motor imagery tasks. We, on the other hand, usediscriminative sequence labeling algorithms to discriminate between NC and control states.As discussed above, during the NC periods the subject’s brain can be in any state thereforeit is extremely difficult to find specific patterns in the signal. This makes the discriminationbetween NC and control states more difficult compared to the discrimination of differentmovement imagery tasks.In this chapter, we propose a new discriminative sequence labeling classifier to solvethe self-paced BCI problem. Our method combines the power of one of the most popu-lar discriminative sequence labeling classifier, i.e. Conditional Random Fields (CRF) [60]to exploit the correlation in consecutive EEG windows. It also utilizes a neural network toextract high-level features from the observations. We call our method Neural Network Con-ditional Random Fields (NNCRF). Using NNCRF the observation vector is passed througha multi-layer neural network. The neural network applies several layers of nonlinear trans-formation on the observations vector. Then, the output of the neural network is given to a1These observations correspond to the features extracted from the raw EEG signal.73CRF classifier. It is also important to note that the parameters of the CRF and the neuralnetwork parts are learned jointly.4.2 EEG Sequence Labeling with Hidden MarkovModelsHidden Markov Models directly model the joint distribution of observations and labels[81]. The non-stationary nature of EEG signals makes it difficult for the HMM to modelthe observations. As this method is a generative learning algorithm, defining the emissiondistribution is difficult when the feature space is of high dimensions. In such cases thenumber of parameters is too large and this might lead to over-fitting. Therefore it is com-mon to use simplifying assumptions, e.g. conditional independence of the elements of theobservation vectors given the labels. This however makes this technique less powerful. Theemission function we used here is the mixture of Gaussian distributions.To apply HMMs on self-paced BCIs, we trained different HMMs for the different men-tal tasks. We trained two different HMMs, one was trained using NC data and the otherone using the samples of the movement (Control) task. Then the likelihood of a new givensequence was calculated using the forward-back algorithm and the classification was per-formed by comparing the likelihoods of the different HMMs. In this case, each HMMfocuses on learning the structure of the mental task that it is trained on, instead of learningto discriminate between different tasks.The emission function used for HMM is the mixture of Gaussian distributions. Theparameters of the HMM include the transition probabilities, and the parameters of the mix-ture of Gaussian distributions. The parameters are learned by maximizing the likelihood ofthe training dataset. Baum-Weltch algorithm is used to learn the parameters of the HMMmodel [12].744.3 Discriminative Sequence Labeling AlgorithmsFor sequence labeling algorithms, each data sample in the training set consists of a se-quence of observations2, and its corresponding label sequence. Assuming the training setis {xi,yi}Ni=1 where N is the number of training samples. For a training sample there areK windows i.e. xi is the sequence of observation vectors from K consecutive windows andyi is the corresponding sequence of K labels. As in the sliding window approach, every wmilliseconds of the EEG creates a window. xi is created by concatenating the vectors ofobservations of each of K consecutive windows i.e. xi = [xi1, ...,xik, ...,xiK] and xik corre-sponds to the vector of observations in the kth window in the sequence number i 3. Like-wise, yi is created by concatenating the labels of each of these K consecutive windows i.e.yi = [yi1, ...,yik, ...,yiK] and yik corresponds to the label of the kth window in the sequencenumber i.Discriminative sequence labeling classifiers directly maximize the conditional likeli-hood of the label sequence given the observations. These models make it possible to extractmany informative and overlapping features from the observation sequence. Furthermore,these models have the ability to represent long range dependencies between observations.The conditional nature of these models gives them the freedom to relax the independenceassumptions made by HMM.The power of these methods comes from their ability to design a set of features basedon the structure of the data. Each feature is a function of the joint observations and labelpair. In classical classification the feature vector is built based on the observations (X) only.However, here the idea is to extract features from the joint observations (X) and label (Y )spaces. In this way, each feature function Φ measures the compatibility of x (xi) and y (yi)4. Although more complex types of features can be used, in all the following discriminative2These observations correspond to the features extracted from the raw EEG signal.3In our case, xik corresponds to the bandpower of the window in a specific frequency band.4For the sake of clarity, we drop the index i in the remainder of the text75sequence labeling classifiers the feature functions are inspired by a first order HMM.In the following, we breifly describe three different discriminative sequence labelingclassifiers, the Hidden Markov Support Vector Machines (HMSVM), the Conditional Ran-dom Fields (CRF) and the Neural Networks Conditional Random Fields (NNCRF).4.3.1 Hidden Markov Support Vector MachinesSequential supervised learning can be modeled as a special case of structured learning inwhich both observations and labels are sequences. In structured learning problems, theoutput of the classifier has a particular structure. It is possible to solve a structural learningtask using any multi-class classifier. However, in this case the internal structure of theoutput is not exploited and the number of classes might be very large which makes usingthese algorithms almost impossible. Structural Support Vector Machine (SSVM) [98] is ageneralization of multi-class SVMs which allows us to build classifiers for structured data(such as sequence of observations and labels). SSVM has the advantage of being a largemargin classifier like SVM.A Hidden Markov Support Vector Machine [2] is a special case of SSVM in which thefeatures are designed to capture the sequential nature of the data. The set of feature vectorswe use, capture the dependency between consecutive labels in a sequence (yk and yk−1), andmeasure the relation between the observation in the kth window (xk) and its correspondinglabel yk in a sequence.The first set of features captures the dependency between two consecutive labels in asequence:φσ1σ2k(k−1) = φ(yk,yk−1)= I(yk = σ1∧ yk−1 = σ2), σ1,σ2 ∈ Σ,(4.1)I(.) denotes the indicator function which has the value 1 when the input is true and 0otherwise. Σ corresponds to the set of possible values of the label of each window i.e.76movement imagery or NC state. The vector φk(k−1) is formed by stacking φσ1σ2k(k−1) elementsfor all possible values of σ1 and σ2.The second type of features measures the relation between the observation token xk andits corresponding label yk in the sequence:φσk = φ(xk,yk) = I(yk = σ)xk,σ ∈ Σ, (4.2)where xk represents the observations from the kth window of the sequence. The vector φkis formed by stacking φσk for all possible values of σ .By concatenating the two type of feature vectors, we obtain a single vector of all fea-tures for the kth window. Therefore, the whole set of features for the kth window (φ(x,y,k))is equal to [φk,φk(k−1)]. To create the final set of features Φ(x,y) for the sequence, we sumthe feature vectors over the sequence:Φ(x,y) =K∑k=1φ(x,y,k). (4.3)Therefore, Φ(x,y) for type one features corresponds to the number of transitions betweendifferent labels. For type two features, it is the sum of the band-power of the consecutivewindows in a sequence for each possible label. These types of features are very similar tothe maximum likelihood solution of an HMM.As in [2], we have used a linear function of the observation/label pair for classificationof the brain signals. Given a new unseen sequence xN+1 the goal is to predict its corre-sponding sequence of labels. For inference i.e. assigning a sequence of labels to xN+1 wemaximize the following equation:ypredicted = argmaxyW TΦ(xN+1,y), (4.4)77W TΦ(x,y) is a linear function defined on the joint space of observation/Labels sequencespair. Like HMMs, the solution of Equation 4.4 is found using the Viterbi algorithm.For the training phase i.e. finding an optimal value of the vector W , SSVM solves thefollowing minimization problem:minW,ξ12||W ||2+CN∑i=1ξi,s.t. ∀ j ∈ [1,N],∀y ∈ Y,W TΦ(x j,y j)−W TΦ(x j,y)≥ ∆(y j,y)−ξ j,(4.5)where ∆(y j,y) is the loss function that measures the penalty of estimating a wrong label forthe sequence. This maximization problem is a convex problem however, it involves a largenumber of linear inequality constraints. These constraints make the distance between thetrue label of a sequence and all other possible values of the sequence label maximum [97].4.3.2 Conditional Random FieldsThe Conditional Random Field (CRF) is another discriminative classifier which is wellsuited for sequential supervised learning. In general the linear chain Conditional RandomFields classifier models the posterior distribution of the formPr(Y |X) =exp(J∑j=1λ jK∑k=1Φ j(yk−1,yk,x,k))Z(X), (4.6)where K is the sequence length, J is the number of features extracted from the joint obser-vation labels pair, andΦ j is the feature function. λ js are the parameters of the model which78are learned based on the training data. Z(x) is the normalization factor (partition function):Z(X) =∑y′1∑y′2...∑y′Kexp(J∑j=1λ jK∑k=1Φ j(y′k−1,y′k,x,k)) (4.7)To calculate the partition function we use forward-backward algorithm which is a dynamicprogramming algorithm.The set of feature functions (φ j) should be specified before using CRF. The first featurewe used, is defined asφσ0 = I(yk = σ), σ ∈ Σ, (4.8)where I represents the indicator function, and Σ corresponds to the set of possible labelvalues (i.e. NC states and movement states).The second set of features φk,1:L captures the relation between the observation vectorand the kth label in the sequence. We are assuming the observation vector is of dimensionL. Each element φσk,l of the vector φσk,1:L is defined asφσk,l = xk,lI(yk = σ), σ ∈ Σ, (4.9)where xk,l is the lth dimension of the observation vector.The third type of features is defined asφσ1,σ2k(k−1) = I(yk = σ1∧ yk−1 = σ2), σ1,σ2 ∈ Σ, (4.10)where φσ1,σ2k(k−1) captures the relation between the kth and (k−1)th labels in the sequence.Using the above mentioned set of features (equations 4.8, 4.9 and 4.10) the posterior79function will be of the formPr(y1:K|x1:K) =exp( K∑k=1(byk +WTyk,1:Lxk)+K∑k=2Vyk−1,yk)Z(X), (4.11)where byk , Wyk,1:L and Vyk−1,yk are the parameters of the posterior distribution and should belearned using the training dataset. The set of parameters byk , Wyk,1:L and Vyk,yk−1 capturesthe importance of the first, second and third type of features respectively.To learn the optimal values of the parameters, we minimize the L2 regularized negativelog likelihoodminλN∑i=1−LogPr(yi|xi)+CλTλ , (4.12)where the vector λ consists of the parameters of the model (i.e. byk , Wyk,1:L and Vyk−1,yk),and C is the regularization coefficient. To find the optimal value of the parameters, weuse the stochastic gradient descent algorithm. As the loss function is convex, the gradientdescent algorithm converges to the global optimum of the loss function.For the inference i.e. assigning a sequence of labels to xN+1, we assign the most prob-able sequence as the predicted labels i.e.ypredicted = argmaxyP(y|xN+1). (4.13)We used a Viterbi like algorithm to find the most probable sequence of labels.4.3.3 Neural Network Conditional Random FieldsThe linearity of the exponent term in CRF classifiers makes them have less expressivepower compared to the models that exploit kernels. A good approach that makes thesealgorithms more powerful is to combine feed-forward neural network with CRF. Neuralnetwork transforms the observation vector into high level features which are then used as80the input to the CRF. As a result, the exponent term in CRF (i.e. the exponent term inequation 4.11) becomes non-linear because of the several layers of non-linear activationfunctions that have been applied on the observation vector [35] [76].Neural Network Conditional Random Field (NNCRF) can be viewed as a standard lin-ear chain CRF that uses a high level representation of the observations. Therefore, theposterior function of the NNCRF has the same form as that in equation 4.11 except that thexk terms are replaced with the hM(xk) term which represents the output of a feed-forwardneural networks with M layers.Pr(Y |X) =exp(K∑k=1byk +WTyk,1:Lh(M)(xk)yk +K∑k=2Vyk−1,yk)Z(X). (4.14)The output of the (M)th layer (h(M)) ish(M)(xk) = tanh(b(M−1)+W (M−1)hM−1(xk)), (4.15)where b(M−1) and W (M−1) are the weights of the (M− 1)th layer, hM−1(xk) is the outputof the (M− 1)th layer of the neural network, and h0(xk) = xk. To avoid over-fitting, theweights of the neural networks are the same for all observations xk in a sequence i.e. thevalues of the weights do not depend on k.Initialization of a neural networks is very crucial for this algorithm to converge to agood local optima. For initialization of the neural network, the value of each parameter isa sample taken from a uniform distribution U [−γ,+γ] whereγ =√6√size(M)+ size(M−1) , (4.16)where size(M) is the number of hidden units in the Mth layer of the neural networks.81The loss function is the same as the loss function of the standard CRF (Equation 4.12).We use the stochastic gradient descent algorithm to jointly optimize the values of the pa-rameters of the neural networks and of the CRF. The learning rate of the stochastic gradientdescent algorithm is adjusted using the bold driver approach. For inference the same algo-rithm as in standard linear chain CRF is used.4.4 DatasetsTo perform the experiments two self-paced sensory motor BCI datasets have been used.The first dataset, SM2, was collected from 4 subjects attempting to activate a switch byperforming a right index finger movement (as explained in section 2.3).The second dataset, BCICIV2a, is the dataset IIa from the BCI competition IV (Section2.3) which is recorded from 9 subjects performing 4-class motor imagery (left hand andright hand, both feet and tongue imagery movements) tasks. We have treated this datasetas a self-paced BCI dataset. In other words, to evaluate the performance of the classifierson this dataset the time of transition from previous mental task to the new one (the timecue was displayed) has not been used. We have also converted the problem into a binaryclassification task i.e. separating movement imagery from NC states. All 4-classes ofmotor-imagery are considered as movement (control states) and the periods in which thesubject did not control the system are considered as the No-Control class.4.5 ResultsWe compared the performance of NNCRF with the standard CRF classifier, Hidden MarkovSupport Vector Machines (HMSVM), Hidden Markov Models (HMM) and two popularclassical classifiers, Logistic Regression and Support Vector Machines (SVM).For frequency filtering, we applied a filter-bank with 2 blocks in ranges [8-12]Hz and[16-24]Hz corresponding to the typical ranges of the alpha and beta brain waves. For spatial82filtering, we used CSP with 2, 4 and 6 filters. The values of these hyper-parameters alongwith each classifier’s parameters are adjusted jointly using 5-fold cross-validation. Thenthe parameters with the best mean cross-validation accuracy were used to train a classifieron the training set. To evaluate the performance of the classifiers on the test dataset, weused the Area Under the Curve (AUC) measure.In our experiments, the band power of the EEG signal is used as the extracted features(observations) for the classification phase. The window length was equal to the samplingrate of the dataset and we used the last two seconds of the data to perform the classificationin the test phase.Table 4.1 shows the results of comparing different classifiers. We also included theaverage rank of each algorithm in Table 4.1. The average ranks of different classifiersare calculated by ranking them based on their AUC for each subject separately, and thenaveraging the ranks over all subjects.Table 4.1: The results of comparing AUC of different algorithms on the test dataset.Highlighted cells show the algorithm for which the performance is the best. Thelast row shows the average rank of different classifiers across all subjects.Subject LR SVM HMM HMSVM CRF NNCRF1 0.617 0.472 0.470 0.567 0.554 0.5652 0.605 0.535 0.577 0.602 0.640 0.6813 0.666 0.659 0.646 0.643 0.690 0.6914 0.673 0.618 0.556 0.593 0.594 0.5405 0.559 0.560 0.511 0.510 0.473 0.4936 0.679 0.653 0.484 0.629 0.692 0.7017 0.725 0.710 0.605 0.683 0.672 0.6668 0.709 0.706 0.509 0.542 0.629 0.7189 0.616 0.606 0.527 0.565 0.591 0.58822 0.767 0.723 0.711 0.708 0.789 0.84023 0.724 0.728 0.762 0.706 0.828 0.79824 0.610 0.604 0.548 0.580 0.615 0.63825 0.585 0.574 0.604 0.588 0.669 0.707Average Rank 2.54 3.62 5 4.54 2.85 2.4683As shown in Table 4.1, the average rank of NNCRF is better than those of other algo-rithms, the second best classifier is the LR classifier. Another observation is that HMM hasthe worst performance, which suggests that HMM has no advantage over classical classi-fiers (this observation has also been reported in [27]). It is worth mentioning that for somesubjects none of the discriminative sequence labeling classifiers have a good performance,this means that these classifiers are not able to capture the temporal structure of the signalin these subjects.4.6 ConclusionIn this chapter, we proposed a discriminative sequence labeling algorithm (classifier), tocapture the dynamics of the EEG signal. We evaluated the performance of our algorithm(which we denote as NNCRF) on two self-paced BCI datasets and showed that it is superior,compared to classical classifiers and sequence labeling classifiers. NNCRF is a combina-tion of a neural network and a CRF classifier. We demonstrated that CRF and NNCRF cancapture the temporal properties of the EEG signal and improve the accuracy of the BCI inmost of the subjects. In some subjects however, the temporal structure of the EEG data isdifficult to capture by these classifiers. The neural network part of NNCRF converts theoriginal observation vector into a new representation which is then fed to the CRF part.The non-linear transformation (by the neural network) of the observation vector helps theCRF part to discriminate between the different control and NC easily.Overall, the reason for the poor performance of classical approaches is that they do notexploit the inherent dynamics in the EEG signal. As for the HMM, although this algorithmmodels the temporal correlations in an EEG signal, its poor performance stems from thefact that it focuses on learning the mental task without learning to discriminate between thedifferent mental tasks. On the other hand, discriminative sequence labeling classifiers donot only model the temporal properties of each mental task, they also model the transition84from one mental task to another (e.g. transition from movement to NC state).85Chapter 5Ensemble of Discriminative SequenceLabeling Classifiers for Self-Paced BCIs5.1 IntroductionThe aim of this chapter is to design a self-paced BCI that can be automatically customizedfor each user. In Chapter 3, we showed the importance of customizing synchronous BCIsbased on the brain characteristics of each user. We then proposed an approach for customiz-ing two different kinds of synchronous BCIs based on band-power features and Morlet fea-tures with logistic legression classifier. Here, we adapt the same approach to customizeself-paced BCIs (proposed in chapter 4) based on the brain characteristics of each user.So, our algorithm not only takes the correlation between consecutive EEG windows intoaccount, but also is customized based on the characteristics of each subject’s EEG signal.Customizing a BCI for each subject involves optimizing a cost function over a hyper-parameter space. The cost function used here is the cross-validation accuracy of the BCI onthe training data, and the hyper-parameters are the parameters of the BCI system which areselected before feature extraction and classification. We use the algorithm we proposed in86chapter 3 which is an iterative algorithm that optimizes the values of the hyper-parameters.After the optimization is finished, we can either use the best performing classifier that ourclassifier selected in the optimization phase or we can combine all the trained classifiersinto one final classifier.In general, for our final classifier we seek a model with low bias and high variance asour final classifier. When the training data is small but the classifier has high variance, thenby averaging the results of individual classifiers we reduce the variance of the final classifierwhile preserving the low bias of a single model. Especially in the case of neural networkswhich can get stuck in a local optimum and has a high variance, creating an ensemble ofneural networks which are trained on different parts of the hyper-parameter space, mayresult in a better approximation of the best possible classifier.In self-paced BCI, the hyper-parameter space of every classifier consists of the fre-quency ranges over which the EEG signal is filtered, the selected channels (from which thefeatures are extracted), and the window length (w). For the first (i.e. frequency ranges) andsecond (i.e. channels) types of hyper-parameters, we use the same settings as in section3.3.1. The third hyper-parameter is the length w of the sliding window. For all the classi-fiers, we have used the past two seconds of the EEG signal to build the observation vectorfrom. For classical classifiers, we only look at the past w milliseconds of the data i.e. onlythe last window. For sequence labeling classifiers, we build a chain of consecutive windowsas our observation vector. The length of the chain is therefore 2|w|×1000. For the proposedNNCRF classifier which we proposed in chapter 4, we have an additional hyper-parameterwhich is the number of neurons in the hidden layer of the neural network. We optimize allthese hyper-parameters jointly.Using Bayesian optimization, we search through the hyper-parameter space to find thebest hypothesis (classifier) that makes a good prediction. Bayesian optimization iterativelyproposes different points (values) in the hyper-parameter space. Each of these points are87Figure 5.1: Diagram of our optimization algorithm at iteration t.used to train a classifiers. After the optimization is finished, we can either select the bestperforming classifier obtained in the optimization phase as our final classifier or we cancombine all trained classifiers into one final classifier.To make our algorithm clear, we have added the diagram (Figure 5.1) of our algorithmthat elaborates one iteration of our algorithm. In each iteration of our algorithm, we firstupdate the posterior distribution. Then based on the posterior distribution, we build theacquisition function and find the maximum point of the acquisition function.88The maximum point (proposed candidate by the Bayesian optimization) is used to per-form filtering on the EEG. The features are then extracted and a classifier is trained on thetraining samples. We then, calculate the accuracy of the classifier, this will be used in thenext iteration of our algorithm. In each iteration of our algorithm, we combine the classi-fiers built in the previous iterations, and examine the stopping condition to decide whetherto stop the algorithm or continue to the next iteration.5.2 ResultsThe set of classifiers and the feature extraction we used are the same as those of chapter 4.The datasets are also the same and are explained in 4.4. For the experiments, we comparedthe ensemble of the classifiers generated in the optimization phase.Table 5.1 shows the results of comparing different classifiers when we used ensem-ble of classifiers. The columns with the MIN label correspond to the results of applyingthe best performing classifier obtained in the optimization phase on the independent testdataset. The Bayesian optimization algorithm was run for at most 50 iterations or untilcross-validation accuracy plateaued. We have repeated our experiments five times to re-duce the effect of the random seed. The average number of iterations of the Bayesianoptimization algorithm for all different classifiers was 27. In Table 5.1, the columns withAVG label correspond to the results of using all the generated classifiers in the optimiza-tion phase. To create the ensemble of classifiers, we averaged the results of all individualclassifiers.89Table 5.1: The results of comparing different classifier after performing Bayesian optimization on the test dataset. MIN is theresults of applying the best performing classifier in optimization phase on the test data. AVG corresponds to the resultsof the ensemble of classifiers. Highlighted cells show the algorithm for which the performance is the best.LR SVM HMM HSVM CRF NNCRFSubject MIN AVG MIN AVG MIN AVG MIN AVG MIN AVG MIN AVG1 0.677±0.001 0.708±0.003 0.612±0.027 0.684±0.008 0.571±0.001 0.572±0.003 0.626±0.008 0.682±0.004 0.646±0.015 0.651±0.006 0.663±0.0 0.653±0.0042 0.653±0.0 0.663±0.004 0.655±0.004 0.668±0.002 0.597±0.0 0.647±0.014 0.592±0.008 0.647±0.003 0.643±0.003 0.66±0.004 0.641±0.012 0.658±0.0043 0.689±0.001 0.709±0.003 0.686±0.005 0.721±0.004 0.705±0.0 0.688±0.003 0.663±0.0 0.724±0.001 0.724±0.011 0.735±0.002 0.735±0.003 0.743±0.0044 0.743±0.0 0.769±0.002 0.737±0.002 0.758±0.003 0.554±0.0 0.571±0.004 0.695±0.0 0.749±0.004 0.748±0.0 0.725±0.006 0.736±0.003 0.722±0.0035 0.595±0.006 0.629±0.003 0.561±0.001 0.622±0.007 0.518±0.004 0.509±0.003 0.605±0.0 0.655±0.003 0.637±0.002 0.618±0.006 0.632±0.016 0.632±0.0046 0.657±0.001 0.692±0.004 0.678±0.012 0.729±0.003 0.512±0.024 0.63±0.015 0.62±0.002 0.718±0.002 0.691±0.021 0.731±0.005 0.711±0.016 0.73±0.0067 0.738±0.0 0.762±0.002 0.729±0.008 0.778±0.002 0.64±0.009 0.71±0.01 0.737±0.0 0.802±0.002 0.769±0.0 0.751±0.003 0.769±0.0 0.755±0.0048 0.716±0.002 0.729±0.002 0.716±0.004 0.736±0.002 0.547±0.126 0.633±0.034 0.505±0.021 0.658±0.013 0.648±0.018 0.697±0.011 0.723±0.004 0.779±0.0049 0.588±0.016 0.626±0.006 0.57±0.012 0.638±0.007 0.565±0.046 0.598±0.013 0.681±0.006 0.687±0.009 0.702±0.006 0.669±0.005 0.709±0.0 0.686±0.00322 0.795±0.004 0.826±0.004 0.786±0.0 0.818±0.002 0.704±0.016 0.732±0.006 0.741±0.001 0.827±0.004 0.815±0.02 0.871±0.005 0.847±0.005 0.881±0.00423 0.716±0.0 0.733±0.004 0.752±0.007 0.763±0.004 0.641±0.012 0.637±0.01 0.701±0.009 0.777±0.003 0.813±0.009 0.831±0.004 0.817±0.007 0.833±0.00624 0.594±0.007 0.597±0.002 0.577±0.003 0.6±0.003 0.556±0.013 0.554±0.006 0.58±0.011 0.617±0.003 0.639±0.009 0.626±0.005 0.633±0.004 0.631±0.00325 0.607±0.0 0.605±0.002 0.61±0.03 0.642±0.008 0.521±0.007 0.518±0.004 0.626±0.009 0.667±0.007 0.699±0.028 0.682±0.011 0.734±0.003 0.723±0.013AVERAGE Rank 7.65 5.23 8.50 4.23 11.23 10.65 9.54 4.15 4.92 4.65 3.88 3.35900.5500.5700.5900.6100.6300.6500.6700.6900.7100.7300.750LR SVM HMM HMSVM CRF NNCRFDefault parameters Optimized parameters Ensemble of classifiersFigure 5.2: Average AUC of different settings for each classifier across all subjects.The blue bars correspond to the average AUC when the default value of thehyper-parameters have been used. The red bars correspond to the average AUCafter using Bayesian optimization. The green bars correspond to the averageAUC when we used an ensemble of different classifiers.Qualitative comparison of different algorithms in Tables 4.1 and 5.1 suggests the fol-lowing: 1) optimizing the values of the hyper-parameters improves the performance of anyclassifier considerably (comparing MIN columns of Table 5.1 with Table 4.1), 2) in almostall subjects and for almost all classifiers, using an ensemble of classifiers improves theperformance, 3) HMM is the worst performing classifier, 4) NNCRF outperforms other al-gorithms in terms of average rank of the algorithm, and 5) interestingly, for some subjectsnone of the discriminative sequence labeling classifiers have yielded a good performance,this means that these classifiers are not able to capture the temporal structure of the signalin these subjects.Figure 5.2 shows the average AUC of different algorithms across all subjects. The bluebars correspond to the average AUC when the default value of the hyper-parameters isused. The red bars correspond to the average AUC after using Bayesian optimization. Thegreen bars correspond to the average AUC when we use an ensemble of different classifiers.91Figure 5.2 shows that for all classifiers using Bayesian optimization considerably improvesthe average AUC of classifiers across all subjects, and using an ensemble of the trainedclassifiers further improves the results. The best performing algorithm is the one that usesan ensemble of NNCRF classifiers.5.3 ConclusionTo further improve the performance of our algorithms used in chapter 4, we customized thehyper-parameters of the BCI (using the algorithm proposed in 3). We evaluated differentalgorithms on the same subjects that we used in chapter 4. We showed that customizingthe BCI hyper-parameters improves the performance of every classifier used in this study.After customizing the parameters of every classifier used, the best performing classifierwas NNCRF, while CRF was the second best classifier.We also showed that using an ensemble of classifiers that have been trained on dif-ferent parts of the BCI hyper-parameter space can further improve the performance mea-sure (AUC). We used Bayesian optimization to find the different values of the BCI hyper-parameters. Selecting different values for the hyper-parameters exposes the classifier todifferent parts of the BCI hyper-parameter space. We believe that this diversification is thekey to the superiority of using the ensemble of classifiers compared to the single classifierapproach. The performance of the classifiers is very sensitive to the choice of the hyper-parameters of the BCI and using an ensemble of classifiers can decrease the variance of theclassifiers. The best performing algorithm was the ensemble of NNCRF classifiers.92Chapter 6Conclusion and Future Work6.1 Summary of Our ContributionsThis thesis addresses the design of Brain Computer Interface (BCI) systems, in general, butspecifically self-paced BCIs. In designing a self-paced BCI system that is based on sensorymotor rhythms, it is important to exploit the properties of the brain signal. We should takeinto consideration the variations over time of the EEG signal in specific frequency bands. Itis common to filter the data in these specific frequency bands during specific time segmentsand exploit the information embedded in the signal so as to gain maximum discriminationbetween different mental tasks. These frequency bands vary from one person to another.The optimal value of these frequency bands should be adjusted for each individual subject.Also, the spatial information as to which channels should be used for a specific men-tal task has a major role in the performance of the BCI. All such information about thecharacteristics of the EEG signal is fed to the BCI system as a set of hyper-parameters.Thus the selection of the value of these parameters is very important as they determine thesuccess and accuracy of the BCI. The value of all BCI hyper-parameters should be cus-tomized based on the brain characteristics of each subject. Furthermore, in sensory motor93self-paced BCIs, while the subject is performing a mental task, his/her brain goes throughseveral well-defined internal state changes. Therefore, it is important to use a classifier thatis able to capture the dynamics of the EEG signal (i.e. takes the time course of the EEGsignal into account).In this thesis we study 1) how to automatically adjust the values of self-paced BCIparameters, and 2) how to exploit the time-frequency properties and user-specific charac-teristics of the EEG. Our final algorithm exploits the brain characteristics of the user inself-paced BCIs. Our algorithm has the following advantages, it is easy to apply to anyBCI, it is fully automatic, it does not need EEG expertise, scalable in the number of hyper-parameters and computationally inexpensive. Moreover, in contrast to the general trend inBCIs our algorithm uses the sequential information in the EEG signal that further improvesthe performance of a self-paced BCI system.We believe that our proposed system is an important step towards transitioning BCIsfrom research devices to in-home communication assistive devices. In the following, wewill describe a summary of our contributions in each chapter of this thesis.Chapter 2In this chapter of the thesis, we developed an open source framework for comparing dif-ferent BCI systems that is unique in two aspects. First, we performed a comprehensivecomparison between 14 different sensory-motor-based BCI designs over 29 subjects. Thisis the first study that includes large number of designs and subjects. Secondly, we per-formed rigorous statistical tests to statistically validate the results. The strength of thisstudy is that the source code and data of our experiments are openly available so that otherresearchers can (a) apply it on their own data and benchmark their data with standard meth-ods, and (b) extend the framework to include other machine learning and signal processingmethodologies.94Unlike most publications in the BCI field which recommend the linear discriminantanalysis classifier as the best classifier, our findings show that for each feature extractionmethod many classifiers should be tried, and then the best classifier should be selectedbased on the cross-validation results of the classifiers. In general, we found that there isnot a best classifier or best feature extraction method that outperforms all others. For eachsubject, the combination of the classifier, the features and model parameters should all betuned together, and finally the method with the best performance on the training data setshould be selected as the final model for testing on unseen data.Chapter 3In the third chapter of this thesis, we proposed an algorithm to automatically customizea BCI based on the brain characteristics of each user. The study was performed on 21subjects in synchronous BCIs from the BCI competitions.In our first set of experiments, we showed the importance of the joint tuning of hyper-parameters in BCIs. We showed that tuning the values of the BCI parameters using Bayesianoptimization can considerably improve the mean accuracy across all 21 subjects. In thesecond part of the experiments, we compared our algorithms to several other approachesfor tuning the values of the hyper-parameters. We also backed our results with the properstatistical tests.Moreover, we compared the results of the proposed algorithm to the results reported inthe literature on the same subjects. We have shown that our algorithm has at least similarperformance, but usually outperforms the results reported in the literature.A big advantage of our proposed method is that it is fully automatic. While our pro-posed algorithm yielded similar or better performance compared to the literature, it is im-portant to note that the best reported results in the literature were achieved based on lengthyextensive manual fine-tuning of the parameters of the BCI system and based on more so-95phisticated feature extraction and classification methods. Our results show that our au-tomated parameter optimization of a BCI system that relies on less sophisticated featureextraction and classification methods yields similar/superior results compared to the bestperforming designs in the literature.Overall the results show the importance of selecting the appropriate values for thehyper-parameters of the system. Furthermore, these results show that applying a goodhyper-parameter selection algorithm can boost a simple baseline algorithm’s performanceso it can compete with the state of the art algorithms and results.Chapter 4In chapter 4, we proposed an algorithm that can capture the time dynamics inherent in theEEG signal. The study was performed on 13 self-paced BCI subjects. The aim was toevaluate the performance of different algorithms in discriminating between a control task(movement or imagery movement) and no-control states. The main challenges in self-pacedBCIs are that the start and end times of the control task and the No-Control (NC) state arenot known, and it is extremely difficult to find the specific patterns in the NC states, this isbecause the subject can be in any mental state during the NC periods.In chapter 4, we proposed a discriminative sequence labeling algorithm (classifier) tocapture the dynamics of the EEG signal. The advantage of such classifiers is that they notonly model the temporal properties of each mental task, but also model the transition fromone metal task to another (e.g. transition from a movement to an NC state).We evaluated the performance of our algorithm (which we denoted as NNCRF) andshowed that it is superior, compared to classical classifiers and the sequence labeling classi-fiers. The proposed NNCRF is a combination of a neural network and a conditional randomfield classifier. The neural network part of NNCRF converts the original observation vectorinto a new representation which is then fed to the CRF part. The non-linear transformation96(by the neural network) of the observation vector helps the CRF part to better discriminatebetween a control task and NC states easily. We demonstrated that the CRF and NNCRFcan capture the temporal properties of the EEG signal and improve the accuracy of the BCIin most subjects.Chapter 5In chapter 5, we proposed an algorithm that customizes a self-paced BCI to each user i.e.exploits the brain characteristics of each subject. Our algorithm not only fine-tunes the(hyper-)parameters of the self-paced BCI, but also exploits to the time information specificto each user. The study was performed on 13 self-paced BCI subjects.In this chapter, we showed that customizing the self-paced BCI hyper-parameters im-proves the performance of every classifier used in the study. We also showed that usingan ensemble of classifiers (where every classifier has been trained on a different part ofthe BCI hyper-parameter space) can further improve the performance measure of the BCI.We used the algorithm proposed in chapter 3 to find the different values of the BCI hyper-parameters. Selecting different values for the hyper-parameters exposes the classifier todifferent parts of the BCI hyper-parameter space. We believe that this diversification is thekey to the superiority of using an ensemble of classifiers compared to the single classifierapproach. The performance of a classifiers is very sensitive to the choice of the hyper-parameters of the BCI and using an ensemble of classifier can decrease the variance of theclassifiers.The algorithm proposed in chapter 5, exploits the time and frequency characteristicsof the specific user’s brain in self-paced BCIs. We believe that our proposed system is animportant step towards transitioning BCIs from research environment to real-life applica-tions, this is because it is fully automatic, scalable and easy to use. Overall our resultsshow the importance of selecting good values for the hyper-parameters, and at the same97time exploiting the time information in self-paced BCI systems.6.2 Future WorkSome possible future directions of this work are briefly discussed below:In this study, we used linear chain discriminative sequence labeling classifiers for self-paced BCIs, however, using other more complex classifiers that can better capture the timecorrelation between different parts of the EEG signal could be beneficial. Another possi-ble future direction is to use more advance feature functions for discriminative sequencelabeling classifiers to incorporate more knowledge about the EEG. The type of the featurefunctions used in this study were all inspired by Hidden Markov Models. So they are alllocal in nature, i.e. each feature function only depends on the current or the previous labelin the sequence. It is possible to use global features such as the ones that capture higherorders of dependency of the transition between consecutive labels, or feature functions thatcapture dependency between the EEG signal and labels of the EEG signal from distant past.These types of feature functions have the ability to capture more complex structures in thedata and increase the power of classifier.One of the major problems in BCIs is that the datasets are very small. Collecting moredata will enable us to use deep learning algorithms [62]. Specially in our proposed NNCRFalgorithm, having larger datasets would enable us to utilize a deep neural network whichwill eventually gives the algorithm the ability to extract better and more informative fea-tures from the EEG signal. The current trend in BCIs is to hand-craft feature extractors fromEEG signal that result in high level features from EEG signal. On the other hand, by usingdeep learning algorithms feature extraction process becomes a part of the automated sys-tem which can lead to easier classification of the BCI data. In addition to NNCRF, anotherclass of deep learning algorithms which can be used here, are recurrent neural networks.These algorithms have been very successful in different sequence labeling applications that98are very similar in nature to the self-paced classification of BCIs [49] .Non-stationarity of the EEG data makes real-life applications of (sensory-motor) BCIsdifficult to use. These variations in the EEG signal properties might deteriorate the BCIperformance. One possible approach to overcome this problem is to use semi-supervisedlearning algorithms. These algorithms have the ability to use a small set of labeled data witha large amount of unlabeled data. Specially, in deep learning methods unlabeled data can beused to pre-train the neural network and the labeled data can be used to perform supervisedfine-tuning of the network. The semi-supervised sequence learning [30] algorithms can beespecially useful for self-paced BCIs. These classifiers can be trained using a small amountof training data. While the subject is operating the BCI, the classifier can improve itselfbased on the new coming EEG data.99Bibliography[1] Sarah N. Abdulkader, Ayman Atia, and Mostafa-Sami M. Mostafa. Brain computerinterfacing: Applications and challenges. Egyptian Informatics Journal, 16(2):213– 230, 2015. → pages 1[2] Yasemin Altun, Ioannis Tsochantaridis, Thomas Hofmann, et al. Hidden markovsupport vector machines. In ICML, volume 3, pages 3–10, 2003. → pages 76, 77[3] Omar AlZoubi, Irena Koprinska, and Rafael A Calvo. Classification ofbrain-computer interface data. In Proceedings of the 7th Australasian Data MiningConference-Volume 87, pages 123–131. Australian Computer Society, Inc., 2008.→ pages 15[4] Kai Keng Ang, Zheng Yang Chin, Chuanchu Wang, Cuntai Guan, and HaihongZhang. Filter bank common spatial pattern algorithm on bci competition iv datasets2a and 2b. Frontiers in Neuroscience, 6, 2012. → pages 63, 65, 66[5] Mahnaz Arvaneh, Cuntai Guan, Kai Keng Ang, and Chai Quek. Optimizing thechannel selection and classification accuracy in eeg-based bci. BiomedicalEngineering, IEEE Transactions on, 58(6):1865–1873, 2011. → pages 7, 42[6] Onder Aydemir and Temel Kayikcioglu. Comparing common machine learningclassifiers in low-dimensional feature vectors for brain computer interfaceapplications. → pages 15[7] Alexandre Barachant, Ste´phane Bonnet, Marco Congedo, and Christian Jutten.Multiclass brain–computer interface classification by riemannian geometry.Biomedical Engineering, IEEE Transactions on, 59(4):920–928, 2012. → pages63, 65[8] Ali Bashashati, Mehrdad Fatourechi, Rabab K Ward, and Gary E Birch. A surveyof signal processing algorithms in brain–computer interfaces based on electricalbrain signals. Journal of Neural engineering, 4(2):R32, 2007. → pages 6, 16, 72[9] Ali Bashashati, Rabab K Ward, and Gary E Birch. Towards development of a3-state self-paced brain-computer interface. Computational intelligence andneuroscience, 2007, 2007. → pages 4, 71100[10] James Bergstra and Yoshua Bengio. Random search for hyper-parameteroptimization. The Journal of Machine Learning Research, 13(1):281–305, 2012.→ pages 53[11] Martin Billinger, Vera Kaiser, Christa Neuper, and Clemens Brunner. Automaticfrequency band selection for BCIs with ERDS difference maps. → pages 42[12] Jeff A Bilmes et al. A gentle tutorial of the em algorithm and its application toparameter estimation for gaussian mixture and hidden markov models. 1998. →pages 74[13] G. Bin, X. Gao, Y. Wang, B. Hong, and S. Gao. Vep-based brain-computerinterfaces: time, frequency, and code modulations [research frontier]. IEEEComputational Intelligence Magazine, 4(4):22–26, November 2009. → pages 2[14] Christopher M Bishop. Pattern recognition and machine learning, volume 1.springer New York, 2006. → pages 24[15] Benjamin Blankertz, Guido Dornhege, Matthias Krauledat, Klaus-Robert Mu¨ller,and Gabriel Curio. The non-invasive berlin brain–computer interface: fastacquisition of effective performance in untrained subjects. NeuroImage,37(2):539–550, 2007. → pages 28[16] Benjamin Blankertz, K Muller, Dean J Krusienski, Gerwin Schalk, Jonathan RWolpaw, Alois Schlogl, Gert Pfurtscheller, Jd R Millan, M Schroder, and NielsBirbaumer. The bci competition iii: Validating alternative approaches to actual bciproblems. Neural Systems and Rehabilitation Engineering, IEEE Transactions on,14(2):153–159, 2006. → pages 28, 56[17] Benjamin Blankertz, Michael Tangermann, Carmen Vidaurre, Siamac Fazli,Claudia Sannelli, Stefan Haufe, Cecilia Maeder, Lenny E Ramsey, Irene Sturm,Gabriel Curio, and Klaus R Mueller. The berlin brain-computer interface:Non-medical uses of bci technology. Frontiers in Neuroscience, 4(198), 2010. →pages 1[18] Benjamin Blankertz, Ryota Tomioka, Steven Lemm, Motoaki Kawanabe, and K-RMuller. Optimizing spatial filters for robust eeg single-trial analysis. SignalProcessing Magazine, IEEE, 25(1):41–56, 2008. → pages 18, 42[19] Reza Boostani, Bernhard Graimann, MH Moradi, and Gert Pfurtscheller. Acomparison approach toward finding the best feature and classifier in cue-based bci.Medical & biological engineering & computing, 45(4):403–412, 2007. → pages 15[20] Jaimie F Borisoff, Steven G Mason, Ali Bashashati, and Gary E Birch.Brain-computer interface design for asynchronous control applications:improvements to the lf-asd asynchronous brain switch. Biomedical Engineering,IEEE Transactions on, 51(6):985–992, 2004. → pages 15, 28, 29101[21] Andreas Trllund Boye, Ulrik Qvist Kristiansen, Martin Billinger, Omar Feixdo Nascimento, and Dario Farina. Identification of movement-related corticalpotentials with optimized spatial filtering and principal component analysis.Biomedical Signal Processing and Control, 3(4):300 – 304, 2008. → pages 13[22] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001. → pages 22[23] Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesianoptimization of expensive cost functions, with application to active user modelingand hierarchical reinforcement learning. arXiv preprint arXiv:1012.2599, 2010. →pages 43[24] Nicolas Brodu, Fabien Lotte, and Anatole Lcuyer. Exploring two novel features foreeg-based brain–computer interfaces: Multifractal cumulants and predictivecomplexity. Neurocomputing, 79:87–94, 2012. → pages 19, 52, 57, 63, 64, 65[25] Nicolas Brodu, Fabien Lotte, and Anatole Le´cuyer. Comparative study ofband-power extraction techniques for motor imagery classification. InComputational Intelligence, Cognitive Algorithms, Mind, and Brain (CCMB), 2011IEEE Symposium on, pages 1–6. IEEE, 2011. → pages 19, 63, 64, 65[26] C Brunner, R Leeb, G Mu¨ller-Putz, A Schlo¨gl, and G Pfurtscheller. Bcicompetition 2008–graz data set a. → pages 28, 56[27] Silvia Chiappa, Nicolas Donckers, Samy Bengio, and Fre´de´ric Vrins. Hmm andiohmm modeling of eeg rhythms for asynchronous bci systems. In ESANN, 2004.→ pages 72, 84[28] Zheng Yang Chin, Kai Keng Ang, Chuanchu Wang, Cuntai Guan, and HaihongZhang. Multi-class filter bank common spatial pattern for four-class motor imagerybci. In Engineering in Medicine and Biology Society, 2009. EMBC 2009. AnnualInternational Conference of the IEEE, pages 571–574. IEEE, 2009. → pages 42[29] A Criminisi, J Shotton, and E Konukoglu. Decision forests for classification,regression, density estimation, manifold learning and semi-supervised learning. →pages 22, 45[30] Andrew M Dai and Quoc V Le. Semi-supervised sequence learning. In Advancesin Neural Information Processing Systems, pages 3079–3087, 2015. → pages 99[31] Michele Dalponte, Francesca Bovolo, and Lorenzo Bruzzone. Automatic selectionof frequency and time intervals for classification of eeg signals. Electronics Letters,43(25):1406–1408, 2007. → pages 42[32] Janez Demsˇar. Statistical comparisons of classifiers over multiple data sets. TheJournal of Machine Learning Research, 7:1–30, 2006. → pages 26, 58102[33] Thomas G Dietterich. An experimental comparison of three methods forconstructing ensembles of decision trees: Bagging, boosting, and randomization.Machine learning, 40(2):139–157, 2000. → pages 22[34] Thomas G Dietterich. Machine learning for sequential data: A review. InStructural, syntactic, and statistical pattern recognition, pages 15–30. Springer,2002. → pages 71[35] Trinh Do, Thierry Arti, et al. Neural conditional random fields. In InternationalConference on Artificial Intelligence and Statistics, pages 177–184, 2010. → pages81[36] Saso Dzˇeroski and Bernard Zˇenko. Is combining classifiers with stacking betterthan selecting the best one? Machine learning, 54(3):255–273, 2004. → pages 54[37] Lawrence Ashley Farwell and Emanuel Donchin. Talking off the top of your head:toward a mental prosthesis utilizing event-related brain potentials.Electroencephalography and clinical Neurophysiology, 70(6):510–523, 1988. →pages 3[38] Mehrdad Fatourechi, Ali Bashashati, Rabab K Ward, and Gary E Birch. Emg andeog artifacts in brain computer interface systems: A survey. Clinicalneurophysiology, 118(3):480–494, 2007. → pages 7[39] Tom Fawcett. An introduction to roc analysis. Pattern recognition letters,27(8):861–874, 2006. → pages 24[40] Reza Fazel-Rezai, Brendan Z Allison, Christoph Guger, Eric W Sellers, Sonja CKleih, and Andrea Ku¨bler. P300 brain computer interface: current challenges andemerging trends. Frontiers in Neuroengineering, 5:14, 2012. → pages 3[41] Yoav Freund and Robert Schapire. A short introduction to boosting.Journal-Japanese Society For Artificial Intelligence, 14(771-780):1612, 1999. →pages 23[42] Jerome H Friedman. Regularized discriminant analysis. Journal of the Americanstatistical association, 84(405):165–175, 1989. → pages 21[43] Ferran Gala´n, Marnix Nuttin, Eileen Lew, Pierre W Ferrez, Gerolf Vanacker, JohanPhilips, and J del R Milla´n. A brain-actuated wheelchair: asynchronous andnon-invasive brain–computer interfaces for continuous control of robots. ClinicalNeurophysiology, 119(9):2159–2169, 2008. → pages 13[44] Salvador Garcıa and Francisco Herrera. An extension on statistical comparisons ofclassifiers over multiple data sets for all pairwise comparisons. Journal of MachineLearning Research, 9:2677–2694, 2008. → pages 26, 61103[45] Bashar Awwad Shiekh Hasan and John Q Gan. Multi-objective particle swarmoptimization for channel selection in brain-computer interfaces. → pages 7, 42[46] Bashar Awwad Shiekh Hasan and John Q Gan. Conditional random fields asclassifiers for three-class motor-imagery braincomputer interfaces. Journal ofNeural Engineering, 8(2):025013, 2011. → pages 73[47] Pawel Herman, Girijesh Prasad, Thomas Martin McGinnity, and Damien Coyle.Comparative analysis of spectral approaches to feature extraction for eeg-basedmotor imagery classification. Neural Systems and Rehabilitation Engineering,IEEE Transactions on, 16(4):317–326, 2008. → pages 19[48] T. Hinterberger, S. Schmidt, N. Neumann, J. Mellinger, B. Blankertz, G. Curio, andN. Birbaumer. Brain-computer communication and slow cortical potentials. IEEETransactions on Biomedical Engineering, 51(6):1011–1018, June 2004. → pages 3[49] Sepp Hochreiter and Ju¨rgen Schmidhuber. Long short-term memory. Neuralcomputation, 9(8):1735–1780, 1997. → pages 99[50] Kurt Hornik. Approximation capabilities of multilayer feedforward networks.Neural networks, 4(2):251–257, 1991. → pages 24[51] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. A practical guide tosupport vector classification. → pages 22[52] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multiclass supportvector machines. Neural Networks, IEEE Transactions on, 13(2):415–425, 2002.→ pages 23[53] Han-Jeong Hwang, Soyoun Kim, Soobeom Choi, and Chang-Hwan Im. Eeg-basedbrain-computer interfaces: A thorough literature survey. International Journal ofHuman-Computer Interaction, 29(12):814–826, 2013. → pages 16[54] Donald R Jones. A taxonomy of global optimization methods based on responsesurfaces. Journal of global optimization, 21(4):345–383, 2001. → pages 43[55] S Sathiya Keerthi and Chih-Jen Lin. Asymptotic behaviors of support vectormachines with gaussian kernel. Neural computation, 15(7):1667–1689, 2003. →pages 23[56] Jun-Yeup Kim, Seung-Min Park, Kwang-Eun Ko, and Kwee-Bo Sim. Optimal eegchannel selection for motor imagery bci system using bpso and ga. In RobotIntelligence Technology and Applications 2012, pages 231–239. Springer, 2013. →pages 7[57] Wlodzimierz Klonowski. Everything you wanted to ask about eeg but were afraidto get the right answer. Nonlinear Biomedical Physics, 3(1):2, 2009. → pages 7104[58] Roman Krepki, Benjamin Blankertz, Gabriel Curio, and Klaus-Robert Mu¨ller. Theberlin brain-computer interface (bbci)–towards a new communication channel foronline control in gaming applications. Multimedia Tools and Applications,33(1):73–90, 2007. → pages 3[59] Andrea Ku¨bler, Femke Nijboer, Ju¨rgen Mellinger, Theresa M Vaughan, HannelorePawelzik, Gerwin Schalk, Dennis J McFarland, Niels Birbaumer, and Jonathan RWolpaw. Patients with als can use sensorimotor rhythms to operate abrain-computer interface. Neurology, 64(10):1775–1777, 2005. → pages 4[60] John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional randomfields: Probabilistic models for segmenting and labeling sequence data. InProceedings of the eighteenth international conference on machine learning,ICML, volume 1, pages 282–289. → pages 73[61] Tian Lan, Deniz Erdogmus, Andre Adami, Misha Pavel, and Santosh Mathan.Salient eeg channel selection in brain computer interfaces by mutual informationmaximization. In Engineering in Medicine and Biology Society, 2005. IEEE-EMBS2005. 27th Annual International Conference of the, pages 7064–7067. IEEE, 2006.→ pages 42[62] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature,521(7553):436–444, 2015. → pages 8, 98[63] Felix Lee, Reinhold Scherer, Robert Leeb, Christa Neuper, Horst Bischof, and GertPfurtscheller. A comparative analysis of multi-class eeg classification for braincomputer interface. → pages 15[64] R Leeb, C Brunner, GR Mu¨ller-Putz, A Schlo¨gl, and G Pfurtscheller. Bcicompetition 2008–graz data set b. → pages 28, 56[65] Xu Lei, Ping Yang, and Dezhong Yao. An empirical bayesian framework forbrain–computer interfaces. Neural Systems and Rehabilitation Engineering, IEEETransactions on, 17(6):521–529, 2009. → pages 21[66] Eric C Leuthardt, Kai J Miller, Gerwin Schalk, Rajesh PN Rao, and Jeffrey GOjemann. Electrocorticography-based brain computer interface-the seattleexperience. IEEE Transactions on Neural Systems and Rehabilitation Engineering,14(2):194–198, 2006. → pages 2[67] Peiyang Li, Peng Xu, Rui Zhang, Lanjin Guo, and Dezhong Yao. L1 norm basedcommon spatial patterns decomposition for scalp eeg bci. Biomed. Eng. Online, 12,2013. → pages 16[68] Fabien Lotte, Marco Congedo, Anatole Le´cuyer, Fabrice Lamarche, Bruno Arnaldi,et al. A review of classification algorithms for eeg-based brain–computerinterfaces. Journal of neural engineering, 4, 2007. → pages 15105[69] Steven G Mason and Gary E Birch. A brain-controlled switch for asynchronouscontrol applications. Biomedical Engineering, IEEE Transactions on,47(10):1297–1307, 2000. → pages 4[70] Dennis J McFarland, Laurie A Miner, Theresa M Vaughan, and Jonathan RWolpaw. Mu and beta rhythm topographies during motor imagery and actualmovements. Brain topography, 12(3):177–186, 2000. → pages 3[71] Jose´ del R Milla´n, Fre´de´ric Renkens, Josep Mourin˜o, and Wulfram Gerstner.Brain-actuated interaction. Artificial Intelligence, 159(1):241–259, 2004. → pages3[72] Jonas Mockus. Application of bayesian approach to numerical methods of globaland stochastic optimization. Journal of Global Optimization, 4(4):347–365, 1994.→ pages 45[73] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.→ pages 20[74] Nam Nguyen and Yunsong Guo. Comparisons of sequence labeling algorithms andextensions. In Proceedings of the 24th international conference on Machinelearning, pages 681–688. ACM, 2007. → pages 73[75] Luis Fernando Nicolas-Alonso and Jaime Gomez-Gil. Brain computer interfaces, areview. Sensors, 12(2):1211–1279, 2012. → pages 2, 6[76] Jian Peng, Liefeng Bo, and Jinbo Xu. Conditional neural fields. In Advances inneural information processing systems, pages 1419–1427, 2009. → pages 81[77] Gert Pfurtscheller and Fernando H Lopes da Silva. Event-related eeg/megsynchronization and desynchronization: basic principles. Clinical neurophysiology,110(11):1842–1857, 1999. → pages 9, 41, 72[78] Gert Pfurtscheller, Gernot R Mu¨ller, Jo¨rg Pfurtscheller, Hans Ju¨rgen Gerner, andRu¨diger Rupp. thought–control of functional electrical stimulation to restore handgrasp in a patient with tetraplegia. Neuroscience letters, 351(1):33–36, 2003. →pages 1[79] Gert Pfurtscheller and Christa Neuper. Motor imagery and direct brain-computercommunication. Proceedings of the IEEE, 89(7):1123–1134, 2001. → pages 41[80] Maja Pohar, Mateja Blas, and Sandra Turk. Comparison of logistic regression andlinear discriminant analysis. MetodoloA˚ aki zvezki, 1(1):143–161, 2004. → pages21106[81] Lawrence R Rabiner. A tutorial on hidden markov models and selected applicationsin speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989. → pages72, 74[82] Herbert Ramoser, Johannes Muller-Gerking, and Gert Pfurtscheller. Optimalspatial filtering of single trial eeg during imagined hand movement. RehabilitationEngineering, IEEE Transactions on, 8(4):441–446, 2000. → pages 18[83] Carl Edward Rasmussen. Gaussian processes for machine learning. 2006. → pages44[84] Daran Ravden and John Polich. On p300 measurement stability: habituation,intra-trial block variation, and ultradian rhythms. Biological psychology,51(1):59–76, 1999. → pages 3[85] Yann Renard, Fabien Lotte, Guillaume Gibert, Marco Congedo, Emmanuel Maby,Vincent Delannoy, Olivier Bertrand, and Anatole Le´cuyer. Openvibe: anopen-source software platform to design, test, and use brain-computer interfaces inreal and virtual environments. Presence: teleoperators and virtual environments,19(1):35–53, 2010. → pages 10[86] Delgado Saa, F Jaime, and Mujdat Cetin. Discriminative methods for classificationof asynchronous imaginary motor tasks from eeg data. Neural Systems andRehabilitation Engineering, IEEE Transactions on, 21(5):716–724, 2013. → pages73[87] Gerwin Schalk, Dennis J McFarland, Thilo Hinterberger, Niels Birbaumer, andJonathan R Wolpaw. Bci2000: a general-purpose brain-computer interface (bci)system. Biomedical Engineering, IEEE Transactions on, 51(6):1034–1043, 2004.→ pages 10[88] R. Scherer, F. Lee, A. Schlogl, R. Leeb, H. Bischof, and G. Pfurtscheller. Towardself-paced brain-computer communication: Navigation through virtual worlds.IEEE Transactions on Biomedical Engineering, 55(2):675–682, Feb 2008. →pages 13[89] Reinhold Scherer, GR Muller, Christa Neuper, Bernhard Graimann, and GertPfurtscheller. An asynchronously controlled eeg-based virtual keyboard:improvement of the spelling rate. IEEE Transactions on Biomedical Engineering,51(6):979–984, 2004. → pages 3[90] Mark R Segal. Machine learning benchmarks and random forest regression. Centerfor Bioinformatics & Molecular Biostatistics, 2004. → pages 22[91] AE Selim, M Abdel Wahed, and YM Kadah. Machine learning methodologies inbrain-computer interface systems. In Biomedical Engineering Conference, 2008.CIBEC 2008. Cairo International, pages 1–5. IEEE, 2008. → pages 15107[92] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesianoptimization of machine learning algorithms. In Advances in Neural InformationProcessing Systems, pages 2951–2959, 2012. → pages 52[93] Heung-Il Suk and Seong-Whan Lee. Subject and class specific frequency bandsselection for multiclass motor imagery classification. International Journal ofImaging Systems and Technology, 21(2):123–130, 2011. → pages 42[94] Gufei Sun, Jinglu Hu, and Gengfeng Wu. A novel frequency band selection methodfor common spatial pattern in motor imagery based brain computer interface. InNeural Networks (IJCNN), The 2010 International Joint Conference on, pages 1–6.IEEE, 2010. → pages 42[95] Kai Ming Ting and Ian H Witten. Issues in stacked generalization. arXiv preprintarXiv:1105.5466, 2011. → pages 54[96] G. Townsend, B. Graimann, and G. Pfurtscheller. Continuous eeg classificationduring motor imagery-simulation of an asynchronous bci. IEEE Transactions onNeural Systems and Rehabilitation Engineering, 12(2):258–265, June 2004. →pages 13[97] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machinelearning for interdependent and structured output spaces. In InternationalConference on Machine Learning (ICML), pages 104–112, 2004. → pages 78[98] Ioannis Tsochantaridis, Thorsten Joachims, Thomas Hofmann, and Yasemin Altun.Large margin methods for structured and interdependent output variables. InJournal of Machine Learning Research, pages 1453–1484, 2005. → pages 76[99] Chun Sing Louis Tsui and John Q. Gan. Asynchronous BCI Control of a RobotSimulator with Supervised Online Training, pages 125–134. Springer BerlinHeidelberg, Berlin, Heidelberg, 2007. → pages 13[100] Douglas L Vail, Manuela M Veloso, and John D Lafferty. Conditional random fieldsfor activity recognition. In Proceedings of the 6th international joint conference onAutonomous agents and multiagent systems, page 235. ACM, 2007. → pages 73[101] Jan BF Van Erp, Fabien Lotte, Michael Tangermann, et al. Brain-computerinterfaces: beyond medical applications. Computer-IEEE Computer Society-,45(4):26–34, 2012. → pages 1[102] Jonathan R Wolpaw, Niels Birbaumer, Dennis J McFarland, Gert Pfurtscheller, andTheresa M Vaughan. Brain–computer interfaces for communication and control.Clinical neurophysiology, 113(6):767–791, 2002. → pages 1, 7[103] Peng Xu, Ping Yang, Xu Lei, and Dezhong Yao. An enhanced probabilistic lda formulti-class brain computer interface. PloS one, 6(1):e14634, 2011. → pages 16108[104] Xinyi Yong, Rabab K Ward, and Gary E Birch. Sparse spatial filter optimization foreeg channel reduction in brain-computer interface. In Acoustics, Speech and SignalProcessing, 2008. ICASSP 2008. IEEE International Conference on, pages417–420. IEEE, 2008. → pages 42[105] Rui Zhang, Peng Xu, Lanjin Guo, Yangsong Zhang, Peiyang Li, and Dezhong Yao.Z-score linear discriminant analysis for eeg based brain-computer interfaces. PloSone, 8(9):e74433, 2013. → pages 21[106] Mingjun Zhong, Fabien Lotte, Mark Girolami, and Anatole Le´cuyer. Classifyingeeg for brain computer interfaces using gaussian processes. Pattern RecognitionLetters, 29(3):354–359, 2008. → pages 63, 64109
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- A user-customized self-paced brain computer interface
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
A user-customized self-paced brain computer interface Bashashati, Hossein 2017
pdf
Page Metadata
Item Metadata
Title | A user-customized self-paced brain computer interface |
Creator |
Bashashati, Hossein |
Publisher | University of British Columbia |
Date Issued | 2017 |
Description | Much attention has been directed towards synchronous Brain Computer Interfaces (BCIs). For these BCIs, the user can only operate the system during specific system-defined periods. Self-paced BCIs, however, allow users to operate the system at any time he/she wishes. The classification of Electroencephalography (EEG) signals in self-paced BCIs is extremely challenging, as the BCI system does not have any clue about the start time of a control task. Also, the data contains a large number of periods during which the user has no intention to control the BCI. For sensory motor self-paced BCIs (focus of this thesis), the brain of a user goes through several well-defined internal state changes while performing a mental task. Designing classifiers that exploit such temporal correlations in EEG data can enhance the performance of BCIs. It is also important to customize these BCIs for each user, because the brain characteristics of different people are not the same. In this thesis, we first develop a unified comparison framework to compare the performance of different classifiers in sensory motor BCIs followed by rigorous statistical tests. This study is the largest of its kind as it has been performed on 29 subjects of synchronous and self-paced BCIs. We then develop a Bayesian optimization-based strategy that automatically customizes a synchronous BCI based on the brain characteristics of each individual subject. Our results show that our automated algorithm (which relies on less sophisticated feature extraction and classification methods) yields similar or superior results compared to the best performing designs in the literature. We then propose an algorithm that can capture the time dynamics of the EEG signal for self-paced BCI systems. We show that this algorithm yields better results compared to several well-known algorithms, over 13 self-paced BCI subjects. Finally, we propose a fully automatic, scalable algorithm that customizes a self-paced BCI system based on the brain characteristics of each user and at the same time captures the dynamics of the EEG signal. Our final algorithm is an important step towards transitioning BCIs from research environments to real-life applications, where automatic, scalable and easy to use systems are needed. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2017-04-18 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0343651 |
URI | http://hdl.handle.net/2429/61248 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2017-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2017_may_bashashatisaghezchi_hossein.pdf [ 951.47kB ]
- Metadata
- JSON: 24-1.0343651.json
- JSON-LD: 24-1.0343651-ld.json
- RDF/XML (Pretty): 24-1.0343651-rdf.xml
- RDF/JSON: 24-1.0343651-rdf.json
- Turtle: 24-1.0343651-turtle.txt
- N-Triples: 24-1.0343651-rdf-ntriples.txt
- Original Record: 24-1.0343651-source.json
- Full Text
- 24-1.0343651-fulltext.txt
- Citation
- 24-1.0343651.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0343651/manifest