Dynamic Bayesian Networks Modeling and Analysis of Neural Signals by Junning Li B.Sc., Tsinghua University, 2002 M.Sc., Tsinghua University, 2005 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in The Faculty of Graduate Studies (Electrical and Computer Engineering) THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) August, 2009 © Junning Li 2009 Abstract Studying interactions between different brain regions or neural components is crucial in understanding neurological disorders. Dynamic Bayesian networks, a type of statistical graphical model, have been suggested as a promising tool to model neural communication systems. This thesis investigates the employment of dynamic Bayesian networks for analyzing neural connectivity, especially with focus on three topics: structural feature extraction, group analysis, and error control in learning network structures. Extracting interpretable features from experimental data is important for clinical diagnosis and improving experiment design. A framework is designed for discovering structural differences, such as the pattern of sub-networks, between two groups of Bayesian networks. The framework consists of three components: Bayesian network modeling, statistical structure-comparison, and structure-based classification. In a study on stroke using surface electromyography, this method detected several coordination patterns among muscles that could effectively differentiate patients from healthy people. Group analyses are widely conducted in neurological research. However for dynamic Bayesian networks, the performances of different group-analysis methods had not been systematically investigated. To provide guidance on selecting group-analysis methods, three popular methods, i.e. the virtual-typical-subject, the common-structure and the individual-structure methods, were compared in a study on Parkinson’s disease, from the aspects of their statistical goodness-of-fit to the data, and more importantly, their sensitivity in detecting the effect of medication. The three methods led to considerably different group-level results, and the individual-structure approach was more sensitive to the normalizing effect of medication. ii Abstract Controlling errors is a fundamental problem in applying dynamic Bayesian networks to discovering neural connectivity. An algorithm is developed for this purpose, particularly for controlling the false discovery rate (FDR). It is proved that the algorithm is able to curb the FDR under user-specified levels (for example, conventionally 5%) at the limit of large sample size, and meanwhile recover all the true connections with probability one. Several extensions are also developed, including a heuristic modification for moderate sample sizes, an adaption to prior knowledge, and a combination with Bayesian inference. iii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi Statement of Co-Authorship . . . . . . . . . . . . . . . . . . . . . . . . . . . xii 1 Introduction and Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Discovering Neural Interactions . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Current Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Applications of Bayesian Networks . . . . . . . . . . . . . . . . . . . . . 7 1.5 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2 Extracting Structural Features from Bayesian Networks . . . . . . . . 22 2.1 Introduction 2.2 Methods 2.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Framework and Components . . . . . . . . . . . . . . . . . . . . 27 iv Table of Contents 2.2.2 Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.2.3 Sub-network Patterns and Muscle Synergies . . . . . . . . . . . . 33 2.2.4 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 2.2.5 Modeling sEMG Signals . . . . . . . . . . . . . . . . . . . . . . . 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.1 Real sEMG Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 40 2.3.2 Learned Bayesian Networks . . . . . . . . . . . . . . . . . . . . . 42 2.3.3 Triple Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 2.3.4 Classification Performance . . . . . . . . . . . . . . . . . . . . . . 47 Conclusions and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . 52 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3 Comparing Group-Analysis Methods Based on Bayesian Networks . 60 2.3 2.4 Results 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Materials and Methods 61 . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 3.2.1 fMRI Data 3.2.2 Dynamic Bayesian Networks . . . . . . . . . . . . . . . . . . . . 68 3.2.3 Learning DBNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 3.2.4 Comparison of Three Approaches . . . . . . . . . . . . . . . . . . 73 3.2.5 Group Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 77 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 3.3 Results 3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4 Controlling the Error Rate in Learning Network Structures . . . . . 95 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 4.1 Introduction 4.1.1 More Recent Related Works . . . . . . . . . . . . . . . . . . . . . 104 v Table of Contents 4.2 4.3 Controlling FDR with PC Algorithm . . . . . . . . . . . . . . . . . . . . 105 4.2.1 Notations and Preliminaries . . . . . . . . . . . . . . . . . . . . . 106 4.2.2 PC Algorithm 4.2.3 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . . . 110 4.2.4 PC Algorithm with FDR . . . . . . . . . . . . . . . . . . . . . . 113 4.2.5 Asymptotic Performance . . . . . . . . . . . . . . . . . . . . . . 117 4.2.6 Computational Complexity 4.2.7 Miscellaneous Discussions . . . . . . . . . . . . . . . . . . . . . . 120 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 . . . . . . . . . . . . . . . . . . . . . 118 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 4.3.1 Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 4.3.2 Applications to Real fMRI Data . . . . . . . . . . . . . . . . . . 133 4.4 Conclusions and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . 139 4.5 Proof of Theorems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 4.6 Statistical Tests with Asymptotic Power Equal to One . . . . . . . . . . 149 4.7 Error Rates of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 5 Extending Error-Rate Control to Dynamic Bayesian Networks 5.1 5.2 . . . 160 Integration of Prior Knowledge . . . . . . . . . . . . . . . . . . . . . . . 161 5.1.1 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . . . 161 5.1.2 PCfdr* Algorithm with Prior Knowledge 5.1.3 Conditional-Independence Test for fMRI Signals 5.1.4 Test of Relative Consistence Among Graphs . . . . . . . . . . . . 165 5.1.5 Application to Parkinson’s Disease . . . . . . . . . . . . . . . . . 166 Bayesian Inference with FDR-Controlled Prior 5.2.1 . . . . . . . . . . . . . . 162 . . . . . . . . . 164 . . . . . . . . . . . . . . 169 Bayesian Inference with Uniform Prior Distribution . . . . . . . . 169 vi Table of Contents 5.2.2 Modelling with Prior Distribution Derived from FDR-Controlled PC Algorithm 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 5.2.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172 5.2.4 Application to Parkinson’s Disease . . . . . . . . . . . . . . . . . 176 Conclusions and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . 179 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 6 Selecting Regions of Interest in fMRI Analysis . . . . . . . . . . . . . 183 6.1 Introduction 6.2 Experimental Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 6.2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 FMRI Image Pre-processing . . . . . . . . . . . . . . . . . . . . . 193 6.2.2 LLDA Algorithm 6.2.3 Backward Step-wise Discrimination 6.2.4 Contrasting Multiple Tasks . . . . . . . . . . . . . . . . . . . . . 198 6.2.5 Significance of Discrimination 6.2.6 SPM Methodology 6.3 Results 6.4 Discussion 6.5 Technical Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . 193 . . . . . . . . . . . . . . . . 197 . . . . . . . . . . . . . . . . . . . 199 . . . . . . . . . . . . . . . . . . . . . . . . . 200 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 7 Conclusions and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . 214 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214 7.2 Contributions and Potential Applications 7.3 Future Work . . . . . . . . . . . . . . . . . 218 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 vii List of Tables 1.1 Publications on discovering neural interactions with Bayesian networks . 8 2.1 Muscle coordination patterns significantly different among subjects . . . 46 2.2 Classification error-rates using sEMG signals . . . . . . . . . . . . . . . . 48 3.1 Group-level BIC scores of simulated data . . . . . . . . . . . . . . . . . . 79 3.2 Group-level BIC scores of real data . . . . . . . . . . . . . . . . . . . . . 79 3.3 Numbers of connections normalized by medication . . . . . . . . . . . . . 82 3.4 Connections statistically significantly normalized by medication . . . . . 84 4.1 Possible outcomes of multiple-hypothesis testing . . . . . . . . . . . . . . 111 4.2 Criteria for multiple-hypothesis testing . . . . . . . . . . . . . . . . . . . 111 4.3 Ratios of extra computational time spent on FDR-control by PCfdr algorithm133 4.4 Ratios of extra computational time spent on FDR-control by PCfdr* algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.5 Brain regions involved in bulb-squeezing data set . . . . . . . . . . . . . 136 4.6 Brain regions involved in sentence-picture data set . . . . . . . . . . . . . 137 4.7 Realized error rates on bulb-squeezing and sentence-picture data sets . . 138 5.1 Possible outcomes of multiple-hypothesis testing . . . . . . . . . . . . . . 162 viii List of Figures 1.1 Example of dynamic Bayesian networks . . . . . . . . . . . . . . . . . . . 6 2.1 Representing the interactions of muscle activities with a Bayesian network 28 2.2 Example of essential graphs . . . . . . . . . . . . . . . . . . . . . . . . . 31 2.3 Coordination patterns among three muscles . . . . . . . . . . . . . . . . 34 2.4 Cross-subject validation . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.5 Within-subject validation . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.6 Auto-correlation of sEMG “carrier” signals . . . . . . . . . . . . . . . . . 39 2.7 Histogram of the temporal distribution of sEMG “carrier” signals . . . . 39 2.8 Muscle coordination networks learned from sEMG signals . . . . . . . . . 42 2.9 Structural differences among muscle coordination networks . . . . . . . . 43 2.10 Boxplot of the frequency of muscle coordination patterns . . . . . . . . . 45 2.11 Classification trees using networks as the classification features . . . . . . 49 2.12 Classification trees using sub-network patterns as the classification features 50 2.13 Principal components of the log likelihood of Bayesian networks learned from sEMG signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1 Experimental task design . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.2 Example of dynamic Bayesian networks . . . . . . . . . . . . . . . . . . . 69 3.3 Comparisons of three group-analysis methods: at structural level . . . . . 80 3.4 Comparisons of three group-analysis methods: at parametric level . . . . 81 ix List of Figures 3.5 Variances of network structures learned with the “individual-structure” approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 3.6 Connections statistically significantly normalized by medication . . . . . 85 4.1 Simulation design: directed acyclic graphs . . . . . . . . . . . . . . . . . 125 4.2 Simulation results: the false discovery rate . . . . . . . . . . . . . . . . . 129 4.3 Simulation results: the detection power . . . . . . . . . . . . . . . . . . . 130 4.4 Simulation results: the type I error rate . . . . . . . . . . . . . . . . . . . 131 4.5 Simulation results: computation time . . . . . . . . . . . . . . . . . . . . 132 4.6 Networks learned from bulb-squeezing data . . . . . . . . . . . . . . . . . 136 4.7 Networks learned from sentence-picture data . . . . . . . . . . . . . . . . 137 5.1 Possible connections of dynamic Bayesian networks . . . . . . . . . . . . 164 5.2 Consistent brain connectivity inferred from fMRI data . . . . . . . . . . 168 5.3 Simulation design: dynamic Bayesian network . . . . . . . . . . . . . . . 173 5.4 Simulation results: the false discovery rate . . . . . . . . . . . . . . . . . 174 5.5 Simulation results: the detection power . . . . . . . . . . . . . . . . . . . 175 5.6 Simulation results: histogram of Bayesian-information-criterion scores . . 176 5.7 Brain connectivity inferred from fMRI data . . . . . . . . . . . . . . . . . 178 6.1 Comparison of fMRI group-analysis methods . . . . . . . . . . . . . . . . 186 6.2 Activation paradigm for fMRI motor task . . . . . . . . . . . . . . . . . . 192 6.3 Comparison of group results: EG + IG . . . . . . . . . . . . . . . . . . . 202 6.4 Comparison of group results: EG vs IG . . . . . . . . . . . . . . . . . . . 203 6.5 Discriminability of overall discriminant function across subjects . . . . . 204 x Acknowledgements This work would not have been possible without the support, enthusiasm and guidance of my supervisor, Dr. Z. Jane Wang, to whom I owe a debt of thanks. I would also like to thank Dr. Martin J. McKeown (Department of Medicine) for his constructive comments and valuable suggestions. Many thanks to Dr. Janice Eng (Department of Physical Therapy), and PhD candidate Samantha Palmer (Department of Medicine), with whom I collaborated to produce the manuscripts contained herein. I would also like to acknowledge the feedback and support from my fellow graduate students in our signal-processing laboratory. I owe many people for their generosity and support during my four-year study at the University of British Columbia. Dr. Max Hui-Bon-Hoa (Hong Kong University), my spiritual mentor, gave me constant love, support and wise advice during my life challenges. My dear friends, Zhi-Ming Fan, Stephen Ney, Jia Li, Henry Ngan, Michelle Tan, Kevin Khang, Eric Vyskocil, and Xue-Gui Song, not only shared in my excitement, but also encouraged me when I was in frustration during this long PhD journey. Rod Foist, my fellow PhD student, helped me proofread some of my writing. Andrey Vlassov, our excellent software engineer, helped me recover my research data from crashed hard drives. I thank all my friends in Vancouver, Beijing, and California. The work produced here was funded partially by the University of British Columbia Graduate Fellowship. xi Statement of Co-Authorship This thesis is a compilation of several manuscripts, resulting from collaboration between several researchers. Chapter 2 is based on a research article published in IEEE Transactions on Biomedical Engineering, co-authored with Dr. Z. Jane Wang, Dr. Janice J. Eng and Dr. Martin J. McKeown. Chapter 3 is based on a research article published in NeuroImage, co-authored with Dr. Z. Jane Wang, Samantha J. Palmer and Dr. Martin J. McKeown. Chapter 4 is based on a research article published in the Journal of Machine Learning Research, co-authored with Dr. Z. Jane Wang. Chapter 5 is based on an article published in the Proceedings of the 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, and an article accepted by the 3rd International Conference on Bioinformatics and Biomedical Engineering. They were co-authored with Dr. Z. Jane Wang and Dr. Martin J. McKeown. Chapter 6 is based on a research article published in NeuroImage, co-authored with Dr. Martin J. McKeown, Dr. Xuemei Huang, Dr. Mechelle M. Lewis, Dr. Seungshin Rhee, Dr. K.N. Young Truong and Dr. Z. Jane Wang. The authors’ specific contributions to these collaborations are detailed below. The research outline was designed jointly by the author, Dr. Z. Jane Wang and Dr. Martin J. McKeown. The majority of the research, including literature survey, model design, proofs of theorems, numerical simulation, statistical data analysis and compilation of results, was conducted by the author, with suggestions from Dr. Z. Jane Wang. The manuscripts were primarily drafted by the author, with helpful comments from Dr. Z. Jane Wang and Dr. Martin J. McKeown. The neurological part of Chapter 6 was written by Dr. McKeown. He also wrote the paragraph about the neurological background xii Statement of Co-Authorship in the introduction of Chapters 2 and 3, and helped on the neurological interpretation of the analysis results in Chapters 2, 3 and 4. Part of the technical review in the introduction of Chapters 2 and 3 was written by Dr. Z. Jane Wang. The paragraph kinetically interpretating the analysis results in Chapter 2 was written by Dr. Janice J. Eng. The description of the data-acquisition process in Chapters 3 and 4 was originally written by Samantha J. Palmer. The statistical parametric mapping in Chapter 6 was conducted by Dr. Xuemei Huang, Dr. Mechelle M. Lewis, Dr. Seungshin Rhee and Dr. K.N. Young Truong. xiii Chapter 1 Introduction and Overview 1.1 Discovering Neural Interactions Studying the interactions among neural components, e.g. different regions of the brain and nerve cells driving different muscles, is crucial in understanding neurological disorders. All of our sensations, thoughts and movements trigger neural signals transmitted between nerve cells. If nerve cells cannot communicate correctly, functions are not coordinated and the body does not respond properly. For example, Parkinson’s disease patients can no longer move voluntarily when dopamine, a neurotransmitter that carries signals between nerve cells, is depleted. Researchers studying neural interactions want to find out which neural components interact with each other, when they interact, and how they interact, i.e. stimulatively or repressively. Further, researchers are also interested in whether certain interactions are associated with state of health, for instance, medication, years of onset and severity of disease. Very commonly, experiments are performed at group level to discover features that are associated with a population, for example, stroke patients. Researchers are usually not only interested in each individual, but also and even more in a group of people. They want to answer the questions, “what are the interactions commonly shared by a population?” and “what are the interactions possessed only by a specific individual?” Modern neuro-signal recording techniques provide us with the ability to sense neural impulses in living humans, and also encourage development of suitable mathematical analysis tools. Functional magnetic resonance imaging (fMRI) non-invasively measures 1 1.2. Current Methods increase in oxygen in venous blood. Electromyography (EMG) detects the electrical impulses directly driving muscles. Positron emission tomography (PET) and electroencephalography (EEG) can also record the brain’s activities. These neural signals share certain common features. First, they all are stochastic processes accompanied by physiological noise. Repeated experimental trials yield similar but not the same signals. Noise induced by respiration, heart beating, etc. is not simply white noise, but actually has certain patterns and structures. Second, experiments can generate long, multi-channel time series, but the number of subjects is usually relatively small because of the high cost of experiments or the limited availability of patients. Analysis methods should be adaptable to these features and also meet the concerns of biomedical researchers. 1.2 Current Methods A wide range of mathematical methods, such as correlation threshold [9], linear decomposition [8, 30], multivariate auto-regression, structural equation models (SEM) [5], dynamic causal models [15] and clustering, have been employed to discover neural interactions. The correlation threshold (CT) method [9, 18, 23, 32, 44, 46] estimates how strongly two neural components interact with each other by calculating the correlation coefficient between their activities. If the correlation coefficient is so high that it is almost impossible to be just a result of randomness, then the two components are considered associated with each other. Backed by random field theory [9], CT is statistically rigorous and is able to control the Type-I error rate, and to estimate the strength of interactions. However, it only analyzes two components at a time, so it cannot distinguish whether the two components interact directly or indirectly through a third component [21]. Linear decomposition methods such as principal component analysis (PCA) and independent component analysis (ICA) [8, 30] assume that observed neural signals are a 2 1.2. Current Methods combination of many simultaneous functional activities. The goal of decomposition is to find the underlying functional activities that correlate closely with the input stimuli, and to detect the neural components executing the functions. Linear decomposition is computationally efficient because it does not require a model-selection procedure. Its applications to sEMG data [30] and fMRI group analysis [8] have produced meaningful results. However, simulation showed that it is not sensitive to focal interactions [46]. Multivariate auto-regression (MAR) models [6, 19, 45], structural equation models (SEM) [5, 22, 29, 41, 49] and dynamic causal models (DCM) [14, 15, 39–41] are the most popular multivariate linear regressions used in discovering neural interactions. The three models share a common format and can be unified, as shown in Eq. (1.1): X ptq m ¸ A i X pt iq eptq, (1.1) i 0 where X ptq is a multi-channel time series, Ai s (i 0, . . . , m) are regression matrices and eptq is a noise term. A non-zero coefficient, Ai pj, k q, implies that the j-th channel depends on the k-th channel with a lag of i time points, which suggests that neural component j interacts with another component k. MAR, SEM and DCM emphasize different research purposes by setting different constraints on Eq. (1.1). MAR focuses only on lagged inter- 0. Traditional SEM focuses only on simultaneous by setting the constraint Ai 0 pi 1, . . . , mq. DCM is distinct from actions, by setting the constraint A0 interactions, the other two in its consideration of modulated interactions [41] and modelling neural activities as hidden variables. These features possibly make DCM more closer to the reality. Multivariate linear regression models are commonly statistically rigorous, and are supported by many well-developed algorithms for various kinds of purposes, such as spectrum analysis, model selection, and statistical inference. They are also very flexible. For instance, modulated interactions can be modelled by adding bilinear terms, as in DCM. Dependence relationships can be visualized as graphs, to which further network 3 1.2. Current Methods analysis can be applied. Nevertheless, these models usually demand much computation, especially when features such as latent variables or bilinear terms are involved. Clustering techniques, such as self-organizing map (SOM) [13, 38], k-means clustering [2, 3, 12], hierarchical clustering [10] and graph clustering [11, 20, 28], provide a data-driven approach to exploring unknown neural interactions without assuming a predefined interaction pattern. The underlying idea of the clustering approach is that if neural components cluster together, they behave similarly, which suggests interactions among them. Clustering is usually implemented with fast and heuristic algorithms and thus is suitable for large-scale problems where it is difficult to perform rigorous statistical analysis. New interactions that are discovered can be verified later with rigorous statistical models. However, the data-driven feature also brings disadvantages. Certain algorithms either fall in local optimal solutions or their convergence cannot be proved. Statistical criteria such as specificity and sensitivity cannot be theoretically analyzed for clustering methods. Bayesian networks, a type of graphical model, are suitable for discovering neural interactions due to their graphical nature and rigorous underlying theory. First, the structural similarity between Bayesian networks and the nervous systems makes the former promising tools for modelling the latter. The nervous system is a network of connected neurons that transmit electrochemical signals between each other through nerve fibres. The topology of this complicated system can be naturally abstracted as a graph, that is, nodes connected with edges. Second, the edges of a Bayesian network are directional, which is suitable for modeling the transmission path of neural signals. Third, the node variables of a Bayesian network only locally depend on their parent nodes, which is similar to neurons’ local and direct interactions with their neighbor neurons through nerve fibres. Fourth, Bayesian networks are modular and flexible. Given a network structure, different statistical models can be used to describe the dependence relationships between nodes and their parent nodes. Fifth, plenty of model-learning and computation methods have 4 1.3. Dynamic Bayesian Networks been developed for Bayesian networks by researchers in the field of artificial intelligence. 1.3 Dynamic Bayesian Networks A dynamic Bayesian network (DBN) [36, 42] is an extension of a Bayesian network (also called a belief network) for stochastic processes. Actually, the term “dynamic” does not mean that the model itself changes temporally, but that the network models a dynamic system. Also, “Bayesian” does not mean that Bayesian statistics are employed, but that Bayes’ rule is employed in inference. Though the term “temporal belief network” seems to be a more suitable name, the term “dynamic Bayesian network” has been more popular. A Bayesian network employs a directed and acyclic graph (DAG) to encode conditional independence among random variables. The essential concept in the encoding is d-separation [24, 37], which we will introduce after defining related concepts in graph theory. A DAG G is a pair (V ,E) where V is a set of vertices and E V V is a set of arrows without cycles. A chain between two vertices α and β is a sequence α=α0 , . . . , αn =β of distinct vertices such that (αi 1 , αi ) or (αi , αi 1 ) P E for all i=1, . . . , n. Vertex β is a descendant of vertex α if and only if there is a sequence α=α0 , . . . , αn =β of distinct vertices such that (αi 1 , αi ) A, B and S V P E for all i=1, . . . , n. If three disjoint subsets satisfy the condition that any chain π between @α contains a vertex γ P A and @β P B P π such that either arrows of π do not meet head-to-head at γ and γ P S, arrows of π meet head-to-head at γ and γ is neither in S nor has any descendants in S, then S d-separates A and B. Let P denote a probability distribution of random variables x, and xα denote the random variable represented by vertex α and xA txα|α P Au. P is said to admit the global Markov property according to a DAG G if 5 1.3. Dynamic Bayesian Networks Figure 1.1: An example of dynamic Bayesian networks. Zt = [Ut , Xt , Yt ]T is a first-order Markov process with dependence relationships specified as in the figure. Xt is a Markov process whose transition distribution P pXt |Xt 1 , Ut q varies according to the input Ut . Arrows from Xt 1 and Ut to Xt are associated with the transition distribution. Yt is the output observation at time t. The arrow from Xt to Yt is associated with the distribution P pYt |Xt q. Such a process can be represented by the first two time-slices circled by dots. S d-separates A and B ñ xAKxB |xS , i.e. xA and xB are independent given xS . The same set of conditional independence can be encoded by different DAGs, and a DAG can be converted to an essential graph that uniquely encodes the set of conditional independence [1]. If P obeys the global Markov property, it can be factorized according to G as f px q ¹ P f pxα |xpapαq q, (1.2) α V where f pxq is the joint probability density function (pdf) of P and pa(α)=tβ |pβ, αq P E u are the parents of α [24]. In summary, a Bayesian network is composed of a DAG that encodes the conditional-independence relationships and a set of conditional probability distributions that specifies the joint distribution. A multi-channel stochastic process can be modelled with a Bayesian network of C T vertices, where C is the number of channels and T is the number of time points and each 6 1.4. Applications of Bayesian Networks vertex represents the signal of a channel at a time point. In this case, the DAG is subject to an additional constraint, that vertices at time t cannot have vertices after t as their parents because the future cannot influence either the present or the past. If the same dependence relationships repeat time after time and the signals at t only depend on the signals from t N to t, then the whole network can be “rolled up” as its DBN representation, a DAG composed of only vertices from t N to t (see Fig. 1.1). In certain contexts, we call normal Bayesian networks “static Bayesian networks” to distinguish them from DBNs. Though the directed arcs of static Bayesian networks cannot be simply explained as causality, the arcs of DBNs from the past to the present can be interpreted as Granger causality [17]. Many classical multivariate probabilistic models, such as hidden Markov models, Kalman filters, and multivariate auto-regression models, are actually special cases of DBNs [36]. 1.4 Applications of Bayesian Networks Only a limited number of papers have been published since (D)BNs were introduced into the field of discovering neural interactions in around 2003. At the current stage, researchers are still exploring suitable techniques for using DBNs in this field. In this section, we chronologically review the development of (D)BNs for discovering neural interactions. In 2003, Michell et al. proposed Naive Bayesian Networks (NBNs), a special case of static Bayesian networks with strong independence assumptions, to identify instantaneous cognitive states from fMRI data [34]. Because the structure of NBNs is predefined, they can be applied to huge networks, as the authors did to networks of about 2000 nodes. Statistical tests showed that NBNs were able to utilize the discriminating information contained in the fMRI data, but the authors did not compare NBNs with other classifiers to investigate whether the former exploited the discriminating information better or not. 7 Table 1.1: Publications on Applications of Bayesian networks in Discovering Neural Interactions. S = Static, D = Dynamic; CPD = Conditional Probability Distribution, Gau = Gaussian, Cat = Categorical, Bin = Binary; NBN = Naive Bayesian Network, MAR = Multivariate Auto-Regression, DML-HMM = Dynamically Multi-Linked Hidden Markov Model, MCMC = Markov Chain Monte Carlo; SEM = Structure Expectation Maximization (search algorithm) or Structural Equation Model (comparison), SVM = Support Vector Machine, GNBN = Gaussian Naive Bayesian Network, KNN = K-Nearest Neighbor. 1.4. Applications of Bayesian Networks Paper [34] [7] [47] [48] [25] [27] Year 2003 2004 2005 2006 2006 2007 Bayesian Network Static or Dynamic S D D D S D CPD Gau. Cat. Bin. & Gau. Gau. Gau. Gau. Number of Nodes 7-20,000 150 10(5 hidden) 10-13 7 7 Constraint NBN MAR DML-HMM No No Yes Structure Search No Score Function BDE BIC BIC BIC BIC Search Algorithm Greedy SEM MCMC MCMC MCMC Final Result Best Best Best Best Best Data Set Sample Size 10-33 subj. 28 subj. 6-28 subj. 3 subj. 12 subj. Signal fMRI fMRI fMRI fMRI sEMG fMRI Number of Components 7-20,000 150 5 10-13 7 6 Group Analysis Group Model Pool Pool Pool Pool Individual Common Structure P pG1 |X q P pX |G1 q P pX |G1 q P pG1 |X q No No Classification P pG2 |X q P pG2 |X q P pX |G2 q P pX |G2 q Structure Analysis No No Edge No Sub-network Edge Validation Biological 1 Cit. 10 Cit. No ¥ 16 Cit. No No Stat. Infer. Yes Yes No No No No Simulation No No No No No No Comparison No SVM & GNBN KNN SEM No No 8 1.4. Applications of Bayesian Networks Although the purpose of the study was classification rather than discovering neural interactions, the authors did suggest that Bayesian networks are an applicable and promising tool for studying neural signals. One year later, Burge et al. proposed DBNs for both detecting altered neural connections and classification, with an application to fMRI data [7]. Compared with Michell et al.’s NBNs, Burge et al.’s DBNs were much more complex: the structures of the DBNs were not predefined, but were learned from the data. Since the authors studied networks of a considerable number of nodes (about 150), they excluded simultaneous interactions, and focused only on lagging interactions, decomposing the model-selection problem to many simple local variate-selection problems. Though Gaussian NBNs, special cases of static Bayesian networks, outperformed DBNs in classification, hypothesis testing strongly suggested that temporal information should be exploited, which was consistent with previous research [35]. The authors also compared the DBNs of the case group against those of the control group but did not perform rigorous hypothesis testing in this comparison. To the best of our knowledge, the most complicated Bayesian network used in this field is a DBN with hidden nodes [47] (2006) where the true neural activities were modelled as hidden nodes, and the dynamic structure between the hidden nodes was learned from the data with the structure expectation maximization algorithm. Because of the high computational complexity of this algorithm, the authors investigated a network of only 5 hidden nodes. However, the results were very promising. The DBNs not only outperformed k-nearest neighbor (KNN) classifiers but also intuitively discovered consistent interaction patterns across subjects. Zheng et al.’s paper published in NeuroImage [48] (2006) supported Bayesian networks as a promising approach to discovering neural interactions. Though the method used in this study was simple, just static Gaussian Bayesian networks, the interactions discovered were interpreted neurologically and were backed by more than 15 citations. 9 1.5. Thesis Outline In a simulation, Bayesian networks yielded higher likelihood than structural equation models (SEM) did. However, the study also found that when there were more than 13 nodes, the learned structure of Bayesian networks did not match the true structure well. Li et al. explored group analysis based on Bayesian networks in papers [26, 27]. In [7, 34, 47, 48], group analysis was performed by pooling or averaging all the data of the same experimental group, and then learning one Bayesian network from the pooled or averaged data as if they were sampled from the same subject. An alternative approach was explored in [26]: a Bayesian network was learned for each subject individually, and group analysis was performed by comparing two groups of Bayesian networks with the concept of “network motif” [33]. In [27], a trade-off between the two approaches was proposed. The Bayesian networks of each individual subject were constrained to share the same structure but allowed to have different parameters to accommodate inter-subject variability. 1.5 Thesis Outline The research presented in this thesis focuses on adapting classical methods, or creating novel methods for dynamic Bayesian networks, to meet the particular need of neurological research and the limitations of current neural-signal acquisition technology. Since biomedical research usually has high and special requirements of interpretability, generality, and reliability, we mainly investigated the following topics: feature extraction from Bayesian networks, group analysis, and error control in learning network structures. Chapter 2 describes our research on extracting structural features from Bayesian networks. To provide guidance on further experiment design and clinical diagnosis, mathematical models in biomedical research should not only fit experimental data well, but also be able to extract interpretable features from data. A framework is designed to discover statistically significant structural differences, such as the pattern of sub-networks, 10 1.5. Thesis Outline between two groups of Bayesian networks. The framework includes three components: Bayesian-network modeling, statistical structure-comparison, and structure-based classification. The framework is demonstrated in an application to discovering the coordination patterns of muscle activities using surface electromyography. Chapter 3 presents our research on group-analysis methods for dynamic Bayesian networks. A common and fundamental challenge in biomedical studies is to meaningfully combine population-common features and individual-specific features for group inference. Dynamic Bayesian networks have recently been introduced for modelling neural connectivity, but their group-analysis methods have not been systematically investigated. In this chapter, three popular group-analysis methods, i.e. the “virtual-typical-subject” approach, the “individual-structure” approach [16] and the “common-structure” approach [22, 31], are compared in a study on Parkinson’s disease using fMRI, from the aspects of their statistical goodness-of-fit to the data, and more importantly their sensitivity in detecting the effect of medication on the disease. Three fundamental questions are investigated: “which approach most accurately reflects the underlying biomedical behavior?”, “do the approaches lead to considerably different analysis results?”, and “how can the suitable approach be selected?” This chapter ends with a summary of our observations in the comparison, including the limitations of the three methods, and as well vision for future research on group analysis. Chapter 4, a highlight of the thesis, elaborates our research on error control in learning network structures. In real world applications, graphical statistical models, such as dynamic Bayesian networks, are not only a tool for operations such as classification or prediction, but usually the network structures of the models themselves are also of great interest (e.g. in modeling brain connectivity). The false discovery rate (FDR) [4, 43], the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. Controlling the FDR provides researchers statistical confidence of the learned networks. However, current structure-learning algorithms 11 1.5. Thesis Outline have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. In this chapter, a novel algorithm is designed to control the FDR of the network connections inferred from experimental data under user-specified levels. It is proved that the new algorithm is able to curb the FDR under user-specified levels (for example, conventionally 5%) at the limit of large sample size, and meanwhile recover all the true connections with probability one. We named this algorithm the PCfdr algorithm. This chapter includes detailed explanation about the algorithm and its heuristic modification, theoretical proofs of its asymptotic performance, analysis on its computational complexity, extensive evaluation with simulated data and real data, and discussions on its alternatives. Based on the original version of the PCfdr algorithm, several modified or extended versions are developed. Hereafter, we refer to this family of algorithms as the “PCfdr algorithms”, and refer to the original version as the “PCfdr algorithm”. Chapter 5 extends the control over the false discovery rate from static Bayesian networks to dynamic Bayesian networks. The PCfdr algorithms in Chapter 4 are only basic implementations of the main idea. In this chapter, two extensions are designed for dynamic Bayesian networks. One is an adaptation to prior knowledge, allowing users to specify which edges must appear in the network, which cannot, and which are to be learned from data. This extension is naturally applicable to dynamic Bayesian networks, by simply regarding them as Bayesian networks that cannot have edges from time t 1 to time t. The other extension is using the PCfdr algorithms to improve Bayesian inference of dynamic Bayesian networks. The idea is to first learn a network with the PCfdr algorithms, and then make Bayesian inference based on a prior distribution derived from the learned network . It accelerates Bayesian inference and is relatively robust to perturbing noise. Chapter 6, as a supplement to Chapters 2 to 5, provides a fast method to select 12 1.5. Thesis Outline node variables (e.g. brain regions-of-interest) for Bayesian-network modelling. Methods in Chapters 2 to 5 assume that the node variables of interest have been pre-defined, which however is not always true in exploring research. The method presented in Chapter 6, designed for fMRI study, uses extended linear discriminant analysis to preliminarily select brain regions of interest from thousands of candidate voxels. The activities of those selected brain regions can be analyzed further with Bayesian networks. Chapter 7 briefly summarizes the results of the thesis, and provides concluding remarks, and discusses potential future work. The thesis itself is written in a manuscript style as permitted by the Faculty of Graduate Studies at the University of British Columbia, with each chapter representing an independent research effort. Each chapter has its own motivation and literature review and can be read independently. For convenience, the references of each chapter are placed in an individual bibliography at the end of the chapter. 13 Bibliography [1] Steen A. Andersson, David Madigan, and Michael D. Perlman. A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2): 505–541, 1997. ISSN 00905364. [2] R. Baumgartner, C. Windischberger, and E. Moser. tional magnetic resonance imaging: ysis. Quantification in func- Fuzzy clustering vs. correlation anal- Magnetic Resonance Imaging, 16(2):115–125, February 1998. URL http://www.sciencedirect.com/science/article/B6T9D-3ST7H3N-3/2/ 02ebb7afa3164c60363bbd30fce5777e. [3] R. Baumgartner, L. Ryner, W. Richter, R. Summers, M. Jarmasz, and R. Somorjai. Comparison of two exploratory data analysis methods for fmri: fuzzy clustering vs. principal component analysis. Magnetic Resonance Imaging, 18(1): 89–94, January 2000. URL http://www.sciencedirect.com/science/article/ B6T9D-3Y86GMC-B/2/94ad78dc51396e0191859290c5909a87. [4] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165–1188, 2001. [5] Kenneth A. Bollen. Structural Equations With Latent Variables. John Wiley, 1989. [6] Steven L. Bressler, Mingzhou Ding, and Weiming Yang. Investigation of co- operative cortical dynamics by multivariate autoregressive modeling of eventrelated local field potentials. Neurocomputing, 26-27:625–631, June 1999. 14 Chapter 1. Bibliography URL http://www.sciencedirect.com/science/article/B6V10-40D0KHS-2X/2/ 2b4784b3fc693729da541528957a10e2. [7] J. Burge, V.P. Clark, and T. Lane. Bayesian classification of fmri data: Evidence for altered neural networks in dementia. Technical Report TR-CS-2004-28, University of New Mexico, 2004. [8] V.D. Calhoun, T. Adali, G.D. Pearlson, and J.J. Pekar. A method for making group inferences from functional mri data using independent component analysis. Human Brain Mapping, 14(3):140–151, 2001. URL http://dx.doi.org/10.1002/ hbm.1048. [9] Jin Cao and Keith Worsley. The geometry of correlation fields with an application to functional connectivity of the brain. The Annals of Applied Probability, 9(4): 1021–1057, Nov. 1999. ISSN 10505164. [10] Dietmar Cordes, Vic Haughton, John D. Carew, Konstantinos Arfanakis, and Ken Maravilla. resting-state data. URL Hierarchical clustering to measure connectivity in fmri Magnetic Resonance Imaging, 20(4):305–317, May 2002. http://www.sciencedirect.com/science/article/B6T9D-46FVC9C-1/2/ b6231627fc9e0ad8a0bee18087a10bd5. [11] Silke Dodel, J. Michael Herrmann, and Theo Geisel. by cross-correlation clustering. URL Functional connectivity Neurocomputing, 44-46:1065–1070, June 2002. http://www.sciencedirect.com/science/article/B6V10-45FSV63-2/2/ 61e5cc4a13f98a77067450ee4a33da7a. [12] Peter Filzmoser, Richard Baumgartner, and Ewald Moser. A hierarchical clustering method for analyzing functional mr images. Magnetic Resonance Imaging, 17 (6):817–826, July 1999. URL http://www.sciencedirect.com/science/article/ B6T9D-3WTP987-2/2/fa8a4b8fae9aa692aae3d02bf600dfd0. 15 Chapter 1. Bibliography [13] Harald Fischer and Jrgen Hennig. Neural network-based analysis of mr time series. Magnetic Resonance in Medicine, 41(1):124–131, 1999. URL http://dx.doi.org/ 10.1002/(SICI)1522-2594(199901)41:1<124::AID-MRM17>3.0.CO;2-9. [14] K. J. Friston, C. Buechel, G. R. Fink, J. Morris, E. Rolls, and R. J. Dolan. Psychophysiological and modulatory interactions in neuroimaging. NeuroImage, 6(3): 218–229, October 1997. URL http://www.sciencedirect.com/science/article/ B6WNP-45M2XTF-8/2/e020927398c2c86b4e21f007575c440e. [15] K. J. Friston, L. Harrison, and W. Penny. Dynamic causal modelling. NeuroImage, 19(4):1273–1302, August 2003. [16] Miguel S. Goncalves, Deborah A. Hall, Ingrid S. Johnsrude, and Mark P. Haggard. Can meaningful effective connectivities be obtained between auditory cortical regions? NeuroImage, 14(6):1353–1360, December 2001. [17] C. W. J. Granger. Investigating causal relations by econometric models and crossspectral methods. Econometrica, 37(3):424–438, Aug. 1969. ISSN 00129682. [18] Michelle Hampson, Bradley S. Peterson, Pawel Skudlarski, James C. Gatenby, and John C. Gore. Detection of functional connectivity using temporal correlations in mr images. Human Brain Mapping, 15(4):247–262, 2002. URL http://dx.doi. org/10.1002/hbm.10022. [19] L. Harrison, W. D. Penny, and K. Friston. eling of fmri time series. Multivariate autoregressive mod- NeuroImage, 19(4):1477–1491, August 2003. URL http://www.sciencedirect.com/science/article/B6WNP-494HMVJ-1/2/ 53fdf428cc1a70702f21d590d64e6f2c. [20] Ruth Heller, Benjamini. Damian Stanley, Daniel Yekutieli, Cluster-based analysis of fmri data. Nava Rubin, and Yoav NeuroImage, 33(2):599– 16 Chapter 1. Bibliography 608, November 2006. URL http://www.sciencedirect.com/science/article/ B6WNP-4KTVP14-1/2/e045916a60c150fe7da168d313c5b570. [21] M. Kaminski. Determination of transmission patterns in multichannel data. Phil. Trans. R. Soc. B, 360(1457):947–952, May 2005. URL http://dx.doi.org/10. 1098/rstb.2005.1636. [22] Jieun Kim, Wei Zhu, Linda Chang, Peter M. Bentler, and Thomas Ernst. Unified structural equation modeling approach for the analysis of multisubject, multivariate functional MRI data. Human Brain Mapping, 28(2):85–93, 2007. URL http://dx. doi.org/10.1002/hbm.20259. [23] Pierre-Jean Lahaye, Jean-Baptiste Poline, Guillaume Flandin, Silke Dodel, and Line Garnero. Functional connectivity: teractions between bold signals. URL studying nonlinear, delayed in- NeuroImage, 20(2):962–974, October 2003. http://www.sciencedirect.com/science/article/B6WNP-49CSYWR-2/2/ 4f68d9258711b374a2961701fc5d09aa. [24] Steffen L. Lauritzen. Graphical Models, chapter 3.2.2, page 46. Clarendon Press, Oxford University Press, 1996. [25] Junning Li, Jane Wang, and Martin McKeown. Dynamic Bayesian networks (DBNs) demonstrate impaired brain connectivity during performance of simultaneous movements in Parkinson’s disease. In Proceedings of IEEE International Symposium on Biomedical Imaging, 2006. [26] Junning Li, Z.J. Wang, and M.J. McKeown. Bayesian network modeling for discovering “directed synergies” among muscles in reaching movements. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 2, pages II–1156–II–1159, 2006. 17 Chapter 1. Bibliography [27] Junning Li, Z.J. Wang, and M.J. McKeown. A multi-subject dynamic Bayesian network (DBN) framework for brain effective connectivity. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007 Proceedings. 2007 IEEE International Conference on, Accepted, 2007. [28] G. Lohmann and S. Bohn. Using replicator dynamics for analyzing fmri data of the human brain. Medical Imaging, IEEE Transactions on, 21(5):485–492, May 2002. doi: 10.1109/TMI.2002.1009384. [29] A.R. McIntosh and F. Gonzalez-Lima. Structural modeling of functional neu- ral pathways mapped with 2-deoxyglucose: uation on the auditory system. effects of acoustic startle habit- Brain Research, 547(2):295–302, May 1991. URL http://www.sciencedirect.com/science/article/B6SYR-4847P0T-5H/2/ 2c90a40fb74b6f93ece2e4f203e1992f. [30] Martin J. McKeown. Cortical activation related to arm-movement combinations. Muscle & Nerve, 23(S9):S19–S25, 2000. [31] Andrea Mechelli, Will D. Penny, Cathy J. Price, Darren R. Gitelman, and Karl J. Friston. Effective connectivity and intersubject variability: Using a multisubject network to test differences and commonalities. NeuroImage, 17(3):1459–1469, 2002/11. [32] Ruth M. Mickey, Olive Jean Dunn, and Virginia A. Clark. Analysis of Variance and Regression, chapter 11, page 280. John Wiley, 2004. [33] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298:824–827, 2002. [34] Tom M. Mitchell, Rebecca Hutchinson, Marcel Adam Just, Radu S. Niculescu, Francisco Pereira, and Xuerui Wang. Classifying instantaneous cognitive states from fmri data. In American Medical Informatics Association Annual Symposium Proceedings, pages 465–469, 2003. 18 Chapter 1. Bibliography [35] Karsten Muller, Gabriele Lohmann, Volker Bosch, and D. Yves von Cramon. On multivariate spectral analysis of fmri time series. 356, August 2001. NeuroImage, 14(2):347– URL http://www.sciencedirect.com/science/article/ B6WNP-457VFJT-19/2/495421aedd5d6d64a3777408b65ecb46. [36] Kevin Patrick Murphy. Dynamic bayesian networks: representation, inference and learning. PhD thesis, University of California, Berkeley, 2002. [37] Judea Pearl. Fusion, propagation, and structuring in belief networks. Artificial Intelligence, 29(3):241–288, September 1986. URL http://www.sciencedirect.com/ science/article/B6TYF-47X2B3Y-5K/2/4eda90d282f96723f99937a5c13d7a26. [38] Scott J. Peltier, Thad A. Polk, and Douglas C. Noll. Detecting low-frequency functional connectivity in fmri using a self-organizing map (som) algorithm. Human Brain Mapping, 20(4):220–226, 2003. URL http://dx.doi.org/10.1002/hbm. 10144. [39] W. Penny, Z. Ghahramani, and K. Friston. Bilinear dynamical systems. Phil. Trans. R. Soc. B, 360(1457):983–993, May 2005. URL http://dx.doi.org/10. 1098/rstb.2005.1642. [40] W. D. Penny, K. E. Stephan, A. Mechelli, and K. J. Friston. dynamic causal models. Comparing NeuroImage, 22(3):1157–1172, July 2004. URL http://www.sciencedirect.com/science/article/B6WNP-4CF1682-2/2/ cb7aaf13b9a57d4d3813d8117f7d21a6. [41] W.D. Penny, K.E. Stephan, A. Mechelli, and K.J. Friston. Modelling functional integration: a comparison of structural equation and dynamic causal models. NeuroImage, 23(Supplement 1):S264–S274, 2004. URL http://www.sciencedirect.com/ science/article/B6WNP-4DD95D7-4/2/324a5af334ccd7d717a39838b0e898b9. 19 Chapter 1. Bibliography [42] Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach, chapter 15.5, page 559. Prentice Hall, 2003. [43] John D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):479–498, 2002. doi: 10. 1111/1467-9868.00346. URL http://www.blackwell-synergy.com/doi/abs/10. 1111/1467-9868.00346. [44] Felice T. Sun, Lee M. Miller, and Mark D’Esposito. Measuring interregional functional connectivity using coherence and partial coherence analyses of fmri data. NeuroImage, 21(2):647–658, February 2004. URL http://www.sciencedirect.com/ science/article/B6WNP-4BDM2NW-1/2/4c3eb37ef66975777e2c4c1e38cb80df. [45] P. Valdes-Sosa, J. Snchez-Bornot, A. Lage-Castellanos, M. Vega-Hernndez, J. BoschBayard, L. Melie-Garca, and E. Canales-Rodrguez. Estimating brain functional connectivity with sparse multivariate autoregression. Phil. Trans. R. Soc. B, 360 (1457):969–981, May 2005. URL http://dx.doi.org/10.1098/rstb.2005.1654. [46] K. Worsley, J. Chen, J. Lerch, and A. Evans. Comparing functional connectivity via thresholding correlations and singular value decomposition. Phil. Trans. R. Soc. B, 360(1457):913–920, May 2005. URL http://dx.doi.org/10.1098/rstb.2005. 1637. [47] Lei Zhang, Dimitris Samaras, Nelly Alia-Klein, Nora Volkow, and Rita Goldstein. Modeling neuronal interactivity using dynamic Bayesian networks. In Y. Weiss, B. Sch¨olkopf, and J. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1593–1600. MIT Press, Cambridge, MA, 2006. [48] Xuebin Zheng and Jagath C. Rajapakse. ture from fmr images. Learning functional struc- NeuroImage, 31(4):1601–1613, July 2006. URL 20 Chapter 1. Bibliography http://www.sciencedirect.com/science/article/B6WNP-4JGJJ8N-6/2/ 9ecc1d089a83a0c7778fe75666ca0f5f. [49] Jiancheng Zhuang, Stephen LaConte, Scott Peltier, Kan Zhang, and Xiaoping Hu. Connectivity exploration with structural equation modeling: an fmri study of bimanual motor coordination. 470, April 2005. NeuroImage, 25(2):462– URL http://www.sciencedirect.com/science/article/ B6WNP-4FB9GV5-2/2/86ce4b912542c0022e75970be54e1046. 21 Chapter 2 Extracting Structural Features from Bayesian Networks with Applications to Discovering “Dependent Synergies” among Muscles 1 To provide guidance on further research exploration and clinical diagnosis, mathematical models in biomedical research should not only fit experimental data well, but also be able to extract interpretable features from data. In this chapter, a structural-feature extraction framework is developed to discover statistically significant structural differences, such as the patterns of sub-networks, between two groups of Bayesian networks. The framework includes three components: Bayesian-network modeling, statistical structurecomparison, and structure-based classification. It is demonstrated in a study on the coordination pattern of muscle activities using surface electromyography. It discovered several groups of muscles whose coordinated activities in reaching movement significantly differ between healthy people and stroke patients, and it also achieved very low classification error-rates. 1 A version of this chapter has been published. Junning Li, Z. Jane Wang, Janice J. Eng and Martin J. McKeown (2008) Bayesian Network Modeling for Discovering “Dependent Synergies” among Muscles in Reaching Movements. IEEE Transactions on Biomedical Engineering 55: 298–310. Portions reprinted, with permission, from Junning Li, Z. Jane Wang, Janice J. Eng and Martin J. McKeown (2008) Bayesian Network Modeling for Discovering “Dependent Synergies” among Muscles in Reaching Movements. IEEE Transactions on Biomedical Engineering 55: 298–310. © 2008 IEEE. 22 2.1. Introduction 2.1 Introduction An important goal of motor control studies is to understand how the central nervous system (CNS) selects and co-ordinates the muscle activity patterns necessary to achieve a variety of natural motor behaviors [6]. A key emerging concept in motor control is the importance of synergies [6], or groups of muscles that act together. Works on frogs have suggested that a complex repertoire of movements can emerge from the appropriate control and selection of only a few synergies which each represent a primitive movement [33]. However, identifying muscle synergies from all possible muscle patterns and efficient decomposing of complex and variable motor behaviors into meaningful synergies remain challenging problems. To address this goal, a necessary intermediate step is to determine how muscles efficiently collaborate together during movements. In this paper, we plan to infer muscle interaction patterns from surface Electromyogram (sEMG) recordings during reaching movements. Especially we are interested in investigating whether certain muscle interactions are selectively recruited across subjects based on hand dominance or affliction by stroke. A sEMG is a semi-stochastic signal whose properties depend upon a number of factors including the anatomical and physiological properties of the contracting muscles, the amount of subcutaneous fat, and choice of electrodes [21]. Standard analytical methods, including frequency-based ones, may be particularly sensitive to parameters difficult to measure, such as capacitive effects of muscles and subcutaneous tissue [31]. Nevertheless, despite its limitations, the non-invasive nature of sEMG makes it practical to record several muscles simultaneously in humans, and hence allows the investigation of synergies. A sEMG signal can be modelled as a zero-mean wide-sense stationary stochastic process (i.e. the so-called carrier signal) modulated by the sEMG amplitude [5]. A common practice in the sEMG literature is to focus only on the amplitude data, i.e. doing rectifying and then low-pass filtering the raw sEMG signal, while the carrier data is generally ignored. Here we use a fundamentally different approach and model sEMG 23 2.1. Introduction carrier signal after the effects of amplitude have been minimized. During the last years, partially linear decomposition methods such as Principal Component Analysis (PCA), linear Independent Component Analysis (ICA) [19] [20], and Nonnegative Matrix Factorization (NMF) [32] have been suggested to infer synergistic action between muscles. For instance, linear ICA was applied to noisy sEMG data and revealed meaningful interactions between muscles [19]. These methods are characterized by a number of latent variables and project multi-channel sEMG signals to a subspace. They share a common assumption that a small set of source signals are upstream of the muscles and produce the sEMG signals. For example, ICA assumes that the observations are combinations of statistically independent components, and the goal of ICA is to find the underlying sources. Here we propose a different approach, the Bayesian network (BN) modeling approach, to directly represent interactions between muscles without using latent variables. Methods such as P/ICA do not explicitly reveal interactions but implicitly through underlying sources. Our BN framework provides an alternative which captures the interactions directly from the observed sEMG signals by detecting conditional dependence/independence between muscle activities, i.e. whether the activities of two muscles are associated given that of a third muscle. There are a number of biological and statistical reasons that make the assumption of conditional dependence between muscle activities plausible. Depending upon the neural context, the same neurons participating in central pattern generators can demonstrate remarkably dissimilar behaviors [28]. Activity in muscles themselves may be modulated by Ia inhibitory interneurons from antagonists or Renshaw cell activity in the spinal cord that may project to other motor neurons in the spinal cord. [3]. Ordinary coherence, often used to infer connectivity coupling between muscles, cannot distinguish whether two channels are directly connected or indirectly connected via other channels suggesting that conditional dependence may need to be considered [14]. In fact, failure to identify 24 2.1. Introduction potential conditional dependence between muscles may lead to erroneous interpretations regarding the overall interactions between muscles, necessitating the use of partial coherence, often resulting in a distinctly different connection pattern [14]. A BN represents the conditional dependence/independence through a graph of nodes and edges connected according to rigorous statistical rules (see Sec. 2.2.2), so it is suitable to discover conditional interactions between muscles. There are a number of reasons why BNs, as opposed to other choices of graphical models (such as Boolean networks), may be particularly well-suited for modeling muscle interaction networks. First, BN models have a solid basis in statistics, enabling them to deal with the stochastic and nonlinear aspects of sEMG measurements in a natural way. As a rigorous probabilistic model, a BN allows incorporation of the stochastic nature of sEMG recordings that may be caused by any number of biological factors along the cortex Ñ spinal cord Ñ peripheral nerve Ñ neuromuscular junction Ñ muscle pathway. Second, BN’s modular nature makes it easily extensible to the task of modeling sub-networks of sEMG signals. In BNs, conditional probability distributions (CPD) are specified locally at each node to encode dependence relationships between a node and its parents, so the whole network can be decomposed into many small sub-networks. Since a node is independent of its ancestors given its parents, a simple conditional independence relationship between muscles could be that the interaction between muscle-A and muscleB does not depends on the activity in muscle-C. Further, the rich repertoire of techniques developed for network analysis in other areas can be used for inferring muscle networks using sEMG. Finally, BNs can be used when incomplete knowledge is available, and can also deal with dynamical aspects of muscle interactions through generalizations like dynamical BNs. Therefore, in this paper, we develop a BN modeling framework to statistically capture the interactions between muscle activity patterns directly. More specifically, we generalize the muscle synergy idea into the concept of a muscle network, defined as a set of muscle 25 2.1. Introduction activity patterns with probabilistic and conditional interactions between them that are coordinated to achieve specific motor behaviors. In our approach, we first model the overall muscle activity across several simultaneously recorded sEMG signals. From the learned muscle network we then define muscle synergies as statistically significant subnetworks or network-motifs [23] and use the term “dependent synergies” to refer to (conditionally) dependent muscle (not necessarily pair-wise) interactions. We also plan to tackle the problem of investigating consistent muscle synergies across subjects within a certain group. While there are many potential factors that may affect muscle synergies and thus sEMG patterns during a reaching movement, here we focus on the effects of stroke and hand dominance. Hand dominance has been reported as an important factor in motor control [26], and different muscle activations between the dominant and non-dominant hands have recently been observed during reaching movements [29]. Though hand dominance as a factor in motor and functional performance has been studied in the literature, to the best of our knowledge, no studies have investigated the impact of hand dominance in healthy or stroke subjects in terms of muscle association/interaction patterns. The main contributions of this paper are as follows: To present a framework for learning the muscle interaction networks during reaching movements based on the BN modeling of sEMG data. To demonstrate how the trained BNs can then be probed with network motif anal- ysis to determine “dependent synergies”. To demonstrate that some network structure features are relatively robust across subjects, and thus can be used to distinguish factors such as handedness and stroke status. To demonstrate that specific three-muscle synergies may provide insights into the compensatory changes seen in reaching movements after stroke. 26 2.2. Methods To indicate that the sEMG “carrier” signal (after the amplitude information is esti- mated and removed) can also be informative, even though it has been traditionally ignored by sEMG analysis methods. The paper is organized as follows. In Section 2.2, we describe the proposed BN framework for learning muscle networks and analyzing the sub-network patterns. A real case study utilizing sEMG recordings from stroke and healthy subjects, including data from both dominant and non-dominant arms, is discussed in Section 2.3. Finally, we conclude our paper and suggest some directions for future research. 2.2 2.2.1 Methods Framework and Components Our Bayesian network (BN) framework includes three components: BN modelling, graph structure analysis and classification. First, BNs are applied to multi-muscle sEMG signals, with directed acyclic graphs (DAG) encoding the overall interactions between muscles. Researchers can choose different types of BNs according to their interest and prior knowledge about a specific application. For example, they can use a static BN to model the invariant interactions, a dynamic BN [25] to model the dynamics of muscles, or a BN with hidden nodes [25] to model the unobserved neural signals which drive the muscles. Secondly, graph structure analysis is conducted on the learned BNs to extract structural features which characterize the interaction patterns among muscle activity patterns. Structural features can be the number of edges in or out from a node (which is called “degree” in graph theory), or the length of the shortest path from one node to another (which is called “distance” in graph theory). To go beyond the node and edge levels, but to a sub-network level, we employed the “network motif” concept [23]. Thirdly, classification is performed, based on the BNs. As statistical models, BNs can be naturally extended to statistical classifiers with the posterior probability criterion. 27 2.2. Methods Figure 2.1: An example of representing the stochastic interactions of sEMG recordings of different muscles with a Bayesian Network. This particular graph is only for illustrative purposes and does not have intrinsic biological meaning. Each node represents a sEMG recording of a muscle and the whole graph represents the interaction among the muscles in { X2|X3) a movement. This DAG encodes the independence relationships X1 KX2 (but X1 K { and pX1 , X2 qKX4 |X3 (but pX1 , X2 qKX4 ). Nodes X1 and X2 do not have any parent and they are associated with unconditional probability distributions. Nodes X3 and X4 have parent(s) and they are associated with conditional probability distributions. The joint probability distribution can be factorized according to the DAG as: P pX q P pX1 qP pX2 qP pX3 |X1 X2 qP pX4 |X3 q. Alternatively, the DAGs of BNs can also be used as input features to other classifiers such as classification trees. In the following sub-sections, we will elaborate on each of the three components alluded to above. 2.2.2 Bayesian Networks Introduction A Bayesian Network (BN) [16], also referred as a ”Bayesian belief network” or simply ”belief network”, is a graphical model that consists of a directed acyclic graph (DAG) and a set of (conditional) probability distributions. The DAG encodes the (conditional) dependence/independence relationships among random variables, and the probability distributions constitute the joint probability distribution with Bayes’ rule. A BN, in short, is a representation of the joint distribution over random variables by indicating the conditional dependence/independence relationships with a DAG. 28 2.2. Methods A DAG encodes a set of (conditional) independence relationships between node variables with the concept of d-separation [16]. For instance, let X1 , X2 and X3 denote three node variables. According to the global Markov property [16], if X1 is d-separated from X2 by X3 in the DAG, then it is said that X1 is conditionally independent of X2 given X3 , i.e. P pX1 X2 |X3 q P pX1 |X3 qP pX2 |X3 q, and we denote it as X1 KX2 |X3 . The definition of conditional independence is similar to that of unconditional independence P pX1 X2 q P pX1qP pX2q except that it is conditional on a third random variable X3. Here we focus on two simplified but important corollaries within the broader concept of d-separation. (1) If X1 and X2 are connected (i.e. there is a path in the DAG from X1 { X2. (2) If to X2 or vice versa), then X1 is not independent of X2 which we denote as X1 K X1 precedes X2 (i.e. there is at least a path from X1 to X2 ) and X3 blocks all the paths from X1 to X2 , then X1 KX2 |X3 . Fig. 2.1 shows an example of BN which represents a network of four muscles. (Note that this figure is only for illustrative purposes and does not necessarily represent any real muscle network.) According to the above corollaries, { X2|X3) and pX1, X2qKX4|X3 this DAG encodes a set of relationships: X1 KX2 (but X1 K { X4). This example also shows that conditional independence does not (but pX1 , X2 qK imply unconditional independence, and vice versa. If the joint probability distribution of a set of random variables X is subject to the Markov property, it can be factorized according to the DAG as P pX q ¹ r sH pa Xi P pXi |parXi s, θi q ¹ r sH P pXi |θi q, (2.1) pa Xi where the set θ = {θ1 . . . θn } denotes the parameters used in the probability distributions and pa[Xi ] denotes parents of Xi , i.e. nodes with an edge to Xi . If a node Xi has parent nodes, i.e. pa[Xi ] φ, it is associated with a conditional probability dis- tribution P (Xi |pa[Xi ]). If a node Xi does not have any parent, it is associated with an unconditional probability distribution P pXi q. For example, the joint probability of 29 2.2. Methods the four-muscle network in Fig. 2.1 generally can be decomposed as Eq. 2.2 according to the chain rule, and further as Eq. 2.3 according to the (conditional) independence relationships specified by the DAG. P pX q 4 ¹ P pXi |X1 , . . . , Xi 1 q (2.2) i 1 P pX1qP pX2qP pX3|X1X2qP pX4|X3q. (2.3) Using BNs to represent muscle interactions can be summarized as follows. sEMG signals are regarded as a vector-valued stochastic process X ptq=[X1 ptq, X2 ptq, . . . Xn ptq]T where n denotes the number of muscles and Xi ptq the observed signal of the ith muscle at time t. A DAG with nodes X = {X1 . . . Xn } indicates the interactions between muscles. If a node Xi is connected to another node Xj , then the corresponding muscles are considered to interact. If Xi is d-separated from Xj by another node Xk , then the muscles represented by Xi and Xj do not interact conditionally on the activity of the muscle Xk . Conditional and unconditional probability distributions are associated with the nodes, describing how the muscle activity patterns interact with each other. In this study, we employed Gaussian BNs to model the multi-muscle sEMG signals, i.e. we modelled each variable Xi as the sum of a Gaussian noise and a linear combination of its parents pa[Xi ]. Gaussian BNs are not only applicable but also one of the most popular BNs for modeling multi-channel continuous variables. It should be pointed out that before DAGs are analyzed to reveal muscle interactions, it must be converted to an essential graph (EG) [1] which is also referred as a completed acyclic partially directed graph (CPDAG) in the literature. This is because there are often several different DAGs representing the same set of conditional independence relationships, while there is only one EG uniquely encoding the set of conditional independence relationships, as shown in Fig. 2.2. An EG has the same edges as a DAG does except that some edges are not directed. Algorithms to convert a DAG to an EG have been 30 2.2. Methods Figure 2.2: An example of essential graphs (EGs). Both the directed and acyclic graphs DAG 1 and DAG 2 represent the same set of conditional independence: AKpC, Dq, B KD|C and B KD|pA, C q. Since the edge between C and D is reversible, its direction is removed in the EG. Directions in either the DAGs or the EG do not necessarily imply causality. proposed previously [1]. A directed graph, such as those that contain directed edges of either a BN or an EG to encode conditional independence/dependence and independence/dependence, does not necessarily imply causality but rather association. Hence our model dose not conflict with the fact that the activities of multiple muscles are often coupled by kinematics and dynamics of the bones and joints. For further details on BNs, EGs and d-separation, the reader is referred to Lauritzen’s book (1996) [16] and Andersson’s paper (1997) [1]. Learning Bayesian networks Learning a BN includes two steps: (1) structure learning and (2) parameter learning. Structure learning is to select an appropriate DAG among many candidate DAGs. Parameter learning is to estimate the parameters of the conditional and unconditional distributions given the DAG. In structure learning, we attempted to select the most probable DAG based on the observations according to the maximum a posteriori (MAP) criterion. Let X denote the observations and S the DAG, the best structure from the view of Bayesian statistics is: Sˆ arg max ppS |X q, S (2.4) 31 2.2. Methods where according to Bayes’ rule, we have p p S |X q p p X |S q » ppX |S qppS q , p pX q ppX |θ, S qppθ|S qdθ, (2.5) (2.6) where ppS q is the prior probability of the structure S and ppθ|S q is the probability of the parameter θ given the structure S. The MAP criterion has the advantage of allowing users to incorporate their knowledge in the prior probability. As the denominator in Eq. (2.5) does not depend on S, only the numerator needs to be maximized. If ppS q (the prior probability) is uniform over all the possible structures, only ppX |S q in Eq. (2.6) (the conditional probability) needs to be maximized. The uniform assumption is reasonable in practice since we do not prefer any structure before we observe the data. However, alternates to a uniform prior distribution may be considered to enhance computationally efficiency [9]. The MAP approach can be implemented by selecting the structure with the largest Bayesian Information Criterion (BIC) score [30] which is defined as BICpS q sup log P pX |S, θq 0.5K log N, (2.7) θ where N denotes the sample size of X and K denotes the number of free parameters in θ. In the comparison between two models S1 and S2 , exp[BIC(S1 )-BIC(S2 )] asymptotically approximates the ratio of their posterior probability P pS1 |X q/P pS2 |X q if the two models S1 and S2 have the same prior probability, i.e. P pS1 q=P pS2 q and ppθ|S q is uniform [30]. (For rigorous proof, please refer to Schwarz’s paper in 1978.) The large sample size in our sEMG study, 1000 time points (see Sec. 2.3.1), should satisfy the condition of the asymptotical approximation. As shown in Eq. (2.7), the BIC consists of two terms: the maximum log likelihood term sup log P pX |S q and the penalty term 0.5K log N . 32 2.2. Methods The penalty term prevents “over fitting”, i.e. choosing a structure which has too many edges compared with the data size. Because a structure with more edges tends to have larger likelihood, we will inevitably choose a fully connected DAG if only the maximum likelihood criterion is used. Therefore, a penalty term is needed in model selection. The BIC penalty term is proportional to the number of free parameters (K), and “punishes” structures with redundant edges. After structure learning, we estimated the parameters of the conditional and unconditional distributions via the maximum likelihood criterion. Since the number of all the possible DAGs is super-exponential to the number of nodes, it is impractical to exhaustively search for the best DAG. To avoid local maxima found by greedy algorithms, we employed the Markov chain Monte Carlo algorithm (MCMC) [13] to learn the structure. Our implementation of learning BNs was developed based on the software Bayes Net Toolbox (BNT) [24] for Matlab. 2.2.3 Sub-network Patterns and Muscle Synergies Milo and Shen-Orr [23] reported that in the networks of the real world (for instance, gene regulation networks), certain connection patterns of sub-networks appear more frequently than would be expected from chance, and these patterns (named “network motifs” by the authors) can be used to characterize the networks. Since in the current case, subnetworks represent the co-activation between muscles, we adopt a similar idea to discover the muscle synergies from the DAGs of BNs. However, in contrast to the original proposal by Milo and Shen-Orr’s, where the goal was to determine network motifs that appear more frequently in real graphs than in randomized graphs, our goal is to identify network motifs that distinguish a group of graphs from another group. Specifically, the question of interest is that “given two groups of muscle interaction graphs derived from sEMG recordings (e.g. recordings of stroke and normal subjects), is it possible to determine which network motif(s) distinguish one group from the other?” 33 2.2. Methods Figure 2.3: Different ways that three muscles can interact within a BN framework. The directed triple graph is first converted to an undirected graph, and then it is classified as one of the patterns. The four patterns are abbreviated as B, L, V and T respectively. We propose the following way to detect network motifs distinguishing between two groups of graphs. First, the occurrences of each possible connection pattern in each graph is counted. As a result, the count of a particular connection pattern in one group of graphs is a group of numbers. Then, the two groups of numbers are compared with a hypothesis test such as a t-test. Finally, patterns appearing significantly more frequently in one group than in the other are selected as the feature network motifs of that group. We note that patterns which are functionally important but not statistically significant could exist and could be missed by this approach. In our sEMG study, we focus on triplet network motifs, i.e. sub-networks with three nodes. Though sub-networks with more nodes can be analyzed similarly without theoretic difficulty, we did not pursue more than three in this exploratory research due to limited computation power and the observation that triplets have demonstrated our framework adequately. The complete and detailed procedure of detecting and evaluating triplets from two groups of graphs is as follows. First, DAGs are converted to EGs because a EG uniquely determines the dependence relationships among nodes (see Sec. 2.2.2). Secondly, the appearing frequencies of triplet connection patterns are counted. The possible triplet patterns are show in Fig. 2.3. Thirdly, t-test is performed to evaluate whether a specific connection pattern appears significantly more often in one group of EGs than in another. As a result, each triplet pattern is associated with a level of significance, or p-value. Forth, the p-values are adjusted for the effect of multiple testings with Sidak correction 34 2.2. Methods as in Eq. (2.8), pa 1 p1 pqh , (2.8) where p is the original p-value, pa is the adjusted one and h is the number of hypotheses tested simultaneously. In the context, h equals the number of the interested connection patterns. The effect of multiple testings can also be adjusted with the false discovery rate (FDR) [2] which controls q-values, i.e. the expected portion of falsely rejected hypotheses among those rejected. Finally, connection patterns with the adjusted p-values or q-values lower than 0.05 are selected as network motifs. While the above general network motifs provide information on the overall connectivity patterns of the network, we are also interested in specific muscle triplets because it is possible that alterations between the interactions of a few particular muscles may significantly influence classification of reaching movements between groups. The identification procedure is similar to that of the general network motifs, except that the connection patterns are counted for each combination of three specific muscles individually. For n muscles, all the Cn3 triplets are exhaustively examined. Instead of a t-test, Fisher’s exact test is employed to check whether a connection pattern of a specific muscle triple appears significantly more in a group of graphs than in another. The effect of multiple testing is also adjusted, but the number of simultaneous hypothesis tests h is much larger. If there are n muscles and m patterns of interest, the number h is mCn3 . 2.2.4 Classification BNs can be used for classification purpose in two ways: (1) as statistical models, they can be extended to be statistical classifiers; (2) as graphical models, their structures can be input as features to other classifiers. BNs can be extended to be a statistical classifier naturally with the posterior probability criterion. Suppose M1 and M2 are the statistical models of the sEMG signals of two 35 2.2. Methods groups of subjects respectively, e.g. a control group and a stroke group. Given sEMG signals X, its model index can be predicted by the posterior probability criterion as in Eq. (2.9). The more ppM1 |X q is larger than ppM2 |X q, the more likely that X belongs to group 1, and vice versa. If the prior probabilities ppM1 q and ppM2 q are equal, the ratio ppM1 |X q{ppM2 |X q is the same as the ratio ppX |M1 q{ppX |M2 q according to Bayes’ rule. Suppose there are totally Ni subjects of group i and Mij represents the BN model of the jth subject in group i. With the assumption that each individual model of group i is equally representative of the group, the group model Mi (i=1,2) can be built by averaging the BN models trained from individual subjects within the same group, as expressed in Eq. (2.10). If $ ' & ppM1 |X q ppM2 |X q ' % ¡ 1, then X belongs to group 1, 1, then X belongs to group 2; ppX |Mi q (2.9) °Ni ppX |Mij q , i 1, 2. j 1 Ni (2.10) Structure features of BNs can also be used as the input to other classifiers in various ways. As we mentioned in Sec. 2.2.2, DAGs should be converted to EGs before used to represent the interactions between muscles. An EG of n nodes can be encoded as an n-by-n binary adjacent matrix A taij u where aij 1 indicates an edge from node i to node j. Because an EG cannot have any edge which circles from and to the same node, the diagonal elements are all zeros, and they are uninformative. The elements off the diagonal line can be lined up as a binary vector with npn 1q elements and then used as input to a classification tree. We choose a classification tree, but not other classifiers such as the support vector machine (SVM) [34] because of two reasons. First, classification tree is especially suitable for categorical data as an adjacent matrix is; secondly and also most importantly, it is easier to interpret since a classification tree explicitly gives the conditions of predicting the class. In contrast, despite its popularity, SVM results are hard to interpret, as the SVM algorithm implicitly maps features to a high-dimensional 36 2.2. Methods Figure 2.4: Cross-subject validation Figure 2.5: Within-subject validation imaginary space. Triple patterns derived from a BN’s structure can also be used as input features to a classification tree. A graph of n nodes contains Cn3 triples which can then be converted to a categorical vector of Cn3 elements. In this study, n equals 7, resulting in 42-dimensional classification features for an EG and 35-dimensional features for exploring triple patterns. The performance of BN-based classifiers were evaluated with both cross-subject validation and within-subject validation on a real sEMG data set containing repeated trials of arm reaching movements. In cross-subject validation (Fig. 2.4), all the trials of one arm side of one subject were kept aside as the testing data, and all the other data were used to train a classifier which was then used to predict the stroke state and hand dominance of the testing arm side. The stroke state and hand dominance of each testing trial was predicted individually, and then all the predictions voted on the state of the testing arm side. In this way, all the trials of the arm being tested were used to predict its group membership. This procedure is repeated for each subject in a leave-one-out, cross-validation manner. Cross-subject validation evaluates whether data of the same group share common features while data of different groups have distinguishing features. The strategy of within-subject validation is shown in Fig. 2.5. One trial of a subject was kept aside as the testing data, and all the other trials of the same subject were used to train a classifier which was then used to predict the group membership of the testing 37 2.2. Methods trial. This procedure was repeated and each time a different trial was selected as the testing trial. Within-subject validation was used to evaluate whether different trials of one arm from the same subject were consistent, yet trials of the other arm from the same subject were different. The performance of the BN-based classification trees applied to sEMG carrier was compared with that of a PCA-based SVM applied to the sEMG amplitude [11]. The PCA-based SVM approach includes two steps: 1) reduce the dimension of sEMG amplitude with PCA. 2) input the dimension-reduced data to SVMs for classification. In our study, 6 principal components (PC) were needed to explain 80% of the total variance. The PC coefficients of the seven muscles in our study were then concatenated as input feature vectors to SVMs whose length was 6 7 42. We constrained both classification trees and SVMs from using no more than three elements of the input features to avoid over-fitting. We tried to input all the 35 or 42 features to the classifiers, but the corresponding classification error rates were as high as 30% to 50%, so we set the constraint above to improve the performance. All the combinations of no more than three features were exhaustively searched, and finally the best performance was selected. The results showed that this modified implementation provided much better performance than using all the features together. We think that the comparisons between the BN-based classification trees and the PCA-based SVMs are fair because: they were subject to the same constraint; the best combination of features were enumeratively searched; and the classification trees were not provided with more features than the SVMs were. 2.2.5 Modeling sEMG Signals A sEMG signal y ptq is usually considered as a zero-mean, Gaussian, band-limited and wide-sense stationary stochastic process xptq modulated by the EMG amplitude aptq [4, 5], expressed as y ptq xptqaptq, (2.11) 38 2.2. Methods 1 120 0.8 100 0.6 Count Auto−Cor 80 0.4 0.2 60 40 0 20 −0.2 −0.4 −50 −25 0 25 50 0 −4 −3 −2 −1 Lag Figure 2.6: A typical auto-correlation graph of the “carrier” signal xptq of a muscle. xpt1 q and xpt2 q (t1 t2 ) are almost independent. 0 1 2 3 4 X Figure 2.7: A typical histogram of the time-distribution of the “carrier” signal xptq of a muscle. The distribution of xptq in the time domain is almost Gaussian. where t indicates time and xptq is named by us as “carrier”, a term borrowed from the field of communication. Accepting these assumptions and being consistent with what we observed in this study, we assume 1. xptq is wide-sense stationary; 2. xptq follows a Gaussian process; 3. xptq is approximately white; 4. xptq is ergodic. Fig. 2.6, a typical auto-correlation plot of xptq, shows that xptq is approximately white, i.e. xpt1 q and xpt2 q are approximately independent if t1 t2. Fig. 2.7, a typical histogram of the distribution of xptq, shows that xptq’s distribution is almost Gaussian in the time-domain. Since the distribution of xptq is suggested to be Gaussian both spatially and temporally, the ergodicity assumption at least is not severely violated yet not rigorously proved. The four assumptions as a whole imply that a single-channel “carrier” signal xptq is independent identically-distributed (iid) at different time points and the distribution is 39 2.3. Results well approximated as Gaussian. Therefore, we can model multi-channel “carrier” signals with static Gaussian BNs by adding one more assumption that the joint distribution of the multi-channel signals follows a multivariate Gaussian distribution. Since static models are used here, what we attempt to discover is not the dynamics of the muscles’ activities, but the invariant interaction patterns among the muscles during the reaching movements. Most of the existing literature on sEMG describes rectification and low-pass filtering of the data and hence emphasizes the amplitude aptq [5] while xptq the carrier is generally abandoned. However, in contrast here we focus on the “carrier” signals but not the amplitude, which will provide novel insights into the underlying system. As supported by our analysis results reported in Sec. 2.3, the “carrier” signal is also informative and provides a robust way to deal with the challenging issue of inter-subject variability in sEMG data. 2.3 2.3.1 Results Real sEMG Datasets All research was approved by the University of British Columbia Ethics Board. Thirteen stroke subjects and 9 healthy subjects were recruited. In the experiment, subjects sat in a chair with their hands on the thigh, and then reached to a shoulder-height target as fast as they could for five to ten trials with each arm. The sEMG of the following seven muscles were collected: the deltoids (anterior and lateral), the triceps (long and lateral heads), the biceps brachium, the latissimus dorsi, and the brachioradialis. A bipolar montage was used to minimize the effect of crosstalk. The seven-channel sEMG signals were amplified, high-pass filtered at 20 Hz to reduce movement artifact, and then sampled at 600 Hz. (Please refer to [18] for further details on the sEMG experiment procedure). The amplitude of the sEMG was estimated with root-mean-square (RMS), 40 2.3. Results with a moving window of 0.1 second. As EMG signal can be considered as a wide sense stationary stochastic process modulated by the EMG amplitude [5], the carrier stochastic process (see Sec. 2.2.5) was also calculated by dividing the sEMG signal by the estimated amplitude. Finally, the sEMG signals (both the amplitude and carrier) of different trials were resampled with cubic spline interpolation so that the overall movement duration from reaching-start to target-touching was exactly 1000 time points. This prevents the sub-network analysis (see Sec. 2.2.3) from being biased by the unequal data length since a DAG learned from more time points tend to include more connections than another learned from less time points. Preliminary studies included testing the effect of the moving window by stepwise increasing the width from 20 ms to 300 ms. It was determined visually that 100 ms gave the best estimation, as it yielded good amplitude estimate and produced an approximately wide-sense stationary carrier signal. Since we were interested in the influence on sEMG patterns of two factors, stroke condition and hand dominance, sEMG recordings were grouped into four types of experimental groups: healthy dominant hand (HD), healthy non-dominant hand (HN), stroke more affected side involving the dominant hand (SD) and stroke more affected side involving the non-dominant hand (SN). To sharpen the contrast between the stroke and healthy states, the less affected side of stroke subjects was excluded because it may not be a valid comparison against the healthy state. Most individuals with stroke have subtle deficits on the non-paretic side due to a number of factors, including the contribution of the small portion of corticospinal tracts that do not decussate, and remain ipsilateral. To focus on the effect of one factor at a time, we fixed the state of one factor and compared the two states of the other factor in four group comparisons: HD v.s. HN, SD v.s. SN, HN v.s. SN and HD v.s. SD. 41 2.3. Results Tri. Long Tri. Long Biceps Biceps Tri. Lat. Tri. Lat. A. Deltoid Brachior A. Deltoid Brachior L. Deltoid Lats. (a) Healthy, Dominant L. Deltoid Lats. (b) Stroke, Non-dominant Figure 2.8: Examples of typical DAGs of subjects with different hand dominance and different stroke state. 2.3.2 Learned Bayesian Networks Examples of the learned DAGs of the BNs are given in Fig. 2.8, where the left side illustrates a typical DAG of the dominant hand side of a healthy subject (i.e. a HD case) and the right is for a typical DAG of the non-dominant hand side of a stroke subject (i.e. a SN case). The two DAGs showed different connection features. For example, the lateral deltoid is connected to all the other six muscles in the DAG of the HD case while it is completely isolated in the SN case. The long head of the triceps is more connected with others in the HD case than in the SN case. The biceps is connected to the lateral, long heads of the triceps and the lateral deltoid in the HD case while to the anterior deltoid, latissimus dorsi, the brachioradialis and the lateral triceps in the SN case. These differences between the two typical subjects suggest that both stroke and hand dominance can affect the sEMG muscle association patterns during reaching movements. Comparisons between the DAGs of different hand dominance are shown in Fig. 2.9. The width of the displayed edge is proportional to the log odds ratio (LOR) of their appearance rates in the two groups. First the DAGs were converted to EGs, then the 42 2.3. Results Tri. Long Tri. Long Biceps Biceps Tri. Lat. Tri. Lat. A. Deltoid Brachior A. Deltoid Brachior L. Deltoid L. Deltoid Lats. Lats. (a) Healthy, Dominant vs. Non-dominant (b) Stroke, Dominant vs. Non-dominant Tri. Long Tri. Long Biceps Biceps Tri. Lat. Tri. Lat. A. Deltoid Brachior A. Deltoid Brachior L. Deltoid Lats. (c) Non-dominant, Healthy vs. Stroke L. Deltoid Lats. (d) Dominant, Healthy vs. Stroke Figure 2.9: Mean differences of overall network structures as a function of experimental condition. The labels under the sub-figures are the experimental conditions in comparison. A solid edge means it appears more frequently in the first type than in the second, and a dashed edge does vice versa. The width of an edge is proportional to the contrast of the frequencies of its appearances. The contrast are measured with log odds ratio (LOR) which is defined as LOR = lnprf1 {p1 f1 qs{rf2 {p1 f2 qsq where f1 and f2 are the appearance frequencies of the edge in two types of experimental situations under contrast. If the edge appears k times in n trials, the appearance frequency f is estimated as k 1{pn 2q with the Bayes estimator. The Bayes estimator is more robust than MLE when k and n are small, and converges to MLE when k and n are large. Only edges whose absolute value of the LOR is greater than ln(2) are shown in the figure. 43 2.3. Results appearance rate of each edge in each group of the EGs was calculated, and finally the rate was converted to LOR. Only the edges whose LOR’s absolute value exceeded ln(2) were shown in the figure. A solid edge means that it appears more frequently in the first group than in the second, and a dashed edge means a higher appearing frequency in the second group. Sub-figures (a) and (b) compared the BNs learned from dominant hand and non-dominant hand groups. We note that networks from dominant hand groups have more connections between the muscle pairs (biceps, triceps long head), (lateral triceps, lateral deltoid) and (anterior deltoid, lateral deltoid). Sub-figures (c) and (d) compared the BNs between healthy subjects and stroke subjects. We note that healthy subject group has more connections between the muscle pairs (triceps long head, biceps) and (brachioradialis, lateral deltoid). 2.3.3 Triple Patterns The learned BNs from different experimental groups demonstrated different triple connection patterns, as shown in Fig. 2.10. The V pattern (see Fig. 2.3) appears significantly more frequently in the HN group than in the SN group (p 0.0002), and more often in the SD group than in the SN group. The Line pattern (see Fig. 2.3) appears significantly more frequently in the SN group than in the HN group (p in the SN group than in the SD group (p 0.0077). 0.0103), and also In addition to the general triple patterns, specific muscles triples also showed significantly different connection patterns across experiment groups. Muscles involved in these triplets are deltoid (both anterior and lateral), triceps and brachioradilis, as shown in Table 2.1. However, no significant results were discovered about the Blank pattern and the Triangle pattern (see Fig. 2.3). Since the V pattern is the most efficient pattern to connect three muscles and a Line pattern only connects two muscles, muscles of the HN and SD groups seem cooperate more closely than those of the SN group. This observation coincides with clinic experience, where the SN group typically have the most difficulty in performing reaching 44 2.3. Results HN vs. SN t stat = 4.2422, df = 98 95% CI = 1.5955 ~ 4.4044 p = 0.0002 HN vs. SN t stat = −3.0936, df = 98 95% CI = −3.2465 ~ −0.7091 p = 0.0103 20 Count 15 10 5 SD vs. SN t stat = −3.1905, df = 94 95% CI = −3.5533 ~ −0.8272 p = 0.0077 0 HD HN SD SN Blank HD HN SD Line SN HD HN SD V shape SN HD HN SD SN Triangle Figure 2.10: Comparison of the count of appearances of connection patterns in different experimental groups. The distribution of the number of appearances is shown with box plot. The boxes have lines at the lower quantile, median and the higher quantile. The whiskers are the extent of the rest of the data and the plus symbols are the outliers. If the notches of two boxes overlap, their medians differ significantly with type I error rate less than 5%. The means of the distributions are also compared with t-test and significant results are labeled with arrows. p-values are adjusted for multiple comparisons with Sidak correction. Since the notches compare the medians and the t-test compares the means, their results may differ from each other when the distributions are skewed, for example in the comparison of pattern “Line” between HN and SN. There are no statistical differences in the number of edges between the above groups, and thus the statistically different results between groups are not based on the number of edges. 45 Comparison HD vs. SD Pattern L L L B L Count 13/45 vs. 0/41 13/45 vs. 0/41 10/45 vs. 34/55 24/41 vs. 9/55 0/41 vs. 14/55 OR and 95% CI inf, (3.5052, inf) inf, (3.5052, inf) 0.1765, (0.0650, 0.4638) 7.2157, (2.5614, 21.0100) 0.0000, (0.0000, 0.3312) p-value 0.0151 0.0151 0.0140 0.0036 0.0390 q-value 0.0076 0.0076 0.0141 0.0036 0.0157 Table 2.1: Muscle coordination patterns significantly different among subjects. The connection patterns of the specific muscle triples appears significantly more/less frequently in a type of experimental group than in another type. Counts are in the form of (No. of appearance / No. of trials). OR and CI are short for “odds ratio” and “confidence interval”. p-values are originally calculated with Fisher’s exact test and are then adjusted for multiple comparisons with the Sidak correction or converted to q-values with the FDR [2]. 2.3. Results HN vs. SN SD vs. SN Triple A. Deltoid, L. Deltoid and Tri. Lat. L. Deltoid, Tri. Lat. and Tri. Long Brachior, L. Deltoid and Tri. Lat. A. Deltoid, Brachior and Tri. Long A. Deltoid, L. Deltoid and Tri. Lat. 46 2.3. Results movements. Lack of normal cooperation between the muscles may explain this empirical observation of the SN group’s demonstrating inferior performance [12]. The importance of the deltoid (both anterior and lateral) in these results is consistent with a previous traditional analysis of these data [18] where the deltoids’ activation was found significantly altered after stroke. The functional connectivity between the brachioradilis and the deltoid that we detected (Fig. 2.9 c-d) during reaching movements has been suggested by previous studies. Lemon et al. [17] used transcranial magnetic stimulation during reaching movements in human subjects and found evidence of a strong cortical drive to both the deltoid and brachioradialis throughout a reaching movement. The connectivity between the deltoid and triceps found more prominently in the stroke subjects (Fig. 2.9 c-d) may suggest a more traditional stroke synergy, where there is breakdown in the normal independent activation of muscles involving the shoulder girdle and those involved in movement of the elbow [10]. Probably because the reaching task is sufficiently simple that healthy subjects mastered it easily even with their non-dominant hands, no significant difference between HD and HN groups was found. An alternative explanation is that handedness may not be strongly contrasted in individuals that exhibit forms of ambidexterity. 2.3.4 Classification Performance The across-subject classification performances were reported in Table 2.2. The proposed methods, which use the structure features of the BNs learned from the carrier signals, provide very high classification accuracy in the four classification tasks, and outperform the PCA-based SVM approach applied to the amplitude signals. This excellent classification performance is unlikely related to over-fitting because the error rate was estimated with cross-subject validation and the classifiers were provided with almost equal chances to achieve good performance. Although most of current studies on sEMG focus on the amplitude signals for classification, our classification trees are based on the “carrier” 47 2.3. Results Signal Feature Classifier HD vs. HN SD vs. SN HN vs. SN HD vs. SD Car. EG CT 0.0556 0.0000 0.0000 0.1333 Car. Triple CT 0.1111 0.0000 0.0000 0.0000 Amp. PCs SVM 0.1667 0.0769 0.1250 0.0667 Table 2.2: The error rates of cross-subject classification. “Car.” and “Amp.” are short for the carrier and the amplitude respectively. “EG” and “Triple” are the essential graphs and the triple patterns of the carrier’s Bayesian networks. “PC” is short for principal component. “CT” and “SVM” are short for classification trees and support vector machines respectively. All the error rates were estimated with the cross-subject validation procedure in Fig. 2.4. Bold numbers are the best performance of the four classifiers. signal (see Sec. 2.2.5) which is usually lost in the traditional process of rectification, smoothing and other preprocessing steps. In addition to almost perfect classification involving stroke, the proposed methods also offer visible interpretations. The BN-based classification trees generally provided better performance when the triple patterns were used as the input classification features than when the adjacency matrices of EGs were used. We think this is because an EG just encodes pair-wise interactions but a triple involves three muscles. Figs. 2.11 and 2.12 showed the best classification trees for the four classification tasks. These trees include interactions between agonist-antagonist pairs (e.g. long head of triceps, biceps), muscles with similar actions (e.g. biceps, brachioradialis), and muscles with no obvious similarity of function but might be part of larger synergies (e.g. brachioradialis and latissimus dorsi). As previously mentioned in Sec. 2.2.4, in addition to using their structure features, BNs by themselves can be extended straightforwardly to a statistical classifier. The BN statistical classifier separated the trials of the same subject perfectly (i.e. the withinsubject cross-validation error rate = 0%) but performed poorly in cross-subject validation (i.e. the error rate 50%). Fig. 2.13 demonstrated the usage of a BN classifier for classifying healthy subjects’ dominant and non-dominant hands. Although the BN classifier showed high trial-to-trial reliability, its across-subject classification performance 48 2.3. Results (a) Healthy, Dominant vs. Non-dominant (b) Stroke, Dominant vs. Non-dominant (c) Non-dominant side, Healthy vs. Stroke (d) Dominant side, Healthy vs. Stroke Figure 2.11: The best classification trees by using EG structures as the classification features. If two trees have equal cross-subject classification error rates, the one using fewer input edges is chosen. For each node, a label specifies its group membership and its predicted value (in bold). For a branch, its label is the decision rule. For the performances of these trees, please refer to Table 2.2. 49 2.3. Results (a) Healthy, Dominant vs. Non-dominant (b) Stroke, Dominant vs. Non-dominant (c) Non-dominant side, Healthy vs. Stroke (d) Dominant side, Healthy vs. Stroke Figure 2.12: The best classification trees by using triple patterns as the classification features. If two trees have equal cross-subject classification error rates, the one using fewer input edges is chosen. For each node, a label specifies its group membership and its predicted value (in bold). For a branch, its label is the decision rule. For the performances of these trees, please refer to Table 2.2. 50 2.3. Results 4 1 x 10 7 7 0.5 7 2 7 component 2nd 6 6 7 0 2 2 2 6 6 6 6 6 6 6 2 6 2 2 7 7 2 7 2 4 7 7 4 −0.5 42 4 44 4 44 −1 HD 4 HN −1.5 −2 −1.5 −1 −0.5 0 0.5 component 1st 1 1.5 2 2.5 3 4 x 10 Figure 2.13: Principal components of the log likelihood of BNs trained from the sEMG data of the dominant (HD) and non-dominant (HN) arms of healthy subjects. As demonstrated in Eqs. (2.9) and (2.10), a statistical BN classifier is a function of ppx|M q where x is the data of a trial and M is the BN model of another trial. Thus, a trial is represented as a log likelihood vector composed of ppx|Mi q where Mi s are the models of many trials. To visualize the high dimension vectors, we plot their first two principal components. Note that trials of the same arm tend to cluster together, which suggests that the BN represents the reaching movement reliably and consistently. Trials of HD and HN are not separated, which suggests that arms of the same type do not share a common distinguishing pattern in their log likelihood, but each has a different pattern. Since a statistical BN classifier is based on log-likelihood, its cross-subject performance is poor while its within-subject performance is excellent. 51 2.4. Conclusions and Discussions was poor. We believe that the poor across-subject performance of the BN statistical classifiers is not due to deficiencies in the method, but rather reflective of the underlying biology. We note that the likelihood function is very consistent across trials of the same subject, which suggests that it is robust to various artifacts that may corrupt sEMG signals. Nevertheless, there may be considerable variations between individuals due to factors such as variance in genetics, developmental environment, compensatory strategies, and ongoing plasticity in response to environmental stressors. Thus a key result of the present work is implication that most within-group, inter-subject variability is not in the network structure (which itself is sensitive to handedness and effects of stroke), but rather in the parameters that specify the interactions within the network structure. 2.4 Conclusions and Discussions In this paper, we have developed a Bayesian network (BN) framework based for modeling muscle networks to represent muscle co-ordinations in motor control. To demonstrate the benefits of the proposed approach, we applied this method to multi-channel sEMG data simultaneously recorded during reaching movements in healthy and stroke subjects. We noted that dependent muscle synergies can be revealed by first using a BN to model the muscle interaction network and then analyzing subnets of the derived BN. Further examination of the muscle synergies (muscle association patterns) suggested that stroke may particularly affect the interaction between a few specific muscles, especially the deltoid, during reaching movements. Classification trees based on the BNs’ structure features can effectively classify the reaching performed by the healthy dominant, healthy non-dominant, stroke dominant and stroke non-dominant arms across subjects. Classification trees provide an additional benefit: their classification processes can be easily visualized. A key result of this study was that statistically significant differences between the 52 2.4. Conclusions and Discussions sEMG recorded under different conditions were noticed from analyzing the “carrier” signal, which is obtained by estimating and removing the amplitude information from the raw sEMG data. This is noteworthy observation, since in many sEMG studies, only the amplitude is investigated with the carrier signal discarded. Although not specifically explored in this study, we believe that the widespread statistical dependencies between the “carrier” of the sEMG signals from different muscles may reflect widespread synchronization between different cortical areas and muscles known to exist during dynamic movements [8] [20]. We are intrigued that the significant features of the networks (Fig. 2.9 c-d) suggest a significant statistical interaction between the deltoid and brachioradialis, consistent with previously-described cortical-muscle interactions during reaching movements in normal subjects [17]. Another promising result from this study is the finding that the structure of BNs and their subsets are quite robust across individuals within the same group, yet demonstrate enough sensitivity to detect handedness and the effects of stroke. One of the fundamental challenges in classification of movements after stroke is how to deal with the inherent subject-to-subject variability in sEMG recordings, yet still be sensitive enough to detect impairment. The overall BN models demonstrated robustness to trial-to-trial variability within subjects (Fig. 2.13), but were quite different across subjects within the same group. However, the success of the classification trees based on the structure features of the subject-specific BNs suggests that the BN structure is sensitive to the effects due to stroke or handedness factor, but robust to inter-subject variability. Although we suggest that our results demonstrate strong evidence to support the use of BNs as a tool to study sEMG signals, there are nevertheless shortcomings of the proposed method. The BN model, as proposed here, assumes stationarity of the muscle interactions. Yet there is evidence that muscle interactions may be dynamically affected by a number of factors. For example, different heads of the gastrocnemius muscle may be activated during walking as a function of activity in sensory afferents [7], 53 2.4. Conclusions and Discussions an observation with neuroanatomical basis [15]. Presumably integrating the time-varying amplitude information with a dynamic Bayesian network (DBN) in addition to the carrier information for classification of sEMG data may be a fruitful avenue to explore in the future. Each sEMG channel is a measure of the depolarization of muscle fibers, which is the end result of motor cortex activity, conduction along peripheral nerves, propagation across the neuromuscular junction, and propagation within the muscles. At each stage of motor propagation, there is the possibility for temporal variability, for example, “neuromotor noise” in the central nervous system which may be particularly important in disease states, conduction velocity along the peripheral nerves, which is a strong function of temperature, jitter at the neuromuscular junction and possible disease states in the muscles themselves [22, 27]. However, successfully incorporating these features in the future will additionally involve explicitly modeling the dynamic temporal relationships between sEMG recordings. We do not suggest that the model is a biologically-accurate description of the generative process of EMG activity. Nevertheless, we note that both Renshaw cells and Ia inhibitory interneurons have complex effects on the firing of α motor neurons in the spinal cord. Since these modulatory cells are themselves activated by muscle activities [28], the ”dependent synergies” proposed – where interactions between muscles are influenced by the activity of other muscle(s) – are physiological plausible. However, explicitly incorporating inhibitory interneurons would require the addition of hidden nodes, beyond the scope of the current proposed method. Acknowledgment This research was partially funded by a National Parkinsons Foundation (NPF) Center of Excellence Grant and a Natural Sciences and Engineering Research Council of Canada 54 2.4. Conclusions and Discussions (NSERC) Grant. 55 Bibliography [1] Steen A. Andersson, David Madigan, and Michael D. Perlman. A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25: 505–541, 1997. [2] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society.Series B (Methodological), 57(1):289–300, 1995. [3] Timothy J. Carroll, Evan R. L. Baldwin, and David F. Collins. Task dependent gain regulation of spinal circuits projecting to the human flexor carpi radialis. Experimental Brain Research, V161(3):299–306, March 2005. [4] E. A. Clancy. Electromyogram amplitude estimation with adaptive smoothing window length. Biomedical Engineering, IEEE Transactions on, 46(6):717–729, 1999. [5] E. A. Clancy, E. L. Morin, and R. Merletti. Sampling, noise-reduction and amplitude estimation issues in surface electromyography. Journal of Electromyography and Kinesiology, 12(1):1–16, February 2002. [6] A. d’Avella, P. Saltiel, and E. Bizzi. Combinations of muscle synergies in the construction of a natural motor behavior. Nature neuroscience, 6:300–308, 2003. [7] J. Duysens, B. M. H. van Wezel, T. Prokop, and W. Berger. Medial gastrocnemius is more activated than lateral gastrocnemius in sural nerve induced reflexes during human gait. Brain Research, 727(1-2):230–232, July 1996. 56 Chapter 2. Bibliography [8] Bernd Feige, Ad Aertsen, and Rumyana Kristeva-Feige. Dynamic Synchronization Between Multiple Cortical Motor Areas and Muscle Activity in Phasic Voluntary Movements. J Neurophysiol, 84(5):2622–2629, 2000. [9] Nir Friedman and Daphne Koller. Being Bayesian about network structure. a Bayesian approach to structure discovery in Bayesian networks. Machine Learning, V50 (1):95–125, January 2003. [10] Paul L. Gribble and D. J. Ostry. Independent coactivation of shoulder and elbow muscles. Experimental Brain Research, V123(3):355–360, November 1998. [11] Nihal Fatma Guler and Sabri Kocer. Classification of EMG signals using PCA and FFT. Journal of Medical Systems, V29(3):241–250, June 2005. [12] JE Harris and JJ Eng. Individuals with the dominant hand affected following stroke demonstrate less impairment than those with the non-dominant hand affected. Neurorehabilitation and Neural Repair, in press. [13] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 1970. doi: 10.1093/biomet/57.1.97. [14] M. Kaminski. Determination of transmission patterns in multichannel data. Phil. Trans. R. Soc. B, 360(1457):947–952, May 2005. [15] L. A. LaBella, J. P. Kehler, and D. A. McCrea. A differential synaptic input to the motor nuclei of triceps surae from the caudal and lateral cutaneous sural nerves. J Neurophysiol, 61(2):291–301, 1989. [16] Steffen L. Lauritzen. Graphical models, volume 17. Clarendon Press, Oxford University Press, Oxford, New York, 1996. [17] R.N. Lemon, R.S. Johansson, and G. Westling. Modulation of corticospinal influence 57 Chapter 2. Bibliography over hand muscles during gripping tasks in man and monkey. Canadian Journal of Physiology and Pharmacology, 74:547–558, 1996. [18] Patrick H. McCrea, Janice J. Eng, and Antony J. Hodgson. Saturated Muscle Activation Contributes to Compensatory Reaching Strategies After Stroke. J Neurophysiol, 94(5):2999–3008, 2005. doi: 10.1152/jn.00732.2004. [19] Martin J. McKeown. Cortical activation related to arm-movement combinations. Muscle & Nerve, 23(S9):S19–S25, 2000. [20] M.J. McKeown and R. Radtke. Phasic and tonic coupling between EEG & EMG revealed with independent component analysis (ICA). J Clin Neurophysiology, 18: 45–57, 2001. [21] R. Merletti, D. Farina, and A. Granata. Non-invasive assessment of motor unit properties with linear electrode arrays. Electroencephalography and clinical neurophysiology. Supplement, 50:293–300, 1999. [22] K R Mills. Specialised electromyography and nerve conduction studies. J Neurol Neurosurg Psychiatry, 76(suppl 2):ii36–40, 2005. [23] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298:824–827, 2002. [24] Kevin Murphy. Bayes net toolbox for Matlab (BNT). URL http://bnt. sourceforge.net/. [25] Kevin Patrick Murphy. Dynamic Bayesian networks: representation, inference and learning. PhD thesis, University of California, Berkeley, 2002. Chair-Stuart Russell. [26] K. A. Provins. Handedness and speech: A critical reappraisal of the role of genetic and environmental factors in the cerebral lateralization of function. Psychological Review, 104(3):554–571, Jul. 1997. 58 Chapter 2. Bibliography [27] David J. Reinkensmeyer, Mario G. Iobbi, Leonard E. Kahn, Derek G. Kamper, and Craig D. Takahashi. Modeling reaching impairment after stroke using a population vector model of movement control that incorporates neural firing-rate variability. Neural Comp., 15(11):2619–2642, 2003. [28] Serge Rossignol, Rejean Dubuc, and Jean-Pierre Gossard. Dynamic sensorimotor interactions in locomotion. Physiol. Rev., 86(1):89–154, 2006. [29] Robert L. Sainburg. Evidence for a dynamic-dominance hypothesis of handedness. Experimental Brain Research, 142(2):241–258, Jan. 2002. doi: 10.1007/ s00221-001-0913-8. [30] Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6 (2):461–464, Mar. 1978. GR: Short Communications. [31] NS. Stoykov, MM. Lowery, A. Taflove, and TA Kuiken. Frequency- and time-domain FEM models of EMG: capacitive effects and aspects of dispersion. IEEE Transactions on Biomedical Engineering, 49(8):763–72, Aug. 2002. [32] Matthew C. Tresch, Vincent C. K. Cheung, and Andrea d’Avella. Matrix factorization algorithms for the identification of muscle synergies: evaluation on simulated and experimental data sets. J Neurophysiol, 95:2199–2212, 2006. [33] Matthew C. Tresch1, Philippe Saltiel1, and Emilio Bizzi1. The construction of movement by the spinal cord. Nature Neuroscience, 2(2):162–7, 1999. [34] Vladimir Naumovich Vapnik. The Nature of Statitical Learning Theory. SpringerVerlag, 1995. 59 Chapter 3 Comparing Group-Analysis Methods Based on Bayesian Networks with Applications in Modelling Brain Connectivity 2 To make analysis results generally applicable to a population, rather than just a specific individual, experimental data should be studied at group level. In a study on Parkinson’s disease using functional magnetic resonance imaging, three popular group-analysis methods, i.e. the “virtual-typical-subject” approach, the “individual-structure” approach, and the “common-structure” approach, were compared from the aspects of their statistical goodness-of-fit to the data, and more importantly their sensitivity in detecting the effect of medication on the disease. They led to considerably different group-level results, learning different network structures, and detecting different numbers of connections normalized by the medication. The “virtual-typical-subject” approach fitted the data of the healthy people best, while the “individual-structure” approach fitted the data of the patients best. The “individual-structure” approach was more sensitive in detecting the normalizing effect of the medication on brain connectivity, but it also tended to yield results supporting its assumption even when the assumption actually is incorrect. 2 A version of this chapter has been published. Junning Li, Z. Jane Wang, Samantha J. Palmer and Martin J. McKeown (2008) Dynamic Bayesian Network Modeling of fMRI: A Comparison of GroupAnalysis Methods. NeuroImage 41: 398–407. 60 3.1. Introduction 3.1 Introduction Effective brain connectivity, defined as the neural influence that one brain region exerts over another [4, 6], is important for the assessment of normal brain function, and its impairment is associated with neurodegenerative diseases such as Alzheimer’s or Parkinson’s disease (PD). Various mathematical methods, such as structural equation modelling (SEM) [18], multivariate auto-regressive modelling (MAR) [8], dynamic causal modelling (DCM) [5], bilinear dynamical system (BDS) [24] and Bayesian networks (BN) [27, 38], have been proposed for inferring effective connectivity from functional magnetic resonance imaging (fMRI). These models can all be visualized as a graph whose nodes denote brain regions and directed edges denote connections between brain regions. Connection parameters are associated with edges, indicating the strength of the connections. MAR, SEM and BNs are ordinary graphical models, but with different constraints on the network structures, regarding whether time lags are considered and whether cycles are allowed on the graph. DCM and BDS add a hidden layer to the graph for the unobserved underlying neural activities, and also allow the interaction between two regions to be modulated by the activities of a third region, at the cost of intensive computation. FMRI experiments are usually performed to infer brain activations consistently shared by a population or to identify its differences between populations. Therefore, it is important to develop group-analysis methods for the aforementioned graphical models if they are to be more fully adopted in group analysis rather than just being used at the individual level [27]. Here we investigate potential group-analysis approaches, using BNs as a prototype of graphical models. We are especially interested in BN modelling of brain connectivity because it is flexible to handle both categorical and continuous data and as well both linear and nonlinear relationships [13], and also because plenty of methods about model learning and computation have been developed for BNs in the field of artificial intelligence. Group analyses of any sort fundamentally involve two factors: common features 61 3.1. Introduction shared by group members, and the specific features of each individual. The common features seen within a group are typically of ultimate interest, although individuallyspecific features cannot simply be ignored, as they may indicate the presence of outliers, potentially biasing the overall group result. Many studies have found that fMRI activation patterns may considerably differ among subjects even within the same group. For example, Sugihara et al. [33] found three of the five regions related to writing were inconsistently activated among subjects; in Vandenbroucke et al. [36]’s study, eighteen of twenty nine subjects showed activation varied in both location and size. These results suggest that the similarity and diversity among group members should be balanced in group models, rather than either over-emphasized or neglected. There have been several group-analysis techniques implicitly employed in graphical modelling of fMRI data, based on models such as SEM, MAR and BNs. A review of the literature reveals that these techniques can be divided into three broad categories, namely the “virtual-typical-subject” (VTS) approach, the “individual-structure” (IS) approach and the “common-structure” (CS) approach. The VTS approach assumes that every subject within a group performs the same function with exactly the same connectivity network, and it does not generally accommodate inter-subject variability. These methods reconstruct a virtual typical subject to represent the whole group, by pooling or averaging group data as if they were sampled from a single virtual subject [7, 27, 38], and then learn the connectivity network of the virtual typical subject. When subjects are homogeneous within a group or the inter-subject variability follows certain regular distributions such as the Gaussian distribution, the VTS approach could increase sensitivity, because pooling can yield a relatively large data set, and averaging can enhance the signal-to-noise ratio. At the other extreme, the “individual-structure” (IS) approach learns a network for each subject separately, and then performs group analysis on the individually-learned networks [7, 15]. The IS approach is consistent with the concept of functional degeneracy, i.e. “the ability of elements that are structurally different to perform the same function 62 3.1. Introduction or yield the same output” [3], or more plainly, “there are multiple ways of completing the same task” [26]. The IS approach certainly considers inter-subject variability, but may not integrate group data tightly enough to enable correct inference about statistically significant differences between groups. The “common-structure” (CS) approach is a trade-off between the two extremes, imposing the same network structure on the statistical graphical models of every subject, while allowing the parameters of the models to differ across subjects [12, 19]. The CS approach assumes that cognitive functions invoke a similar connectivity pattern for every subject, but the exact details of the connectivity patterns, in terms of connectivity strength (coefficients), differ across subjects. Thus, the CS approach addresses group similarity at the structural level and inter-subject variability at the parameter level. We investigate the three approaches using data collected from subjects with Parkinson Disease (PD), as the consistent dramatic effects of L-dopa medication seen in this population provide an additional qualitative “ground truth” to the effects of group inference. One of the cardinal features of PD is bradykinesia, a slowness of performed voluntary movements, that represents a major source of disability in PD and is related to impairments in daily activities such as walking and writing. Impaired ramping of force may be fundamentally related to the clinical feature of bradykinesia [31]. Similarly, several studies have demonstrated disturbances of rhythmic movement in PD [25] that are likely related to bradykinesia. However, the bradykinesia is thus far not fully explained in research. A recent study used PET imaging to infer brain areas that appear related to bradykinesia [35], such as under-activity in the sensorimotor cortex contralateral to the moving arm, bilateral dorsal premotor cortices, and ipsilateral cerebellum. However the PET modality did not enable inference about the interaction between these regions. A local problem in the basal ganglia circuit in Parkinson’s disease may cause disruption of downstream distributed motor control networks. A greater understanding of the relative contribution 63 3.2. Materials and Methods of neural regions to bradykinesia in healthy controls and patients may be of diagnostic and therapeutic significance for patients with Parkinson’s disease. The main reason that we chose PD as an example for assessing group analyses, is that the standard medication used for this condition, L-dopa, has dramatic effects against bradykinesia and rigidity (although less effect against tremor, balance and gait). The effects of L-dopa on idiopathic disease are sufficiently dramatic that lack of response to this medication makes the diagnosis of Parkinson’s disease questionable [10]. Thus after introduction of the L-dopa medication, we would expect group features of analyses from PD subjects performing a paradigm assessing bradykinesia to approach that of normal subjects. In this paper, we investigate the performances of the three group-analysis approaches (VTS, IS, CS) by applying them to an fMRI study on Parkinson’s disease. Broadly speaking, we attempted to answer three fundamental questions: “which approach most accurately reflects the underlying biomedical behavior?”, “do the approaches lead to considerably different analysis results?”, and “how can the suitable approach be selected?” The three approaches are compared from the aspects of their statistical goodness-of-fit to the data, and more importantly their sensitivity in detecting the effect of L-dopa medication on the disease. To the best of our knowledge, this is the first study specifically devoted to group-analysis on fMRI with BN modelling. 3.2 3.2.1 Materials and Methods fMRI Data The fMRI data were collected from ten healthy people and ten Parkinson’s disease (PD) patients, each of whom was asked to squeeze a rubber bulb at three different speeds or at a constant force, as cued by visual instruction. The patients performed the entire task twice, once before and another after the L-dopa medication which is most effective against 64 3.2. Materials and Methods slowness of movement and rigidity. Six regions were selected for the analysis according to the Talairach atlas [34]: the left and right primary motor cortex (M1) (Brodman Area 4), supplementary motor cortex (SMA) (Brodman Area 6), and lateral cerebellar hemispheres (CER). FMRI time courses were collected at the sampling frequency of 0.503Hz for 4 minutes and 18 seconds, in total of 130 time points. Subjects: The study was approved by the University of British Columbia Ethics Board. Subjects gave written informed consent prior to participating. Ten volunteers with clinically diagnosed PD participated in the study (4 men, 6 women, mean age 66 8 years, 8 right-handed, 2 left-handed). All patients had mild to moderate PD (Hoehn and Yahr stage 2-3) [9] with mean symptom duration of 5.8 3 years. Exclusion criteria included atypical Parkinsonism, presence of other neurological or psychiatric conditions and use of antidepressants, sleeping tablets, or dopamine blocking agents. All patients were taking L-dopa with an average daily dose of 685 231 mg. We also recruited ten healthy, agematched control subjects without active neurological disorders (3 men, 7 women, mean age 57.4 14 years, 9 right-handed, 1 left-handed). All patients stopped their anti-Parkinson medications overnight for a minimum of 12 hours before the study. The mean Unified Parkinson Disease Rating Scale (UPDRS) motor score during this off-L-dopastate was 26 8. There were no significant correlations between UPDRS motor scores and age. All patients exhibited some aspects of bradykinesia on examination. After completing the experiment in an off-medication state, patients were given the equivalent to their usual morning dose of L-dopa in immediate release form (mean 125 35.3 mg L-dopa). They then repeated the same tasks post-medication following an interval of approximately 1 hour to allow time for L-dopa to reach peak dose. Experimental task: Subjects were instructed to lie on their back in the magnetic resonance scanner viewing a computer screen via a projection mirror system. All subjects 65 3.2. Materials and Methods Figure 3.1: Graph of the first half of the computer task subjects had to perform. Subjects were either asked to statically squeeze at 10% maximum voluntary contraction (MVC), or squeeze sinusoidally (between 5 and 15% of MVC) at differing frequencies in 20s blocks. All frequencies were performed twice per session. used an in-house designed response device in their left hand, which was a custom-built MR-compatible rubber squeeze-bulb connected to a pressure transducer outside the scanner room. They lay with their forearm resting down in a stable position, and were instructed to squeeze the bulb using an isometric hand grip and to keep their grip constant throughout the study. Each subject had their maximum voluntary contraction (MVC) measured at the start of the experiment and all subsequent movements were scaled to this, so that they had to squeeze at 5-15% of maximal force to accomplish the task. Using the squeeze bulb, subjects were required to control the width of an inflatable ring (shown as a black horizontal bar on the screen) in order to keep the ring within an undulating pathway without scraping the sides. Applying greater pressure to the bulb increased the width of the bar, and releasing pressure from the bulb decreased the width of the bar. To avoid scraping the sides of the tunnel, the required pressure was between 5 and 15% MVC. No additional visual feedback or error reporting was given when subjects went outside the white target lines so subjects had to monitor their own performance carefully. The pathway used a block design with sinusoidal sections in three different frequencies (0.25, 0.5 and 0.75 Hz) in a pseudo-random order, and straight parts in between where the subjects had to keep a constant force of 10% of MVC. The frequencies were chosen 66 3.2. Materials and Methods based on prior findings and pilot studies were used to determine that PD patients could comfortably perform the required task. Each block lasted 19.85 seconds (exactly 10 TR intervals), alternating a sinusoid, constant force, sinusoid and so on (Fig. 3.1) to a total of 4 minutes and 18 seconds. PD patients performed the task both after an overnight withdrawal (minimum of 12 hours since last dose of L-dopa) of their anti-Parkinson drugs and repeated the same series one hour after admission of L-dopa. Before the first scanning session, subjects practiced the task at each frequency until errors stabilized and they were familiar with the task requirements. Custom Matlab software (Mathworks) and the Psychtoolbox [1, 23] was used to design and present stimuli, and to collect behavioral data from the response devices. FMRI acquisition: Functional MRI was conducted on a Philips Achieva 3.0 T scanner (Philips, Best, the Netherlands) equipped with a head-coil. We collected echo-planar (EPI) T2*-weighted images with blood oxygenation level-dependent (BOLD) contrast. Scanning parameters were: repetition time 1985 ms, echo time 3.7, flip angle 90, field of view (FOV) [240.00 240.00 240.00] mm, matrix size = 128 128, voxel size 1.9 mm 1.9 mm slice thickness. Each functional run lasted 4 minutes and 18 seconds. Thirty- six axial slices of 3 mm thickness were collected in each volume, with a gap thickness of 1 mm. We selected slices to cover the dorsal surface of the brain and include the cerebellum ventrally. A high resolution, 3-dimensional T1-weighted image consisting of 170 axial slices was acquired of the whole brain to facilitate anatomical localization of activation for each subject. Head motion was minimized by a foam pillow placed around the subjects head within the coil. Subjects also used ear plugs to minimize the noise of the scanner. The subjects constantly viewed visual stimuli on a screen through a mirror built into the head coil. 67 3.2. Materials and Methods Preprocessing: The functional MRI data were preprocessed for each subject, using Brain Voyager trilinear interpolation for 3D motion correction and Sinc interpolation for slice timing correction. No temporal or spatial smoothing was performed on the data. The data were then further motion corrected with Motion Corrected Independent Component Analysis (MCICA), a computationally expensive but highly accurate method for motion correction [16, 17]. The Brain Extraction Tool in MRIcro [28] (http://www.sph.sc.edu/comd/rorden/mricro.html) was used to strip the skull off the anatomical and first functional image from each run, to enable a more accurate alignment of the functional and anatomical scans. Custom scripts in Amira software (Amira 3D Visualization and Volume modelling) were used to co-register the anatomical and functional images. The following ROIs were drawn separately in each hemisphere, based upon anatomical landmarks and guided by the Talairach atlas [34]: primary motor cortex (M1) (Brodman Area 4), supplementary motor cortex (SMA) (Brodman Area 6), and lateral cerebellar hemispheres (CER). The labels on the segmented anatomical scans were resliced at the fMRI resolution. The raw time courses of the voxels within each ROI were averaged as the overall activity of the ROI, and the averaged time course were then detrended and normalized to unit variance before input into the BN models. 3.2.2 Dynamic Bayesian Networks A dynamic Bayesian network (DBN) [21] is a graphical model for stochastic processes. The term “dynamic” does not mean that the model itself changes over time but that the Bayesian network (BN) models a dynamic system. As an extension of BN, a DBN follows the same rules as a regular BN does, encoding the conditional-independence/dependence relationships among random variables with a directed acyclic graph (DAG). If a random variable xa directly depends on another random variable xb , i.e. xa still depends on xb even given all the other random variables, then the dependence is encoded as an edge 68 3.2. Materials and Methods Figure 3.2: An example of dynamic Bayesian networks. This DBN is a first-order Kalman filter process. Arrows from Xt 1 and Ut to Xt (t=2, 3, . . . ) mean that Xt depends on Ut and Xt 1 , and are associated with the transition distribution f pXt |Xt 1 , Ut q which varies according to the input Ut . Arrows from Xt to Yt (t=1, 2, . . . ) mean that the output Yt depends on Xt , and is associated with f pYt |Xt q. As the same dependence relationships repeat, the process can be represented by just the first two time slices circled by dots. The joint probability density function of the process ±T is f pX, Y, U q=f pU1 qf pX1 |U1 qf pY1 |X1 q t2 f pUt qf pXt |Xt 1 Ut qf pYt |Xt q where T is the number of time points. between nodes a and b in the DAG (see Fig. 3.2 for an example). 3 In this text, nodes a and b are associated with xa and xb , respectively. When fMRI signals of ROIs are modeled with BNs, an edge in the DAG implies that two ROIs interact with each other even after removing the influence from all the other ROIs. Regarding the Bayes rule, the joint probability (density) function of all the random variables can be factorized according to the DAG as: f px q ¹ P f pxa |xparas q, (3.1) a W where W denotes the set of all the nodes and pa[a] denotes the set of node a’s parent node(s). 3 BNs graphically encode conditional independence theoretically according to Markov properties: if two sets of nodes A and B are d-separated by a third set of nodes C according to the DAG, and the three sets of nodes are disjoint, then xA and xB are conditionally independent given xC [14], i.e. P pxA xB |xC q=P pxA |xC qP pxB |xC q. The d-separation is a complex concept. For its exact definition, please refer to [14]. 69 3.2. Materials and Methods A multi-channel stochastic process can be modelled with a BN of C T nodes, where C denotes the number of channels, T denotes the number of time points, and each node represents the signal of a channel at one time point. Because the future can influence neither the presence nor the past, nodes at time t 1 can have only nodes after time t as their children. BNs that account for all time points will become intractably large, so first-order Markov and stationary assumptions are usually applied. In this case, the same dependence relationships repeat time after time and signals at t only depend on signals from t 1 to t, so the whole network can be “rolled up” as its DBN representation, a DAG composed of only nodes from t 1 to t, as Fig. 3.2 illustrates. We modelled fMRI signals with first-order Gaussian DBNs by regarding the signals of ROIs as a multi-channel stochastic Gaussian process. Gaussian DBNs whose conditional probability distributions are all Gaussian are limited to modelling linear relationships between random variables, while discrete DBNs are capable of modelling non-linear relationships, but at the cost of precision and accuracy. Gaussian DBNs whose parameters explicitly indicate the strength of interactions between ROIs also facilitate the cross-group quantitative comparison. To accommodate nonlinear responses of brain activities to the task frequencies, we represented the bulb-squeezing frequency level (see Section 3.2.1) as an extra categorical input node L in the network. If node L is a parent of node a, xa is regressed to other parent variables conditionally on the frequency level xL , as in Eq. (3.2) xa ¸ P r sztLu βba|l xb | a l, (3.2) b pa a where xa and xb are the signals of node a and b, respectively, βba|l is the connection coefficient from node b to node a conditional on the frequency level xL l, and | which al follows a zero-mean Gaussian distribution is the regression error conditional on xL l. If node L is not a parent node of a, βba|l takes the same value at all the four frequency levels. If there is no arrow from b to a, the connection coefficients are set to zero, as βba|l =0. 70 3.2. Materials and Methods We imposed on the DBN structures a constraint that there must be at least one path from the input node L to any ROI node. The constraint was imposed to ensure that the response of the ROIs to the bulb-squeezing task was accommodated. It is worth mentioning that in preliminary studies, we had relaxed the constraint to that there must be one path from the input node to at least one ROI, but all the structures learned under the relaxed constraint satisfied the more rigorous one. 3.2.3 Learning DBNs Because many different DBNs may fit the same data almost equally well, we mixed them together according to their posterior probabilities, to provide a more stable approximation ° P pMi|X qMi P pMi |X q is its posterior to the data than a single best DBN does. The mixture formula is M where M is the mixed model, Mi is a possible DBN and probability given data X. We used its mixed structure and its mixed coefficients to profile the mixture model which was composed of thousands of DBNs. The mixed structure is encoded as a matrix G tgabu whose element gab is the posterior probability of the existence of the connection from a to b, defined as Eq. (3.3) gab ¸ P pMi |X qgab piq, (3.3) i where the binary variable gab piq indicates whether the connection from a to b appears in model Mi or not. The mixed coefficients B =tβab|l u are the posterior expected values of the connection coefficients, defined as Eq. (3.4) βab|l ¸ i P pMi |X qβab|l piq, (3.4) where βab|l piq is the coefficient of model Mi . The mixed structure and the mixed coefficients were estimated with the Bayesianinformation-criterion (BIC) scores [30] and Markov chain Monte Carlo (MCMC) [20]. 71 3.2. Materials and Methods The BIC score is defined as Eq. (3.5) where N denotes the sample size of data X and K denotes the number of free parameters θ of DBN model M . BICpM |X q sup ln P pX |M, θq 0.5K ln N (3.5) θ If two models M1 and M2 have the same prior probability, i.e. P pM1 q=P pM2 q and the prior distributions of their parameters θ are uniform distributions, then the ratio of their posterior probabilities can be asymptotically approximated as exp[BIC(M1 )-BIC(M2 )]. 4 In our implementation, 1500 different DBN structures were sampled with MCMC, according to their relative posterior probabilities, i.e. exp[BIC(Mi |X)] which was calculated when Mi was sampled, and then all the sampled DBNs were averaged according to their appearance frequencies in MCMC, as the estimation of the mixture model M . Based on the assumption that subjects perform the task with different patterns of brain connectivity, the IS approach learns a mixture DBN model individually for each subject. Thus each subject is associated with its own G and B . The group mean ¯ and the group mean coefficients (GMC) B¯ are the mean of G and structure (GMS) G the mean of B over all the subjects in the same group, respectively. The group-level BIC score of applying a combination of subject-specific DBNs M tM1 , , MS u (where S is the number of subjects) respectively to the data of each subject is: BICpM|X1 , , XS q S ¸ BICpMs |Xs q s 1 S ¸ rsup ln P pXs|Ms, θsq s 1 θs 0.5Ks ln Ns s, (3.6) where all the notations are similar to those in Eq. (3.5) besides the subscript s indicating that they are associated with subject s. Based on the assumption that subjects perform the task with the same pattern of 4 For rigorous proof, please refer to Schwarz [30]. 72 3.2. Materials and Methods brain connectivity but differently in the details of the interaction between brain regions, the CS approach applies DBN models with the same network structure to the data of every subject in the same group, but optimizes the connection coefficients individually for each subject. Every subject is associated with the same G , and each subject is ¯ equals the G and the GMC B¯ is the mean of associated with its own B . The GMS G B over all the subjects in the same group. The group-level BIC score of applying a DBN M to the group data in the manner of the CS approach is the sum of its BIC scores over the data of each subject: BICpM |X1 , , XS q S ¸ rsup ln P pXs|M, θsq s 1 θs 0.5K ln Ns s, (3.7) where K is without the subscript s because DBNs with the same structure have the same number of parameters. Based on the assumption that inter-subject variability is so small that they can be neglected, the VTS approach pools together all subject’s data, and learns a mixture model for the whole group. Every subject is associated with the same G and B , and ¯ and the GMC B¯ as well, respectively. Because the data are pooled they are the GMS G together, the group-level BIC score of a model M is BICpM |X1 , , XS q supr θ S ¸ ln P pXs |M, θqs s 1 0.5K lnp S ¸ Ns q (3.8) s 1 . 3.2.4 Comparison of Three Approaches The IS, VTS and CS approaches were compared from the aspects of the goodness of fit, the similarity between their models at the group level and the sensitivity to the effect of the L-dopa medication. First, their best group-level BIC scores found in the MCMC sampling were compared. As a criterion of model selection, the greater the BIC score is, 73 3.2. Materials and Methods the better the model fits the data (see Section 3.2.3). More parameters do not guarantee a model a larger BIC score because the penalty term 0.5K ln N in Eq. (3.5) punishes models with redundant parameters. Group-level BIC scores of the IS approach, the CS approach and the VTS approaches are defined in Eqs. (3.6), (3.7) and (3.8), respectively. To examine the feasibility of using the best BIC scores for selecting the IS, VTS and CS approaches, we carried out simulation studies to investigate the following question: whether the true underlying generating models can be correctly identified with the highest BIC scores. In our simulation, similar to the setting of our real fMRI data set, fMRI time courses of the 6 ROIs at 130 time points were simulated for 10 subjects from mixture DBNs derived from the real fMRI data set. We leaned mixture DBN models from the real fMRI data of a group of 10 subjects (e.g. PD patients before medication) with the IS, VTS and CS approaches respectively. Then the learned mixture models were used as true models of the IS, VTS and CS approaches respectively to generate the simulation data. Different approaches may fit the data with different BIC scores, but it is still possible for them to yield similar network structures and parameters at the group level. To inspect whether the choice of the analysis approaches impacts the group-level model, we ¯ or GMCs B¯ across approaches. For each of the three comparisons compared the GMSs G (IS vs. CS, IS vs. VTS and CS vs VTS), we element-wise plotted the GMSs or GMCs across the two approaches under comparison. Because an element of B¯ will become ¯ is zero, an element of B¯ is plotted only if its zero when its corresponding element of G ¯ is larger than 1%, to eliminate the information that has corresponding element of G already been presented in the comparison of GMSs. If the dots on the plotted graphs are located roughly alone the diagonal line, the two approaches yield similar network structures or parameters at the group level. As a statistical metric, a BIC score alone cannot indicate whether a model approximates the underlying biomedical truth well, so the sensitivity of the three approaches 74 3.2. Materials and Methods to the possible effect of the L-dopa medication was also compared. Because it has been demonstrated that disturbances of rhythmic movement in PD [25] is related to bradykinesia (slowness of movement) and L-dopa is most effective against bradykinesia [10], we assume that the L-dopa medication will generally normalize PD patients’ brain connectivity in this frequency-related bulb-squeezing task. The overall effect of the medication was verified by checking whether the mixture DBN model of each patient before medication (denoted as the Ppre group) changed toward those of the normal group (denoted as the N group) after the patient took the medication (denoted as the Ppost group). At the structure level, a connection is considered to have been changed toward the normal group, if elements of GP spost ¯N GP spre and G GP spre (where subscripts “P spre ” and “P spost ” indicate that the comparison is about patient s) have the same signs, or equally elements of G1s pGP s post ¯N GP spre q signpG GP spre q (3.9) are positive where “ ” denotes element-wise products. At the parameter level, a connection was considered to have been changed toward the normal group, if elements of BP spost BP spre and B¯N Bs1 BP spre had the same sign, or equally elements of p BP s post BP spre q signpB¯N BP spre q (3.10) are positive. To summarize this patient-matched comparison at the group level, we averaged G1s and Bs1 over all the patients as G1 °Ss1 G1s{S and B 1 °Ss1 Bs1 {S re- spectively. If an element of G1 or B 1 is positive, we call the corresponding connection a “normalized connection” at the structure or the parameter level respectively. The same patient performed the task twice, once before and the other after the medication, so the comparison is patient-matched. Because the normal group are healthy people which can ¯ N and B¯N ) are used to not be individually matched with patients, its group means (G 75 3.2. Materials and Methods represent healthy people. The signs of G1s and Bs1 are standardized relatively to the signs ¯N of G GP spre and B¯N BP spre respectively, so that positive values indicate that the change is toward that of healthy people. Based on the “normalized connections”, we performed tests between the two hypotheses below: H0 : medication does not functionally change brain connectivity in PD overall. H1 : medication changes the PD subjects’ connections between brain regions func- tionally toward those of normal subjects. The possibility of a connection appearing to be “normalized” by the medication is parametrized as γ. The value of γ equals 0.5 under H0 , due to the effect of randomness; γ is expected to be greater than 0.5 under H1 . If m out of n connections appear to be normalized by medication, the right-tailed p-value of this observation under H0 is °n i n Cn {2 . i m To further assess the similarity/diversity of the DBN structures learned with the IS approach, the variance of a connection’s posterior probabilities i.e. V arrgab s was calculated among subjects’ mixture DBNs. A small V arrgab s suggests that the connection from ROI a to ROI b consistently appears (or does not appear) in subjects’ mixture DBNs; a large variance suggests that its appearance differs considerably among subjects’ mixture DBNs. If a big portion of connections have small V arrgab s, then the DBNs learned individually for each subject are structurally consistent; if a big portion of connections have large V arrgab s, then the DBNs structurally differs among subjects. Please note that the upper bound of the variance is 0.5, not 1, because posterior probabilities are in the range of [0, 1]. 76 3.2. Materials and Methods 3.2.5 Group Comparison The group analysis is to identify those connections that appear to be statistically significantly normalized by the medication. Though a connection appears to be normalized if its corresponding element in B 1 is positive, the statistical significance of the normalization still need to be tested. In this section, we employed mixed-effect models to verify whether elements of B 1 are significantly positive. Because the data of the normal group were best fitted with the VTS approach and the data of the patient groups were best fitted with the IS approach (see Table 3.2), we applied the B¯N of the VTS approach and the BP spre and BP spost of the IS approach to Eq. (3.10) to calculate B 1 . 1 of B1 is associated with a connection from a to b and a frequency An element βab |l 1 from all the patients level l. When the connection from a to b is under investigation, βab |l and at all the frequency levels can be regressed to the frequency level as in Eq. (3.11), 1 βab c Freq e, (3.11) 1 is a vector of β 1 , c is a constant, and vector e is the regression errors which where βab ab|l 1 can also be just regressed to follow a zero-mean multivariate Gaussian distribution. βab a constant as in Eq. (3.12) without the term Freq if the effect of the medication does not differ at different task frequencies. 1 βab c e. (3.12) To accommodate the variability among the randomly sampled subjects and as well the noise in the fMRI of each subject, the variance of e is composed of two parts: V = Va + Vw which are the variance among and within subjects respectively, in the way of the summary statistics approach. Since the variance term includes both among-subject and within-subject terms, this is a mixed-effect model that can infer about the population, 77 3.3. Results rather than just the pool of the recruited subjects. Va has the pattern of σ 2 I, under the assumption that subjects are independently sampled from the population. Vw is a blocked-diagonal matrix whose diagonal blocks are variances within subjects. The diagonal blocks of Vw are estimated at the stage of calculating βab|l , as Vws = V pβP spre q + V pβP spost q where the subscript s indicates that the block is associated with patient s. Eqs. (3.11) and (3.12) were solved with the restricted maximum likelihood method [2] to ensure an unbiased estimation of V . The analysis includes three steps. First, nested models Eq. (3.11) and Eq. (3.12) are compared with likelihood-ratio test, to test whether the medication altered the connections differently for tasks at different frequencies. If Eq. (3.11) fits the data significantly better than Eq. (3.12) does, then the medication has different effects for tasks at different frequencies. However, significant results were not discovered for any connection (see Section 3.3), so all the preceded analyses were performed with Eq. (3.12). Second, whether c is equal to or significantly above zero is tested with right-tailed t-tests. A p-value was calculated for each possible connection in the DBN. Finally, the effect of testing multiple connections simultaneously was adjusted by Storey’s positive-false-discovery-rate (pFDR) procedure [32] which controls the pFDR (characterized by a q-value), the expected ratio of falsely rejected hypotheses among all those being rejected. Connections whose q-values were smaller than 5% were selected, and regarded to have been statistically significantly normalized by the medication. 3.3 Results Table 3.1 shows the group-level BIC scores of the simulated data set. The generating models were correctly identified with the highest BIC scores. Table 3.2 shows the grouplevel BIC scores of the real data set. For the normal group, the VTS approach yielded the highest group-level BIC score; for the pre-medication and post-medication patient 78 3.3. Results Table 3.1: Group-level BIC Scores of Simulated Data Learning Approaches IS CS VTS (Pool) True Models IS -5.40e3 -5.85e3 -6.47e3 CS -5.42e3 -5.35e3 -5.68e3 VTS -6.27e3 -6.36e3 -5.63e3 Group-level BIC scores of the IS approach, the CS approach and the VTS approach are defined in Eqs. (3.6), (3.7) and (3.8) respectively. Bold digits are the highest BIC scores found by the three approaches. Table 3.2: Group-level BIC Scores of Real Data IS CS VTS (Pool) N -5.25e3 -5.38e3 -5.06e3 Ppre -4.99e3 -5.21e3 -5.31e3 Ppost -6.11e3 -6.55e3 -6.39e3 Group-level BIC scores of the IS approach, the CS approach and the VTS approach are defined in Eqs. (3.6), (3.7) and (3.8) respectively. Bold digits are the highest BIC scores found by the three approaches. “N”, “Ppre ” and “Ppost ” denote the normal group, the pre-medication group and the post-medication group respectively. groups, the IS approach yielded the highest group-level BIC scores. The best fitting approaches outperformed the second best fitting approaches with at least 190 BIC scores, which implies that if the conditions of the BIC’s asymptomatic approximation to the log posterior probability are satisfied, then the model of the best fitting approach is e190 times more likely than that of the second best fitting approach. Though the conditions of the BIC’s asymptomatic approximation may not be guaranteed in this study, the sizeable contrasts among the BIC scores still suggest that the goodness-of-fit of the three approaches is considerably different. Fig. 3.3 shows the element-wise comparison of the group mean structures learned with the three approaches. If two approaches learn similar networks at the group level, dots on the figure should be located close to the (0, 0) to (1, 1) diagonal lines. However, this phenomenon is not observed. The connections’ posterior probabilities estimated by the CS approach tend to cluster at extreme values near 0 and 1, while those estimated by the IS approach cover the range from 0 to 1 more evenly. 79 3.3. Results 1 N Ppre Ppost 0.9 0.8 0.7 CS 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 IS 1 N Ppre Ppost 0.9 0.8 0.7 VTS 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 IS 1 N Ppre Ppost 0.9 0.8 0.7 VTS 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 CS Figure 3.3: Comparisons of the group mean structures. The group-mean-structure ¯ of the IS, CS and VTS approaches are plotted element-wise. Plots of the matrices G three experimental groups are overlaid with different dot symbols. Legends “N”, “Ppre” and “Ppost” denote the normal group, the pre-medication group and the post-medication group respectively. 80 3.3. Results 1 N Ppre Ppost 0.8 0.6 0.4 CS 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 IS 1 N Ppre Ppost 0.8 0.6 0.4 VTS 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 IS 1 N Ppre Ppost 0.8 0.6 0.4 VTS 0.2 0 −0.2 −0.4 −0.6 −0.8 −1 −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 CS Figure 3.4: Comparisons of the group mean coefficients. The group mean coefficients B¯ of the IS, CS, and VTS approaches are plotted element-wise. An element of B¯ is plotted ¯ is larger than 1%. Plots of the three experimental only if its corresponding element of G groups are overlaid with different dot symbols. Legends “N”, “Ppre” and “Ppost” denote the normal group, the pre-medication group and the post-medication group respectively. 81 3.3. Results Table 3.3: The counts of normalized connections IS CS VTS (Pool) Count p Count p Count p Struct. 59/66 0.0000 15/66 1.0000 29/66 0.8661 .00Hz 64/66 0.0000 27/66 0.9456 35/66 0.3561 .25Hz 62/66 0.0000 27/66 0.9456 34/66 0.4511 .50Hz 62/66 0.0000 27/66 0.9456 35/66 0.3561 .75Hz 64/66 0.0000 27/66 0.9456 34/66 0.4511 Mean 63/66 0.0000 27/66 0.9456 34/66 0.4511 The format of counts is “the number of normalized connections / the number of possible connections”. In row “Struct.” are the counts of positive elements in G1 ; in rows from “.00Hz” to “.75Hz” are the counts of positive elements in B 1 , given the task frequency level; in row “Mean” are the counts of positive elements in B 1 , with the elements at the four frequency levels averaged. Fig. 3.4 shows the element-wise comparison of the group mean parameters learned with the three approaches. To remove the information that Fig. 3.3 has showed, an ¯ are larger than 1%. Dots element of B¯ is plotted only if its corresponding element of G are located roughly along the (-1,-1) to (1,1) diagonal lines, which suggests that the three approaches tend to learn similar coefficients at the group level for those connections that always appear in the learned networks no matter which approach is applied. Table 3.3 shows the number of normalized connections detected with the three approaches. The IS approach detected more normalized connections than the other two approaches. It found that at least 59 connections out of all the 66 possible ones were normalized after the medication. The p-value of this observation is smaller than 1.193 10 11 , under the hypothesis that the medication does not improve patients’ brain connectivity overall. Fig. 3.5 shows the histogram of V arrgab s, i.e. the variances of connections’ posterior probabilities among the mixture DBN models learned with the IS approach. Given the fact that the variance of posterior probabilities cannot exceed 0.5, a large portion of connections vary considerably among the networks of different subjects, which suggests that different subjects’ networks learned with the IS approach are not consistent. Even though subjects of the normal group probably share the same brain connectivity pattern, 82 3.3. Results 35 N Ppre Ppost 30 25 20 15 10 5 0 −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 Figure 3.5: The histogram of the variances of connections’ posterior probabilities estimated with the IS approach. The posterior probability of a connection’s existence was estimated individually for each subject, and the variance of a connection’s posterior probabilities among different subjects was then calculated. This figure shows the histogram of the posterior-probability variances of the connections. Legends “N”, “Ppre” and “Ppost” denote the normal group, the pre-medication group and the post-medication group respectively. 83 3.3. Results Table 3.4: Connections whose coefficients were statistically significantly normalized toward those of the normal group. Connection t-stat d.f. p q R CER ÞÑ R CER 2.5828 18 0.0094 0.0424 R CER ÞÑ L CER 4.1458 15 0.0004 0.0065 R M1 ÞÑ R M1 3.0563 18 0.0034 0.0275 R SMA ÞÑ R SMA 2.6912 27 0.0060 0.0362 L CER ÞÑ R CER 6.6237 15 0.0000 0.0001 L CER ÞÑ L CER 2.6565 9 0.0131 0.0424 R CER Ñ R M1 2.4417 12 0.0155 0.0424 R M1 Ñ R CER 3.0219 18 0.0037 0.0275 R M1 Ñ L M1 2.4717 12 0.0147 0.0424 R SMA Ñ R CER 2.3685 18 0.0146 0.0424 2.4601 12 0.0150 0.0424 L SMA Ñ R SMA Arrows ÞÑ denote connections with a time lag; arrows Ñ denote connections without a time lag. “d.f.” is short for “degree of freedom”. which is suggested by the highest BIC score learned with the VTS approach, networks still vary considerably from subject to subject when the IS approach was applied. 1 significantly better The likelihood-rate tests show that Eq. (3.11) does not fit βab than Eq. (3.12) does for any connection, with the minimum p-value larger than 0.0769, before correction for the effect of multiple comparisons. This result does not support that the medication differently altered the connections for tasks at different frequencies. This does not mean that the medication does not altered the connections, but just that the alterations are probably the same at different frequency levels. The connections whose coefficients were significantly normalized according to the right-tailed t-test with Eq. (3.12) are listed in Table 3.4 and further graphically illustrated in Fig. 3.6. These results indicate that the right SMA, right M1 and left cerebellum, all part of the motor system expected to be activated with the left hand were modulated by L-dopa mediation. 84 3.3. Results Figure 3.6: Connections whose coefficients were statistically significantly normalized toward those of the normal group. Solid arrows denote connections with a time lag; line arrows denote those without a time lag. 85 3.4. Discussion 3.4 Discussion The comparison of the three DBN-based group-analysis approaches (the “virtual-typicalsubject” (VTS), “common-structure” (CS) and “individual-structure” (IS) approaches) suggests that no single approach is universally superior over the other two. For this fMRI study on Parkinson’s disease that included a motor task at different frequencies, the VTS approach fits the data of the normal group best, while the IS approach fits the data of the pre-medication and post-medication patient group best, from the view of statistics with the BIC score as the goodness-of-fit metric (see Table 3.2). The three approaches led to considerably different group-level results, learning different network structures, and detecting different numbers of connections normalized by the medication, from the same data set. Fig. 3.3 shows that the probability of a connection’s existence may be estimated to be quite high with one approach, but quite low with another approach. The overall effect of the L-dopa medication on the brain connectivity of Parkinson’s disease patients are also estimated quite differently with the three approaches, with the proportion of normalized connections varies from 15/66 to 64/66, as shown in Table 3.3. The results of the IS approach support the assumption better that the L-dopa medication normalizes the brain connectivity for the patients than those of the other two approaches do. The regions showing normalized activity after L-dopa medication are consistent with motor tasks involving the left hand. The sizeable differences among the results of the three approaches suggest that choice over the three approaches influence the analysis considerably. Analysts should not arbitrarily choose an approach without justification. However, as Fig. 3.4 shows, the three approaches tend to learn similar coefficients at the group level for those connections that always appear in the learned networks no matter which approach is applied. If the IS approach is applied, the networks tend to vary considerably from subject to subject, even when statistical metrics support that the underlying true models probably share the same structure. For the normal group in this study, its highest BIC score was 86 3.4. Discussion learned with the VTS approach, suggesting that the data of those normal people are more similar than dissimilar. However, the networks learned with the IS approach still vary considerably from subject to subject, as Fig. 3.5 shows sizeable variances of connections’ posterior probabilities among the individual DBN models. The reason behind this phenomenon probably is that the limited time points of each individual’s data do not support an accurate and robust estimation of the true model. The inconsistency among individual networks is not effective evidence of inter-subject variability, if only the IS approach is applied. Each of the three approaches has its limitations in practice. The IS approach do not take advantage of the similarity among group members, so when each subject’s data is not long enough, it tend to learn non-robust networks for each subject as mentioned above with Fig. 3.5. The VTS approach does not accommodate differences among subjects, but inter-subject variability is believed to be a common factor in biomedical studies. As shown in Table 3.3, ignoring the differences among patients may limit the analysis’ sensitivity to the effect of the L-dopa medication. The CS approach balances the similarity and diversity among group members at the point that the connectivity structure is the same for all the subjects but details of the connectivity are different. However it is not sensitive to the effect of the L-dopa medication either, as shown in Table 3.3. It is possible that the subjects’ connectivity networks are the same partially at some connections but differs at other connections, in which case none of the three approaches is suitable. It is ultimately desirable to develop a method that allows researchers to adjust the degree of the balance between the similarity and diversity among group members. The method NiculescuMizil and Caruana [22] proposed is a possible solution. It allows network structures to be different while punishes excessive diversity among the structures and controls the degree of the punishment with an adjustable parameter. Detection of sub-groups and outliers can also improve group analysis when inter-subject variability cannot be ignored [11]. 87 3.4. Discussion Group-analysis approaches should be selected according to comprehensive criteria, such as both statistical and biomedical evidence. In exploratory researches where the ground truth is usually unknown, statistical evidence is helpful in the selection of groupanalysis methods. Besides the BIC score used in the paper, the model evidence calculated with MCMC is a more accurate metric, at the cost of intensive computation, since the BIC score which asymptotically approximates the log posterior probability is inaccurate when the sample size is small. Other goodness-of-fit metrics, such as the Dirichlet Prior Score Metric (DPSM), and the Bayesian Dirichlet metrics (BDe) can also be used [37]. The confidence of the statistical evidence can be verified with data simulated from the learned models. Generating data from the models learned from the real data is to make the simulated data as similar to the real data as possible. However, statistical evidence alone is not sufficient in making decisions, especially in biomedical studies, if heterogeneity of group members are not considered from other aspects such as age, gender and etc. If possible, researchers can design biomedical markers, and select group-analysis approaches according to their ability to detect those markers, as we assumed the effect of the L-dopa medication in the study. While the author of this thesis was finalizing the writing, he noticed a valuable reflection by Robert L. Savoy on current fMRI-based research that is quite related to group analysis. In [29], Savoy cautioned against the current torrent of using a large number of subjects to conduct group analysis, but recommended focusing on studying a small number of subjects. From the negative aspect, he argued, first, that due to possible misregistration it is problematic to conform different brains to a standard shape, and second, that group averaging can be misleading when outliers exist. From the positive aspect, he pointed out, first, that some interesting discoveries actually relied on the finding of exceptional subjects, and second, that some fMRI work, such as that in presurgical planning, is inherently just for a single subject. He recommended thoroughly studying a few subjects with various experimental factors, such as, gender, or education background. 88 3.4. Discussion He concluded “it may someday turn out that the information from a few brains, thoroughly studied, will reveal more about the universal aspects of human brain function and organization than the current torrent of studies from large collections of brains.” [29] Acknowledgments This work was supported by NSERC Grant CHRPJ 323602-06 (MJM) and CIHR grant CPN-80080 (MJM). 89 Bibliography [1] D. H. Brainard. The psychophysics toolbox. Spat Vis, 10(4):433–436, 1997. [2] R. R. Corbeil and S. R. Searle. A comparison of variance component estimators. Biometrics, 32(4):779–791, Dec. 1976. ISSN 0006341x. [3] Gerald M. Edelman and Joseph A. Gally. Degeneracy and complexity in biological systems. PNAS, 98(24):13763–13768, 2001. [4] Richard S.J. Frackowiak, John T. Ashburner, William D. Penny, and Semir Zeki. Human Brain Function, chapter 48 and 50. Academic Press, 2nd edition, 2004. [5] K. J. Friston, L. Harrison, and W. Penny. Dynamic causal modelling. NeuroImage, 19(4):1273–1302, August 2003. [6] Karl J. Friston. Functional and effective connectivity in neuroimaging: A synthesis. Human Brain Mapping, 2(1-2):56–78, 1994. [7] Miguel S. Goncalves, Deborah A. Hall, Ingrid S. Johnsrude, and Mark P. Haggard. Can meaningful effective connectivities be obtained between auditory cortical regions? NeuroImage, 14(6):1353–1360, December 2001. [8] L. Harrison, W. D. Penny, and K. Friston. Multivariate autoregressive modeling of fMRI time series. NeuroImage, 19(4):1477–1491, August 2003. [9] M. M. Hoehn and M. D. Yahr. Parkinsonism: onset, progression and mortality. Neurology, 17(5):427–442, May 1967. 90 Chapter 3. Bibliography [10] Joseph Jankovic and Eduardo Tolosa, editors. Parkinson’s Disease and Movement Disorders. Williams & Wilkins, 4th edition, 2002. [11] Ferath Kherif, Jean-Baptiste Poline, Sebastien Meriaux, Habib Benali, Guillaume Flandin, and Matthew Brett. Group analysis in functional neuroimaging: selecting subjects using similarity measures. NeuroImage, 20(4):2197–2208, December 2003. [12] Jieun Kim, Wei Zhu, Linda Chang, Peter M. Bentler, and Thomas Ernst. Unified structural equation modeling approach for the analysis of multisubject, multivariate functional MRI data. Human Brain Mapping, 28(2):85–93, 2007. [13] Steffen L. Lauritzen. Graphical models, volume 17. Clarendon Press, Oxford University Press, Oxford, New York, 1996. [14] Steffen L. Lauritzen. Graphical Models, chapter 3.2.2, pages 46–52. Clarendon Press, Oxford University Press, 1996. [15] Junning Li, Z.J. Wang, and M.J. McKeown. A multi-subject dynamic Bayesian network (DBN) framework for brain effective connectivity. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 1, pages I–429–I–432, 2007. [16] Rui Liao, Jeffrey L Krolik, and Martin J McKeown. An information-theoretic criterion for intrasubject alignment of FMRI time series: motion corrected independent component analysis. IEEE Trans Med Imaging, 24(1):29–44, Jan 2005. [17] Rui Liao, Martin J McKeown, and Jeffrey L Krolik. Isolation and minimization of head motion-induced signal variations in fMRI data using independent component analysis. Magn Reson Med, 55(6):1396–1413, Jun 2006. doi: 10.1002/mrm.20893. [18] A. R. McIntosh and F. Gonzalez-Lima. Structural equation modeling and its appli- 91 Chapter 3. Bibliography cation to network analysis in functional brain imaging. Human brain mapping, 2: 2–22, 1994. [19] Andrea Mechelli, Will D. Penny, Cathy J. Price, Darren R. Gitelman, and Karl J. Friston. Effective connectivity and intersubject variability: Using a multisubject network to test differences and commonalities. NeuroImage, 17(3):1459–1469, 11 2002. [20] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H. Teller, and Edward Teller. Equation of state calculations by fast computing machines. The Journal of Chemical Physics, 21(6):1087–1092, 1953. [21] Kevin Patrick Murphy. Dynamic Bayesian networks: representation, inference and learning. PhD thesis, University of California, Berkeley, 2002. [22] Alexandru Niculescu-Mizil and Rich Caruana. Inductive transfer for Bayesian network structure learning. In Proceedings of the 11th International Conference on AI and Statistics (AISTATS ‘07), 2007. [23] D. G. Pelli. The videotoolbox software for visual psychophysics: transforming numbers into movies. Spat Vis, 10(4):437–442, 1997. [24] W. Penny, Z. Ghahramani, and K. Friston. Bilinear dynamical systems. Phil. Trans. R. Soc. B, 360(1457):983–993, May 2005. [25] Paul A Pope, Peter Praamstra, and Alan M Wing. Force and time control in the production of rhythmic movement sequences in Parkinson’s disease. Eur J Neurosci, 23(6):1643–1650, Mar 2006. doi: 10.1111/j.1460-9568.2006.04677.x. [26] Cathy J. Price and Karl J. Friston. Degeneracy and cognitive anatomy. Trends in Cognitive Sciences, 6(10):416–421, October 2002. 92 Chapter 3. Bibliography [27] Jagath C. Rajapakse and Juan Zhou. Learning effective brain connectivity with dynamic Bayesian networks. NeuroImage, 37:749–60, 2007. doi: 10.1016/j.neuroimage. 2007.06.003. [28] Chris Rorden and Matthew Brett. Stereotaxic display of brain lesions. Behav Neurol, 12(4):191–200, 2000. [29] R.L. Savoy. Using small numbers of subjects in fmri-based research. Engineering in Medicine and Biology Magazine, IEEE, 25(2):52–59, March–April 2006. ISSN 0739-5175. [30] Gideon Schwarz. Estimating the dimension of a model. The Annals of Statistics, 6 (2):461–464, Mar. 1978. Short Communications. [31] G. E. Stelmach, N. Teasdale, J. Phillips, and C. J. Worringham. Force production characteristics in Parkinson’s disease. Exp Brain Res, 76(1):165–172, 1989. [32] John D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(3):479–498, 2002. [33] Genichi Sugihara, Tatsuro Kaminaga, and Morihiro Sugishita. Interindividual uniformity and variety of the “writing center”: A functional MRI study. NeuroImage, 32(4):1837–1849, October 2006. [34] J. Talairach and P. Tournoux. Co-Planar Stereotaxic Atlas of the Human Brain. Thieme Medical Publishers, 1988. [35] Robert S Turner, Scott T Grafton, Anthony R McIntosh, Mahlon R DeLong, and John M Hoffman. The functional anatomy of Parkinsonian bradykinesia. Neuroimage, 19(1):163–179, May 2003. [36] M. W. G. Vandenbroucke, R. Goekoop, E. J. J. Duschek, J. C. Netelenbos, J. P. A. Kuijer, F. Barkhof, Ph. Scheltens, and S. A. R. B. Rombouts. Interindividual differ93 Chapter 3. Bibliography ences of medial temporal lobe activation during encoding in an elderly population studied by fMRI. NeuroImage, 21(1):173–180, January 2004. [37] Shulin Yang and Kuo-Chu Chang. Comparison of score metrics for Bayesian network learning. Systems, Man and Cybernetics, Part A, IEEE Transactions on, 32(3):419– 428, 2002. ISSN 1083-4427. [38] Xuebin Zheng and Jagath C. Rajapakse. Learning functional structure from fMR images. NeuroImage, 31(4):1601–1613, July 2006. 94 Chapter 4 Controlling the Error Rate in Learning Network Structures 5 In real world applications, graphical statistical models are not only a tool for operations such as classification or prediction, but usually the network structures of the models themselves are also of great interest (e.g. in modeling brain connectivity). The false discovery rate (FDR), the expected ratio of falsely claimed connections to all those claimed, is often a reasonable error-rate criterion in these applications. However, current learning algorithms for graphical models have not been adequately adapted to the concerns of the FDR. The traditional practice of controlling the type I error rate and the type II error rate under a conventional level does not necessarily keep the FDR low, especially in the case of sparse networks. We propose embedding an FDR-control procedure into the PC algorithm to curb the FDR of the skeleton of the learned graph. It is proved that the proposed method can control the FDR under user-specified levels at the limit of large sample sizes. In the cases of moderate sample size (about several hundred), empirical experiments show that the method is still able to control the FDR under the user-specified level, and a heuristic modification of the method is able to control the FDR accurately around the user-specified level. The proposed method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. 5 A version of this chapter has been published. Junning Li and Z. Jane Wang (2009) Controlling the False Discovery Rate of the Association/Causality Structure Learned with the PC Algorithm. Journal of Machine Learning Research 10: 475 – 514. 95 4.1. Introduction 4.1 Introduction Graphical models have attracted increasing attention in the fields of data mining and machine learning in the last decade. These models, such as Bayesian networks (also called belief networks) and Markov random fields, generally represent events or random variables as vertices (also referred to as nodes), and encode conditional-independence relationships among the events or variables as directed or undirected edges (also referred to as arcs) according to the Markov properties [see 18, chapt. 3]. Of particular interest here are Bayesian networks [see 25, chapt. 3.3] that encode conditional-independence relationships according to the directed Markov property [see 18, pages 46–53] with directed acyclic graphs (DAGs) (i.e. graphs with only directed edges and with no directed cycles). The directed acyclic feature facilitates the computation of Bayesian networks because the joint probability can be factorized recursively into many local conditional probabilities. As a fundamental and intuitive tool to analyze and visualize the association and/or causality relationships among multiple events, graphical models have become more and more explored in biomedical researches, such as discovering gene regulatory networks and modelling functional connectivity between brain regions. In these real world applications, graphical models are not only a tool for operations such as classification or prediction, but often the network structures of the models themselves are also output of great interest: a set of association and/or causality relationships discovered from experimental observations. For these applications, a desirable structure-learning method needs to account for the error rate of the graphical features of the discovered network. Thus, it is important for structure-learning algorithms to control the error rate of the association/causality relationships discovered from a limited number of observations closely below a user-specified level, in addition to finding a model that fits the data well. As edges are fundamental elements of a graph, error rates related to them are of natural concerns. In a statistical decision process, there are basically two sources of errors: the type 96 4.1. Introduction I errors, i.e. falsely rejecting negative hypotheses when they are actually true; and the type II errors, i.e. falsely accepting negative hypotheses when their alternatives, the positive hypotheses are actually true. In the context of learning graph structures, a negative hypothesis could be that an edge does not exist in the graph while the positive hypothesis could be that the edge does exist. Because of the stochastic nature of random sampling, data of a limited sample size may appear to support a positive hypothesis more than a negative hypothesis even when actually the negative hypothesis is true, or vice versa. Thus it is generally impossible to absolutely prevent the two types of errors simultaneously, but has to set a threshold on a certain type of errors, or keep a balance between the them, for instance by minimizing a certain lost function associated with the errors according to the Bayesian decision theory. For example, when diagnosing cancer, to catch the potential chance of saving a patient’s life, doctors probably hope that the type II error rate, i.e. the probability of falsely diagnosing a cancer patient as healthy, to be low, such as less than 5%. Meanwhile, when diagnosing a disease whose treatment is so risky that may cause the loss of eyesight, to avoid the unnecessary but great risk for healthy people, doctors probably hope that the type I error rate, i.e. the probability of falsely diagnosing a healthy people as affected by the disease, to be extremely low, such as less than 0.1%. Learning network structures may face scenarios similar to the two cases above of diagnosing diseases. Given data of a limited sample size, there is not an algorithm guaranteeing a perfect recovery of the structure of the underlying graphical model, and any algorithm has to compromise on the two types of errors. For problems involving simultaneously testing multiple hypotheses, such as verifying the existence of edges in a graph, there are several different criteria for their error-rate control (see Table 4.2), depending on researchers’ concerns or the scenario of the study. Generally there are not mathematically or technically superior relationships among different error-rate criteria if the research scenario is not specified. One error-rate criterion may be favoured in one scenario while another criterion may be right of interest in a 97 4.1. Introduction different scenario, just as the aforementioned examples of diagnosing diseases. In real world applications, selecting the error rate of interest is largely not an abstract question “which error rate is superior over others?”, but a practical question “which error rate is the researchers’ concern?” For extended discussions on why there are not general superior relationships among different error-rate criteria, please refer to Appendix 4.7, where examples of typical research scenarios, accompanied by theoretical discussions, illustrate that each of the four error-rate criteria in Table 4.2 may be favoured in a certain study. The false discovery rate (FDR) [see 4, 34], defined as the expected ratio of falsely discovered positive hypotheses to all those discovered, has become an important and widely used criterion in many research fields, such as bioinformatics and neuroimaging. In many real world applications that involve multiple hypothesis testing, the FDR is more reasonable than the traditional type I error rate and type II error rate. Suppose that in a pilot study researchers are selecting candidate genes for a genetic research on schizophrenia. Due to the limited funding, only a limited number of genes can be studied thoroughly in the afterward genetic research. To use the funding efficiently, researchers would hope that 95% of the candidate genes selected in the pilot study are truly associated with the disease. In this case, the FDR is chosen as the error rate of interest and should be controlled under 5%. Simply controlling the type I error rate and the type II error rate under certain levels does not necessarily keep the FDR sufficiently low, especially in the case of sparse networks. For example, suppose a gene regulatory network involves 100 genes, where each gene interacts in average with 3 others, i.e. there are 150 edges in the network. Then an algorithm with the realized type I error rate = 5% and the realized power = 90% (i.e. the realized type II error rate = 10%) will recover a network with 150 90%=135 correct connections and r100 p100 1q{2 150s 5% 240 false connections. This means that 240{p240 135q 64% of the claimed connections actually do not exist in the true network. Due to the popularity of the FDR in research practices, it is highly desirable to develop structure-learning algorithms that allow the control over 98 4.1. Introduction the FDR on network structures. However, current structure-learning algorithms for Bayesian networks have not been adequately adapted to explicitly control the FDR of the claimed “discovered” networks. Score-based search methods [see 13] look for a suitable structure by optimizing a certain criterion of goodness-of-fit, such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), or the Bayesian Dirichlet likelihood equivalent metric (BDE), with a random walk (e.g. simulated annealing) or a greedy walk (e.g. hill-climbing), in the space of DAGs or their equivalence classes.6 It is worth noting that the restricted case of tree-structured Bayesian networks has been optimally solved, in the sense of KullbackLeibler divergence, with Chow and Liu [6]’s method, and that Chickering [5] has proved that the greedy equivalence search can identify the true equivalence class in the limit of large sample sizes. Nevertheless, scores do not directly reflect the error rate of edges, and the sample sizes in real world applications are usually not large enough to guarantee the perfect asymptotic identification. The Bayesian approach first assumes a certain prior probability distribution over the network structures, and then estimates the posterior probability distribution of the structures after data are observed. Theoretically, the posterior probability of any structure features, such as the existence of an edge, the existence of a path, or even the existence of a sub-graph, can be estimated with the Bayesian approach. This consequently allows the control of the posterior error rate of these structure features, i.e. the posterior probability of the non-existence of these features. It should be pointed out that the posterior error rate is conceptually different from those error rates such as the type I error rate, the type II error rate, and the FDR, basically because they are from different statistical perspectives. The posterior error rate is defined from the perspective of Bayesian statistics. From the Bayesian perspective, network structures are assumed to be random, according to a probability distribution, and the posterior error rate is the probability of 6 An equivalence class of DAGs is a set of DAGs that encode the same set of conditional-independence relationships according to the directed Markov property. 99 4.1. Introduction the non-existence of certain features according to the posterior probability distribution over the network structures. Given the same data, different posterior distributions will be derived from different prior distributions. The type I error rate, the type II error rate, and the FDR are defined from the perspective of classical statistics. From the classical perspective, there is a true, yet unknown, model behind the data, and the error rates are defined by comparing with the true model. Nevertheless, a variant of the FDR, the positive false discovery rate (pFDR) proposed by Storey [34], can be interpreted from the Bayesian perspective [35]. The Bayesian approach for structure learning is usually conducted with the maximuma posteriori -probability (MAP) method or the posterior-expectation method. The MAP method selects the network structure with the largest posterior probability. The optimal structure is usually searched for in a score-based manner, with the posterior probability or more often approximations to the relative posterior probability (for instance the BIC score) being the score to optimize. Cooper and Herskovits [7] developed a heuristic greedy search algorithm called K27 that can finish the search in a polynomial time with respect to the number of vertices, given the order of vertices. The MAP method provides us with a single network structure, the posteriorly most probable one, but does not address error rates in the Bayesian approach. To control errors in the Bayesian approach, the network structure should be learned with the posterior-expectation method, i.e. calculating the posterior probabilities of network structures, and then deriving the posterior expectation of the existence of certain structure features. Though theoretically the posterior-expectation method can control the error rate of any structure features, in practice its capacity is largely limited for computational reasons. The number of DAGs increases super-exponentially as the number of vertices increases [29]. For 10 vertices, there are already about 4.2 1018 DAGs. Though the number of equivalence classes of DAGs is much smaller than the number of DAGs, it is 7 The algorithm is named K2 because it evolved from a system named Kutato [14] 100 4.1. Introduction still forbiddingly large, empirically asymptotically decreasing to 1/13.652 of the number of DAGs, as the number of vertices increases [33]. Therefore, exact inference of posterior probabilities is only feasible for small scale problems, or under certain additional constraints. For certain prior distributions, and given the order of vertices, Friedman and Koller [12] have derived a formula that can be used to calculate the exact posterior probability of a structure feature with the computational complexity bounded by OpN Din 1 q, where N is the number of vertices and Din is the upper bound of the in-degree for each vertex. Considering similar prior distributions, but without the restriction on the order of vertices, Koivisto and Sood [17] have developed a fast exact-Bayesian-inference algorithm based on dynamic programming that is able to compute the exact posterior probability of a sub-network with the computational complexity bounded by OpN 2N N Din 1 Lpmqq, where Lpmq is the complexity of computing a marginal conditional likelihood from m samples. In practice, this algorithm runs fairly fast when the number of vertices is less than 25. For networks with more than 30 vertices, the authors suggested setting more restrictions or combining with inexact techniques. These two breakthroughs made exact Bayesian inference practical for certain prior distributions. However, as Friedman and Koller [12] pointed out, the prior distributions which facilitate the exact inference are not hypothesis equivalent [see 13], i.e. different network structures that are in the same equivalence class often have different priors. The simulation performed by Eaton and Murphy [8] confirmed that these prior distributions deviate far from the uniform distributions. This implies that the methods cannot be applied to the widely accepted uninformative prior, i.e. the uniform prior distribution over DAGs. For general problems, the posterior probability of a structure feature can be approximated with Markov chain Monte Carlo (MCMC) methods [21]. As a versatile implementation of Bayesian inference, the MCMC method can estimate the posterior probability given any prior probability distribution. However, MCMC usually requires intensive computation and the results may depend on the initial state of randomization. 101 4.1. Introduction Constraint-based approaches, such as the SGS ductive causation (IC) 9 8 algorithm [see 32, pages 82–83], in- [see 26, pages 49–51], and the PC 10 algorithm [see 32, pages 84–89], are rooted in the directed Markov property, the rule by which Bayesian networks encode conditional independence. These methods first test hypotheses of conditional independence among random variables, and then combine those accepted hypotheses of conditional independence to construct a partially directed acyclic graph (PDAG) according to the directed Markov property. The computational complexity of these algorithms is difficult to analyze exactly, though for the worst case, which rarely occurs in real world applications, is surely bounded by OpN 2 2N q where N is the number of vertices. In practice, the PC algorithm and the fast-causal-inference (FCI) algorithm [see 32, pages 142–146] can achieve polynomial time if the maximum degree of a graph is fixed. It has been proved that if the true model satisfies the faithfulness constraints [see 32, pages 13 and 81] and all the conditional-independence/dependence relationships are correctly identified, then the PC algorithm and the IC algorithm can exactly recover the true equivalence class, and so do the FCI algorithm and the IC* algorithm 11 [see 26, pages 51–54] for problems with latent variables. Kalisch and B¨ uhlmann [15] have proved that for Gaussian Bayesian networks, the PC algorithm can consistently estimate the equivalence class of an underlying sparse DAG as the sample size m approaches infinity, even if the number of vertices N grows as fast as Opmλ q for any 0 λ 8. Yet, as in practice hypotheses of conditional independence are tested with statistical inference from limited data, false decisions cannot be entirely avoided and thus the ideal recovery cannot be achieved. In current implementations of the constraint-based approaches, the error rate of testing conditional independence is usually controlled individually for each test, under 8 “SGS” stands for Spirtes, Glymour and Scheines who invented this algorithm. An extension of the IC algorithm which was named as IC* [see 26, pages 51–54] was previously also named as IC by Pearl and Verma [27]. Here we follow Pearl [26]. 10 “PC” stands for Peter Spirtes and Clark Glymour who invented this algorithm. A modified version of the PC algorithm which was named as PC* [see 32, pages 89–90] was previously also named as PC by Spirtes and Glymour [31]. Here we follow Spirtes et al. [32]. 11 See footnote 9. 9 102 4.1. Introduction a conventional level such as 5% or 1%, without correcting the effect of multiple hypothesis testing. Therefore these implementations may fail to curb the FDR, especially for sparse graphs. In this paper, we propose embedding an FDR-control procedure into the PC algorithm to curb the error rate of the skeleton of the learned PDAGs. Instead of individually controlling the type I error rate of each hypothesis test, the FDR-control procedure considers the hypothesis tests together to correct the effect of simultaneously testing the existence of multiple edges. We prove that the proposed method, named as the PCfdr -skeleton algorithm, can control the FDR under a user-specified level at the limit of large sample sizes. In the case of moderate sample sizes (about several hundred), empirical experiments show that the method is able to control the FDR under the userspecified level, and a heuristic modification of the method is able to control the FDR more accurately around the user-specified level. Sch¨afer and Strimmer [30] have applied an FDR procedure to graphical Gaussian models to control the FDR of the non-zero entries of the partial correlation matrix. Different from Sch¨afer and Strimmer [30]’s work, our method, built within the framework of the PC algorithm, is not only applicable to the special case of Gaussian models, but also generally applicable to any models for which conditional-independence tests are available, such as discrete models. We are particularly interested in the PC algorithm because it roots in conditionalindependence relationships, the backbone of Bayesian networks, and p-values of hypothesis testing represent one type of error rates. We consider the skeleton of graphs because constraint-based algorithms usually first construct an undirected graph, and then annotate it into different types of graphs while keeping the skeleton as the same as that of the undirected one. The PCfdr -skeleton algorithm is not designed to replace or claimed to be superior over the standard PC algorithm, but provide the PC algorithm with the ability to control the FDR over the skeleton of the recovered network. The PCfdr -skeleton algorithm controls 103 4.1. Introduction the FDR while the standard PC algorithm controls the type I error rate, as illustrated in Section 4.3.1. Since there are no general superior relationships between different errorrate criteria, as explained in Appendix 4.7, neither be there between the PCfdr -skeleton algorithm and the standard PC algorithm. In research practices, researchers first decide which error rate is of interest, and then choose appropriate algorithms to control the error rate of interest. Generally they will not select an algorithm that sounds “superior” but controls the wrong error rate. Since the purpose of the paper is to provide the PC algorithm with the control over the FDR, we assume in this paper that the FDR has been selected as the error rate of interest, and selecting the error rate of interest is out of the scope of the paper. 4.1.1 More Recent Related Works After this paper had been accepted for publication, we still kept tracking the advancement in the FDR control in network structure learning, and noticed that two recent conference papers also addressed the problem. To keep the review on related works up-to-date, we add this subsection to discuss the two conference papers. Listgarten and Heckerman in [20] investigated the problem with both Bayesian and classical approaches. The Bayesian approach, as normally regarded, estimates the FDR through posterior probabilities. The classical approach, quite different from the method proposed in this paper, employs permutation to estimate the number of spurious edges. The advantage of this permutation approach is that it is generally applicable to any structure learning algorithms. On the other hand, it also needs repeating the structure learning at least dozens of times to achieve a robust estimation, which scales up the total computation time. Although its implementation in the paper is from intuition, lacking concrete theoretical justification, based on a fixed order of nodes, and leading to over conservative results, it is still very interesting and worth further development. In a paper [37] published four months after the submission of this paper, Tsamardinos 104 4.2. Controlling FDR with PC Algorithm and Brown proposed an estimation of the upper bound of FDR. The method in [37] and the PCfdr algorithm here share certain similarities, but yet technically different. Both are rooted in Proposition 1 and both use an FDR-control procedure to correct pvalues. Unlike the PCfdr algorithm, the method in [37] does not embed the FDR-control procedure in the structure learning procedure, but separates it as an afterward error-rate estimation. Consequently, it does not allow users to specify the desired FDR in advance. Its asymptotic performance is not discussed in [37]. The remainder of the paper is organized as follows. In Section 4.2, we review the PC algorithm, present the FDR-embedded PC algorithm, prove its asymptotic performance, and analyze its computational complexity. In Section 4.3, we evaluate the proposed method with simulated data, and demonstrate its real world applications to learning functional connectivity networks between brain regions using functional-magneticresonance-imaging (fMRI). Finally, we discuss the advantages and limitations of the proposed method in Section 4.4. 4.2 Controlling FDR with PC Algorithm In this section, we first briefly introduce Bayesian networks and review the PC algorithm. Then, we expatiate on the FDR-embedded PC algorithm and its heuristic modification, prove their asymptotic performances, and analyze their computational complexity. At the end, we discuss other possible ideas of embedding FDR control into the PC algorithm. 105 4.2. Controlling FDR with PC Algorithm 4.2.1 Notations and Preliminaries To assist the reading, notations frequently used in the paper are listed as follows: a, b, Xa , Xb , : vertices : variables respectively represented by vertices a, b, A, B, XA , XB , V N |V | aÑb : : : : : vertex sets variable sets respectively represented by vertex sets A, B, the vertex set of a graph the number of vertices of a graph a directed edge or an ordered pair of vertices : : : : an undirected edge, or an unordered pair of vertices a set of directed edges the undirected edges derived from E, i.e. ta b|a Ñ b or b Ñ a P E u a directed graph composed of vertices in V and edges in E ab E E G pV, E q G pV, E q adjpa, Gq adjpa, G q aKb|C Xa KXb |XC paKb|C : : : : : the skeleton of a directed graph G pV, E q vertices adjacent to a in graph G, i.e. tb|a Ñ b or b Ñ a P E u vertices adjacent to a in graph G , i.e. tb|a b P E u vertices a and b are d-separated by vertex set C Xa and Xb are conditional independent given XC : the p-value of testing Xa KXb |XC A Bayesian network encodes a set of conditional-independence relationships with a DAG G pV, E q according to the directed Markov property defined as follows. Definition 1 the Directed Markov Property: if A and B are d-separated by C where A, B and C are three disjoint sets of vertices, then XA and XB are conditionally independent given XC , i.e. P pXA , XB |XC q = P pXA |XC qP pXB |XC q. [see 18, pages 46– 53] The concept of d-separation [see 18, page 48] is defined as follows. A chain between two vertices a and b is a sequence a = a0 , a1 , . . . , an = b of distinct vertices such that ai P E for all i=1, . . . , n. Vertex b is a descendant of vertex a if and only if there is a sequence a a0 , a1 , . . . , an b of distinct vertices such that ai 1 Ñ ai P E for ai 1 106 4.2. Controlling FDR with PC Algorithm V , C d-separates A and B if and only if any chain π between @a P A and @b P B contains a vertex γ P π such that all i=1, . . . , n. For three disjoint subsets A, B and C either arrows of π do not meet head-to-head at γ and γ P C, or arrows of π meet head-to-head at γ and γ is neither in C nor has any descendants in C. Moreover, a probability distribution P is faithful to a DAG G [see 32, pages 13 and 81] if all and only the conditional-independence relationships derived from P are encoded by G. In general, a probability distribution may possess other independence relationships besides those encoded by a DAG. It should be pointed out that there are often several different DAGs encoding the same set of conditional-independence relationships and they are called an equivalence class of DAGs. An equivalence class can be uniquely represented by a completed acyclic partially directed graph (CPDAG) (also called the essential graph in the literature) that has the same skeleton as a DAG does except that some edges are not directed [see 3]. 4.2.2 PC Algorithm If a probability distribution P is faithful to a DAG G, then the PC algorithm [see 32, pages 84–89] is able to recover the equivalence class of the DAG G, given the set of the conditional-independence relationships. In general, a probability distribution may include other independence relationships besides those encoded by a DAG. The faithfulness assumption assures that the independence relationships can be perfectly encoded by a DAG. In practice, the information on conditional independence is usually unknown but extracted from data with statistical hypothesis testing. If the p-value of testing a hypothesis Xa KXb |XC is lower than a user-specified level α (conventionally 5% or 1%), then 107 4.2. Controlling FDR with PC Algorithm the hypothesis of conditional independence is rejected while the hypothesis of conditional dependence Xa M Xb|XC is accepted. The first step of the PC algorithm is to construct an undirected graph G whose edge directions will later be further determined with other steps, while the skeleton is kept the same as that of G . Since we restrict ourselves to the error rate of the skeleton, here we only present in Algorithm 1 the first step of the PC algorithm, as implemented in software Tetrad version 4.3.8 (see http://www.phil.cmu.edu/projects/tetrad), and refer to it as the PC-skeleton algorithm. Algorithm 1 PC-skeleton Input: the data XV generated from a probability distribution faithful to a DAG Gtrue , and the significance level α for every statistical test of conditional independence Output: the recovered skeleton G 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: Form the complete undirected graph G on the vertex set V . Let depth d = 0. repeat for each ordered pair of adjacent vertices a and b in G , i.e. a b P E do if |adjpa, G qztbu| ¥ d, then for each subset C adjpa, G qztbu and |C | d do Test hypothesis Xa KXb |XC and calculate the p-value paKb|C . if paKb|C ¥ α, then Remove edge a b from G . Update G and E . break the for loop at line 6 end if end for end if end for Let d d 1. until |adjpa, G qztbu| d for every ordered pair of adjacent vertices a and b in G . The theoretical foundation of the PC-skeleton algorithm is Proposition 1: if two vertices a and b are not adjacent in a DAG G, then there is a set of other vertices C that either all are neighbors of a or all are neighbors of b such that C d-separates a and b, or equivalently, Xa KXb |XC , according to the directed Markov property. Since two adjacent 108 4.2. Controlling FDR with PC Algorithm vertices are not d-separated by any set of other vertices (according to the directed Markov property), Proposition 1 implies that a and b are not adjacent if and only if there is a d-separating C in either neighbors of a or neighbors of b. Readers should notice that the proposition does not imply that a and b are d-separated only by such a C, but just guarantees that a d-separating C can be found in either neighbors of a or neighbors b. Proposition 1 If vertices a and b are not adjacent in a DAG G, then there is a set of vertices C which is either a subset of adjpa, Gqztbu or a subset of adjpb, Gqztau such that C d-separates a and b in G. This proposition is a corollary of Lemma 5.1.1 on page 411 of book Causation, Prediction, and Search [32]. The logic of the PC-skeleton algorithm is as follows, under the assumption of perfect judgment on conditional independence. The most straightforward application of Proposition 1 to structure learning is to exhaustively search all the possible neighbors of a and b to verify whether there is such a d-separating C to disconnect a and b. Since the possible neighbors of a and b are unknown, to guarantee the detection of such a d-separating C, all the vertices other than a and b should be searched as possible neighbors of a and b. However, such a straightforward application is very inefficient because it probably searches many unnecessary combinations of vertices by considering all the vertices other than a and b as their possible neighbors, especially when the true DAG is sparse. The PC-skeleton algorithm searches more efficiently, by keeping updating the possible neighbors of a vertex once some previously-considered possible neighbors have been found actually not connected with the vertex. Starting with a fully connected undirected graph G (step 1), the algorithm searches for the d-separating C progressively by increasing the size of C, i.e. the number of conditional variables, from zero (step 2) with the step size of one (step 16). Given the size of C, the search is performed for every vertex pair a and b (step 4). Once a C d-separating a and b is found (step 8), a and b are disconnected (step 9), and the neighbors of a and b are updated (step 10). In the algorithm, G is continually updated, so adjpa, G q is also constantly updated as the 109 4.2. Controlling FDR with PC Algorithm algorithm progresses. The algorithm stops when all the subsets of the current neighbors of each vertex have been examined (step 17). The Tetrad implementation of the PCskeleton algorithm examines an edge a b as two ordered pairs (a, b) and (b, a) (step 4), each time searching for the d-separating C in the neighbors of the first element of the pair (step 6). In this way, both the neighbors of a and the neighbors of b are explored. The accuracy of the PC-skeleton algorithm depends on the discriminability of the statistical test of conditional independence. If the test can perfectly distinguish dependence from independence, then the algorithm can correctly recover the true underlying skeleton, as proved by Spirtes et al. [32], pages 410–412. The outline of the proof is as follows. First, all the true edges will be recovered because an adjacent vertex pair a b is not d-separated by any vertex set C that excludes a and b. Second, if the edge between a non-adjacent vertex pair a and b has not been removed, subsets of either adjpa, Gqztbu or adjpb, Gqztau will be searched until the C that d-separates a and b according to Proposition 1 is found, and consequently the edge between a and b will be removed. If the judgments on conditional independence and conditional dependence are imperfect, the PC-skeleton algorithm is unstable. If an edge is mistakenly removed from the graph in the early stage of the algorithm, then other edges which are not in the true graph may be included in the graph [see 32, page 87]. 4.2.3 False Discovery Rate In a statistical decision process, there are basically two sources of errors: the type I errors, i.e. falsely rejecting negative hypotheses when they are actually true; and the type II errors, i.e. falsely accepting negative hypotheses when their alternative, the positive hypotheses are actually true. The FDR [see 4] is a criterion to assess the errors when multiple hypotheses are simultaneously tested. It is the expected ratio of the number of falsely claimed positive results to that of all those claimed to be positive, as defined in Table 4.2. A variant of the FDR, the positive false discovery rate (pFDR), defined as in 110 4.2. Controlling FDR with PC Algorithm Test Results Negative Positive Total Truth Negative TN (true negative) FP (false positive) T1 Positive FN (false negative) TP (true positive) T2 Total R1 R2 Table 4.1: Results of multiple hypothesis testing, categorized according to the claimed results and the truth. Full Name False Discovery Rate Positive False Discovery Rate Family-Wise Error Rate Type I Error Rate (False Positive Rate) Specificity (True Negative Rate) Type II Error Rate (False Negative Rate) Power (Sensitivity, True Positive Rate) Abbreviation FDR pFDR FWER α 1 α β 1 β Definition E pFP{R2 q (See note *) E pFP{R2 |R2 ¡ 0q P pFP ¥ 1q E pFP{T1 q E pTN{T1 q E pFN{T2 q E pTP{T2 q Table 4.2: Criteria for multiple hypothesis testing. Here E pxq means the expected value of x, and P pAq means the probability of event A. Please refer to Table 4.1 for related notations. * If R2 0, FP{R2 is defined to be 0. Table 4.2, was proposed by Storey [34]. Clearly, pFDR = FDR / P pR2 measures will be similar if P pR2 ¡ 0q, so the two ¡ 0q is close to 1, and quite different otherwise. The FDR is a reasonable criterion when researchers expect the “discovered” results are trustful and dependable in afterward studies. For example, in a pilot study, we are selecting candidate genes for a genetic research on Parkinson’s disease. Because of the limited funding, we can only study a limited number of genes in the afterward genetic research. Thus, when selecting candidate genes in the pilot study, we hope that 95% of the selected candidate genes are truly associated with the disease. In this case, the FDR is chosen as the error rate of interest and should be controlled under 5%. Since similar situations are quite common in research practices, the FDR has been widely adopted in many research fields such as bioinformatics and neuroimaging. In the context of learning the skeleton of a DAG, a negative hypothesis could be that a connection does not exist in the DAG, and a positive hypothesis could be that the connection exists. The FDR is the expected proportion of the falsely discovered 111 4.2. Controlling FDR with PC Algorithm connections to all those discovered. Learning network structures may face scenarios similar to the aforementioned pilot study, but the FDR control has not yet received adequate attention in structure learning. Benjamini and Yekutieli [4] have proved that, when the test statistics have positive regression dependency on each of the test statistics corresponding to the true negative hypotheses, the FDR can be controlled under a user-specified level q by Algorithm 2. In other cases of dependency, the FDR can be controlled with a simple conservative modification of the procedure by replacing H in Eq. (4.1) with H p1 1{2, . . . , 1{H q. Storey [34] has provided algorithms to control the pFDR for independent test statistics. For a review and comparison of more FDR methods, please refer to Qian and Huang [28]’s work. It should be noted that the FDR procedures do not control the realized FDR of a trial under q, but control the expected value of the error rate when the procedures are repetitively applied to randomly sampled data. Algorithm 2 FDR-stepup Input: a set of p-values tpi |i 1, . . . , H u, and the threshold of the FDR q Output: the set of rejected null hypotheses ¤ ¤ 1: Sort the p-values of H hypothesis tests in the ascendant order as pp1q . . . ppH q . 2: Let i H, and H H (or H H 1 1 2, . . . , 1 H , depending on the assump- p { tion of the dependency among the test statistics). 3: while H ppiq i ¡q and { q i ¡ 0, (4.1) do 4: Let i i 1. 5: end while 6: Reject the null hypotheses associated with pp1q , . . . , ppiq , and accept the null hypotheses associated with ppi 1q , . . . , ppH q . Besides the FDR and the pFDR, other criteria, as listed in Table 4.2, can also be applied to assess the uncertainty of multiple hypothesis testing. The type I error rate is the expected ratio of the type I errors to all the negative hypotheses that are actually true. The type II error rate is the expected ratio of the type II errors to all the positive 112 4.2. Controlling FDR with PC Algorithm hypotheses that are actually true. The family-wise error rate is the probability that at least one of the accepted positive hypotheses are actually wrong. Generally, there are not mathematically or technically superior relationships among these error-rate criteria. Please refer to Appendix 4.7 for examples of typical research scenarios where each particular error rate is favoured. Controlling both the type I and the type II error rates under a conventional level 5% or 1% and β 10% or 5%) does not necessarily curb the FDR at a desired level. As shown in Eq. (4.2), if FP{T1 and FN{T2 are fixed and positive, FP{R2 approaches 1 when T2 {T1 is small enough. This is the case of sparse networks where (such as α the number of existing connections T2 is much smaller than the number of non-existing connections T1 . FP R2 4.2.4 FP T1 FP T1 p1 q FN T2 T2 T 1 . (4.2) PC Algorithm with FDR Steps 8–12 of the PC-skeleton algorithm control the type I error rate of each statistical test of conditional independence individually below a pre-defined level α, so the algorithm can not explicitly control the FDR. We propose embedding an FDR-control procedure into the algorithm to curb the error rate of the learned skeleton. The FDR-control procedure collectively considers the hypothesis tests related to the existence of multiple edges, correcting the effect of multiple hypothesis testing. The proposed method is described in Algorithm 3, and we name it as the PCfdr -skeleton algorithm. Similar to the PC-skeleton algorithm, G , adjpa, G q and E are constantly updated as the algorithm progresses. The PCfdr -skeleton and the PC-skeleton algorithms share the same search strategy, but differ on the judgment of conditional independence. The same as the PC-skeleton algorithm, the PCfdr -skeleton algorithm increases d, the number of conditional variables, 113 4.2. Controlling FDR with PC Algorithm Algorithm 3 PCfdr -skeleton Input: the data XV generated from a probability distribution faithful to a DAG Gtrue , and the FDR level q for the discovered skeleton Output: the recovered skeleton G 1: Form the complete undirected graph G on the vertex set V 2: Initialize the maximum p-values associated with edges as 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 1uab . P max tpmax ab Let depth d = 0. repeat for each ordered pair of adjacent vertices a and b in G , i.e. a b P E do if |adjpa, G qztbu| ¥ d, then for each subset C adjpa, G qztbu and |C | d do Test hypothesis Xa KXb |XC and calculate the p-value paKb|C . if paKb|C ¡ pmax ab , then max Let pab paKb|C . if every element of P max has been assigned a valid p-value by step 10, then Run the FDR procedure, Algorithm 2, with P max and q as the input. 13: if the non-existence of certain edges are accepted, then 14: Remove these edges from G . 15: Update G and E . 16: if a b is removed, then 17: break the for loop at line 7. 18: end if 19: end if 20: end if 21: end if 22: end for 23: end if 24: end for 25: Let d d 1. 26: until adj a, G b d for every ordered pair of adjacent vertices a and b in G . | p qzt u| * A heuristic modification at step 15 of the algorithm is to remove from P max the pmax ab s whose associated edges have been deleted from G at step 14, i.e. to update P max as P max tpmax ab uabPE right after updating E at step 15. This heuristic modification is named as the PCfdr* -skeleton algorithm. 114 4.2. Controlling FDR with PC Algorithm from zero (step 3) with the step size of one (step 25), and also keeps updating the neighbors of vertices (steps 14 and 15) when some previously-considered possible neighbors have been considered not connected (step 13). The PCfdr -skeleton algorithm differs from the PC-skeleton algorithm on the inference of d-separation, with its steps 11–20 replacing steps 8–12 of the PC-skeleton algorithm. In the PC-skeleton algorithm, two vertices are regarded as d-separated once the conditional-independence test between them yields a p-value larger than the pre-defined significant level α. In this way, the type I error of each statistical test is controlled separately, without consideration of the effect of multiple hypothesis testing. The PCfdr -skeleton algorithm records in pmax ab the up-to-date maximum p-value associated with an edge a b (steps 9 and 10), and progressively removes those edges whose non-existence is accepted by the FDR procedure (step 12), with P max = tpmax ab uab and the pre-defined FDR level q being the input. The FDR procedure, Algorithm 2, is invoked at step 12, either immediately after every element of P max has been assigned a valid p-value for the first time, or later once any element of P max is updated. The pmax ab is the upper bound of the p-value of testing the hypothesis that a and b are dseparated by at least one of the vertex sets C searched in step 7. According to the directed Markov property, a and b are not adjacent if and only if there is a set of vertices C V zta, bu d-separating a and b. As the algorithm progresses, the d-separations between a V zta, bu are tested respectively, and consequently a K sequence of p-values p1 , . . . , pK are calculated. If we use pmax ab maxi1 pi as the statistic and b by vertex sets C1 , . . . , CK to test the negative hypothesis that there is, though unknown, a Cj among C1 , . . . , CK d-separating a and b, then due to P ppmax ab ¤ pq P ppi ¤ p for all i 1, . . . , K q ¤ P ppj ¤ pq p, (4.3) pmax ab is the upper bound of the p-value of testing the negative hypothesis. Eq. (4.3) also clearly shows that the PC-skeleton algorithm controls the type I error rate of the negative 115 4.2. Controlling FDR with PC Algorithm hypothesis, since its step 8 is equivalent to “if pmax ab α, then . . . ” if pmax ab is recorded in the PC-skeleton algorithm. The statistical tests performed at step 8 of the PCfdr -skeleton algorithm generally are not independent with each other, since the variables involved in two hypotheses of conditional independence may overlap. For example, conditional-independence relationships aKb1 |C and aKb2 |C both involve a and C. It is very difficult to prove whether elements of P max have positive regression dependency or not, so rigorously the conservative modification of Algorithm 2, should be applied at step 12. However, since pmax ab is probably a loose upper bound of the p-value of testing a b, in practice we simply apply the FDR procedure that is correct for positive regression dependency. It should be noted that different from step 9 of the PC-skeleton algorithm, step 14 of the PCfdr -skeleton algorithm may remove edges other than just a b, because the decisions on other edges can be affected by the updating of pmax ab . max A heuristic modification of the PCfdr -skeleton algorithm is to remove pmax ab from P once edge a b has been deleted from G at step 14. We name this modified version as the PCfdr* -skeleton algorithm. In the PCfdr -skeleton algorithm, pmax ab is still recorded in P max and input to the FDR procedure after the edge a b has been removed. This guarantees that the algorithm can asymptotically keep the FDR under the user-specified level q (see Section 4.2.5). The motivation of the heuristic modification is that if an edge has been eliminated, then it should not be considered in the FDR procedure any longer. Though we cannot theoretically prove the asymptotic performance of the heuristic modification in the sense of controlling the FDR, it is shown to control the FDR closely around the user-specified level in our empirical experiments and gain more detection power than that of the PCfdr -skeleton algorithm (see Section 4.3). 116 4.2. Controlling FDR with PC Algorithm 4.2.5 Asymptotic Performance Here we prove that the PCfdr -skeleton algorithm is able to control the FDR under a userspecified level q (q ¡ 0) at the limit of large sample sizes if the following assumptions are satisfied: (A1) The probability distribution P is faithful to a DAG Gtrue . (A2) The number of vertices is fixed. (A3) Given a fixed significant level of testing conditional-independence relationships, the power of detecting conditional-dependence relationships with statistical tests approaches 1 at the limit of large sample sizes. (For the definition of power in hypothesis testing, please refer to Table 4.2.) Assumption (A1) is generally assumed when graphical models are applied, although it restricts the probability distribution P to a certain class. Assumption (A2) is usually implicitly stated, but here we explicitly emphasize it because it simplifies the proof. Assumption (A3) may seem demanding, but actually it can be easily satisfied by standard statistical tests, such as the likelihood-ratio test introduced by Neyman and Pearson [24], if the data are identically and independently sampled. Two statistical tests that satisfy Assumption (A3) are listed in Appendix 4.6. The detection power and the FDR of the PCfdr -skeleton algorithm and its heuristic modification at the limit of large sample sizes are elucidated in Theorems 1 and 2. The detailed proofs are provided in Appendix 4.5. Theorem 1 Assuming (A1), (A2) and (A3), both the PCfdr -skeleton algorithm and its heuristic modification, the PCfdr* -skeleton algorithm, are able to recover all the true connections with probability one as the sample size approaches infinity: lim P pEtrue Ñ8 m E q 1, 117 4.2. Controlling FDR with PC Algorithm denotes the set of the undirected edges derived from the true DAG Gtrue , E where Etrue denotes the set of the undirected edges recovered with the algorithms, and m denotes the sample size. Theorem 2 Assuming (A1), (A2) and (A3), the FDR of the undirected edges recovered with the PCfdr -skeleton algorithm approaches a value not larger than the user-specified level q as the sample size m approaches infinity: q ¤ q, lim sup FDRpE , Etrue Ñ8 m q is defined as where FDRpE , Etrue $ ' & FDR E , E true zEtrue | ' | E % Define |E | p 4.2.6 q |E zEtrue | , E |E | 0, if |E | 0. Computational Complexity The PCfdr -skeleton algorithm spends most of its computation on performing statistical tests of conditional independence at step 8 and controlling the FDR at step 12. Since steps 13 to 19 of the PCfdr -skeleton algorithm play a role similar to steps 8 to 12 of the PC-skeleton algorithm do, and all the other parts of both algorithms employ the same search strategy, the computation spent by the PCfdr -skeleton algorithm on statistical tests has the same complexity as that by the PC-skeleton algorithm. The only extra computational cost of the PCfdr -skeleton algorithm is at step 12 for controlling the FDR. The computational complexity of the search strategy employed by the PC algorithm has been studied by Kalisch and B¨ uhlmann [15] and see Spirtes et al. [32], pages 85–87. Here to make the paper self-contained, we briefly summarize the results as follows. It is difficult to analyze the complexity exactly, but if the algorithm stops at the depth 118 4.2. Controlling FDR with PC Algorithm d dmax , then the number of conditional-independence tests required is bounded by T 2CN2 d¸ max CNd 2, d 0 where N is the number of vertices, CN2 is the number of combinations of choosing 2 un-ordered and distinct elements from N elements, and similarly CN2 combinations of choosing from N complexity is bounded by 2CN2 2N 2 is the number of 2 elements. In the worst case that dmax 2 N 2, the . The bound usually is very loose, because it assumes that no edge has been removed until d dmax . In real world applications, the algorithm is very fast for sparse networks. The computational complexity of the FDR procedure, Algorithm 2, generally is O( H logpH q + H) = OpH logpH qq where H CN2 is the number of input p-values. The sorting at step 1 costs H logpH q and the “while” loop from step 3 to step 5 repeats H times at most. However, if the sorted P max is recorded during the computation, each time when an element of P max is updated at step 10 of the PCfdr -skeleton algorithm, the complexity of keeping the updated P max sorted is only OpH q. With this optimization, the complexity of the FDR-control procedure is OpH logpH qq at its first operation, and is OpH q later. The FDR procedure is invoked only when paKb|C ¡ pmax ab . In the worst case that paKb|C is always larger than pmax ab , the complexity of the computation spent on the FDR control in total is bounded by OpCN2 logpCN2 q T CN2 q = OpN 2 logpN q T N 2q where T is the number of performed conditional-independence tests. This is a very loose bound because it is rare that paKb|C is always larger than pmax ab . The computational complexity of the heuristic modification, the PCfdr* -skeleton algorithm, is the same as that of the PCfdr -skeleton algorithm, since they share the same search strategy and both employ the FDR procedure. In the PCfdr* -skeleton algorithm, the size of P max keeps decreasing as the algorithm progresses, so each operation of the FDR procedure is more efficient. However, since the PCfdr* -skeleton algorithm adjusts the 119 4.2. Controlling FDR with PC Algorithm effect of multiple hypothesis testing less conservatively, it may remove less edges than the PCfdr -skeleton algorithm does, and invokes more conditional-independence tests. Nevertheless, their complexity is bounded by the same limit in the worst case. It is unfair to directly compare the computational time of the PCfdr -skeleton algorithm against that of the PC-skeleton algorithm, because if the q of the former is set at the same value as the α of the latter, the former will remove more edges and perform much less statistical tests, due to its more stringent control over the type I error rate. A reasonable way is to compare the time spent on the FDR control at step 12 against that on conditional-independence tests at step 8 in each run of the PCfdr -skeleton algorithm. If P max is kept sorted during the learning process as aforementioned, then each time (except the first time) the FDR procedure just needs linear computation time (referring to the size of P max ) with simple operations such as division and comparing two numerical values. Thus, we suspect that the FDR procedure will not contribute much to the total computation time of the structure learning. In our simulation study in Section 4.3.1, the extra computation added by the FDR control was only a tiny portion, less than 0.5%, to that spent on testing conditional independence, performed with the Cochran-MantelHaenszel (CMH) test [see 1, pages 231–232], as shown in Tables 4.3 and 4.4. 4.2.7 Miscellaneous Discussions An intuitive and attracting idea of adapting the PC-skeleton algorithm to the FDR control is to “smartly” determine such an appropriate threshold of the type I error rate α that will let the errors be controlled at the pre-defined FDR level q. Given a particular problem, it is very likely that the FDR of the graphs learned by the PC-skeleton algorithm is an monotonically increasing function of the pre-defined threshold α of the type I error rate. If this hypothesis is true, then there is a one-to-one mapping between α and q for the particular problem. Though we cannot prove this hypothesis rigorously, the following argument may be enlightening. Instead of directly focusing on FDR = 120 4.2. Controlling FDR with PC Algorithm E pFP{R2 q (see Table 4.2), the expected ratio of the number of false positives (FP) to the number of accepted positive hypotheses (R2 ), we first focus on E pFPq{E pR2 q, the ratio of the expected number of false positives to the expected number of accepted positive hypotheses, since the latter is easier to link with the type I error rate according to Eq. (4.2), as shown in Eq. (4.4), E pFPq E pR 2 q FP T1 E E FP T1 p1 q FN T2 T2 T 1 FP T1 E E FP T1 1 E FN T2 T2 T1 α p1α β q T T 2 1 , (4.4) where α and β are the type I error rate and the type II error rate respectively. A sufficient condition for E pFPq{E pR2 q being a monotonically increasing function of the type I error rate includes (I) p1 Bp1 β q{α ¡ Bp1 β q{B α is the derivative of p1 β q{B α, (II) T1 β q over α. If p1 ¡ 0 and (III) T2 ¡ 0, where β q, regarded as a function of α, is a concave curve from (0, 0) to (1,1), then condition (I) is satisfied. Recall that p1 βq versus α actually is the receiver operating characteristic (ROC) curve, and that with an appropriate statistic the ROC curve of a hypothesis test is usually a concave curve from (0, 0) to (1,1), we speculate that condition (I) is not difficult to satisfy. With the other ¡ 0 and (III) T2 ¡ 0, we could expect that E pFPq{E pR2q is a monotonically increasing function of α. E pFPq{E pR2 q is the ratio of the expected values of two random variables, while E pFP{R2 q is the expected value of the ratio of two random variables. Generally, there is not a monotonic relationship between E pFPq{E pR2 q and E pFP{R2 q. Nevertheless, if the average number of false positives, E pFPq, increases proportionally faster than that of the accepted positives, E pR2 q, we speculate that under certain conditions, the FDR = E pFP{R2 q also increases accordingly. Thus the FDR may two mild conditions (II) T1 be a monotonically increasing function of the threshold α of the type I error rate for the PC-skeleton algorithm. However, even though the FDR of the PC-skeleton algorithm may decrease as the predefined significant level α decreases, the FDR of the PC-skeleton algorithm still cannot 121 4.2. Controlling FDR with PC Algorithm be controlled at the user-specified level for general problems by “smartly” choosing an α beforehand, but somehow has to be controlled in a slightly different way, such as the PCfdr -skeleton algorithm does. First, the value of such an α for the FDR control depends on the true graph, but unfortunately the graph is unknown in problems of structure learning. According to Eq. (4.2), the realized FDR is a function of the realized type I and type II error rates, as well as T2 {T1 , which in the context of structure learning is the ratio of the number of true connections to the number of non-existing connections. Since T2 {T1 is unknown, such an α cannot be determined completely in advance without any information about the true graph, but has to be estimated practically from the observed data. Secondly, the FDR method we employ is such a method that estimates the α from the data to control the FDR of multiple hypothesis testing. The output of the FDR algorithm is the rejection of those null hypotheses associated with p-values pp1q , . . . , ppiq (see Algorithm 2). Given pp1q ¤ . . . ¤ ppH q, the output is equivalent to the rejection of all those hypotheses whose p-values are smaller than or equal to ppiq . In other words, it is equivalent to setting α ppiq in the particular multiple hypothesis testing. Thirdly, the PCfdr -skeleton algorithm is a valid solution to combining the FDR method with the PCskeleton algorithm. Because the estimation of the α depends on p-values, and p-values are calculated one by one as the PC-skeleton algorithm progresses with hypothesis tests, the α cannot be estimated separately before the PC-skeleton algorithm starts running, but the estimation has to be embedded within the algorithm, like in the PCfdr -skeleton algorithm. Another idea on the FDR control in structure learning is a two-stage algorithm. The first stage is to draft a graph that correctly includes all the existing edges and their orientations but may also include non-existing edges as well. The second stage is to select the real parents for each vertex, with the FDR controlled, from the set of potential parents determined in the first stage. The advantage of this algorithm is that the selection of real parent vertices in the second stage is completely decoupled from the determination 122 4.3. Empirical Evaluation of edge orientations, because all the parents of each vertex have been correctly connected with the particular vertex in the first stage. However, a few concerns about the algorithm should be noticed before researchers start developing this two-stage algorithm. First, to avoid missing many existing edges in the first stage, a considerable number of non-existing edges may have to be included. To guarantee a perfect protection of the existing edges given any randomly sampled data, the first stage must output a graph whose skeleton is a fully connected graph. The reason for this is that the type I error rate and the type II error rate contradict each other and the latter reaches zero generally when the former approaches one (see Appendix 4.7). Rather than protecting existing edges perfectly, the first stage should trade off between the type I and the type II errors, in favour of keeping the type II error rate low. Second, selecting parent vertices from a set of candidate vertices in the second stage, in certain sense, can be regarded as learning the structure of a sub-graph locally, in which error-rate control remains as a crucial problem. Thus erro-rate control is still involved in both of the two stages. Though this two-stage idea may not essentially reduce the problem of the FDR control to an easier task, it may break the big task of simultaneously learning all edges to many local structure-learning tasks. 4.3 Empirical Evaluation The PCfdr -skeleton algorithm and its heuristic modification are evaluated with simulated data sets, in comparison with the PC-skeleton algorithm, in the sense of the FDR, the type I error rate and the power. The PCfdr -skeleton and the PC-skeleton algorithms are also applied to two real functional-magnetic-resonance-imaging (fMRI) data sets, to check whether the two algorithms correctly curb the error rates that they are supposed to control in real world applications. 123 4.3. Empirical Evaluation 4.3.1 Simulation Study The simulated data sets are generated from eight different DAGs, shown in Figure 4.1, with the number of vertices N = 15, 20, 25 or 30, and the average degree of vertices D = 2 or 3. The DAGs are generated as follows: (1) Sample N D 2 undirected edges from ta b|a, b P V and a bu with equal proba- bilities and without replacement to compose an undirected graph G true . (2) Generate a random order ¡ of vertices with permutation. (3) Orientate the edges of G according to the order ¡. If a is before b in the order ¡, then orientate the edge a b as a Ñ b. Denote the orientated graph as a DAG Gtrue . For each DAG, we associate its vertices with (conditional) binary probability distributions as follows, to extend it to a Bayesian network. (1) Specify the strength of (conditional) dependence as a parameter δ ¡ 0. (2) Randomly assign each vertex a P V with a dependence strength δa = 0.5δ or 0.5δ, with equal possibilities. (3) Associate each vertex a P V with a logistic regression model ¸ ∆ P rs Xb δb , b pa a P pX a P pXa p∆ q , 1|Xparasq 1 expexp p∆ q 1|Xparas q 1 1 , expp∆q where paras denotes the parent vertices of a. 124 4.3. Empirical Evaluation 2 D D 2 3 5 4 2 10 1 11 9 6 7 4 8 5 6 1 8 3 7 12 10 15 3 13 12 11 9 N 13 15 15 14 1 2 6 7 4 14 5 3 11 12 9 2 8 5 3 18 16 13 20 20 N 14 6 17 4 10 1 19 12 9 15 13 8 7 16 11 17 14 10 18 15 20 19 2 13 N 25 8 1 12 19 9 11 20 7 16 3 22 21 10 15 6 14 24 5 18 13 2 18 4 4 21 17 6 8 17 14 3 9 11 4 9 10 13 14 20 22 5 25 15 6 24 11 5 1 2 27 20 21 8 18 19 7 26 25 17 13 16 1 3 18 22 30 28 23 6 12 14 25 29 7 9 8 5 30 16 23 10 N 15 24 12 3 12 10 7 19 23 25 16 1 4 24 15 26 27 29 30 2 17 21 22 19 11 23 20 28 Figure 4.1: DAGs used in the simulation study. N denotes the number of vertices and D denotes the average degree of the vertices. Unshaded vertices are associated with positive dependence strength 0.5δ, and shaded ones are associated with negative dependence strength 0.5δ. 125 4.3. Empirical Evaluation The parameter δ reflects the strength of dependence because if the values of all the other parent variables are fixed, the difference between the conditional probabilities of a variable Xa = 1 given a parent variable Xb = 1 and -1 is logitrP pXa 1|Xb 1, Xparasztbuqs logitrP pXa 1|Xb 1, Xparasztbu qs |2δb| δ, where the logit function is defined as logitpxq logp 1 x x q. Since the accuracy of the PC-skeleton algorithm and its FDR versions is related to the discriminability of the statistical tests, we generated data with different values of δ (δ= 0.5, 0.6, 0.7, 0.8, 0.9 and 1.0) to evaluate the algorithms’ performances with different power of detecting conditional dependence. The larger the absolute value of δ is, the easier the dependence can be detected with statistical tests. Because statistical tests are abstract queries yielding p-values about conditional independence for the structure-learning algorithms, the accuracy of the algorithms is not determined by the particular procedure of a statistical test, or a particular family of conditional probability distributions but by the discriminability of the statistical tests. Given a fixed sample size, the stronger the conditional-dependence relationships are, the higher discriminability the statistical tests have. By varying the dependence strength δ of the binary conditional probability distributions, we have varied the discriminability of the statistical tests, as if by varying the dependence strength of other probability distribution families. To let the readers intuitively understand the dependence strength of these δ values, we list as follows examples of probability pairs whose logit contrasts are equal to these δ values: . 0.5 = logit( 0.5622 ) - logit( 0.4378 ), 0.6 = logit( 0.5744 ) - logit( 0.4256 ), 0.7 = logit( 0.5866 ) - logit( 0.4134 ), 0.8 = logit( 0.5987 ) - logit( 0.4013 ), 0.9 = logit( 0.6106 ) - logit( 0.3894 ), 1.0 = logit( 0.6225 ) - logit( 0.3775 ). 126 4.3. Empirical Evaluation In total, we performed the simulation with 48 Bayesian networks generated with all the combinations of the following parameters: N D δ 15, 20, 25, 30; 2, 3; 0.5, 0.6, 0.7, 0.8, 0.9, 1.0. From each Bayesian network, we repetitively generated 50 data sets each of 500 samples to estimate the statistical performances of the algorithms. A non-model-based test, the Cochran-Mantel-Haenszel (CMH) test [see 1, pages 231–232], was employed to the test conditional independence among random variables. Both the significant level α of the PC-skeleton algorithm and the FDR level q of the PCfdr -skeleton algorithm and its heuristic modification were set at 5%. Figures 4.2, 4.3 and 4.4 respectively show the empirical FDR, power and type I error rate of the algorithms, estimated from the 50 data sets repetitively generated from each Bayesian network, with error bars indicating the 95% confidence intervals of these estimations. The PCfdr -skeleton algorithm controls the FDR under the user-specified level 5% for all the 48 Bayesian networks, and the PCfdr* -skeleton algorithm steadily controls the FDR closely around 5%, while the PC-skeleton algorithm yields the FDR ranging from about 5% to about 35%, and above 15% in many cases, especially for those sparser DAGs with the average degree of vertices D 2. The PCfdr -skeleton algorithm is conservative, with the FDR notably lower than the user-specified level, while its heuristic modification controls the FDR more accurately around the user-specified level, although the correctness of the heuristic modification has not been theoretically proved. As the discriminability of the statistical tests increases, the power of all the algorithms approaches 1. When their FDR level q is set at the same value as the α of the PC-skeleton algorithm, the PCfdr -skeleton algorithm and its heuristic modification control the type I error rate more stringently than the PC-skeleton algorithm does, so 127 4.3. Empirical Evaluation their power generally is lower than that of the PC-skeleton algorithm. Figure 4.4 also clearly shows, as Eq. 4.3 implies, that it is the type I error rate, rather than the FDR, that the PC-skeleton algorithm controls under 5%. Figure 4.5 shows the average computational time spent during each run of the PCfdr skeleton algorithm and its heuristic modification on the statistical tests of (conditional) independence at step 8 and the FDR control at step 12. The computational time was estimated on the platform of an Intel Xeon 1.86GHz CPU and 4G RAM, and with the code implemented in Matlab R14. Tables 4.3 and 4.4 show the average ratios of the computational time spent on the FDR control to that spent on the statistical tests. The average ratios are not more than 2.57 for all the 48 Bayesian networks. The relatively small standard deviations, as shown in brackets in the tables, indicate that these estimated ratios are trustful. Because the PCfdr -skeleton algorithm and its heuristic modification employ the same search strategy as the PC-skeleton algorithm does, this result evidences that the extra computation cost to achieve the control over the FDR is trivial in comparison with the computation already spent by the PC-skeleton algorithm on statistical tests. 128 4.3. Empirical Evaluation D 2 D 0.4 3 0.4 PC 0.35 PC−fdr 15 0.3 N PC 0.35 PC−fdr* 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0 0.5 0.6 0.7 0.8 0.9 PC−fdr* PC−fdr 0.3 1 0.4 0 0.5 0.6 0.7 0.8 0.35 20 N PC 0.35 PC−fdr* PC−fdr 0.3 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0.5 0.6 0.7 0.8 0.9 PC−fdr* PC−fdr 0.3 0.25 1 0.4 0 0.5 0.6 0.7 0.8 0.35 PC−fdr 25 N 0.25 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0.5 0.6 0.7 0.8 0.9 PC−fdr* PC−fdr 0.3 1 0.4 0 0.5 0.6 0.7 0.8 0.35 1 PC 0.35 PC−fdr* PC−fdr 0.3 30 0.9 0.4 PC N 1 PC 0.35 PC−fdr* 0.3 0.25 0.2 0.2 0.15 0.15 0.1 0.1 0.05 0.05 0.5 0.6 0.7 0.8 0.9 PC−fdr* PC−fdr 0.3 0.25 0 0.9 0.4 PC 0 1 0.4 PC 0 0.9 1 0 0.5 0.6 0.7 0.8 0.9 1 Figure 4.2: The FDR (with 95% confidence intervals) of the PC-skeleton algorithm, the PCfdr -skeleton algorithm and the PCfdr* -skeleton algorithm on the DAGs in Figure 4.1, as the dependence parameter δ shown on the x-axes increases from 0.5 to 1.0. 129 4.3. Empirical Evaluation 15 D 2 D 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 3 0.3 PC 0.2 PC 0.2 PC−fdr* PC−fdr* 0.1 0.1 20 N PC−fdr 0 0.5 0.6 0.7 0.8 0.9 PC−fdr 1 0 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.5 0.6 0.7 0.8 PC 0.2 PC−fdr* PC−fdr* 0.1 0.1 25 N PC−fdr 0.5 0.6 0.7 0.8 0.9 PC−fdr 1 0 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.5 0.6 0.7 0.8 PC 0.2 PC−fdr* 0.1 PC−fdr 30 N PC−fdr 0.5 0.6 0.7 0.8 0.9 1 0 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.5 0.6 0.7 0.8 0.9 1 0.3 PC 0.2 PC 0.2 PC−fdr* PC−fdr* 0.1 0.1 PC−fdr N 1 PC 0.2 PC−fdr* 0.1 0 0.9 0.3 0.3 0 1 0.3 PC 0.2 0 0.9 0.5 0.6 0.7 0.8 0.9 PC−fdr 1 0 0.5 0.6 0.7 0.8 0.9 1 Figure 4.3: The power (with 95% confidence intervals) of the PC-skeleton algorithm, the PCfdr -skeleton algorithm and the PCfdr* -skeleton algorithm on the DAGs in Figure 4.1, as the dependence parameter δ shown on the x-axes increases from 0.5 to 1.0. 130 4.3. Empirical Evaluation D 2 D 0.04 3 0.04 PC 0.035 PC−fdr 15 0.03 N PC 0.035 PC−fdr* 0.025 0.025 0.02 0.02 0.015 0.015 0.01 0.01 0.005 0.005 0 0.5 0.6 0.7 0.8 0.9 PC−fdr* PC−fdr 0.03 1 0.04 0 0.5 0.6 0.7 0.8 0.035 20 N PC 0.035 PC−fdr* PC−fdr 0.03 0.025 0.02 0.02 0.015 0.015 0.01 0.01 0.005 0.005 0.5 0.6 0.7 0.8 0.9 PC−fdr* PC−fdr 0.03 0.025 1 0.04 0 0.5 0.6 0.7 0.8 0.035 PC−fdr 25 N 0.025 0.025 0.02 0.02 0.015 0.015 0.01 0.01 0.005 0.005 0.5 0.6 0.7 0.8 0.9 PC−fdr* PC−fdr 0.03 1 0.04 0 0.5 0.6 0.7 0.8 0.035 1 PC 0.035 PC−fdr* PC−fdr 0.03 30 0.9 0.04 PC N 1 PC 0.035 PC−fdr* 0.03 0.025 0.02 0.02 0.015 0.015 0.01 0.01 0.005 0.005 0.5 0.6 0.7 0.8 0.9 PC−fdr* PC−fdr 0.03 0.025 0 0.9 0.04 PC 0 1 0.04 PC 0 0.9 1 0 0.5 0.6 0.7 0.8 0.9 1 Figure 4.4: The type I error rates (with 95% confidence intervals) of the PC-skeleton algorithm, the PCfdr -skeleton algorithm and the PCfdr* -skeleton algorithm on the DAGs in Figure 4.1, as the dependence parameter δ shown on the x-axes increases from 0.5 to 1.0. 131 4.3. Empirical Evaluation D 2 D 2 10 PCfdr -skeleton, seconds 1 1 10 10 0 0 10 N=30 Statistical Tests N=25 FDR Control 10 −1 N=30 Statistical Tests N=25 FDR Control −1 10 10 N=20 N=20 N=15 N=15 −2 −2 10 10 −3 −3 10 10 0.5 0.6 0.7 0.8 0.9 1 2 0.5 0.6 0.7 0.8 0.9 1 2 10 PCfdr* -skeleton , seconds 3 2 10 10 1 1 10 10 0 0 10 N=30 Statistical Tests N=25 FDR Control 10 −1 N=30 Statistical Tests N=25 FDR Control −1 10 10 N=20 N=20 N=15 N=15 −2 −2 10 10 −3 −3 10 10 0.5 0.6 0.7 0.8 0.9 1 0.5 0.6 0.7 0.8 0.9 1 Figure 4.5: The average computational time (in seconds, with 95% confidence intervals) spent on the FDR control and statistical tests during each run of the PCfdr -skeleton algorithm and its heuristic modification. 132 D 3 D 2 4.3. Empirical Evaluation δ 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 N=15 1.11e-03 (3.64e-04) 1.48e-03 (2.03e-04) 1.58e-03 (1.68e-04) 1.63e-03 (1.64e-04) 1.59e-03 (1.50e-04) 1.64e-03 (1.51e-04) 1.69e-03 (3.70e-04) 2.06e-03 (2.82e-04) 2.11e-03 (1.84e-04) 2.02e-03 (1.68e-04) 2.04e-03 (1.50e-04) 1.99e-03 (1.41e-04) N=20 7.19e-04 (2.32e-04) 1.24e-03 (2.15e-04) 1.61e-03 (2.01e-04) 1.81e-03 (1.61e-04) 1.83e-03 (1.50e-04) 1.88e-03 (1.59e-04) 1.50e-03 (2.55e-04) 2.22e-03 (1.45e-04) 2.36e-03 (1.24e-04) 2.35e-03 (1.20e-04) 2.34e-03 (9.05e-05) 2.28e-03 (9.18e-05) N=25 5.44e-04 (1.79e-04) 1.21e-03 (3.32e-04) 1.59e-03 (1.31e-04) 1.93e-03 (1.20e-04) 2.06e-03 (1.19e-04) 2.15e-03 (1.12e-04) 9.80e-04 (1.90e-04) 1.93e-03 (1.71e-04) 2.31e-03 (1.19e-04) 2.45e-03 (1.28e-04) 2.53e-03 (1.15e-04) 2.57e-03 (9.66e-05) N=30 4.81e-04 (1.37e-04) 1.09e-03 (1.44e-04) 1.64e-03 (1.09e-04) 1.89e-03 (1.04e-04) 1.95e-03 (9.63e-05) 2.01e-03 (9.01e-05) 9.10e-04 (1.52e-04) 1.82e-03 (1.47e-04) 2.29e-03 (1.12e-04) 2.20e-03 (1.15e-04) 2.03e-03 (1.23e-04) 1.92e-03 (1.25e-04) D 3 D 2 Table 4.3: The average ratios (with their standard deviations in brackets) of the computational time spent on the FDR control to that spent on the statistical tests during each run of the PCfdr -skeleton algorithm. δ 0.5 0.6 0.7 0.8 0.9 1.0 0.5 0.6 0.7 0.8 0.9 1.0 N=15 8.88e-04 (2.57e-04) 1.12e-03 (2.99e-04) 1.17e-03 (2.76e-04) 1.16e-03 (2.66e-04) 1.22e-03 (2.73e-04) 1.21e-03 (2.68e-04) 1.33e-03 (3.42e-04) 1.44e-03 (3.46e-04) 1.43e-03 (2.41e-04) 1.36e-03 (1.38e-04) 1.35e-03 (1.34e-04) 1.39e-03 (1.72e-04) N=20 5.93e-04 (2.25e-04) 8.82e-04 (3.19e-04) 1.04e-03 (2.92e-04) 1.04e-03 (1.83e-04) 1.08e-03 (1.91e-04) 1.11e-03 (2.09e-04) 1.01e-03 (1.86e-04) 1.19e-03 (1.63e-04) 1.20e-03 (1.10e-04) 1.17e-03 (9.36e-05) 1.24e-03 (1.03e-04) 1.29e-03 (1.01e-04) N=25 3.94e-04 (1.60e-04) 7.08e-04 (2.58e-04) 9.02e-04 (1.51e-04) 1.03e-03 (1.38e-04) 1.06e-03 (8.76e-05) 1.12e-03 (1.04e-04) 6.74e-04 (1.54e-04) 1.08e-03 (2.08e-04) 1.09e-03 (1.51e-04) 1.10e-03 (1.20e-04) 1.10e-03 (8.31e-05) 1.14e-03 (9.77e-05) N=30 3.24e-04 (1.15e-04) 5.86e-04 (1.81e-04) 7.45e-04 (1.37e-04) 8.23e-04 (1.24e-04) 8.37e-04 (1.35e-04) 8.31e-04 (1.05e-04) 5.49e-04 (1.08e-04) 7.97e-04 (8.83e-05) 7.19e-04 (7.80e-05) 6.48e-04 (9.32e-05) 6.19e-04 (6.57e-05) 5.98e-04 (4.86e-05) Table 4.4: The average ratios (with their standard deviations in brackets) of the computational time spent on the FDR control to that spent on the statistical tests during each run of the PCfdr* -skeleton algorithm. 4.3.2 Applications to Real fMRI Data We applied the PCfdr -skeleton and the PC-skeleton algorithms to real-world research tasks, studying the connectivity network between brain regions using functional magnetic resonance imaging (fMRI). The purpose of the applications is to check whether the two algorithms correctly curb the error rates in real world applications. The purpose of 133 4.3. Empirical Evaluation the applications is not, and also should not be, to answer the question “which algorithm, the PCfdr -skeleton or the PC-skeleton, is superior?”, for the following reasons. Basically, the two algorithms control different error rates between which there is not a superior relationship (see Appendix 4.7). Secondly, the error rate of interest for a specific application is selected largely not by mathematical superiority, but by researchers’ interest and the scenario of research (see Appendix 4.7). Thirdly, the simulation study has clearly revealed the properties of and the differences (not superiority) between the two algorithms. Lastly, the approximating graphical models behind the real fMRI data are unknown, so the comparison on the real fMRI data is rough, rather than rigorous. The two algorithms were applied to two real fMRI data sets, one including 11 discrete variables and 1300 observations, and the other including 25 continuous variables and 1098 observations. The first data set, denoted by “the bulb-squeezing data set”, was collected from 10 healthy subjects each of whom was asked to squeeze a rubber bulb with their left hand at three different speeds or at a constant force, as cued by visual instruction. The data involve eleven variables: the speed of squeezing and the activities of the ten brain regions listed in Table 4.5. The speed of squeezing is coded as a discrete variable with four possible values: the high speed, the medium speed, the low speed, and the constant force. The activities of the brain regions are coded as discrete variables with three possible values: high activation, medium activation and low activation. The data of each subject include 130 time points. The data of the ten subjects are pooled together, so in total there are 1300 time points. For details of the data set, please refer to Li et al. [19]. The second data set, denoted by “the sentence-picture data set”, was collected from a single subject performing a cognitive task. In each trial of the task, the subject was shown in sequence an affirmative sentence and a simple picture, and then answered whether the sentence correctly described the picture. In half of the trials, the picture was presented first, followed by the sentence. In the remaining trials, the sentence was presented first, 134 4.3. Empirical Evaluation followed by the picture. The data involve the activities of 25 brain regions, as listed in Table 4.6, encoded as continuous variables, at 1098 time points. For details of the data set, please refer to Keller et al. [16] and Mitchell et al. [23]. The PCfdr -skeleton and the PC-skeleton algorithms were applied to both the bulbsqueezing and the sentence-picture data sets. Both the FDR level q of the PCfdr -skeleton algorithm and the type-I-error-rate level α of the PC-skeleton algorithm were set at 5%. For the bulb-squeezing data set, all of whose variables are discrete, conditional independence was tested with Pearson’s Chi-square test; for the sentence-picture data set, all of whose variables are continuous, conditional independence was tested with the t-test for partial correlation coefficients [10]. The networks learned from the bulb-squeezing data set and the networks learned from the sentence-picture data set are shown in Figures 4.6 and 4.7 respectively. For ease of comparison, the networks learned by the two algorithms are overlaid. Thin solid black edges are those connections detected by both the two algorithms; thin dashed blue edges are those detected only by the PCfdr -skeleton algorithm; thick solid red edges are those detected only by the PC-skeleton algorithm. In Figure 4.6, there are 17 thin solid black edges, 0 thin dashed blue edge and 1 thick solid red edge; in Figure 4.7, there are 39 thin solid black edges, 3 thin dashed blue edges and 12 thick solid red edges. The results intuitively, though not rigorously, support our expectation of the performances of the two algorithms in real world applications. First, since the data sets are relatively large, with the sample sizes more than 1000, it is expected that both algorithms will recover many of the existing connections, and consequently the networks recovered by the two algorithms may share many common connections. This is consistent with the fact that in Figures 4.6 and 4.7 there are many thin solid black edges, i.e. the connections recovered by both algorithms. Second, since the PCfdr -skeleton algorithm is designed to control the FDR while the PC-skeleton algorithm to control the type I error rate, it is expected that the two algo- 135 4.3. Empirical Evaluation Full Name Left/Right Left/Right Left/Right Left/Right Left/Right anterior cingulate cortex lateral cerebellar hemispheres primary motor cortex pre-frontal cortex supplementary motor cortex L L L L L Abbreviation ACC, R ACC CER, R CER M1, R M1 PFC, R PFC SMA, R SMA Table 4.5: Brain regions involved in the bulb-squeezing data set. The prefixes “L” or “R” in the abbreviations stand for “Left” or “Right”, respectively. L_PFC R_PFC L_ACC R_ACC L_SMA R_SMA L_M1 R_M1 Speed L_CER R_CER Figure 4.6: The networks learned from the bulb-squeezing data set, by the PCfdr -skeleton and the PC-skeleton algorithms. For ease of comparison, the networks learned by the two algorithms are overlaid. Thin solid black edges are those connections detected by both the two algorithms; thick solid red edges are those connections detected only by the PC-skeleton algorithm. For the full names of the brain regions, please refer to Table 4.5. 136 4.3. Empirical Evaluation Full Name Calcarine fissure Left/Right dorsolateral prefrontal cortex Left/Right frontal eye field Left inferior frontal gyrus Left/Right inferior parietal lobe Left/Right intraparietal sulcus Left/Right inferior temporal lobule Left/Right opercularis Left/Right posterior precentral sulcus Left/Right supramarginal gyrus Supplementary motor cortex Left/Right superior parietal lobule Left/Right temporal lobe Left/Right triangularis Abbreviation CALC DLPFC, R DLPFC FEF, R FEF IFG IPL, R IPL IPS, R IPS IT, R IT OPER, R OPER PPREC, R PPREC SGA, R SGA SMA SPL, R SPL T, RT TRIA, R TRIA L L L L L L L L L L L L Table 4.6: Brain regions involved in the sentence-picture data set. The prefixes “L” or “R” in the abbreviations stand for “Left” or “Right”, respectively. L_IPS L_SPL R_SPL R_IPS L_IPL R_IPL L_SGA SMA L_PPREC R_SGA R_PPREC L_IT L_FEF R_IT CALC R_FEF L_T R_T L_OPER L_DLPFC R_DLPFC R_OPER L_IFG L_TRIA R_TRIA Figure 4.7: The networks learned from the sentence-picture data set, by the PCfdr -skeleton and the PC-skeleton algorithms. For ease of comparison, the networks learned by the two algorithms are overlaid. Thin solid black edges are those connections detected by both the two algorithms; thin dashed blue edges are those connections detected only by the PCfdr -skeleton algorithm; thick solid red edges are those connections detected only by the PC-skeleton algorithm. For the full names of the brain regions, please refer to Table 4.6. 137 4.3. Empirical Evaluation Bulb-Squeezing Exist PCfdr PC 17 Assumed Truth Non-Exist 11 p11 1q 2 17 38 Correct 17 17 Realized Detection False FDR Type I Error Rate 0 0.00% 0.00% 1 5.56% 2.63% Sentence-Picture Exist PCfdr PC 39 Assumed Truth Non-Exist 25 p25 1q 2 39 261 Correct 39 39 Realized Detection False FDR Type I Error Rate 3 7.14% 1.14% 12 23.5% 4.60% Table 4.7: The realized error rates of the PCfdr -skeleton and the PC algorithms on the bulb-squeezing and sentence-picture data sets, under the TI assumption that all and only those connections detected by both of the two algorithms truly exist. rithms will control the corresponding error rate under or around the pre-defined level, which is 5% in this study. To verify whether the error rates were controlled as expected, we need to know which connections really exist and which do not. Unfortunately, this is very difficult for real data sets, because unlike the simulated data, the true models behind the real data are unknown, and in the literature, researchers usually tend to report evidence supporting the existence of connections rather than supporting the non-existence. However, since the sample sizes of the two data sets are relatively large, more than 1000, we can speculate that both of the two algorithms have recovered most of the existing connections. Extrapolating this speculation a bit, we intuitively assume that those connections detected by both of the two algorithms truly exist while all the others do not. In other words, we assume that all and only the thin black edges in the figures truly exist. We refer to this assumption as the “True Intersection” (TI) assumption. The statistics about Figures 4.6 and 4.7, under the TI assumption, are listed in Table 4.7. The realized FDR of the PCfdr -skeleton algorithm on the bulb-squeezing and sentence-picture data sets are 0.00% and 7.14%, respectively; the realized type I error rate of the PC-skeleton algorithm on the bulb-squeezing and sentence-picture data sets are 2.63% and 4.60%, respectively. Considering that the realized error rate, as a statistic extracted from just 138 4.4. Conclusions and Discussions a trial, may slightly deviate from its expected value, these results, derived under the TI assumption, support that the two algorithms controlled the corresponding error rate under the pre-defined level 5%. Third, according to Eq. (4.2), the sparser and the larger the true network is, the higher the FDR of the PC-skeleton algorithm will be. For the bulb-squeezing data set, there are 11 vertices, and under the TI assumption, 17 existing connections and 38 nonexisting connections. In this case, the realized FDR of the PC-skeleton algorithm is only 5.56% (Table 4.7). For the sentence-picture data set, there are 25 vertices, and under the TI assumption, 39 existing connections and 261 non-existing connections. In this case, the realized FDR of the PC-skeleton algorithm rises to 23.5% (Table 4.7). This notable increase of the realized FDR is consistent with the prediction based on Eq. (4.2). It should be noted that the preceding arguments are rough rather than rigorous, since they are based on the TI assumption rather than the true models behind the data. However, because the true models behind the real data are unknown, the TI assumption is a practical and intuitive approach to assess the performance of the two algorithms in the two real world applications. 4.4 Conclusions and Discussions We have proposed a modification of the PC algorithm, the PCfdr -skeleton algorithm, to curb the false discovery rate (FDR) of the skeleton of the learned Bayesian networks. The FDR-control procedure embedded into the PC algorithm collectively considers the hypothesis tests related to the existence of multiple edges, correcting the effect of multiple hypothesis testing. Under mild assumptions, it is proved that the PCfdr -skeleton algorithm can control the FDR under a user-specified level q (q ¡ 0) at the limit of large sample sizes (see Theorem 2). In the cases of moderate sample size (about several hundred), empirical experiments have shown that the method is still able to control the FDR 139 4.4. Conclusions and Discussions under the user-specified level. The PCfdr* -skeleton algorithm, a heuristic modification of the proposed method, has shown better performance in the simulation study, steadily controlling the FDR closely around the user-specified level and gaining more detection power, although its asymptotic performance has not been theoretically proved. Both the PCfdr -skeleton algorithm and its heuristic modification can asymptotically recover all the edges of the true DAG (see Theorem 1). The idea of controlling the FDR can be extended to other constraint-based methods, such as the inductive causation (IC) algorithm [see 26, pages 49–51] and the fast-causal-inference (FCI) algorithm [see 32, pages 142–146]. The simulation study has also shown that the extra computation spent on achieving the FDR control is almost negligible when compared with that already spent by the PC algorithm on statistical tests of conditional independence. The computational complexity of the new algorithm is closely comparable with that of the PC algorithm. As a modification based on the PC algorithm, the proposed method is modular, consisting of the PC search strategy, statistical tests of conditional independence and an FDR-control procedure. Different statistical tests and FDR-control procedures can be “plugged in”, depending on the data type and the statistical model. Thus, the method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. It should be noted that the PCfdr -skeleton algorithm is not proposed to replace the PCskeleton algorithm. Instead, it provides an approach to controlling the FDR, a certain error-rate criterion for testing the existence of multiple edges. When multiple edges are involved in structure learning, there are different applicable error-rate criteria, such as those listed in Table 4.2. The selection of these criteria depends on researchers’ interest and the scenarios of studies, which is beyond the scope of this paper. When the FDR is applied, the PCfdr -skeleton algorithm is preferable; when the type I error rate is applied, the PC-skeleton algorithm is preferable. The technical difference between the two algorithms is that the PCfdr -skeleton algorithm adaptively adjusts the type I error 140 4.4. Conclusions and Discussions rate according to the sparseness of the network to achieve the FDR control, while the PC-skeleton algorithm fixes the type I error rate. Currently the FDR control is applied only to the skeleton of the graph, but not to the directions of the edges yet. The final output of the PC algorithm is a partially directed acyclic graph that uniquely represents an equivalence class of DAGs, so a possible improvement for the PCfdr -skeleton algorithm is to extend the FDR control to the directions of the recovered edges. Because both type I and type II errors may lead to wrong directions in the later steps of the PC algorithm, minimizing direction errors may lead to a related, yet different, error-control task. The asymptotic performance of the PCfdr -skeleton algorithm has only been proved under the assumption that the number of vertices is fixed. Its behavior when both the number of vertices and the sample size approach infinity has not been studied yet. Kalisch and B¨ uhlmann [15] proved that for Gaussian Bayesian networks, the PC algorithm consistently recovers the equivalence class of the underlying sparse DAG, as the sample size m approaches infinity, even if the number of vertices N grows as quickly as Opmλ q for any 0 λ 8. Their idea is to adaptively decrease the type I error rate α of the PC-skeleton algorithm as both the number of vertices and the sample size increase. It is desirable to study whether similar behavior can be achieved with the PCfdr -skeleton algorithm if the FDR level q is adjusted appropriately as the sample size increases. It is also worth developing FDR control for structure-learning methods not based on hypothesis tests, such as the popular and fast “least absolute shrinkage and selection operator” (lasso) [36]. For Gaussian graphical models, the lasso is a robust and efficient method for recovering sparse networks. It uses the L1 -norm to curb the complexity of linear regression models when selecting predictors. Subject to an upper bound on the L1 -norm of the regression coefficients, it minimizes the residual squared errors. Graphical lasso [22] considers learning connections as selecting predictors locally for each node variable. In comparison with other methods, it is extremely fast, able to solve a problem with 141 4.5. Proof of Theorems 1000 nodes in about a minute [11]. Under and almost only under the “irrepresentable condition” [15, 39], graphical lasso is able to consistently recover the true network asymptotically, even when the number of nodes increases exponentially as the sample size does. A method for controlling the type I error rate with lasso has been presented in [22], but the problem of FDR control has not been addressed, still being an interesting problem for future research. A Matlab® package of the PCfdr -skeleton algorithm and its heuristic modification is downloadable at www.junningli.org/software. Acknowledgment The authors thank Dr. Martin J. McKeown for sharing the functional magnetic resonance imaging (fMRI) data [19] and helpful discussions. 4.5 Proof of Theorems To assist the reading, we list below notations frequently used in the proof: G true : the skeleton of the true underlying Bayesian network. Aab : the event that edge a b is in the graph recovered by the PCfdr -skeleton algorithm. : AEtrue AEtrue abPEtrue Aab , the joint event that all the edges in G true , the skeleton of the true DAG, are recovered by the PCfdr -skeleton algorithm. : the set of the undirected edges that are not in G . Etrue true pab : the value of pmax ab when the PCfdr -skeleton algorithm stops. Cab : a certain vertex set that d-separates a and b in Gtrue and that is also a subset of either adjpa, G true qztbu or adjpb, Gtrue qztau, according to Proposition 1. Cab is defined only for vertex pairs that are not adjacent in the true DAG Gtrue . 142 4.5. Proof of Theorems pab : the p-value of testing Xa KXb |XCab . The conditional-independence relationship may not be really tested during the process of the PCfdr -skeleton algorithm, but pab can still denote the value as if the conditional-independence relationship was tested. H : the value in Eq. (4.1) that is either H or H p1 1{2, . . . , 1{H q, depending on the assumption of the dependency of the p-values. Lemma 1 If as m approaches infinity, the probabilities of K events Ai pmq, , AK pmq approach 1 at speed P pAi pmqq 1 where lim β pmq Ñ8 m 0 and K o pβ pmqq is a finite integer, then the probability of the joint of all these events approaches 1 at speed P K £ Ai pmq ¥1 Kopβ pmqq i 1 as m approaches infinity. Proof 7 Aipmq A¯ipmq. K i 1 6P 1 K i 1 K i 1 K ° i 1 Ai pmq r1 1 K P P pAi pmqqs 1 Corollary 1 If Ai pmq, i 1 A¯i pmq K ° i 1 ¥1 K ° i 1 opβ pmqq 1 P pA¯i pmqq Kopβ pmqq. , AK pmq are a finite number of events whose probabilities each approach 1 as m approaches infinity: lim P pAi pmqq 1, Ñ8 m 143 4.5. Proof of Theorems then the probability of the joint of all these events approaches 1 as m approaches infinity: lim P Ñ8 m Lemma 2 If there are F (F K £ Ai pmq 1. i 1 ¥ 1) false hypotheses among H tested hypotheses, and the p-values of the all the false hypotheses are smaller than or equal to F H q, where H is either H or H p1 1{2, . . . , 1{H q, depending on the assumption of the dependency of the p-values, then all the F false hypotheses will be rejected by the FDR procedure, Algorithm 2. Proof Let pi (i 1, , H) denote the p-value of the ith hypothesis, pf denote the maximum of the p-values of the F false hypotheses, and rf denote the rank of pf in the ascending order of tpi ui1, ,H . 7 pf is the maximum of the p-values of the F false hypotheses. 6 rf |tpi|pi ¤ pf u| ¥ F . 6 Hr pf ¤ HF pf . 7 pf ¤ HF q. 6 Hr pf ¤ HF pf ¤ q. 6 Hypotheses with p-values not greater than pf will be rejected. 7 The p-values of the F false hypotheses are not greater than pf . 6 All the F false hypotheses will be rejected by the FDR procedure, Algorithm 2. f f Proof of Theorem 1 If there is not any edge in the true DAG Gtrue , then the proof is trivially Etrue In the following part of the proof, we assume Etrue H E . H. For the PCfdr-skeleton algorithm and its heuristic modification, whenever the FDR procedure, Algorithm 2, is invoked, pmax ab is always less than max tpaKb|C u, and the number of p-values input to the FDR P zta,bu C V 144 4.5. Proof of Theorems algorithm is always not more than CN2 . Thus, according to Lemma 2, if " max P a b Etrue * | |Etrue t paKb|C u ¤ °C 1 q, C PV zta,bu 2 max CN 2 N (4.5) i 1 i then all the true connections will be recovered by the PCfdr -skeleton algorithm and its heuristic modification. Let A1aKb|C denote the event paKb|C ¤ 2|E°trueC | 1 q, CN 2 N i 1 i denote the event that all the true conA1Etrue denote the event of Eq. (4.5), and AEtrue nections are recovered by the PCfdr -skeleton algorithm and its heuristic modification. 7 A1E is a sufficient condition for AE , according to Lemma 2. 6 AE A1E . 6 P pAE q ¥ P pA1E q. 7 A1E is the joint of a limited number of events as true true true true true true true A1Etrue £ £ C V a b Etrue P zta,bu A1aKb|C , and lim P pA1aKb|C q 1 according to Assumption (A3). Ñ8 m 6 According to Corollary 1, mlim P pA1E q 1. Ñ8 6 1 ¥ mlim P pAE q ¥ lim P pA1E q 1. Ñ8 mÑ8 6 mlim P pAE q 1. Ñ8 true true true true ¡ 0, if the p-value vector P rp1, , pH s input to Algorithm 2 is replaced with P 1 rp11 , , p1H s, such that (1) for those hypotheses that Lemma 3 Given any FDR level q are rejected when P is the input, p1i is equal to or less than pi , and (2) for all the other hypotheses, p1i can be any value between 0 and 1, then the set of rejected hypotheses when 145 4.5. Proof of Theorems P 1 is the input is a superset of those rejected when P is the input. Proof Let R and R1 denote the sets of the rejected hypotheses when P and P 1 are respectively input to the FDR procedure. If R H, then the proof is trivially R1 H R. If R H, let us define α max pi and α1 max p1i . iPR iPR Let r |R| denote the rank of α in the ascending order of P and r1 denote the rank of α1 in the ascending order of P 1 . 7 p1i ¤ pi for all i P R. 6 α1 ¤ α. 7 α1 max p1i . iPR 6 r1 ¥ |R| r. 7 Hr α ¤ q. 6 Hr1 α1 ¤ Hr α ¤ q. 6 When P 1 is the input, hypotheses with p1i smaller than or equal to α1 will be rejected. 7 p1i ¤ α1, @i P R. 6 R R1, equivalently R1 R. ¡ 0, if the p-value vector P rp1, , pH s input to Algorithm 2 is replaced with P 1 rp11 , , p1H s such that p1i ¤ pi for all i 1, , H, Corollary 2 Given any FDR level q then the set of rejected hypotheses when P 1 is the input is a superset of those rejected when P is the input. Proof of Theorem 2 and E denote the undirected edges respectively recovered and removed by Let Estop stop the PCfdr -skeleton algorithm when the algorithm stops. Let sequence P1max , , PKmax denote the values of P max when the FDR procedure is invoked at step 12 as the algorithm progresses, in the order of the update process of P max , and let Ek denote the set 146 4.5. Proof of Theorems of removable edges indicated by the FDR procedure, with Pkmax as the input. Ek may include edges that have already been removed. 7 The PCfdr-skeleton algorithm accumulatively removes edges in Ek. K Ek . 6 Estop k1 7 P max is updated increasingly at step 10 of the algorithm. 6 According to Corollary 2, E1 EK . K . 6 Estop Ek EK k 1 Let P tpabu denote the value of P max when the PCfdr-skeleton algorithm stops. 7 The FDR procedure is invoked whenever P max is updated. 6 The value of P max does not change after the FDR procedure is invoked for the last time. 6 P PKmax. is the same as the edges recovered by directly applying the FDR procedure to P . 6 Estop The theorem is proved through comparing the result of the PCfdr -skeleton algorithm with that of applying the FDR procedure to a virtual p-value set constructed from P . The virtual p-value set P is defined as follows. For a vertex pair a b that is not adjacent in the true DAG Gtrue, let Cab denote a certain vertex set that d-separates a and b in Gtrue and that is also a subset of either adjpa, G true qztbu or adjpb, Gtrue qztau. Let us define P pab $ ' & p aKb|Cab ' % pab tpabu as: , : a b P Etrue . : a b P Etrue Though paKb|Cab may not be actually calculated during the process of the algorithm, paKb|Cab still can denote the value as if it was calculated. Let us design a virtual algorithm, called Algorithm , that recovers edges by just applying the FDR procedure to 147 4.5. Proof of Theorems P , and let E denote the edges recovered by this virtual algorithm. This algorithm is , virtual and impracticable because the calculation of P depends on the unknown Etrue exists. For any vertex pair a and b that is not but this algorithm exists because Etrue adjacent in Gtrue : 7 Xa and Xb are conditional independent given XC . 6 paKb|C follows the uniform distribution on [0, 1]. 6 The FDR of Algorithm is under q. a b a b When all the true edges are recovered by the PCfdr -skeleton algorithm, i.e. Etrue , the conditional independence between Xa and Xb given X Estop Cab is tested for all E , because for these edges, subsets of b P Etrue stop adjpa, Gtrue qztbu and subsets of adjpa, Gtrue qztbu have been exhaustively searched and when event AE Cab is one of them. Therefore, pab ¥ pab for all a b P Estop E . happens. Consequently, according to Lemma 3, if event AE happens, Estop Let q pE q denote the realized FDR of reporting E as the recovered skeleton of the the falsely recovered edges a true true true DAG: $ ' & | |E Etrue : E H, |E | q pE q ' % 0 : E H. qs and E rq pE The FDRs of the PCfdr -skeleton algorithm and Algorithm are E rq pEstop respectively. Here E rxs means the expected value of x. qs qs E rq pE q|AE sP pAE q E rq pE q|AE¯ sP pAE¯ q 7 E rqpEstop stop stop ¯ ¤ Q P pAE q, where Q E rqpEstopq|AE sP pAE q. qs ¤ lim sup Q lim sup P pAE¯ q. 6 lim sup E rqpEstop mÑ8 mÑ8 mÑ8 7 mlim P pAE q 1, according to Theorem 1. Ñ8 6 lim sup P pAE¯ q mlim P pAE¯ q 0. Ñ8 mÑ8 qs ¤ lim sup Q. 6 lim sup E rqpEstop mÑ8 mÑ8 qs. 7 Q ¤ E rqpEstop qs. 6 lim sup Q ¤ lim sup E rqpEstop true true true true true true true true true true Ñ8 m true Ñ8 m 148 4.6. Statistical Tests with Asymptotic Power Equal to One qs lim sup Q lim sup E rq pE q|AE sP pAE q. 6 lim sup E rqpEstop stop Ñ8 Ñ8 m Ñ8 m Similarly, lim sup E rq pE Ñ8 m true qs lim sup E rqpE q|AE sP pAE q. 7 Given event AE 6 Given event AE , Etrue , true true q q pEstop true m Ñ8 m true true E . Estop | |E | | | |E | | E | |Estop |Etrue |Etrue true true 1 ¤ 1 q pE q. | | |Estop |Estop |E | |E | q|AE sP pAE q ¤ lim sup E rq pE q|AE sP pAE q. 6 lim sup E rqpEstop mÑ8 mÑ8 6 lim sup E rqpEstopqs ¤ lim sup E rqpE qs. mÑ8 mÑ8 7 Algorithm controls the FDR under q. 6 E rqpE qs ¤ q. 6 lim sup E rqpE qs ¤ q. mÑ8 qs ¤ q. 6 lim sup E rqpEstop true true true true Ñ8 m 4.6 Statistical Tests with Asymptotic Power Equal to One Assumption (A3) on the asymptotic power of detecting conditional dependence appears demanding, but actually the detection power of several standard statistical tests approaches one as the number of identically and independently sampled observations approaches infinity. Listed as follows are two statistical tests satisfying Assumption (A3) for Gaussian models or discrete models. 149 4.6. Statistical Tests with Asymptotic Power Equal to One Fisher’s z transformation on sample partial-correlation-coefficients for Gaussian models In multivariate Gaussian models, Xa and Xb are conditional independent given XC if and only if the partial-correlation-coefficient of Xa and Xb given XC is zero [see 18, pages 129–130]. The partial-correlation-coefficient ρ is defined as: ρ Ya Yb Wa Wb Cov rY ,Y s ?Var rY sVarrY s , Xa Wa, XC ¡, Xb Wb, XC ¡, arg min E rpXa w, XC ¡q2 s, w arg min E rpXb w, XC ¡q2 s. w a b a b The sample partial-correlation-coefficient ρˆ can be calculated from m i.i.d. samples [xai , xbi , xCi ] (i 1, , m) as: ρˆ y¯a y¯b yˆai yˆbi ˆa W ˆb W d 1 m 1 m m ° 1 m rpyˆai i 1 m 1 ° yˆai m i 1 m ° y¯a q2 m1 m ° qs pyˆbi i 1 y¯b q2 , yˆai , i 1 m ° p qp y¯a yˆbi y¯b yˆbi , i 1 Wˆ a, xCi ¡, ˆ b , xCi ¡, xbi W m ° arg min pxai w, xCi ¡q2 , w xai arg min w i 1 m ° i 1 pxbi w, xCi ¡q2. The asymptotic distribution of z pρˆq, where z pxq, the Fisher’s z transformation [see 9], is defined as z px q 1 1 log 2 1 x , x 150 4.6. Statistical Tests with Asymptotic Power Equal to One is the normal distribution with mean z pρq and variance 1{pm |C | 3q [see 2, pages 120–134]. When the type I error rate is kept lower than α, the power of detecting ρ 0 a with Fisher’s z transformation is the probability that range p 8, Φ 1pα{2qs or rΦ 1p1 α{2q, 8q, m |C | 3 z pρˆq falls in the where Φ is the cumulative distribution function of the standard normal distribution and Φ 1 is its inverse function. Without loss of generality, we assume the true partial-correlation-coefficient ρ is greater than zero, then the asymptotic power is a lim Power ¥ lim P Ñ8 Ñ8 m mlim Ñ8 m 1 ΦrΦ 1 p1 |C | m a α{2q 3 z pρˆq ¥ Φ m p1 α{2q |C | 3 zpρqs p1 1 Φr 8sq 1. The likelihood-ratio test generally applicable to nested models The likelihood ratio is the ratio of the maximum likelihood of a restricted model to that of a saturated model [see 24]. Let f px, θq denote the probability density function of a random vector x parametrized with θ set Ω specified with r (r rθ 1 , , θk s. The null hypothesis restricts θ to a ¤ k) constraints ξ1 pθq ξ2 pθq Given i.i.d. observations x1 , ξr pθq 0. , xm , let Lpθq denote the likelihood function Lpθq m ¹ f px i , θ q. i 1 The likelihood ratio Λ given the observations is defined as Λ sup Lpθq . sup Lpθq P θ Ω 151 4.6. Statistical Tests with Asymptotic Power Equal to One Wald [38] has proved that under certain assumptions on f px, θq and ξ1 pθq, the limit distribution of the statistic , ξ r pθ q, 2logΛ is the χ2r distribution with r degrees of freedom if the null hypothesis true. If the null hypothesis is not true, the distribution of 2logΛ approaches the non-central χ2r pλq distribution with r degrees of freedom and the non-central parameter D pθ q λ If Dpθq ¡ 0, then mlim λ 8. Ñ8 mDpθq ¥ 0, k ° k ° pq pq . B B B B B p q B B ξi θ ξj θ ξi ξj i 1j 1 ° k ° k p p θ 2θ f x,θ p1 q 1 E θp θq 8) denote the threshold of rejecting the null hypothesis with type I error rate under α (α ¡ 0). The asymptotic power of detecting a θ that is not in Ω and whose Dpθq is greater than 0 is lim P pχ2r pλq ¡ tq. The mean λÑ8 2 and the variance of the χr pλq distribution is u r λ and σ 2 2pr 2λq, respectively. Let t (t When λ is large enough, P χ2r pλq ¡ t ¥ P t χ2r pλq u pu tq 1 P |χ2r pλq u| ¥ u t . According to Chebyshev’s inequality, P |χ2r pλq u| ¥ u ¤ pu σ tq2 pr2pr λ 2λtqq2 . 2 t 6 When λ is large enough, P pχ2r pλq ¡ tq ¥ 1 7 mlim λ 8 and both r and t are fixed. Ñ8 2pr 2λq 6 mlim 0. Ñ8 pr λ tq 6 mlim P pχ2r pλq ¡ tq 1. Ñ8 2pr 2λq pr λ tq2 . 2 152 4.7. Error Rates of Interest 4.7 Error Rates of Interest Statistical decision processes usually involve choices between negative hypotheses and their alternatives, positive hypotheses. In the decision, there are basically two sources of errors: the type I errors, i.e. falsely rejecting negative hypotheses when they are actually true; and the type II errors, i.e. falsely accepting negative hypotheses when their alternatives, the positive hypotheses are actually true. In the context of learning graph structures, a negative hypothesis could be that an edge does not exist in the graph, while the positive hypothesis could be that the edge does exist. It is generally impossible to absolutely prevent the two types of errors simultaneously, because observations of a limited sample size may appear to support a positive hypothesis more than a negative hypothesis even when actually the negative hypothesis is true, or vice versa, due to the stochastic nature of random sampling. Moreover, the two types of errors generally contradict each other. Given a fixed sample size and a certain statistic extracted from the data, decreasing the type I errors will increase the type II errors, and vice versa. To guarantee the absolute prevention of the type I errors in any situations, one must accept all negative hypotheses, which will generally lead the type II error rate to be one, and vice versa. The contradiction between the two types of errors is clearly revealed by the monotone increase of receiver operating characteristic (ROC) curves. Thus the errors must be controlled by setting a threshold on a certain type of errors, or trading off between them, for instance, by minimizing a certain lost function associated with the errors according to the Bayesian decision theory. Rooted in the two types of errors, there are several different error-rate criteria (as listed in Table 4.2) for problems involving simultaneously testing multiple hypotheses, such as verifying the existence of edges in a graph. The type I error rate is the expected ratio of the type I errors to all the negative hypotheses that are actually true; the type II error rate is the expected ratio of the type II errors to all the positive hypotheses that are actually true; the false discovery rate (FDR) [see 4, 34], is the expected ratio of 153 4.7. Error Rates of Interest falsely accepted positive hypotheses to all those accepted positive hypotheses; the familywise error rate is the probability that at least one of the accepted positive hypotheses is actually wrong. Generally, there are no mathematically or technically superior relationships among these error-rate criteria. Each of these error rates may be favoured in certain research scenarios. For example: We are diagnosing a dangerous disease whose treatment is so risky that may cause the loss of eyesight. Due to the great risk of the treatment, we hope that less than 0.1% of healthy people will be falsely diagnosed as patients of the disease. In this case, the type I error rate should be controlled under 0.1%. We are diagnosing cancer patients. Because failure in detecting the disease will miss the potential chance to save the patient’s life, we hope that 95% of the cancer patients will be correctly detected. In this case, the type II error rate should be controlled under 5%. In a pilot study, we are selecting candidate genes for a genetic research on Parkin- son’s disease. Because of the limited funding, we can only study a limited number of genes in the afterward genetic research, so when selecting candidate genes in the pilot study, we hope that 95% of the selected candidate genes are truly associated with the disease. In this case, the FDR will be chosen as the error rate of interest and should be controlled under 5%. We are selecting electronic components to make a device. Any error in any compo- nent will cause the device to run out of order. To guarantee the device functions well with a probability higher than 99%, the family-wise error rate should be controlled under 1%. In these examples, the particular error-rate criteria are selected by reasons beyond mathematical or technical superiority, but by the researchers’ interest, to minimize a certain 154 4.7. Error Rates of Interest lost function associated with the errors according to the Bayesian decision theory. Learning network structures in real world applications may face scenarios similar to the above examples. The excellent discrimination between negative hypotheses and positive hypotheses cannot be achieved by “smartly” setting a threshold on a “superior” error-rate criterion. Setting a threshold on a certain type of error rate is just choosing a cut-point on the ROC curve. If the ROC curve is not sharp enough, any cut-point on the curve away from the ends (0,0) and (1,1) still leads to considerable errors. To discriminate more accurately between a negative hypothesis and a positive hypothesis, one must design a better statistic or increase the sample size to achieve a sharper ROC curve. 155 Bibliography [1] A. Agresti. Categorical Data Analysis (2nd edition). John Wiley & Sons, Inc., 2002. [2] T. W. Anderson. An Introduction to Multivariate Statistical Analysis (2nd edition). John Wiley & Sons, Inc., 1984. [3] S. A. Andersson, D. Madigan, and M. D. Perlman. A characterization of Markov equivalence classes for acyclic digraphs. The Annals of Statistics, 25(2):505–541, 1997. [4] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165–1188, 2001. [5] D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507–554, 2002. [6] C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. Information Theory, IEEE Transactions on, 14(3):462–467, 1968. [7] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9(4):309–347, 1992. [8] D. Eaton and K. Murphy. Bayesian structure learning using dynamic programming and MCMC. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, 2007. [9] R. A. Fisher. Frequency distribution of the values of the correlation coefficients in samples from an indefinitely large population. Biometrika, 10(4):507–521, 1915. 156 Chapter 4. Bibliography [10] R. A. Fisher. The distribution of the partial correlation coefficient. Metron, 3: 329–332, 1924. [11] J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covariance estimation the graphical lasso. Biostatistics, 2007. [12] N. Friedman and D. Koller. Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50(1): 95–125, 2003. [13] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3):197–243, 1995. [14] E. H. Herskovits and G. F. Cooper. Kutato: An entropy-driven system for the construction of probabilistic expert systems from databases. In Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence, pages 54–62, 1990. [15] M. Kalisch and P. B¨ uhlmann. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research, 8:613–636, 2007. [16] T. A. Keller, M. A. Just, and V. A. Stenger. Reading span and the time-course of cortical activation in sentence-picture verification. In Annual Convention of the Psychonomic Society, 2001. [17] M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5:549–573, 2004. [18] S. L. Lauritzen. Graphical Models. Clarendon Press, Oxford University Press, Oxford, New York, 1996. [19] J. Li, Z. J. Wang, S. J. Palmer, and M. J. McKeown. Dynamic Bayesian network 157 Chapter 4. Bibliography modelling of fMRI: A comparison of group analysis methods. NeuroImage, 41:398– 407, 2008. [20] J. Listgarten and D. Heckerman. Determining the Number of Non-Spurious Arcs in a Learned DAG Model: Investigation of a Bayesian and a Frequentist Approach. In Proceedings of the 23rd Conference on Uncertainty in Artificial Intelligence, 2007. [21] D. Madigan, J. York, and D. Allard. Bayesian graphical models for discrete data. International Statistical Review, 63(2):215–232, 1995. [22] N. Meinshausen and P. Buhlmann. High dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34:1436–1462, 2006. [23] T. M. Mitchell, R. Hutchinson, R. S. Niculescu, F. Pereira, X. Wang, M. Just, and S. Newman. Learning to decode cognitive states from brain images. Machine Learning, 57(1):145–175, 2004. [24] J. Neyman and E. S. Pearson. On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika, 20A:175–240, 1928. [25] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [26] J. Pearl. Causality. Cambridge University Press, 2000. [27] J. Pearl and T. S. Verma. A statistical semantics for causation. Statistics and Computing, 2(2):91–95, 1992. [28] H. Qian and S. Huang. Comparison of false discovery rate methods in identifying genes with differential expression. Genomics, 86(4):495–503, 2005. [29] R. W. Robinson. Counting labeled acyclic digraphs. In New Directions in the Theory of Graphs, pages 239–273. Academic Press, 1973. 158 Chapter 4. Bibliography [30] J. Sch¨afer and K. Strimmer. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics, 21(6):754–764, 2005. [31] P. Spirtes and C. Glymour. An algorithm for fast recovery of sparse causal graphs. Social Science Computer Review, 9:62–72, 1991. [32] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. The MIT Press, 2001. [33] B. Steinsky. Enumeration of labelled chain graphs and labelled essential directed acyclic graphs. Discrete Mathematics, 270(1-3):267–278, 2003. [34] J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3):479–498, 2002. [35] J. D. Storey. The positive false discovery rate: A Bayesian interpretation and the q-value. The Annals of Statistics, 31(6):2013–2035, 2003. [36] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal Statistical Society. Series B., 58:267–288, 1996. [37] Ioannis Tsamardinos and Laura E. Brown. Bounding the false discovery rate in local Bayesian network learning. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, July, 2008. [38] A. Wald. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Transactions of the American Mathematical Society, 54(3):426–482, 1943. [39] Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006. 159 Chapter 5 Extending Error-Rate Control to Dynamic Bayesian Networks 12 The PCfdr algorithms presented in Chapter 4 have a significant contribution in the field of network learning, in that they introduce an adequate approach of handling the false discovery rate with theoretical justification. The algorithms in Chapter 4 are only basic implementations of the main idea. To extend the control over the false discovery rate to dynamic Bayesian networks, the PCfdr algorithms must be adapted and further developed. In this chapter, we present two extensions of the PCfdr algorithms designed for dynamic Bayesian networks. One is an adaptation to prior knowledge, allowing users to specify which edges must appear in the network, which cannot, and which are to be learned from data. This extension is naturally applicable to dynamic Bayesian networks, by simply regarding them as Bayesian networks that cannot have edges from time t 1 to time t. The other extension is using the PCfdr algorithms to improve Bayesian inference of dynamic Bayesian networks. The idea is to first learn a network with the PCfdr algorithms, and then make Bayesian inference based on a prior distribution derived from the learned network. It accelerates Bayesian inference and is relatively robust to perturbing noise. 12 A version of Section 5.1 of this chapter has been published. Junning Li, Z. Jane Wang and Martin J. McKeown (2008) Learning Brain Connectivity with the False-Discovery-Rate-Controlled PCAlgorithm. Proceedings of the 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 4617–4620. A version of Section 5.2 of this chapter has been accepted for publication. Junning Li, Z. Jane Wang and Martin J. McKeown (2009) Incorporating ErrorRate-Controlled Prior in Modelling Brain Functional Connectivity. The 3rd International Conference on Bioinformatics and Biomedical Engineering. 160 5.1. Integration of Prior Knowledge 5.1 Integration of Prior Knowledge 13 The PCfdr algorithms in Chapter 4 do not consider prior knowledge and cannot be applied to models such as dynamic Bayesian networks. In this section, we present an extension of the PCfdr algorithms that is able to incorporate prior knowledge. We also demonstrate how to apply it to learning Brain connectivity from continuous fMRI data, with an application to a study on Parkinson’s disease. 5.1.1 False Discovery Rate When the existence of many connections is investigated simultaneously, the error rate should be controlled appropriately. The false discovery rate (FDR) [1] is an error-rate criterion of multiple testing, defined as the expected ratio of falsely rejected negative hypotheses to all those rejected. Referring to Table 5.1, the FDR is formally defined as FDR E FP R2 , (5.1) where FP / R2 is defined to be zero when R2 is zero. A variation of the FDR, the positive false discovery rate (pFDR), defined as pFDR E FP |R 2 R2 ¡0 (5.2) was proposed in [11]. In the context of learning brain connectivity, a negative hypothesis is that a connection between two brain regions does not exist, and a positive hypothesis is that the connection exists. The FDR is the expected proportion of the falsely discovered connections to all those discovered. Among many error rate criteria, such as the type I error rate, the type II error rate 13 Portions reprinted, with permission, from Junning Li, Z. Jane Wang and Martin J. McKeown (2008) Learning Brain Connectivity with the False-Discovery-Rate-Controlled PC-Algorithm. Proceedings of the 30th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 4617–4620. © 2008 IEEE. 161 5.1. Integration of Prior Knowledge Test Results Negative Positive Total Table 5.1: Results of multiple hypothesis testing Truth Negative Positive TN (true negative) FN (false negative) FP (false positive) TP (true positive) T1 T2 Total R1 R2 and etc., the FDR is a reasonable criterion for biomedical applications, because it directly reflects the uncertainty of reported positive results. Yoav Benjamini and Daniel Yekutieli [1] have proved that, when the test statistics have positive regression dependency on each of the test statistics corresponding to the true negative hypotheses, the FDR can be controlled under the user-specified level q by the following procedure. (1) Sort the p-values of H hypothesis tests in ascendant order as p1 (2) Find the largest k such that pk ¤ . . . ¤ pH . ¤ Hk q. (3) Reject hypotheses 1, . . . , k. In other cases of dependency, the FDR can be controlled with a simple conservative modification of the procedure by replacing H in step 2 with 1 5.1.2 1{2 1{H. PCfdr* Algorithm with Prior Knowledge The extended PCfdr* algorithm is described in Algorithm 4. It incorporates prior knowledge with two inputs: Emust and Etest . Emust is the set of edges that must appear in the true graph according to the prior knowledge and Etest is the set of edges that are to be tested from the observed data X. Let Ef ull denote the edges of the fully connected graph. Because the union of Emust and Etest does not necessarily cover all the edges of the fully connected graph, the edge set E initialized in step 1 may exclude some edges in Ef ull . In this way, the prior knowledge on certain impossible edges is incorporated 162 5.1. Integration of Prior Knowledge too. The original PCfdr* algorithm can be regarded as a special case of the extended algorithm, by setting Emust H and Etest Ef ull . Algorithm 4 The extended PCfdr* algorithm Notations: a, b, and A, B, denote vertices and vertex sets respectively; Xa , Xb , and XA , XB , denote variables and variable sets represented by their subscripts respectively. a b denotes an undirected edge. G denotes an undirected graph and E denotes the undirectededge set of G. adjpa, Gq denotes vertices adjacent to a in graph G. Xa KXb |XC denotes the conditional independence between Xa and Xb given XC . Input: the data X, the undirected edges Emust that must appear in the true undirected graph Gtrue according to prior knowledge, the undirected edges Etest (Emust X Etest H) whose existences are to be tested from the data X, and the FDR level q for testing Etest . Output: an undirected graph G. Etest Y Emust, and form an undirected graph G from E accordingly. Initialize the maximum p-values associated with the edges in Etest as P max tpmax ab 1uabPEtest . Let depth d = 0. repeat for each ordered pair of vertices a and b that a b P Etest and |adjpa, Gqztbu| ¥ d do for each subset C adjpa, Gqztbu and |C | d do Test hypothesis Xa KXb |XC and calculate the p-value paKb|C . If paKb|C ¡ pmax ab , then let pmax p . aKb|C ab if every element of P max has been assigned a valid p-value and pmax ab is just updated by step 7, then Apply the FDR procedure to P max to control the FDR under q. Let Enegative denote the edges whose non-existences are accepted by the FDR procedure. Remove Enegative from E and also remove their corresponding elements from P max . Update G accordingly. if a b is removed, then break the for loop at line 6. end if end if end for end for Let d d 1. until |adjpa, Gqztbu| d for every ordered pair of vertices a and b that a b is in E X Etest . 1: Initialize the undirected edge set E as E 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: The extended PCfdr* algorithm is applicable to dynamic Bayesian networks to which the PCfdr* algorithm is not. A multi-channel stochastic process can be modelled with a dynamic Bayesian network of M T nodes, where M and T denote the number of 163 5.1. Integration of Prior Knowledge Figure 5.1: All the possible connections (with the directions ignored) of a first-order dynamic Bayesian network for a two-channel stochastic process channels and the number of time points respectively, and each node represents the signal of a channel at one time point. Because the future can influence neither the presence nor the past, nodes at time t 1 can have only nodes after time t as their children. To reduce the complexity of the model, the order of dynamics is usually restricted. For example, it may be assumed that nodes at time t cannot have nodes at or after t 2 as their children. In these cases, many connections must be excluded from the network. To reduce the complexity further, the same connection pattern is usually be assumed to repeat overtime. Figure 5.1 shows all the possible connections (with the directions ignored) of a first-order dynamic Bayesian network for a two-channel stochastic process. Because this is a first-order dynamic model, connections from a1 and b1 to a3 and b3 should be excluded. If the same connection pattern is assumed to repeat, then all the possible connections in the repeating pattern can be represented by the sub-network circled by dots. The repeating connection pattern can be learned with the extended b1 from both Emust and Etest. original PCfdr* algorithm cannot incorporate the exclusion of a1 b1 . PCfdr* algorithm by excluding the connection a1 5.1.3 The Conditional-Independence Test for fMRI Signals A basic modular of the extended PCfdr* algorithm is to test whether two random variables Xa and Xb are conditionally independent or not upon a set of other random variables XC . Xa and Xb are conditionally independent given XC if and only if P pXa , Xb |XC q P pXa |XC qP pXb |XC q. 164 5.1. Integration of Prior Knowledge fMRI data are continuous, so we modelled the fMRI signals of multiple brain regions with multi-variate Gaussian stochastic processes. The t-test on the partial correlation coefficient is the most widely used test of conditional independence for multi-variate Gaussian distributions. If rXa , Xb , XC s follows a multi-variate Gaussian distribution N pu, Σq, then the partial correlation coefficient between Xa and Xb given XC , denoted as ρab|C , is ρab|C ?kkabk , (5.3) aa bb where kab , kaa and kbb are the elements of Σ 1 corresponding to Xa and Xb . ρab|C is the correlation coefficient between Xa and Xb after the effect of XC has been removed. If and only if Xa and Xb are conditionally independent given XC , then ρab|C is zero [6]. The sample partial correlation coefficient rab|C can be calculated in a similar way by replacing Σ with the covariance matrix estimated from the data. Under the null hypothesis of conditional independence, if rab|C is estimated from S identical independent samples, then the statistic t follows the t-distribution with S a S 2 2 |C | b rab|C2 1 rab|C (5.4) |C | degrees of freedom, and the two-tailed t-test can be used to test the null hypothesis [7]. 5.1.4 Test of Relative Consistence Among Graphs The group analysis was conducted with the individual-structure approach, i.e. to learn the connectivity networks individually for each subject, and then to extract those connections that are consistently recruited by subjects in the same experimental group. If the appearance frequency of a certain connection among a group of graphs is statistically significantly higher than the result of randomness, the connection is considered to be relatively consistently recruited. Suppose there are N graphs each containing Ln pn 1, . . . , N q connections respectively and the total number of possible connections of 165 5.1. Integration of Prior Knowledge a graph is L. If connections in the same graph are recruited with equal chances, then given Ln pn 1, . . . , N q, the number of a connetion’s appearances in the N graphs is a ° random number Y N n1 Xn where Xn Bernoulli(Ln {L). Under this hypothesis of randomness, the probability that a connection appears y times or more is P pY ¥ y q. We applied the test on relative consistence to all the possible connections, adjusted the effect of multiple testing with the FDR procedure, and finally selected those whose q-values are lower than 5% as relatively consistent connections to compose a connectivity network at the group level. 5.1.5 Application to Parkinson’s Disease Data Collection: The study was approved by the University of British Columbia ethics board and all subjects gave written informed consent prior to participating. Ten healthy people and ten Parkinson’s disease patients participated in the study. While in the MRI scanner, subjects were instructed to squeeze a rubber bulb with their left hand at four frequencies (0.00Hz, 0.25Hz, 0.5Hz and 0.75Hz) in 30s blocks, arranged in a pseudo-random order. The patients performed the experiment twice, once before L-dopa medication and the other after the medication. fMRI data of their brain activities was collected with a Philips Achieva 3.0 T scanner. The following twelve brain regions in the Talairach atlas were selected with the local linear discriminant analysis [9] as the regions of interest (ROI) in the study: the left and right (denoted as prefixes “L” or “R” in the abbreviated names of ROIs) – primary motor cortex (L M1 and R M1), supplementary motor cortex (L SMA and R SMA), lateral cerebellar hemisphere (L CER and R CER), putamen (L PUT and R PUT), caudate (L CAU and R CAU), and thalamus (L THA and R THA). Preprocessing: After motion correction, the fMRI time courses of the voxels within each ROI were averaged as the summary activity of the ROI. Then, the averaged time courses were detrended and normalized to unit variance. 166 5.1. Integration of Prior Knowledge Results: The connectivity networks of the normal people, the patients before and after the medication are shown in Figure 5.2. The graphs are obviously not results of randomness. For example, the connections between the left ROIs and their right counterparts appear in all the three graphs. Since all the subjects performed the same motor task, it is reasonable for the networks to share certain similarity. This also suggests that controlling the error rate yields robust results, curbing the effect of randomness. We observe that after the L-dopa medication, the network of the patients changed toward that of the normal people. Connections L SMA—L M1, L THA—L THA and R THA—R THA appear in the network of the normal people, but do not in that of the patients before the medication. After the medication, these connections appear in the network (Figure 5.2c) again. Connection L CAU—R THA does not appear in the network of the normal people, but does in that of the patients before the medication. After the medication, this connection disappears in the network (Figure 5.2c). These observations are consistent with the fact that L-dopa has dramatic effects against bradykinesia and rigidity, cardinal features of Parkinson’s disease [5]. 167 5.1. Integration of Prior Knowledge L_CAU L_PUT L_SMA 0 0.0126 0 0.003 R_CAU 0.003 0.0126 R_PUT 0 R_SMA 0 0 0.0126 0.0126 L_THA L_M1 0 0 0 R_THA 0 R_M1 0 L_CER 0.0126 R_CER 0 0 0.0126 (a) Normal Subjects L_CAU L_PUT L_SMA R_CAU 0 0 0 0.0032308 0.012353 R_PUT 0 R_SMA 0.012353 0 0.012353 0.0032308 L_THA L_M1 R_THA 0 0 R_M1 0 L_CER 0.012353 R_CER 0 0.0032308 0 (b) Patients Before Medication L_CAU L_PUT L_SMA 0 0.003 0 0 R_CAU R_PUT 0 R_SMA 0 0 0.003 0.012353 L_THA L_M1 0.012353 0 0 R_THA 0.003 R_M1 0 L_CER 0.012353 0 0 R_CER (c) Patients After Medication Figure 5.2: The relatively consistent connections at the group level. Edge labels are the q-values of the edges in the group analysis. Prefixes “L” and “R” before ROI names are short for “left” and “right” respectively. 168 5.2. Bayesian Inference with FDR-Controlled Prior 5.2 Bayesian Inference with FDR-Controlled Prior 14 The PCfdr and the PCf dr algorithms can make fast and reliable inference of the structure of a Bayesian networks, with the false discovery rate being controlled at user-specified levels. If the data were perturbed away from the generating model by additional noise, the algorithms would not to control the FDR precisely, but rather yield a fast and rough estimation of the generating model. On the other hand, Bayesian inference is relatively robust to perturbation, but needs cumbersome computation. For example, sometimes even the overhead “burn-in” process of its generic implementation, Markov chain Monte Carlo, needs thousands of iterations. We propose a post-hoc method to combine the PCfdr algorithms and Bayesian inference together, taking the advantages from both of them. The basic idea is to first learn a network with the PCfdr algorithms, and then for Bayesian inference propose a prior distribution of network structures based on the learned network. This method may accelerate Bayesian inference and is more robust to perturbing noise. 5.2.1 Bayesian Inference with Uniform Prior Distribution One intuitive approach to controlling error rates is Bayesian inference, i.e. inferring the posterior probability of network structures from the observed data based on a certain prior probability distribution. The uniform distribution is usually employed as a noninformative prior. As a powerful, generic tool for Bayesian inference, Markov chain Monte Carlo (MCMC) has been widely employed for structure learning of DBNs [8]. MCMC sampling is usually implemented as follows: A neighborhood is first defined for each network structure, usually by adding, deleting or reversing an edge of the network. 14 Portions reprinted, with permission, from Junning Li, Z. Jane Wang and Martin J. McKeown (2009) Incorporating Error-Rate-Controlled Prior in Modelling Brain Functional Connectivity. Proceedings of the 3rd International Conference on Bioinformatics and Biomedical Engineering. DOI: 10.1109/ICBBE.2009.5162945 © 2009 IEEE. 169 5.2. Bayesian Inference with FDR-Controlled Prior Then, starting from a randomly selected network structure, the chain iteratively moves from one structure to one of its neighbors according to a certain probability derived from the Bayesian factor between them. Instead of sampling in the space of directed acyclic graphs, Chickering suggested sampling in the space of equivalence classes [3]. In our following simulations, we will use MCMC to sample network structures in the space of equivalence classes according to their posterior probabilities derived from the uniform prior. 5.2.2 Modelling with Prior Distribution Derived from FDR-Controlled PC Algorithm This proposed method, named as the PCf dr -Prior procedure, is summarized as follows: (1) Learn the network structure S using the PCfdr or the PCf dr algorithm. (2) Derive a prior probability distribution D over the network structures, parameterizing our confidence of S. (3) Make Bayesian inference with the prior distribution D. In Step-2, we use the realized FDR and the realized detection power to parameterize our confidence of S, the output of the PCfdr or the PCf dr algorithms. Because both the realized FDR and the realized detection power are random variables within [0, 1], it is natural to model them with beta distributions, as in Eqs. (5.5) and (5.6), where G is the existing, yet unknown true network structure, and |S | denotes the number of edges in S. f dr power |S |XS |G| Betapaf dr , bf dr q. (5.5) |S |XG|G| Betapapower , bpower q. (5.6) 170 5.2. Bayesian Inference with FDR-Controlled Prior We employ the following process to generate network structures from the prior distribution D. First, generate random numbers f dr and power from Eqs. (5.5) and (5.6). Second, select |S | f dr undirected edges from network S, and |S | f dr{power |S | f dr from those not in S. Finally, determine the directions of the edges randomly according a random order of the nodes. If the PCfdr* algorithm is used in step 1, to point out this specification, we will call the method the PCf dr -Prior method, but the name “PCfdr -Prior” is still more general than specific, still covering the choice between the PCfdr and PCfdr* algorithms. We must point out that the PCfdr -Prior procedure is not rigorously correct from the view of Bayesian inference, because the information in the data is used twice during the inference, once in step (1) and the other in step (3). However, we justify the procedure as, which we all, a “third-party” Bayesian inference. Traditional Bayesian inference is a process involving two parties: the observer and the data. The observer has prior knowledge directly about the model, and then updates his knowledge with the information in the data. In real world, inference processes may involve not just the observer and the data, but also, for example, an extra “counsellor”. Let us consider a consulting procedure as follows. The observer has certain prior knowledge about the problem, but not directly about the model, rather, indireclty about the counsellor’s abilility to read information from the data. The observe first hands the data to the counsellor, and according to his trust in the counsellor, he accepts a prior based on the counsellor’s report, and later he studies the data by himself with the accepted prior. In the PCfdr -Prior procedure, the PCfdr algorithms play the role of a counsellor in step (1). Equations (5.5) and (5.6) for step (2) describe the observer’s trust in the PCfdr algorithms, that is his indirect prior knowledge about the problem. In step (3), the observer makes Bayesian inference by himself. Although the correctness of this “third-party” Bayesian inference is still arguable, we decided to keep it in the thesis for the following reasons. First, it provides an approach for 171 5.2. Bayesian Inference with FDR-Controlled Prior fusing different data analysis methods. Second, we do not want to be over conservative about new yet arguable technologies, but would rather like to protect the bud of potential innovations. When Bayesian inference was first introduced, statisticians argued about it, and some did not accept it. Nowadays, it has become an active and important branch of statistics. This “third-party” Bayesian idea, though is arguable, has a real-world consulting procedure to justify it, so we keep it and leave it for investigation in the future. Alternatively, to be theoretically correct from the view of Bayesian inference, instead of using the outputs of the PCfdr algorithms as the prior, we can use such outputs as the initialization of the standard MCMC sampling. This usage can accelerate the burn-in process of MCMC, and significantly reduce the computational time. This idea is strongly supported by the results in Section 5.2.3. 5.2.3 Simulation Synthetic Data Data were generated from a dynamic Bayesian network, with the network structure as shown in Fig. 5.3, and with the conditional-dependence relationships defined in Eqs. (5.7) and (5.8): xa ya ¸ P rs βba xb ea , (5.7) b pa a xa a, (5.8) where a and b denote node indices, pa[a] denote the parent nodes of a, βba is the connection coefficient from b to a, and ea and a are the additive Gaussian noise related to node a. ea represents the intrinsic randomness of the stochastic system, while a is the perturbation noise occurring during data acquisition. It should be noted that due to the a term, Fig. 5.3 represents only the conditional-independence relationships among {xa , xb , . . .}, but not necessarily those among the observations {ya , yb , . . .}. However, 172 5.2. Bayesian Inference with FDR-Controlled Prior 7 6 5 8 4 9 3 10 2 11 1 12 20 13 19 14 18 15 16 17 Figure 5.3: The simulation model. Solid blue arrows are edges from time t to t dashed red arrows are edges without time lags. 1; {ya , yb , . . .} but not {xa , xb , . . .} were available for structure learning. Here we included a and analyzed the observations {ya , yb , . . .} deliberately to investigate the effect of noise perturbation in practice. Connection coefficients βba were randomly selected from the uniform distribution on r0.2, 0.4s, and the variances of teau were randomly selected from the uniform distribution on r0.2, 1.2s. From a single model for {xa , xb , . . .}, data of 200 time points and 500 time points were sampled repetitively 50 times. Based on the signals {xa , xb , . . .}, we added a at the signal-to-noise ratio (SNR) levels 16, 9, 4 and 3 to synthesize the observations {ya , yb , . . .}. Results Figs. 5.4 and 5.5 respectively show the FDRs and the detection powers of the PCf dr -Prior method, the PCfdr* algorithm and the Bayesian inference based on the uniform prior distribution. We note that, due to the perturbation noise, none of the algorithms controlled the FDR closely around the expected level 5%. It should be pointed out that the performance of the PCfdr* algorithm without perturbation noise 173 5.2. Bayesian Inference with FDR-Controlled Prior 0.4 0.4 0.35 0.35 0.3 0.3 0.25 0.25 0.2 0.2 PCfdr*−Prior PCfdr* Uniform Prior 0.15 0.1 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 (a) 200 time points PCfdr*−Prior PCfdr* Uniform Prior 0.15 0.65 0.1 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 (b) 500 time points Figure 5.4: The estimated false?discovery rate (FDR). The x-axis is the level of the perturbation noise, in terms of 1{ SN R; the y-axis is the FDR. The bars are the standard deviation of the estimations. has been investigated in Chapter 4, and it controlled the FDR accurately around 5% without perturbation noise. The data used in the simulation here have been perturbated from the true generating model by the extra noise in Eq. (5.8). The empirical FDRs of the Bayesian inference based on the uniform prior distribution (implemented with MCMC of 5000 steps) are higher than those of the other two algorithms, especially in the case of 200 time points. Meanwhile, its detection power is lower than that of the PCfdr* algorithm, and merely higher than that of the PCf dr -Prior method. As the perturbation level increases, the FDR of the PCf dr -Prior method increases slower than that of the PCfdr* algorithm. This suggests that the post-hoc Bayesian average makes the structure-learning more robust to perturbation noise. The PCfdr* algorithm and the Bayesian inference based on the uniform prior distribution steadily detected more true edges than the PCf dr -Prior method did, nevertheless at the cost of higher FDRs, as shown in Fig. 5.4. In the case of 500 time points, all the three methods detected more than 80% of the true connections. In the case of 200 time points, the detection power dropped sharply as the perturbation noise increased. Though the theoretical foundation of this post-hoc Bayesian inference, the “thirdparty” Bayesian inference, is not well established, step 2 of the PCfdr -Prior method still can be used to initialize standard MCMC sampling, accelerating the burn-in process, 174 5.2. Bayesian Inference with FDR-Controlled Prior 1 1 PCfdr*−Prior PCfdr* Uniform Prior 0.95 0.9 0.95 0.9 0.85 0.85 0.8 0.8 0.75 0.75 0.7 0.7 0.65 0.65 0.6 0.6 0.55 0.55 0.5 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.5 0.2 (a) 200 time points PCfdr*−Prior PCfdr* Uniform Prior 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 (b) 500 time points Figure 5.5: The estimated detection power. The x-axis is the level of the perturbation ? noise, in terms of 1{ SN R; the y-axis is the detection power. The bars are the standard deviation of the estimations. and significantly reducing the computational cost. This usage does not violate the rule of Bayesian statistics. For the case of SNR( a )=16 and 200 time points, Fig. 5.6 shows the average trajectory of the Bayesian-information-criterion (BIC) scores of the MCMC sampling based on the posterior distribution derived from the uniform distribution. The chain became mature after about 2500 steps (BIC = 6511.4) which took more than an hour with Matlab on our workstation (Intel Xeon 1.86GHz CPU and 4G RAM). The horizontal histogram is the BIC scores sampled from the prior distribution built in Step 2 of the PCf dr -Prior method. According to the histogram, the probability Pr( BIC ¥ 6511.4 ) is approximately 0.3791, implying that if we randomly sample 20 networks from the distribution, with probability 1 p1 0.3791q20 99.99% we can find a BIC score which takes the MCMC trajectory about 2500 steps to achieve. The overhead of initializing MCMC with the PCfdr algorithms is to first preliminarily learn a network structure with the PCfdr algorithms. For the simulation presented here, it took no more than 4 seconds for the case of 200 samples, and around 10 seconds for the case of 500 samples. 175 5.2. Bayesian Inference with FDR-Controlled Prior −6200 −6300 −6400 −6500 −6600 −6700 −6800 −6900 −7000 −7100 −7200 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Figure 5.6: The histogram of the BIC scores sampled from the prior distribution built in Step 2 of the PCf dr -Prior method, and the average trajectory of the Bayesianinformation-criterion (BIC) scores during the MCMC sampling for Bayesian inference based on the uniform prior distribution. 5.2.4 Application to Parkinson’s Disease Data collection Ten normal people and ten Parkinson’s disease (PD) patients participated in the study. Subjects continually squeezed a bulb in their right hand to control an “inflatable” ring so that the ring moved through an undulating tunnel without touching the sides. A trial of the task was five minutes. Normal people performed only one trial; the patients performed twice, once before L-dopa medication, the other after the medication. fMRI data were collected with a Philips Achieva 3.0 T scanner at 0.5038 Hz. The following eighteen brain regions were selected as regions of interest (ROIs) in the study: the left and right (denoted as prefixes “L” or “R” in the abbreviated names of ROIs) primary motor cortex (L M1 and R M1), supplementary motor cortex (L SMA and R SMA), lateral cerebellar hemisphere (L CER and R CER), putamen (L PUT and R PUT), caudate (L CAU and R CAU), thalamus (L THA and R THA), anterior cingulate cortex (L ACC and R ACC), globus pallidus (L GLP and R GLP), and prefrontal cortex (L PFC and R PFC). After motion correction, the fMRI time courses of the voxels within each ROI were 176 5.2. Bayesian Inference with FDR-Controlled Prior averaged as the summary activity of the ROI. The averaged time courses were then detrended and normalized to unit variance. Data were grouped into three categories: group N for the normal people, group Ppre for the PD subjects before medication, and group Ppost for the PD subjects after taking L-dopa medication. Data in the same group were pooled together for group analysis. Results The connectivity networks learned with the PCf dr -Prior method are shown in Fig. 5.7. We argue that the results are not random, but informative. Fig. 5.7(d) shows the connections that are shared by all the three groups, N, Ppre and Ppost . Most of these connections are either auto-regression to the same ROIs, or between left and right counterparts. The pathway GLP – PUT – CAU is also symmetric. Fig. 5.7(e) shows those connections that are shared by the N and Ppost groups, but not Ppre . In other words, we can regard these connections as “impaired” by the disease but later “recovered” by the L-dopa medication. The pathway L M1 – L SMA – L PFC is compatible with recruitment of the left sensorimotor system. Interactions between the SMA and M1 have previously been demonstrated to be abnormal in PD, and later improved with medication [2, 4]. Fig. 5.7(f) shows those connections that are only in the Ppre group, but in neither the N nor the Ppost group. A prominent feature is the importance of the L CER, the contralateral cerebellum. The cerebellum has recently been shown to be hyperactive in PD [12], and may represent a compensation-mechanism of the disease Parkinson’s disease. Our results extend prior observations by demonstrating that not only does the amplitude of the fMRI signal increase in the cerebellum but also the connections to other brain regions. 177 R_CER L_CER R_THA R_CER L_THA R_THA R_GLP L_GLP L_M1 R_PUT R_PFC L_PFC R_PFC (a) Normal subjects (N) R_THA R_GLP L_GLP R_M1 L_M1 L_PUT R_CAU R_SMA R_PFC L_M1 L_PFC (d) Shared by N, Ppre and Ppost L_M1 R_PUT L_PUT R_CAU L_SMA R_PFC L_GLP R_M1 L_CAU R_SMA L_THA R_GLP L_PUT R_ACC L_CER R_THA L_GLP L_ACC L_PFC R_CER L_THA R_PUT L_ACC (c) Patients post-medication (Ppost ) L_CER R_CAU L_SMA R_ACC R_PFC R_M1 L_CAU L_SMA R_ACC R_GLP L_CAU R_SMA L_ACC L_PFC R_THA R_PUT R_CAU R_CER L_THA L_M1 L_PUT L_CAU (b) Patients pre-medication (Ppre ) L_CER L_GLP R_PUT L_SMA R_ACC R_CER R_GLP R_SMA L_ACC L_THA R_M1 L_PUT R_CAU L_SMA R_ACC L_GLP L_M1 L_CAU R_SMA R_THA R_M1 L_PUT L_CER R_SMA L_SMA L_ACC R_ACC L_PFC (e) Shared by N and Ppost , but not Ppre L_CAU R_PFC L_ACC L_PFC (f) In Ppre but neither in N nor in Ppost 178 Figure 5.7: The functional connectivity networks learned from the bulb-squeezing data with the PCf dr -Prior method. The prefixes “L” and “R” before ROI names are short for “left” and “right” respectively. Solid blue edges are those with a onetime-point lag; dashed red are those without time lags. 5.2. Bayesian Inference with FDR-Controlled Prior R_PUT R_CER L_THA R_GLP R_M1 R_CAU L_CER 5.3. Conclusions and Discussions 5.3 Conclusions and Discussions Graphical models have been increasingly investigated as an exploratory tool for inferring brain connectivity in fMRI analyses. It is critical to control the error rate in the “discovered” connectivity networks in real fMRI studies. Two extensions of the PCfdr algorithms are designed to introduce the control over the false discovery rate to dynamic Bayesian networks. The first one is an adaptation to prior knowledge. It allows users to specify which edges must appear in the network, which cannot, and which are to be learned from data. It is naturally applicable to dynamic Bayesian networks by simply regarding dynamic Bayesian networks as certain special prior constraints. In a study on Parkinson’s disease using functional magnetic resonance imaging, it showed promising performance. It learned connectivity networks that are consistent with the dramatic effects of L-dopa against bradykinesia and rigidity, cardinal features of the disease. It yielded robust results, curbing the effect of randomness. The second one is using the PCfdr algorithms to improve the Bayesian inference of dynamic Bayesian networks. The basic idea is to make a post-hoc Bayesian inference based on the prior distribution derived from the network structure learned with the PCfdr algorithms. Simulations showed its superior performances in terms of both computational cost and robustness to perturbation noise. In a study on Parkinson’s disease using functional magnetic resonance imaging, it detected the normalizing effect of L-dopa medication in motor pathways, as well as regions involved in the compensatory mechanisms of the disease. This post-hoc Bayesian inference is not rigorously correct from the view of Bayesian statistics, because the information in the data is used twice during the inference. However, it can be justified with a new idea, which we call, “third-party” Bayesian inference. Though it is arguable, it describes real-world consulting procedures. Though the theoretical foundation of this post-hoc Bayesian inference, the “thirdparty” Bayesian inference, is not well established, step 2 of the PCfdr -Prior method still 179 5.3. Conclusions and Discussions can be used to initialize standard MCMC sampling, accelerating the burn-in process, and significantly reducing the computational cost. This usage does not violate the rule of Bayesian statistics. The proposed structure-learning algorithms are based on the PC-algorithm [10] which does not consider latent variables. If certain brain regions in the functional system are not involved in the structure learning, then a connection between two brain regions may be a result of either their direct communication or the common exertion over them by a latent region. To distinguish whether two brain regions are directly connected or both driven by a third but uninvolved brain region, the Fast-Causal-Inference (FCI) algorithm [10] which considers latent variables should be used as the analysis framework. 180 Bibliography [1] Yoav Benjamini and Daniel Yekutieli. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165–1188, 2001. [2] C. Buhmann, V. Glauche, H. J. Sturenburg, M. Oechsner, C. Weiller, and C. Buchel. Pharmacologically modulated fmri–cortical responsiveness to levodopa in drug-naive hemiparkinsonian patients. Brain., 126(Pt 2):451–61, 2003. [3] David Maxwell Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507–554, 2002. [4] B. Haslinger, P. Erhard, N. Kampfe, H. Boecker, E. Rummeny, M. Schwaiger, B. Conrad, and A. O. Ceballos-Baumann. Event-related functional magnetic resonance imaging in Parkinson’s disease before and after levodopa. Brain., 124(Pt 3): 558–70, 2001. [5] Joseph Jankovic and Eduardo Tolosa, editors. Parkinson’s Disease and Movement Disorders. Williams & Wilkins, 4th edition, 2002. [6] Steffen L. Lauritzen. Graphical Models. Clarendon Press, Oxford University Press, Oxford, New York, 1996. [7] Levy, Kenneth J. and Narula, Subhash C. Testing hypotheses concerning partial correlations: Some methods and discussion. International Statistical Review, 46(2): 215–218, aug 1978. ISSN 0306-7734. 181 Chapter 5. Bibliography [8] David Madigan, Jeremy York, and Denis Allard. Bayesian graphical models for discrete data. International Statistical Review, 63(2):215–232, aug 1995. ISSN 03067734. [9] Martin J McKeown, Junning Li, Xuemei Huang, Mechelle M Lewis, Seungshin Rhee, K. N. Young Truong, and Z. Jane Wang. Local linear discriminant analysis (LLDA) for group and region of interest (ROI)-based fMRI analysis. Neuroimage, 37(3): 855–865, Sep 2007. doi: 10.1016/j.neuroimage.2007.04.072. [10] Peter Spirtes, Clark Glymour, and Richard Scheines. Causation, Prediction, and Search. The MIT Press, 2001. [11] John D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3):479–498, 2002. [12] H Yu, D Sternad, DM Corcos, and DE Vaillancourt. Role of hyperactive cerebellum and motor cortex in Parkinson’s disease. Neuroimage, 35(1):222–233, 2007. 182 Chapter 6 Selecting Regions of Interest in fMRI Analysis Using Local Linear Discriminant Analysis 6.1 15 Introduction Group analysis in fMRI is typically done in several consecutive steps. First, fMRI data are corrected for motion, despite the fact that most methods cannot easily distinguish changes in fMRI signal from that induced by motion [11, 12]. Data are then spatially transformed to a common space such as the atlas by Talaraich [23] or the probabilistic space suggested by the Montreal Neurological Institute [3] to minimize intersubject differences. However, because of the variability in human brain anatomy, the intersubject registration is typically imperfect, so spatial low-pass filtering (“smoothing”) is performed to de-emphasize anatomical differences [6]. Once data have been motion corrected, warped to a common space, and spatially smoothed, the task-related activation of a voxel of a subject k is estimated with linear regression techniques: Yk Xk βk εk , and Cov pεk q σk2 Vk (6.1) 15 A version of this chapter has been accepted for publication. Martin J. McKeown, Junning Li, Xuemei Huang, Mechelle M. Lewis, Seungshin Rhee, K.N. Young Truong and Z. Jane Wang (2007) Local Linear Discriminant Analysis (LLDA) for Group and Region of Interest (ROI)-Based fMRI Analysis. NeuroImage 37: 855–865. 183 6.1. Introduction where Yk is the Tk 1 time course of the voxel, Xk is the Tk D design matrix containing the hypothesized activation (often incorporating estimates of the hemodynamic response 1 vector of residuals, σk2 is the function) as well as other covariates, εk is the Tk homogeneous variance of the residuals and Vk is the correlation matrix. The subscript k indicates that all the variables are related to subject k. As fMRI data are typically not temporally white, data are often pre-whitened using a whitening matrix Wk such that: Wk Vk WkT I (6.2) (for an excellent summary the reader is referred to: [16]). If each term in (6.1) is premultiplied by Wk , we have: Yk Xk βk εk (6.3) where the superscript * denotes the whitened quantities. The whitening matrix Wk is estimated by the residuals εk and Vk as: Wk Vk {. 1 2 (6.4) The regression estimates of Eq. (6.3) can then be estimated by Ordinary Least Squares (OLS) to give the Generalized Least Squares estimate of Eq. (6.1): βˆkGLS Cov βˆkGLS Xk T Xk 1 X k T Yk σk2 Xk T Xk (6.5) 1 (6.6) Contrasts between conditions are of most interest in an experiment, e.g. contrasting BOLD signal during performance of a given task compared to rest. In the study on a 184 6.1. Introduction single subject, the null hypothesis is that the contrast between the least-squared estimates is zero: H0 : cβk 0 where c is the contrast row vector. For example, if we are interested in the comparison between task 1 and task 2, c is [ 1, -1 ]. Group analyses are usually done using a Summary Statistics method, which is a twostaged approach; first individual models are fit to each subject as described above, and then a second level is applied to make group inference on the cβk [16]. In the usual situation where one is contrasting activation across two groups, the second level is a multivariate regression equation with the design matrix encoded with group inclusion indicators (Figure 6.1 (a)): βcont Xg βg εg (6.7) 10 ... where Xg 10 is a binary K 2 matrix coded to show group inclusion, (K is 01 ... 01 the number of subjects from the two groups), βcont is composed of the contrasts cβk for each individual as defined in the first stage, βg two groups and εg rβg1, βg2sT is mean activation of the N p0, δg2Vg q is the residual with the variance δg2 and the correlation matrix Vg being a diagonal matrix , typically just I. Here the null hypothesis is that the group activations for a given voxel in the common spatially transformed space are not significantly different: 185 6.1. Introduction (a) S1 Y1 = S2 E1 + H1 X1 Y2 = E11 E21 = E2 + H2 X2 Yn = E12 E22 = = En + Hn E1n E2n c En c E2 E cont Xn = c E1 Sn « Eg + Hg Xg Eg1 Eg2 (b) S1 S2 « Sk Voxel t-stats, ROIB task 1 U2 U1 U3 task 2 Voxel t-stats, ROIA U K U ¦D U k k k 1 Figure 6.1: Comparison of fMRI group analysis methods. (a) With a standard GLM approach, a two step approach is typically employed. An individually-specific general linear model is first used to estimate task-related activity. In the second stage, the individual contrasts from each subject are collected with a dummy matrix encoding group membership. Note that this implicitly implies that each voxel across subjects is comparable, i.e. each subject’s data have been spatially warped to the same space. (b) With the LLDA approach, voxel-based, activation t-statistics from different ROIs are compared for each subject. The optimum combination of ROIs resulting in maximal discriminability (represented by the straight lines and the Ui’s) is calculated. Note that a joint optimization is performed (represented by the curved arrows), so that the each Ui is not calculated in isolation. (lower panel ) The individual Ui ’s can then be (weighted) averaged to get an overall linear combination of ROIs that maximally discriminate between tasks. 186 6.1. Introduction H0 : βg1 βg2 0. A number of different implementations have been proposed to implement the above analysis in a practical way. The fMRIStat method uses Restricted Maximum Likelihood (ReML) to estimate σg2 [25], then smoothes the data to increase its degree of freedom and accuracy and finally tests the hypothesis with t-statistics. The SPM2 package [7, 8] estimates the δ 2 V δg2 Vg term with ReML under a simplifying assumption that all the subjects share a common covariance matrix δk2 Vk δ2V , and then tests the hypotheses with F statistics. The FMRIB software library estimates σg2 with the maximum a posteriori (MAP) criteria, then screens obviously insignificant voxels with Z-tatistics and finally performs a Bayesian inference on the significance of the remaining voxels with a slower but more accurate Markov Chain Monte Carlo (MCMC) simulation [1]. Nevertheless, there are a number of shortcomings with the previously described methods. The above methods work on the voxel level — this assumes that after suitable spatial transformation, there is a perfect correspondence between the same voxel across subjects. While this may be mitigated somewhat by spatial smoothing, such low-pass filtering degrades the spatial resolution of the data. Activation estimates in small, subcortical structures such as the basal ganglia or thalami, which abut functionally different tissues (e.g. the internal capsule), may be particularly affected by mis-registration errors. One way to partially circumvent the difficulties associated with spatially transforming functional maps to a common space is to manually draw anatomical regions of interest (ROIs) for each subject, and performing analyses at the ROI level — as opposed to the individual voxel level. Using standard atlases, a particular brain region (e.g. the lateral cerebellar hemisphere) is manually circumscribed on the high-resolution structural MRI scans that have been co-registered with the functional data, and the voxels within this region are analyzed. The benefit of this method is that it does not require rigid spatial transformation, preventing possible gross distortion of a particular brain area, as 187 6.1. Introduction may occur if the anatomy of a given individual differs significantly in size and shape to the homologous area in the exemplar brain. However, drawing ROIs is labor-intensive, subject to human error, and requires the assumption that a functionally active region (the SMA for example) of a given brain will be within an anatomically standardized index (i.e. Broadman’s Area 9) which is used to draw the ROI. In addition to the possibilities of mis-registration, the previously-described voxelbased methods do not explicitly model interactions between brain regions. Covarying regions are often of interest, but are not included in the group methods described above. Conceptually, group methods are done in two stages: in the first stage, individuallyspecific regression models are fit to the data; and then the results of these models are used in a group-level analysis. Because the goal of these methods is to test a specific hypothesis, these methods may be conducted sequentially. In contrast, if the goal is to find which combination of brain regions are maximally different between tasks, it is desirable to jointly optimize the individual statistical model and the overall models simultaneously. In individual-subject fMRI analysis, in addition to hypothesis driven methods, there is a role for data driven methods, such as Independent Component Analysis (ICA) [2, 14], which do not need rigorous a priori specification of activation patterns. In an analogous manner, there may be particular interest in discovering the combinations of brain regions (specified by ROIs) that are maximally contrasted during performance of certain tasks (Figure 6.1 (b)). There is therefore a need for a multivariate, discriminant analysis approach that works at the region of interest (ROI) level as opposed to the individual voxel level. Previous work has taken individual activations (or the T-statistics associated with them) and used a multivariate discriminant approach [15]. In order to apply a discriminant approach, we first assume that some statistical analysis has been performed to assign a t-statistic, related to activation, to each voxel, such as simple t-test or regres- 188 6.1. Introduction sion (e.g. t-statistic of a voxel comparing mean BOLD signal for a given condition to BOLD signal for that voxel at rest). Since most multivariate discriminant approaches assumes Gaussian distributions, we extract features from the t-statistics with bootstrap methods to ensure the Gaussian assumption. Consider a subject k, we randomly select a voxel from within each of its N ROIs, and assemble the result into a column t-statistic vector: tpk, V q rtpk, 1, v1 q, tpk, 2, v2 q, ..., tpk, N, vN qsT , (6.8) where tpk, r, vr q is the t-statistic of the vr th voxel in the rth ROI of subject k and the voxel index V rv1, v2, ..., vN sT . The random selection is repeated a number of times, say Bt times, and the ith draw is notated as tpk, Vi q. After a reasonable number of draws, (e.g., Bt = 30), the mean of the t-statistic vectors is taken as feature to ensure the data can be modeled as multivariate Gaussian: fk B 1 ¸t ? tpk, Vi q. Bt i1 (6.9) The above process is repeated Bf times (Bf is several hundreds or thousands) and all feature vectors are then collected: Fk rfk p1q, fk p2q, ..., fk pBf qs, (6.10) where fk piq is the ith random sample of a feature vector. This process is then repeated for all S subjects in the groups and all the Fk ’s are concatenated into a big matrix X rF1, F2, ..., FS s . (6.11) The benefit of formulating the problem in this way is two fold; the Fk ’s will be asymptotically normally distributed (by the multivariate central limit theorem [5]) even if the 189 6.1. Introduction voxel-based statistics are not, and that p-variate linear analyses can now be performed on X [5]. The variability of the amplitude of activation across subjects is well known and typically this is dealt with by using a random effects analysis [16]. In random-effects models, the data are assumed to be derived from a hierarchy of different populations whose differences are constrained by the hierarchy. Each subject from a different group (e.g. group of normal subjects) is considered to be representative of the entire population of normal subjects. In some cases the magnitude of inter-subject differences in fMRI activation can exceed the task-specific differences within individuals. In order to deal with this situation, yet still maintaining the benefits of linear discriminant analysis, we propose using a recently developed solution, Local Linear Discriminant Analysis (LLDA), that was initially designed to solve a somewhat different, but still related, problem [10]. In discriminant analysis trying to classify images of faces, often the difference in discriminant features of different poses of the same face can greatly exceed the difference in discriminant features between faces. Finding a classifier that is sensitive to images from different subjects, yet insensitive to different poses from the same subject, has been problematic. Kim and Kittler suggested LLDA as a solution to this problem [10]. We therefore propose using LLDA to sensitively discriminate between task-dependent ROI-based patterns of activity, while being relatively robust to the differences between subjects (Figure 6.1 (b)). We apply this method to data derived from a motor paradigm that would be expected to activate cortical and subcortical structures. We show that the proposed method, consistent with prior neuroscience knowledge derived from animal models, detects significant group activation in subcortical structures that was not present when the same group of data were analyzed using standard methods utilizing spatial normalization. 190 6.2. Experimental Methods 6.2 Experimental Methods To demonstrate the proposed method, we utilized fMRI data that would be expected, based on prior knowledge, to activate subcortical structures. We enrolled 10 healthy volunteers, 5 males and 5 females, (range 27-45 years). All subjects were right hand dominant, and had normal neurological examinations, had no history of neurological disease, and were not currently using of any psychoactive prescription medications. Handedness was determined according to Edinburgh Handedness Inventory. The paradigm consisted of externally guided (EG) or internally guided (IG) movements based on three different finger sequencing movements (FSMs) performed alternatively by either the right or left hand. For FSM #1, subjects had to (a) make finger-to-thumb opposition movements in the specific order of the index, middle, ring and little finger; (b) open and clench the fist twice; (c) complete finger-to-thumb oppositions in the opposite order (i.e., little, ring, middle and index finger); (d) open and clench the fist twice again; and then (e) repeat the same series of movements. The FSM #2 was the same as above except the sequence for (a) changed to index, ring, middle and little fingers and (c) changed to the reversed order of the revised (a) (i.e., little finger, middle, ring, and index finger). The FSM #3 was the same as above except the sequence for (a) changed to middle, little, index, and ring fingers and (c) changed to the reverse of above the revised (a) (i.e., ring, index, little, middle fingers). The above three sequences (instead only one sequence) were chosen to insure the continuous engagements of the subjects’ attention. The above FSM were performed in two test conditions: following (Externally guided movements — EG) and continuation (Internally guided movements — IG). In the EG condition (30 s), subjects followed the finger tapping sequence shown on the screen (tapping frequency of 1 Hz). In the IG condition (30 s), the visual cue discontinued and the subjects were instructed to continue to keep tapping the same sequence as shown on the earlier screen (Figure 6.2). The two consecutive conditions were preceded and followed by a rest (R) period (30 s). The EG, IG, and R periods were designated using the visual 191 6.2. Experimental Methods Figure 6.2: Activation paradigm for fMRI motor task. A block design paradigm was used wherein subjects held their right hand at rest, followed the sequence on the screen, or generated the previous sequence internally. Each block was 30 seconds in duration and the rate of finger tapping was 1 Hz. cues, “FOLLOW”, “CONTINUE”, and “REST”, respectively. The visual cues remained on the screen with a pair of hands labeled as “Right” and “Left” to mirror the subjects’ hands throughout the fMRI session. The FOLLOW-CONTINUE-REST cycle was repeated four times during each run (total duration of 6 minutes). There were total of 4-6 runs performed on each subject (depending upon tolerability of each subject). Subjects practiced the task for about 20 minutes prior to scanning session, with more than 80% correction rate for the sequence. Subjects were not able to see their own hands at anytime during the scans. All subjects were monitored for performance accuracy throughout the study with the use of a video camera mounted on a tripod. These videos were assessed for accuracy and tabulated for each subject. 192 6.2. Experimental Methods 6.2.1 FMRI Image Pre-processing The fMRI data was preprocessed for each individual independently for motion correction, smoothing, and time realignment using standard parametric mapping software (SPM 5). The time series of functional images were aligned for each slice in order to minimize the signal changes related to small motion of the subject during the acquisition. Temporal filtering of functional time series included removal of the linear drifts of the signal with respect to time from each voxel’s time-course and low-pass filtering of each voxel’s timecourse with a one-dimensional Gaussian filter with FWHM=6 s. The data were not spatially smoothed and were not spatially transformed to a common space. Eight regions of interests (ROIs) were defined bilaterally (total = 16 ROIs) and manually traced for each subject based on anatomical sulcal landmarks and with the guidance of a brain atlas [4]: anterior cingulate cortex (ACC), supplementary motor areas (SMA), primary motor cortex (PMC), dorsal lateral prefrontal cortex (DLPFC), caudate (Caud), globus pallidus/putamen (GP/Put), thalamus, and lateral cerebellum hemisphere. The proposed method is a post-processing method that utilizes statistical parametric maps. To ensure that any benefits from the proposed method were not due to the methods of obtaining the statistical parametric maps themselves, we utilized simple t-tests based on the BOLD signal changes in all runs between a task and rest (e.g., right hand IG vs. rest). The voxels in the statistical maps were then labeled by the appropriate ROIs drawn on the anatomical image. The labeled statistic parametric maps for different conditions (e.g. Right Hand IG vs. rest contrasted to Right EG vs rest) were then contrasted with LLDA. 6.2.2 LLDA Algorithm The underlying idea of LLDA [10] is that global nonlinear data structures in many cases are locally linear and local structures can be linearly aligned. The LLDA linearly transforms each local structure (called a “cluster”) to a common vector space with an indi193 6.2. Experimental Methods vidual transformation matrix and optimizes the discriminant between different classes globally in the common space. Starting with the resampled feature matrix X in Eq. (6.11) in a study involving K subjects and two tasks, we regard the data of each subject as a local linear structure /cluster and attempt to find an individual transformation matrix for it such that the transformed data of all the subjects are globally and optimally discriminated between the tasks/classes. Let x, a N k 1 vector be a column of X and it belongs to a subject P t1, 2, ..., K u and a taskc P t1, 2u. P k and x P c will respectively mean (Notations x that x belongs to subject k and task c). Next, x is transformed to y in the common vector space with Eq. (6.12) y m where Uk UkT px k N1 ruk1, ..., ukn, . . . , ukN s is the N cluster k with ukn being its nth base, m k m ¸ P k q x (6.12) (6.13) k x k N orthogonal transformation matrix of is the mean vector of cluster k, and N k is number of x’ s which belong to cluster k. The mean of each cluster is removed in the transformation. The discriminant after the transformation is scored with Eq. (6.14): J log ˜ B ˜ W (6.14) ˜ and W ˜ are the between-class and within-class scatter matrices in the common where B 194 6.2. Experimental Methods space. The transformed scatter matrices are defined as: ˜ B °C ˜c Nc p m °C ˜c Nc m C ¸ ¸ P m ˜ qT (6.15) m ˜ Tc c 1 ˜ W m ˜ q pm ˜c c 1 py m ˜c q py m ˜c qT (6.16) c 1x c where m ˜ 1 N ° x y 0 and m˜ c N1 ° P y are the global mean and the mean of class x c c c respectively after the transformation, and Nc is the number of x’ s which belong to class c. Because the mean of a cluster is removed in the transformation, in our case m ˜ ˜ and W ˜ can also be written in their matrix form as: equals 0. B ˜ B B11 . .. ... T c1 Nc mci mcj and mck N1 U1T UKT BK1 where Bij °C B1K .. . BKK ° ˜ W U1T UKT W11 . .. Wij $ ' & ' ° % ... W1K .. . WKK k P px x k m U T BU (6.17) UT WU (6.18) q U1 . .. UK if i j Bij , k qpx m k q Bkk , if i j T k (6.19) L log U T BU U T W U under the orthogonality UTk Uk I. The constrained nonlinear programming is LLDA attempts to maximize J and normal constraint Uk UTk m WK1 where UK P P px U1 . .. x c,x k c solved by successively calculating the bases of Uk from the subspace orthogonal to the already calculated bases. Unlike the original LLDA description by Kim and Kittler, we 195 6.2. Experimental Methods propose using an overall optimization procedure using two routines, a “subspace” routine, and a “one-base” LLDA routine, which we found more robust and reliable for fMRI data. The “subspace” routine creates subspaces orthogonal to the already calculated bases uk1 ukpn 1q and calculates ukn in the subspace by calling the “one-base” routine which solves a one-base LLDA problem. The “subspace” routine repeats iteratively until all the bases are calculated. The “one-base” routine solves a one-base LLDA problem that is similar to LLDA but with a different constraint UkT Uk 1 where Uk is just a column vector and only subject to the normal constraint. The pseudo-code of the two routines is given in the appendix. The procedure proposed here has several advantages over the original one proposed by Kim and Kittler for LLDA [10]. First, the orthogonalization is performed only once for every ukn in our procedure (at the “subspace” Routine’s step 4) because the orthogonality constraint is implemented in advance of solving the one-base LLDA problem by projecting ukn (at the “subspace” Routine steps 3 and 5) to the subspaces bases Ak which is orthogonal to the already calculated uk1 ukpn 1q . In contrast, the orthogonality con- straint in the original LLDA implementation [10] is carried out in solving the one-base LLDA problem and thus the orthogonalization must be performed at every iteration, which is less computationally efficient. Second, in the one-base LLDA problem, we do not randomize the start point as Kim and Kittler did but estimate the initial starting point by solving a generalized eigenvector problem, which results in much faster and more stable convergence. As pointed out by Kherif et al [9], averaging of fMRI data across individuals is only prudent when the mean is a good representation of the group. They suggested a way to look for selecting subjects with “similar” activation patterns. In the current situation, since the activation statistics from each individual have been transformed to a common vector space, defined by the y ’s, we can now selectively weight each subject so that the weighted y s are maximally discriminable in the transformed vector space. Specifically, 196 6.2. Experimental Methods we can weight each subject, k, by a small positive factor αk , ψ subject to: αk y αk UkT px K ¸ αk 1, αk m k q, P r0, 1s (6.20) (6.21) k 1 where k is the subject and there are K total subjects and the means of the y s are maximally discriminable by, e.g., a standard t-test. The αk ’s can then be estimated by constrained non-linear optimization methods. To estimate the overall transformation, i.e., the linear combination of ROIs that maximally discriminate between groups, we take advantage of the individual subject αk ’s computed with (6.20) and summarize all the individual Uk s by: U¯ K ¸ ak Uk (6.22) k 1 In the appendix, we explicitly demonstrate how U¯ , at least for the two-class problem, can be jointly estimated during the LLDA optimization. This joint optimization procedure gaver results virtually identical to that given by (6.22). 6.2.3 Backward Step-wise Discrimination To determine which ROIs should actually be included in the discriminant analysis, a backward step-wise procedure can be employed. This method first assumes that all ROIs significantly contribute to the discrimination. The least insignificant element of U¯ is then removed from the analysis, and the procedure is repeated. The fewest number of overall included ROIs that result in the maximum number of significant ROIs is then retained. 197 6.2. Experimental Methods 6.2.4 Contrasting Multiple Tasks In fMRI experiments there is frequently interest in contrasting multiple tasks. For example, given the model: Yk X k βk εk the contrast of choice for a single task and rest is: [ 1, -1, 0, 0, . . . ], where the first two columns of the design matrix represent the anticipated activation and rest respectively. In the current case, we are interested in contrasting activations resulting from right or left hand movement, independent of whether or not the task was externally-guided (EG) or internally-guided (IG), resulting in a contrast citation[ 1/2, 1/2, -1, 0, 0, . . . ] if the design matrix represented [ EG, IG, rest, . . . ]. However, for the LLDA implementation we can proceed two ways: (1) we can pool the t-statistics from the EG and IG tasks — in effect placing the same ROI mask over both statistical maps from the EG vs. rest and IG vs. rest to create an overall group of voxel-based statistics representing that ROI. We note that this is equivalent to creating a contrast such as [ 1/2, 1/2, -1, 0, 0, . . . ] (2) we can treat each task as a separate cluster and calculate the Uk for each task of each subject. Although standard approaches employ method (6.1), we have had much better empirical results with method (6.2). While it is difficult to analytically determine why this would be so, a possible and intuitive explanation is the “Simpson paradox” [21]. The paradox shows that the conditional association between two factors A and B given another factor C may be completely altered compared to their marginal association, without knowing C . In our situation, right (or left) hand movement and ROI’s activities are the two factors A and B, while the condition EG or IG is the third factor. Method (6.1) pools 198 6.2. Experimental Methods the EG and IG data together and thus studies the marginal association, possibly missing meaningful interactions. Method (6.2) considers EG and IG separately and thus studies the conditional association. 6.2.5 Significance of Discrimination A necessary but not sufficient condition of assessing the role of individual ROIs in the overall transformation, U¯ , results in separation that is statistically significant. Specifically, we determine the separation on a subject by subject basis by evaluating the differences with t-tests [20]: ti y A,i c y B,i σy2 A,i σy2 ηyA,i ηyB,i , (6.23) B,i where y is the transformed data using the mean transformation, U¯ , A and B refer to different activation maps being compared, and i is the subject index. We can test the null hypothesis that the different ti’s from all subjects are not significantly different from 0. In order to determine whether or not the overall discriminant, U¯ (Eq. (6.22)) resulted in false positives, ROC curves were drawn to compare the distributions of U¯ under two hypotheses: H0: None of the ROIs are differently activated when the subjects perform the two tasks. H1: Some of ROIs are differently activated when the subjects perform the two tasks. The distribution of U¯ are estimated from data with two resampling techniques, bootstrapping and permutation. Boostrapping resamples S subjects from the real data with replacement and simulates the random recruitment of subjects from a population. If the data of the same subject is sampled more than once, they are considered as different subjects, for the purposes of LLDA, in the resampled data. Permutation randomly shuffles 199 6.2. Experimental Methods the task labels of the data matrices of each subject. The distribution of U¯ under H0 is estimated with two steps: 1) generate many new data sets with bootstrapping followed by permutation. 2) apply LLDA to the new data sets and calculate a U¯ for each data set. We performed 2000 bootstraps to estimate U¯ under H0. The distribution of U¯ under H0 was estimated similarly, but only bootstrapping was employed in the resampling. With the estimated distribution of U¯ under H0 and H1, ROC curves were drawn. 6.2.6 SPM Methodology We also performed SPM 5 analysis on our data. The FMRI data were pre-processed (motion corrected, spatially normalized to a common stereotactic space and smoothed with a Gaussian kernel with 6-mm full width at half maximum (FWHM)). The first level comparisons were made between right hand and left hand finger sequential movements, and the individual activation map ( right ¡ left; and left ¡ right) for each subject was first generated by a fixed-effect model with the voxels that exceeded a probability threshold of p = 0.05 FDR ( False Discovery Rate – corrected). In order to provide a reasonable comparison, the first level analysis was restricted to those voxels within the ROIs, using the PickAtlas software package [13]. The second level analysis was made based on the results of the individual activation maps generated in the first level comparison. That is, the contrast images, one from each of 10 subjects from the first level comparison (for example, right¡left) were assessed using one sample t-test by a random-effect model. ^ This can be expressed in the equation, C β Z Xg βg ^ εg , where C β is the contrast result from first level analysis, Xg is a design matrix that is simply a single column of 1’s, βg is a second level parameter, and εg ˜ N(0, σg2 ). The regions that have clusters with at least 5 contiguous voxels exceeding a probability threshold of p = 0.001 (uncorrected) were identified as activated regions. 200 6.3. Results 6.3 Results Using SPM 5, we found activation in the left primary motor cortex during right hand movement, and similarly left primary motor cortex activation during right hand movement. Activation in the left cerebellar hemisphere, right somatosensory cortex, and right thalamus was detected using left hand movement only (Figure 6.3a). Similarly, during right hand movement, significant activation was detected with LLDA in the left primary cortex. With left hand movement LLDA detected significant activity in the right primary motor cortex, the right supplementary motor area, the right striatum, the right thalamus and the left cerebellar hemisphere (Figure 6.3b). The ROCs demonstrated large deviations from the diagonal, suggesting that the null hypothesis of a region not being active and the hypotheses that it is active have little overlap and a threshold can easily be set to classify them well. The ROC curves also demonstrated superior performance of LLDA (solid line) over LDA (dotted line). In comparing the SPM and LLDA results contrasting EG vs. IG there were also distinct differences (Figure 6.4). SPM analysis did not identify any clusters as having significant activity. for either the EG¡IG comparison or the IG¡EG comparison. In contrast, for the EG vs. IG comparison LLDA detected significant activity in the left primary motor cortex and the left SMA and when comparing IG¡EG tasks, a structure previously found to display increased activity during IG tasks [22]. When comparing EG¡IG, significant activity was found in the dorsolateral prefrontal cortex, bilaterally consistent with previous results [18]. The weighting across subjects for the joint optimization was relatively consistent (Figure 6.5). In all cases the separation across each subject (cf. Eq. (6.23)) was significantly different from zero (Figure 6.5). 201 6.3. Results (a) (b) Right Hand > Left Hand Left Hand > Right Hand Right Hand > Left Hand Left Hand > Right Hand p<0.02 p<0.007 p<10-5 p<0.005 p<0.03 p<0.008 ROC curves Local Linear Discriminant analysis (LLDA) Linear Discriminant Analysis (LDA) Figure 6.3: Comparison of group results. Note that this is a combination of EG and IG movements (top panel ) SPM results. (bottom panel ) LLDA results. ROC curves are shown for significantly active regions. A comparison between LLDA (solid line) and regular LDA (dotted line) is shown. 202 6.3. Results (a) Right IG > EG Right EG > IG (b) Right IG > EG Right EG > IG p<0.003 p<0.05 p<0.02 p<0.04 Figure 6.4: Comparison of group results, examining movement type (EG vs IG). (top panel ) SPM results. (bottom panel ) LLDA results. 203 6.3. Results Right hand EG vs IG Relative Weighting 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 1 2 3 4 5 6 7 8 9 10 Subject Number Figure 6.5: Discriminability of overall discriminant function, U¯ , across subjects. 204 6.4. Discussion 6.4 Discussion We have utilized a modification of the local linear discriminant (LLDA) algorithm to perform group-wise analysis on activation maps that have been manually labelled with individually-drawn Regions of Interest (ROIs). In addition to finding the contralateral primary motor cortical activation and ipsilateral cerebellar activation similar to the SPM approach, we found a number of regions that were significantly active, including bilateral dorsolateral prefrontal cortex during externally guided movement. We also found differences between the use of the right and left hand, such that the right SMA was activated only during left hand performance. This result most likely reflects differences due to hand dominance. In contrast to linear discriminant analysis (LDA), which treats each subject independently, LLDA tried to simultaneously discriminate combinations of ROIs differing between tasks within individuals, while still trying to determine a common transformation across individuals (Figure 6.1). We are particularly interested in the vector space of ROI-based activation statistics that is invariant across individuals, but differs by task activation. A permutation/bootstrapping approach has been employed to prevent false positives. A useful feature of LLDA is that it transforms the activation statistics from each subject (or each tasks of each subject) into a common vector space with respect to relative activation of ROIs during task performance. This allows us to meaningfully compare how similar the combined activation, specified by U¯ , is across subjects (Figure 6.5) providing a check for subjects whose activations represent outliers. The proposed method may be computationally expensive. To run 10 subjects with 16 ROIs, using full backward step-wise discrimination, with LLDA at each step, took approx 30 minutes total on a P4 PC. As this was implemented in Matlab for programming convenience, it is expected that compilation could improve the computational performance. By repetitively resampling voxel-based statistics to ensure a multivariate Gaussian 205 6.4. Discussion distribution of the feature vectors (Eq. (6.9)) there is a risk that our bootstrap samples may no longer be i.i.d. Often researchers restrict the ROIs to the intersection between anatomically derived ROIs and voxels deemed ”active” for a given task. However, by restricting subsequent analyses to these active voxels, the distribution of activation statistics will necessarily be highly non-Gaussian. A balance must therefore be obtained between making the data more Gaussian and the risk of using non-i.i.d. bootstrap samples. Since recent work has suggested that the i.i.d. assumption can be violated without disrupting the results excessively (e.g., [19]), we suggest the current approach is reasonable. Since we are interested in subcortical structures, some consideration must be given to ROIs that have few voxels. We suggest that this will result in discriminant functions that are overly conservative. For example, consider just two ROIs, ROIsmall with relatively few voxels and ROIlarge with many voxels. The effective oversampling of ROIsmall to create a 2 n bootstrap matrix (n being the number of bootstrap samples, cf. Eq. (6.11)) will re- sult in conservative estimates of the discriminant function, since the heavily oversampled ROIsmall voxels will be distributed over the entire range of voxels in ROIlarge . There is growing recognition that warping of individual subjects’ brains to a common space may cause particular problems with registration, especially with small subcortical structures [17, 24]. Despite widespread evidence that subcortical structures such as the thalamus are an integral part of the network used for motor control, only the LLDA approach, when applied to unwarped data, was able to detect significant activation differences between right and left handed task performance. We suggest that experimenters employing tasks expected to activate subcortical structures should use caution when warping individual subjects’ brains to a common space. 206 6.5. Technical Appendix 6.5 Technical Appendix “Subspace” routine: incrementally calculate ukn (1) Initialize:n 1, Ak Ak , a N pN 1 IN N . nq matrix is the subspace which ukn is projected to. (2) Call the “One-base” Routine to calculate u1kn . u1kn , a column vector of N (3) Convert u1kn to ukn : ukn 1 n elements, is the projection of ukn in Ak . Ak u1kn. ukn is a column vector of N elements. (4) Create a subspace Sk which is orthogonal to u1kn . Sk is a pN 1 nq pN nq matrix. (5) Prepare for the next iteration: Ak S Ak Sk , diagpS1 SK q, B S T BS, W S T W S, nn 1. (6) If n ¡ N , stop; else go back to step 2. “One-base” routine: solve the one-base LLDA problem The non-linear programming with the quadratic constraint is solved with a sequential quadratic programming (SQP) method implemented in Matlab function fmincon. The start point Uint k is initialized by solving a generalized eigenvector problem as shown below. 207 6.5. Technical Appendix (1) B °Ni1 °Nj1 Bij , W °Ni1 °Nj1 Wij . (2) B uint (3) Uint k λW uint where λ is the largest generalized eigenvalue. uint. Joint optimization of mean U for the two-class problem The ultimate goal of the optimization is to maximize the discriminant function of y mean , i.e., the weighted y. As y mean αk y αk UkT px is J m k q, the discriminant function of ymean |B | , log |W | where B α1 U1T . . . αK UKT B11 . .. ... BK1 B1K .. . BKK W α1 U1T . . . αK UKT α1 U1 .. . αK UK W11 . .. WK1 .. . W1K .. . WKK U T BU α1 U1 .. . αK UK U T BU In the case of a two-class problems, U is a column vector composed of several column vectors αi UiT . Though Ui and αi are subject to the constraints |Ui | αi 1, ° αi 1 and ¥ 0, the optimization can be solved without the constraints by the following. we solve max J U First, |U T BU | B | log log ||W | |U T W U | 208 6.5. Technical Appendix U U1 . .. UK without the constraints by calculating the generalized eigenvectors of B and W. Then, we normalize each Ui to Ui : | Ui | . Ui Ui {b bi Finally, we derive the weight αi from bi : αi °bib i Because Ui and αi solved in this way satisfy the constraints and they optimize the unconstrained problem whose solution is not worse than that of the constrained problem, they are also the solution of the constrained problem. Acknowledgments This work was supported by a National Parkinson’s Foundation Center of Excellence grant (MJM), NSERC Grant CHRPJ 323602-06 (MJM), CIHR grant CPN-80080 (MJM) and NIH grants AG21491 (XH), and RR00046 (GCRC). 209 Bibliography [1] Christian F Beckmann, Mark Jenkinson, and Stephen M Smith. General multilevel linear modeling for group analysis in FMRI. Neuroimage, 20(2):1052–1063, Oct 2003. doi: 10.1016/S1053-8119(03)00435-X. URL http://dx.doi.org/10.1016/ S1053-8119(03)00435-X. [2] V. D. Calhoun, T. Adali, L. K. Hansen, J. Larsen, and J. J. Pekar. ICA of functional MRI data: An overview. In 4th International Symposium on Independent Component Analysis and Blind Signal Separation, 2003. [3] D. L. Collins, A. P. Zijdenbos, V. Kollokian, J. G. Sled, N. J. Kabani, C. J. Holmes, and A. C. Evans. Design and construction of a realistic digital brain phantom. IEEE TRANSACTIONS ON MEDICAL IMAGING, 17:463–8, 1998. [4] H. Damasio. Human Brain Anatomy in Computerized Images. Oxford University Press, 2005. [5] Flurry. A First Course in Multivariate Statistics. Springer-Verlag, 1997. [6] K. J. Friston. Statistical parametric mapping and other analyses of functional imaging data. Brain Mapping, The Methods, pages 363–385, 1996. [7] K. J. Friston, D. E. Glaser, R. N A Henson, S. Kiebel, C. Phillips, and J. Ashburner. Classical and Bayesian inference in neuroimaging: applications. Neuroimage, 16(2): 484–512, Jun 2002. doi: 10.1006/nimg.2002.1091. URL http://dx.doi.org/10. 1006/nimg.2002.1091. 210 Chapter 6. Bibliography [8] K. J. Friston, W. Penny, C. Phillips, S. Kiebel, G. Hinton, and J. Ashburner. Classical and Bayesian inference in neuroimaging: theory. Neuroimage, 16(2):465–483, Jun 2002. doi: 10.1006/nimg.2002.1090. URL http://dx.doi.org/10.1006/nimg. 2002.1090. [9] Ferath Kherif, Jean-Baptiste Poline, Sbastien Mriaux, Habib Benali, Guillaume Flandin, and Matthew Brett. Group analysis in functional neuroimaging: selecting subjects using similarity measures. Neuroimage, 20(4):2197–2208, Dec 2003. [10] T.K. Kim and J. Kittler. Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, pages 318–327, 2005. [11] Rui Liao, Jeffrey L Krolik, and Martin J McKeown. An information-theoretic criterion for intrasubject alignment of FMRI time series: motion corrected independent component analysis. IEEE Trans Med Imaging, 24(1):29–44, Jan 2005. [12] Rui Liao, Martin J McKeown, and Jeffrey L Krolik. Isolation and minimization of head motion-induced signal variations in fMRI data using independent component analysis. Magn Reson Med, 55(6):1396–1413, Jun 2006. doi: 10.1002/mrm.20893. URL http://dx.doi.org/10.1002/mrm.20893. [13] Joseph A Maldjian, Paul J Laurienti, Robert A Kraft, and Jonathan H Burdette. An automated method for neuroanatomic and cytoarchitectonic atlas-based interrogation of fMRI data sets. Neuroimage, 19(3):1233–1239, Jul 2003. [14] M. J. McKeown, S. Makeig, G. G. Brown, T. P. Jung, S. S. Kindermann, A. J. Bell, and T. J. Sejnowski. Analysis of fMRI data by blind separation into independent spatial components. Hum Brain Mapp, 6(3):160–188, 1998. 211 Chapter 6. Bibliography [15] Martin J McKeown and Colleen A Hanlon. A post-processing/region of interest (ROI) method for discriminating patterns of activity in statistical maps of fMRI data. J Neurosci Methods, 135(1-2):137–147, May 2004. doi: 10.1016/j.jneumeth. 2003.12.021. URL http://dx.doi.org/10.1016/j.jneumeth.2003.12.021. [16] Jeanette A Mumford and Thomas Nichols. Modeling and inference of multisubject fMRI data. IEEE Eng Med Biol Mag, 25(2):42–51, 2006. [17] Alfonso Nieto-Castanon, Satrajit S Ghosh, Jason A Tourville, and Frank H Guenther. Region of interest based analysis of functional imaging data. Neuroimage, 19 (4):1303–1316, Aug 2003. [18] Satoru Otani. Prefrontal cortex function, quasi-physiological stimuli, and synaptic plasticity. J Physiol Paris, 97(4-6):423–430, 2003. doi: 10.1016/j.jphysparis.2004. 01.002. URL http://dx.doi.org/10.1016/j.jphysparis.2004.01.002. [19] D. N. Politis. The impact of bootstrap methods on time series analysis. Statistical Science, 18:219–230, 2003. [20] William H. Press, Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes in C (2nd ed.): the Art of Scientific Computing. Cambridge University Press, 1992. [21] E. H. Simpson. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society, Ser. B, 13:238–241, 1951. [22] PL Strick, RP Dum, and N. Picard. Motor areas on the medial wall of the hemisphere. In Novartis Found Symp, volume 218, pages 64–75, 1998. [23] J. Talairach and P. Tournoux. Co-Planar Stereotaxic Atlas of the Human Brain: 3-Dimensional Proportional System: An Approach to Cerebral Imaging. Thieme, 1988. 212 Chapter 6. Bibliography [24] B. Thirion, A. Roche, P. Ciuciu, and J.B. Poline. Improving sensitivity and reliability of fMRI group studies through high level combination of individual subjects results. In Proc. MMBIA, 2006. [25] K. J. Worsley, C. H. Liao, J. Aston, V. Petre, G. H. Duncan, F. Morales, and A. C. Evans. A general statistical analysis for fMRI data. Neuroimage, 15(1):1–15, Jan 2002. doi: 10.1006/nimg.2001.0933. URL http://dx.doi.org/10.1006/nimg. 2001.0933. 213 Chapter 7 Conclusions and Discussions The previous chapters have, at length, discussed several approaches of applying dynamic Bayesian networks to analyzing neural signals. This chapter completes the study by reviewing them in light of current research in the field. The topics, methods and results of the conducted research are summarized in Section 7.1. The contributions and potential applications of the technology developed in the research are highlighted in Section 7.2. The limitations and possible future work are discussed in Section 7.3. 7.1 Summary Graphical models are a suitable prototype for modeling the interactions among neurons. It is very natural to abstract the nervous system mathematically as a network whose neurons (or nodes) connect with each other through nerve fibres (or edges). Bayesian networks have been actively studied for decades in the field of artificial intelligence, but to meet the particular need of biomedical research and the limitations of current neural-signal acquisition technology, classical methods still need to be adapted and novel methods still need to be developed. Since biomedical research usually has high requirements of interpretability, generality and reliability, we mainly investigated the following topics: structural-feature extraction, group analysis, and error control. To provide guidance on further research exploration and clinical diagnosis, mathematical models in biomedical research should not only fit experimental data well, but also be able to extract interpretable features from data. To make the discoveries generally applicable to a population, rather than just a specific 214 7.1. Summary individual, experimental data should be studied at group level, with both population homogeneity and heterogeneity being considered. To produce reliable analysis results, the rate of errors must be curbed at a low level or balanced against the chance of missing true evidence. Structural-Feature Extraction Our research on extracting informative features from Bayesian networks was described in Chapter 2. We developed a structural-feature extraction framework for Bayesian networks that is able to discover structural features that are significantly different between two groups of Bayesian networks, such as the pattern of sub-networks. The framework includes three components: Bayesian-network modeling, statistical structure-comparison, and structure-based classification. It was demonstrated with a study on the coordination patterns of muscle activities using surface electromyography. First, Bayesian networks were applied to modelling the sEMG signals of multiple muscles, with directed acyclic graphs encoding the interactions between muscle activities. Second, statistical comparison was conducted between the learned Bayesian networks, to extract structural features which characterize the interaction patterns of muscle activities. Structural features can be the number of edges in or out from a node (which is called degree in graph theory), or the length of the shortest path from one node to another (which is called distance in graph theory). To go beyond the level of individual nodes and individual edges, we worked on subnetworks, adapting the concept of network motif [14]. Third, interpretable classifiers, such as classification trees, were built on those structural features. We used classification trees because their classification processes can be easily visualized. In our study on thirteen stroke patients and nine healthy subjects, the classification trees could effectively (with error rates lower than 6%) classify the reaching movements performed by the healthy subjects and the stroke patients. 215 7.1. Summary Comparison of Group-Analysis Methods Our research on group analysis based on dynamic Bayesian networks was presented in Chapter 3. Group analyses of any sort fundamentally involve two factors: common features shared by group members, and specific features of each individual. A review of the literature reveals that group-analysis methods for graphical models can be divided into three broad categories, namely the “virtual-typical-subject” approach [5, 16, 23], the “individual-structure” approach [5, 11] and the “common-structure” approach [8, 13]. The three approaches handle the similarity and diversity of a group of graphical models, e.g. Bayesian networks, at different structural and parametric levels. We investigated the performances of the three group-analysis approaches in an fMRI study on Parkinson’s disease with L-dopa medication. Broadly speaking, we attempted to answer three fundamental questions: “which approach most accurately reflects the underlying biomedical behavior?”, “do the approaches lead to considerably different analysis results?”, and “how can the suitable approach be selected?” We compared the three approaches from the aspects of their statistical goodness-of-fit to the data, and more importantly their sensitivity in detecting the effect of the medication on the disease. The three approaches led to considerably different group-level results, learning different network structures, and detecting different numbers of connections normalized by the medication, from the same data set. The “virtual-typical-subject” approach fitted the data of the healthy people best, while the “individual-structure” approach fitted the data of the patients best, from the view of the Bayesian-information-criterion (BIC). The “individual-structure” approach was more sensitive than the other two approaches in detecting the normalizing effect of the medication on brain connectivity, finding sixty-four out of sixty-six connections between brain regions being normalized by the medication. 216 7.1. Summary Error-Rate Control in Structure Learning Our research on error control was elaborated in Chapter 4 and then further developed in Chapter 5. We designed an algorithm for static Bayesian networks that is able to control the false discovery rate (FDR) [1, 22] of the network connections inferred from experimental data under user-specified levels, for example, conventionally 5%. The FDR is the expected ratio of falsely claimed positive hypotheses to all those claimed. In the context of learning network connections, the FDR is the expected ratio of falsely “discovered” connections to all those “discovered”. In other words, not rigorously, the FDR can tell how many among the “discovered” connections are actually spurious. The algorithm is based on the popular PC algorithm invented by Peter Spirtes and Clark Glymour [19] which is fast but does not control the effect of multiple testing. We built an FDR-control procedure [1] into the PC algorithm to correct the effect of simultaneously testing the existence of multiple edges. We named the FDR-embedded PC algorithm the PCfdr algorithm, and a heuristic modification the PCfdr* algorithm. Theoretically, we proved that under mild conditions the PCfdr algorithm is able to curb the FDR under user-specified levels at the limit of large sample size, and that both the PCfdr algorithm and the PCfdr* algorithm are able to recover all the true connections with probability one as the sample size approaches infinity. Empirically, we tested the two new algorithms on simulated data of moderate sample sizes, such as several hundred. In the test, the PCfdr algorithm controlled the FDR under user-specified levels, and the PCfdr* algorithm controlled the FDR accurately around the user-specified levels. The original version of the PCfdr and PCfdr* algorithms was designed only for static Bayesian networks, taking only the classical approach. Two advanced versions were then developed. One is an adaptation to prior knowledge, allowing users to specify which edges must appear in the network, which cannot, and which are to be learned from data. This extension is naturally applicable to dynamic Bayesian networks, by simply regarding 217 7.2. Contributions and Potential Applications them as Bayesian networks that cannot have edges from time t 1 to time t. The other extension is using the PCfdr algorithms to improve Bayesian inference of (dynamic) Bayesian networks. The idea is to first learn a network with the PCfdr algorithms, and then make Bayesian inference based on a prior distribution derived from the learned network. It accelerates Bayesian inference and is relatively robust to perturbing noise. These new FDR-controlled structure-learning algorithms were applied to studying the functional connectivity among brain regions using functional magnetic resonance imaging (fMRI). Evidence was found to support the normalizing effect of medication on Parkinson’s disease, and the compensation mechanism of the disease. Selection of Regions of Interest Chapter 6, as a supplement to Chapters 2 to 5, provides a fast method to select brain regions of interest for further modeling their connectivity with Bayesian networks. 7.2 Contributions and Potential Applications The major contributions of this research to dynamic Bayesian-network modeling of neural signals are: algorithms that are able to learn reliable network structures from experimental data, a framework for extracting interpretable structural features, and guidance on selecting group-analysis strategies. Structure-Learning Algorithms with Error Rates Controlled The PCfdr and the PCfdr* algorithms, as the reviewers of the Journal of Machine Learning Research commented, “have a significant contribution in the field of network learning, in that it introduces an adequate method of handling errors with finite data and theoretical justification.” The algorithms provide explicit control over the false discovery rate of network structures learned from experimental data. They are modular, flexible 218 7.2. Contributions and Potential Applications and computationally efficient. Their asymptotic performances have been theoretically proved, and in the case of moderate sample size they have been extensively evaluated with simulated data. Current structure-learning algorithms for Bayesian networks have not been adequately adapted to explicitly control the FDR of the claimed “discovered” networks. Scorebased search methods [6] look for a suitable network structure by optimizing a certain criterion of goodness-of-fit, such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), or the Bayesian Dirichlet likelihood equivalent metric (BDE), but scores of goodness-of-fit cannot not directly reflect the error rate. The Bayesian approach, theoretically, can estimate the exact posterior probability of any structure features, but in practice its capacity is largely limited for computational reasons, and only feasible for small scale problems, or under certain additional constraints. For certain special prior distributions, exact-inference algorithms [9] are available, and are efficient for problems with less than thirty nodes. However, they are not applicable to the widely accepted uninformative prior, i.e. the uniform prior distribution over DAGs [3, 4]. Markov chain Monte Carlo (MCMC) [12], a generic implementation of Bayesian inference, usually requires intensive computation and the results may depend on the initial state of the randomization. The PCfdr and PCfdr* algorithms provide direct control over the FDR of the networks learned from experimental data, reporting reliable analysis results. We proved that under mild conditions the PCfdr algorithm is able to curb the FDR under user-specified levels at the limit of large sample size, and that both the PCfdr algorithm and the PCfdr* algorithm are able to recover all the true connections with probability one as the sample size approaches infinity. In the cases of moderate sample size (about several hundred), simulation experiments have showed that the method is still able to control the FDR under the user-specified level, and its heuristic modification, the PCfdr* algorithm is able to control the FDR accurately around the user-specified level. 219 7.2. Contributions and Potential Applications The simulation study has also shown that the extra computational cost to achieve the FDR control is negligible when compared with that already spent by the PC algorithm on statistical tests of conditional independence. The computational complexity of the new algorithm is closely comparable with that of the PC algorithm. The PCfdr and the PCfdr* algorithms are modular, consisting of the PC search strategy, statistical tests of conditional independence and FDR-control procedures. Different statistical tests and FDR-control procedures can be “plugged in” depending on the type of data and the statistical model. Thus, the method is applicable to any models for which statistical tests of conditional independence are available, such as discrete models and Gaussian models. A previous method [17] is only applicable to graphical Gaussian models. The PCfdr and the PCfdr* algorithms are flexible. Their extended versions allow users to specify which edges must appear in the network, which cannot, and which are to be learned from data. The extended versions are not only applicable to static Bayesian networks, but also to dynamic Bayesian networks. The PCfdr algorithms can also be used to accelerate Bayesian inference on dynamic Bayesian networks. The algorithms of the PCfdr family are widely applicable to research where networklearning is involved and the uncertainty of the learned networks should be controlled or assessed, for example, discovering gene regulatory networks, or cellular metabolic networks. We have successfully applied the algorithms to discovering connectivity between brain regions using functional magnetic resonance imaging (fMRI). The discoveries were consistent with the normalizing effect of medication on Parkinson’s disease. Effective Framework for Structural-Feature Extraction The framework based on Bayesian networks for structural-feature extraction provides an effective and flexible method for discovering interaction patterns among neural activities, and its application to sEMG signals casts fundamentally new understanding on the nature 220 7.2. Contributions and Potential Applications of sEMG signals. The framework is effective and flexible in discovering interaction patterns among neural activities. As demonstrated in a study on stroke using sEMG, it discovered several groups of muscles whose coordinated activities in reaching movement significantly differs between healthy people and stroke patients, and in classification it also achieved error rates lower than 6%. The analysis framework, consisting of Bayesian-network modeling, statistical structure-comparison, and structure-based classification, is also flexible to different research specifications. Different types of Bayesian networks can be chosen according to prior knowledge and specific research scenarios. For example, dynamic Bayesian networks can be used to model temporal interactions, and Bayesian networks with hidden nodes to model unobservable neural signals driving muscles. Various structural features can be extracted with the analysis framework, for instance, the number of edges in or out from a node, or the length of the shortest path from one node to another, or the pattern of sub-networks. Its application to modelling sEMG signals casts fundamentally new understanding on the nature of sEMG signals. Traditionally, sEMG signals are decomposed into two components: the amplitude which is extracted for further analysis, and the modulated carrier which is ignored as random noises [2]. However, as we still lack comprehensive understanding of the nature of the carrier signals, the common practice of disregarding them is questionable. With the framework for structural-feature extraction, the carrier signals could classify the reaching movements of stroke patients and healthy people with very low error rates, between 0.00% and 5.56%. The classification error-rates based on the amplitude with traditionally standard methods were considerably higher, between 7.69% and 16.67%. This implies that the carrier signals are not purely random, but also informative. 221 7.3. Future Work Guidance on Selection of Group-Analysis Methods The comparison on group-analysis methods, to the best of our knowledge, was the first study specifically devoted to group analysis with graphical models, in real biomedical applications. It reveals the limitations of current methods, showing directions of future development. It also provides guidance on the selection of group-analysis methods in current research practice. The three popular group-analysis methods, that is, the virtual-typical-subject approach, the common-structure approach, and the individual-structure approach, yielded considerably different group-level results, and tended to find results supporting their own assumptions even when the assumptions were actually wrong. This clearly revealed their limitations: none of them provides a tunable balance between group-common features and subject-specific features, or a means to estimate the degree of inter-subject variability, so that they are confined to their rigid assumptions. While the author was finalizing the thesis, Stephan, Penny, etc. in a recently accepted paper [21] proposed a group Bayesian model to address these limitations. The comparison also suggests that biomedical evidence is an effective criterion for selecting group-analysis models. The sensitivity to the normalizing effect of medication on Parkinson’s disease was used as the criterion for selecting group-analysis methods, and it effectively differentiated the individual-structure approach in favor from the other two candidate approaches. This suggests that whenever possible researchers should design biomedical markers, and use them to select group-analysis strategies. 7.3 Future Work Structure-learning and group analysis are important for applying dynamic Bayesian networks to neural signals, and much work is left to be done. Some of the methods we developed herein are promising prototypes but still need to be improved to better match 222 7.3. Future Work the challenges of real-world problems. Some are successful inventions, both theoretically and empirically, and they can be extended to fully exploit their potential. Future research can focus on enhancing their ability to handle large scale problems, and incorporating with methods in other broad research fields. High-Capacity Structure-Learning Algorithms Current network-learning methods for graphical models can only reliably recover networks of about twenty nodes from data of a practical sample size, providing results at very low spatial resolution. Random graph theory has revealed that many real-world networks, such as the cellular metabolic networks, actually possess surprising scale-free properties, such as the small-world property, or self-similarity [18]. For neural-interaction networks, such as brain functional connectivity, there are also sparse reports and conjectures on their structural properties, from the perspective of random graph theory. Since this kind of prior knowledge has not been exploited in current network-learning, it is interesting to embed these scalable properties into structure-learning methods to enhance their ability to handle networks at high spatial resolution. This development will not only potentially benefit the study on neural-interaction networks, but also many other studies that involve discovering networks from real-world data. Extensions of PCfdr Algorithms The scheme used in the PCfdr algorithms for controlling the false discovery rate (FDR) of network structures can be extended from many aspects to better meet the need of biomedical studies. It has been theoretically proved and empirically validated that the PCfdr algorithms are able to robustly control the false discovery rate of networks learned from experimental data. Nevertheless, the algorithms are initial inventions, and the potential of the fundamental scheme used in them is far from being fully exploited. First, it can be embedded into algorithms that are similar to but more advanced than 223 7.3. Future Work the PC algorithm. Besides the PC algorithm, many other algorithms, for instance, the Inductive Causation (IC) algorithm [15], the IC* algorithm [15], and the Fast-CausalInference (FCI) algorithm [19], are also rooted in the Markov properties [10] and based on the test of conditional-independence. Because these algorithms share the same theoretical foundation, the scheme we developed for the PC algorithm can be “transplanted” into them. The FCI algorithm and the IC* algorithm can solve problems with latent variables, so it is especially interesting to adapt the FDR-control scheme for them. Second, the PCfdr algorithms can be extended for group analysis to control the FDR at group level. Current PCfdr algorithms are designed for data sampled from a single Bayesian network. With modification, they can be extended for data sampled from a group of Bayesian networks. A key inference procedure in the PCfdr algorithms is testing whether two random variables a and b are conditionally independent given a set of other random variables C. If we extend the test of conditional independence to group level, i.e. testing whether a and b are conditionally independent at the group level, then the FDR control can be applied to group analysis. The asymptotic performance of the PCfdr algorithm has only been proved under the assumption that the number of vertices is fixed. Its behavior when both the number of vertices and the sample size approach infinity has not been studied yet. Kalisch and B¨ uhlmann proved that for Gaussian Bayesian networks, the PC algorithm consistently recovers the equivalence class of an underlying sparse DAG, as the sample size m approaches infinity, even if the number of vertices N grows as quickly as Opmλ q for any 0 λ 8 [7]. Their idea is to adaptively decrease the type I error rate α of the PC algorithm as both the number of vertices and the sample size increase. It is desirable to study whether similar behavior can be achieved with the PCfdr algorithm if the FDR level q is adjusted appropriately as the sample size increases. 224 7.3. Future Work Modular, Updatable and Scalable Group-Analysis Methods Current group-analysis methods for Bayesian networks cannot effectively balance groupcommon features and subject-specific features, or meet the need of various experiment designs. It is necessary to develop more advanced group-analysis methods in brand new frameworks. As we discussed in Section 7.2, the three popular group-analysis approaches, i.e. the virtual-typical-subject approach, the common-structure approach, and the individual-structure approach, tend to found results supporting their own assumptions even when the assumptions are actually wrong. One of the underlying reasons is that none of the three methods provides a tunable balance between group-common features and subject-specific features, or a means to estimate the degree of cross-subject variability. They build an unnecessarily rigid boundary between the two different yet related concepts, structures and parameters. This sheds light on the development of new group-analysis methods: to transcend the boundary. In a recently accepted paper [21], Stephan, Penny, etc. have proposed a Bayesian group model to smooth the boundary between structures and parameters. To set clear goals for the development of new group-analysis methods, we suggest three highly desirable features: being modular, being incrementally updatable, and being scalable. Being modular means that a group-analysis method is not only designed for a particular type of single-subject model, but versatile and applicable to different types of single-subject models. For example, both Bayesian networks and structural equation models are applicable to fMRI signals at the single-subject level, so the group-analysis method should not be restricted to only one of them, but should be able to work with both of them. If a group-analysis model can be a module of itself, then it will be able to handle hierarchical group structures. Being incrementally updatable means that group-inference results can be summarized as summary statistics, and forwarded to analysis involving newly collected data. This feature is very useful in research practice because experimental data are usually collected 225 7.3. Future Work incrementally. For example, after a study on eighty subjects half a year ago, twenty more subjects might be recruited. In this case, if the group analysis must start over, but not incrementally update, it may need cumbersome computation to analyze the data of the one hundred subjects as a whole. On the other hand, if the group inference is incrementally updatable, it may need much less computation to involve the twenty additional subjects. Being scalable means that a group-analysis method can handle fast growing diversity among subjects. Because modern exploring research usually involves investigation on a large number of candidate models, scalability has become a highly desirable feature for group analysis. For example, if we assume that connectivity among brain regions is different from one subject to another, and the connectivity among ten brain regions is investigated, then a group-analysis method should be able to handle the diversity of about 3.1 1017 [20] different directed acyclic graphical models of brain connectivity. 226 Bibliography [1] Y. Benjamini and D. Yekutieli. The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics, 29(4):1165–1188, 2001. [2] E. A. Clancy, E. L. Morin, and R. Merletti. Sampling, noise-reduction and amplitude estimation issues in surface electromyography. Journal of Electromyography and Kinesiology, 12(1):1–16, February 2002. [3] D. Eaton and K. Murphy. Bayesian structure learning using dynamic programming and MCMC. In Proceedings of the Twenty-Third Conference on Uncertainty in Artificial Intelligence, 2007. [4] N. Friedman and D. Koller. Being Bayesian about network structure. A Bayesian approach to structure discovery in Bayesian networks. Machine Learning, 50(1): 95–125, 2003. [5] Miguel S. Goncalves, Deborah A. Hall, Ingrid S. Johnsrude, and Mark P. Haggard. Can meaningful effective connectivities be obtained between auditory cortical regions? NeuroImage, 14(6):1353–1360, December 2001. [6] D. Heckerman, D. Geiger, and D. M. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data. Machine Learning, 20(3):197–243, 1995. [7] M. Kalisch and P. B¨ uhlmann. Estimating high-dimensional directed acyclic graphs with the PC-algorithm. Journal of Machine Learning Research, 8:613–636, 2007. 227 Chapter 7. Bibliography [8] Jieun Kim, Wei Zhu, Linda Chang, Peter M. Bentler, and Thomas Ernst. Unified structural equation modeling approach for the analysis of multisubject, multivariate functional MRI data. Human Brain Mapping, 28(2):85–93, 2007. [9] M. Koivisto and K. Sood. Exact Bayesian structure discovery in Bayesian networks. Journal of Machine Learning Research, 5:549–573, 2004. [10] S. L. Lauritzen. Graphical Models. Clarendon Press, Oxford University Press, Oxford, New York, 1996. [11] Junning Li, Z.J. Wang, and M.J. McKeown. A multi-subject dynamic Bayesian network (DBN) framework for brain effective connectivity. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on, volume 1, pages I–429–I–432, 2007. [12] D. Madigan, J. York, and D. Allard. Bayesian graphical models for discrete data. International Statistical Review, 63(2):215–232, 1995. [13] Andrea Mechelli, Will D. Penny, Cathy J. Price, Darren R. Gitelman, and Karl J. Friston. Effective connectivity and intersubject variability: Using a multisubject network to test differences and commonalities. NeuroImage, 17(3):1459–1469, 11 2002. [14] R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: Simple building blocks of complex networks. Science, 298:824–827, 2002. [15] J. Pearl. Causality. Cambridge University Press, 2000. [16] Jagath C. Rajapakse and Juan Zhou. Learning effective brain connectivity with dynamic Bayesian networks. NeuroImage, 37:749–60, 2007. doi: 10.1016/j.neuroimage. 2007.06.003. 228 Chapter 7. Bibliography [17] J. Sch¨afer and K. Strimmer. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics, 21(6):754–764, 2005. [18] Chaoming Song, Shlomo Havlin, and Hernn A Makse. Self-similarity of complex networks. Nature, 433(7024):392–395, Jan 2005. doi: 10.1038/nature03248. URL http://dx.doi.org/10.1038/nature03248. [19] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. The MIT Press, 2001. [20] B. Steinsky. Enumeration of labelled chain graphs and labelled essential directed acyclic graphs. Discrete Mathematics, 270(1-3):267–278, 2003. [21] Klaas Enno Stephan, Will D. Penny, Jean Daunizeau, Rosalyn J. Moran, and Karl J. Friston. Bayesian model selection for group studies. NeuroImage, accepted, 2009. [22] J. D. Storey. A direct approach to false discovery rates. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 64(3):479–498, 2002. [23] Xuebin Zheng and Jagath C. Rajapakse. Learning functional structure from fMR images. NeuroImage, 31(4):1601–1613, July 2006. 229
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Dynamic Bayesian networks : modeling and analysis of...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Dynamic Bayesian networks : modeling and analysis of neural signals Li, Junning 2009
pdf
Page Metadata
Item Metadata
Title | Dynamic Bayesian networks : modeling and analysis of neural signals |
Creator |
Li, Junning |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | Studying interactions between different brain regions or neural components is crucial in understanding neurological disorders. Dynamic Bayesian networks, a type of statistical graphical model, have been suggested as a promising tool to model neural communication systems. This thesis investigates the employment of dynamic Bayesian networks for analyzing neural connectivity, especially with focus on three topics: structural feature extraction, group analysis, and error control in learning network structures. Extracting interpretable features from experimental data is important for clinical diagnosis and improving experiment design. A framework is designed for discovering structural differences, such as the pattern of sub-networks, between two groups of Bayesian networks. The framework consists of three components: Bayesian network modeling, statistical structure-comparison, and structure-based classification. In a study on stroke using surface electromyography, this method detected several coordination patterns among muscles that could effectively differentiate patients from healthy people. Group analyses are widely conducted in neurological research. However for dynamic Bayesian networks, the performances of different group-analysis methods had not been systematically investigated. To provide guidance on selecting group-analysis methods, three popular methods, i.e. the virtual-typical-subject, the common-structure and the individual-structure methods, were compared in a study on Parkinson's disease, from the aspects of their statistical goodness-of-fit to the data, and more importantly, their sensitivity in detecting the effect of medication. The three methods led to considerably different group-level results, and the individual-structure approach was more sensitive to the normalizing effect of medication. Controlling errors is a fundamental problem in applying dynamic Bayesian networks to discovering neural connectivity. An algorithm is developed for this purpose, particularly for controlling the false discovery rate (FDR). It is proved that the algorithm is able to curb the FDR under user-specified levels (for example, conventionally 5%) at the limit of large sample size, and meanwhile recover all the true connections with probability one. Several extensions are also developed, including a heuristic modification for moderate sample sizes, an adaption to prior knowledge, and a combination with Bayesian inference. |
Extent | 2068950 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-08-31 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0065463 |
URI | http://hdl.handle.net/2429/12618 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2009_fall_li_junning.pdf [ 1.97MB ]
- Metadata
- JSON: 24-1.0065463.json
- JSON-LD: 24-1.0065463-ld.json
- RDF/XML (Pretty): 24-1.0065463-rdf.xml
- RDF/JSON: 24-1.0065463-rdf.json
- Turtle: 24-1.0065463-turtle.txt
- N-Triples: 24-1.0065463-rdf-ntriples.txt
- Original Record: 24-1.0065463-source.json
- Full Text
- 24-1.0065463-fulltext.txt
- Citation
- 24-1.0065463.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0065463/manifest