Parallel Computation of Large Power System Networks Using the Multi-Area Thévenin Equivalents by Marcelo Aroca Tomim B.Sc., Escola Federal de Engenharia de Itajubá, 2001 M.A.Sc., Universidade Federal de Itajubá, 2004 A THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in The Faculty of Graduate Studies (Electrical & Computer Engineering) The University Of British Columbia (Vancouver) July, 2009 c Marcelo Aroca Tomim 2009 Abstract The requirements of today’s power systems are much diﬀerent from the ones of the systems of the past. Among others, energy market deregulation, proliferation of independent power producers, unusual power transfers, increased complexity and sensitivity of the equipments demand from power systems operators and planners a thorough understanding of the dynamic behaviour of such systems in order to ensure a stable and reliable energy supply. In this context, on-line Dynamic Security Assessment (DSA) plays a fundamental role in helping operators to predict the security level of future operating conditions that the system may undergo. Amongst the tools that compound DSA is the Transient Stability Assessment (TSA) tools, which aim at determining the dynamic stability margins of present and future operating conditions. The systems employed in on-line TSA, however, are very much simpliﬁed versions of the actual systems, due to the time-consuming transient stability simulations that are still at the heart of TSA applications. Therefore, there is an increasing need for improved TSA software, which has the capability of simulating bigger and more complex systems in a shorter lapse of time. In order to achieve such a goal, a reformulation of the Multi-Area Thévenin Equivalents (MATE) algorithm is proposed. The intent of such an algorithm is parallelizing the solution of the large sparse linear systems associated with transient stability simulations and, therefore, speeding up the overall on-line TSA cycle. As part of the developed work, the matrix-based MATE algorithm was re-evaluated from an electric network standpoint, which yielded the network-based MATE presently introduced. In addition, a performance model of the proposed algorithm is developed, from which a theoretical speedup limit of p 2 was deduced, where p is the number of subsystems into which a system is torn apart. Applications of the network-based MATE algorithm onto solving actual power systems (about 2,000 and 15,000 buses) showed the attained speedup to closely follow the predictions made with the formulated performance model, even on a commodity cluster built out of inexpensive out-of-the-shelf computers. ii Table of Contents Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi Dedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivating Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.1 Commodity Oﬀ-The-Shelf Computer Systems . . . . . . . . . . . . . 3 1.1.2 Performance Metrics for Parallel Systems . . . . . . . . . . . . . . . 4 1.2 Parallel Transient Stability Solution . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Parallel Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . 6 7 1.2.2 Parallel Waveform Relaxation Methods . . . . . . . . . . . . . . . . 9 1.2.3 Parallel Alternating Methods . . . . . . . . . . . . . . . . . . . . . . 10 1.3 Parallel Network Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Fine Grain Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . Coarse Grain Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 11 12 1.4 Other Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.3.1 1.3.2 1.4.1 Parallel Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . 16 1.4.2 Parallel Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Thesis Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 21 1.6 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 1.7 Publications 23 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii Table of Contents 2 Network-based Multi-Area Thévenin Equivalents (MATE) . . . . . . . . 2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 MATE Original Formulation 24 25 . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.3 Network-based MATE Formulation . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 MATE Algorithm Summary . . . . . . . . . . . . . . . . . . . . . . . 28 28 2.3.2 Multi-Node Thévenin Equivalents . . . . . . . . . . . . . . . . . . . 29 2.3.3 Multi-Area Thévenin Equivalents . . . . . . . . . . . . . . . . . . . . 33 2.3.4 Subsystems Update . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.4 MATE: Original versus Network-based . . . . . . . . . . . . . . . . . . . . . 2.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 40 3 Network-based MATE Algorithm Implementation . . . . . . . . . . . . . 41 3.1 MATE Algorithm Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2 MATE Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.2.1 3.2.2 Performance Model Preliminaries . . . . . . . . . . . . . . . . . . . . Computational Aspects of MATE . . . . . . . . . . . . . . . . . . . 44 46 3.2.3 Communication Aspects of MATE . . . . . . . . . . . . . . . . . . . 49 3.2.4 MATE Speedup and Eﬃciency . . . . . . . . . . . . . . . . . . . . . 53 3.2.5 MATE Performance Qualitative Analysis . . . . . . . . . . . . . . . 3.3 Hardware/Software Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 54 57 3.3.1 Sparse Linear Solver Benchmark . . . . . . . . . . . . . . . . . . . . 57 3.3.2 Dense Linear Solver Benchmark . . . . . . . . . . . . . . . . . . . . 59 3.3.3 Communication Libraries Benchmark . . . . . . . . . . . . . . . . . 62 3.4 Western Electricity Coordinating Council System . . . . . . . . . . . . . . . 3.4.1 WECC System Partitioning . . . . . . . . . . . . . . . . . . . . . . . 69 70 3.4.2 Timings and Performance Predictions for the WECC System 3.5 Conclusion . . . . 75 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4 MATE-based Parallel Transient Stability . . . . . . . . . . . . . . . . . . . 85 4.1 Transient Stability Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Transient Stability Solution Techniques . . . . . . . . . . . . . . . . 86 87 4.1.2 Transient Stability Models . . . . . . . . . . . . . . . . . . . . . . . 89 4.2 Sequential Transient Stability Simulator . . . . . . . . . . . . . . . . . . . . 100 4.3 MATE-based Parallel Transient Stability Simulator . . . . . . . . . . . . . . 103 4.3.1 4.3.2 System Partitioning Stage . . . . . . . . . . . . . . . . . . . . . . . . 104 Pre-processing Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 4.3.3 Solution Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 iv Table of Contents 4.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.4.1 South-Southeastern Brazilian Interconnected System Partitioning . . 112 4.4.2 Timings for the SSBI System . . . . . . . . . . . . . . . . . . . . . . 114 4.5 Conclusion 5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 5.1 Summary of Contributions 5.2 Future Work 5.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.2 LU Factorization Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 A.3 Computational Complexity of LU Factorization and Solution . . . . . . . . 143 A.3.1 Expected Computational Complexity of Sparse LU Factorization . . 144 B Transient Stability Solution Techniques B.1 Problem Discretization . . . . . . . . . . . . . . . . . . . . 147 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 B.2 Simultaneous Solution Approach . . . . . . . . . . . . . . . . . . . . . . . . 148 B.3 Simultaneous versus Alternating Solution Approach . . . . . . . . . . . . . 150 C Network Microbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 C.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 v List of Tables 1.1 SuperLU DIST timings on a commodity PC cluster. . . . . . . . . . . . . . . 19 3.1 Data ﬁtting summary for the sparse operations. . . . . . . . . . . . . . . . . 59 3.2 Data ﬁtting summary for the dense operations. . . . . . . . . . . . . . . . . . 62 3.3 Parameter summary for Ethernet and SCI networks. . . . . . . . . . . . . . 65 3.4 Partitioning of the Western Electricity Coordinating Council system. . . . . 74 4.1 Summary of the SSBI system. . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.2 Statistics of the SSBI system simulation. . . . . . . . . . . . . . . . . . . . . 116 4.3 Summary of the sequential transient stability simulation. . . . . . . . . . . . 116 4.4 Partitioning of the South-Southwestern Brazilian Interconnected system. . . 124 5.1 Timings for the network-based MATE and SuperLU DIST. . . . . . . . . . . 132 A.1 Operation count for sparse and dense LU factorization. . . . . . . . . . . . . 144 A.2 Floating-point operation count for sparse and dense LU factorization. . . . . 146 A.3 Complex basic operations requirements in terms of ﬂoating-point count. . . . 146 vi List of Figures 1.1 Generic COTS computer system . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Sparse matrix and its associated task graph. . . . . . . . . . . . . . . . . . . 4 13 1.3 Partitioning techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 1.4 SuperLU DIST two-dimensional block-cyclic partitioning scheme . . . . . . . 17 1.5 Decomposition of the elimination tree adopted within MUMPS . . . . . . . . 18 2.1 Generic electric network partitioned into subsystems . . . . . . . . . . . . . . 2.2 Thévenin equivalent voltages. . . . . . . . . . . . . . . . . . . . . . . . . . . 26 30 2.3 Multinode Thévenin equivalent impedances construction. . . . . . . . . . . . 32 2.4 Multi-node Thévenin equivalent of a generic subsystem . . . . . . . . . . . . 33 2.5 Link Thévenin equivalent of subsystem S2 in Figure 2.1 in detailed and compact forms, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 2.6 Multi-area Thévenin equivalent of the system in Figure 2.1 in its detailed and compact versions, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 37 2.7 Relationships among the various mappings Rk , Qk and Pk . . . . . . . . . . . 39 3.1 MATE algorithm ﬂow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 MATE timeline for two subsystems, S1 and S2 , and the link solver. . . . . . 43 45 3.3 Link currents scatter according to PLogP model. . . . . . . . . . . . . . . . . 51 3.4 Multi-node Thévenin equivalents gather according to PLogP model. . . . . . 52 3.5 Sparse operations benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.6 Dense operations benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7 Sparse operations throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . 60 61 3.8 Dense operations throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.9 Benchmark results of two MPI implementations over SCI and Gigabit Ethernet networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 3.10 Sending and receiving overheads and inter-message gaps for MPI over SCI and Gigabit Ethernet network. . . . . . . . . . . . . . . . . . . . . . . . . . . 67 3.11 Bandwidth of MPI over SCI and Gigabit Ethernet networks. . . . . . . . . . 68 vii List of Figures 3.12 Comparison of Ethernet and SCI network’s timings for the network-based MATE communications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 3.13 North American Electric Reliability Council (NERC) Regions . . . . . . . . 71 3.14 Western Electricity Coordinating Council System admittance matrix . . . . 3.15 Link solver and communication penalty factors relative to the WECC system. 71 73 3.16 Comparison between MATE timings for multilevel recursive bisection and multilevel k-way partitioning algorithms. . . . . . . . . . . . . . . . . . . . . 76 3.17 MATE predicted and measured timings for the solution of the WECC system for 1000 steps, 1 factorization and diﬀerent partitioning strategies. . . . . . . 3.18 MATE predicted and measured timings for the solution of the WECC system 77 for 1000 steps, 100 factorizations and diﬀerent partitioning strategies. . . . . 78 3.19 MATE predicted and measured timings for the solution of the WECC system for 1000 steps, 500 factorizations and diﬀerent partitioning strategies. . . . . 3.20 MATE timings for the solution of the WECC system partitioned in 14 sub- 79 systems for 1000 steps, 1 factorization and diﬀerent partitioning strategies. . 80 3.21 MATE timings for the solution of the WECC system partitioned in 14 subsystems for 1000 steps, 100 factorizations and diﬀerent partitioning strategies. 81 3.22 MATE timings for the solution of the WECC system partitioned in 14 subsystems for 1000 steps, 500 factorizations and diﬀerent partitioning strategies. 82 3.23 MATE performance metrics for the solution of the WECC system for 1000 steps and (a) 1, (b) 100 and (c) 500 factorizations. . . . . . . . . . . . . . . . 83 4.1 Basic structure of a power system model for transient stability analysis. . . . 86 4.2 Equivalent π circuit of a transmission line . . . . . . . . . . . . . . . . . . . 4.3 Transformer with oﬀ-nominal ratio . . . . . . . . . . . . . . . . . . . . . . . 91 92 4.4 Equivalent circuit of the polynomial load model . . . . . . . . . . . . . . . . 93 4.5 Synchronous machine qd reference frame and system reference frame . . . . . 96 4.6 Synchronous generator equivalent circuits . . . . . . . . . . . . . . . . . . . . 98 4.7 Steady-state equivalent circuit of the synchronous machine. . . . . . . . . . . 99 4.8 Steady-state phasor diagram of the synchronous machine . . . . . . . . . . . 100 4.9 Flow chart of a transient stability program based on the partitioned approach. 102 4.10 MATE-based parallel transient stability program ﬂow chart: Pre-Processing Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 4.11 MATE-based parallel transient stability program ﬂow chart: Solution Stage . 107 4.12 Contingency statuses check employed in the parallel MATE-based transient stability simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 viii List of Figures 4.13 Network-based MATE parallel linear solver detail. . . . . . . . . . . . . . . . 110 4.14 Convergence check employed in the parallel MATE-based transient stability simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 4.15 Brazilian National Interconnected System . . . . . . . . . . . . . . . . . . . . 113 4.16 South-Southeastern Brazilian Interconnected System admittance matrix. . . 113 4.17 Link solver and communication penalty factors relative to the SSBI system. . 114 4.18 Timings of the parallel transient stability simulator for diﬀerent partitioning heuristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 4.19 Timings of the link solver of the parallel transient stability simulator for different partitioning heuristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 p 4.20 Convergence and contingency status check time TCCC measured and ﬁtted with p log p function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 4.21 Performance metrics of the parallel transient stability simulator. . . . . . . . 123 C.1 Average issue time, yielded by Algorithm 1, for zero-byte messages communicated by MPI routines over a Gigabit Ethernet network. . . . . . . . . . . . 154 ix Glossary Acronyms BBDF Block-Bordered Diagonal Form BLAS Basic Linear Algebra Subroutines CG Conjugate Gradient CGS Conjugate Gradient Squared COTS Comommodity (or Commercial) Oﬀ-The-Shelf CPU Central Processing Unit CSMA/CD Carrier Sense Multiple Access With Collision Detection DSA Dynamic Security Assessment FACTS Flexible Alternating Current Transmission System ﬂop ﬂoating-point operation GPU Graphical Processing Unit HVDC High-Voltage Direct-Current transmission IP Internet Protocol LAPACK Linear Algebra PACKage MATE Multi-Area Thévenin Equivalents Mﬂops 106 ﬂops per second NIC Network Interface Card PC Personal Computer PDE Partial Diﬀerential Equations x Glossary RAM Random-access memory RMS Root Mean Squared SCI Scalable Coherent Interface SCTP Stream Control Transmission Protocol SMP Symmetric Multiprocessing SSBI South-Southeastern Brazilian Interconnected system SVC Static-Var Compensator TCP Transmission Control Protocol TSA Transient Stability Assessment VDHN Very DisHonest Newton WECC Western Electricity Coordinating Council system Mathematical Terms µ(·) Arithmetic mean σ(·) Standard deviation O (·) Indicate the same order of compexity of the function it operates upon (·)−1 Inverse matrix operator (·)T Tranpose matrix operator ℑ {·} imaginary part ℜ {·} real part j pure imaginary number, i.e., (·)∗ complex comjugate operator ¯ V¯ ,I, ¯ Z¯ E, RMS phasor quantities ρ The ratio of branches to buses in S √ −1 xi Glossary bk Number of border nodes contained in Bk k Subsystem index, where k = 1, . . . , p l Number of global links contained in L lk Number of local links contained in Lk n Number of nodes in the untorn system S Ni Number of times the current injections associated with the untorn system S change nk Number of internal nodes contained in Nk Nz Number of times the untorn system S undergoes topological changes p Number of subsystems, which the system S is torn into Bk Set of the bk border nodes, which are connected to the subsystem Sk ∈ S Lk Set of the lk local links, which are connected to the subsystem Sk ∈ S L Set of the l global links, which interconnect all subsystems Sk ∈ S Nk Set of the nk internal nodes, which belong to the subsystem Sk ∈ S Sk k th subsystem associated with S, i.e., Sk ∈ S S Original untorn system ebk Thévenin voltages vector at the bk border nodes of Sk , i.e., in Bk elk Vector of dimension l with the Thévenin voltages of the subsystem Sk , i.e., in Bk , which are connected to the global links in L ibk Vector of dimension bk with current injections into the border nodes in Bk vkb Vector of dimension bk with voltage drops across subsystem Sk , or seen from the border nodes in Bk vkl Vector of dimension l with voltage drops across subsystem Sk , or seen Bk , due to the global link currents in L ek Thévenin voltages vector at all nk nodes in Nk xii Glossary ik Current injections vector of dimension nk due to the local links in Lk jk Internal current injections vector of dimension nk associated with the system Sk vk Nodal voltages vector of dimension nk associated with the system Sk i Current injections vector of dimension n associated with the system S v Nodal voltages vector of dimension n associated with the system S el Multi-area Thévenin voltages vector of dimension l exciting the global links L il Global link currents vector of dimension l associated with the global links in L Zbk Multi-node Thévenin impedance matrix of dimension bk × bk with respect to the border nodes in Bk Zl0 Primitive impedance matrix of dimension l × l associated with the global links in L Zlk Link Thévenin impedance matrix of dimension l × l associated with the subsystem Sk Zl Multi-area Thévenin impedance matrix of dimension l × l associated with the global links L Lk Lower diagonal factor of the admittance matrix Yk , obtained from the LU factorization Pk Link-to-subsystem transformation matrix of dimension nk × l, which maps L onto Nk . The subsystem-to-link transformation matrix is given by (Pk )T Qk Subsystem-to-border transformation matrix of dimension bk × nk , which maps Nk onto Bk . The border-to-subsystem transformation matrix is given by (Qk )T Rk Link-to-border transformation matrix of dimension bk × l, which maps L onto Bk . The border-to-link transformation matrix is given by (Rk )T Uk Upper diagonal factor of the admittance matrix Yk , obtained from the LU factorization Yk Admittance matrix of dimension nk × nk associated with each subsystem Sk Y Admittance matrix of dimension n × n associated with the system S xiii Glossary EM AT E Eﬃciency achieved by the parallel network-based MATE algorithm when solving SM AT E Speedup achieved by the parallel network-based MATE algorithm over the se- g(m) Minimum time, or gap, between consecutive message transmissions or receptions, the system S quential sparse linear solver when solving the system S which also considers all contributing factors, including os (m) and or (m) L End-to-end latency from process to process, which includes the time for copying the data to and from the network interfaces and transferring the data over the physical network or (m) Period of time in which the processor is engaged in receiving the message of size m os (m) Period of time in which the processor is engaged in sending the message of size m TC Timing associated with the communication overhead incurred by the networkbased MATE algorithm TL Timing associated with the sequential tasks performed by the process handling the link solve TM AT E Time spent by the parallel network-based MATE solver to solve the original system S TP Timing associated with parallel tasks performed by the processes handling subsystems TSP ARSE Time spent by the sequential sparse linear solver to solve the original system S T1f act (Sk ) Time spent performing the ﬁrst-time factorization of Yk . T2f act (Sk ) Time spent performing the same-pattern factorization of Yk . In this case, the symbolic factorization, required in the ﬁrst-time factorization, is not repeated i Tcomm (L) Time spent scattering the border node current injections ibk to the subsystems Sk ∈ S i Tcomm (Sk ) Time spent receiving the border nodes injections ibk from the link solver xiv Glossary T hv Tcomm (L) Time spent receiving Zbk and ebk from the subsystems Sk ∈ S T hv Tcomm (Sk ) Time spent sending Zbk and ebk to the link solver process i Tcomp (L) Time spent setting up and computing the link current equations T hv Tcomp (Sk ) Time spent computing the multi-node Thévenin equivalent Zbk and ebk for the subsystem Sk ∈ S v Tcomp (Sk ) Time spent computing the nodal voltages in the subsystem Sk v Tcomp (Sk ) Time spent solving the node voltages vk for the updated current injections ik Tf act (L) Time spent performing the dense factorization of Zl . lnk Tidle (Sk ) Time spent waiting for the link solver process to compute the link currents il subs Tidle (L) Time spent waiting for the subsystems Sk ∈ S to compute Zbk , ebk and vk Tsolv (L) Time spent performing the forward/backward substitutions using Ll and Ul factors associated with Zl Tsolv (Sk ) Time spent performing the forward/backward substitutions using Lk and Uk factors associated with Yk T(L) Time spent by the link solver process solving the link system L T(Sk ) Time spent by the processes solving the subsystem Sk xv Acknowledgements This thesis arose, in part, out of years of research carried out by the University of British Columbia Power Engineering Group, led by Dr. José R. Martí. It is indeed a great pleasure to be part of one of the most prominent research groups on dynamic simulations of electric power systems. In particular, I would like to thank Dr. José R. Martí for the supervision, advice and guidance throughout the Ph.D. program, which was vital to the conduct of the present work. Since I started the Ph.D. program, I had the opportunity to know a great number of people of the most varied cultural and professional backgrounds, which contributed in assorted ways to my growth as a person and a researcher. To those, I would like to record my most sincere acknowledgements. A special mention should be made to Mazana Armstrong for introducing me to the previous works related to my thesis topic, to Michael Wrinch for his valuable comments and advice in both scientiﬁc and professional realm and, to Tom De Rybel for introducing me to the Linux world and his ceaselessly technical support with the computing cluster. I would also like to thank the CAPES Foundation (Fundação de Coordenação de Aperfeiçoamento de Pessoal de Nível Superior) of Brazil for funding my Ph.D. program. In addition, I would like to thank the British Columbia Transmission Corporation and the Powertech Labs Inc. for extending the funding of this project and providing the data relative to the Western Electricity Coordinating Council system, which was used in the performance analysis presented in this work. I would also like to acknowledge Dr. Lei Wang of Powertech Labs Inc. for the his crucial comments, which further motivated me to accomplish this work. I gratefully thank Dr. Paulo Nepomuceno Garcia of the Federal University of Juiz de Fora (UFJF), Brazil, for further appreciating the results of this work and helping me organizing the body the thesis. I would also to acknowledge and thank my research committee members, Dr. K. D. Srivastava, Dr. Hermann Dommel and Dr. Juri Jatskevich for their so much valuable time put into reading my thesis and for their equally important feedback that deﬁnitely helped improving the content of this thesis. Last, but far from least, I would like to thank my family for the unconditional support, patience and love that motivates me to be a better person every day. xvi Dedication To Adriana and Nicholas, You are my life. xvii Chapter 1 Introduction Modern electric power systems are often less secure than the systems of the past. This is a result of reduced attention to the transmission infrastructure caused by deregulation, proliferation of independent power producers, unusual power transfers driven by market activities, the use of complex controls and special protection schemes, and a general lack of systemwide oversight regarding reliable planning and operation (Wang & Morison, 2006). In this context, possible types and combinations of energy transactions occurring at any given time may grow enormously. Therefore, today’s power systems can no longer be operated in a structured manner (Kundur et al., 2000). One solution to mitigate this uncertainty is prediction of future operating conditions, for example through use of on-line Dynamic Security Assessment (DSA). Such a tool takes a snapshot of the system operating condition, performs a comprehensive security assessment in near-real-time, and provides the operators with warnings of abnormal situations as well as remedial measure recommendations (Wang & Morison, 2006). One of the most costly computation-wise tools of on-line DSA is undoubtedly the Transient Security Assessment (TSA). Many alternatives to tackle this problem have been proposed in the literature (Xue et al., 1993; Mansour et al., 1995; Marceau & Soumare, 1999; Ernst et al., 2001; Kassabalidis, 2002; Juarez T. et al., 2007), which includes more eﬃcient time domain solutions, direct stability, pattern recognition and expert systems/neural networks, as well as hybrid methods. Many of these methods, however, still rely on transient stability time-domain simulation tools, due to the high level of accuracy and ﬂexibility of these tools in terms of complexity of the system models. Hence, speeding up TSA tools can signiﬁcantly improve overall performance of on-line DSA. In the past two decades, dramatic improvements in microprocessor technology have been observed. According to (Grama et al., 2003), the average number of cycles per instruction of high end processors has improved by roughly an order of magnitude during the 1990’s. However, not only the speed of the light and the eﬀectiveness of heat dissipation techniques impose physical limits on the speed of a single computer, but also the cost of advanced single-processor computers, which increases more rapidly than their power. As personal computer (PC) performance has increased and prices have fallen steeply, for both PCs and networks employed to interconnect them, dedicated clusters of PC workstations have become 1 Chapter 1. Introduction an interesting alternative to traditional supercomputers (Gropp et al., 1999, 2003). Such a picture tends to shift high-performance computing even further towards clusters of PCs and parallelism as multi-core processors have recently become the standard. Many have been the attempts to parallelize transient stability simulations, but only a few have aimed at taking full advantage of commodity PC clusters. Therefore, further research on porting industrial-grade transient stability simulators onto today’s parallel computing systems with minimal programming eﬀort and maximum returns in terms of computational throughput is still of signiﬁcant importance. Furthermore, inexpensive parallel computing systems present themselves as a strong alternative to provide electric utilities operating centres with the so much needed real-time DSA applications. In the sequence, further reasons that motivate parallelism and some useful performance metrics applied to analyzing parallel programs will be presented. Afterwards, a literature review on parallel algorithms applied to solving the transient stability problem will be presented. Lastly, the motivation of the present work will be introduced, followed by the major achievements resulting from the present research project. 1.1 Motivating Parallel Computing A brief survey of trends in applications, computer architecture, and networking, presented by Foster (1995), suggests a future in which parallelism plays a vital role not only for supercomputers development but also workstations, personal computers, and networks. In this future, programs will be required to exploit the multiple processors located inside each computer and the additional processors available across a network. Moreover, because most existing algorithms are specialized for a single processor, this situation implies a need for new algorithms and program structures to be able to perform many operations at once. As pointed out by Grama et al. (2003), the role of concurrency in accelerating computing elements has been recognized for several decades. In the past, however, when trends in hardware development were not clear and standardized parallel programming interfaces were not yet established, development of parallel software had traditionally been thought of as highly time and eﬀort intensive. Nowadays, many are the arguments that support parallelism, such as the computational power, memory and disk speed and data communication requirements of today’s applications. On the computational power argument, it is recognized that increased computational performance will not come from faster processors, but parallel ones. This fact can be explained by the fact that processors clock cycle times1 are decreasing in a much slower pace, 1 The time to perform a basic operation is ultimately limited by the clock cycle. 2 Chapter 1. Introduction as physical limits such as speed of light are approached. Therefore, the only way to increase the amount of operations that can be performed by a computer architecture is by executing them in parallel (Foster, 1995). In conjunction with the speed of the processor, the latency and bandwidth of the memory system represents another factor that inﬂuences the computational performance. Usually, memory access times are higher than the execution of ﬂoating-point operations, which impose a tremendous computational performance bottleneck. The usage of hierarchical and faster memory devices, called caches, reduces the problem, but not entirely. Parallel platforms typically yield better memory system performance because they provide larger aggregate caches and higher aggregate bandwidth to the memory system (Grama et al., 2003). As the problems increase in size and complexity, larger databases are required, which may be infeasible to collect in a central location. In such situations, parallel computing appears as the only solution. In addition, an extra motivation for computational parallelism comes in cases where a central data storage is also undesirable for unusual technical and/or security reasons. 1.1.1 Commodity Off-The-Shelf Computer Systems Past decades’ advances in computer technology made general-purpose personal computers (PCs) an alternative cost eﬀective solution to the traditional supercomputers for computing large problems quickly. In early 1990’s, Dr. Thomas Sterling and Dr. Donald Becker, two engineers in NASA, hypothesized that using inexpensive, commodity oﬀ-the-shelf (COTS) computer systems hooked together with high-speed networking (even with speeds as low as 10 Mbit/s Ethernet) could duplicate the power of supercomputers, particularly applications that could be converted into highly parallelized threads of execution. They theorized that the price/performance of these COTS systems would more than make up for the overhead of having to send data between the diﬀerent nodes to have that additional computing done (Gropp et al., 2003). Eventually, this concept became known as Beowulf clusters (Beowulf.org, 2009), whose generic structure is illustrated in Figure 1.1. Another advantage of the COTS systems over the early supercomputers is the ease of programming and maintaining. Supercomputers were often hand-made systems that required speciﬁc programming techniques compliant with only a certain models. In the case of commodity clusters, common processor’s architectures are employed, which allow code reuse and, therefore, signiﬁcantly reduces the development and deployment time of parallel applications. Regarding the network interconnects, many types are available in the market 3 Chapter 1. Introduction Figure 1.1. Generic COTS computer system (http://upload.wikimedia.org/ wikipedia/commons/4/40/Beowulf.png). with diﬀerent underlying hardware approaches. However, parallel programming standards, such as Message-Passage Interface (MPI) and Portable-Virtual Machine (PVM), helped abstract programming from the hardware, which further leveraged the success of commodity computing clusters. Nowadays, low-cost commodity clusters built from individual PCs and interconnected by private low-latency and high-bandwidth networks provide users with an unprecedented price/performance and conﬁguration ﬂexibility for high-performance parallel computing. 1.1.2 Performance Metrics for Parallel Systems When analyzing the performance of parallel programs in order to ﬁnd the best algorithm, a number of metrics have been used, such as speedup and eﬃciency. Speedup Speedup is a relative measure that extracts the acceleration delivered by a parallel algorithm with respect to the equivalent best known sequential algorithm. Mathematically, speedup, given in (1.1), corresponds to the ratio of the time taken to solve a speciﬁc problem using a single processor, Ts , to the time demanded to solve the same problem on a parallel 4 Chapter 1. Introduction environment with p identical processors, Tp . S= Ts Tp (1.1) Moreover, the parallel timing Tp can be further split into three smaller components. Even though parallel programs aim at utilizing the p available processors as much as possible, there are always parts of the problems solutions that remain sequential, which in turn incur in a sequential timing, Tps . Also included in the total parallel timing Tp is the overhead timing Tpo , which comprises extra computation timings and interprocessor communication timings. Last but not least, the actual time spent executing parallel work Tpw . S= where fpk = Tpk Ts Ts 1 = Tps + Tpw + Tpo fps + fpw + fpo (1.2) with k ∈ {s, w, o} correspond their associated normalized quantities with respect to the best sequential timing Ts . In fact, these normalized quantities are usually functions that depend on the problem to be solved, the number of available processors p and characteristics intrinsic to the hardware/software setup. Therefore, in order to ﬁnd such relationships thorough understanding of the underlying parallel algorithm and the costs involved in solving a given problem in a parallel architecture is required. For example, assume that a problem of size N can be solved in parallel by any number of processors, which can be, hypothetically, inﬁnity. Consider also that the parallel overhead fpo is kept negligible for any number of processors, and the parallel work fraction fpw diminishes asymptotically with the number of processors, i.e., fpw tends to zero as p heads towards inﬁnity. In such a case, one can observe that the speedup achieved with the parallel solution will never exceed 1 . fps Such behavior of parallel programs was ﬁrstly observed by Amdahl (1967), who stated that the sequential portion of the program alone would place an upper limit on the speedup, even if the sequential processing were done in a separate processor. In an ideal parallel system, however, both sequential processing required by the parallel algorithm and the overhead times are null, whereas the parallel work can approximated by the original sequential time Ts split evenly among p processors. In such a case, the time required to solve the problem in parallel would be Tps , which incur in a p-fold speedup. In practice, speedups greater than p (also know as superlinear speedups) can be observed when the work performed by the sequential program is greater than its parallel counterpart. Such phenomenon is commonly observed when data necessary to solve a problem is too large to ﬁt into the cache of a single processor, but suitable to ﬁt into several units, which can be 5 Chapter 1. Introduction accessed concurrently (large cache aggregate). Efficiency Eﬃciency is a measure of the fraction of the time for which a processing unit is usefully employed (Grama et al., 2003). Its mathematical deﬁnition is given by the ratio of the speedup to the number of processing units used, as given below. E= Ts S = p p Tp (1.3) From (1.3), it can be readily veriﬁed that the eﬃciency of an ideal parallel system equals the unity. In practice, however, communication overhead tend to increase as the number of processors grows, which yields eﬃciencies always lower than the unity. 1.2 Parallel Transient Stability Solution Strategies for parallelizing traditional sequential transient stability solutions rely on the structure of the discretized diﬀerential-algebraic equations given below: x˙ = f(x, v) (1.4a) Y v = i(x, v) (1.4b) where, x represents a vector with dynamic variables (or, state variables), whose ﬁrst derivatives x˙ are deﬁned by a vector function f, normally dependent on x and the vector with nodal voltages v. In addition, i represents a vector function that deﬁnes the nodal current injections, which also depend on the variable states x and the nodal voltages v. Lastly, Y represents the complex-valued nodal admittance matrix of the system under study. In general, diﬀerential equations (1.4a) are usually associated with dynamic devices connected to the passive network. Usually, there is no direct connection between dynamic devices and all interactions occur through the network. As a consequence, state variables are usually grouped by device, which keep them independent from the others and, hence, make them suitable for concurrent computations. As for the network equations (1.4b), parallelization is not as straightforward as for the diﬀerential equations. Generally, electric networks are highly sparse with irregular connectivity patterns, fact that causes parallel solutions for such networks very diﬃcult to ﬁnd (Tylavsky et al., 1992). Parallelization of these two major tasks for the solution of a single time step is often referred to as parallel-in-space approach, because of the usage of the mathematical structure 6 Chapter 1. Introduction of the problem and the system topology. Other strategies adopt parallel-in-time approach, where multiple time-steps of the same transient stability simulation are solved simultaneously. In practice, parallel-in-time and parallel-in-space methods are combined in order to enhance the eﬃciency of the overall simulation. Various parallel algorithms for transient stability computation have been proposed but few have been actually tested on parallel machines. Moreover, many are the possible algorithmic combinations reported in the literature that take advantage of parallelization in both space and time. In the sequence, a few of these parallelization techniques applied to the transient stability problem will be quickly reviewed. 1.2.1 Parallel Newton Methods The ﬁrst parallel-in-space variation of the Newton-Raphson method2 was introduced by Hatcher et al. (1977). The approach adopted in that work was distributing sets of discretized diﬀerential equations (1.4a) among diﬀerent processors, and solving the network equations (1.4b) by means of two main algorithms: a sequential sparse LU factorization and a parallel Successive Over Relaxation (SOR) method. Later on, Alvarado (1979) proposed a parallel-in-time approach based on the NewtonRaphson method applied to diﬀerential equations discretized by the trapezoidal integration method. Network equations, however, were not considered at that time. La Scala et al. (1990b, 1991) formalized and extended the mathematical formulation of the previous method, yielding a parallel-in-space-and-time method. The basic idea in this technique was assigning T × p processors to solve multiple time steps, where each group of p processors computes one of the T steps of a predeﬁned time window. More speciﬁcally, blocked Gauss-Jacobi (La Scala et al., 1990b) and Gauss-Seidel (La Scala et al., 1991) methods were proposed for the iterations between time steps, while each time step was solved by the parallel-in-space Very DisHonest Newton (VDHN) method. Again, network equations were computed sequentially. Chai et al. (1991) describes the ﬁrst reported implementations of a parallel transient stability simulator on two diﬀerent computing architectures, a distributed and a shared-memory system. In this study the parallel VDHN method and the SOR-Newton methods were also compared. For the tested 662-bus and 91-generator system, a SOR-Newton method presented a speedup of 7.4 on the shared-memory machine and 3.5 on the distributed-memory system, when employed 8 processors. Such metrics make evident that communication-intensive methods, such as the SOR-Newton, are more eﬃcient when implemented on shared-memory 2 For details, see Appendix B. 7 Chapter 1. Introduction systems. As for the VDHN method, speedups of 3.9 and 4.6 were reported for the same 8 processors on the shared and distributed-memory systems, respectively. It shows that, on a shared-memory system, the performance of the VDHN method is much lower than the one observed for the SOR-Newton method, due to the sequential solution of the network equations. On a distributed-memory system, however, the VDHN method performance is slightly better than SOR-Newton due to the reduced communication overhead, and it is not higher only because of the sequential solution of the network equations. In addition, the parallel-in-space VDHN method was used to compute multiple time steps simultaneously, yielding a parallel-in-space-and-time VDHN method. When solving the same 662-bus system employing 32 processors on the distributed-memory system, the performance of this combined method achieved a speedup of about 9 times in comparison to the sequential VDHN method. Later, Chai & Bose (1993) summarize practical experiences with parallel Gausstype, VDHN, SOR-Newton, Maclaurin-Newton and Newton-W matrix methods, where the later one is discussed in Section 1.3. Parallel implementations of the VDHN method on a research-purpose and a productiongrade transient stability simulators are presented in (Wu et al., 1995). The adopted approach was a step-by-step solution of the entire system, with diﬀerential equations and network equations split among several processors on a shared-memory system (Cray). The parallel network solutions was based on the methodology introduced in (Huang & Wing, 1979; Wu & Bose, 1995). As part of the results of the paper, parallelization of the diﬀerential equations are shown to be practically linearly scalable with number of processors in such architecture (about 16 times speedup with 20 processors, only for the machine equations computations). As for the parallel network equations factorization and triangular solutions, a strong speedup saturation is observed, which achieved speedups about 11 times for the factorization and 5 times for triangular solutions on 20 processors. The overall speedup of the parallel transient stability solutions was about 7 times on the same 20 processors. Decker et al. (1996) introduces a new class of parallel iterative solvers into the Newtonbased transient stability solutions, Conjugate Gradient (CG) based methods. Both parallelin-time and space approaches were considered in solving the transient stability problem on a distributed-memory architecture. Moreover, network equations were also parallelized as part of the CG-like iterative solutions. Despite of high level of parallelism provided by such methods, intensive communication overhead along with convergence issues considerably degraded the performance of the algorithm with respect to sequential solutions. Other parallel implementations of the VDHN method, which employs same parallel sparse factorization and triangular solution suggested by Huang & Wing (1979), are discussed in (La Scala et al., 1996; Hong & Shen, 2000). According to (La Scala et al., 1996), the parallel 8 Chapter 1. Introduction VDHN method implemented on a shared-memory system was able to achieve roughly 7.7 times speedup on 20 processors, when solving a system with 662 buses and 91 generators. On an 8-processor distributed-memory system, Hong & Shen (2000) reports speedups of about 3.6 and 5.5 times when solving systems with 1017 buses (150 generators) and 3021 buses (450 generators), respectively. Hong & Shen (2000) also apply the Gauss-Seidel method, proposed in (La Scala et al., 1991), on a distributed-memory architecture. Taking the sequential VDHN method as a reference, the parallel-in-time-and-space Gauss-Seidel method achieved 9 and 13 times speedup when solving the aforementioned 1017-bus and 3021-bus systems, respectively, on 32 processors. 1.2.2 Parallel Waveform Relaxation Methods Waveform relaxation methods (WRM) had been shown to be very eﬀective for the transient analysis of Very-Large-Scale-Integration (VLSI) circuits and was introduced to transient stability analysis of large power systems by Ilić-Spong et al. (1987). The basic idea of the WRM is to solve the waveform of one state variable, considering all other state variables waveforms ﬁxed at their previous values. The same is then repeated for the other state variables using the updated waveforms until convergence is observed. This method is similar to GaussJacobi method, but applied to the whole state variable waveform. Although predictions were made with respect to the linear scalability of the method when implemented on a truly parallel computing environment, no comparisons with the best available transient stability solution, based on Newton-like algorithms, were presented. Improvements on the parallel WRM applied to the transient stability analysis are proposed in (Crow & Ilić, 1990), which included aspects related to windowing and partitioning of the state variables among distinct processors. Results showed that the method is linearly scalable, although the presented simulations did not considered communication overhead. In another attempt in solving the transient stability problem by means of WRM-like methods, La Scala et al. (1994, 1996) proposed the Shifted-Picard method, also known as the WRM-Newton method for parallelizing the transient stability problem. As synthesized in the article, the main idea of the algorithm consists in linearizing nonlinear diﬀerential-algebraic equations (DAEs) about a given initial guess waveform. Then, the guess waveform is updated by solving a linear set of DAEs derived by linearization. The original equations should be then re-linearized about the updated guess in an iterative fashion. Experiments with a system with 662 buses and 91 generators on an distributed system, which took a sequential VDHN-based simulator as the reference, show that this algorithm achieved about 7 times 9 Chapter 1. Introduction speedup with 32 processors, which represents an eﬃciency of 22% per processor. For a system with 2583 buses and 511 generators, Aloisio et al. (1997) showed that performance ShiftedPicard method is drastically decreased due to intensive interprocessors communication on a distribute-memory architecture (IBM SP2). In such a case, the observed speedup with respect to the serial VDHN algorithm was 0.5 on 8 processors. Moreover, network equations were solved sequentially. Wang (1998) proposed a parallel-in-time relaxed Newton method, as an improvement to the SOR-Newton method. An implementation of the proposed method on a distributedmemory Cray-T3D presented an speedup of about 9.4 times at 47% eﬃciency on 20 processors, with respect to a sequential VDHN-based transient stability program. 1.2.3 Parallel Alternating Methods As observed in Section 1.2, in the alternating solution approach the diﬀerential equations can be straightforwardly solved concurrently in a parallel-in-space fashion. The network equations, however, still remain as a major problem to be eﬃciently parallelized, due to its irregular connectivity pattern. In order to overcome the sequentiality of the network equations solutions, La Scala et al. (1990a) exploited the in-space parallelism inherent to the SOR method, at the expense of the quadratic convergence proper of the Newton-based methods, for the network solutions give in (4.4b). Another solution strategy, which relies on the network decomposition, originally proposed in (Ogbuobiri et al., 1970; Huang & Wing, 1979), was used by Decker et al. (1992, 1996). In this work, the network equations were solved iteratively by a parallel implementation of the Conjugate Gradient (CG) method on a distributed multiprocessor machine. In spite of the high level of parallelism inherent to the CG method, speedup of up to 7 were observed when 32 processors were employed in the computations. Also based on network decomposition (Ogbuobiri et al., 1970; Huang & Wing, 1979), Shu et al. (2005) adopted a node-tearing technique (discussed in Section 1.3 approach for parallelizing the network equations on a cluster of multi-core processors. 10 Chapter 1. Introduction 1.3 Parallel Network Solutions The vast majority of problems encountered in engineering demands the solution of large sparse linear systems, expressed mathematically by (1.5). Ax = b (1.5) where x represents the the vector of unknowns, b is known vector, usually specifying boundary or contour conditions, and the A is a large problem-dependent sparse square matrix. In power systems engineering, more speciﬁcally, such linear systems are often associated with large electric networks which can span entire continents. As such networks present irregular and sparse connectivity patterns, their associated nodal equations (represented by the matrix A in (1.5)) end up assuming the same topological pattern. Irregular sparsity is due to the fact that in large power systems nodes are directly connected to just a few other neighboring nodes in a variety of ways (Sato & Tinney, 1963). Since Tinney & Walker (1967), sparsity techniques applied to the solution of large sparse linear systems of equations have been heavily studied. As a consequence, direct sparse LU factorization became the standard tool for solving sparse linear systems, which arise from many power systems problems, such as fault analysis, power ﬂow analysis and transient stability simulations. Although sparsity techniques have been extremely successful on sequential computers, parallel algorithms that take full advantage of the sparsity of the power systems problems are more diﬃcult to develop. The diﬃculty is mainly due to the aforementioned irregularity of power networks, in addition to the lack of evident parallelism in the required computational operations. In order to address the problem, a number of parallel algorithms have been proposed in the literature, which can be categorized into two major classes, ﬁne and coarse grain parallelization schemes. Independently of the class, these algorithms often need a preprocessing stage that aims at properly scheduling independent tasks on diﬀerent processors. Topology of the networks and their factorization paths described by means of graphs provide extremely useful information which help exploit the inherent parallelism in the computations, even though these are not readily evident. 1.3.1 Fine Grain Schemes In ﬁne grain schemes, parallelism is achieved by dividing the factorization process and forward/backward substitutions into a sequence of small elementary tasks and scheduling them on several processors. In (Huang & Wing, 1979; Wing & Huang, 1980; Wu & Bose, 1995), these elementary 11 Chapter 1. Introduction tasks consist of one or two ﬂoating-point operations, which are scheduled based on precedence relationships deﬁned by the system’s topology and node ordering. An illustration of the method is given in Figure 1.2. In this example, the list of required operations to eliminate the sparse matrix3 is given, along with its associated task graph. The main idea is to take advantage of the independence of the tasks belonging to a same level of the task graph and performing them in parallel. An optimal scheduling based on the levels of the task graph associated with the optimally-ordered system matrix was suggested in (Huang & Wing, 1979). Wu & Bose (1995) extend the previous work and presents an implementation of the algorithm on a shared-memory computer. For the parallel LU factorization, results show almost linearly-scalable speedups of up to 17 times on 20 processors, i.e., an eﬃciency of 85%, while for the parallel forward/backward substitutions, speedups strongly saturate at 8 processors and peak at about 5.5 times when 16 processors are used. Employing a diﬀerent approach, in (Alvarado et al., 1990; Enns et al., 1990), the available parallelism was exploited from the W matrix deﬁned below. −1 −1 W = L−1 = (L1 L2 . . . Lm )−1 = L−1 m . . . L2 L1 where, A = L D LT and D is a diagonal matrix and L is a lower diagonal factor matrix of A. 4 Based on the fact that each matrix L−1 k for k = 1, . . . , m describes one update operation required in the Gaussian elimination, parallelism can be obtained by properly aggregating sets of L−1 k , so W = Wa Wb . . . Wz . Each intermediate multiplication is then performed in parallel. Indirect performance measurements of this methodology, when applied to the transient stability problem on a shared-memory machine, are discussed in (Chai & Bose, 1993). A maximum speedup of about 5 times was observed on 16 processors, and it strongly saturated for higher number of processors. 1.3.2 Coarse Grain Schemes Schemes of this nature rely on the fact that electric power systems can be partitioned into smaller subsystems, which are in turn interconnected or interfaced by a limited number of branches or nodes. As long as all subsystems are independent from one another, apart from the interconnection or interface system, each subsystem can be solved concurrently 3 4 Notice, this is a perfect elimination matrix defined in Section A.3. For further details on the LU factorization, see Appendix A. 12 Chapter 1. Introduction × × × × × × × × × × × × ◦ × Task 1 2 3 4 5 6 7 8 9 10 11 12 13 14 × ◦ × × × - original nonzeros ◦ - fill-ins (a) Sparse matrix Operation a14 ← a14/a11 a44 ← a44 − a41 a14 a25 ← a25/a22 a55 ← a55 − a52 a25 a34 ← a34/a33 a36 ← a36/a33 a44 ← a44 − a43 a34 a46 ← a46 − a43 a36 a64 ← a64 − a63 a34 a66 ← a66 − a63 a36 a46 ← a46/a44 a66 ← a66 − a64 a46 a56 ← a56/a55 a66 ← a66 − a65 a56 (b) Task list Level 1 3 6 1 5 Level 2 4 10 8 2 9 Level 3 13 7 Level 4 11 Level 5 12 Level 6 14 (c) Task graph Figure 1.2. Sparse matrix and its associated task graph. on distinct processors. Therefore, partitioning algorithms of large systems into independent subsystems is at the heart of any coarse grain parallel scheme. The basic partitioning schemes for electrical networks are based on branch and node tearing algorithms, illustrated in Figure 1.3. Branch tearing techniques In branch tearing methods, subsystems are completely disconnected from each other when the interconnecting branches are removed from the system. This group of branches is also know as the cut-set of the system, and is represented by the currents associated with the interconnection system ZI in Figure 1.3a. The most well-known example of such technique is Diakoptics, developed by Gabriel Kron and his followers (Happ, 1970, 1973; Kron, 1953, 1963). As pointed out by Happ (1973, 1974), 13 Chapter 1. Introduction Y1 Y2 ZI Y1 PI1 P1I Y2 P2I PI2 -ZI (a) Branch tearing. Y1 Y2 YI Y1 YI1 Y1I Y2 Y2I YI2 YI (b) Node tearing. Figure 1.3. Partitioning techniques. the basic idea of diakoptics is to solve a large system by breaking, or tearing, it apart into smaller subsystems; to ﬁrst solve the individual parts, and then to combine and modify the solutions of the torn parts to yield the solution of the original untorn problem. The combination or modiﬁcation of the torn solutions so that they apply to the untorn original problem is the crux of the method. It was precisely this task that Kron accomplished through a series of contour transformations and link divisions. Many authors attempted to clarify the concepts proposed by Kron and further developed by Happ. Among them, Wu (1976) explained the basic idea of diakoptics as merely the partitioning of the branches and the Kirchhoﬀ’s laws. He also observed that practical large networks are usually sparsely connected, and thus sparse matrix techniques, pioneered by Tinney & Walker (1967), could be used in making diakoptics more computationally eﬃcient. Alvarado et al. (1977) also investigated the feasibility of employing diakoptics as a means of solving large networks, and concluded that sparsity techniques are fundamental in improving the performance of the diakoptics-based algorithms proposed at that time. More recently, the Multi-Area Thévenin Equivalents (MATE) method was proposed by Martí et al. (2002), which extends the idea of diakoptics in terms of interconnected Thévenin equivalents. As summarized by the authors, the MATE algorithm embodies the concepts of multi-node Thévenin equivalents, diakoptics, and the modiﬁed nodal analysis by Ho et al. (1975), in a simple and general formulation. Another application of the MATE algo- rithm is demonstrated by Hollman & Martí (2003), who employed a distributed-memory 14 Chapter 1. Introduction PC-cluster for calculating electromagnetic transients in real-time. Although the work by Hollman & Martí (2003) has its foundations on the MATE algorithm, it further exploits computational parallelism from the time decoupling provided by transmission lines existent in power systems (Dommel, 1996). This technique completely eliminates the need for combining the subsystems’ solutions in order to obtain the solution of the untorn system and, therefore, only requires exchanges of past values of voltages (i.e., known values) between subsystems interconnected by a given transmission line. Timings for the solution of a 234-node/349- branch system show an achieved speedup of about 3.5 times using 5 PCs. Before the present work, no parallel implementation of any branch tearing based algorithm applied to the solution of large power systems has been found in the literature. Performance predictions of proposed algorithms, however, were presented in (Wu, 1976; Alvarado et al., 1977). Node tearing techniques In node tearing methods, subsystems are completely disconnected from each other when the nodes that lie in a common interface (YI ) are removed from the system, as depicted in Figure 1.3b. Node tearing or network decomposition was ﬁrstly proposed as a means to reorder the sparse matrices in order to reduce ﬁll-ins during the LU factorization, as suggested by Tinney & Walker (1967) and Ogbuobiri et al. (1970). Due to the matrix shape illustrated in Figure 1.3b, network decomposition is also often referred to as the BlockBordered Diagonal Form (BBDF) method. At ﬁrst, parallel applications based on such partitioning techniques were only qualitatively evaluated. Brasch et al. (1979) showed that BBDF-based parallel algorithms would severely suﬀer from scalability issues, due to the strong speedup saturation predicted for increasing number of processors. Lau et al. (1991) presented a partitioning algorithm based on the computation of precedence relationships embedded in the factorization paths described by Tinney et al. (1985). The partitioned factorization path was then structured according to BBDF and properly scheduled on distinct processors of a distributed-memory system. Results conﬁrmed previous predictions, where speedups strongly saturated for more than 4 processors. For a 662-bus system, the maximum observed speedups were slightly inferior to 2 times on the same 4 processors, which leads to an eﬃciency of 50%. Other implementations of parallel BBDF methods can be found in the literature, although the timings are not clearly separated from the global timings of the problem solution, such as the transient stability problem, discussed previously. 15 Chapter 1. Introduction 1.4 Other Related Work Parallel computing has been employed in solving scientiﬁc and engineering problems for decades. Structural analysis of materials, weather forecast, simulation of astronomical phenomena, simulation of air ﬂow about an aircraft and calculation of electromagnetic ﬁelds around electric equipments are just a few examples of problems that are extensively tackled by parallel computing due to their computation and data intensiveness. A computational task shared by all the previous problems is the solution of large sparse linear systems of equations, often extracted from sets of discretized ordinary or partial differential equations. Many parallel algorithms have been proposed to solve such large sparse systems. These can be categorized into parallel direct methods and parallel iterative methods. 1.4.1 Parallel Direct Methods Parallel direct solvers are based on the classical Gaussian elimination combined with partial or total pivoting strategies. Amongst the most well known parallel direct sparse solvers available are the SuperLU DIST (Li & Demmel, 2002) and MUMPS (Amestoy & Duﬀ, 2000). The algorithms underlying these two solvers represent a vast class of solvers (Amestoy et al., 2001). Hence, an overview of such solvers, to some extent, summarizes two of the best algorithms employed in computer science for direct sparse linear solutions. SuperLU DIST partitions the matrix in a two dimensional block-cyclic fashion. The blocks are formed by groups of submatrices and assigned to a two-dimensional process grid of shape r × c, as depicted in Figure 1.4. Each block is deﬁned based on the notion of unsymmetric supernodes, suggested by Demmel et al. (1999). As described in (Li & Demmel, 2002), a supernode is a range of columns of L with the triangular block just below the diagonal being full, and the same nonzero structure elsewhere (either full or zero). The oﬀ-diagonal block may be rectangular and need not be full. By the block-cyclic layout, the block (i, j) is mapped onto the process at coordinate (i − 1) mod r, (j − 1) mod c of the process grid. During the factorization, the block L(i, j) is only needed by the process row (i−1) mod r , which restricts the interprocess communication. Similarly, the block U(i, j) is only needed by the processes on the process column (j − 1) mod c . As also reported in (Li & Demmel, 2002), SuperLU DIST was able to solve matrices of various sizes and patterns, originated from diﬀerent ﬁelds of application, with decreasing timings for up to 256 processors of a Cray T3E-900. On the other hand, MUMPS (Multifrontal Massively Parallel Solver) aims at solving unsymmetric or symmetric positive deﬁnite linear systems of equations on distributed mem16 Chapter 1. Introduction Figure 1.4. SuperLU DIST two-dimensional block-cyclic partitioning scheme. (Extracted from (Li & Demmel, 2002)) ory computing architectures. There are two main levels of parallelism within MUMPS: tree and node parallelism. As explained by Amestoy & Duﬀ (2000), in the factorization process, data is ﬁrst assembled at a node combining the Schur complements from the children nodes with data from the original matrix. The original matrix’s data comprises rows and columns corresponding to variables that the analysis forecasts should be eliminated at this node. This data is usually supplied in so-called arrowhead format (i.e., BBDF format), with the matrix ordered according to the permutation from the analysis phase. This data and the contribution blocks from the children nodes are then assembled (or summed) into a frontal matrix using indirect addressing (sometimes called an extended add operation). Since the nodes belonging to a same level of the elimination tree and their children are independent (see Figure 1.5), the previous procedure can be performed concurrently for all nodes at same level. This explains the tree parallelism. The node parallelism lies on the fact that the Schur compliments of each node of the elimination tree are obtained from local blocked update operations, which can be performed by means of parallel implementations of the Level 3 BLAS (Blackford et al., 2002; Goto & Van De Geijn, 2008b). According to Amestoy et al. (2001), although the total volume of communication is comparable for both solvers, MUMPS requires many fewer messages, especially with large numbers of processors. The diﬀerence found was up to two orders of magnitude. This is partly intrinsic to the algorithms, and partly due to the 1D (MUMPS) versus 2D (SuperLU) matrix 17 Chapter 1. Introduction Figure 1.5. Decomposition of the elimination tree adopted within MUMPS partitioning. Furthermore, MUMPS was, in most tested matrices, faster in both factorization and solve phases. The speed penalty for SuperLU partly comes from the code complexity that is required to preserve the irregular sparsity pattern, and partly because of the greater number of communication messages. With more processors, SuperLU showed better scalability, because its 2D partitioning scheme keeps all of the processors busier despite the fact that it introduces more messages. In order to illustrate the performance of these tools on a commodity cluster, a series of timings were obtained as part of this thesis from the SuperLU DIST on a 16 AMD AthlonTM 64 2.5 GHz processors cluster interconnected by a Dolphin SCI network is presented on Table 1.1. Both factorization and forward and backward substitutions (F/B Subst.) procedures were timed for two complex admittance matrices extracted from the South-Southeastern Brazilian Interconnected (SSBI) system (see Section 4.4) and the North American Western Electricity Coordinating Council (WECC) system (see Section 3.4). As it can be observed, regardless of the number of processors employed in the computations, the parallel timings are always greater that the timing for a single processor. This seemingly contradictory fact is explained by the additional overhead incurred by the interprocessors communications. The timings also show a marginal improvement for the WECC system (14,327 nodes) in comparison to the timings recorded for the SSBI system (1,916 nodes). According to the obtained timings, the parallel factorization is roughly 3 to 25 times slower than its sequential counterpart for the SSBI system, and 3 to 20 for the WECC system; while the parallel F/B substitutions are about 10 to 45 slower for the SSIB system, and 7 to 30 times slower for the WECC system. It can also be observed that the choice of the process grid topology also greatly inﬂuences the timings. In this case, the factorization beneﬁts from 18 Chapter 1. Introduction Table 1.1. SuperLU DIST timings on a 16 AMD AthlonTM 64 2.5GHz processors cluster built on a single rack and interconnected by a dedicated network. Process Grid 1 1×2 2×1 1×3 3×1 2×2 4×1 1×4 5×1 1×5 2×3 3×2 6×1 1×6 SSBI (1,916 nodes) Fact. [s] F/B Subst. [s] 0.0030 0.0100 0.0274 0.0105 0.0457 0.0307 0.0170 0.0355 0.0264 0.0578 0.0186 0.0299 0.0768 0.0098 0.0005 0.0121 0.0051 0.0146 0.0054 0.0113 0.0051 0.0195 0.0050 0.0183 0.0100 0.0073 0.0046 0.0226 WECC (14,327 nodes) Fact. [s] F/B Subst. [s] 0.0265 0.0794 0.2039 0.0816 0.3387 0.2197 0.3688 0.0796 0.5256 0.0781 0.2120 0.3374 0.5701 0.0750 0.0047 0.0916 0.0427 0.1063 0.0404 0.0829 0.0383 0.1419 0.0348 0.1305 0.0714 0.0516 0.0316 0.1591 column-wise partitioning, while row-wise partitioning beneﬁts the F/B substitutions. An explanation for such low performance is twofold. Firstly, algorithms as the one employed by SuperLU DIST have massive supercomputers as their target architectures, whose processors communicate by means of specialized low-latency and high-bandwidth network interconnects. Secondly, the dimension of the problems aimed at by such tools are much bigger than the power systems related problems. One can understand bigger as those problems for which the computational burden per processor is much bigger than the communication overhead due to the parallelization. In this sense, the foregoing power systems admittance matrices (∼15,000), often considered big in the power systems ﬁeld, turn out to be small in comparison to problems often encountered in the computer science ﬁeld, for which tools, such as MUMPS and SuperLU DIST, are developed. 1.4.2 Parallel Iterative Methods According to Saad & Vorst (2000), originally, the usage of iterative methods was restricted to systems related to elliptic partial diﬀerential equations, discretized with ﬁnite diﬀerence 19 Chapter 1. Introduction techniques5 . For other problems, for instance those related to ﬁnite element modeling and large electric and electronic circuit calculations, direct solution techniques were preferred, because of the lack of robustness of iterative solvers for a large number of classes of matrices. Until the end of 1980’s, almost none of the commercial packages for ﬁnite element problems included iterative solution techniques. However, for many PDE-related problems, the complexity of the elimination process increases too much to make realistic 3D modeling feasible. For instance, irregularly structured ﬁnite element problems of order one million may be solved by direct methods - given a large enough computer (memory wise), but at a tremendous cost and diﬃculty. However, some of such problems can be solved with less resources and more eﬃciently by means of iterative solution techniques, if an adequate preconditioning can be constructed. As reported by Pai & Dag (1997), iterative solution techniques applied to both static and dynamic simulation problems, typical of large scale power systems, still cannot compete with direct methods because of possible convergence problem. Such statement agrees with Saad & Vorst (2000), who reported that large electric and electronic circuits are not easy to solve in an eﬃcient and reliable manner by iterative methods. The performance of direct methods, both for dense and sparse systems, is largely bound by the factorization of the matrix. This operation is absent in iterative methods, which also lack dense matrix suboperations. Since such operations can be executed at very high eﬃciency on most current computer architectures, a lower ﬂop rate for iterative than for direct methods is expected. Furthermore, the basic operations in iterative methods often use indirect addressing, which depends on the data structure of the matrix. Such operations also have a relatively low eﬃciency of execution. On the other hand, iterative methods are usually simpler to implement than direct methods, and since no full factorization has to be stored, they can handle much larger systems than direct methods (Barrett et al., 1994). Vorst & Chan (1997) compares an iterative with a direct method for a sparse system related to a ﬁnite element discretization of a second order PDE over an irregular grid, where 3 n is the order of the problem. It is shown that the cost of a direct method varies with n 2 5 for a 2D grid and with n 3 for a 3D grid. Comparing then the ﬂop counts for the direct method with those for the Conjugate Gradient (CG) method, it was concluded that the CG method might be preferable, in case diﬀerent systems have to be solved each time with large n. Furthermore, in case many systems with several diﬀerent right-hand sides have to be solved, it seems likely that direct methods will be more eﬃcient, for 2D problems, as long as the number of right-hand sides is so large that the costs for constructing the LU factorization 5 Such systems are originated from oil reservoir engineering, weather forecasting, electronic device modeling, etc. 20 Chapter 1. Introduction is relatively small. For 3D system, however, it is concluded that such a choice is unlikely, because the ﬂop counts for two triangular solves associated with a direct solution method 5 are proportional to n 3 , whereas the number of ﬂops for the iterative solver varied according 4 to n 3 . Following the argument presented in the previous section on parallel direct methods, it can be concluded that iterative solvers are recommended for solving even bigger problems than the ones tackled by parallel direct solvers, because of mainly memory constraints. For instance, as reported in (Li & Demmel, 2002), SuperLU DIST has played a critical role in the solution of a long-standing problem of scattering in a quantum system of three charged particles. This problem requires solving a complex, nonsymmetric and very ill-conditioned sparse linear system, whose dimension reached 8 million. The task performed by SuperLU DIST was building the block diagonal preconditioners for the Conjugate Gradient Squared (CGS) iterative solver. For a block of size 1 million, SuperLU DIST took 1209 seconds to factorize using 64 processors of a IBM SP and 26 seconds to perform triangular solutions. The total execution time of the problem was about 1 hour. This scientiﬁc breakthrough result was reported in a cover article of Science magazine (Rescigno et al., 1999). 1.5 Thesis Motivation Based on the methodologies and experiences reported in the literature overviewed in the previous sections, one can readily identify the parallel solution of power networks as one of the major bottlenecks in achieving fully parallel transient stability simulations. Although many have addressed the problem of parallelizing the solution of large sparse linear systems, no standard methodology has yet been agreed upon. This can be partially explained by the fact that some algorithms are more suitable for shared-memory systems than for distributed-memory ones, and vice-versa. From the hardware standpoint, inexpensive commodity clusters, also known as Beowulf clusters (Gropp et al., 2003; Beowulf.org, 2009), provide a technical and economically alternative to the expensive vector-based supercomputers. Nowadays, high performance computing clusters can be easily built with oﬀ-the-shelf symmetric multiprocessing (SMP) machines interconnected through high-speed and high-bandwidth network cards. Moreover, due to limitations on increasing the number of cores on a single chip, the cluster setup is expected to become the standard in the computer industry. Bearing the aforementioned high performance computing architectures, the most successful parallel algorithms will likely combine coarse and ﬁne grain approaches. In this sense, by means of coarse grain approaches, one can assign tasks to SMP nodes, which in turn can 21 Chapter 1. Introduction further extract further parallelism by adopting ﬁne grain approaches. From a software point of view, much eﬀort had been put into developing direct sparse linear solvers. In this way, algorithms that require a complete redesign and re-implementation of existing sparse solvers are bound to be treated with a high level of scepticism mainly due to economical reasons. Such an undertaking would certainly require re-educating developers with the use of new algorithms, which in turn would require from developers time for programming and testing the new routines. The reusability of available and highly-tested sparse routines becomes signiﬁcant to reduce costs and time of deployment of newly developed parallel tools. In this context, the Multi-Area Thévenin Equivalents (MATE) approach, proposed by Martí et al. (2002) and extended by Armstrong et al. (2006), has the potential to provide both coarse and ﬁne granularities required by SMP clusters. Moreover, the MATE approach also allows one to employ ready-to-use sparse routines, as it will be shown in this thesis. As a consequence, with a minimum programming eﬀort, it is expected a performance boost in the solution of large sparse systems, which are at the heart of essential power system analysis tools, such as the on-line transient stability assessment. 1.6 Thesis Contributions The main contributions of this thesis are: • Introduction of the network-based MATE algorithm, which further optimizes the original matrix-based MATE algorithm formulation (Martí et al., 2002) in terms of computation and communication overhead; • Implementation of the network-based MATE algorithm on a commodity cluster, built with single-core nodes interfaced by a dedicated high-speed network, employing readyto-use sparsity libraries; • Development of a performance model for the network-based MATE algorithm, which enabled the establishment of a theoretical speedup limit for the method with respect to traditional sequential sparsity-oriented sparse linear solvers; • Application of the parallel network-based MATE algorithm for the solution of the network equations associated with transient stability simulations. 22 Chapter 1. Introduction 1.7 Publications • Tomim, M. A., De Rybel, T., & Martí, J. R. (2009). Multi-Area Thévenin Equivalents Method Applied To Large Power Systems Parallel Computations. IEEE Transactions on Power Systems (submitted). • Tomim, M. A., Martí, J. R., & Wang, L. (2009). Parallel solution of large power system networks using the Multi-Area Thévenin Equivalents (MATE) algorithm. International Journal of Electrical Power & Energy Systems, In Press. • Tomim, M. A., Martí, J. R., & Wang, L. (2008). Parallel computation of large power system network solutions using the Multi-Area Thévenin Equivalents (MATE) algorithm. In 16th Power Systems Computation Conference, PSCC2008 Glasgow, Scotland. • De Rybel, T., Tomim, M. A., Singh, A., & Martí, J. R. (2008). An introduction to open-source linear algebra tools and parallelisation for power system applications. In Electrical Power & Energy Conference, Vancouver, Canada 23 Chapter 2 Network-based Multi-Area Thévenin Equivalents (MATE) The Multi-Area Thévenin Equivalents (MATE) method was proposed by Martí et al. (2002), which extends the idea of diakoptics in terms of interconnected Thévenin equivalents. As summarized by the authors, the MATE algorithm embodies the concepts of multi-node Thévenin equivalents, diakoptics, and the modiﬁed nodal analysis by Ho et al. (1975), in a simple and general formulation. Although the possibility of parallel computations at the subsystems level has been conceived in (Martí et al., 2002; Armstrong et al., 2006), no implementation of the MATE algorithm in a distributed computer architecture had been realized until recently. This fact can be attributed to the only recent release of economically viable commodity clusters, built from out-of-the-self personal computers interconnected by dedicated local networks, and multicore processors. Tomim et al. (2008, 2009) present an evaluation of the MATE algorithm applicability to the solution of nodal equations of large electric power systems in a distributed computer architecture. The MATE algorithm was implemented using ready-to-use highly optimized sparse and dense matrix routines in a 16-computer cluster interconnected by a dedicated network. In this chapter, a formulation of the MATE algorithm from an electrical network perspective will be set forth. In comparison to the algorithm proposed by Martí et al. (2002), this network-based approach adds novel concepts to the algorithm, and further optimizes the amount of operations, memory usage, and data exchange among parallel processes during the actual solution. However, before introducing the network-based MATE algorithm, the the original problem will be restated along with the original MATE formulation. Lastly, the matrix and network-oriented approaches for formulating the MATE algorithm will be qualitatively and quantitatively compared. 24 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) 2.1 Problem Statement In order to solve an electric network, one needs to compute the voltages at all its nodes due to the excitation sources to which the system is subjected. A widely used technique for solving power systems is the well-known nodal approach (Grainger & Stevenson, 1994), where the system is modeled as an admittance matrix Y and all excitation sources converted into current injections, which populate the vector i. Yv = i (2.1) Then, taking into consideration the relationship between the nodal voltages v and the current injections i, given by a linear system of equations (2.1), one can ﬁnally compute v through Gaussian elimination, for instance. A second approach, the one analyzed presently in greater detail, considers tearing the system apart into smaller subsystems, then solving the individual parts, and subsequently combining and modifying the solutions of the torn parts to yield the solution of the original untorn problem. In this case, assume the system S is subdivided into p disconnected areas, or subsystems, such that as S = {S1 , S2 , . . . , Sp }, and that these subsystems are interconnected by a set of global links L = {α1 , α2 , . . . , αl }, with l elements. The nodes that form the interface between the subsystem Sk , with k = 1, . . . , p, and the interconnection system L will be called border nodes and form the set Bk with bk nodes. Finally, the group of links connected to the subsystem Sk are called local links and form the set Lk with lk nodes. For illustrative purposes, an example of such a system is depicted in Figure 2.1. In this example, there are three disconnected areas, which makes p = 3 and, consequently, S = {S1 , S2 , S3 }. In addition, four global links interconnect the three subsystems, making l = 4 and L = {α1 , α2 , α3 , α4 }. As for the local quantities, subsystem S1 has b1 = 3 and l1 = 3; subsystem S2 has b2 = 2 and l2 = 2; and, subsystem S3 has b3 = 2 and l3 = 3. Consider, now, that each subsystem Sk ∈ S is modeled by the nodal equations given in (2.1), where Yk is the admittance matrix of the isolated subsystem Sk , vk is a vector with nodal voltages, jk a vector with internal injections and ik a vector with currents injected by the local links Lk . If the subsystem Sk has nk local buses, Yk is a square and, usually, sparse matrix with dimension nk , whereas vk , ik and jk are vectors of order nk . Additionally, the set of nodes in the subsystem Sk is deﬁned as Nk . Yk vk = ik + jk (2.2) Usually, during power systems computations, e.g., dynamic simulations, the subsystems 25 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) S1 S2 α1 1 1 2 3 2 α2 α3 α4 2 1 S3 Figure 2.1. Generic electric network partitioned into three subsystems interconnected by four tie lines. matrices Yk and their associated local injections jk are known, while the local nodal voltages vk and local link injections ik are unknown. Notice, however, that if the links contributions ik were known, vk would be straightforward to computed. Therefore, in the MATE context, the link currents ik , which depend on how the subsystems are connected to one another, need to be computed ﬁrst and then submitted to their correspondent subsystems so local voltages vk can be obtained. 2.2 MATE Original Formulation Following the same reasoning presented in (Martí et al., 2002), the three-area electric system, depicted in Figure 2.1, can be mathematically represented by the hybrid modiﬁed nodal equations, given in (2.3). In other words, the latter model comprises nodal equations for the subsystems S1 , S2 and S3 , and branch equations for the links α1 , α2 , α3 and α4 . Y1 Y2 Y3 (P1 )T (P2 )T (P3 )T P1 P2 P3 −Zl0 v1 v2 = v3 il j1 j2 j3 0 (2.3) 26 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) The matrices Pk , deﬁned in (2.4), capture which nodes in subsystem Sk are connected to links, and whether each link is injecting/drawing current into/from that speciﬁc node. 1 if link j injects current into bus i, Pk (i, j) = −1 if link j draws current from bus i, 0 otherwise. (2.4) where i refers to buses in the subsystem Sk and j to global links. In order to solve (2.3), one can apply Gaussian elimination to the row correspondent to the link currents il , which yields the new set of equations given in (2.5). P1 v1 j1 Y1 P2 v2 j2 Y2 = Y3 P3 v3 j3 l l l Z i e (2.5) Here, Zl and el are deﬁned below, with k = 1, 2, 3. Zl = Zl0 + Zlk ∀k Zlk = (Pk )T (Yk )−1 Pk el = elk (2.6) elk = (Pk )T (Yk )−1 jk (2.7) ∀k In the transformed system of equations (2.5), each pair, Zlk and elk , represents the Thévenin equivalent of the subsystem Sk with respect to the links αl with l = 1, . . . , 4, whereas Zl and el represent the reduced version of the original system from the links’ perspective. Ultimately, the original system can be solved by solving the link currents system (at the bottom of (2.5)) and then appropriately injecting the currents il into the subsystems Sk with k = 1, 2, 3. As pointed by Martí et al. (2002), the main advantage of the MATE algorithm lies in the fact that subsystems can be computed independently from each other, apart from the links system of equations. This fact, in turn, leads to other possibilities, such as: subsystems can be solved concurrently at diﬀerent rates, with diﬀerent integration methods, or even in diﬀerent domains (like time and quasi-stationary phasor domains). The possibility of parallel computations at the subsystems level has been conceived in (Martí et al., 2002; Armstrong et al., 2006), but only implemented on a distributed computer architecture for 27 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) solving nodal equations of large electric power systems in (Tomim et al., 2008). Recent investigations of the MATE algorithm, however, indicated that the set of equations (2.5), (2.6) and (2.7) hide a few interesting features that help reduce amount of operations performed in the subsystems and the data exchange between link solver and subsystems. Therefore, in the subsequent sections, the MATE algorithm will come under close scrutiny in order to make such features evident. 2.3 Network-based MATE Formulation Even though the set of equations given in (2.5) are suﬃcient to describe the algorithm under study, they hide a few interesting features that help reduce the number of operations performed in the subsystems and the amount of data exchange between link solver and subsystems. Therefore, in the next sections, the MATE algorithm will be restated from an electric network viewpoint in order to make these features evident. 2.3.1 MATE Algorithm Summary For solving the problem stated in Section 2.1, one could start by extracting the links from the original system S and ﬁnding a multi-node Thévenin equivalent with respect to the border nodes, as suggested in (Dommel, 1996). Since the subsystems are uncoupled a priori, the Thévenin equivalents computation could also be done for each subsystem Sk separately. The multi-node Thévenin equivalent of each subsystem, as seen from its border nodes, can be constructed in two basic steps: (a) Find the voltages at the border nodes due to the internal current and voltage sources; (b) Find the self and mutual impedances seen from the border nodes (short-circuiting and opening internal voltage and current sources, respectively). As a result of the previous steps, reduced-order networks, completely isolated from each other, will be generated for each subsystem Sk . These equivalent networks, along with the interconnection system, will form a reduced-order version of the original system, as seen from the links. The currents ﬂowing in each link αi , with i = 1, . . . , l, can then be computed employing the newly constructed equivalent system. Once the currents ﬂowing in the links are known, each vector ik can be formed according to the links connections to the subsystems. Lastly, combining ik vectors with the internal voltage and current sources represented by jk enables the computation of each subsystems nodal voltages vk using (2.1). 28 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) In the next sections, each MATE algorithm building block will be explained in more detail. 2.3.2 Multi-Node Thévenin Equivalents The ﬁrst step in the MATE algorithm is constructing multi-node Thévenin equivalents for all subsystems Sk ∈ S with respect to their border nodes set Bk . Such equivalent circuits fully describe the voltage-current characteristic of each subsystem at its border nodes, and will provide the fundamental structures for solving the link currents at a later stage. In addition, as an aid to the multi-node Thévenin equivalents construction, a subsystem-toborder mapping will also be introduced. Subsystem-to-Border Mapping Given a speciﬁc subsystem Sk ∈ S, it will be necessary to map its bk border nodes quantities onto its nk local buses quantities, i.e., map quantities deﬁned in Bk onto Nk . One such situation occurs when one needs to express current injections deﬁned in the border nodes set Bk in terms of the local nodes set Nk . The inverse mapping, i.e., mapping of Nk onto Bk , is equally important, since it allows one to gather border nodes voltages from local nodal voltages. Formally, the mapping of Nk onto Bk can be seen as a bk × nk matrix Qk deﬁned in (2.8). The matrix Qk is ﬁlled with zeros, except for bk ones, which indicate that a speciﬁc local node j (related to the columns of Qk ) corresponds to the border node i (related to the rows of Qk ). 0 if B (i) = N (j), k k Qk (i, j) = 1 if B (i) = N (j). k k (2.8) where i = 1, . . . , bk and j = 1, . . . , nk . As for the inverse direction of the mapping, it is achieved by (Qk )T . This inverse mapping simply scatters the vectors deﬁned in the border node set Bk so they are expressed in terms of the local nodes set Nk . Although the vectors produced by (Qk )T have nk elements, only bk of them are non-zero, which correspond to the border nodes. Thus, if the vector vk contains voltages associated with the set of local nodes Nk , gathering border nodes voltages vkb from vk can be summarized as the matrix-vector multiplication expressed by (2.9). vkb = Qk vk (2.9) On the other hand, consider now a vector ibk that contains current injections associated 29 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) S1 1 E¯11 3 2 E¯13 E¯12 Figure 2.2. Thévenin equivalent voltages. with the set of border nodes Bk . If one wants to scatter the injections ibk into another vector, say ik , deﬁned in Nk , one only needs the matrix-vector multiplication expressed by (2.10). ik = (Qk )T ibk (2.10) For example, for the subsystem S1 of the system depicted in Figure 2.1, the matrix Q1 is shown in (2.11), assuming that the set of border nodes B1 contains b1 = 3 nodes. system nodes 1 ··· 3 ··· 2 ··· 1 1 2 border ··· ··· ··· ··· ··· Q1 = · · · · · · · · · · · · 1 · · · ··· ··· 1 ··· ··· ··· (2.11) 3 Multi-Node Thévenin Voltage Multi-node Thévenin voltages are nodal voltages detected at the border nodes of a speciﬁc subsystem, due to internal voltage and current sources only. In such a situation, all links connected to the same subsystem must be open and all subsystem internal sources active during the computation of border nodal voltages. The Figure 2.2 illustrates the concept of multi-node Thévenin voltages when applied to subsystem S1 , depicted in Figure 2.1. Mathematically, the multi-node Thévenin voltages computation can be summarized by the two step procedure shown in (2.12). Yk ek = jk ebk = Qk ek (2.12a) (2.12b) First, by solving the linear system (2.12a), the new voltage vector ek is obtained, which 30 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) contains Thévenin voltages at all nodes of the subsystem Sk described by Yk . However, since only Thévenin voltages at the border nodes are needed, one needs to gather the corresponding voltages in ek and store them in ebk by means of the subsystem-to-border mapping matrix Qk , according to (2.12b). Multi-Node Thévenin Impedance Multi-node Thévenin impedances are a set of self and mutual impedances seen from the border nodes of a speciﬁc subsystem Sk , when all its internal voltage and current sources are turned oﬀ, i.e., the voltage sources become short circuits and the current sources open circuits. For compactness, it can also be represented as a bk × bk matrix Zbk which relates the injected currents at the border nodes ibk to the border nodes voltages vkb . vkb = Zbk ibk (2.13) In the general case, the matrix Zbk can be obtained column by column, injecting unitary currents into its corresponding border node and measuring the voltages at all border nodes. Thus, bearing in mind the fact that each subsystem Sk is represented by its nodal equations, voltages at the border nodes can be gathered from repeated solutions of (2.1) for individual current injections at each border node. In turn, each unitary current injection can be represented by a single column vector ubik of size bk , which has one 1 at the position i, associated with the border node Bk (i), while all other components are zero. Now, taking into consideration that Yk is related to the local nodes in Nk , one ﬁrst needs to map ubik onto Nk , using the border-to-subsystem mapping (Qk )T . So, collecting all the vectors ubik associated with i = 1, . . . , bk , in this order, and applying the transformation (Qk )T to it, leads to (2.14). This shows that the unitary current injections required for computing Zbk are readily provided by (Qk )T . T bb (Qk )T ub1 · · · uk k = (Qk ) ub2 k k bb (Qk )T ub1 ub2 · · · uk k k k 1 0 ··· 0 T 0 1 · · · 0 = (Qk ) . . . = (Qk )T . . . . . . . . . (2.14) (2.15) 0 0 ··· 1 Hence, a general procedure for computing Zbk , with respect to border nodes of Sk is summarized by (2.16). Since (Qk )T has bk columns, the equation (2.16a) is equivalent to 31 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) S1 S1 S1 1 2 1 V¯111 3 1.0 2 3 V¯122 V¯133 1.0 V¯113 1.0 V¯123 (c) (b) (a) 3 2 V¯132 V¯131 V¯121 1 V¯112 Figure 2.3. Multinode Thévenin equivalent impedances construction. solving a sparse linear system of order nk , deﬁned by Yk , bk times, and storing each solution in each column of Zk . Afterwards, one only needs to gather the impedances associated with the border nodes in Zk and store them in Zbk , as stated by (2.16b). Yk Zk = (Qk )T (2.16a) Zbk = Qk Zk (2.16b) Figure 2.3 illustrates the concept of multi-node Thévenin voltages when applied to subsystem S1 , depicted in Figure 2.1. Multi-Node Thévenin Equivalents After equations (2.12) and (2.16) were solved for each subsystem Sk ∈ S, a series of circuits, synthesized as shown in Figure 2.4, is constructed. Algebraically, the multi-node Thévenin equivalent can be represented by (2.17), where ibk represent the current injections at the border nodes, vkb lumps all the voltage drops across the subsystem Sk , and ebk corresponds to the vector with the Thévenin voltages that lie behind the Thévenin impedance network Zbk . vkb = Zbk ibk + ebk ′ V¯k1 − V¯k1 .. . ′ vkb = V¯ki − V¯ki .. . b′k bk ¯ ¯ Vk − Vk I¯k1 . .. ibk = I¯ki . . . b I¯kk (2.17) E¯k1 . .. ebk = E¯ki . . . b E¯kk 32 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) Zbk E¯k1 I¯k1 1′ reference nodes 1 I¯ki E¯ki i′ border nodes i b I¯kk b E¯kk b′k bk Figure 2.4. Multi-node Thévenin equivalent of a generic subsystem Sk with bk border nodes. 2.3.3 Multi-Area Thévenin Equivalents Once all subsystems Sk ∈ S have been reduced to their multi-node Thévenin equivalents, one needs to interconnect them through the link system so the currents ﬂowing between adjacent subsystems can be calculated. This ﬁnal equivalent of the original system with respect to all links is then called the Multi-Area Thévenin Equivalent (MATE). Analogously to what has been done for the multi-node Thévenin equivalents construction, a link-to-border mapping will also be introduced. Link-to-Border Mapping The aim of the link-to-border mapping is to represent current injections at the border nodes Bk in terms of currents ﬂowing in the links deﬁned in the set of global links L. For instance, for the subsystem S1 in Figure 2.1 which has b1 = l1 = 3, the current injections at each border node can be seen as an injection coming from a single link, respecting proper link current orientation. Thus, if I¯ki is a current injection at border node Bk (i), I¯A1 = −I¯α1 , I¯2 = I¯α3 , and I¯3 = −I¯α2 . In matrix form, the previous relationships yield (2.18), where ib is A 1 A a vector with the border nodes injections, i is a vector with the link currents according to l the orientation adopted in Figure 2.1, and R1 is a b1 × l matrix that represents the mapping between the two vectors. ib1 = R1 il (2.18) where, I¯11 ib1 = I¯12 I¯3 1 I¯α1 ¯α I 2 il = I¯α3 I¯α4 −1 R1 = 0 0 0 0 0 0 −1 1 0 0 0 33 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) For the subsystem S3 in the same example, b3 = 2, l3 = 3, and the current injections at each border node become combinations of one or more links. In this case, I¯31 = I¯α4 and I¯32 = I¯α2 − I¯α3 , which leads to (2.19). ib3 = R3 il (2.19) where, I¯31 i3 = 2 I¯ b R3 = 3 0 0 0 1 0 1 −1 0 So far, only the current injections were related by means of the link-to-border mapping. In the case of the voltages, the mapping direction has to be reversed. This can be explained by the fact that each local link connecting the same border node senses the same nodal voltage. However, the link orientation may change the polarity of the voltage. This is the case in subsystem S3 in Figure 2.1, where links α2 and α3 are joined together at the border α node B1 (2). Thus, if V¯ki is a nodal voltage at border node Bk (i), and V¯k j the voltage at link αj whose termination is in the subsystem Sk , equation (2.20) applies. v3l = (R3 )T v3b (2.20) where, V¯ 1 v3 = 32 V¯ b 3 V¯3α1 ¯ α2 V 3 l v3 = V¯ α3 3 V¯3α4 0 0 (R3 ) = 0 1 T 0 1 −1 0 In a general form, the mapping Rk , from L onto Bk , can be deﬁned by (2.21). Conversely, border-to-link mapping is achieved by employing (Rk )T . 1 if L(j) injects current into Bk (i), Rk (i, j) = −1 if L(j) draws current from Bk (i), 0 otherwise. (2.21) where i = 1, . . . , bk and j = 1, . . . , l. Note that Rk is sparse and each row, associated with the border nodes of subsystem Sk , has as many non-zero entries as the number of links connected to it. Therefore, the number 34 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) of non-zeros in Rk always equals the number of local links in Lk , i.e., lk . Link Thévenin Equivalents At this point, all subsystems Sk ∈ S are reduced to their set of border nodes Bk by means of multi-node Thévenin equivalents and can be ﬁnally reconnected through the link system. Since the equation (2.17) is expressed in terms of the border nodes in Bk , each multi-node Thévenin equivalent has to be referenced to the set of global links L, so all equivalents can be connected to one another. The latter equivalents are then called link Thévenin equivalents, and will be discussed next. Recalling that each vector vkb and ebk , with border nodes voltages and Thévenin voltages, respectively, can be mapped onto L by means of (Rk )T , and the link currents vector il mapped onto each Bk by means of the mapping Rk , one can pre-multiply (2.17) by (Rk )T and make ibk = Rk il , as follows: (Rk )T vkb = (Rk )T Zbk Rk il + (Rk )T ebk which yields: vkl = Zlk il + elk (2.22a) Zlk = (Rk )T Zbk Rk (2.22b) T elk = (Rk ) ebk (2.22c) In the above equations, elk represents the vector with the Thévenin voltages associated with subsystem Sk that partially drive the global link currents il . The vector vkl contains the total voltage drops across the subsystem Sk due to il and elk . Lastly, the l × l matrix Zlk , given in (2.22b), is the set of impedances Zbk seen from the global links. For example, applying link-to-border mapping R2 to the multi-node Thévenin equivalent Z2 with respect to the border nodes of subsystem S2 in Figure 2.1, as given in (2.23), one can obtain el2 , v2l and Zl2 , given in (2.24), which form the link Thévenin equivalent of subsystem b S2 . Zb2 = Z¯211 Z¯212 Z¯221 Z¯222 R2 = 1 0 0 0 0 0 0 −1 (2.23) 35 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) E¯21 I¯α1 Z¯211 α1′ α1 I¯α2 α2′ el2 α2 Z¯212 Zl2 il I¯α3 α3′ α3 E¯2 2 v2l I¯α4 Z¯222 α4′ α4 (a) (b) Figure 2.5. Link Thévenin equivalent of subsystem S2 in Figure 2.1 in detailed and compact forms, respectively. E¯2α1 E¯21 ¯ α2 E2 0 T b l e2 = (R2 ) e2 = E¯ α3 = 0 2 α 2 4 −E¯ E¯ (2.24a) 2 2 V¯2α1 V¯21 ¯ α2 V2 0 T b l v2 = (R2 ) v2 = α3 = ¯ 0 V 2 α4 2 ¯ ¯ −V2 V2 Z¯211 0 Z2 = (R2 ) Z2 R2 = 0 −Z¯ 21 l T b B 0 0 0 0 0 0 0 0 (2.24b) −Z¯212 0 0 Z¯ 22 B (2.24c) As matter of fact, equation (2.22) can be seen as a polyphase voltage source elk in series with the polyphase impedance Zlk . Thus, the link Thévenin equivalent of the subsystem S2 , described by (2.24), can be represented by the equivalent circuits shown in Figure 2.5. Multi-Area Thévenin Equivalents Once the link Thévenin equivalents are found for all subsystems Sk ∈ S, for each subsystem there will exist a polyphase representation similar to the one shown in Figure 2.5. Since each phase of each link Thévenin equivalent is associated with one single global link, they can all 36 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) E¯A1 E¯B1 Z¯A11 E¯A3 Z¯A13 E¯A2 Z¯A23 Z¯A33 Z¯B11 E¯C2 Z¯A12 Z¯B12 Z¯A22 E¯B2 Z¯B22 Z¯C22 E¯C2 Z¯C22 E¯C1 Z¯C12 Z¯C22 Z¯C12 Z¯C11 Z¯ α1 I¯α1 Z¯ α2 I¯α2 Z¯ α3 I¯α3 Z¯ α4 I¯α4 (a) elA ZlA elB elC ZlB ZlC Zl0 il (b) Figure 2.6. Multi-area Thévenin equivalent of the system in Figure 2.1 in its detailed and compact versions, respectively. be connected in series as shown in Figure 2.6. In addition to the subsystems, one can also add the impedances of the interconnection system, represented by the matrix Zl0 . In the case of the system example depicted in Figure 2.6, Zl0 is a diagonal matrix, because all links are considered uncoupled. In a more general case, however, couplings among links could be straightforwardly added as oﬀ-diagonal elements of Zl0 . Finally, based on the circuit Figure 2.6b, one can combine all polyphase impedances and voltages sources accordingly, which results in the linear system given in (2.25), which can then be solved for the link currents il . Zl = Zl0 + Zlk (2.25a) Sk ∈S l l e = ek (2.25b) Sk ∈S l l Z i = el 2.3.4 (2.25c) Subsystems Update Since all link currents il are known at this stage, one only needs to build the proper ik from the link currents il and, together with the internal currents jk , solve the linear system (2.1) for the nodal voltages vk for each subsystem Sk . For building ik , a two step conversion is proposed, ﬁrst from links to border nodes and 37 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) then from border nodes to subsystem nodes. Mathematically, this task is achieved by means of the transformations Rk and (Qk )T , deﬁned in (2.21) and (2.8), respectively, and expressed in (2.26). ik = (Qk )T Rk il (2.26) By the end of this stage, each subsystem Sk ∈ S will have its set of nodal voltages vk known, which completely solves the original untorn system. 2.4 MATE: Original versus Network-based The diﬀerence between the original MATE formulation, introduced by Martí et al. (2002), and the network-based formulation, presently discussed, lies on the deﬁnitions of the subsystem-to-border mapping Qk , given in (2.8), and the link-to-border mapping Rk , given in (2.21), and how they relate to Pk , given in (2.3). From (2.3), notice that Pk converts the global link currents il into currents injected into the subsystem Sk . Therefore, each matrix Pk can be seen as a link-to-subsystem mapping from L to Nk , as deﬁned in (2.4) and emphasized in (2.27). ik = Pk il (2.27) Comparing then (2.27) to (2.26), one can verify that the following equality applies. Pk = (Qk )T Rk (2.28) Similarly to the other mappings, the reverse mapping of Pk is also given by its transposed version, i.e., (Pk )T , and can be used to convert the nodal voltages deﬁned in Nk to voltages at links terminals in L. The relationships among all deﬁned mappings are summarized in Figure 2.7, which includes the ﬂow directions of node voltages and current injections among subsystem nodes set Nk , border nodes set Bk and global links set L. The main beneﬁts in using Rk and Qk mappings instead of Pk are the following: Less computational effort to obtain the multi-area Thévenin equivalent Zl : The multi-area Thévenin equivalent Zl can be interchangeably deﬁned in (2.25a) and (2.6). However, if each link Thévenin equivalent Zlk , which will later form the multi-area Thévenin equivalent Zl , is computed according (2.7), the nodal equations associated with the subsystem Sk need to be solved l times, where l is the number of global links. Even when the 38 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) Qk Nk Bk (Qk ) (Pk ) T T Rk Pk (Rk )T voltages flow L currents flow Figure 2.7. Relationships among the various mappings Rk , Qk and Pk . empty columns of Pk are skipped during the computation of Zlk , the subsystem Sk needs to be solved at least lk times, where lk is the number of local links. On the other hand, if each link Thévenin equivalent Zlk is computed according to (2.22b), the computational burden will be concentrated in forming the multi-node Thévenin equivalents Zbk , which requires only bk solutions for each subsystem Sk . Taking into consideration that, for large systems, the number of border nodes bk ≤ lk ≪ l, a reasonable gain in performance should be expected. Less data exchange between subsystems processes and link solver: The link-toborder mapping Rk works as an indirect reference to the multi-node Thévenin equivalent Zbk , which has only bk 2 elements, to form the link Thévenin equivalent Zlk , which in turn has lk 2 elements. Without the information inherited by the link-to-border mapping Rk , Zlk would have to be formed by the subsystems and sent to the link solver. Moreover, without Rk , the link solver would have no other option, but sending all link currents to all subsystems Sk ∈ S. With Rk , on the other hand, the link solver only need to send the net currents injected at the border nodes of the subsystems. Therefore, the link-to-border mapping Rk plays an important role in minimizing data exchange between subsystems. Global link information is confined in the link solver process: If link Thévenin equivalents Zlk are formed by its subsystem processes, information regarding the global links is required by all subsystems and, therefore, needs to be replicated in all processes participating in the solution. The multi-node Thévenin equivalents Zbk , however, only require information regarding the local links, reducing again memory usage and amount of global information that needs to be shared among the processes. 39 Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE) 2.5 Conclusion In this chapter, a step-by-step formulation of the MATE algorithm from a network standpoint has been developed. The formulation revealed properties of the subsystems interface nodes (or, border nodes) and the links that can be used to minimize both subsystems computations and communication overhead. The distinct mappings among subsystem nodes, border nodes and links, introduced in this formulation, also make subsystem computations and communications rely on local information only. This characteristic also plays an important role in the aforementioned communication and computation reduction. In the next chapter, a thorough assessment of the computation and communication eﬀort required by the foregoing methodology, in terms of a performance model of the network-based MATE algorithm, will be presented. 40 Chapter 3 Network-based MATE Algorithm Implementation Although diakoptics theory is very well established in literature, not many implementations of such techniques are available in power systems ﬁeld. This fact is likely due to the great success of sparsity techniques in solving power systems problems in sequential computers, and high costs involved with early parallel machines. However, with the advent of cheaper commodity computer clusters and multi-core processors, diakoptics-based as well as other parallel algorithms have been drawing more attention from power systems ﬁeld. Therefore, decision makers need appropriate tools to compare existing traditional software to newly developed parallel alternatives. This requires an understanding of the related computational costs, which can be developed and formalized in mathematical performance models. These models can be used to compare the eﬃciency of diﬀerent algorithms, to evaluate scalability, and to identify possible bottlenecks and other ineﬃciencies, all before investing substantial eﬀort in an implementation (Foster, 1995). Even though many parallel algorithms have been proposed in computer science and power systems ﬁelds for solving large sparse linear systems (see Chapter 1 for references), their performance greatly varies according to the available computing architecture and, the type and size of the problem under analysis. Depending on the target architecture, the maximum communication overhead due to the parallelization, while assuring reasonably eﬃcient parallel computations, may drastically decrease, which is the case of distributed systems. As a consequence, ﬁne grain parallel algorithms may be not be recommendable or even become unfeasible. An implementation of the MATE algorithm suitable for distributed computing architectures will be set forth, according to the theoretical development presented in Chapter 2. One of the reasons supporting this choice comes from the fact that, in the MATE algorithm, the amount of communications remains limited, as long as the system under analysis is partitioned in so many subsystems, so the interconnections among them also remains limited. In the development, the proposed implementation will be also modeled in terms of performance metrics, such as computation and communication times, eﬃciency and speedup. For 41 Chapter 3. Network-based MATE Algorithm Implementation this task, the implementation will be assessed from the computational and communication standpoint, aiming at timings required to compute the subsystems and link equations, and to exchange data between subsystems and link solver processes. Afterwards, the aforementioned performance metrics for the MATE parallel algorithm, relative to sequential sparsity techniques, will be evaluated and analyzed. 3.1 MATE Algorithm Flow Chart In Figure 3.1, the proposed MATE algorithm ﬂow chart is depicted. Since all subsystems are similar in behavior, only one generic subsystem Sk is represented. Moreover, each communication point between the subsystem Sk and the link solver should actually be seen as a collective communication point, where all subsystems Sk ∈ S communicate with the link solver and vice-versa. At the beginning, all processes need to acquire the data they rely upon. At this stage, all subsystems concurrently load local admittance matrices Yk , local current injections jk , and information regarding the links connected to this speciﬁc subsystem, which is summarized by Lk . As for the link solver, information about all links interconnecting all subsystems, denoted by simply L, need to be loaded. Subsequently, subsystems identify the border nodes sets Bk and form the subsystem-to-border mappings Qk and the link-to-border mappings Rk , according to the deﬁnitions shown in (2.8) and (2.21). Afterwards, the link-to-border mappings Rk are sent to the link solver, so it becomes aware of which links are associated with each subsystem Sk , and their respective border nodes. At this point all subsystems start computing their multi-node Thévenin equivalents, formed by Zbk and ebk , according to (2.12), (2.16) and (2.17), and send them to the link solver. After receiving the previous equivalents, the link solver assembles the multi-area Thévenin equivalent, formed by Zl and el , and computes the link currents il , following the procedure described by (2.22) and (2.25). With the link currents il available, the link solver scatters them appropriately, i.e., respecting the link-to-border mappings Rk , among the subsystems, which can, ﬁnally, update the local injections jk with the links coming from the link solver and compute the local node voltages vk . In the next sections, a more elaborate discussion of the forgoing procedure will presented, along with performance aspects of each task involved in the network-based MATE implementation. 42 Chapter 3. Network-based MATE Algorithm Implementation Begin Subsystem Sk Link Solver Read Yk , jk and Lk Read L Find Bk Form Qk and Rk Gather Rk for k = 1, . . . , p Send Rk Gather Zbk and ebk Compute Zbk and ebk Send Zbk and ebk Form Zl and el Receive ibk Solve Zl il = el Update jk ← (Qk )T ibk + jk Scatter ibk for k = 1, . . . , p Solve Yk vk = jk Finish Figure 3.1. MATE algorithm ﬂow chart 43 Chapter 3. Network-based MATE Algorithm Implementation 3.2 MATE Performance Model In this section, the MATE method for solving large linear electric networks in a distributed computer architecture will be modeled in terms of performance metrics, such as computation and communication times, eﬃciency and speedup. For this task, the algorithm presented earlier will be assessed from the computational and communication standpoint, aiming at timings required to compute the subsystems and link equations, and to exchange data between subsystems and link solver processes. Afterwards, the aforementioned performance metrics for the MATE parallel algorithm, relative to sequential sparsity techniques, will be evaluated and analyzed. 3.2.1 Performance Model Preliminaries In order to model the performance of the MATE algorithm, one ﬁrst needs to understand the computational and communication requirements of each of the MATE’s tasks, taking into perspective the type of problems intended to be solved. For the processes performing tasks associated with subsystem Sk , their execution time T (Sk ) is deﬁned as follows: T hv T hv lnk i v T (Sk ) = Tcomp (Sk ) + Tcomm (Sk ) + Tidle (Sk ) + Tcomm (Sk ) + Tcomp (Sk ) (3.1) where, T hv Tcomp (Sk ) = time spent computing the multi-node Thévenin equivalent Zbk and ebk for the subsystem Sk ∈ S; Tcomm (Sk ) = time spent sending Zbk and ebk to the link solver process; lnk (Sk ) = time spent waiting for the link solver process to compute the link currents il ; Tidle T hv i Tcomm (Sk ) = time spent receiving the border nodes injections ibk from the link solver; v Tcomp (Sk ) = time spent solving the node voltages vk for the updated current injections ik . For the link solver process, which solves equations associated with the link system L, or multi-area Thévenin equivalent, the execution time is given by: i i T hv subs (L) (L) + Tcomm (L) + Tcomp (L) + Tcomm T (L) = Tidle (3.2) where, subs (L) = time spent waiting for the subsystems Sk ∈ S to compute Zbk , ebk and vk ; Tidle T hv Tcomm (L) = time spent receiving Zbk and ebk from the subsystems Sk ∈ S; 44 Chapter 3. Network-based MATE Algorithm Implementation Link Solver Process Process A Process B t0 T hv Tcomp (S1 ) subs Tidle (L) T hv Tcomp (S2 ) T hv Tcomm (S1 ) T hv Tcomm (L) T hv Tcomm (S2 ) lnk Tidle (S1 ) i Tcomp (L) lnk Tidle (S2 ) i Tcomm (S1 ) i Tcomm (L) i Tcomm (S2 ) v Tcomp (S1 ) subs Tidle (L) v Tcomp (S2 ) communication idle t1 t2 t3 t4 t5 computation Figure 3.2. MATE timeline for two subsystems, S1 and S2 , and the link solver. i Tcomp (L) = time spent setting up and computing the link current equations; i Tcomm (L) = time spent scattering the border node current injections ibk to the subsystems Sk ∈ S. All of the above timings are shown in the MATE algorithm timeline example, illustrated in the Figure 3.2. Three processes are shown: two subsystem processes A and B, and the link solver process. At instant t0 , all subsystem processes start computing their corresponding multi-node Thévenin equivalents. Once the ﬁrst of them is ready, at instant t1 , the link solver process can start receiving it, while the other processes ﬁnish their computation. Right after ﬁnishing sending their multi-node Thévenin equivalents to the link solver, the subsystems start waiting for the link solver feedback. At instant t2 , the link solver received all Thévenin equivalents and starts setting up and computing the link current equations. Once the link currents become available, the link solver process sends them to the subsystems at instant t3 . As soon as the subsystems receive the border nodes current injection due to their local links, they are able to update their local current injections and solve their local node voltages. After instant t4 , the link solver process remains waiting for the next batch of Thévenin equivalents, while the subsystems ﬁnish updating the local variables. From the previous discussion on the MATE timeline, the total time for the network-based MATE algorithm, TM AT E , can be estimated by the sum of the terms related to the most time-consuming subsystem Smax , in addition to the data exchanges and computational tasks 45 Chapter 3. Network-based MATE Algorithm Implementation associated with the link currents. Thus, the foregoing estimation can be denoted as follows: T hv T hv i i v TM AT E = Tcomp (Smax ) + Tcomm (L) + Tcomp (L) + Tcomm (L) + Tcomp (Smax ) (3.3) Next, each time component, introduced above, will be formulated, based on the computational complexity or communication requirements of each of the MATE’s tasks described in Section 2.3. In summary, the computational and communication tasks covered in the following sections are: • Computational Tasks: – Multi-node Thévenin equivalents computation – Link currents computation – Subsystems update • Communication Tasks: – Subsystems multi-node Thévenin equivalents gathering – Link currents scattering 3.2.2 Computational Aspects of MATE In general, the most computationally expensive tasks of the MATE algorithm are the computation of the multi-node Thévenin equivalents at the subsystems level, and the solution of the link currents at the link solver level. Since the main purpose of the MATE algorithm is to tackle very large systems, the subsystems, obtained by means of any partitioning heuristic, will be also large enough to render these subsystems very sparse. Therefore, the application of well-known sparsity techniques will be fundamental to guarantee eﬃciency of the subsystems tasks. As for the link solver tasks, traditional dense matrix operations are required. Multi-node Thévenin equivalents Based on (2.12) and (2.16), notice that both ebk and Zbk are solutions of the same linear system (2.1) deﬁned by Yk , with only diﬀerent right-hand sides, i.e., jk for computing ebk and (Qk )T for computing Zbk . In this way, each subsystem’s local network, expressed by (2.1) has to be solved bk times for building Zbk , and one time for ebk . Since the matrix Yk is usually very sparse, sparse techniques for solving this linear system should be employed. A widely used technique for solving large sparse linear systems is LU factorization. 46 Chapter 3. Network-based MATE Algorithm Implementation The time necessary to compute multi-node Thévenin equivalents can be then approximated by the following expression: T hv Tcomp (Sk ) = T1f act (Sk ) + (Nz − 1) T2f act (Sk ) + (Nz bk + Ni ) Tsolv (Sk ) (3.4) where T1f act (Sk ) represents the time for the ﬁrst factorization of the admittance matrix Yk , which includes both symbolic and numerical factorizations. In turn, T2f act (Sk ) is the time spent in the numerical factorization that reuses the structures of the factors Lk and Uk , acquired during the symbolic factorization of Yk . Additionally, Tsolv (Sk ) is the time for solving the triangular systems, deﬁned by the factors Lk and Uk . The parameter Nz captures the number of topological changes in subsystem Sk , i.e., how often Zbk needs to be recalculated, whereas the parameter Ni represents the number of changes in the injected currents. Results obtained by Alvarado (1976) with respect to the order of complexity of sparse factorization and repeated solutions of triangular systems suggests that T1f act (Sk ), T2f act (Sk ) and Tsolv (Sk ) can be approximated by exponential functions, as follows: T1f act (Sk ) = a3 ρk 2 nk + a2 ρk nk + a1 nk + a0 (3.5) T2f act (Sk ) = b3 ρk 2 nk + b2 ρk nk + b1 nk + b0 (3.6) Tsolv (Sk ) = c2 ρk nk + c1 nk + c0 (3.7) where nk is the number of buses in subsystem Sk and, ρk represents the branch-to-bus ratio of the same subsystem. These relationships show that more interconnected systems, i.e., higher branch-to-bus ratios ρk , are more computationally expensive. On one hand, if a system has all its buses interconnected to one another, its branch-to-ratio ρ equals n − 1, which makes its factorization and repeated solution time of order O (n3 ) and O (n2 ), respectively. On the other hand, for loosely interconnected systems, as is the case of most power systems, where each bus connects to just a few other neighboring buses, the factorization and repeated solution times vary practically linearly with the number of buses n in the system, which means an order of complexity O (n)6 . In summary, the aspects that inﬂuence the computation of the multi-node Thévenin equivalents are: subsystem size, topology, ordering of the nodes, as well as the number of border nodes. Both size and number of border nodes are highly dependent on the partitioning used to tear the original system apart. Ideally, the employed partitioning technique should make subsystems as balanced as possible, while keeping the number of border nodes 6 For further details, see Appendix A 47 Chapter 3. Network-based MATE Algorithm Implementation reasonably small. This also helps in reducing the amount of data that needs to be exchanged with the link solver. Subsystems update Once the local current injections jk + ik are available, the sparse linear system (2.2) needs to be solved. Since the subsystem matrix Yk is the same as the one used to compute the multi-node Thévenin equivalents, the factors Lk and Uk can be reused to solve (2.2) for the node voltages vk , by means of sparse forward/backward substitutions. Hence, using Tsolv (Sk ), which is deﬁned in (3.7), the time required to calculate vk is as follows: v Tcomp (Sk ) = Ni Tsolv (Sk ) (3.8) Notice from (3.8) that each subsystem has to be solved as many times as their current injections change, i.e., Ni times. Link currents computation Considering that all p subsystems multi-node Thévenin equivalents are available at the link solver, one can build the multi-area Thévenin equivalent, formed by Zl and el , by means of the relationships denoted in (2.25). In these equations, the link Thévenin equivalents, deﬁned by (2.22), are built in-place on top of the multi-area Thévenin equivalent, Zl and el , according to (3.9). p l (Rk )T Zbk Rk l Z = Z0 + (3.9a) k=1 p (Rk )T ebk el = (3.9b) k=1 Once the multi-area Thévenin equivalent seen from the links is built, the link currents computation can proceed. In this case, the dense linear system denoted in (2.25c) needs to be solved, by ﬁrst factorizing Zl and then solving the triangular systems associated with the obtained dense L and U factors, by means of forward and backward substitutions. Moreover, since subsystems may suﬀer topological changes during simulations, their correspondent link Thévenin equivalents Zlk will also change. Therefore, if any subsystem Sk ∈ S has its topology changed Nz times, the matrix Zl also needs to be re-factorized Nz times. Similarly, if subsystems current injections change Ni times over a simulation period, the link Thévenin voltages el also have to be computed Ni times. So, based on the fact that the number of 48 Chapter 3. Network-based MATE Algorithm Implementation operations required to factorize Zl is of the order O (l3 ), and forward/backward substitutions of the order O (l2 ), one can conclude that: i Tcomp (L) = Nz Tf act (L) + Ni Tsolv (L) (3.10) where, Tf act (L) and Tsolv (L) can be approximated by 3.2.3 Tf act (L) = a3 l3 + a2 l2 + a1 l (3.11) Tsolv (L) = b2 l2 + b1 l (3.12) Communication Aspects of MATE In the MATE algorithm, two main communication tasks are necessary: gathering the multinode Thévenin equivalents in the link solver and scattering link currents to subsystems. For establishing the performance model of such routines the PLogP model, introduced by Kielmann et al. (2000), was employed. Known as the parameterized LogP model, PLogP is an extension of the LogP model, originally proposed by Culler et al. (1996). Diﬀerently from the original LogP, which considers ﬁxed parameters for modeling performance of dedicated networks, the PLogP includes the dependencies on the messages sizes. The parameters which both PLogP employs are described as follows: • L: end-to-end latency from process to process, which includes the time for copying the data to and from the network interfaces and transferring the data over the physical network; • os (m): period of time in which the processor is engaged in sending the message of size m; • or (m): period of time in which the processor is engaged in receiving the message of size m; • g(m): minimum time, or gap, between consecutive message transmissions or receptions, which also considers all contributing factors, including os (m) and or (m). In the next sections, the link currents scattering will be ﬁrst explained, followed by the multi-node Thévenin equivalents gathering. Link currents scatter At this stage of the solution, each of the p subsystems Sk ∈ S has to receive from the link solver process, L, the net currents ibk injected by the lk local links at the bk border nodes. 49 Chapter 3. Network-based MATE Algorithm Implementation These current injections are the result of a linear combination of the link currents il obtained by means of the link-to-border mapping Rk , as shown in (3.13). ibk = Rk il (3.13) The procedure described by (3.13) represents, in fact, a pre-processing step performed by the link solver on the link currents il just before sending them to the subsystems. As mentioned earlier, by sending the border nodes net current injections ibk , instead of the global link currents vector il , one reduces the communication overhead of the algorithm. In addition, once the subsystems have received their net injections ibk , they are added to the local subsystem current injection jk by means of the border-to-subsystem mapping (Qk )T as shown in (3.14). The ﬁnal vector jk is later used for the subsystem update, as discussed in Section 3.2.2. jk ← jk + (Qk )T ibk (3.14) In fact, the present communication task corresponds to a customized scatter function, according to the MPI Standard (2008). This is shown graphically in Figure 3.3. In this procedure, parameters L, os (m), or (m) and g(m) characterize the behavior of a network, according to the PLogP model (Kielmann et al., 2000). The link currents scatter routine starts in the link solver process, which sends p consecutive messages to distinct processes handling the subsystems. As seen in Figure 3.3, the ﬁrst message alone takes a time equal to os (m1 ) + L + or (m1 ) to be completely received by the process. However, before the link solver can start sending the second message, it demands a minimum of g(m1 ) seconds before the message can be sent. The second message leaves the link solver process only at the os (m2 ) + g(m1 ) mark and is completely received in the respective subsystems by g(m1 ) + os (m2 ) + L + or (m2 ). Following the previous reasoning, one can ﬁnd the time consumed by the link solver process to send p messages of size mk (k = 1, . . . , p) to the subsystems processes. This time is given by L+os (mp )+or (mp )+ i the link currents scatter timing Tcomm (L) is given in (3.15). p−1 k=1 g(mk ). Finally p−1 i g(mk ) Tcomm (L) = L + os (mp ) + or (mp ) + (3.15) k=1 Recalling that each message of size mk is associated with currents being injected at the bk border nodes of subsystem Sk , there is a linear relationship between mk and bk . In fact, this relationship depends on the data type that represents each injection. If a single current 50 Chapter 3. Network-based MATE Algorithm Implementation g(m1 ) os (m1 ) g(mp ) g(m2 ) os (m2 ) time os (mp ) os (m3 ) L or (m1 ) S1 L or (m2 ) S2 L or (m3 ) S3 L or (mp ) Sp L Figure 3.3. Link currents scatter according to PLogP model. injection takes c bytes of memory, the message size mk is deﬁned by (3.16). m k = c bk (3.16) Considering that longer messages take longer periods of times to be transfered to another process, one can conclude that the function g(mk ) is always positive and increases monotonically with the message size mk . Therefore, for any message size mk , g(mk ) ≤ g(mk ), where mk is the longest message amongst all mk , for k = 1, . . . , p. Consequently, the following inequality is also true. p−1 p−1 k=1 g(mk ) ≤ k=1 g(mk ) = (p − 1) g(mk ) (3.17) That fact that the gap g includes both overheads os and or leads to g(mk ) ≥ os (mk ) and g(mk ) ≥ or (mk ) (Kielmann et al., 2000). Combining these last pieces of information with (3.15) and (3.17) yields (3.18), which represents an upper bound for the link currents scatter i time Tcomm (L). i Tcomm (L) ≤ L + (p + 1) g(mk ) (3.18) 51 Chapter 3. Network-based MATE Algorithm Implementation g(mp ) g(m3 ) or (m1 ) or (m3 ) time or (mp ) or (m2 ) L os (m1 ) S1 L os (m2 ) S2 L os (m3 ) S3 L os (mp ) Sp L Figure 3.4. Multi-node Thévenin equivalents gather according to PLogP model. Multi-node Thévenin equivalents gather In this case, each of the p subsystems Sk ∈ S has to send its previously computed multi-node Thévenin equivalent, formed by Zbk and ebk , to the link solver process. In fact, this task corresponds to a customized gather function (MPI Standard, 2008). Considering that each one of the multi-node Thévenin equivalents is associated with a message of size mk , the gather routines follows the procedure described in Figure 3.4. In light of the PLogP model (Kielmann et al., 2000), the time that the link solver takes to p−1 receive p consecutive messages of size mk corresponds to L + os (mp ) + or (mp ) + k=1 g(mk ) seconds. Diﬀerently from the current links scatter, for the multi-node Thévenin gather procedure two kinds of messages need to be considered: full multi-node Thévenin equivalents, consisting of Zbk and ebk and, the multi-node Thévenin equivalent voltage ebk alone. While the ﬁrst kind will be sent to the link solver only when a subsystem topology changes, the second one is sent whenever the internal current injections vary. The size of the messages carrying the information of the multi-node Thévenin equivalents of the subsystem Sk are summarized by (3.19). c b (b + 1) if Zb and eb are sent, k k k k mk = c b if only ebk is sent. k (3.19) where c represents the amount of memory, in bytes, that each element of either Zbk or ebk takes. 52 Chapter 3. Network-based MATE Algorithm Implementation T hv Finally, the communication time Tcomm (L), perceived by the link solver process, can be calculated as follows: p−1 T hv Tcomm (L) = L + os (mp ) + or (mp ) + g(mk ) (3.20) k=1 T hv Notice that when only ebk needs to be gathered in the link solver, Tcomm (L) is equal to i the link currents scatter time Tcomm (L), deﬁned in (3.15). Moreover, the upper bound (3.18) i T hv found for Tcomm (L) is applicable to Tcomm (L) as well, which leads to (3.21). T hv Tcomm (L) ≤ L + (p + 1) g(mk ) 3.2.4 (3.21) MATE Speedup and Efficiency According to (Foster, 1995), both speedup and eﬃciency are relative quantities, which are measured, usually, with respect to the best known algorithm available to perform the same task under analysis. Hence, since sparsity-based network solvers are widely used in power system industry (Tinney & Walker, 1967), they appear a natural choice as a comparative baseline to estimate speedups and eﬃciency. Sparsity-based solver computation time When applying sparsity techniques alone to solve a large electric network, two basic tasks need to be performed: ﬁrstly, factorization of the system matrix Y and, secondly, solution of two triangular sparse systems for each current injection i. Thus, keeping in mind that the system topology may change Nz times and its current injections changes Ni times, the sparsity-based network solver computation time TSP ARSE may be deﬁned as denoted bellow: TSP ARSE = T1f act (S) + (Nz − 1) T2f act (S) + Ni Tsolv (S) (3.22) where T1f act (S), T2f act (S) and Tsolv (S) follow the relationships (3.5), (3.6) and (3.7), respectively. 53 Chapter 3. Network-based MATE Algorithm Implementation MATE versus Sparsity Alone Plugging (3.3) and (3.22) into the deﬁnitions for speedup and eﬃciency yields the following expressions: SM AT E = TSP ARSE TM AT E (3.23) EM AT E = SM AT E p+1 (3.24) The above expressions synthesize multivariable functions dependent on aspects which involve characteristics of the system to be solved, such as system topology and partitioning strategy7 . Furthermore, software implementation and hardware capabilities also considerably impacts metrics, such as speedup and eﬃciency. Hence, the performance analysis of the MATE algorithm needs to be evaluated on a case by case basis, which needs to consider the system under study, in addition to the computer architecture and software available. In the following section, the proposed MATE performance model will be evaluated for a test case recently presented in (Tomim et al., 2008). 3.2.5 MATE Performance Qualitative Analysis In order to provide means to qualitatively understand the constraints in the network-based MATE algorithm, a qualitative analysis based on the computational and communication costs, discussed previously, will be presented. For this analysis, one ﬁrst splits the total MATE timing TM AT E , given by (3.3), into three distinct components, given by (3.25): TP , TL and TC . The component TP is associated with parallel tasks performed by the processes handling subsystems. The second component TL refers to sequential computations carried out by the link solver. Finally, the component TC accumulates the communication overhead incurred by the MATE algorithm. T hv v TP = Tcomp (Smax ) + Tcomp (Smax ) (3.25a) i TL = Tcomp (L) (3.25b) T hv i TC = Tcomm (L) + Tcomm (L) (3.25c) Employing the expressions obtained in Section 3.2, the set of timings above can be explicitly written in terms of the variables that directly aﬀect the performance of the MATE 7 Note that p refers to the number of partitions a given system is torn into, and the number of processors employed in the MATE algorithm equals p + 1. Therefore, p + 1 processors is used in order to compute the efficiency EM AT E . 54 Chapter 3. Network-based MATE Algorithm Implementation algorithm, as shown in (3.26). TP = Nz ksf + (Nz bmax + 2 Ni )kss nmax (3.26a) TL = Nz kdf l3 + Ni kds l2 (3.26b) TC = (p + 1) 2 Ni g(me ) + Nz g(mt ) (3.26c) where me and mt are relative to the memory associated with the Thévenin equivalent voltages and the full Thévenin equivalent, respectively. Furthermore, nmax represents the order of the biggest subsystem Smax , which, for well balanced partitions, can be approximated by N , p where p the number of subsystems that form the original untorn system of order N . Under similar circumstances, the sequential sparse solver timing TSP ARSE , deﬁned in (3.22), assumes the following form: (3.27) TSP ARSE = (Nz ksf + Ni kss ) N Plugging the expressions deﬁned above into the speedup deﬁnition (3.23) gives SM AT E ≈ p Nz ksf +(Nz bmax +2 Ni )kss Nz ksf +Ni kss + l3 +N 2 Nz kdf i kds l (Nz ksf +Ni kss )nmax + (p+1)[2 Ni g(me )+Nz g(mt )] (Nz ksf +Ni kss )nmax (3.28) Next, two separate analyzes on the speedup SM AT E , introduced in (3.28) will be dis- cussed: (a) for when the number of factorizations Nz approximates the number of repeated solutions Ni , and (b) for when the number of repeated Ni is much greater than the number of factorizations Nz . Case for Nz ≈ Ni In such a situation, (3.28) becomes: SM AT E Nz ≈Ni ≈ p ksf +(bmax +2)kss ksf +kss + kdf l3 +kds l2 (ksf +kss )nmax + (p+1)[2 g(me )+g(mt )] (ksf +kss )nmax (3.29) In general, subsystems become more interconnected as the number of partitions increase, which makes the number of border nodes bmax , number of links l, and the communicationrelated g(me ) and g(mt ) increase as well, while only nmax decreases. Therefore, the denominator of (3.29) will always increase. Moreover, the term associated with the number of links l can be expected to present the faster increase ratio, due to its cubic power. As a result, the network-based MATE algorithm will drastically loose performance when- 55 Chapter 3. Network-based MATE Algorithm Implementation ever systems require frequent factorizations. Such performance loss rate can only be alleviated in case the hardware/software-associated coeﬃcients kdf , kds , g(me ) and g(mt ) can be reduced. However, such reduction can only be achieved up to some extent, by means of specialized hardware/software for dense matrix operations and data communication. Case for Nz ≪ Ni In this case, the speedup achieved with the MATE algorithm approximates the expression denoted in (3.30). p SM AT E (3.30) ≈ 2 k l e) Nz ≪Ni 2 + kssdsnmax + 2(p+1)g(m kss nmax Following the trend of the previous case, the denominator in the speedup expression still tends to increase as the number of partitions grows. However, its increase occurs at a much slower rate, due to the quadratic term associated with the number of links l. From the expression above, even for partitions that are very weakly connected, i.e., l ≪ nmax , the speedup would be still limited by p2 , which represents the theoretical speedup limit, in case both dense operations in the link solver and data communications were instantaneous. This is explained by the fact that each subsystem is solved twice, each time the original untorn system is solved. In order to keep the speedup as close to p 2 as possible, the two varying denominator terms should be much less than the unity. In this sense, these terms can be seen as penalty factors on the network-based MATE algorithm due to the sequential computations performed on the link solver and the data exchange among processes. These penalty factors are deﬁned in (3.31), which also states the constraints that should be imposed on the same penalty factors in order to maximize the algorithm’s performance. λlink = λcomm = kds l2 ≪1 kss nmax (3.31a) 2 (p + 1) g(me ) ≪1 kss nmax (3.31b) The inequality given in (3.31a) serves as a constraint on the number of links that a system of size N can have, whenever it is torn into p subsystems. In other words, it reﬂects the fact that the workload in the subsystems has to be much heavier than the workload in the link solver. Therefore, the MATE algorithm has better chances to achieve higher speedup when solving larger systems. Constraint (3.31b) provides a bound on the communication time associated with Thévenin equivalent voltages, which depend on the intercommunication gap g(me ). Similarly to the previous term, it reﬂects the fact that the Thévenin equivalents 56 Chapter 3. Network-based MATE Algorithm Implementation communication overhead should be much lower than the computations. In other words, subsystems cannot be too small, otherwise the Thévenin equivalents communication timings would eventually surpass their computation time, degrading considerably the algorithm’s performance. 3.3 Hardware/Software Benchmarks The parameters that describe the performance of the hardware and software combination available is of fundamental importance to predict the performance of the network-based MATE implementation without actually implementing it. Sparse and dense operations modules as well as interprocess communication kernels were benchmarked for the distributed computing environment available in the UBC Power Engineering Group Laboratory. As detailed in (De Rybel et al., 2008), this environment consists of 16 AMD AthlonTM 64 2.5 GHz PC units built on a single rack, interconnected by two distinct private networks: one built with Dolphin SCI (Scalable Coherent Interface) network cards and another built with Gigabit Ethernet cards. 3.3.1 Sparse Linear Solver Benchmark For the sparse routines, two speciﬁc kernels need to be benchmarked, namely, the LU factorization and repeated triangular solutions. In this study the sparse linear solver kernels implemented in SuperLU 3.0 library were employed (Demmel et al., 1999; Li et al., 2003). For the timing procedure, a number of WECC subsystems, generated in the partitioning stage (see Section 3.4.1, page 70), were fully solved, i.e., factorized and solved, 1000 times. Computation time averages are plotted in Figure 3.5. Another important aspect that has be considered when solving large sparse linear systems is minimizing ﬁll-in of the factors L and U during the factorization process. Minimizing the ﬁll-in has a twofold beneﬁt: minimizing the memory consumption and the number of required ﬂoating-point operations. Thus, choosing a good ordering technique for the sparse matrices of interest ensures eﬃciency of their solutions. For the present work, the employed ordering technique is known as multilevel nested dissection, whose implementation is available in the METIS 4.0 library (Karypis & Kumar, 1998a,b), introduced in Section 3.4.1. By means of linear regression, the timing data previously acquired were ﬁtted in the modeling functions (3.5), (3.6) and (3.7). A summary of the results of the data ﬁtting procedure is presented in Table 3.2. In this table, the parameters associated with each component of 57 First Fact. [ms] Chapter 3. Network-based MATE Algorithm Implementation 1000 Measured Fitted 500 1 2 3 4 5 6 7 8 9 Dimension (×1000) 10 11 12 13 14 15 4 5 6 7 8 9 Dimension (×1000) 10 11 12 13 14 15 4 5 6 7 8 9 Dimension (×1000) 10 11 12 13 14 15 300 Measured Fitted 200 100 Repeated Sol. [ms] Same Pattern Fact. [ms] 0 0 0 0 1 2 3 60 Measured Fitted 40 20 0 0 1 2 3 Figure 3.5. Sparse operations benchmark. the modeling functions are shown. The term associated with ρ2 n is not observable in the ﬁtted function, which shows that, the computation eﬀort, required by the SuperLU routines to solve the WECC system, increases linearly with the size of the subsystems. This is the same behavior expected from sparse band matrices (Alvarado, 1976). The sparse routines throughputs are given in Figure 3.7. The results show that the repeated solutions are more eﬃcient than the factorizations in terms of ﬂoating-point operations per second. As for the factorizations, it can be observed that the symbolic factorization required by the ﬁrst-time factorization reduces considerably the eﬃciency of the routine when compared to the same-pattern factorization. 58 Chapter 3. Network-based MATE Algorithm Implementation Table 3.1. Data ﬁtting summary for the sparse operations. Component ρn n 1 Component ρn n 1 Component ρn n 1 * 3.3.2 First Factorization (R2 = 0.99749) Value C.I. (95%)* 1.4956 × 10−11 ±9.7840 × 10−13 5.0778 × 10−6 ±8.1457 × 10−8 −1.3855 × 10−3 ±1.1733 × 10−4 Same Pattern Factorization (R2 = 0.99803) Value C.I. (95%)* 2.0521 × 10−12 ±2.4136 × 10−13 1.5777 × 10−6 ±2.0094 × 10−8 −4.7687 × 10−4 ±2.8943 × 10−5 Repeated Solution (R2 = 0.99869) Value C.I. (95%)* 5.6283 × 10−13 ±3.5275 × 10−14 2.6740 × 10−7 ±2.9368 × 10−9 −6.7353 × 10−5 ±4.2301 × 10−6 95% conﬁdence interval for the parameters estimates. Dense Linear Solver Benchmark Following the same procedure adopted for timing the sparse solver kernels, both dense routines, namely, the LU factorization and the repeated triangular solutions, were benchmarked. The dense routines employed in this study are implemented in the GotoBLAS library (Goto, 2006; Goto & Van De Geijn, 2008a,b), which provides a large set of highly optimized BLAS (Basic Linear Algebra Subprograms) (Blackford et al., 2002) and LAPACK (Linear Algebra Package) (Anderson et al., 1999) routines. For the benchmarking process, instead of using matrices extracted from the previous partitions (multi-node Thévenin equivalents, for instance), randomly generated dense matrices were fully solved 1000 times each. The measured timings are shown in Figure 3.6. The results of the data ﬁtting procedure, using the functions deﬁned in (3.11) and (3.12), are shown Table 3.1. The throughput for the dense factorization and repeated solution is shown in Figure 3.8. Diﬀerently from what was observed for the sparse routines, the factorization is more eﬃcient that the repeated solutions in terms of ﬂoating-point operations per second. This can be explained by the fact that, for dense matrices, the factorizations can be performed in blocks. Such characteristic allows implementations in modern processors, such as GotoBLAS, that 59 Chapter 3. Network-based MATE Algorithm Implementation Factorization [ms] 1500 Measured Fitted 1000 500 Repeated Solution [ms] 0 50 100 150 200 250 300 350 Dimension 400 450 500 550 600 150 200 250 300 350 Dimension 400 450 500 550 600 Measured Fitted 15 10 5 0 50 100 Figure 3.6. Dense operations benchmark. minimize considerably RAM memory access and maximize cache memory usage. Such an improved performance of dense matrix operations is often exploited in sparseoriented algorithms. For instance, blocked strategies for handling sparse matrices are used in the aforementioned SuperLU (Demmel et al., 1999; Li et al., 2003) and UMFPACK (Davis, 2004) libraries. 60 Chapter 3. Network-based MATE Algorithm Implementation 300 First Fact. Same Pattern Fact. Repeated Sol. Throughput [Mflops] 250 200 150 100 50 0 0 1 2 3 4 5 6 7 8 9 Dimension (×1000) 10 11 12 13 14 15 Figure 3.7. Sparse operations throughput. 4000 3500 Throughput [Mflops] 3000 2500 2000 1500 1000 Factorization Repeated Solutions 500 0 100 200 300 Dimension 400 500 600 Figure 3.8. Dense operations throughput. 61 Chapter 3. Network-based MATE Algorithm Implementation Table 3.2. Data ﬁtting summary for the dense operations. Component l3 l2 l Component l2 l 1 * 3.3.3 Factorization (R2 = 0.99975) Value C.I. (95%)* 6.5066 × 10−10 ±5.2271 × 10−12 2.7058 × 10−8 ±1.1134 × 10−9 8.8535 × 10−8 ±1.9420 × 10−8 Repeated Solution (R2 = 0.99855) Value C.I. (95%)* 4.9757 × 10−9 ±1.9051 × 10−10 −2.2719 × 10−7 ±9.1490 × 10−8 2.8222 × 10−6 ±8.9642 × 10−7 95% conﬁdence interval for the parameters estimates. Communication Libraries Benchmark As previously discussed in Section 3.2.3, in order to properly characterize a network, according to the PLogP model (Kielmann et al., 2000), the latency L, sending overhead os (m), receiving overhead or (m) and inter-message gap g(m) are required. The underlying hardware is not the only aspect to inﬂuence these parameters, but also the software layer responsible for handling the interprocess communications. The Message Passage Interface (MPI) was chosen for the implementation of the communication operations required by the network-based MATE algorithm. This choice is mainly due to the fact that, since its creation in 1994, the MPI Standard (2008) have become the de facto standard in high-performance computing industry, which speciﬁes the interfaces to a number of point-to-point and collective inter-processes communication kernels. As such, parallel MPI-based programs are portable and can be easily compiled against many hardware-speciﬁc MPI implementation. Among the several MPI implementations publicly available, the well-known MPICH2 (Argonne National Laboratory, 2007) and NMPI (NICEVT, 2005) libraries were selected. The MPICH2 is a widely used MPI implementation, developed by the Argonne National Laboratory, USA, which provides the entire MPI 2.1 standard (MPI Standard, 2008) for shared memory and a number of distributed network stacks, such as TCP, InﬁniBand, SCTP and Myrinet. The NMPI library, based on the previous MPICH2, implements the MPI standard over SCI networks. These two ready-to-use MPI libraries are examples of highly tested and hardware-specialized pieces of software, which abstract the programming from the hardware, signiﬁcantly reducing the development time. 62 Chapter 3. Network-based MATE Algorithm Implementation Since two network types are available in the employed parallel computing environment, the acquisition of the network parameters, latency L, sending overhead os , receiving overhead or and minimum inter-message gap g, not only provides the means to predict the network performance, but also enables comparing the two networks. Results of the benchmark, summarized in Appendix C, are depicted in Figures 3.9 and 3.11 for the SCI and Ethernetbased networks available in the UBC’s Power System Engineering Group’s cluster. The minimum inter-message gaps g, shown in Figure 3.9 for both Ethernet and SCI networks, are inversely related to the maximum bandwith of the network. Since the parameter g(m) denotes the minimum time necessary to send an m-byte message to another process, the bandwidth of the network can be calculated as the ratio between the message size m and its associated gap g(m), as shown in (3.32). B(m) = m g(m) (3.32) The inter-message g(m) plots, shown in Figure 3.11, suggest that, for the Ethernet and SCI networks, g(m) can be approximated by a piece-wise linear function, such as (3.33). In the expression, the coeﬃcient G represents the gap per byte in a long message, according to the LogGP model, proposed by (Alexandrov et al., 1995), whereas g0 coincides with the estimated minimum gap between zero-length messages, g(0). g(m) = G m + g0 (3.33) As a consequence, for very large messages, the bandwidth B(m) reaches a steadystate, i.e., the maximum network bandwidth, which is identical to the reciprocal of G, i.e., Bmax = 1 . G According to the ﬁtted parameters for both Ethernet and SCI networks, reported in Table 3.3, the maximum bandwidth of the Gigabit Ethernet network was 122 MB/s, while the SCI yielded 182 MB/s. The other parameters, the latency L and the overheads os and or for sending and receiving messages, respectively, were also obtained, according to the approach summarized in Appendix C and depicted in Figures 3.9and 3.10. One similarity observed for both networks is, in fact, related to some MPICH2’s design strategies (Argonne National Laboratory, 2007) regarding the handling of short and large messages. Two message protocols are usually employed by MPI implementations to diﬀerentiate short and large messages, namely, the eager and the rendezvous protocols8 . In both MPICH2 and NMPI libraries, the default threshold for the message size is 128 kB, which is characterized by sharp discontinuities in both 8 For more information about these protocols see Appendix C. 63 Chapter 3. Network-based MATE Algorithm Implementation networks benchmarks. For the Ethernet network, the average values of os and or vary practically linearly, as observed in Figures 3.9a and 3.10. In this case, the receiving overhead or is slightly less than the sending overhead os , for messages sizes smaller than 128 kB. For large messages, however, due to the switching between eager and rendezvous message protocols, the receiving overhead or increases considerably, approaching the inter-message gap g, while the sending overhead os presents just a slight increase. Such an increase in or is related to the extra required handshakes between processes, in order to guarantee available memory for the transmission and readiness of the receiver. For the SCI network, in addition to the eager and rendezvous message protocols, an extra message protocol is provided, which takes advantage of hardware-speciﬁc features for sending very short messages of less than 12 kB. In this short message protocol, the messages are encoded, so the receiver can interpret the message without communicating with the sender. This extra work required to decode short messages is captured by the faster increase rate veriﬁed for the receiving overhead or . As also observed in Figure 3.9b, the sending overhead os is coincident with the inter-message gap g, due to the error checking, included in the implementation provided by the NMPI library, which blocks the process until the transfer is complete. For the rendezvous protocol region, similarly to the Gigabit Ethernet behavior, both overheads become practically equal because of additional handshakes required between processes. Other aspects regarding the performance of each network can be extracted comparing the sending overhead os and the inter-message gap g, according to Culler et al. (1996). For the Ethernet network, for instance, os < g, for all benchmarked message sizes, which shows that the hardware represents the bottleneck for the communications, since the CPU is able to process data in a much faster pace than the NIC is able to handle them. On the other hand, the SCI network presented os = g throughout the tested message size range, which shows that the software is the bottleneck of the communications. Regarding the latencies, they cannot be measured directly from any process, but only inferred, because it captures the time that the message spends in the network. According to the procedure summarized in Appendix C, the latency L associated with the MPICH2 implementation over the Gigabit Ethernet network was 12.8 µs. As for the SCI, because of the blocking behavior of the embedded error checking functions, the latency cannot be extracted accurately, as for the Gigabit Ethernet network, only estimates can be obtained. In this case, half of round trip timing for zero-byte messages can yield an upper limit for the latency. The measured round trip timing for zero-byte messages was about 7 µs, which limits the latency to 3.5 µs. Such a latency clearly shows the superior processing power of 64 Chapter 3. Network-based MATE Algorithm Implementation Table 3.3. Parameter summary for Ethernet and SCI networks. Ethernet (latency L = 12.8 µs) Coeﬃcient Value C.I. (95%)* g0 [µs] 0.0 (152.68)† ±0.0 (±2020.7)† G µs kB 8.4944 (8.4055)† ±0.0147 (±14.077)† SCI (latency L = 3.5 µs)‡ Coeﬃcient Value C.I. (95%)* g0 [µs] 6.6246 ±0.5435 G * † ‡ µs kB 5.6415 ±0.0081 95% conﬁdence interval for the parameters estimates. Parameters around brackets are ﬁtted for messages of size greater than 128 kB. Latency upper bound (half of the round trip time of a zero-byte message). the SCI network. One major diﬀerence that should also be noticed is the variability of both networks, which is remarkably lower for the SCI network than for the Ethernet network, as the error bars for the receiving overheads or show in Figure 3.9. Such latent variability depends very much on loading conditions of network and amount connections made to a single network interface. As a consequence, in collective communications, necessary for implementing the networkbase MATE algorithm, for instance, the Ethernet network is expected to present even higher variability, if messages are sent to the same process from many others simultaneously. Network-based MATE Communication Time Now that all required network parameters are available, the performance of both Gigabit Ethernet and SCI networks can be compared in the context of the network-based MATE algorithm. As previously analyzed in Section 3.2.3, the communications required by the networkbased MATE algorithm involve gathering subsystems’ Thévenin equivalents in the link solver and scattering link currents back to the subsystems. The timings estimates for these tasks T hv i are deﬁned, respectively, as Tcomm (L), denoted in (3.20), and Tcomm (L), denoted in (3.15). Since these timings follow an identical structure, (3.34) will be used for comparing the both network interfaces. M AT E Tcomm (L) = L + or (m) + os (m) + (p − 1) g(m) (3.34) 65 Chapter 3. Network-based MATE Algorithm Implementation 2000 1800 1600 os (m) or (m) g(m) Time [µs] 1400 1200 1000 800 600 400 200 0 0 20 40 60 80 100 Message size m [kB] 120 140 160 120 140 160 (a) Gigabit Ethernet network 1000 900 800 os (m) or (m) g(m) Time [µs] 700 600 500 400 300 200 100 0 0 20 40 60 80 100 Message size m [kB] (b) Scalable Coherent Interface (SCI) Figure 3.9. Benchmark results of two MPI implementations over (a) SCI and (b) Gigabit Ethernet networks. 66 Chapter 3. Network-based MATE Algorithm Implementation os (m) [µs] 1000 SCI Ethernet 500 0 0 20 40 60 80 100 120 140 160 40 60 80 100 120 140 160 40 60 80 100 Message size m [kB] 120 140 160 or (m) [µs] 2000 SCI Ethernet 1000 Gap g(m) [µs] 0 0 20 2000 SCI Ethernet 1000 0 0 20 Figure 3.10. Sending and receiving overheads and inter-message gaps for MPI over SCI and Gigabit Ethernet network. The message sizes m depend on the communication task and size of the subsystems Thévenin equivalents and number of link currents. For the Thévenin equivalents’ gathering, for instance, there are two types of messages: full Thévenin equivalents, which consist of impedances and voltages, and only the Thévenin voltages. The size of the Thévenin equivalents vary on a case basis and depend mostly on the the system’s topology and partitioning strategies. Although the partitioning of the system will be discussed in the next section, the subsystems’ information presented in Table 3.4 show that a real power system with about 15,000 buses and partitioned into 2 up to 14 subsystems may present Thévenin equivalents of order ranging from 1 to 100. Considering now that the Thévenin equivalents are complexvalued and each complex number takes 16 B of memory, for the previous system, messages sizes ranging up to about 16 kB for the Thévenin voltages alone, and 160 kB for the full Thévenin equivalents may be expected. M AT E In Figure 3.12, the estimated timings Tcomm (L) are plotted for 2 and 14 subsystems. In the case of two subsystems, the estimated timings for both SCI and Ethernet networks are very close to one another for messages sizes smaller that 128 kB, while the estimated 67 Chapter 3. Network-based MATE Algorithm Implementation 200 181.5 [MB/s] 180 160 Bandwidth [MB/s] 140 120 121.8 [MB/s] 100 80 60 SCI SCI Max. Bandwidth Ethernet Ethernet Max. Bandwidth 40 20 0 −1 10 10 0 1 10 Message size m [kB] 10 2 Figure 3.11. Bandwidth of MPI over SCI and Gigabit Ethernet networks. timings for the Ethernet network exceed the ones for the SCI network in about 20%. For 14 subsystems, the SCI network is expected to present a better performance throughout the observed message size range. In this case, the estimated timings for the Gigabit Ethernet network can be up to 55% higher than the timings estimated for the SCI network. M AT E The estimated timing Tcomm (L), deﬁned in (3.34), however, does not consider data col- lision, which is true only for the SCI network (IEEE, 1993). In the case of the Ethernet networks, data collision is handled by the CSMA/CD algorithm (REFERENCE). In this algorithm, whenever the Ethernet card detects data collision, the data is retransmitted after a time delay, aka backoff delay, determined by the truncated binary exponential backoﬀ algorithm9 . In the MATE context, such a characteristic may seriously degrade the eﬃciency of the Thévenin equivalents gathering routine and even lock up all subsequent computations. This undesirable behavior can be explained by the many sending requests the link solver process may receive from the subsystems’ processes. A condition that may even worsen this scenario is when the subsystem’s computations are equally balanced and sending resquest are posted to the link solver almost simultaneously. 9 After i collisions, a random number of slot times between 0 and 2i − 1 is chosen. The truncated aspect refers to the maximum number of acceptable increases before the exponentiation stops. 68 Chapter 3. Network-based MATE Algorithm Implementation M AT E Tcomm (L) [ms] (2 subs.) 5 4 SCI Ethernet 3 2 1 0 0 20 40 60 80 100 Message size m [kB] 120 140 160 40 60 80 100 Message size m [kB] 120 140 160 M AT E Tcomm (L) [ms] (14 subs.) 50 40 SCI Ethernet 30 20 10 0 0 20 Figure 3.12. Comparison of Ethernet and SCI network’s timings for the network-based MATE communications. Due to its better performance and inherent determinism, the SCI network will employed for the network-based MATE algorithm timings presented in the next section. 3.4 Western Electricity Coordinating Council System The Western Electricity Coordinating Council (WECC) is responsible for coordinating the bulk electric system, including generation and transmission, serving all or part of the 14 Western American States, in addition to British Columbia and Alberta in Canada. The WECC region encompasses a vast area of nearly 1.8 million square miles, which makes this system the largest and most diverse of the eight regional councils of the North American Electric Reliability Council (NERC) depicted in Figure 3.13 (http://www.wecc.biz/wrap. php?file=wrap/about.html). For the present analysis, the admittance matrix which represents the WECC electric network will be employed. The WECC system presented has 14,327 buses and 16,607 branches, resulting in an admittance matrix of order 14,327 and 47,541 non-zero elements. These 69 Chapter 3. Network-based MATE Algorithm Implementation numbers show that only about 0.23% of the WECC system matrix is actually ﬁlled with non-zeros, which characterizes the high level of sparsity of such large power systems. The WECC admittance matrix pattern is shown in Figure 3.14. 3.4.1 WECC System Partitioning Load balancing and minimization of the communication volume between subsystems and number of links play a vital role when implementing the MATE algorithm. Ideally, subsystems need to be as equally sized and have as few links as possible. In such a case, the local computations (e.g., Y matrix factorization and v solution) should require about the same time to be performed for each subsystem and interface operations (e.g., multi-node Thévenin equivalents computation and communication) should be as fast and balanced as possible. A supporting tool, METIS 4.0 (Karypis & Kumar, 1998b,a), intended for partitioning large unstructured graphs, was employed to assist in the partitioning procedure. This general purpose graph partitioner takes an undirected graph (deﬁned by the Y matrix topology) as input and splits it into the requested number of partitions, so that they have nearly the same number of nodes and a minimum number of links interconnecting them. The WECC system was partitioned from 2 to 14 subsystems using two diﬀerent techniques, available in the METIS package, namely: (a) Multilevel recursive bisection method; (b) Multilevel k-way method. The results for each partitioning method are presented in Table 3.4 (see page 74). Based on these results, aspects that inﬂuence the goodness of the partitions obtained with both methods will be evaluated, under the light of the MATE algorithm requirements. Load balance among subsystems One of the ﬁrst aspects that should be discussed is how balanced the subsystems operations are. From the subsystems performance model described by (3.4) and (3.8), and later in (3.25), it can be veriﬁed that the number of buses nk and number of border nodes bk considerably aﬀect the load balance among subsystems. In this sense, for both partitioning heuristics, the subsystems average size µ(nk ) decreases according to np , where n is the size of the original system and p the number of subsystems, which indicates a fair level of balance among subsystems. More speciﬁcally, smaller standard deviations of nk (given by σ(nk ) in Table 3.4) lead to the conclusion that the subsystems 70 Chapter 3. Network-based MATE Algorithm Implementation Figure 3.13. North American Electric Reliability Council (NERC) Regions: Florida Reliability Coordinating Council (FRCC), Midwest Reliability Organization (MRO), Northeast Power Coordinating Council (NPCC), Reliability First Corporation (RFC), SERC Reliability Corporation (SERC), Southwest Power Pool (SPP), Texas Regional Entity (TRE), Western Electricity Coordinating Council (WECC). (http://www.nerc.com/page.php?cid=1%7C9%7C119) n = 14327, nnz = 47541 Figure 3.14. Western Electricity Coordinating Council (WECC) System admittance matrix (14,327 buses and 16,607 branches). 71 Chapter 3. Network-based MATE Algorithm Implementation obtained with the recursive bisection method are generally better balanced in terms of nk than those yielded by the k-way method. In terms of border nodes bk , however, the subsystems are not as balanced regardless of the partitioning method, given the bigger values of the standard deviations σ(bk ), comparatively to the averages µ(bk ). The inﬂuence of unbalanced number of border nodes bk on the timings T hv v of diﬀerent subsystems can be explicitly shown by combining Tcomp (Sk ) and Tcomp (Sk ), given subs by (3.4) and (3.8), into the total subsystem computation time Tcomp (Sk ) given below. subs Tcomp (Sk ) = T1f act (Sk ) + (Nz − 1) T2f act (Sk ) + 2 Ni Tsolv (Sk ) + Nz bk Tsolv (Sk ) (3.35) Bearing in mind that the timings T1f act (Sk ), T2f act (Sk ) and Tsolv (Sk ), given by (3.5), (3.6) and (3.7), respectively, depend on well balanced quantities, such as number of nodes nk and branch-to-node ratio ρk , one can induce that the term between square brackets in (3.35) will also be well balanced across all subsystems. The term dependent on the number of border nodes bk , however, will not. Moreover, this last component is also dependent on the number of factorizations Nz the subsystems need to undergo. Thus, the less factorizations required, the less unbalanced the subsystems computations become. Communication volume The communication volume involved in the MATE algorithm is directly associated with the size of the multi-node Thévenin equivalents that need to be gathered in the link solver, i.e., T hv the number of border nodes bk . Such statement becomes evident from the timings Tcomm (L) i and Tcomm (L) deﬁned in (3.15) and (3.20). Both evaluated partitioning algorithms have the number of links l interconnecting all subsystems as a minimizing function. Indirectly, minimizing such an objective function also minimizes the total number of border nodes in the system, but not necessarily the local number of border nodes. Unbalances in the size of the subsystems multi-node Thévenin equivalents are indicated by the relatively big standard deviations σ(bk ). For large systems, the impact of the unbalanced number of border nodes bk will be hardly comparable to the impact of the subsystems and link solver workload. For instance, whenever systems have to be factorized often, workloads in the subsystems and link solver are O (ρ2 n) and O (l3 ), respectively, while the communication burden is O c bk 2 2 . Hence, as long as 2 bk ≪ ρ2 n and bk ≪ l3 , computational workload will be much more signiﬁcant than the communication burden. For systems that require only a few factorizations during a simulation, the minimization of the links that straddle the subsystems can be expected to yield reasonable performance, 72 Communication Penalty Link Solver Penalty Chapter 3. Network-based MATE Algorithm Implementation 0.8 bisection kway 0.6 0.4 0.2 0 2 1 3 4 5 6 4 5 6 7 8 9 10 11 12 13 14 7 8 9 10 Number of partitions 11 12 13 14 bisection kway 0.5 0 2 3 Figure 3.15. Link solver and communication penalty factors relative to the WECC system. as long as the constraints introduced in (3.31) are satisﬁed. Figure 3.15 shows that the communication penalty factors remain below the unity for all adopted partitioning schemes. This fact indicates that subsystems’ workload is suﬃciently more signiﬁcant than the communication overhead. Sequential Link Solver i According to the computation time for the link currents Tcomp (L), deﬁned in (3.10), the work load assigned to the link solver is minimum as long as the number of links l that straddle all subsystems is also minimum. From Table 3.4, it can be observed that both tested partitioning algorithms are able to generate roughly similar partitions as far as the number of links l is concerned. Since none of the partitioning heuristics results in consistently smaller number of links, they can only be compared for speciﬁc number of partitions. Additionally, given that the link solver penalty factors shown in Figure 3.15 remain below unity for all adopted partitioning schemes, one can conclude that the subsystem’s workload is still suﬃciently bigger than the link solver’s. Therefore, for systems that need to be factorized only a few times, reasonable performance should be expected for the chosen partitioning schemes. 73 l 25 43 82 73 139 103 265 242 151 145 187 187 199 p 2 3 4 5 6 7 8 9 10 11 12 13 14 σ (nk ) 0.50 0.47 0.43 0.49 1.34 0.45 0.33 0.31 0.64 0.50 0.28 0.92 0.89 σ (nk ) 9.50 47.25 91.46 49.83 96.07 35.62 59.00 79.90 74.37 36.25 59.19 18.66 27.70 µ (nk ) 7163.50 4775.67 3581.75 2865.40 2387.83 2046.71 1790.88 1591.89 1432.70 1302.45 1193.92 1102.08 1023.36 µ (nk ) 7163.50 4775.67 3581.75 2865.40 2387.83 2046.71 1790.88 1591.89 1432.70 1302.45 1193.92 1102.08 1023.36 nk 7154 4709 3478 2816 2195 1987 1710 1368 1212 1242 1014 1073 954 nk 7163 4775 3581 2865 2386 2046 1790 1591 1432 1302 1193 1100 1022 • xk means the maximum value of xk ; • xk means the minimum value of xk ; l 46 43 71 111 134 125 137 170 152 159 189 201 199 p 2 3 4 5 6 7 8 9 10 11 12 13 14 σ (ρk ) 0.05 0.35 0.37 0.18 0.24 0.29 0.28 0.30 0.30 0.32 0.33 0.32 0.40 ρk 3.69 3.25 3.21 3.30 3.27 3.09 3.11 3.09 2.99 2.98 3.06 2.95 2.85 ρk 3.78 4.08 4.25 3.78 3.91 3.98 3.95 4.02 3.93 3.95 3.97 4.15 4.05 µ (ρk ) 3.63 3.67 3.62 3.70 3.54 3.55 3.45 3.47 3.57 3.56 3.52 3.47 3.47 σ (ρk ) 0.07 0.06 0.12 0.20 0.13 0.19 0.19 0.14 0.25 0.25 0.18 0.26 0.21 ρk 3.56 3.58 3.45 3.49 3.37 3.30 3.19 3.24 3.14 3.13 3.15 3.08 3.16 ρk 3.69 3.73 3.77 4.05 3.73 3.94 3.84 3.63 4.08 3.89 3.74 3.91 4.02 µ (bk ) 41.00 26.67 32.50 39.40 39.83 32.57 29.63 32.11 27.00 25.18 27.00 26.15 24.21 µ (bk ) 23.00 25.00 37.25 25.40 41.00 25.86 55.38 45.89 26.50 22.73 25.67 24.08 24.29 (b) Multi-level k-way algorithm µ (ρk ) 3.74 3.71 3.68 3.53 3.51 3.54 3.51 3.46 3.49 3.49 3.47 3.44 3.43 • µ (xk ) means the average value of xk ; • σ (xk ) means the standard deviation of xk . nk 7173 4813 3682 2943 2455 2092 1840 1631 1474 1344 1229 1132 1049 nk 7164 4776 3582 2866 2390 2047 1791 1592 1434 1303 1194 1104 1025 (a) Multi-level recursive bisection algorithm σ (bk ) 0.00 9.42 10.47 15.21 21.35 10.51 31.00 18.50 14.54 6.84 15.11 11.87 9.84 σ (bk ) 3.00 3.77 7.02 14.32 18.13 6.90 11.64 7.05 8.26 7.81 8.47 9.91 9.01 bk 23 13 28 5 17 5 20 10 4 11 3 3 4 bk 38 24 25 23 20 24 11 24 8 9 6 7 7 bk 23 36 55 49 78 40 111 68 48 34 61 48 40 bk 44 32 40 63 62 43 53 44 42 37 37 43 41 Table 3.4. Partitioning of the WECC system using METIS library. µ (lk ) 25.00 28.67 41.00 29.20 46.33 29.43 66.25 53.78 30.20 26.36 31.17 28.77 28.43 µ (lk ) 46.00 28.67 35.50 44.40 44.67 35.71 34.25 37.78 30.40 28.91 31.50 30.92 28.43 σ (lk ) 0.00 11.26 11.25 17.45 26.00 12.00 38.37 21.09 15.20 7.67 19.93 15.67 11.95 σ (lk ) 0.00 3.30 8.20 17.06 18.28 8.14 13.27 8.70 10.08 8.32 10.59 12.77 10.93 lk 25 13 31 5 18 6 23 10 7 14 3 3 5 lk 46 25 27 25 21 27 12 28 9 12 7 8 7 lk 25 39 60 56 92 47 133 78 51 38 80 58 48 lk 46 33 46 74 66 48 60 56 50 43 45 50 52 Chapter 3. Network-based MATE Algorithm Implementation 74 Chapter 3. Network-based MATE Algorithm Implementation 3.4.2 Timings and Performance Predictions for the WECC System To predict timings and performance, 1000 solutions of the WECC system were considered, i.e., Ni = 1000, and three distinct number of factorizations Nz = {1, 100, 500}. The predicted and measured timings are shown in Figures 3.17, 3.18 and 3.19. As the number of partitions increases, so does the number of links interconnecting the subsystems, which, in turn, become smaller (see Table 3.4). Such behavior helps reducing the participation of the subsystems computations in the total time, while boosting up the workload assigned to the link solver. Due to the signiﬁcantly increased workload in the link solver, the timing ceases to decrease for higher number of partitions. However, since on modern processors, dense matrix computations are more eﬃcient than sparse computations in terms of Mﬂops, the computational overhead incurred by the link solver increases at a much slower pace than the subsystems’ workload decreases. This fact helps the present MATE implementation achieve up to 6 times speedup with respect to the sequential SuperLU algorithm, as shown in Figure 3.23. Figures 3.17, 3.18 and 3.19 also show that as the number of factorizations Nz increase, the link solver operations increase with O (Nz l3 ), therefore, in a much faster rate than the subsystems sparse operations which increase with a function O np Nz , where n represents the number of buses in the original untorn WECC system. This explains the lower speedups achieved when the system needs to be factorized many times during the simulations, regardless of the number of partitions. As mentioned before in Section 3.4.1, the communication between subsystems and links solver has little impact on the global timings, regardless of the number of factorizations, although more data is exchanged whenever a factorization in the subsystems level is required. This is because of the massive amount of computations required in comparison with the amount of data needed to be exchanged among processes. In Figures 3.20, 3.21 and 3.22, the MATE algorithm performance is illustrated for the case when the WECC system is torn apart into 14 subsystems by means of the multilevel recursive bisection partitioning method. In these graphs, the processes participating in the solution are represented in the x-axis, where process 0 is related to the link solver and the rest to the subsystems. Observe that for fewer factorizations, the partitions obtained with the METIS library show good computational balance, which deteriorates as the number of required factorizations grows. This tendency is explained by the fact that the number of border nodes bk are not well balanced across the subsystems (see Table 3.4 for details), which makes the multi-node Thévenin equivalent in distinct subsystems very diﬀerent. As for the performance metrics, speedup and eﬃciency were measured with respect to the SuperLU sparse solver timings. Results are depicted in Figure 3.23. It can be observed 75 Chapter 3. Network-based MATE Algorithm Implementation 1 fact. [s] 6 bisection k−way 4 2 0 2 3 4 5 6 7 8 9 10 11 12 13 14 100 fact. [s] 15 bisection k−way 10 5 0 2 3 4 5 6 7 8 9 10 11 12 13 14 500 fact. [s] 40 bisection k−way 20 0 2 3 4 5 6 7 8 9 10 Number of Subsystems 11 12 13 14 Figure 3.16. Comparison between MATE timings for multilevel recursive bisection and multilevel k-way partitioning algorithms. that, in case when the WECC system required just a few factorizations (less than 5% of the steps), the network-based MATE algorithm achieved about 6 times speedup with respect to the sequential SuperLU solver, when 14 partitions were considered. For many factorizations (more than 10% of the steps), however, the algorithm speedup degrades considerably to approximately 3 times for Nz = 100, and 2 times for Nz = 500. Although maximum speedup happened for the 14-subsystems case, the maximum eﬃciency was observed for a lower number of partitions. For small number of factorizations, a slightly higher than 50% eﬃciency was observed, while lower values were registered for the cases with many factorizations, i.e., less that 30% for Nz = 100 and less that 20% for Nz = 500. Lastly, timings obtained with both partitioning algorithms are compared in Figure 3.16. Both multilevel recursive bisection and k-way deliver similar performance to the MATE algorithm when only a few factorizations are required. The diﬀerences between the employed partitioning heuristics become evident for higher number of factorizations. These diﬀerences in performance come as a result of larger number of links and local border nodes, as it can observed in Table 3.4, when the WECC system is partitioned into 2, 8 and 9 subsystems. 76 Chapter 3. Network-based MATE Algorithm Implementation i Tcomp ( L) i Tcomm ( L) T hv Tcomp (S ) v Tcomp (S ) T hv Tcomm (S ) 5 100 4.5 80 4 3.5 Time [s] 2.5 40 Participation [%] 60 3 2 1.5 20 1 0.5 2 3 4 5 6 7 8 9 10 Number of Partitions 11 12 13 14 0 (a) multilevel recursive bisection i Tcomp ( L) i Tcomm ( L) T hv Tcomp (S ) v Tcomp (S ) T hv Tcomm (S ) 5 100 4.5 80 4 3.5 Time [s] 2.5 40 Participation [%] 60 3 2 1.5 20 1 0.5 2 3 4 5 6 7 8 9 10 Number of Partitions 11 12 13 14 0 (b) multilevel k-way partitioning Figure 3.17. MATE predicted and measured timings for the solution of the WECC system for 1000 steps, 1 factorization and diﬀerent partitioning strategies. 77 Chapter 3. Network-based MATE Algorithm Implementation i Tcomp ( L) i Tcomm ( L) T hv Tcomp (S ) v Tcomp (S ) T hv Tcomm (S ) 14 100 12 80 10 Time [s] 8 40 Participation [%] 60 6 20 4 2 2 3 4 5 6 7 8 9 10 Number of Partitions 11 12 13 14 0 (a) multilevel recursive bisection i Tcomp ( L) i Tcomm ( L) T hv Tcomp (S ) v Tcomp (S ) T hv Tcomm (S ) 10 100 9 80 8 7 Time [s] 5 40 Participation [%] 60 6 4 3 20 2 1 2 3 4 5 6 7 8 9 10 Number of Partitions 11 12 13 14 0 (b) multilevel k-way partitioning Figure 3.18. MATE predicted and measured timings for the solution of the WECC system for 1000 steps, 100 factorizations and diﬀerent partitioning strategies. 78 Chapter 3. Network-based MATE Algorithm Implementation i Tcomp ( L) i Tcomm ( L) T hv Tcomp (S ) v Tcomp (S ) T hv Tcomm (S ) 55 100 50 45 80 40 Time [s] 60 30 25 40 Participation [%] 35 20 15 20 10 5 2 3 4 5 6 7 8 9 10 Number of Partitions 11 12 13 14 0 (a) multilevel recursive bisection i Tcomp ( L) i Tcomm ( L) T hv Tcomp (S ) v Tcomp (S ) T hv Tcomm (S ) 40 100 35 80 60 25 20 40 Participation [%] Time [s] 30 15 20 10 5 2 3 4 5 6 7 8 9 10 Number of Partitions 11 12 13 14 0 (b) multilevel k-way partitioning Figure 3.19. MATE predicted and measured timings for the solution of the WECC system for 1000 steps, 500 factorizations and diﬀerent partitioning strategies. 79 Chapter 3. Network-based MATE Algorithm Implementation v Tcomp (S ) i Tcomp ( L ) T hv Tcomm (S ) Tf act ( L ) 0.5 T hv Tcomp (S ) i Tcomm ( L ) Time [s] 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 Processes 9 10 11 12 13 14 (a) multilevel recursive bisection v Tcomp (S ) i Tcomp ( L ) T hv Tcomm (S ) Tf act ( L ) 0.5 T hv Tcomp (S ) i Tcomm (L) Time [s] 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 8 Processes 9 10 11 12 13 14 (b) multilevel k-way partitioning Figure 3.20. MATE timings for the solution of the WECC system partitioned in 14 subsystems for 1000 steps, 1 factorization and diﬀerent partitioning strategies. (Process 0 relates to L and processes 1-14 with S) 80 Chapter 3. Network-based MATE Algorithm Implementation 1.4 v Tcomp (S ) i Tcomp ( L ) T hv Tcomm (S ) i Tcomm (L) 1.2 T hv Tcomp (S ) Tf act ( L ) Time [s] 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 Processes 9 10 11 12 13 14 (a) multilevel recursive bisection 1.4 v Tcomp (S ) i Tcomp ( L ) T hv Tcomm (S ) i Tcomm (L) 1.2 T hv Tcomp (S ) Tf act ( L ) Time [s] 1 0.8 0.6 0.4 0.2 0 0 1 2 3 4 5 6 7 8 Processes 9 10 11 12 13 14 (b) multilevel k-way partitioning Figure 3.21. MATE timings for the solution of the WECC system partitioned in 14 subsystems for 1000 steps, 100 factorizations and diﬀerent partitioning strategies. (Process 0 relates to L and processes 1-14 with S) 81 Chapter 3. Network-based MATE Algorithm Implementation 5 v Tcomp (S ) i Tcomp ( L ) T hv Tcomm (S ) i Tcomm (L) 4.5 T hv Tcomp (S ) Tf act ( L ) 4 3.5 Time [s] 3 2.5 2 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 Processes 9 10 11 12 13 14 (a) multilevel recursive bisection 5 v Tcomp (S ) i Tcomp ( L ) T hv Tcomm (S ) i Tcomm (L) 4.5 T hv Tcomp (S ) Tf act ( L ) 4 3.5 Time [s] 3 2.5 2 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 Processes 9 10 11 12 13 14 (b) Figure 3.22. MATE timings for the solution of the WECC system partitioned in 14 subsystems for 1000 steps, 500 factorizations and diﬀerent partitioning strategies. (Process 0 relates to L and processes 1-14 with S) 82 Chapter 3. Network-based MATE Algorithm Implementation 7 (a) 6 Speedup 5 4 3 (b) 2 (c) 1 Efficiency [%] 0 2 60 3 4 5 6 7 8 9 10 11 12 13 14 50 (a) 40 30 (b) (c) 20 10 2 3 4 5 6 7 8 9 Number of Partitions 10 11 12 13 14 (a) Multilevel recursive bisection partitioning 7 (a) 6 Speedup 5 4 3 (b) 2 (c) 1 Efficiency [%] 0 2 60 3 4 5 6 7 8 9 10 11 12 13 14 40 (a) 20 (b) (c) 0 2 3 4 5 6 7 8 9 Number of Partitions 10 11 12 13 14 (b) Multilevel k-way partitioning Figure 3.23. MATE performance metrics for the solution of the WECC system for 1000 steps and (a) 1, (b) 100 and (c) 500 factorizations. 83 Chapter 3. Network-based MATE Algorithm Implementation 3.5 Conclusion In this chapter, a performance model for a speciﬁc implementation of the network-based MATE algorithm is discussed, along with an application on a real large power system, the Western Electricity Coordinating Council (WECC) system, with about 15,000 buses. The approach adopted on the present implementation was to combine traditional sparsity techniques with dense operations, in order to obtain an optimized parallel MATE-based linear solver. From the developed performance model, it was shown that for systems, which need to be repeatedly solved, but factorized only a few times, the theoretical speedup of the networkbased MATE algorithm with respect to traditional sparsity-oriented linear solvers is 2p , for a system partitioned into p subsystems. As a consequence, for the same type of systems, the eﬃciency cannot exceed 50%. Nonetheless, compared with speedup and eﬃciency reported in the literature (see Section 1.2), these values can be considered competitive with other parallel algorithms employed in solving sparse linear solvers, even on massive supercomputers. The concept of link solver and communication penalty factors, introduced in (3.31), provides a normalized measure of the inﬂuence of the link solver computations as well as the communication overhead in the network-based MATE algorithm, with respect to the subsystems workload. In the future, such factors could be even used as objective functions in the system partitioning phase. For the WECC system, the speedup was shown to approximate fairly well the theoretical one, with obtained speedups slightly higher than 6 times when solving the WECC system partitioned into 14 subsystems (see Figure 3.23). Concerning the eﬃciency of the algorithm, it was kept very close to the theoretical 50% for all tested partitioning schemes using the multi-level recursive bisection and multi-level k-way methods (Karypis & Kumar, 1998b). 84 Chapter 4 MATE-based Parallel Transient Stability According to Kundur (1994), transient stability is the ability of the power system to maintain synchronism when subjected to a severe transient disturbance such as a fault on transmission facilities, loss of generation or loss of load. As such, the main objective of transient stability simulations is to detect critical operating conditions under which power systems would loose synchronism due to a large disturbance. In engineering applications, it is frequently desirable to make many system response simulations to calculate, for example, the eﬀects of diﬀerent fault locations and types, automatic switching, initial power system operating states, diﬀerent network, machine and control-system characteristics (Stott, 1979). Such studies provide vital information for system design, operation planning and, more recently, real-time assessment of large power systems. The sheer volume of computation required by such studies, however, imposes severe constraints on the size and complexity of the systems that can be analyzed by means of on-line Transient Stability Assessment (TSA) tools. Hence, speeding up transient stability calculations is one way of improving the reliability of operation and security of modern power systems (Andersson et al., 2005). In order to address the need of faster TSA tools, a parallel transient stability simulator, which employs the network-based MATE algorithm as the back-end for network solutions, will be presented. This implementation diﬀers from other parallel transient stability programs in the sense that it is inherently designed for distributed-memory computing systems. This is because the network-based MATE algorithm’s features allow the complete separation of subsystems. This approach, in addition to avoiding extraneous data movement between central databases and subsystems processes, also allows the system data to be partitioned according to the topology of the network under study. This characteristic also improves the parallelism of other tasks, required by the transient stability simulator, such as dynamic models integration. For the sake of the comparisons, one acclaimed transient stability solution technique implementation is discussed in its sequential and parallel versions. Timing results of the actual implementations, along with speedup and eﬃciency of the solution, are shown for a reduced version of the Brazilian National Interconnected System, with 1916 buses, 2788 branches and 79 generators. 85 Chapter 4. MATE-based Parallel Transient Stability Power System Stabilizer Vs valve/gate control Automatic Voltage Regulator Speed Regulator Ef d Generators Vt Transmission System Energy Supply Motors Generator Prime Mover Other dynamic devices ω Figure 4.1. Basic structure of a power system model for transient stability analysis. 4.1 Transient Stability Problem According to the literature (Dommel & Sato, 1972; Stott, 1979; Kundur, 1994), the power system transient stability problem is modelled by means of a set of non-linear diﬀerentialalgebraic equations (DAEs), which can be summarized as follows: x˙ = f(x, v) (4.1a) Y v = i(x, v) (4.1b) where, x represents a vector with dynamic variables (or, state variables), whose ﬁrst derivatives x˙ are deﬁned by a vector function f, normally dependent on x themselves and the vector with nodal voltages v. In addition, i represents a vector function that deﬁnes the nodal current injections, which also depend on the variable states x and the nodal voltages v. Lastly, Y represents the complex-valued nodal admittance matrix of the system under study. For illustration purposes, a basic power system structure for transient stability studies is also given in Figure 4.1. In this formulation, the set of dynamic equations (4.1a) comprises all non-linear differential equations related to generators and their associated prime movers (e.g., hydro and steam turbines) and controllers (e.g, automatic voltage regulators, speed governors and power system stabilizers), motors, FACTS devices (e.g, HVDC systems and SVCs), as well as any other dynamic device (e.g., dynamic loads). The structure of the dynamic equations strongly depends on the models used for each of the aforementioned components, which, ultimately, deﬁne the strength of the non-linearities included in the system as well as its overall response time characteristics. 86 Chapter 4. MATE-based Parallel Transient Stability The set of static equations (4.1b), in turn, includes all algebraic complex-valued nodal equations associated with the passive transmission network. This network is mainly composed by transformers and transmission lines, stator of generators and motors, and voltagedependent loads. Depending on the models used for the loads, the algebraic equations may also become non-linear. The interface between the dynamic and static equations in (4.1) is accomplished by the voltage-dependent algebraic equations associated with each device connected to the passive network. The stator equations of generators ﬁgure as the most important interface equations, as far as the transient stability is concerned. 4.1.1 Transient Stability Solution Techniques Many solution variations for the problem deﬁned by (4.1) have been described in the literature. These variations combine diﬀerent integration methods for the dynamic equations (4.1a), diﬀerent solution methods for algebraic equations (4.1b), and diﬀerent manners of interfacing these dynamic and algebraic equations. A number of these methods are summarized in (Stott, 1979). All proposed variations, however, fall into two major categories, which are closely related to the interface between the dynamic and algebraic equations. These categories are the (a) alternating (or partitioned ) and (b) simultaneous solution approaches. As pointed out by Stott (1979), the alternating approach is the most traditional method and adopted by many industrial-grade transient stability programs. In addition, since the alternating method relies on repeated network solutions during a simulation, it presents itself as an ideal option for direct application of the network-based MATE algorithm, analyzed in the previous chapters. Therefore, the alternating algorithm will detailed in the sequence, while more information with respect to simultaneous solution approach can be found in Appendix B. Alternating Solution Approach The alternating solution method is characterized by the fact that the dynamic equations (4.1a) are integrated separately from the algebraic equations (4.1b) solution. In addition, the integration method considered will aﬀect the way the interface between both set of equations is realized. In case of an explicit integration method, such as explicit Runge-Kutta methods, the derivatives x˙ calculated at previous time steps are used for computing new values of x. Subsequently, x is used for the the network computation, which turns out to be iterative in case non-impedance loads are used. A nice feature provided by explicit integration formulas 87 Chapter 4. MATE-based Parallel Transient Stability lies on the fact that both sets of dynamic and algebraic equations are truly isolated from one another and, their interface solution is non-iterative. On the other hand, explicit integration methods are weakly stable and often require very small time steps. In case of an implicit integration rule, such as Trapezoidal integration rule, extrapolation techniques are used for predicting node voltages v, which are then employed during the integration procedure, which yields an approximate x. With the approximate values of x and v, the current injections i can be calculated and used for computing an updated set of voltages v, which are later employed in the updating of the dynamic equations (4.1a). In comparison with the explicit integration methods, their implicit counterparts present improved numerical robustness and convergence characteristics. In order to properly choose an implicit integration method for the solution of the transient stability problem, requirements of power systems need to be ﬁrstly considered. Power systems related problems usually accept numerical errors in the order of a few percent, which diminishes the importance of the order of the integration method. The implicit method, however, has to present strong numerical stability and, therefore, not accumulate errors during a simulation. Moreover, the implicit integration method should never produce stable solutions to unstable problems. Bearing this requirements in mind, the Trapezoidal integration rule has been proved to be reliable, reasonably accurate and computationally inexpensive, in comparison to higher order implicit formulas (Dommel & Sato, 1972; Stott, 1979). As such, the set of ordinary diﬀerential equations (4.1a) is ﬁrst discretized according to the Trapezoidal integration rule. This procedure is shown in (4.2). t x(t) = x(t − ∆t) + t−∆t f x(ξ), v(ξ) dξ ≈ ∆t ≈ x(t − ∆t) + f x(t), v(t) + f x(t − ∆t), v(t − ∆t) 2 (4.2) Collecting the present and past values yields (4.3), where xh (t) represents the history term of the state vector x, which is known at time t. ∆t f x(t), v(t) + xh (t) 2 ∆t f x(t − ∆t), v(t − ∆t) xh (t) = x(t − ∆t) + 2 x(t) = (4.3a) (4.3b) Combining the original set of algebraic equations deﬁned in (4.1b) with the one produced by the integration procedure, given in (4.3), results the alternating solution method given in 88 Chapter 4. MATE-based Parallel Transient Stability (4.4). ∆t f xm , vm + xh 2 = i xm , vm xm = Y vm+1 (4.4a) (4.4b) where xh (t), deﬁned in (4.3b), is computed from previous values of x and v, and m = 0, 1, 2, . . . stands for the iteration count. The equations described in (4.4) describe the iterative process needed for the solution at a given time t. The process starts with the computation of vm by extrapolation on a few immediate past values of v. Once vm is computed, a new iterate xm is obtained from (4.4a). Then, applying the previous xm and vm to calculate i xk , vk , (4.4b) can be solved for vm+1 , which can be used to start a new iteration. The iterative process continues until the diﬀerence between vm+1 and vm is negligible. 4.1.2 Transient Stability Models The set of equations (4.1), which deﬁne the transient stability problem, is highly dependent on the power system under investigation as well as its associated models, as discussed in Section 4.1. However, before delving into modelling of power system devices for transient stability studies, one need to keep in mind that the choice of the models has to be in agreement with the time scale of interest. As described by Kundur (1994), transient stability can be subdivided into three categories: short-term (up to 10 seconds), mid-term (from 10 seconds to a few minutes ) and long-term stability (from a few minutes to tens of minutes). The modelling for diﬀerent time scales may diﬀer considerably. For instance, in short-term stability studies, generators and their respective automatic controls (voltage and speed) in addition to the transmission system with the loads are often enough, while for longterm stability, boiler dynamics of thermal power plants, penstock and conduit dynamics of hydro plants, automatic generation control, frequency deviation eﬀects on loads and network, excitation limiters, among others, have to be included. Moreover, bearing in mind that the present implementation aims at showing the feasibility of parallel transient stability computations employing the network-based MATE (see Chapters 2 and 3), the modelling requirements for the short-term stability will considered. Thus, the most basic models for short-term transient stability, the generators and the transmission system, will be introduced. 89 Chapter 4. MATE-based Parallel Transient Stability Transmission Network The basic consideration in transient stability studies, as far as the transmission system modelling goes, is that the frequency of the whole system is assumed to be near constant and close to the rated system frequency, such as 50 and 60 Hz. Moreover, the very fast decaying network transients are seldom of concern in the time scale of interest. Therefore, network dynamics are usually neglected in transient stability studies. Under such considerations, the basic transmission system is modelled by a large set of nodal algebraic equations which relate voltages at all buses of the system and their current injections. This set of equations is commonly represented by a large sparse and usually complex-valued admittance matrix Y, which is topologically symmetric but numerically unsymmetric. Furthermore, in transient stability problems, the Y matrix is also constant between topological changes, such as shortcircuits and line tripping. The basic elements considered in the transmission network model are the transmission lines, the transformers and the buses. (a) Transmission lines A lumped π circuit, depicted in Figure 4.2, is normally employed for characterization of the transmission lines in transient stability studies. According to (Kundur, 1994), the parameters of the transmission line π circuit are given as follows: Zs = ZC sinh (γl) Ysh = 2 tanh ZC γl 2 (4.5a) (4.5b) and ZC = γ= R + jωs L G + jωs C (4.6) (R + jωs L) (G + jωs C) (4.7) where ZC = Characteristic impedance, in [Ω] γ = Propagation constant, in [m−1 ] l = length of the line, in [m] Ω ] R = series resistance, in [ m 90 Chapter 4. MATE-based Parallel Transient Stability Z¯s Y¯sh 2 Y¯sh 2 Figure 4.2. Equivalent π circuit of a transmission line H ] L = series inductance, in [ m ℧ ] G = shunt conductance, in [ m F C = shunt capacitance, in [ m ] (b) Transformers The single line diagram of a 2-winding transformer is shown in Figure 4.3a. In this representation the parameter a is the per unit turns ratio and Y¯T = Z¯1T the nominal per unit series admittance of transformer. As presented in (Kundur, 1994), the associated π circuit of the present 2-winding transformer is depicted in Figure 4.3b and deﬁned by the paremeters given in (4.8). Y¯T Y¯t = a 1−a Y¯p = a2 a−1 Y¯s = a [p.u.] (4.8a) Y¯T [p.u.] (4.8b) Y¯T [p.u.] (4.8c) (c) Load buses Modelling a load bus is a very complicated task due to the variety and sazonality of the loads that can be connected to a same bus, such as lamps, refrigerators, heaters, compressors, motors, computers and so forth. And, even if all load models were known, implementing all the millions of loads often supplied by power systems would render the transient stability simulations infeasible. Therefore, it is common practice adopting composite models for the load buses as seen from the power system. 91 Chapter 4. MATE-based Parallel Transient Stability Z¯T a:1 (a) Y¯t Y¯p Y¯s (b) Figure 4.3. Transformer with oﬀ-nominal ratio and its π circuit representation. One type of model that is widely used in power system industry is the polynomial model, as described in (4.9). The coeﬃcients p’s and q’s represent the proportion of the components of constant power (p1 and q1 ), constant current (p2 and q2 ) and constant impedance (p3 and q3 ). P L = P 0 p1 + p2 V + p3 V 2 [p.u.] (4.9a) QL = Q0 q 1 + q 2 V + q 3 V 2 [p.u.] (4.9b) where p1 + p2 + p3 = q1 + q2 + q3 = 1. In order to interface the foregoing composite load model with the network in the alternating method, as summarized by (4.4b), each component of the load needs to be interfaced with the network through an impedance or current injection or both. In the case of the constant current component, its associated injection I¯Li is calculated from the steady-state condition, as shown in (4.10), and included in the right-hand side vector, i, of the system of equations deﬁned in (4.4b). p1 P 0 − j q 1 Q 0 I¯Li = V¯0∗ [p.u.] (4.10) For the constant impedance component, similarly to the constant current case, the load admittance Y¯ z is calculated from the steady-state solution of the network by means of the L expression denoted in (4.11). This constant admittance is then included in the admittance 92 Chapter 4. MATE-based Parallel Transient Stability I¯ I¯Lp Y¯Lp I¯Li Y¯Lz V¯ Figure 4.4. Equivalent circuit of the polynomial load model matrix that describe the transmission system. p2 P 0 − j q 2 Q 0 Y¯Lz = V0 2 (4.11) [p.u.] For the constant power components, a Norton equivalent is used (Arrillaga & Watson, 2001), where the admittance Y¯ p is connected in parallel with a current injection I¯p . In this L L case, the current I¯Lp provides a power adjustment for changes around the power initially drawn by the admittance Y¯Lp , so the speciﬁed power is enforced. Similarly to the constant impedance case, Y¯ p is also included in the admittance matrix of the system. L I¯Lp = S¯∗ Y¯Lp − 2 V V¯ Y¯Lp = (p3 P0 − j q3 Q0 ) [p.u.] (4.12a) [p.u.] (4.12b) Synchronous Generator Synchronous machines are commonly developed in the machine rotor qd0 reference frame (Kundur, 1994; Krause et al., 2002; Anderson et al., 2003). The full synchronous machine model describe both stator and rotor dynamics, in addition to the mechanical dynamics. However, since the very fast decaying network transients are seldom of concern in the time scale of interest, both network and the stator dynamics are usually neglected in transient stability studies. Under such consideration, the stator equations become algebraic, while the rotor equations are kept dynamic (Kundur, 1994). Another important assumption made when modelling synchronous machines for transient stability studies is that the electrical frequency ωe of the machines only slightly deviates from the rated frequency of the system ωs . Under such conditions, the electromechanical torque 93 Chapter 4. MATE-based Parallel Transient Stability Te and power Pe of the machine, when given in p.u. on the machine base quantities, are interchangeable, i.e., Te ≈ Pe . There is a vast literature on synchronous generators models applied to the transient stability problem (Kundur & Dandeno, 1983; Kundur, 1994; Anderson et al., 2003), hence, only a summary of the classical and transient models will be presented. The extension of the transient to the subtransient model is straightforward and is also covered in the referenced literature. (a) Mechanical Equations The mechanical equations, given in (4.13), describe the electrical frequency ωe of the machine and angle between the net magnetic ﬂux in machine gap and the magnetic ﬂux induced in the ﬁeld winding, also known as the load angle of the machine δ. δ˙ = ωs ω 1 Pm − Pe − Dω ω˙ = 2H ωe − ωs ω= ωs rad s (4.13a) p.u. s (4.13b) [p.u.] (4.13c) where δ = load angle, in [rad] ω = frequency deviation from the rated electrical frequency ωs , in [p.u.] ωe = electrical speed or frequency, in ωs = system rated frequency, in rad s rad s Pm = mechanical power, in [p.u.] Pe = electrical power, in [p.u.] H = inertia constant, in [s] D = damping factor, in [p.u.] 94 Chapter 4. MATE-based Parallel Transient Stability Applying the Trapezoidal integration rule to (4.13) yields the new set of algebraic equations given in (4.14). δ(t) = kδω ω(t) + δh (t) [rad] (4.14a) ω(t) = kωP Pm (t) − Pe (t) +ωh (t) [p.u.] (4.14b) δh (t) = δ(t − ∆t) + kδω ω(t − ∆t) [rad] (4.14c) ωh (t) = kωP Pm (t − ∆t) − Pe (t − ∆t) +kωh ω(t − ∆t) [p.u.] (4.14d) where kδω = ωs ∆t 2 kωP = ∆t 4H + D ∆t kωh = 4H − D ∆t 4H + D ∆t (b) Electrical Equations The algebraic transient equations associated with stator are given in (4.15), while the dynamic equations which govern the transient voltages induced onto the stator are presented in (4.16). These equations are deﬁned in the qd reference frame of the machine, depicted in Figure 4.5. e′q − vq = ra iq − x′d id e′d − vd = ra id + x′q iq de′q 1 = ′ ef d + (xd − x′d ) id − e′q dt Tdo ′ 1 ded = ′ − xq − x′q iq − e′d dt Tqo [p.u.] (4.15a) [p.u.] (4.15b) p.u. s (4.16a) p.u. s (4.16b) and Pe = e′d id + e′q iq + x′d − x′q iq id [p.u.] (4.17) where vq , vd = q and d-axis voltages at the terminal of the machine, in [p.u.] iq , id = q and d-axis stator currents, in [p.u.] 95 Chapter 4. MATE-based Parallel Transient Stability d-axis A¯qd q-axis A¯d θ δ A¯q ωe ωs Figure 4.5. Synchronous machine qd reference frame and system reference frame e′q , e′d = q and d-axis transient voltages induced at the stator of the machine, in [p.u.] ef d = excitation ﬁeld voltage referenced to the stator of the machine, in [p.u.] Pe = electrical power transferred to the machine shaft, in [p.u.] x′q , x′d = q and d-axis transient reactances of the machine, in [p.u.] ′ ′ Tqo , Tdo = q and d-axis open-circuit time constants of the machine, in [p.u.] ra = stator resistance, in [p.u.] Again, applying the Trapezoidal integration rule to (4.16) results the set of algebraic equations given in (4.18). e′q (t) = kq1 ef d (t) + kq2 id (t) + e′qh (t) [p.u.] (4.18a) e′d (t) = kd2 iq (t) + e′dh (t) [p.u.] (4.18b) [p.u.] (4.18c) [p.u.] (4.18d) e′qh (t) = kq1 ef d (t − ∆t) + kq2 id (t − ∆t) + kq3 e′q (t − ∆t) e′dh (t) = kd2 id (t − ∆t) + kd3 e′d (t − ∆t) where ∆t ′ (2Tdo + ∆t) ∆t = ′ (2Tqo + ∆t) ′ 2Tdo − ∆t ′ 2Tdo + ∆t ′ 2Tqo − ∆t = ′ 2Tqo + ∆t kq1 = kq2 = kq1 (xd − x′d ) kq3 = kd1 kd2 = −kd1 xq − x′q kd3 ′ In case the period of analysis is small compared with Tdo and the eﬀect of the amortisseurs neglected, the machine model can be further simpliﬁed by assuming the voltage e′q constant and ignoring the equations associated with e′d . These assumptions eliminates all diﬀerential 96 Chapter 4. MATE-based Parallel Transient Stability equations related to the electrical characteristics of the machine (Kundur, 1994). These are the foundation of the classical synchronous machine model with constant ﬂux linkages. (c) Network Interface From (4.4), it can be noticed that dynamic models are interfaced with the network by means of the current injections i, which are dependent on their state variables x and node voltages v. In the case of the synchronous machine, the ﬁrst step in computing the current injections is solving (4.15) for iq and id , which results (4.19). iq id = ra x′d 1 ra 2 + x′d x′q −x′q ra e′q − vq e′d − vd (4.19) At this point, the stator current phasor I¯ is required for interfacing the synchronous machine with the transmission network. And, since each machine has its own qd reference frame, the currents iq and is need to be converted to the synchronous reference frame of the system. For this task, consider the generic phasor A¯qd illustrated in Figure 4.5, which is originally given in the qd reference frame. In such a case, the phasor A¯qd can be converted to ¯ The procedure the system reference frame by means of (4.20b), which results the phasor A. denoted by (4.20b) represents a rotation of A¯qd by δ radians. A¯qd A¯ = a q q = A¯q + A¯d A¯ = j a d d (4.20a) A¯ = A¯qd ejδ (4.20b) Therefore, using (4.20) and (4.19), the stator current I¯ in the system reference frame can be found, as given in (4.21). I¯ = ra − j x′q ra 2 + x′d x′q e′q − vq − j ra − j x′d ra 2 + x′d x′q (e′d − vd ) ejδ (4.21) Although, (4.21) gives the complex-valued synchronous machine current injection required by (4.4b), Dommel & Sato (1972) reports that injecting the forgoing current straight into the transmission network often makes the alternating solution method non-convergent. In order to circumvent such an undesirable behavior, the authors proposed describing I¯ in terms of a voltage source behind a ﬁctitious admittance Y¯M , given in (4.22). Y¯M = ra − j 12 x′d + x′q ra 2 + x′d x′q (4.22) 97 Chapter 4. MATE-based Parallel Transient Stability I¯ I¯M Y¯M V¯ Figure 4.6. Synchronous generator equivalent circuits In this way, one can rewrite (4.21) as follows: I¯ = −Y¯M V¯ + I¯M I¯M = Y¯M E¯ ′ + I¯saliency 1 I¯saliency = j 2 x′d − x′q ra 2 + x′d x′q (4.23a) (4.23b) E¯ ′ ∗ − V¯ ∗ ej2δ (4.23c) where all phasor quantities are given in the system reference frame and E¯ ′ = e′q + j e′d ejδ . The equivalent circuit associated with the synchronous machine, represented by the algebraic equations (4.23), is depicted in the Figure 4.6 For the classical synchronous machine model, a further approximation is often assumed, which ignores the transient saliency of the machine, i.e., x′d = x′q . This way, the ﬁctitious admittance Y¯M becomes the inverse of the ra + j x′d , which corresponds to the transient impedance of the machine. Moreover, since ﬂux linkages are also considered constant during the period of interest, the voltage E¯ ′ also remains constant. (d) Initial Conditions The initial conditions for the diﬀerential equations of the machine, given in (4.13) and (4.16), are usually computed from a power ﬂow solution, previously calculated for a speciﬁc operating condition of interest. Therefore, before starting a transient stability simulation, each synchronous machine is assumed to be in steady state, i.e., all derivatives are made equal to zero. Since all generated power and terminal voltages in the system reference frame are available from the power ﬂow solution, one can ﬁrst make the derivatives in (4.16) zero and solve for 98 Chapter 4. MATE-based Parallel Transient Stability I¯ ra + j x q E¯q V¯ Figure 4.7. Steady-state equivalent circuit of the synchronous machine. e′q and e′d , which yields (4.24). e′q = ef d + (xd − x′d ) id e′d = − xq − x′q iq (4.24a) (4.24b) Now, combining (4.24) with the stator algebraic equations (4.15) results (4.25). vq = −ra iq + xd id + ef d (4.25a) vd = −ra id − xq iq (4.25b) Rewriting then (4.25) using phasors in the system reference frame yields (4.26), which can be seen as voltage E¯q behind the impedance ra + j xq , as depicted in Figure 4.7. From (4.26b), it can be observed that the direction of controlled voltage E¯q coincides with the q-axis of the machine, as illustrated in Figure 4.8. Hence, the angle of E¯q equals δ. V¯ = − (ra + j xq ) I¯ + E¯q E¯q = Eq ejδ Eq = ef d + (xd − xq ) id (4.26a) (4.26b) (4.26c) Since the machine terminal voltage V¯ and the apparent power S¯ = P + j Q generated by the machine are known from the power ﬂow calculations, one can compute the stator current I¯ as shown below. I¯ = P +jQ V¯ ∗ (4.27) 99 Chapter 4. MATE-based Parallel Transient Stability d-axis E¯f d E¯q q-axis j (xd − xq ) I¯d V¯ δ xq I¯ ra I¯ I¯ Figure 4.8. Steady-state phasor diagram of the synchronous machine, where E¯f d = ef d ejδ and I¯d = j id ejδ . Now, plugging the previous I¯ into (4.26a), one obtain E¯q , which, from (4.26b), gives the load angle δ. Once δ is known, both current components iq and id can also be found as follows: iq = ℜ I¯ e−jδ id = ℑ I¯ e−jδ (4.28a) (4.28b) For the initial value of the excitation ﬁeld voltage ef d , (4.26c) is used along with the voltage Eq = |E¯q | previously calculated from (4.26a). And, to ﬁnalize the initialization of the electrical diﬀerential equations, the transient voltages e′q and e′d are calculated from (4.24). As for the steady-state of the mechanical equations (4.13), the frequency deviation ω must be set to zero, which guarantees that the load angle δ remains constant. Such a requirement is only met when the mechanical and electrical power match, i.e., Pm = Pe = e′d id + e′q iq + x′d − x′q iq id 4.2 (4.29) Sequential Transient Stability Simulator The reasons for implementing of a sequential transient stability simulator is twofold. First, it will serve as basis for the parallel version implementation, which is discussed in the next section. Second, the sequential simulator will provide a fair base for performance measurement of the parallel transient stability simulator, since both simulators share the same core functions and programming techniques. The sequential transient stability simulator implemented complies with the alternating 100 Chapter 4. MATE-based Parallel Transient Stability solution method, explained in Section 4.1.1. The ﬂow chart of the present implementation is given in Figure 4.9. The preparation of a simulation starts with a parsing routine that reads the system data ﬁle and the contingency information. The system data ﬁles include information regarding generators and their associated prime movers and controllers, loads, and any other dynamic device present in the system. Subsequently, all data structures for dynamic devices, such as generators, and loads are allocated and initialized. Initial conditions for the speciﬁed operating condition are also computed for all dynamic devices connected to the system. Still during the pre-processing stage, the transmission system admittance matrix Y is formed, taking into consideration the models for transmission lines, transformers, loads and generators discussed in Section 4.1.2. At this point the actual simulation starts. Before the computation of a time step proceeds, contingency statuses are checked, so that any due switching operation has to be performed. At the beginning of the simulation or whenever topology changes occur, the admittance matrix Y is factorized for later use during the network iterative solution. Non-integrable variables, such as voltages v at load and generation buses, are also predicted by extrapolation and history terms xh associated with the state variables are computed. The formula used for extrapolating the voltages is shown in (4.30) (Stott, 1979). At the end of this stage, the equations (4.4) are ready for computation, which is performed next. V¯ 2 (t − ∆t) V¯ext = ¯ V (t − 2∆t) (4.30) Now, the inner loop responsible for solving the nonlinear network starts with the iteration counter m = 0. At the beginning of the network solution, the currents injected into the passive network are computed for each dynamic and algebraic device, such as generators and polynomial loads. This procedure encompasses the integration of the dynamic models, denoted by (4.4a), and computation of the current injections im due to non-impedance loads, required by (4.4b). Afterwards, the voltages vm+1 across the system are solved from (4.4a), using the previously factorized Y matrix of the transmission network. With the ﬁrst iterate vm+1 , the dynamic equipments can update their state variables. If the diﬀerence between the initially predicted and updated voltages is negligible, i.e., vm+1 − vm ≤ εv , the simulation can move onto the next time step; otherwise, the iteration counter m is incremented by one, the current injections im are recalculated and the process repeats. Test cases simulated with an implementation of the present algorithm will be discussed in Section 4.4. 101 Chapter 4. MATE-based Parallel Transient Stability Parse system data and contingencies info Begin Initialize network, generators, loads and controlers Start simulation t=0 Check contingencies statuses; factorize Y, if needed Compute history and extrapolate non-integrable variables m=0 Update network current injections im Output handling and t ← t + ∆t Solve Y vm+1 = im m←m+1 Update internal variables of dynamic components Finish No No Convergence vm − vm−1 ≤ εv Yes t ≥ Tf inal Yes Figure 4.9. Flow chart of a transient stability program based on the partitioned approach. 102 Chapter 4. MATE-based Parallel Transient Stability 4.3 MATE-based Parallel Transient Stability Simulator A parallel transient stability simulator based on the network-based MATE algorithm will be discussed. The objective of such an application is to further assess the potential of the network-based MATE algorithm in improving existing transient stability simulators with the least programming eﬀort and investment. The present parallel transient stability simulator was targeted to run on distributed computing environments, such as computer clusters built with out-of-the-shelf PCs and network cards. Although distributed-memory was the system of choice, all ideas discussed next can be directly applied to shared-memory environments. This implementation design combines the previously presented sequential transient stability simulator, as the basis of the code, and the network-based MATE algorithm, introduced in the Chapters 2 and 3, as the back-end of the sparse linear systems solver. As such, the parallel alternating method for the transient stability solution is summarized by (4.31). In this set of equations, all variables are computed at time t and, hence, it is kept implicit, except for the history term xhk (t), which depends on past computed values. ∆t m fk xm k , vk + xhk 2 m = i xm k , vk xm k = Y vm+1 (4.31a) (4.31b) where k = 1, . . . , p represents the number of subsystems, which the original system is torn into, m = 0, 1, . . . is the iteration counter and, the history term xhk (t) is given below. xhk (t) = xk (t − ∆t) + ∆t fk xk (t − ∆t), vk (t − ∆t) 2 (4.32) In (4.31), the state variables are presented in a partitioned manner, where each xm k , assigned to the subsystem Sk , is associated with only a portion of the state variables of the untorn system xm . And, since each subsystem contain only a set of the dynamic devices, originally connected to the untorn system, each set of discretized dynamic equations, given in (4.31a), only depends on the voltages vkm at buses belonging to the same subsystem Sk . Transient stability simulators can beneﬁt considerably from the partitioning of the dynamic equations alone, based on the fact that most production-grade transient stability simulators usually spend from 60 to 80% of the simulation time integrating the dynamic models (Brasch et al., 1979; Wu et al., 1995). The actual MATE-based parallel transient stability simulator can be split into three main stages: the partitioning stage, the pre-processing stage and solution stage. Each of these 103 Chapter 4. MATE-based Parallel Transient Stability stages are discussed in the sequence. 4.3.1 System Partitioning Stage Similarly to the network-base MATE algorithm, the parallel transient stability simulator needs a system partitioning phase, where subsystems and interconnection links are topologically identiﬁed prior the simulation. During the partitioning phase, the system under study in torn into a number subsystems, namely, p. Each of these subsystems will, therefore, contain a number of buses, branches and their associated dynamic devices. In turn, these subsystems are interconnected by the system of links. Since actual power systems evolve incrementally, by addition of new buses or lines into the systems, their associated subsystems are expected to follow the same trend. Keeping this fact in mind, repartitioning of the systems is also expected to be mostly required in cases a diﬀerent number of partitions are needed. Therefore, the approach adopted for this implementation was splitting the system data ﬁles, into several ﬁles required by the subsystems and link solver systems. As consequence, the partitioning routine does not need to be invoked in every simulation. The subsystems ﬁles contain the data associated with their own transmission networks, generators, loads and local links. The link solver ﬁle, in turn, contain the global links information. Each link is deﬁned by the following parameters: Global ID: Identiﬁcation of the link solver among the global links. Local ID: Identiﬁcation of the link solver among the local links. From Subsystem ID: Subsystem, which this link is leaving From Bus ID: Bus, which this link is leaving from. To Subsystem ID: Subsystem, which this link is arriving. To Bus ID: Bus, which this link is arriving to. Orientation: 1 for orientation matching the one deﬁned for the global link; -1, otherwise. This set of parameters allow the assembling of the subsystem-to-border mappings Qk , deﬁned in (2.8), and the link-to-border mappings Rk , deﬁned in (2.21), which, ultimately, provide interconnect all subsystems. For the actual partitioning of the system, the METIS library (Karypis & Kumar, 1998a,b) was employed. This library provides a set of partitioning routines for undirected graphs 104 Chapter 4. MATE-based Parallel Transient Stability based on multilevel algorithms, such as the multilevel recursive bisection and multilevel kway partitioning. Among the objective functions minimized during the partitioning phase was the minimum number of global links in the system. The basic tasks in the system partitioning stage are: (a) Parsing system data and initialize system structures. (b) Undirected graph generation for the transmission network of the system under study. Graph data format is described in (Karypis & Kumar, 1998b). (c) Undirected subgraphs generation for each generated, including local and global links, according to the partitioning produced by METIS. (d) Subsystems construction based on the constructed subgraphs and original system data, such as generators and loads. (e) Subsystems and link solver ﬁles generation. 4.3.2 Pre-processing Stage In the pre-processing stage, the data structures required for storing and accessing the system data are allocated and initialized. Following the network-based MATE algorithm idea, two types of processes are spawned at beginning of the simulation: subsystems and link solver processes. According to the ﬂow chart presented in Figure 4.10, each processes start loading and parsing their appropriate data ﬁles. In addition to the system data associated with the transmission network, generators and loads, subsystems also require an extra ﬁle with the information regarding the event to be simulated, such as a fault followed by a line tripping. During this stage, each subsystem Sk identiﬁes its own set of border nodes Bk . The border nodes contained in Bk are then used, along with the local links information, discussed in Section 4.3.1, to assemble the subsystem-to-border mappings Qk , deﬁned in (2.8), and the link-to-border mappings Rk , deﬁned in (2.21). Moreover, during the iterative network solution, the link-to-border mappings Rk are required by the link solver to form the multi-area Thévenin equivalent from the subsystems’ multi-node Thévenin equivalents, as observed from (2.25) and (2.22). Therefore, all mappings Rk need to be gathered at the link solver prior the simulation. This will be further discussed in the next section. 105 Chapter 4. MATE-based Parallel Transient Stability Subsystem Sk Begin Parse subsystem data and contingencies info Link Solver Parse link solver data Initialize network, generators, loads and controllers Identify Bk Form Qk and Rk Send Rk Gather Rk for k = 1, . . . , p P L Figure 4.10. Flow chart of the pre-processing stage of the MATE-based parallel transient stability program (continues on Figure 4.11). 4.3.3 Solution Stage The solution stage of the parallel transient stability follows closely the procedure discussed in Section 4.2, and is summarized in the ﬂow chart presented in Figure 4.11. Therefore, the main diﬀerences between the two simulators will be analyzed next. These diﬀerences lie exactly at the points where interprocess communications are required, namely, the contingency statuses check, network-based MATE solver and convergence check. Contingency Statuses Check Although contingencies are usually applied within a subsystem and usually incur in the refactorization of the subsystem’s local admittance matrix Yk , which, in turn, demands the link solver to rebuild and refactorize the multi-area Thévenin equivalent. Therefore, whenever a subsystem has its admittance matrix refactorized, the link solver needs to receive a signal, so it can proceed with the proper calculations. Another aspect to be considered whenever any of the subsystems have to be refactorized 106 Chapter 4. MATE-based Parallel Transient Stability P L Subsystem Sk Link Solver t=0 Check and gather contingency statuses Factorize Y, if needed Compute history and extrapolate non-integrable variables m=0 Output handling and t ← t + ∆t Update network current injections im k Network-based MATE Solver Yk vkm+1 = im k m←m+1 Update internal variables of dynamic components No Yes No No Convergence Check vkm − vkm−1 ≤ εv Yes t ≥ Tf inal t ≥ Tf inal Yes No Yes Finish Figure 4.11. Flow chart of the solution stage of the MATE-based parallel transient stability program (continued from Figure 4.10). 107 Chapter 4. MATE-based Parallel Transient Stability is related with the manner MPI-1 standard (MPI Standard, 2008; Gropp et al., 1999) deﬁnes the interface for sending or receiving messages. First, one needs to keep in mind that the most basic MPI routines, the MPI_SEND for sending and the MPI_RECV for receiving messages, require, as passing arguments, the amount of data of a given type to be sent or received and which processes are participating in the communication. Now, recall that the multi-node Thévenin equivalents, consisting of Zbk and ebk , need to be sent from the subsystems to the link solver only when admittance matrices Yk are (re)factorized, and only ebk , otherwise. Hence, in the Thévenin equivalents gather (Section 3.2.3), the link solver has to keep track on not only the contingency statuses changes, but also which subsystems have changed. Bearing these aspects in mind, the functionality of the contingency statuses check is described in Figure 4.12. In this adopted approach, each subsystem Sk sends to the link solver its own contingency status sk , calculated according to (4.33a), by means of the MPI_GATHER routine. Afterwards, the link solver contingency status, s0 , is computed by ﬁnding the maximum of all gathered contingency statuses, as indicated in (4.33b). In this way, s0 informs the link solver when a change occurred in the system, whereas sk ’s identify the subsystems that suﬀered the change. 1 if Y was refactorized k sk = 0 otherwise p s0 = max ck k=1 (4.33a) (4.33b) Network-based MATE Solver In the present parallel transient stability simulator, the parallel network-based MATE solver (Chapters 2 and 3) is employed in the sparse linear solutions required by the iterative network computations (4.4b). The ﬂow chart, presented in Figure 4.13, summarizes the tasks performed by the networkbased MATE solver. At the beginning of a network solution, all subsystems Sk start computing their multinode Thévenin equivalents, formed by Zbk and ebk , according to (2.12), (2.16) and (2.17). As discussed previously, the recalculation of impedance matrix Zbk depends on the contingency status of the same subsystem, i.e., only when the admittance Yk changes. Once the Thévenin equivalents are computed, they can be gathered in the link solver process. This collective communication task depends on the contingency status of all subsystems. The contingency status are ﬁrst checked on both subsystems and link solver processes. 108 Chapter 4. MATE-based Parallel Transient Stability Sp S3 S2 S1 L sp s3 s2 s1 s0 p = max sk k=1 s1 s2 s3 sp Figure 4.12. Contingency statuses check employed in the parallel MATE-based transient stability simulator. In case changes in any Yk are veriﬁed, the full Thévenin equivalents are sent(received) to(by) the link solver; otherwise, only the Thévenin voltages ebk are sent(received). After sending the Thévenin equivalents, the subsystems wait for the link currents to be calculated by the link solver. After the link solver received the Thévenin equivalents, the multi-area Thévenin equivalent, formed by Zl and el , needs to be assembled according to the contingency statuses of the system. The assembling procedure is described by (2.22) and (2.25). However, if any of the contingency statuses are true, the full multi-area Thévenin needs to be rebuilt and the impedance matrix Zl factorized; otherwise only the multi-area Thévenin voltages el are assembled. Afterwards, the link currents il can be computed from Zl il = el . With the link currents il available, the link solver scatters them appropriately, i.e., respecting the link-to-border mappings Rk , among the subsystems. The subsystems, in turn, that received their respective border node injection ibk due to the link currents il , can, ﬁnally, update the local injections im k and compute the local node voltages vk . Convergence Check At this point, all the nodal voltages vkm+1 are solved and all the dynamic devices’ state variables xm+1 updated, during the iteration m and for each subsystem Sk . Now, the conk vergence of the solution needs to be checked against the predicted values of vkm and xm k , which were used to computed the currents im k . In the convergence check stage, the convergence of each subsystem Sk can be veriﬁed 109 Chapter 4. MATE-based Parallel Transient Stability Subsystem Sk Th´evenin Eqs. Computation if Yk changed then Zbk = Qk (Yk )−1 (Qk )T end if ebk = Qk (Yk )−1 im k Th´evenin Eqs. Gather if Yk changed then Send Zbk and ebk else Send ebk end if Link Currents Scatter Receive ibk Wait until done Update local injections T b m im k ← (Qk ) ik + ik Compute voltages Solve Yk vk = im k Link Solver Th´evenin Eqs. Gather for k = 1 to p do if Yk changed then Receive Zbk and ebk else Receive ebk end if end for Wait all Th´evenin Eqs. MATE Assembling if any Yk changed then Zl = Zl0 + Sk ∈S (Rk )T Zbk Rk end if el = Sk ∈S (Rk )T ebk MATE Solution if any Yk changed then Refactorize Zl end if Solve Zl il = el Link Currents Scatter for k = 1 to p do Send ibk = Rk il end for Figure 4.13. Network-based MATE parallel linear solver detail. 110 Chapter 4. MATE-based Parallel Transient Stability S1 c1 cglobal S1 S2 c2 cglobal S2 S3 c3 cglobal S3 Sp cp cglobal Sp L max Figure 4.14. Convergence check employed in the parallel MATE-based transient stability simulator. through the condition vkm+1 − vkm ≤ εv , where εv is the error tolerance applied to voltages. Although the convergence check can be performed by each subsystem independently, a global warning about the convergence status still needs to be issued to all subsystems and link solver, so all the processes know whether or not start a new iteration. For this task, the convergence check is performed as illustrated in Figure 4.14. This procedure starts by each subsystem Sk calculating a convergence ﬂag, as denoted by (4.34a), which is 1 if the system is not converged and 0 otherwise. Afterwards all local convergence ﬂag are gathered in the link solver L. In the link solver, a global convergence ﬂag cg lobal is calculated, as in (4.34b), and broadcasted back to all subsystems. This operation can be performed by the collective MPI_Allreduce routine applied with the MPI_MAX operator (MPI Standard, 2008; Gropp et al., 1999). 0 if vm+1 − vm ≤ ε v k k ck = 1 otherwise p cglobal = max ck k=1 (4.34a) (4.34b) Whether the program should keep track of the convergence of each subsystem individually is not clear, although it is possible. Diﬀerent subsystems would perform more or less iterations under such conditions. A direct consequence of the distinct number of iterations would be a further unbalance in the computations in the subsystems and, hence, longer idle periods in the faster-convergent subsystems, which would have to wait for the slowerconvergent subsystems to conclude the iterative process. However, further investigations are 111 Chapter 4. MATE-based Parallel Transient Stability Table 4.1. Summary of the SSBI system. Element Quantity Buses 1916 Transmission Lines 2788 Generators 77 Loads 1212 needed concerning local convergence check, which goes beyond of the scope of the present work. 4.4 Performance Analysis For analyzing the performance of the parallel transient stability simulator, described in Section 4.3, with respect to its sequential counterpart, discussed in Section 4.2, a reduced version of the South-Southeastern Brazilian Interconnected (thereafter, SSBI) System will be simulated. An overview of the Brazilian Interconnected System in its entirety is shown in Figure 4.15, whereas a summary of the SSBI system is given in Table 4.1. In addition, two other aspects associated with the parallel transient stability simulation will be studied in this section: the parallelization of the network solution with the networkbased MATE algorithm and the parallelization of the integration of the dynamic models. As discussed in Section 4.1, these are the major steps constituting the alternating solution approach for transient stability simulations. 4.4.1 South-Southeastern Brazilian Interconnected System Partitioning Employing the system partitioning tool described in Section 4.3.1, the SSBI system was torn into 2 to up to 14 subsystems. A summary of the generated partitions is presented in Table 4.4. Since the parallel transient stability simulator implementation relies on repeated solutions of the nodal equations and only a few factorizations due to topology changes, the network-based MATE algorithm performance for the present partitioning schemes can be qualitatively evaluated by means of the penalty factors deﬁned in (3.31). Both link solver and communication penalty factors calculated from the partitioning information, given in Table 4.4, are shown in Figure 4.17. 112 Chapter 4. MATE-based Parallel Transient Stability Figure 4.15. Brazilian National Interconnected System. conheca_sistema/mapas_sin.aspx#) (http://www.ons.org.br/ n = 1916, nnz = 6630 Figure 4.16. South-Southeastern Brazilian Interconnected System admittance matrix (1916 buses and 2788 branches). 113 Chapter 4. MATE-based Parallel Transient Stability Link Solver Penalty 2.5 bisection k-way 2 1.5 1 0.5 0 2 Communication Penalty 8 3 4 5 6 7 8 9 10 Number of partitions 11 12 13 14 4 5 6 7 8 9 10 Number of partitions 11 12 13 14 bisection k-way 6 4 2 0 2 3 Figure 4.17. Link solver and communication penalty factors relative to the SSBI subsystems summarized in Table 4.4. From these graphs, both penalty factors remain below unity only for only a very reduced number of partitions, regardless of the partitioning technique. Thus, a rapid degradation of the performance of the network-based MATE algorithm is expected as the number of partitions increases, because of the high penalty factors veriﬁed. In addition, the graphs also show that the communication burden represents the biggest bottleneck for the network-based MATE algorithm applied to the present partitions due to the high values of its associated penalty factor. Based on the fact that the penalty factors are normalized quantities with respect to the subsystems workload, one can further conclude that such performance degradation occurs due the reduced size of the subsystems in comparison with the link solver and communication. 4.4.2 Timings for the SSBI System For the subsequent evaluations, a 100 ms short-circuit at the 765 kV Foz do Iguaçu substation, followed by the opening of one of the four circuits that connect Foz do Iguaçu and Itaipu 114 Chapter 4. MATE-based Parallel Transient Stability Bi-national substations was simulated for 10 s, with a time step of 4 ms. The SSBI system was modelled by means of classical synchronous machine models, in addition to constant power loads for voltages above 0.8 p.u. and constant impedance loads otherwise. For more information on the modelling of the system, see Section 4.1.2. Timings for the sequential transient stability simulation During the sequential transient stability simulation, the solution of 2,501 time steps was performed, which required 7,313 system solutions, i.e., iterations. The average number of iterations per time step was 2.92, with a standard deviation of 0.65, which conforms with the literature (Dommel & Sato, 1972; Stott, 1979). These statistics are summarized in Table 4.2. The timings recorded for the same simulation are given in Table 4.3. Ignoring the simulation setup and output handling, the network solution corresponded to about 25% of the computation time, while the dynamic models computations amounted to about 65% of the computation time. These proportions are also in agreement with the ones associated with production-grade transient stability simulators, as reported in the literature (Brasch et al., 1979; Wu et al., 1995). Timings for the MATE-based parallel transient stability simulation The same simulation of the SSBI system was performed using the parallel transient stability simulator described in Section 4.3. The timings recorded from the simulations, which ignore simulation setup and output handling time, are graphically shown in Figure 4.18. These timings show that both multilevel recursive bisection and k-way partitioning techniques yield similar performance to the parallel simulator. The timings are drastically reduced from about 8 s to about 3 s, when number of subsystems varies from two to about eight, while saturation is observed for higher number of subsystems. This saturation is justiﬁed by the increase of the link solver computational burden and communication overhead, as predicted by the penalty factors, shown in Figure 4.17, which presented values lower than the unity for just a few subsystems. The convergence check timings also contributes to the communication overhead, which also increases with the number of subsystems, or, in this case, number of processors. The other timings, namely, network currents ik computations, Thévenin equivalents Zbk and ebk computations, solution of subsystems’ voltages vk and update of dynamic models’ internal variables, decrease with the number of partitions, due to their close relationship with the size of each subsystem. That is because solving smaller subsystems requires, in general, less computational eﬀort. The saturation of the performance of the parallel transient stability simulators can be 115 Chapter 4. MATE-based Parallel Transient Stability Table 4.2. Statistics of the SSBI system simulation. Discrimination Value Simulation time 10 s Time step 4 ms Solution steps 2,501 System solutions 7,313 Average of solutions per time step 2.92 Standard deviation of solutions per time step 0.65 Table 4.3. Summary of the sequential transient stability simulation. Task Timing [s] Participation [%] Setup 0.857 5.88 Fault check 0.001 0.01 Convergence check 1.277 8.76 History terms and extrapolation 0.305 2.10 Current injections 8.063 55.3 First time factorization 0.009 0.06 Same pattern factorization 0.005 0.04 Voltages solution 3.290 22.6 Variables update 0.085 0.59 Output handling 0.677 4.65 Total 14.57 [s] 116 Chapter 4. MATE-based Parallel Transient Stability further illustrated by the link solver timings, given in Figure 4.19. In these timings, the link currents il computation and scattering increase almost linearly with the number of subsystems. This aspect is denoted by the almost constant participation factors associated with each of the latter tasks, i.e., about 70 to 80% of communication and 20 to 30% of computation. These participation factors also reinforce the qualitative information embedded in the penalty factors (Figure 4.17), which predicted a much higher communication overhead than the link solver computations. Performance metrics of the MATE-based parallel transient stability simulation In addition to the performance analysis of the MATE-based parallel transient stability simulator, the MATE-based parallel network solutions, dynamic models computations and convergence and contingency check will be individually assessed for the simulations performed with the SSBI system. In this manner, the contributions of each parallel computation to the overall simulation can be evaluated. Following the deﬁnitions of speedup and eﬃciency, introduced in Section 1.1.2, the performance metrics of the MATE-based parallel transient stability simulation are given below: TTpSS TTs SS Tp = Ns ET TN ET Tp N = DY s TDY N p TCCC = s TCCC ST SS p+1 SN ET = p+1 SDY N = p SCCC = p+1 ST SS = ET SS = (4.35a) SN ET EN ET (4.35b) SDY N SCCC EDY N ECCC (4.35c) (4.35d) where TΘs and TΘp indicates whether the computation indicated by Θ is performed sequentially or in parallel, respectively, and SΘ and EΘ are the speedup and the eﬃciency of the parallel computations also indicated by Θ, which can be one of the following: T SS = Transient stability simulation N ET = Network solution DY N = Computation of the dynamic models CCC = Convergence and contingency status check 117 Chapter 4. MATE-based Parallel Transient Stability 9 100 90 8 80 70 60 6 50 5 40 Participation [%] Total Time [s] 7 ik comp. Zbk ,ebk comp. vk solution update conv. check link solver Zbk ,ebk comm. cont. check 30 4 20 3 10 2 2 3 4 5 6 7 8 9 10 Number of Subsystems 11 12 13 14 0 (a) Multilevel recursive bisection 9 100 90 8 80 70 60 6 50 5 40 Participation [%] Total Time [s] 7 ik comp. Zbk ,ebk comp. vk solution update conv. check link solver Zbk ,ebk comm. cont. check 30 4 20 3 10 2 2 3 4 5 6 7 8 9 10 Number of Subsystems 11 12 13 14 0 (b) Multilevel k-way partitioning Figure 4.18. Timings of the parallel transient stability simulator for diﬀerent partitioning heuristics. 118 Chapter 4. MATE-based Parallel Transient Stability 100 1.3 il solution il scatter 90 80 1.1 0.9 60 50 0.7 40 0.5 Participation [%] Link Solver Time [s] 70 30 20 0.3 10 0.1 2 3 4 5 6 7 8 9 10 Number of Subsystems 11 12 13 0 14 (a) Multilevel recursive bisection 100 1.3 il solution il scatter 90 80 1.1 0.9 60 50 0.7 40 0.5 Participation [%] Link Solver Time [s] 70 30 20 0.3 10 0.1 2 3 4 5 6 7 8 9 10 Number of Subsystems 11 12 13 14 0 (b) Multilevel k-way partitioning Figure 4.19. Timings of the link solver of the parallel transient stability simulator for diﬀerent partitioning heuristics. 119 Chapter 4. MATE-based Parallel Transient Stability Moreover, the deﬁnition of the eﬃciency of the dynamic models’ parallelization, EDY N , given in (4.35c), considers only p processes rather than p + 1, as assumed for the other tasks. That is because the link solver does not take part in the computations of the dynamic models. Based on the timings obtained from both sequential and parallel transient stability simulators, the performance metrics, deﬁned in (4.35), were then calculated and plotted in Figure 4.21, as functions of the number of subsystems p. Due the intrinsic parallel nature of the computations of the dynamic models, partitioning the latter task among distinct processors yielded speedups as high as six times with eight subsystems. From two to four processors, the eﬃciency even remains higher than 100%, due to speedups higher than p, which characterize superlinear speedups. For the MATE-based parallel network solutions, the achieved speedups saturate around two times for about six subsystems. That is explained by the fact that, as p increases, the subsystems become small compared to the communication overhead and link solver computations, as predicted by the penalty factors previously discussed in Figure 4.17. The eﬃciency, however, presents values ranging from 30 to 40% for two to six processors, i.e., before the speedup saturation. Compared with other parallel sparse linear solvers reported in the literature (see Chapter 1), these values are still very competitive. Diﬀerently from the previous two tasks, the convergence and contingency status checks present speedups that decrease with the number of subsystems p. That is because the computations associated with, mostly, the convergence check are much faster than the required communications. Even though the computations for the convergence check decrease with the number of subsystems p, the communication increase with O (p log p) (Grama et al., 2003), which makes the speedups reduce with larger number of subsystems. The data ﬁtting of p the convergence and contingency status check time TCCC , given in Figure 4.20, conﬁrms the previous statement. In order to further the understanding on the impact of the gains of the individual tasks on the overall parallel transient stability simulation, one can expand the speedup ST SS , given in (4.35a), in terms of the individual sequential and parallel timings as denoted in (4.36). ST SS = p p TTpSS TNp ET + TDY N + TCCC = s s TTs SS TNs ET + TDY N + TCCC (4.36) Re-writting the above expression in terms of individual speedups deﬁned in (4.35) yields: s s ST SS = fNs ET SN ET + fDY N SDY N + fCCC SCCC (4.37) 120 Chapter 4. MATE-based Parallel Transient Stability p [s] TCCC 0.8 Fitted Measured 0.6 0.4 0.2 2 4 6 8 10 Number of Subsystems 12 14 p Figure 4.20. Convergence and contingency status check time TCCC measured and ﬁtted with p log p function. where fΘs stands for the fraction of the sequential parallel transient stability timing TTs SS that the task Θ = {NET, DYN, CCC} consumes, deﬁned as follows: fΘs = TΘs TTs SS (4.38) In (4.37), the overall speedup of the transient stability simulation, ST SS , is expressed as a combination of the individual speedups of the parallel network solution, SN ET , parallel solution of the dynamic and non-linear devices, SDY N , and the parallel convergence and contingency status check, SCCC . The foregoing speedups are weighted according to their corresponding task’s proportion in the sequential transient stability simulation, which s s are given by the coeﬃcients fNs ET , fDY N and fCCC . Moreover, this relationship expresses, mathematically, that any speedup in the most time-consuming task of the transient stability simulation will have a much stronger impact on the overall computations performance. From the sequential transient stability timings, given in Table 4.3, one can obtain the proportions of each of the tasks, which results: fNs ET = 0.25 s fDY N = 0.65 s fCCC = 0.10 From Figure 4.21, the overall parallel transient stability solution presented a speedup of about 4.3 times, which approximates the solution of (4.37), with SN ET = 2, SDY N = s 5.5 and SCCC = 3, calculated for 8 subsystems, and the above coeﬃcients fNs ET , fDY N s and fCCC . These proportions also show that the speedups of parallel transient stability 121 Chapter 4. MATE-based Parallel Transient Stability simulation ST SS are mostly inﬂuenced by the parallelization of the computations associated with the dynamic and non-linear devices, which occurs naturally when partitioning the system according to the network-based MATE algorithm. In the partitioning stage, however, only network topological information were taken into account, which causes imbalances in the subsystems in terms of dynamic and non-linear devices connected to the systems, as Table 4.4 shows. Such imbalances also inﬂuence the saturation observed in the speedups associated with the devices’ computations, SDY N , and, consequently, in the speedup of the overall parallel transient stability simulation. Although the network solutions correspond to only 25% of the sequential transient stability simulation, parallelizing the network solutions by means of the network-based MATE algorithm will deﬁnitely help speed up the transient stability solutions, specially for large systems. For instance, from Section 3.4.2, the WECC system (about 15,000 buses) was solved by the network-based MATE algorithm with a 6 times speedup with 14 subsystems. Now, assuming that the sequential tasks’ proportions remain ﬁxed, one may expect a direct contribution to the overall parallel transient stability simulation speedup of about 1.5 times from only the parallel network solutions. 122 Chapter 4. MATE-based Parallel Transient Stability 6 ST SS SN ET SDY N SCCC Speedup 5 4 3 2 1 0 2 140 3 4 5 6 7 8 9 10 11 12 14 ET SS EN ET EDY N SCCC 120 Efficiency [%] 13 100 80 60 40 20 0 2 3 4 5 6 7 8 9 10 Number of Subsystems 11 12 13 14 12 13 14 (a) Multilevel recursive bisection algorithm. 6 ST SS SN ET SDY N SCCC Speedup 5 4 3 2 1 0 2 140 3 4 5 6 7 8 9 10 11 ET SS EN ET EDY N SCCC Efficiency [%] 120 100 80 60 40 20 0 2 3 4 5 6 7 8 9 10 Number of Subsystems 11 12 13 14 (b) Multilevel k-way partitioning algorithm. Figure 4.21. Performance metrics of the parallel transient stability simulator. 123 l 22 25 49 53 69 72 80 70 86 105 104 108 122 p 2 3 4 5 6 7 8 9 10 11 12 13 14 σ(nk ) 0.0 0.6 1.6 0.4 0.8 1.0 1.1 0.9 0.8 1.1 0.8 0.8 0.7 σ(nk ) 33.9 19.1 12.8 9.1 7.6 6.7 6.3 5.9 4.6 6.6 6.7 3.5 6.3 µ(nk ) 958.0 638.7 479.0 383.2 319.3 273.7 239.5 212.9 191.6 174.2 159.7 147.4 136.9 µ(nk ) 958.0 638.7 479.0 383.2 319.3 273.7 239.5 212.9 191.6 174.2 159.7 147.4 136.9 nk 934 621 467 371 309 265 230 200 185 157 141 140 118 nk 958 638 477 383 318 272 238 211 190 172 159 146 136 • xk means the maximum value of xk • xk means the minimum value of xk l 20 23 44 47 56 64 72 78 89 95 93 106 109 p 2 3 4 5 6 7 8 9 10 11 12 13 14 nk 982 659 492 392 328 282 246 219 198 179 165 151 144 nk 958 639 481 384 320 275 241 214 193 176 161 149 138 σ(bk ) 1.0 3.6 10.6 7.2 5.8 8.4 7.3 5.2 5.2 6.9 6.5 6.7 7.3 σ(bk ) 1.5 4.5 6.1 5.7 5.5 6.9 5.1 5.5 4.3 5.9 4.2 5.8 5.6 bk 18 21 25 23 21 24 23 22 22 28 18 27 26 µ(lk ) 20.0 15.3 22.0 18.8 18.7 18.3 18.0 17.3 17.8 17.3 15.5 16.3 15.6 σ(lk ) 0.0 4.2 9.0 6.6 7.8 7.6 7.6 6.9 6.3 7.6 6.0 7.3 6.5 lk 20 11 13 8 9 8 7 9 7 6 5 6 7 lk 20 21 32 27 28 29 28 29 26 35 25 35 32 bk 17 11 9 7 12 6 6 5 4 4 3 3 4 bk 19 19 36 26 26 32 27 19 21 28 24 24 31 µ(lk ) 22.0 16.7 24.5 21.2 23.0 20.6 20.0 15.6 17.2 19.1 17.3 16.6 17.4 lk 22 11 10 7 14 8 9 6 5 4 3 5 6 lk 22 23 39 32 31 39 33 27 30 37 32 33 42 µ(gk ) 38.5 25.7 19.2 15.4 12.8 11.0 9.6 8.6 7.7 7.0 6.4 5.9 5.5 µ(gk ) 38.5 25.7 19.2 15.4 12.8 11.0 9.6 8.6 7.7 7.0 6.4 5.9 5.5 σ(gk ) 0.7 0.6 5.7 3.0 6.4 5.6 4.5 4.7 4.7 3.8 4.5 5.2 3.1 σ(gk ) 6.4 3.8 4.6 4.4 4.9 6.2 3.3 5.1 4.2 3.8 3.9 3.2 3.4 gk 38 25 14 11 8 5 3 3 4 3 2 0 1 gk 34 23 13 12 8 5 6 1 3 1 0 2 0 • gk represents the number of generators • zk represents the number of loads σ(lk ) 0.0 4.9 11.9 8.7 6.8 10.3 8.6 6.5 7.4 10.1 8.7 8.8 9.7 (b) Multi-level k-way algorithm bk 15 10 11 7 8 6 7 5 5 6 4 4 5 • µ(xk ) means the average value of xk • σ(xk ) means the standard deviation of xk µ(bk ) 18.0 14.0 20.8 18.4 20.0 17.4 16.1 13.0 14.0 15.6 14.0 13.9 13.9 µ(bk ) 16.5 15.0 17.5 16.8 15.7 15.3 14.1 14.1 14.3 14.8 12.4 13.5 13.1 (a) Multi-level recursive bisection algorithm gk 39 26 26 19 25 21 17 15 17 15 18 16 11 gk 43 30 24 22 22 19 15 18 15 15 14 13 13 µ(zk ) 606.0 404.0 303.0 242.4 202.0 173.1 151.5 134.7 121.2 110.2 101.0 93.2 86.6 µ(zk ) 606.0 404.0 303.0 242.4 202.0 173.1 151.5 134.7 121.2 110.2 101.0 93.2 86.6 σ(zk ) 19.8 44.0 18.6 30.5 28.4 18.8 26.6 25.6 19.3 17.2 19.3 18.8 14.7 σ(zk ) 17.0 47.3 35.9 35.8 37.5 26.2 24.6 22.9 22.5 15.0 19.6 18.4 18.9 zk 592 355 281 190 154 146 112 93 93 86 74 59 59 zk 594 350 266 192 153 133 119 94 80 84 66 62 57 Table 4.4. Partitioning of the reduced South-Southwestern Brazilian Interconnected system using METIS library. zk 620 440 325 269 241 194 189 172 158 139 125 127 109 zk 618 438 339 282 245 202 187 169 152 138 134 120 115 Chapter 4. MATE-based Parallel Transient Stability 124 Chapter 4. MATE-based Parallel Transient Stability 4.5 Conclusion A transient stability simulator based on the network-based MATE algorithm was discussed and implemented in its sequential and parallel versions. The solution approach adopted for the transient stability problem was the alternating solution method, where diﬀerential equations and network equations are solved in an alternate fashion. Basic models for the most elementary power systems devices, such as generators, transmission lines, transformers and composite loads, were also presented. A reduced version of the Brazilian Interconnected System, with about 2,000 buses, was employed in the subsequent evaluations. The timings of the implemented sequential transient stability simulator presented a proﬁle similar to those observed in industrial-grade programs: 65% of the computations associated with dynamic and non-linear devices, and 25% related to network solutions. The performance metrics of the parallel transient stability simulator with respect to its sequential counterpart showed that the parallel network solutions and the distribution of the dynamic and non-linear devices among subsystems are fundamental in shaping the overall parallel transient stability simulation performance. For the simulated system, the parallel computation of the dynamic and non-linear elements achieved speedups as high as six with eight subsystems, while the parallel network solutions based on the network-based MATE algorithm achieved speedups of up to two times for the same number of subsystems. As a consequence, the overall parallel transient stability solution presented a speedup of approximately 4.5 times. 125 Chapter 5 Conclusion In this thesis, the network-based Multi-Area Thévenin Equivalent (MATE) algorithm has been proposed and employed in the solution of the transient stability problem in a distributed parallel computing architecture. Theoretical performance analysis of the networkbased MATE has also been developed. The resultant performance model provides qualitative and quantitative performance measures of the various stages of the algorithm when applied to the solution of a speciﬁc network on a given hardware/software setup. A large power system network, the WECC system with 14,327 buses was solved with the developed network-based MATE algorithm. A speedup of 6 times over conventional sparsity-oriented solutions was obtained with a 14-PC commodity cluster. A parallel transient stability program was also developed, which employed the previous network-based MATE algorithm as the backend for the sparse linear solutions. The alternating algorithm for transient stability simulations was implemented on the top of the practically intact structure of the network-based MATE algorithm, which provided straight concurrent solution of groups of dynamic elements. Tests on the SSBI systems with 1916 buses revealed speedups of up to 4.3 times over a sequential version of the same transient stability solution algorithm. The main conclusions of the present work will be summarized in sequence, along with some possible further investigations and ﬁnal remarks about the MATE algorithm applied to the solution of bulk power systems. 5.1 Summary of Contributions The main contributions of this thesis are: 1. Introduction of the network-based MATE algorithm, which further optimizes the matrix-based MATE algorithm in terms of computation and communication overhead The network-based MATE introduces new concepts, such as border nodes, multi-node and link Thévenin equivalents and their associated mappings (subsystem-to-border and link-to-border). Such concepts were proven to be helpful in reducing the computations 126 Chapter 5. Conclusion performed on the subsystems, as well as the communications required to form the link system in the link solver. 2. Implementation of the network-based MATE algorithm on a commodity cluster employing ready-to-use sparsity and communication libraries The network-based MATE algorithm was implemented on a PC cluster, consisting of several single-core PCs interfaced by a dedicated high-speed network. In order to minimize the code development time, generic and ready-to-use libraries, which implement all computational and communication kernels required by the network-based MATE, were employed. Sparse and dense linear solutions were provided by SuperLU (Li et al., 2003; Demmel et al., 1999) and GotoBLAS (Goto, 2006), respectively. More speciﬁcally, the GotoBLAS library provides implementations of BLAS (Basic Linear Algebra Subroutines) (Blackford et al., 2002) and LAPACK (Linear Algebra Package) (Anderson et al., 1999), which deﬁne a comprehensive standard interface for linear algebra routines. As for the interprocess communications, the Message Passing Interface (MPI) standard (MPI Standard, 2008) was adopted, whose implementation was provided by NMPI library (NICEVT, 2005), which is a version of the well known MPICH2 library (Argonne National Laboratory, 2007) over the Scalable Coherent Interface (SCI) (IEEE, 1993). Because all the aforementioned libraries are implemented in accordance to widely accepted programming standards, employing such libraries not only minimized the development time but also enhanced the portability of the code, which can be easily compiled for many parallel computing architectures ranging from PC clusters to supercomputers. 3. Development of a performance model for the network-based MATE algorithm Developing a performance model for the network-based MATE algorithm enabled the establishment of a theoretical speedup limit for the method, with respect to traditional sequential sparsity-oriented sparse linear solvers. In addition, the developed performance model helps evaluating the performance of the proposed algorithm for a given hardware/software setup and a speciﬁc power system network to be solved. This is a key piece of information that also helps one making decisions regarding the best suitable algorithm and computational environment for solving a given problem. 4. Application of the parallel network-based MATE algorithm for the solution of the network equations associated with the transient stability simulation 127 Chapter 5. Conclusion The proposed network-based MATE algorithm was employed in the solution of sparse linear systems, common in transient stability programs. For the 14,327-bus system, extracted from the North American Western Electricity Coordinating Council (WECC), speedups closely followed the theoretical speedup of 2p , where p is the number of partitions, reaching about 6 times with 14 partitions (Figure 3.23). 5. Implementation of a parallel transient stability simulator based on the parallel network-based MATE algorithm The network-based MATE algorithm was employed as the backend for the sparse linear solutions required by transient stability simulations. The network-based MATE algorithm not only provided an alternative for solving the network equation in parallel, but also lent a structure suitable for solving groups of diﬀerential equations associated with dynamic elements concurrently. Because the integration of the diﬀerential equations is the most time consuming task in most of the industrial-grade transient stability simulators (about 80% of the computational time), its parallelization becomes essential to yield maximum performance to any parallel transient stability simulator. Employing the alternating algorithm for solving the transient stability problem, tests on the SSBI systems with 1916 buses revealed speedups of up to 4.3 times over a sequential version of the same transient stability solution algorithm. 5.2 Future Work The network-based MATE algorithm has been proved to be competitive with other parallel algorithms for the solution of large linear systems in transient stability simulations using commodity oﬀ-the-shelf PC clusters. However, there are still many related aspects suitable for further investigation, such as: 1. Improved solution of the subsystems equations As shown in the thesis, the theoretical speedup limit of the network-based MATE algorithm is p2 , where p is the number of partitions. This limit can be roughly explained by the fact that the subsystems need to be solved, at least, twice during each solution. Therefore, further optimizing the Thévenin equivalents computations may signiﬁcantly increase the the foregoing speedup limit. One technique that can be employed for such an optimization is the sparse vector methods (Tinney et al., 1985), which can potentially reduce the number of non-zeros handled during the solution of systems with only a few injections. In this case, the theoretical speedup limit would approximate p , 1+f 128 Chapter 5. Conclusion where f ∈ (0, 1] represents the computational factor due to the Thévenin equivalent solutions. For instance, as reported by Tinney et al. (1985), sparse vector methods achieved up to 20 times speedup over regular sparse forward and backward substitu- tions. If sparse vector methods were used to compute the Thévenin equivalent compu1 would yield a speedup limit of approximately tations, the computational factor f = 20 20 p 21 ≈ 0.95 p, which means a practically linear speedup10 . 2. Implementation of the MATE algorithm in more advanced parallel architectures In the present work, each subsystem and the the system of equations associated with the interconnections among subsystems, i.e., link currents equations, were concurrently solved on single processors. Further acceleration of the solutions could be achieved by means of more advance parallel architectures, like multi-core processors and graphical processing units (GPU). With the tightly-coupled multiprocessing architectures provided by oﬀ-the-shelf multi-core processors, the MATE algorithm could be used for the coarse grain parallelization, i.e., deﬁning subsystems and links; whereas ﬁner grain parallelism (Huang & Wing, 1979; Wing & Huang, 1980; Wu & Bose, 1995; Amestoy & Duﬀ, 2000; Armstrong et al., 2006) could be further exploited inside each subsystem or link solver. As such, each subsystem and the link solver could be mapped onto diﬀerent multi-processing units and solved locally in parallel. And, as long as the interprocessor communications, i.e., number of links, are minimized at the MATE algorithm level, one may expect better scalability of the solutions on SMP machines. Yet another piece of hardware that has a great potential in accelerating ﬂoating-point operatins is the GPU. Diﬀerently from what has been observed for the CPU technology, the performance of GPUs is quickly accelerating, due to the inherent parallel nature of graphic computations. GPUs can achieve much higher arithmetic intensity with the same transistor count as regular CPUs (Owens et al., 2007). 3. Performance analysis of the multi-level MATE applied to the solution of large power systems Armstrong et al. (2006) shows that multi-level MATE is capable of considerably improving the performance of the MATE algorithm when only dense matrices are handled on a single processor. Due to the relatively reduced size of the systems solved by the multi-level MATE, sparsity-oriented techniques were not considered. Hence, it seems 10 The theoretical speedup limit assumes both communications and link solver computations negligible, while only forward and backward substitutions are performed. 129 Chapter 5. Conclusion worthwhile developing a performance model for a sparsity-oriented version of multilevel MATE in order to determine possible performance gains and bottlenecks when applied to solving large power systems. 4. Development of partitioning algorithms specially tailored for meeting both computational and communication requirements imposed by the MATE algorithm Two diﬀerent generic graph partitioning algorithms, namely, the multi-level recursive bisection and k-way algorithms (Karypis & Kumar, 1998b,a), were employed in this work for deﬁning the subsystems solved by the MATE algorithm. From the results, the subsystems generated by both aforementioned partitioning heuristics yielded similar performance to the parallel forward and backward solutions implemented through the MATE algorithm. For the factorization, however, the load unbalance among the subsystems was evident, due to the uneven number of border nodes in each subsystem. As for the parallel transient stability program, computational costs involved in solving equations associated with generators and loads were not considered in the partitioning phase. Therefore, the MATE algorithm would certainly beneﬁt from the development of a partitioning heuristic capable of balancing the load among many subsystems, considering the number of border nodes and computational costs of generators and loads in each subsystem, while keeping the number of interconnections among the subsystems low. 5. Feasibility of geographically distributed power systems simulations based on the MATE algorithm As pointed out by Wang & Morison (2006), another challenging issue is the modeling of the external system not observable by the SE (state estimator). Inclusion of these models in the real-time system models may require the development of adequate oﬄine equivalent models, which can then be merged with the real-time SE models. Similarly to the method proposed by (Esmaeili & Kouhsari, 2007), the network-based MATE algorithm can also aid the deployment of geographically distributed transient stability simulations. In the MATE context, areas managed by distinct utilities can be represented by separate subsystems, whereas the tie lines interconnecting these areas are the links. In this manner, each utility can preserve the locality and the privacy of its own data, and needs only to make their system equivalents available, for the sake of the whole system’s solution. 130 Chapter 5. Conclusion 5.3 Final Remarks The MATE algorithm, originally proposed by (Martí et al., 2002), has been reformulated from an electric network point of view, which yielded the network-based MATE algorithm introduced in this work. It has been shown that inherent characteristics of the electrical networks, revealed by the present development, can reduce both computational eﬀort and communication overhead of a parallel implementation of the MATE algorithm. Results of previously related researches and the present implemention of the MATE algorithm on a commodity cluster show that, in order to keep parallel computations eﬃcient, one has always to keep in mind the nature of the problem to be solved and the target computing architecture. Some parallel algorithms may be more suitable to certain architectures than others. For example, SuperLU DIST, whose goal is scalability when solving extremely large-scale unsymmetrical problems on massive distributed-memory supercomputers, may not perform well on a commodity cluster solving systems of moderate size. For the sake of comparison, timings for the parallel network-based MATE and SuperLU DIST when solving the two power systems studied in Section 3.3 and Section 4.4 on a commodity cluster are shown in Table 5.1. As can be observed, the network-based MATE scales well up to 8 processors, while the SuperLU DIST suﬀers from its high communication overhead in comparison with the computational load of each processing node. The timings show, for instance, that the network-based MATE algorithm was able to solve the 14,327-bus WECC system up to 26 times faster than SuperLU DIST. In summary, the network-based MATE algorithm can be very eﬃcient for solving sparse linear systems, commonly found in transient stability programs using commodity PC clusters built with out-of-the-shelf processors. 131 Chapter 5. Conclusion Table 5.1. Timings for 1 factorization and 1000 repeated solutions using the network-based MATE and SuperLU DIST on a 16 AMD AthlonTM 64 2.5GHz processors cluster interconnected by a SCI-based network. Processors 1 2 3 4 5 6 7 8 † ‡ Solver MATE SuperLU MATE SuperLU MATE SuperLU MATE SuperLU MATE SuperLU MATE SuperLU MATE SuperLU MATE SuperLU Seq. DIST DIST DIST DIST DIST DIST DIST SSBI† (1,916 buses) 0.43 4.52 0.45 5.24 0.32 5.25 0.26 4.87 0.24 4.58 0.22 4.48 0.21 4.19 WECC† (14,327 buses) 4.6 37.6 4.3 39.3 2.6 38.4 1.8 36.0 1.4 32.6 1.2 30.3 1.1 N/A‡ All timings are given in [s] N/A means that the program returned an unknown error 132 Bibliography Alexandrov, A., Ionescu, M. F., Schauser, K. E., & Scheiman, C. (1995). LogGP: Incorporating Long Messages into the LogP Model — One step closer towards a realistic model for parallel computation. Technical report, University of California at Santa Barbara. Aloisio, G., Bochicchio, M., Scala, M. L., & Sbrizzai, R. (1997). A distributed computing approach for real-time transient stability analysis. Power Systems, IEEE Transactions on, 12(2), 981–987. Alvarado, F. (1976). Computational complexity in power systems. IEEE Trans. Power App. Syst., 95(4), 1028–1037. Alvarado, F. (1979). Parallel solution of transient problems by trapezoidal integration. IEEE Trans. Power App. Syst., PAS-98(3), 1080–1090. Alvarado, F., Reitan, D., & Bahari-Kashani, M. (1977). Sparsity in diakoptic algorithms. IEEE Trans. Power App. Syst., 96(5), 1450–1459. Alvarado, F., Yu, D., & Betancourt, R. (1990). Partitioned sparse a-1 methods. IEEE Trans. Power Syst., 5(2), 452–459. Amdahl, G. (1967). Validity of the single processor approach to achieving large-scale computing capabilities. In AFIPS Conference Proceedings (pp. 483–485). Amestoy, P. R. & Duﬀ, I. S. (2000). Multifrontal parallel distributed symmetric and unsymmetric solvers. Comput. Methods Appl. Mech. Eng, 184, 501—520. Amestoy, P. R., Duﬀ, I. S., L’excellent, J., & Li, X. S. (2001). Analysis and comparison of two general sparse solvers for distributed memory computers. ACM Trans. Math. Softw., 27(4), 388–421. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., & Sorensen, D. (1999). LAPACK Users’ Guide. Philadelphia, PA: Society for Industrial and Applied Mathematics, third edition. 133 Bibliography Anderson, P. M., Fouad, A. A., of Electrical, I., Engineers, E., & Paul M. Anderson, A. A. F. (2003). Power system control and stability. Andersson, G., Donalek, P., Farmer, R., Hatziargyriou, N., Kamwa, I., Kundur, P., Martins, N., Paserba, J., Pourbeik, P., Sanchez-Gasca, J., Schulz, R., Stankovic, A., Taylor, C., & Vittal, V. (2005). Causes of the 2003 Major Grid Blackouts in North America and Europe, and Recommended Means to Improve System Dynamic Performance. IEEE Transactions on Power Systems, 20(4), 1922–1928. Argonne National Laboratory (2007). MPICH2, an implementation of the Message-Passing Interface (MPI). http://www-unix.mcs.anl.gov/mpi/. Armstrong, M., Martí, J. R., Linares, L. R., & Kundur, P. (2006). Multilevel MATE for eﬃcient simultaneous solution of control systems and nonlinearities in the OVNI simulator. Power Systems, IEEE Transactions on, 21, 1250–1259. 3. Arrillaga, J. & Watson, N. R. (2001). Computer Modelling of Electrical Power Systems. Wiley, 2 edition. Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., & der Vorst, H. V. (1994). Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. Philadelphia, PA: SIAM. Beowulf.org (2009). Beowulf.org: Overview. http://www.beowulf.org/overview/index.html. Blackford, L. S., Demmel, J., Dongarra, J., Duﬀ, I., Hammarling, S., Henry, G., Heroux, M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., & Whaley, R. C. (2002). An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Transactions on Mathematical Software, 28(2), 135–151. Brasch, F., Van Ness, J., & Kang, S.-C. (1979). The use of a multiprocessor network for the transient stability problem. In Proc. PICA-79 Power Industry Computer Applications Conference IEEE (pp. 337–344). Bunch, J. R. & Rose, D. J. (1972). Partitioning, Tearing, and Modification of Sparse Linear Systems. Technical Report TR 72 - 149, Cornell University. Chai, J. & Bose, A. (1993). Bottlenecks in parallel algorithms for power system stability analysis. IEEE Trans. Power Syst., 8(1), 9–15. 134 Bibliography Chai, J., Zhu, N., Bose, A., & Tylavsky, D. (1991). Parallel newton type methods for power system stability analysis using local and shared memory multiprocessors. IEEE Trans. Power Syst., 6(4), 1539–1545. Crow, M. & Ilić, M. (1990). The parallel implementation of the waveform relaxation method for transient stability simulations. IEEE Trans. Power Syst., 5(3), 922–932. Culler, D. E., Liu, L. T., Martin, R. P., & Yoshikawa, C. (1996). LogP performance assessment of fast network interfaces. IEEE Micro, (pp. 35–43). Davis, T. A. (2004). A column pre-ordering strategy for the unsymmetric-pattern multifrontalmethod. ACM Trans. Math. Softw., 30(2), 165–195. De Rybel, T., Tomim, M. A., Singh, A., & Martí, J. R. (2008). An introduction to opensource linear algebra tools and parallelisation for power system applications. In Electrical Power & Energy Conference, Vancouver, Canada. Decker, I., Falcão, D., & Kaszkurewicz, E. (1992). Parallel implementation of a power system dynamic simulation methodology using the conjugate gradient method. Power Systems, IEEE Transactions on, 7(1), 458–465. Decker, I., Falcão, D., & Kaszkurewicz, E. (1996). Conjugate gradient methods for power system dynamic simulation on parallel computers. Power Systems, IEEE Transactions on, 11(3), 1218–1227. Demmel, J. W., Eisenstat, S. C., Gilbert, J. R., Li, X. S., & Liu, J. W. H. (1999). A supernodal approach to sparse partial pivoting. SIAM Journal on Matrix Analysis and Applications, 20, 720—755. Dommel, H. & Sato, N. (1972). Fast transient stability soultions. IEEE Transactions on Power Apparatus and Systems, PAS-91, 1643–1650. Dommel, H. W. (1996). EMTP Theory Book. Vancouver, British Columbia, Canada: Microtran Power System Analysis Corporation, 2nd edition. Enns, M., Tinney, W., & Alvarado, F. (1990). Sparse matrix inverse factors. Power Systems, IEEE Transactions on, 5(2), 466–473. Ernst, D., Ruiz-Vega, D., Pavella, M., Hirsch, P., & Sobajic, D. (2001). A uniﬁed approach to transient stability contingency ﬁltering, ranking and assessment. IEEE Trans. Power Syst., 16(3), 435–443. 135 Bibliography Esmaeili, S. & Kouhsari, S. (2007). A distributed simulation based approach for detailed and decentralized power system transient stability analysis. Electric Power Systems Research, 77(5-6), 673 – 684. Foster, I. (1995). Designing and Building Parallel Programs: Concepts and Tools for Parallel Software Engineering. Addison-Wesley. Goto, K. (2006). Optimized GotoBLAS Libraries. Retrieved in January, 2007. Goto, K. & Van De Geijn, R. A. (2008a). Anatomy of high-performance matrix multiplication. ACM Trans. Math. Softw., 34, 1–25. Goto, K. & Van De Geijn, R. A. (2008b). High-performance implementation of the level-3 BLAS. ACM Trans. Math. Softw., 35, 1–14. Grainger, J. J. & Stevenson, W. D. (1994). Power system analysis. McGraw-Hill, Inc. Grama, A., Gupta, A., Kumar, V., & Karypis, G. (2003). Introduction to Parallel Computing. Gropp, W., Lusk, E., & Skjellum, A. (1999). Using MPI : Portable Parallel Programming with the Message-Passing Interface. Cambridge, Mass.: MIT Press. Gropp, W., Lusk, E., & Sterling, T. (2003). Beowulf Cluster Computing with Linux, 2nd Edition. The MIT Press, 2 edition. Happ, H. H. (1970). Diakoptics and piecewise methods. IEEE Trans. Power App. Syst., (7), 1373–1382. Happ, H. H. (1973). Gabriel Kron and Systems Theory. Schenectady, N.Y.: Union College Press. Happ, H. H. (1974). Diakoptics-the solution of system problems by tearing. Proc. IEEE, 62(7), 930–940. Hatcher, W., Brasch, F.M., J., & Van Ness, J. (1977). A feasibility study for the solution of transient stability problems by multiprocessor structures. IEEE Trans. Power App. Syst., 96(6), 1789–1797. Ho, C.-W., Ruehli, A., & Brennan, P. (1975). The modiﬁed nodal approach to network analysis. IEEE Trans. Circuits Syst., 22(6), 504–509. 136 Bibliography Hollman, J. & Martí, J. (2003). Real time network simulation with pc-cluster. IEEE Trans. Power Syst., 18(2), 563–569. Hong, C. & Shen, C. (2000). Implementation of parallel algorithms for transient stability analysis on a message passing multicomputer. In Power Engineering Society Winter Meeting, 2000. IEEE, volume 2 (pp. 1410–1415 vol.2). Huang, J. & Wing, O. (1979). Optimal parallel triangulation of a sparse matrix. Circuits and Systems, IEEE Transactions on, 26(9), 726–732. IEEE (1993). IEEE standard for scalable coherent interface (SCI). Ilić-Spong, M., Crow, M. L., & Pai, M. A. (1987). Transient stability simulation by waveform relaxation methods. IEEE Trans. Power Syst., 2(4), 943–949. Juarez T., C., Castellanos, R., Messina, A., & Gonzalez, A. (2007). A higher-order newton method approach to computing transient stability margins. In R. Castellanos (Ed.), Proc. 39th North American Power Symposium NAPS ’07 (pp. 360–367). Karypis, G. & Kumar, V. (1998a). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM J. Sci. Comput., 20, 359–392. Karypis, G. & Kumar, V. (1998b). METIS: A Software Package for Partitioning Unstructured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of Sparse Matrices Version 4.0. University of Minnesota, Department of Computer Science / Army HPC Research Center. Retrieved in October, 2006. Kassabalidis, I. N. (2002). Dynamic security border identiﬁcation using enhanced particle swarm optimization. Power Systems, IEEE Transactions on, 17(3), 723–729. Kielmann, T., Bal, H. E., & Verstoep, K. (2000). Fast measurement of LogP parameters for message passing platforms. of Lecture Notes in Computer Science, 1800, 1176–1183. Krause, P. C., Wasynczuk, O., & Sudhoﬀ, S. D. (2002). Analysis of Electric Machinery and Drive Systems. Piscataway, NJ; New York: IEEE Press; Wiley-Interscience. Kron, G. (1953). A method to solve very large physical systems in easy stages. Circuit Theory, IRE Transactions on, 2(1), 71–90. Kron, G. (1963). Diakoptics: The Piecewise Solution of Large-Scale Systems. (London): Macdonald & Co. 137 Bibliography Kundur, P. (1994). Power System Stability and Control. New York: McGraw-Hill. Kundur, P. & Dandeno, P. (1983). Implementation of advanced generator models into power system stability programs. power apparatus and systems, ieee transactions on, PAS-102(7), 2047–2054. Kundur, P., Morison, G., & Wang, L. (2000). Techniques for on-line transient stability assessment and control. In Proc. IEEE Power Engineering Society Winter Meeting, volume 1 (pp. 46–51). La Scala, M., Bose, A., Tylavsky, D., & Chai, J. (1990a). A highly parallel method for transient stability analysis. Power Systems, IEEE Transactions on, 5(4), 1439–1446. La Scala, M., Brucoli, M., Torelli, F., & Trovato, M. (1990b). A gauss-jacobi-block-newton method for parallel transient stability analysis. IEEE Trans. Power Syst., 5(4), 1168–1177. La Scala, M., Sblendorio, G., Bose, A., & Wu, J. (1996). Comparison of algorithms for transient stability simulations on shared and distributed memory multiprocessors. IEEE Trans. Power Syst., 11(4), 2045–2050. La Scala, M., Sblendorio, G., & Sbrizzai, R. (1994). Parallel-in-time implementation of transient stability simulations on a transputer network. IEEE Trans. Power Syst., 9(2), 1117–1125. La Scala, M., Sbrizzai, R., & Torelli, F. (1991). A pipelined-in-time parallel algorithm for transient stability analysis. IEEE Trans. Power Syst., 6(2), 715–722. Lau, K., Tylavsky, D., & Bose, A. (1991). Coarse grain scheduling in parallel triangular factorization and solution of power system matrices. Power Systems, IEEE Transactions on, 6(2), 708–714. Li, X. S. & Demmel, J. W. (2002). SuperLU_DIST: A scalable distributed-memory sparse direct solver for unsymmetric linear systems. Lawrence Berkeley National Laboratory. Paper LBNL-49388. Li, X. S., Demmel, J. W., & Gilbert, J. R. (2003). SuperLU Users’ Guide. The Regents of the University of California, through Lawrence Berkeley National Laboratory. Retrieved in October, 2006. Mansour, Y., Vaahedi, E., Chang, A., Corns, B., Garrett, B., Demaree, K., Athay, T., & Cheung, K. (1995). BC Hydro’s on-line transient stability assessment (TSA) model development, analysis and post-processing. IEEE Trans. Power Syst., 10(1), 241–253. 138 Bibliography Marceau, R. & Soumare, S. (1999). A uniﬁed approach for estimating transient and longterm stability transfer limits. IEEE Trans. Power Syst., 14(2), 693–701. Martí, J. R., Linares, L. R., Hollman, J. A., & Moreira, F. A. (2002). OVNI: Integrated software/hardware solution for real-time simulation of large power systems. In Conference Proceedings of the 14th Power Systems Computation Conference, PSCC02, Sevilla, Spain. MPI Standard (2008). A Message-Passing Interface Standard. Retrieved [August 8, 2008] from http://www-unix.mcs.anl.gov/mpi/. NICEVT (2005). Message-passing interface MPI2 over high speed net SCI. Retrieved [October 1, 2006] from http://www.nicevt.ru/download/index.html?lang=en. Ogbuobiri, E., Tinney, W., & Walker, J. (1970). Sparsity-directed decomposition for gaussian elimination on matrices. IEEE Trans. Power App. Syst., PAS-89(1), 141–150. Owens, John, D., Luebke, David, Govindaraju, Naga, Harris, Mark, Kruger, Jens, Lefohn, Aaron, E., Purcell, & Timothy, J. (2007). A survey of General-Purpose computation on graphics hardware. Computer Graphics Forum, 26(1), 113, 80. Pai, M. & Dag, H. (1997). Iterative solver techniques in large scale power system computation. In Decision and Control, 1997., Proceedings of the 36th IEEE Conference on, volume 4 (pp. 3861–3866 vol.4). Rescigno, T. N., Baertschy, M., Isaacs, W. A., & McCurdy, C. W. (1999). Collisional breakup in a quantum system of three charged particles. Science, 286(5449), 2474–2479. Saad, Y. & Vorst, H. A. V. D. (2000). Iterative solution of linear systems in the 20th century. Journal of Computational and Applied Mathematics, 123, 1—33. Sato, N. & Tinney, W. (1963). Techniques for exploiting the sparsity or the network admittance matrix. IEEE Trans. Power App. Syst., 82(69), 944–950. Shu, J., Xue, W., & Zheng, W. (2005). A parallel transient stability simulation for power systems. Power Systems, IEEE Transactions on, 20(4), 1709–1717. Stott, B. (1979). Power system dynamic response calculations. Proc. IEEE, 67(2), 219–241. Tinney, W., Brandwajn, V., & Chan, S. (1985). Sparse vector methods. IEEE Trans. Power App. Syst., PAS-104(2), 295–301. Tinney, W. & Walker, J. (1967). Direct solutions of sparse network equations by optimally ordered triangular factorization. Proc. IEEE, 55(11), 1801–1809. 139 Bibliography Tomim, M., Martí, J., & Wang, L. (2009). Parallel solution of large power sys- tem networks using the Multi-Area Thévenin Equivalents (MATE) algorithm. International Journal of Electrical Power & Energy Systems, In Press, Corrected Proof. DOI: 10.1016/j.ijepes.2009.02.002. Tomim, M. A., Martí, J. R., & Wang, L. (2008). Parallel computation of large power system network solutions using the Multi-Area Thévenin Equivalents (MATE) algorithm. In 16th Power Systems Computation Conference, PSCC2008 Glasgow, Scotland. Tylavsky, D. J., Bose, A., Alvarado, F., Betancourt, R., Clements, K., Heydt, G. T., Huang, G., Ilic, M., Scala, M. L., Pai, M., Pottle, C., Talukdar, S., Ness, J. V., & Wu, F. (1992). Parallel processing in power systems computation. Power Systems, IEEE Transactions on, 7(2), 629–638. Vorst, H. A. V. D. & Chan, T. F. (1997). Linear system solvers: Sparse iterative methods. Parallel Numerical Algorithms, ICASE/LaRC Interdisciplinary Series in Science and Engineering, 4, 167—202. Wang, F. (1998). Parallel-in-time relaxed newton method for transient stability analysis. Generation, Transmission and Distribution, IEE Proceedings-, 145(2), 155–159. Wang, L. & Morison, K. (2006). Implementation of online security assessment. Power and Energy Magazine, IEEE, 4(5), 46–59. Wing, O. & Huang, J. (1980). A computation model of parallel solution of linear equations. Computers, IEEE Transactions on, C-29(7), 632–638. Wu, F. (1976). Solution of large-scale networks by tearing. IEEE Trans. Circuits Syst., 23(12), 706–713. Wu, J. Q. & Bose, A. (1995). Parallel solution of large sparse matrix equations and parallel power ﬂow. Power Systems, IEEE Transactions on, 10(3), 1343–1349. Wu, J. Q., Bose, A., Huang, J. A., Valette, A., & Lafrance, F. (1995). Parallel implementation of power system transient stability analysis. IEEE Trans. Power Syst., 10(3), 1226–1233. Xue, Y., Rousseaux, P., Gao, Z., Belhomme, R., Euxible, E., & Heilbronn, B. (1993). Dynamic extended equal area criterion. In P. Rousseaux (Ed.), Proc. Joint International Power Conference Athens Power Tech APT 93, volume 2 (pp. 889–895). 140 Appendix A LU Factorization The LU factorization is mathematically equivalent to the Gaussian elimination and allows to express a permuted version of a speciﬁc complex system matrix A in terms of its lower and upper-diagonal factors, namely, L and U. Once the factors L and U are formed, two new triangular linear systems are generated, which in turn, can be solved by forward and backward substitutions. A.1 Problem Formulation Let (A.1) represent a generic complex linear system, where A is a complex n × n matrix, and, B and X are, also complex, n × m matrices, where the last one is unknown. AX = B (A.1) In order to solve this system for X, one alternative is to factorize A into two triangular matrices, namely L and U, as stated in (A.2). Pr A Pc = L U (A.2) where Pr and Pc are a row and column permutation matrices, respectively. In the case A is dense, only Pr may be employed in order to perform diagonal partial pivoting, while both Pr and Pc are needed in case total pivoting is required. In case the matrix A large and sparse, Pr still performs partial pivoting during the factorization, whereas Pc is selected in such manner that it minimizes the number of added non-zeros (or simply fill-ins) in the structure of Pr A Pc , and consequently, in the structure of L + U. Pre-multiplying (A.1) by Pr and making X = Pc Z leads to the following expression (Pr A Pc ) Z = Pr B (A.3) where the product inside the brackets can be substituted by the product LU, according to 141 Appendix A. LU Factorization (A.2), which in turn yields L U Z = Pr B (A.4) The new linear system obtained above can then be split into two interdependent triangular linear systems, deﬁned by L and U, whose result is the matrix Z. Lastly, the solution of the original linear system, X, can be obtained by permuting the rows of Z according to Pc . Thus, introducing a new n × m matrix W, which will keep the solution of the lower diagonal system deﬁned by L, this procedure is summarized as follows. Pr A Pc = L U L W = Pr B UZ = W X = Pc Z A.2 (LU factorization) (A.5a) (forward substitution) (A.5b) (backward substitution) (A.5c) (row permutation) (A.5d) LU Factorization Process Assuming that the complex matrix A is already ordered, its L and U factors can be obtained in the manner very similar to the one synthetized in (A.6). Firstly, one separates the ﬁrst row and column of the matrix A, as shown in (A.6a). As a consequence, the previous partitioned matrix can be further rewritten, as in (A.6b), in the form of a multiplication of two other matrices matrices, namely L1 U1 . Note that L1 is already lower diagonal, while U1 is the matrix A with its ﬁrst column eliminated by Gaussian elimination. a1 A= c 1 = 1 (r1 )T A1 = 0 ··· 0 1 1 c a1 1 = L1 U1 ... 1 (A.6a) a1 0 (r1 )T A1 − 1 c a1 1 (r1 )T One can then repeat this procedure to matrices Ak = Ak−1 − = 1 c ak k (A.6b) (A.6c) (rk )T with k = 2, . . . , n − 1, which are in fact formed by means of recursive rank-1 updates. Such procedure 142 Appendix A. LU Factorization ultimately generates the following result. A = L1 L2 . . . Ln−1 Un−1 = L U (A.7) where each Lk is as follows. Lk = 1 ... 1 1 c ak k .. . 1 (A.8) As a ﬁnal result, the Un−1 is in the upper-diagonal triangular matrix U, whereas the lower-diagonal triangular L equals to the product of all Lk with k = 1, . . . , n − 1, i.e., L= A.3 n−1 k=1 Lk . Computational Complexity of LU Factorization and Solution As presented by (Bunch & Rose, 1972), assume the sparse matrix M = Pr APc that also includes all ﬁll-ins introduced in the structure Pr APc , according to (A.2). This matrix M is known as perfect elimination matrix, since no ﬁll-ins are introduced when decomposed into its L and U factors. In this context, Table A.1 can be generated, based on some metrics deﬁned in (Alvarado, 1976) for the perfect elimination matrix M. These metrics deﬁnitions are repeated below for convinience. n−1 τu = ri (A.9) i=1 n−1 α= ri ci (A.10) i=1 In the deﬁnitions (A.9) and (A.10), ri is the number of non-zero elements in the ith row of the matrix U above its main diagonal and ci is the number of non-zero elements in the ith column of the matrix L below its main diagonal. One can readily verify that τu equals the total number of oﬀ-diagonal non-zeros of U. The metric α is the sum of all multiplications needed to perform all rank-1 updates on the LU structure, as shown in the previous section. 143 Appendix A. LU Factorization On a further note, for symmetric and topologically symmetric matrices, and consequently dense matrices, ri is the same as ci . The operation count for dense systems is also included in Table A.1, which were obtained by making ri = ci = n − i. Table A.1. Operation count for sparse and dense LU factorization and solution for unsymmetric, but topologicaly symmetric, systems. Sparse Task Dense add mult div LU fact. α α τu LU sol. 2τu 2τu n add n3 3 2 − n2 + n6 n2 − n mult n3 3 2 − n2 + n6 n2 − n div n2 2 − n 2 n Although the operation counts summarized on Table A.1 provides an exact means to determine the required computational eﬀort for both sparse LU factorizations and solutions, they completely rely on the sparsity pattern of a speciﬁc prefect elimination matrix M. Therefore, a general way to, at least, predict the required computational eﬀort of sparse operations is desirable. Bearing this need in mind, in the following subsection, one way to predict the number of operations will be studied. A.3.1 Expected Computational Complexity of Sparse LU Factorization Suppose that the n × n matrix M = {mij } is a perfect elimination matrix associated with the generic matrix A, which is assumed to be unsymmetric but topology-symmetric, and has no zeros in its main diagonal. Moreover, each upper triangular location is also assumed to have an equal probability of being non-zero, as given by (A.11). p = p(mij = 0) = 2τu −n n2 (A.11) The fact that there are n − i possible upper diagonal elements in row i of M with equal probability of being non-zero, given by (A.11), indicates that the number of non-zero elements ri , in row i above the diagonal of M is a binomially distributed random variable. Therefore, according to basic statistics manipulation, the expected value and variance of ri 144 Appendix A. LU Factorization are given as follows. E(ri ) = (n − i)p σ 2 (ri ) = (n − i)(1 − p)p (A.12) (A.13) Now, recalling that the operation counts given in Table A.1 depend on the metrics τu and α, deﬁned in (A.9) and (A.10), respectively, the expected value for these metrics can also be obtained as follows. n−1 E(τu ) = (A.14) E(ri ) = τu i=1 n−1 n E ri 2 = E(α) = i=1 σ 2 (ri ) + E 2 (ri ) = i=1 (A.15) 4 (n − 2) 2 τu + τu = 3 (n − 1) n Deﬁning the new parameter ρ, given in (A.16), which represents the ratio of branches to buses11 , one can rewrite (A.14) and (A.15) as shown in (A.17) and (A.18), respectively. ρ= number of oﬀ-diagonal nonzeros τu = number of nodes n (A.16) which gives E(τu ) = nρ E(α) = 4 (n − 2) 2 nρ + nρ 3 (n − 1) (A.17) (A.18) The choice of expressing the above expected values in terms of branch-bus ratio ρ instead of τu is due to the fact that buses in electric networks are usually connected to only a few other neighboring buses. Thus, expressing τu in terms of ρ provides a measure of how interconnected the system under study is. Combining Table A.1 and equations (A.17) and (A.18), expressions for predicting the ﬂoating-point operations required to perform sparse LU factorizations, backward and forward substitutions can be deduced, which are shown in Table A.2. Since only complex systems have been assumed so far, the number of ﬂoating-point operations needed for each basic complex operation, i.e., addition, subtraction, multiplication and divisions, used for 11 Note that the branch-bus ratio ρ is associate with the perfect elimination matrix M and not the system matrix A. Therefore, the number of branches is always dependent on the ordering scheme adopted to reduce fill-ins in the original matrix A structure 145 Appendix A. LU Factorization obtaining Table A.2 is provided in Table A.3. Notice that the extra column in Table A.2, related to large power systems, takes into consideration that the term (n−2) (n−1) represent E(α). ≈ 1 when n ≫ 2, which leads to a simpler expression to Table A.2. Floating-point operation count for sparse and dense LU factorization and solution for unsymmetric, but topologicaly symmetric, systems. Task LU factorization Sparse 32 (n−2) 2 ρ 3 (n−1) LU solution + 19ρ n (16ρ + 11) n Sparse (n ≫ 2) 32 2 ρ 3 + 19ρ n Dense 8 3 n 3 (16ρ + 11) n + 32 n2 − 25 n 6 8n2 + 3n Table A.3. Complex basic operations requirements in terms of ﬂoating-point count. Complex Operation Mathematical Representation addition/subtraction (a + jb) ± (c + j d) = (a + c) ± j (b + d) Flop Count 2 add/subtr. 2 flops 2 add. multiplication (a + j b) × (c + j d) = (ac − bd) + j (bc + ad) 4 mult. 6 flops 3 add. division (ac + bd) + j (bc − ad) a+jb = c+jd c2 + d2 6 mult. 2 div. 11 flops 146 Appendix B Transient Stability Solution Techniques As extensively described in the literature (Dommel & Sato, 1972; Stott, 1979; Kundur, 1994), the power system transient stability model can be summarized by the set of non-linear diﬀerential-algebraic equations shown in (B.1). x˙ = f(x, v) (B.1a) Y v = i(x, v) (B.1b) where, x represents a vector with dynamic variables (or, state variables), whose ﬁrst derivatives x˙ are normally dependent on x themselves and the vector with nodal voltages v. In addition, i represents a vector function that deﬁnes the nodal current injections, which also depend on the variable states x and the nodal voltages v. Lastly, Y represents the complexvalued nodal admittance matrix of the system under study. According to the literature (Dommel & Sato, 1972; Stott, 1979; Kundur, 1994), the many possible alternatives for solving (B.1) simultaneously in time are categorized in terms of (a) the way in which (B.1a) and (B.1b) are interfaced; (b) the integration method, which can be explicit or implicit; (c) the technique used for solving the algebraic equations, that may be linear or non-linear. For cases when (B.1b) is linear, sparsity-based direct solutions should be employed, whereas for non-linear cases, either the Gauss-Seidel or the Newton-Raphson methods should be used. B.1 Problem Discretization In this method, (B.1a) is ﬁrstly discretized according to some integration rule. Due to its inherent numerical stability, the trapezoidal rule is chosen (Dommel, 1996; Dommel & Sato, 147 Appendix B. Transient Stability Solution Techniques 1972). This procedure is shown in (B.2). t x(t) = x(t − ∆t) + t−∆t f x(ξ), v(ξ) dξ ≈ ∆t f x(t), v(t) + f x(t − ∆t), v(t − ∆t) ≈ x(t − ∆t) + 2 (B.2) Collecting the present and past values yields (B.3), where xh (t) represents the history term of the state vector x, which is known at time t. ∆t f x(t), v(t) + xh (t) 2 ∆t xh (t) = x(t − ∆t) + f x(t − ∆t), v(t − ∆t) 2 x(t) = (B.3a) (B.3b) There are many solution variations described in the literature that combine diﬀerent integration methods and algebraic equations solutions. A number of these methods are summarized in (Stott, 1979). However, all variations still fall into two major categories: alternating (or partitioned ) and simultaneous solution approaches. Since the alternating solution approach is described and discussed in Section 4.1, only the simultaneous solution will be discussed next. B.2 Simultaneous Solution Approach One of the most prominent methods pertaining to this class is undoubtedly the NewtonRaphson method. The method is often formulated by means of residual vector functions g x(t), v(t) and h x(t), v(t) , given in (B.4), which are obtained from combining the discretized diﬀerential equations (B.3) and the network algebraic equations (B.1b). ∆t f x(t), v(t) − xh (t) = 0 2 h x(t), v(t) = Y v(t) − i x(t), v(t) = 0 g x(t), v(t) = x(t) − (B.4a) (B.4b) In the Newton-Raphson method, the iterates for x(t) and v(t) can be obtained using (B.5). xk+1 = xk + ∆xk (B.5a) vk+1 = vk + ∆vk (B.5b) 148 Appendix B. Transient Stability Solution Techniques where the mismatches ∆xk and ∆vk are computed from the linear system (B.6)12 . Ad Bd Cd Y + Yd k g xk , vk ∆x = − h xk , vk ∆vk (B.6) and ∂g ∂x ∂h Cd = ∂x k =U− k =− Ad = ∆t ∂f k k x , v − xh 2 ∂x ∆t ∂f k k ∂g =− x ,v ∂v k 2 ∂v ∂i k k x ,v Yd = − ∂x Bd = ∂i k k x ,v ∂x Based on the fact that dynamic devices are interconnected through the network and not directly to each other, state variables associated with a device do not depend on other’s state variables. Therefore, Ad is a block-diagonal matrix, denoted as follows for a power system with m dynamic devices. Ad = Ad1 0 .. . 0 0 ··· Ad2 · · · .. ... . 0 0 0 .. . · · · Adm Applying Gaussian elimination to the Jacobian matrix in (B.6) yields (B.7a) and (B.7b), which enable the solution of ∆xk and ∆vk , respectively. ∆xk = −Ad −1 g xk , vk + Bd ∆vk Y + Yd − Cd Ad −1 Bd ∆vk = −h xk , vk + Cd Ad −1 g xk , vk (B.7a) (B.7b) Once ∆xk and ∆vk are calculated, they can be used to ﬁnd new iterates xk+1 and vk+1 according to (B.5). In the sequence, g xk+1 , vk+1 and h xk+1 , vk+1 can be computed. Analogously to the partitioned method, the iterative process continues until the diﬀerence between two successive solutions of x(t) and v(t) is within a certain tolerance. Although the full Newton-Raphson method is a powerful method for solving the transient stability non-linear equations, it is very computationally expensive, due to its dependency on the time-varying reduced Jacobian JR = Y + Yd − Cd Ad −1 Bd . Such dependency requires a sparse matrix inversion every single step, which slows down the overall solution by at least an order of magnitude. In order to overcome such a drawback, in industrial-grade transient 12 The matrix U represents an identity matrix. 149 Appendix B. Transient Stability Solution Techniques stability programs, the Jacobian matrix JR is assembled and factorized into its LU factors only at the beginning of a simulation and whenever topology changes occur in the systems or convergence problems are faced. Under such conditions, the algorithm is usually referred to as Very DisHonest Newton (VDHN) method. B.3 Simultaneous versus Alternating Solution Approach The just discussed simultaneous solution approach can be compared with the alternating approach in various aspects, such as modeling ﬂexibility, computational requirements and convergence characteristics. These are discussed below: • Computational requirements: Since the simultaneous approach is usually related to Newton-Raphson methods for solution of non-linear system of equations, it demands the dynamic models to be expressed in terms of real and imaginary parts. As a direct consequence, the problem, which was originally complex-value of order n, becomes a real problem of order 2n. Moreover, this newly formed real system of equations has also the four times the number of non-zeros of its complex-valued counterpart. For instance, during the factorization process (Appendix A), the amount of ﬂoating-point operations can be expected to, at least, quadruple. Other disadvantages of the real formulation lie on the increased number of memory references and memory consumption. • Modeling flexibility: In terms of modeling, the real form of the simultaneous solution yields greater ﬂexibility over the complex form, inherent to the alternating solution. In the complex form, complex derivatives, required by the linearization process, often depend on complex conjugates of the same variable, which demands the inclusion of redundant equations to the problem. This disadvantage can be exempliﬁed by a common constant power load. In this case, the current I¯L injected by the load P0 −j Q0 into the system relates to the its voltage V¯L according to (B.8). In order to include this model in a Newton-like solution procedure, one needs to ﬁnd a relationship between variations of current ∆I¯L and voltage ∆V¯L , which is not possible from (B.8) alone. Since, only (B.9) can be generated, the solution needs to add the equation associated with ∆V¯L∗ . P0 − j Q 0 I¯L = − (B.8) V¯L∗ ∆I¯L = P0 − j Q 0 ¯ ∗ ∆ VL 2 V¯ ∗ (B.9) L 150 Appendix B. Transient Stability Solution Techniques • Convergence characteristics: Newton-like methods are known for their numerical robustness and quadratic convergence. However, these aspects are entirely true only when the Jacobian of the non-linear problem, expressed by the submatrices Ad , Bd , Cd and Yd in (B.7), is updated at every iteration. In this case, factorization of the problem is also required at every iteration, which, for very large systems, may degrade considerably the overall performance of the algorithm. Dishonest variants, like the VDHN method, which keep the Jacobian constant unless topological changes or convergence diﬃculties occur, are often employed, as a tradeoﬀ between computational burden and convergence strength of full Newton methods. Therefore, for very large systems, the alternating solution approach may become comparable with the simultaneous solution approach in terms of convergence. 151 Appendix C Network Microbenchmarks In order to optimize the performance of parallel applications, the underlying algorithms need to be translated in terms of simpler tasks, for which the performance on a given computing system is known. In this manner, bottlenecks can be identiﬁed and available resources properly allocated aiming at the improvent of the overall application. Regarding communications in parallel computing systems, a few modeling approaches have been proposed in the literature. The LogP (Culler et al., 1996), PLogP (Kielmann et al., 2000) and LogGP (Alexandrov et al., 1995) are examples of communication models in distributed computing systems. The basic foundations of such models lie on the knowledge of certain network parameters, such as, latency L, overheads for sending, os , and receiving, or , and inter-message gaps g. In this context, network microbenchmarks provide the means of systematically acquire the previous network parameters. In turn, microbenchmarks play an paramount role in assessing the performance of the hardware and software that makes the bridge between applications and network interfaces (Culler et al., 1996). C.1 Procedure As the starting point towards the microbenchmark suggested by Kielmann et al. (2000), the gap between zero-byte messages, g(0), is ﬁrst measured, following the Algorithm 1, given below. In this algorithm, each message consists of a send and receive pair. According to Culler et al. (1996), three stages can be observed during this procedure. For a small number of messages M , the issued sending requests receive no replies. Hence, the cost of each of these ﬁrst requests equals to the sending overhead os . As the number of messages M increase, the sender starts receiving replies and the message cost increases, since the receiver spends or for each reply. During this stage, the replies are separated by the time for successive messages to pass through the bandwidth bottleneck, i.e., by g. As M further increases, the number of messages eventually causes the network capacity limit to be reached. Under such a condition, any sending request will stall, until a reply is drained from the network. During this stage, the cost of each message is determined by the gap g. 152 Appendix C. Network Microbenchmarks Algorithm 1 Inter-message gap and sending overhead timing tstart ← start timer for k = 1 to M do if Sender then Send message else Receive message end if end for tstop ← stop timer t ← tstop − tstart tissue ← t/M The results for the measurement of g(0) for the Gigabit Ethernet network available in the UBC’s Power Systems Engineering Laboratory is shown in Figure C.1. Next, the round trip time (RT T ) for a given message is measured, following the Algorithm 2. Notice that both request message m1 and reply message m2 also consist of paired send and receive operations. Since, in the PLogP model, the RT T (0) is equivalent to 2 L + g(0) , one only needs to calculate the latency L. As suggested by Kielmann et al. (2000), the receiving overhead can be measured by the Algorithm 3. In this case, the receiving overhead or can be measured by adding a controlled waiting time ∆, such that receiving the m-byte message occurs immediately without further delay. Without this extra waiting time, an extra idle time would be added to the measurement. Therefore, making ∆ > RT T (m) is, usually, conservative enough. Algorithm 2 Round trip timing (RT T ) tstart ← start timer for k = 1 to M do if Sender then Send message m1 Receive message m2 else Receive message m1 Send message m2 end if end for tstop ← stop timer t ← tstop − tstart RT T ← t/M 153 Appendix C. Network Microbenchmarks Algorithm 3 Receiving overhead or timings or ← 0 for k = 1 to M do if Sender then Send zero-byte message Wait for time ∆ > RT T (m) tstart ← start timer Receive m-byte message tstop ← stop timer or ← or + tstop − tstart else Receive zero-byte message Send m-byte message end if end for if Sender then or ← or /M end if 5 4.8 g(0) = 4.76[µs] Gap g(0) [µs] 4.6 4.4 4.2 4 os (0) = 3.80[µs] 3.8 3.6 0 20 40 60 80 100 Number of messages 120 140 160 Figure C.1. Average issue time, yielded by Algorithm 1, for zero-byte messages communicated by MPI routines over a Gigabit Ethernet network. 154
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Parallel computation of large power system networks...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Parallel computation of large power system networks using the multi-area Thévenin equivalents Tomim, Marcelo Aroca 2009
pdf
Page Metadata
Item Metadata
Title | Parallel computation of large power system networks using the multi-area Thévenin equivalents |
Creator |
Tomim, Marcelo Aroca |
Publisher | University of British Columbia |
Date Issued | 2009 |
Description | The requirements of today's power systems are much different from the ones of the systems of the past. Among others, energy market deregulation, proliferation of independent power producers, unusual power transfers, increased complexity and sensitivity of the equipments demand from power systems operators and planners a thorough understanding of the dynamic behaviour of such systems in order to ensure a stable and reliable energy supply. In this context, on-line Dynamic Security Assessment (DSA) plays a fundamental role in helping operators to predict the security level of future operating conditions that the system may undergo. Amongst the tools that compound DSA is the Transient Stability Assessment (TSA) tools, which aim at determining the dynamic stability margins of present and future operating conditions. The systems employed in on-line TSA, however, are very much simplified versions of the actual systems, due to the time-consuming transient stability simulations that are still at the heart of TSA applications. Therefore, there is an increasing need for improved TSA software, which has the capability of simulating bigger and more complex systems in a shorter lapse of time. In order to achieve such a goal, a reformulation of the Multi-Area Thévenin Equivalents (MATE) algorithm is proposed. The intent of such an algorithm is parallelizing the solution of the large sparse linear systems associated with transient stability simulations and, therefore, speeding up the overall on-line TSA cycle. As part of the developed work, the matrix-based MATE algorithm was re-evaluated from an electric network standpoint, which yielded the network-based MATE presently introduced. In addition, a performance model of the proposed algorithm is developed, from which a theoretical speedup limit of p/2 was deduced, where p is the number of subsystems into which a system is torn apart. Applications of the network-based MATE algorithm onto solving actual power systems (about 2,000 and 15,000 buses) showed the attained speedup to closely follow the predictions made with the formulated performance model, even on a commodity cluster built out of inexpensive out-of-the-shelf computers. |
Extent | 5836906 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
File Format | application/pdf |
Language | eng |
Date Available | 2009-08-07 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivatives 4.0 International |
DOI | 10.14288/1.0067477 |
URI | http://hdl.handle.net/2429/11983 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Graduation Date | 2009-11 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/4.0/ |
Aggregated Source Repository | DSpace |
Download
- Media
- 24-ubc_2009_fall_tomim_marcelo.pdf [ 5.57MB ]
- Metadata
- JSON: 24-1.0067477.json
- JSON-LD: 24-1.0067477-ld.json
- RDF/XML (Pretty): 24-1.0067477-rdf.xml
- RDF/JSON: 24-1.0067477-rdf.json
- Turtle: 24-1.0067477-turtle.txt
- N-Triples: 24-1.0067477-rdf-ntriples.txt
- Original Record: 24-1.0067477-source.json
- Full Text
- 24-1.0067477-fulltext.txt
- Citation
- 24-1.0067477.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0067477/manifest