UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Parallel computation of large power system networks using the multi-area Thévenin equivalents Tomim, Marcelo Aroca 2009-08-07

You don't seem to have a PDF reader installed, try download the pdf

Item Metadata


24-ubc_2009_fall_tomim_marcelo.pdf [ 5.57MB ]
JSON: 24-1.0067477.json
JSON-LD: 24-1.0067477-ld.json
RDF/XML (Pretty): 24-1.0067477-rdf.xml
RDF/JSON: 24-1.0067477-rdf.json
Turtle: 24-1.0067477-turtle.txt
N-Triples: 24-1.0067477-rdf-ntriples.txt
Original Record: 24-1.0067477-source.json
Full Text

Full Text

Parallel Computation of LargePower System Networks Using theMulti-Area Thévenin EquivalentsbyMarcelo Aroca TomimB.Sc., Escola Federal de Engenharia de Itajubá, 2001M.A.Sc., Universidade Federal de Itajubá, 2004A THESIS SUBMITTED IN PARTIAL FULFILMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDoctor of PhilosophyinThe Faculty of Graduate Studies(Electrical & Computer Engineering)The University Of British Columbia(Vancouver)July, 2009c©Marcelo Aroca Tomim 2009AbstractThe requirements of today’s power systems are much different from the ones of the sys-tems of the past. Among others, energy market deregulation, proliferation of independentpower producers, unusual power transfers, increased complexity and sensitivity of the equip-ments demand from power systems operators and planners a thorough understanding of thedynamic behaviour of such systems in order to ensure a stable and reliable energy supply.In this context, on-line Dynamic Security Assessment (DSA) plays a fundamental role inhelping operators to predict the security level of future operating conditions that the systemmay undergo. Amongst the tools that compound DSA is the Transient Stability Assessment(TSA) tools, which aim at determining the dynamic stability margins of present and futureoperating conditions.The systems employed in on-line TSA, however, are very much simplified versions of theactual systems, due to the time-consuming transient stability simulations that are still at theheart of TSA applications. Therefore, there is an increasing need for improved TSA software,which has the capability of simulating bigger and more complex systems in a shorter lapseof time.In order to achieve such a goal, a reformulation of the Multi-Area Thévenin Equivalents(MATE) algorithm is proposed. The intent of such an algorithm is parallelizing the solu-tion of the large sparse linear systems associated with transient stability simulations and,therefore, speeding up the overall on-line TSA cycle. As part of the developed work, thematrix-based MATE algorithm was re-evaluated from an electric network standpoint, whichyielded the network-based MATE presently introduced. In addition, a performance modelof the proposed algorithm is developed, from which a theoretical speedup limit of p2 wasdeduced, where p is the number of subsystems into which a system is torn apart. Applica-tions of the network-based MATE algorithm onto solving actual power systems (about 2,000and 15,000 buses) showed the attained speedup to closely follow the predictions made withthe formulated performance model, even on a commodity cluster built out of inexpensiveout-of-the-shelf computers.iiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiGlossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xAcknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xviDedication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivating Parallel Computing . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.1 Commodity Off-The-Shelf Computer Systems . . . . . . . . . . . . . 31.1.2 Performance Metrics for Parallel Systems . . . . . . . . . . . . . . . 41.2 Parallel Transient Stability Solution . . . . . . . . . . . . . . . . . . . . . . 61.2.1 Parallel Newton Methods . . . . . . . . . . . . . . . . . . . . . . . . 71.2.2 Parallel Waveform Relaxation Methods . . . . . . . . . . . . . . . . 91.2.3 Parallel Alternating Methods . . . . . . . . . . . . . . . . . . . . . . 101.3 Parallel Network Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.1 Fine Grain Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . 111.3.2 Coarse Grain Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 121.4 Other Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161.4.1 Parallel Direct Methods . . . . . . . . . . . . . . . . . . . . . . . . . 161.4.2 Parallel Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . 191.5 Thesis Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211.6 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221.7 Publications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23iiiTable of Contents2 Network-based Multi-Area Thévenin Equivalents (MATE) . . . . . . . . 242.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.2 MATE Original Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 262.3 Network-based MATE Formulation . . . . . . . . . . . . . . . . . . . . . . . 282.3.1 MATE Algorithm Summary . . . . . . . . . . . . . . . . . . . . . . . 282.3.2 Multi-Node Thévenin Equivalents . . . . . . . . . . . . . . . . . . . 292.3.3 Multi-Area Thévenin Equivalents . . . . . . . . . . . . . . . . . . . . 332.3.4 Subsystems Update . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.4 MATE: Original versus Network-based . . . . . . . . . . . . . . . . . . . . . 382.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403 Network-based MATE Algorithm Implementation . . . . . . . . . . . . . 413.1 MATE Algorithm Flow Chart . . . . . . . . . . . . . . . . . . . . . . . . . . 423.2 MATE Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.2.1 Performance Model Preliminaries . . . . . . . . . . . . . . . . . . . . 443.2.2 Computational Aspects of MATE . . . . . . . . . . . . . . . . . . . 463.2.3 Communication Aspects of MATE . . . . . . . . . . . . . . . . . . . 493.2.4 MATE Speedup and Efficiency . . . . . . . . . . . . . . . . . . . . . 533.2.5 MATE Performance Qualitative Analysis . . . . . . . . . . . . . . . 543.3 Hardware/Software Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . 573.3.1 Sparse Linear Solver Benchmark . . . . . . . . . . . . . . . . . . . . 573.3.2 Dense Linear Solver Benchmark . . . . . . . . . . . . . . . . . . . . 593.3.3 Communication Libraries Benchmark . . . . . . . . . . . . . . . . . 623.4 Western Electricity Coordinating Council System . . . . . . . . . . . . . . . 693.4.1 WECC System Partitioning . . . . . . . . . . . . . . . . . . . . . . . 703.4.2 Timings and Performance Predictions for the WECC System . . . . 753.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844 MATE-based Parallel Transient Stability . . . . . . . . . . . . . . . . . . . 854.1 Transient Stability Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . 864.1.1 Transient Stability Solution Techniques . . . . . . . . . . . . . . . . 874.1.2 Transient Stability Models . . . . . . . . . . . . . . . . . . . . . . . 894.2 Sequential Transient Stability Simulator . . . . . . . . . . . . . . . . . . . . 1004.3 MATE-based Parallel Transient Stability Simulator . . . . . . . . . . . . . . 1034.3.1 System Partitioning Stage . . . . . . . . . . . . . . . . . . . . . . . . 1044.3.2 Pre-processing Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . 1054.3.3 Solution Stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106ivTable of Contents4.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.4.1 South-Southeastern Brazilian Interconnected System Partitioning . . 1124.4.2 Timings for the SSBI System . . . . . . . . . . . . . . . . . . . . . . 1144.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1255 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 1265.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1285.3 Final Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141A LU Factorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141A.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141A.2 LU Factorization Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142A.3 Computational Complexity of LU Factorization and Solution . . . . . . . . 143A.3.1 Expected Computational Complexity of Sparse LU Factorization . . 144B Transient Stability Solution Techniques . . . . . . . . . . . . . . . . . . . . 147B.1 Problem Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147B.2 Simultaneous Solution Approach . . . . . . . . . . . . . . . . . . . . . . . . 148B.3 Simultaneous versus Alternating Solution Approach . . . . . . . . . . . . . 150C Network Microbenchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152C.1 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152vList of Tables1.1 SuperLU DIST timings on a commodity PC cluster. . . . . . . . . . . . . . . 193.1 Data fitting summary for the sparse operations. . . . . . . . . . . . . . . . . 593.2 Data fitting summary for the dense operations. . . . . . . . . . . . . . . . . . 623.3 Parameter summary for Ethernet and SCI networks. . . . . . . . . . . . . . 653.4 Partitioning of the Western Electricity Coordinating Council system. . . . . 744.1 Summary of the SSBI system. . . . . . . . . . . . . . . . . . . . . . . . . . . 1124.2 Statistics of the SSBI system simulation. . . . . . . . . . . . . . . . . . . . . 1164.3 Summary of the sequential transient stability simulation. . . . . . . . . . . . 1164.4 Partitioning of the South-Southwestern Brazilian Interconnected system. . . 1245.1 Timings for the network-based MATE and SuperLU DIST. . . . . . . . . . . 132A.1 Operation count for sparse and dense LU factorization. . . . . . . . . . . . . 144A.2 Floating-point operation count for sparse and dense LU factorization. . . . . 146A.3 Complex basic operations requirements in terms of floating-point count. . . . 146viList of Figures1.1 Generic COTS computer system . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Sparse matrix and its associated task graph. . . . . . . . . . . . . . . . . . . 131.3 Partitioning techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141.4 SuperLU DIST two-dimensional block-cyclic partitioning scheme . . . . . . . 171.5 Decomposition of the elimination tree adopted within MUMPS . . . . . . . . 182.1 Generic electric network partitioned into subsystems . . . . . . . . . . . . . . 262.2 Thévenin equivalent voltages. . . . . . . . . . . . . . . . . . . . . . . . . . . 302.3 Multinode Thévenin equivalent impedances construction. . . . . . . . . . . . 322.4 Multi-node Thévenin equivalent of a generic subsystem . . . . . . . . . . . . 332.5 Link Thévenin equivalent of subsystem S2 in Figure 2.1 in detailed and com-pact forms, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 362.6 Multi-area Thévenin equivalent of the system in Figure 2.1 in its detailed andcompact versions, respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . 372.7 Relationships among the various mappings Rk, Qk and Pk. . . . . . . . . . . 393.1 MATE algorithm flow chart . . . . . . . . . . . . . . . . . . . . . . . . . . . 433.2 MATE timeline for two subsystems, S1 and S2, and the link solver. . . . . . 453.3 Link currents scatter according to PLogP model. . . . . . . . . . . . . . . . . 513.4 Multi-node Thévenin equivalents gather according to PLogP model. . . . . . 523.5 Sparse operations benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . 583.6 Dense operations benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . 603.7 Sparse operations throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . 613.8 Dense operations throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . 613.9 Benchmark results of two MPI implementations over SCI and Gigabit Ether-net networks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 663.10 Sending and receiving overheads and inter-message gaps for MPI over SCIand Gigabit Ethernet network. . . . . . . . . . . . . . . . . . . . . . . . . . . 673.11 Bandwidth of MPI over SCI and Gigabit Ethernet networks. . . . . . . . . . 68viiList of Figures3.12 Comparison of Ethernet and SCI network’s timings for the network-basedMATE communications. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.13 North American Electric Reliability Council (NERC) Regions . . . . . . . . 713.14 Western Electricity Coordinating Council System admittance matrix . . . . 713.15 Link solver and communication penalty factors relative to the WECC system. 733.16 Comparison between MATE timings for multilevel recursive bisection andmultilevel k-way partitioning algorithms. . . . . . . . . . . . . . . . . . . . . 763.17 MATE predicted and measured timings for the solution of the WECC systemfor 1000 steps, 1 factorization and different partitioning strategies. . . . . . . 773.18 MATE predicted and measured timings for the solution of the WECC systemfor 1000 steps, 100 factorizations and different partitioning strategies. . . . . 783.19 MATE predicted and measured timings for the solution of the WECC systemfor 1000 steps, 500 factorizations and different partitioning strategies. . . . . 793.20 MATE timings for the solution of the WECC system partitioned in 14 sub-systems for 1000 steps, 1 factorization and different partitioning strategies. . 803.21 MATE timings for the solution of the WECC system partitioned in 14 sub-systems for 1000 steps, 100 factorizations and different partitioning strategies. 813.22 MATE timings for the solution of the WECC system partitioned in 14 sub-systems for 1000 steps, 500 factorizations and different partitioning strategies. 823.23 MATE performance metrics for the solution of the WECC system for 1000steps and (a) 1, (b) 100 and (c) 500 factorizations. . . . . . . . . . . . . . . . 834.1 Basic structure of a power system model for transient stability analysis. . . . 864.2 Equivalent pi circuit of a transmission line . . . . . . . . . . . . . . . . . . . 914.3 Transformer with off-nominal ratio . . . . . . . . . . . . . . . . . . . . . . . 924.4 Equivalent circuit of the polynomial load model . . . . . . . . . . . . . . . . 934.5 Synchronous machine qd reference frame and system reference frame . . . . . 964.6 Synchronous generator equivalent circuits . . . . . . . . . . . . . . . . . . . . 984.7 Steady-state equivalent circuit of the synchronous machine. . . . . . . . . . . 994.8 Steady-state phasor diagram of the synchronous machine . . . . . . . . . . . 1004.9 Flow chart of a transient stability program based on the partitioned approach. 1024.10 MATE-based parallel transient stability program flow chart: Pre-ProcessingStage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1064.11 MATE-based parallel transient stability program flow chart: Solution Stage . 1074.12 Contingency statuses check employed in the parallel MATE-based transientstability simulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109viiiList of Figures4.13 Network-based MATE parallel linear solver detail. . . . . . . . . . . . . . . . 1104.14 Convergence check employed in the parallel MATE-based transient stabilitysimulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1114.15 Brazilian National Interconnected System . . . . . . . . . . . . . . . . . . . . 1134.16 South-Southeastern Brazilian Interconnected System admittance matrix. . . 1134.17 Link solver and communication penalty factors relative to the SSBI system. . 1144.18 Timings of the parallel transient stability simulator for different partitioningheuristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1184.19 Timings of the link solver of the parallel transient stability simulator for dif-ferent partitioning heuristics. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1194.20 Convergence and contingency status check time TpCCC measured and fittedwith plogp function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1214.21 Performance metrics of the parallel transient stability simulator. . . . . . . . 123C.1 Average issue time, yielded by Algorithm 1, for zero-byte messages communi-cated by MPI routines over a Gigabit Ethernet network. . . . . . . . . . . . 154ixGlossaryAcronymsBBDF Block-Bordered Diagonal FormBLAS Basic Linear Algebra SubroutinesCG Conjugate GradientCGS Conjugate Gradient SquaredCOTS Comommodity (or Commercial) Off-The-ShelfCPU Central Processing UnitCSMA/CD Carrier Sense Multiple Access With Collision DetectionDSA Dynamic Security AssessmentFACTS Flexible Alternating Current Transmission Systemflop floating-point operationGPU Graphical Processing UnitHVDC High-Voltage Direct-Current transmissionIP Internet ProtocolLAPACK Linear Algebra PACKageMATE Multi-Area Thévenin EquivalentsMflops 106 flops per secondNIC Network Interface CardPC Personal ComputerPDE Partial Differential EquationsxGlossaryRAM Random-access memoryRMS Root Mean SquaredSCI Scalable Coherent InterfaceSCTP Stream Control Transmission ProtocolSMP Symmetric MultiprocessingSSBI South-Southeastern Brazilian Interconnected systemSVC Static-Var CompensatorTCP Transmission Control ProtocolTSA Transient Stability AssessmentVDHN Very DisHonest NewtonWECC Western Electricity Coordinating Council systemMathematical Termsµ(·) Arithmetic meanσ(·) Standard deviationO(·) Indicate the same order of compexity of the function it operates upon(·)−1 Inverse matrix operator(·)T Tranpose matrix operatorℑ{·} imaginary partℜ{·} real partj pure imaginary number, i.e.,√−1(·)∗ complex comjugate operator¯E,¯V,¯I,¯Z RMS phasor quantitiesρ The ratio of branches to buses in SxiGlossarybk Number of border nodes contained in Bkk Subsystem index, where k = 1,...,pl Number of global links contained in Llk Number of local links contained in Lkn Number of nodes in the untorn system SNi Number of times the current injections associated with the untorn system Schangenk Number of internal nodes contained in NkNz Number of times the untorn system S undergoes topological changesp Number of subsystems, which the system S is torn intoBk Set of the bk border nodes, which are connected to the subsystem Sk ∈SLk Set of the lk local links, which are connected to the subsystem Sk ∈SL Set of the l global links, which interconnect all subsystems Sk ∈SNk Set of the nk internal nodes, which belong to the subsystem Sk ∈SSk kth subsystem associated with S, i.e., Sk ∈SS Original untorn systemebk Thévenin voltages vector at the bk border nodes of Sk, i.e., in Bkelk Vector of dimension l with the Thévenin voltages of the subsystem Sk, i.e., inBk, which are connected to the global links in Libk Vector of dimension bk with current injections into the border nodes in Bkvbk Vector of dimension bk with voltage drops across subsystem Sk, or seen from theborder nodes in Bkvlk Vector of dimension l with voltage drops across subsystem Sk, or seen Bk, dueto the global link currents in Lek Thévenin voltages vector at all nk nodes in NkxiiGlossaryik Current injections vector of dimension nk due to the local links in Lkjk Internal current injections vector of dimension nk associated with the system Skvk Nodal voltages vector of dimension nk associated with the system Ski Current injections vector of dimension n associated with the system Sv Nodal voltages vector of dimension n associated with the system Sel Multi-area Thévenin voltages vector of dimension l exciting the global links Lil Global link currents vector of dimension l associated with the global links in LZbk Multi-node Thévenin impedance matrix of dimension bk×bk with respect to theborder nodes in BkZl0 Primitive impedance matrix of dimension l×l associated with the global linksin LZlk Link Thévenin impedance matrix of dimension l×l associated with the subsys-tem SkZl Multi-area Thévenin impedance matrix of dimension l×l associated with theglobal links LLk Lower diagonal factor of the admittance matrix Yk, obtained from the LU fac-torizationPk Link-to-subsystem transformation matrix of dimension nk ×l, which maps Lonto Nk. The subsystem-to-link transformation matrix is given by (Pk)TQk Subsystem-to-border transformation matrix of dimension bk×nk, which mapsNk onto Bk. The border-to-subsystem transformation matrix is given by (Qk)TRk Link-to-border transformation matrix of dimension bk×l, which maps L ontoBk. The border-to-link transformation matrix is given by (Rk)TUk Upper diagonal factor of the admittance matrix Yk, obtained from the LU fac-torizationYk Admittance matrix of dimension nk×nk associated with each subsystem SkY Admittance matrix of dimension n×n associated with the system SxiiiGlossaryEMATE Efficiency achieved by the parallel network-based MATE algorithm when solvingthe system SSMATE Speedup achieved by the parallel network-based MATE algorithm over the se-quential sparse linear solver when solving the system Sg(m) Minimum time, orgap, betweenconsecutivemessagetransmissionsor receptions,which also considers all contributing factors, including os(m) and or(m)L End-to-end latency from process to process, which includes the time for copyingthe data to and from the network interfaces and transferring the data over thephysical networkor(m) Period of time in which the processor is engaged in receiving the message of sizemos(m) Period of time in which the processor is engaged in sending the message of sizemTC Timing associated with the communication overhead incurred by the network-based MATE algorithmTL Timing associated with the sequential tasks performed by the process handlingthe link solveTMATE Time spent by the parallel network-based MATE solver to solve the originalsystem STP Timing associated with parallel tasks performed by the processes handling sub-systemsTSPARSE Time spent by the sequential sparse linear solver to solve the original system ST1fact(Sk) Time spent performing the first-time factorization of Yk.T2fact(Sk) Time spent performing the same-pattern factorization of Yk. In this case, thesymbolic factorization, required in the first-time factorization, is not repeatedTicomm(L) Time spent scattering the border node current injections ibk to the subsystemsSk ∈STicomm(Sk) Time spent receiving the border nodes injections ibk from the link solverxivGlossaryTThvcomm(L) Time spent receiving Zbk and ebk from the subsystems Sk ∈STThvcomm(Sk) Time spent sending Zbk and ebk to the link solver processTicomp(L) Time spent setting up and computing the link current equationsTThvcomp(Sk) Time spent computing the multi-node Thévenin equivalent Zbk and ebk for thesubsystem Sk ∈STvcomp(Sk) Time spent computing the nodal voltages in the subsystem SkTvcomp(Sk) Time spent solving the node voltages vk for the updated current injections ikTfact(L) Time spent performing the dense factorization of Zl.Tlnkidle(Sk) Time spent waiting for the link solver process to compute the link currents ilTsubsidle (L) Time spent waiting for the subsystems Sk ∈S to compute Zbk, ebk and vkTsolv(L) Time spent performing the forward/backward substitutions using Ll and Ulfactors associated with ZlTsolv(Sk) Time spent performing the forward/backward substitutions using Lk and Ukfactors associated with YkT(L) Time spent by the link solver process solving the link system LT(Sk) Time spent by the processes solving the subsystem SkxvAcknowledgementsThis thesis arose, in part, out of years of research carried out by the University of BritishColumbia Power Engineering Group, led by Dr. José R. Martí. It is indeed a great pleasureto be part of one of the most prominent research groups on dynamic simulations of electricpower systems.In particular, I would like to thank Dr. José R. Martí for the supervision, advice andguidance throughout the Ph.D. program, which was vital to the conduct of the present work.Since I started the Ph.D. program, I had the opportunity to know a great number of peo-ple of the most varied cultural and professional backgrounds, which contributed in assortedways to my growth as a person and a researcher. To those, I would like to record my mostsincere acknowledgements.A special mention should be made to Mazana Armstrong for introducing me to theprevious works related to my thesis topic, to Michael Wrinch for his valuable comments andadvice in both scientific and professional realm and, to Tom De Rybel for introducing me tothe Linux world and his ceaselessly technical support with the computing cluster.I would also like to thank the CAPES Foundation (Fundação de Coordenação de Aper-feiçoamento de Pessoal de Nível Superior) of Brazil for funding my Ph.D. program. Inaddition, I would like to thank the British Columbia Transmission Corporation and thePowertech Labs Inc. for extending the funding of this project and providing the data relativeto the Western Electricity Coordinating Council system, which was used in the performanceanalysis presented in this work. I would also like to acknowledge Dr. Lei Wang of PowertechLabs Inc. for the his crucial comments, which further motivated me to accomplish this work.I gratefully thank Dr. Paulo Nepomuceno Garcia of the Federal University of Juiz de Fora(UFJF), Brazil, for further appreciating the results of this work and helping me organizingthe body the thesis.I would also to acknowledge and thank my research committee members, Dr. K. D.Srivastava, Dr. Hermann Dommel and Dr. Juri Jatskevich for their so much valuable timeput into reading my thesis and for their equally important feedback that definitely helpedimproving the content of this thesis.Last, but far from least, I would like to thank my family for the unconditional support,patience and love that motivates me to be a better person every day.xviDedicationTo Adriana and Nicholas,You are my life.xviiChapter 1IntroductionModern electric power systems are often less secure than the systems of the past. Thisis a result of reduced attention to the transmission infrastructure caused by deregulation,proliferation of independent power producers, unusual power transfers driven by marketactivities, the use of complex controls and special protection schemes, and a general lack ofsystemwide oversight regarding reliable planning and operation (Wang & Morison, 2006).In this context, possible types and combinations of energy transactions occurring atany given time may grow enormously. Therefore, today’s power systems can no longerbe operated in a structured manner (Kundur et al., 2000). One solution to mitigate thisuncertainty is prediction of future operating conditions, for example through use of on-lineDynamic Security Assessment (DSA). Such a tool takes a snapshot of the system operatingcondition, performs a comprehensive security assessment in near-real-time, and provides theoperators with warnings of abnormal situations as well as remedial measure recommendations(Wang & Morison, 2006).One of the most costly computation-wise tools of on-line DSA is undoubtedly the Tran-sient Security Assessment (TSA). Many alternatives to tackle this problem have been pro-posed in the literature (Xue et al., 1993; Mansour et al., 1995; Marceau & Soumare, 1999;Ernst et al., 2001; Kassabalidis, 2002; Juarez T. et al., 2007), which includes more efficienttime domain solutions, direct stability, pattern recognition and expert systems/neural net-works, as well as hybrid methods. Many of these methods, however, still rely on transientstability time-domain simulation tools, due to the high level of accuracy and flexibility ofthese tools in terms of complexity of the system models. Hence, speeding up TSA tools cansignificantly improve overall performance of on-line DSA.In the past two decades, dramatic improvements in microprocessor technology have beenobserved. According to (Grama et al., 2003), the average number of cycles per instructionof high end processors has improved by roughly an order of magnitude during the 1990’s.However, not only the speed of the light and the effectiveness of heat dissipation techniquesimpose physical limits on the speed of a single computer, but also the cost of advancedsingle-processor computers, which increases more rapidly than their power. As personalcomputer (PC) performance has increased and prices have fallen steeply, for both PCs andnetworks employed to interconnect them, dedicated clusters of PC workstations have become1Chapter 1. Introductionan interesting alternative to traditional supercomputers (Gropp et al., 1999, 2003). Such apicture tends to shift high-performance computing even further towards clusters of PCs andparallelism as multi-core processors have recently become the standard.Many have been the attempts to parallelize transient stability simulations, but only a fewhave aimed at taking full advantage of commodity PC clusters. Therefore, further researchon porting industrial-grade transient stability simulators onto today’s parallel computingsystems with minimal programming effort and maximum returns in terms of computationalthroughput is still of significant importance. Furthermore, inexpensive parallel computingsystems present themselves as a strong alternative to provide electric utilities operatingcentres with the so much needed real-time DSA applications.In the sequence, further reasons that motivate parallelism and some useful performancemetrics applied to analyzing parallel programs will be presented. Afterwards, a literaturereview on parallel algorithms applied to solving the transient stability problem will be pre-sented. Lastly, the motivation of the present work will be introduced, followed by the majorachievements resulting from the present research project.1.1 Motivating Parallel ComputingA brief survey of trends in applications, computer architecture, and networking, presentedby Foster (1995), suggests a future in which parallelism plays a vital role not only for su-percomputers development but also workstations, personal computers, and networks. Inthis future, programs will be required to exploit the multiple processors located inside eachcomputer and the additional processors available across a network. Moreover, because mostexisting algorithms are specialized for a single processor, this situation implies a need fornew algorithms and program structures to be able to perform many operations at once.As pointed out by Grama et al. (2003), the role of concurrency in accelerating computingelements has been recognized for several decades. In the past, however, when trends in hard-ware development were not clear and standardized parallel programming interfaces were notyet established, development of parallel software had traditionally been thought of as highlytime and effort intensive. Nowadays, many are the arguments that support parallelism, suchas the computational power, memory and disk speed and data communication requirementsof today’s applications.On the computational power argument, it is recognized that increased computationalperformance will not come from faster processors, but parallel ones. This fact can be ex-plained by the fact that processors clock cycle times1 are decreasing in a much slower pace,1The time to perform a basic operation is ultimately limited by the clock cycle.2Chapter 1. Introductionas physical limits such as speed of light are approached. Therefore, the only way to increasethe amount of operations that can be performed by a computer architecture is by executingthem in parallel (Foster, 1995).In conjunction with the speed of the processor, the latency and bandwidth of the memorysystem represents another factor that influences the computational performance. Usually,memory access times are higher than the execution of floating-point operations, which imposea tremendous computational performance bottleneck. The usage of hierarchical and fastermemory devices, called caches, reduces the problem, but not entirely. Parallel platformstypically yield better memory system performance because they provide larger aggregatecaches and higher aggregate bandwidth to the memory system (Grama et al., 2003).As the problems increase in size and complexity, larger databases are required, whichmay be infeasible to collect in a central location. In such situations, parallel computingappears as the only solution. In addition, an extra motivation for computational parallelismcomes in cases where a central data storage is also undesirable for unusual technical and/orsecurity reasons.1.1.1 Commodity Off-The-Shelf Computer SystemsPast decades’ advances in computer technology made general-purpose personal computers(PCs) an alternative cost effective solution to the traditional supercomputers for computinglarge problems quickly.In early 1990’s, Dr. Thomas Sterling and Dr. Donald Becker, two engineers in NASA,hypothesized that using inexpensive, commodity off-the-shelf (COTS) computer systemshooked together with high-speed networking (even with speeds as low as 10 Mbit/s Ether-net) could duplicate the power of supercomputers, particularly applications that could beconverted into highly parallelized threads of execution. They theorized that the price/per-formance of these COTS systems would more than make up for the overhead of having tosend data between the different nodes to have that additional computing done (Gropp et al.,2003). Eventually, this concept became known as Beowulf clusters (Beowulf.org, 2009),whose generic structure is illustrated in Figure 1.1.Another advantage of the COTS systems over the early supercomputers is the ease ofprogramming and maintaining. Supercomputers were often hand-made systems that re-quired specific programming techniques compliant with only a certain models. In the caseof commodity clusters, common processor’s architectures are employed, which allow codereuse and, therefore, significantly reduces the development and deployment time of parallelapplications. Regarding the network interconnects, many types are available in the market3Chapter 1. IntroductionFigure 1.1. Generic COTS computer system (http://upload.wikimedia.org/wikipedia/commons/4/40/Beowulf.png).with different underlying hardware approaches. However, parallel programming standards,such as Message-Passage Interface (MPI) and Portable-Virtual Machine (PVM), helped ab-stract programming from the hardware, which further leveraged the success of commoditycomputing clusters.Nowadays, low-cost commodity clusters built from individual PCs and interconnectedby private low-latency and high-bandwidth networks provide users with an unprecedentedprice/performance and configuration flexibility for high-performance parallel computing.1.1.2 Performance Metrics for Parallel SystemsWhen analyzing the performance of parallel programs in order to find the best algorithm, anumber of metrics have been used, such as speedup and efficiency.SpeedupSpeedup is a relative measure that extracts the acceleration delivered by a parallel algorithmwith respect to the equivalent best known sequential algorithm. Mathematically, speedup,given in (1.1), corresponds to the ratio of the time taken to solve a specific problem us-ing a single processor, Ts, to the time demanded to solve the same problem on a parallel4Chapter 1. Introductionenvironment with p identical processors, Tp.S = TsTp(1.1)Moreover, the parallel timing Tp can be further split into three smaller components. Eventhough parallel programs aim at utilizing the p available processors as much as possible, thereare always parts of the problems solutions that remain sequential, which in turn incur in asequential timing, Tps. Also included in the total parallel timing Tp is the overhead timingTpo, which comprises extra computation timings and interprocessor communication timings.Last but not least, the actual time spent executing parallel work Tpw.S = TsTps + Tpw + Tpo= 1fps + fpw + fpo(1.2)where fpk = TpkTs with k ∈{s,w,o} correspond their associated normalized quantities withrespect to the best sequential timing Ts.In fact, these normalized quantities are usually functions that depend on the problemto be solved, the number of available processors p and characteristics intrinsic to the hard-ware/software setup. Therefore, in order to find such relationships thorough understandingof the underlying parallel algorithm and the costs involved in solving a given problem in aparallel architecture is required.For example, assume that a problem of size N can be solved in parallel by any number ofprocessors, which can be, hypothetically, infinity. Consider also that the parallel overhead fpois kept negligible for any number of processors, and the parallel work fraction fpw diminishesasymptotically with the number of processors, i.e., fpw tends to zero as p heads towardsinfinity. In such a case, one can observe that the speedup achieved with the parallel solutionwill never exceed 1fps. Such behavior of parallel programs was firstly observed by Amdahl(1967), who stated that the sequential portion of the program alone would place an upperlimit on the speedup, even if the sequential processing were done in a separate processor.In an ideal parallel system, however, both sequential processing required by the parallelalgorithm and the overhead times are null, whereas the parallel work can approximated bythe original sequential time Ts split evenly among p processors. In such a case, the timerequired to solve the problem in parallel would be Tsp , which incur in a p-fold speedup. Inpractice, speedups greater than p (also know as superlinear speedups) can be observed whenthe work performed by the sequential program is greater than its parallel counterpart. Suchphenomenon is commonly observed when data necessary to solve a problem is too large tofit into the cache of a single processor, but suitable to fit into several units, which can be5Chapter 1. Introductionaccessed concurrently (large cache aggregate).EfficiencyEfficiency is a measure of the fraction of the time for which a processing unit is usefullyemployed (Grama et al., 2003). Its mathematical definition is given by the ratio of thespeedup to the number of processing units used, as given below.E = Sp = TspTp(1.3)From (1.3), it can be readily verified that the efficiency of an ideal parallel system equalsthe unity. In practice, however, communication overhead tend to increase as the number ofprocessors grows, which yields efficiencies always lower than the unity.1.2 Parallel Transient Stability SolutionStrategies for parallelizing traditional sequential transient stability solutions rely on thestructure of the discretized differential-algebraic equations given below:˙x = f(x,v) (1.4a)Yv = i(x,v) (1.4b)where, x represents a vector with dynamic variables (or, state variables), whose first deriva-tives ˙x are defined by a vector function f, normally dependent on x and the vector withnodal voltages v. In addition, i represents a vector function that defines the nodal currentinjections, which also depend on the variable states x and the nodal voltages v. Lastly, Yrepresents the complex-valued nodal admittance matrix of the system under study.In general, differential equations (1.4a) are usually associated with dynamic devices con-nectedto the passivenetwork. Usually, there is no direct connectionbetweendynamic devicesand all interactions occur through the network. As a consequence, state variables are usuallygrouped by device, which keep them independent from the others and, hence, make themsuitable for concurrent computations. As for the network equations (1.4b), parallelizationis not as straightforward as for the differential equations. Generally, electric networks arehighly sparse with irregular connectivity patterns, fact that causes parallel solutions for suchnetworks very difficult to find (Tylavsky et al., 1992).Parallelization of these two major tasks for the solution of a single time step is oftenreferred to as parallel-in-space approach, because of the usage of the mathematical structure6Chapter 1. Introductionof the problem and the system topology. Other strategies adopt parallel-in-time approach,where multiple time-steps of the same transient stability simulation are solved simultane-ously. In practice, parallel-in-time and parallel-in-space methods are combined in order toenhance the efficiency of the overall simulation.Various parallel algorithms for transient stability computation have been proposed butfew have been actually tested on parallel machines. Moreover, many are the possible al-gorithmic combinations reported in the literature that take advantage of parallelization inboth space and time. In the sequence, a few of these parallelization techniques applied tothe transient stability problem will be quickly reviewed.1.2.1 Parallel Newton MethodsThe first parallel-in-space variation of the Newton-Raphson method2 was introduced byHatcher et al. (1977). The approach adopted in that work was distributing sets of discretizeddifferential equations (1.4a) among different processors, and solving the network equations(1.4b) by means of two main algorithms: a sequential sparse LU factorization and a parallelSuccessive Over Relaxation (SOR) method.Later on, Alvarado (1979) proposed a parallel-in-time approach based on the Newton-Raphson method applied to differential equations discretized by the trapezoidal integrationmethod. Network equations, however, were not considered at that time.La Scala et al. (1990b, 1991) formalized and extended the mathematical formulation ofthe previous method, yielding a parallel-in-space-and-time method. The basic idea in thistechnique was assigning T ×p processors to solve multiple time steps, where each group ofp processors computes one of the T steps of a predefined time window. More specifically,blocked Gauss-Jacobi (La Scala et al., 1990b) and Gauss-Seidel (La Scala et al., 1991) meth-ods were proposed for the iterations between time steps, while each time step was solved bythe parallel-in-space Very DisHonest Newton (VDHN) method. Again, network equationswere computed sequentially.Chai et al. (1991) describes the first reported implementations of a parallel transient sta-bility simulator on two different computing architectures, a distributed and a shared-memorysystem. In this study the parallel VDHN method and the SOR-Newton methods were alsocompared. Forthetested662-busand91-generatorsystem, a SOR-Newtonmethodpresenteda speedup of 7.4 on the shared-memory machine and 3.5 on the distributed-memory sys-tem, when employed 8 processors. Such metrics make evident that communication-intensivemethods, such as the SOR-Newton, are more efficient when implemented on shared-memory2For details, see Appendix B.7Chapter 1. Introductionsystems. As for the VDHN method, speedups of 3.9 and 4.6 were reported for the same 8processors on the shared and distributed-memory systems, respectively. It shows that, ona shared-memory system, the performance of the VDHN method is much lower than theone observed for the SOR-Newton method, due to the sequential solution of the networkequations. On a distributed-memory system, however, the VDHN method performance isslightly better than SOR-Newton due to the reduced communication overhead, and it is nothigher only because of the sequential solution of the network equations. In addition, theparallel-in-space VDHN method was used to compute multiple time steps simultaneously,yielding a parallel-in-space-and-time VDHN method. When solving the same 662-bus systememploying 32 processors on the distributed-memory system, the performance of this com-bined method achieved a speedup of about 9 times in comparison to the sequential VDHNmethod. Later, Chai & Bose (1993) summarize practical experiences with parallel Gauss-type, VDHN, SOR-Newton, Maclaurin-Newton and Newton-W matrix methods, where thelater one is discussed in Section 1.3.Parallel implementations of the VDHN method on a research-purpose and a production-grade transient stability simulators are presented in (Wu et al., 1995). The adopted approachwas a step-by-step solution of the entire system, with differential equations and networkequations split among several processors on a shared-memory system (Cray). The paral-lel network solutions was based on the methodology introduced in (Huang & Wing, 1979;Wu & Bose, 1995). As part of the results of the paper, parallelization of the differentialequations are shown to be practically linearly scalable with number of processors in sucharchitecture (about 16 times speedup with 20 processors, only for the machine equationscomputations). As for the parallel network equations factorization and triangular solutions,a strong speedup saturation is observed, which achieved speedups about 11 times for thefactorization and 5 times for triangular solutions on 20 processors. The overall speedup ofthe parallel transient stability solutions was about 7 times on the same 20 processors.Decker et al. (1996) introduces a new class of parallel iterative solvers into the Newton-based transient stability solutions, Conjugate Gradient (CG) based methods. Both parallel-in-time and space approaches were considered in solving the transient stability problemon a distributed-memory architecture. Moreover, network equations were also parallelizedas part of the CG-like iterative solutions. Despite of high level of parallelism provided bysuch methods, intensive communication overhead along with convergence issues considerablydegraded the performance of the algorithm with respect to sequential solutions.Other parallel implementations of the VDHN method, which employs same parallel sparsefactorization and triangular solution suggested by Huang & Wing (1979), are discussed in(La Scala et al., 1996; Hong & Shen, 2000). According to (La Scala et al., 1996), the parallel8Chapter 1. IntroductionVDHN method implemented on a shared-memory system was able to achieve roughly 7.7times speedup on 20 processors, when solving a system with 662 buses and 91 generators. Onan 8-processor distributed-memory system, Hong & Shen (2000) reports speedups of about3.6 and 5.5 times when solving systems with 1017 buses (150 generators) and 3021 buses(450 generators), respectively.Hong & Shen (2000) also apply the Gauss-Seidel method, proposed in (La Scala et al.,1991), on a distributed-memory architecture. Taking the sequential VDHN method as a ref-erence, the parallel-in-time-and-space Gauss-Seidel method achieved 9 and 13 times speedupwhen solving the aforementioned 1017-bus and 3021-bus systems, respectively, on 32 proces-sors.1.2.2 Parallel Waveform Relaxation MethodsWaveform relaxation methods (WRM) had been shown to be very effective for the transientanalysis of Very-Large-Scale-Integration (VLSI) circuits and was introduced to transient sta-bility analysis of large power systems by Ilić-Spong et al. (1987). The basic idea of the WRMis to solve the waveform of one state variable, considering all other state variables waveformsfixed at their previous values. The same is then repeated for the other state variables usingthe updated waveforms until convergence is observed. This method is similar to Gauss-Jacobi method, but applied to the whole state variable waveform. Although predictionswere made with respect to the linear scalability of the method when implemented on a trulyparallel computing environment, no comparisons with the best available transient stabilitysolution, based on Newton-like algorithms, were presented. Improvements on the parallelWRM applied to the transient stability analysis are proposed in (Crow & Ilić, 1990), whichincluded aspects related to windowing and partitioning of the state variables among distinctprocessors. Results showed that the method is linearly scalable, although the presentedsimulations did not considered communication overhead.In another attempt in solving the transient stability problem by means of WRM-likemethods, La Scala et al. (1994, 1996) proposed the Shifted-Picard method, also known as theWRM-Newton method for parallelizing the transient stability problem. As synthesized in thearticle, the main idea of the algorithm consists in linearizing nonlinear differential-algebraicequations (DAEs) about a given initial guess waveform. Then, the guess waveformis updatedby solving a linear set of DAEs derived by linearization. The original equations should bethen re-linearized about the updated guess in an iterative fashion. Experiments with asystem with 662 buses and 91 generators on an distributed system, which took a sequentialVDHN-based simulator as the reference, show that this algorithm achieved about 7 times9Chapter 1. Introductionspeedup with 32 processors, which represents an efficiency of 22% per processor. For a systemwith 2583 buses and 511 generators, Aloisio et al. (1997) showed that performance Shifted-Picard method is drastically decreased due to intensive interprocessors communication ona distribute-memory architecture (IBM SP2). In such a case, the observed speedup withrespect to the serial VDHN algorithm was 0.5 on 8 processors. Moreover, network equationswere solved sequentially.Wang (1998) proposed a parallel-in-time relaxed Newton method, as an improvement tothe SOR-Newton method. An implementation of the proposed method on a distributed-memory Cray-T3D presented an speedup of about 9.4 times at 47% efficiency on 20 proces-sors, with respect to a sequential VDHN-based transient stability program.1.2.3 Parallel Alternating MethodsAs observed in Section 1.2, in the alternating solution approach the differential equationscan be straightforwardly solved concurrently in a parallel-in-space fashion. The networkequations, however, still remain as a major problem to be efficiently parallelized, due to itsirregular connectivity pattern.In order to overcome the sequentiality of the network equations solutions, La Scala et al.(1990a) exploited the in-space parallelism inherent to the SOR method, at the expense ofthe quadratic convergence proper of the Newton-based methods, for the network solutionsgive in (4.4b).Another solution strategy, which relies on the network decomposition, originally proposedin (Ogbuobiri et al., 1970; Huang & Wing, 1979), was used by Decker et al. (1992, 1996). Inthis work, the network equations were solved iteratively by a parallel implementation of theConjugate Gradient (CG) method on a distributed multiprocessor machine. In spite of thehigh level of parallelism inherent to the CG method, speedup of up to 7 were observed when32 processors were employed in the computations.Also based on network decomposition (Ogbuobiri et al., 1970; Huang & Wing, 1979),Shu et al. (2005) adopted a node-tearing technique (discussed in Section 1.3 approach forparallelizing the network equations on a cluster of multi-core processors.10Chapter 1. Introduction1.3 Parallel Network SolutionsThe vast majority of problems encountered in engineering demands the solution of largesparse linear systems, expressed mathematically by (1.5).Ax = b (1.5)where x represents the the vector of unknowns, b is known vector, usually specifying bound-ary or contour conditions, and the A is a large problem-dependent sparse square matrix.In power systems engineering, more specifically, such linear systems are often associatedwith large electric networks which can span entire continents. As such networks presentirregular and sparse connectivity patterns, their associated nodal equations (represented bythe matrix A in (1.5)) end up assuming the same topological pattern. Irregular sparsityis due to the fact that in large power systems nodes are directly connected to just a fewother neighboring nodes in a variety of ways (Sato & Tinney, 1963). Since Tinney & Walker(1967), sparsity techniques applied to the solution of large sparse linear systems of equa-tions have been heavily studied. As a consequence, direct sparse LU factorization becamethe standard tool for solving sparse linear systems, which arise from many power systemsproblems, such as fault analysis, power flow analysis and transient stability simulations.Although sparsity techniques have been extremely successful on sequential computers,parallel algorithms that take full advantage of the sparsity of the power systems problemsaremoredifficult todevelop. Thedifficulty is mainlydue totheaforementionedirregularityofpower networks, in addition to the lack of evident parallelism in the required computationaloperations. In order to address the problem, a number of parallel algorithms have beenproposed in the literature, which can be categorized into two major classes, fine and coarsegrain parallelization schemes. Independently of the class, these algorithms often need a pre-processing stage that aims at properly scheduling independent tasks on different processors.Topology of the networks and their factorization paths described by means of graphs provideextremely useful information which help exploit the inherent parallelism in the computations,even though these are not readily evident.1.3.1 Fine Grain SchemesIn fine grain schemes, parallelism is achieved by dividing the factorization process and for-ward/backward substitutions into a sequence of small elementary tasks and scheduling themon several processors.In (Huang & Wing, 1979; Wing & Huang, 1980; Wu & Bose, 1995), these elementary11Chapter 1. Introductiontasks consist of one or two floating-point operations, which are scheduled based on prece-dence relationships defined by the system’s topology and node ordering. An illustration ofthe method is given in Figure 1.2. In this example, the list of required operations to eliminatethe sparse matrix3 is given, along with its associated task graph. The main idea is to takeadvantage of the independence of the tasks belonging to a same level of the task graph andperforming them in parallel. An optimal scheduling based on the levels of the task graph as-sociated with the optimally-ordered system matrix was suggested in (Huang & Wing, 1979).Wu & Bose (1995) extend the previous work and presents an implementation of the algo-rithm on a shared-memory computer. For the parallel LU factorization, results show almostlinearly-scalable speedups of up to 17 times on 20 processors, i.e., an efficiency of 85%, whilefor the parallel forward/backward substitutions, speedups strongly saturate at 8 processorsand peak at about 5.5 times when 16 processors are used.Employinga different approach, in (Alvarado et al., 1990; Enns et al., 1990), the availableparallelism was exploited from the W matrix defined below.W = L−1 = (L1L2 ... Lm)−1 = L−1m ... L−12 L−11where,A = LDLTand D is a diagonal matrix and L is a lower diagonal factor matrix of A.Based on the fact that each matrix L−1k for k = 1,...,m describes one update operation4required in the Gaussian elimination, parallelism can be obtained by properly aggregatingsets of L−1k , so W = WaWb ...Wz. Each intermediate multiplication is then performedin parallel. Indirect performance measurements of this methodology, when applied to thetransient stability problem on a shared-memory machine, are discussed in (Chai & Bose,1993). A maximum speedup of about 5 times was observed on 16 processors, and it stronglysaturated for higher number of processors.1.3.2 Coarse Grain SchemesSchemes of this nature rely on the fact that electric power systems can be partitioned intosmaller subsystems, which are in turn interconnected or interfaced by a limited numberof branches or nodes. As long as all subsystems are independent from one another, apartfrom the interconnection or interface system, each subsystem can be solved concurrently3Notice, this is a perfect elimination matrix defined in Section A.3.4For further details on the LU factorization, see Appendix A.12Chapter 1. Introduction× ×× ×× × ×× × × ◦× × ×× ◦ × × × - original nonzeros◦ - fill-ins(a) Sparse matrixTask Operation1 a14← a14/a112 a44←a44−a41a143 a25← a25/a224 a55←a55−a52a255 a34← a34/a336 a36← a36/a337 a44←a44−a43a348 a46←a46−a43a369 a64←a64−a63a3410 a66←a66−a63a3611 a46← a46/a4412 a66←a66−a64a4613 a56← a56/a5514 a66←a66−a65a56(b) Task list3 6 1 54 10 8 2 913 7111214Level 1Level 2Level 3Level 4Level 5Level 6(c) Task graphFigure 1.2. Sparse matrix and its associated task graph.on distinct processors. Therefore, partitioning algorithms of large systems into independentsubsystems is at the heart of any coarse grain parallel scheme.The basic partitioning schemes for electrical networks are based on branch and nodetearing algorithms, illustrated in Figure 1.3.Branch tearing techniquesIn branch tearing methods, subsystems are completely disconnected from each other whenthe interconnecting branches are removed from the system. This group of branches is alsoknow as the cut-set of the system, and is represented by the currents associated with theinterconnection system ZI in Figure 1.3a.The most well-known example of such technique is Diakoptics, developed by Gabriel Kronand his followers (Happ, 1970, 1973; Kron, 1953, 1963). As pointed out by Happ (1973, 1974),13Chapter 1. IntroductionY1 Y2ZIY1Y2-ZIP1IP2IPI1 PI2(a) Branch tearing.Y1 Y2YIY1Y2YIY1IY2IYI1 YI2(b) Node tearing.Figure 1.3. Partitioning techniques.the basic idea of diakoptics is to solve a large system by breaking, or tearing, it apart intosmaller subsystems; to first solve the individual parts, and then to combine and modifythe solutions of the torn parts to yield the solution of the original untorn problem. Thecombination or modification of the torn solutions so that they apply to the untorn originalproblem is the crux of the method. It was precisely this task that Kron accomplished througha series of contour transformations and link divisions.Many authors attempted to clarify the concepts proposed by Kron and further developedby Happ. Among them, Wu (1976) explained the basic idea of diakoptics as merely thepartitioning of the branches and the Kirchhoff’s laws. He also observed that practical largenetworks are usually sparsely connected, and thus sparse matrix techniques, pioneered byTinney & Walker (1967), could be used in making diakoptics more computationally efficient.Alvarado et al. (1977) also investigated the feasibility of employing diakoptics as a means ofsolving large networks, and concluded that sparsity techniques are fundamental in improvingthe performance of the diakoptics-based algorithms proposed at that time.More recently, the Multi-Area Thévenin Equivalents (MATE) method was proposed byMartí et al. (2002), which extends the idea of diakoptics in terms of interconnected Théveninequivalents. As summarized by the authors, the MATE algorithm embodies the concepts ofmulti-node Thévenin equivalents, diakoptics, and the modified nodal analysis by Ho et al.(1975), in a simple and general formulation. Another application of the MATE algo-rithm is demonstrated by Hollman & Martí (2003), who employed a distributed-memory14Chapter 1. IntroductionPC-cluster for calculating electromagnetic transients in real-time. Although the work byHollman & Martí (2003) has its foundations on the MATE algorithm, it further exploitscomputational parallelism from the time decoupling provided by transmission lines exis-tent in power systems (Dommel, 1996). This technique completely eliminates the need forcombining the subsystems’ solutions in order to obtain the solution of the untorn systemand, therefore, only requires exchanges of past values of voltages (i.e., known values) be-tween subsystems interconnected by a given transmission line. Timings for the solution of a234-node/349- branch system show an achieved speedup of about 3.5 times using 5 PCs.Before the present work, no parallel implementation of any branch tearing based al-gorithm applied to the solution of large power systems has been found in the literature.Performance predictions of proposed algorithms, however, were presented in (Wu, 1976;Alvarado et al., 1977).Node tearing techniquesIn node tearing methods, subsystems are completely disconnected from each other whenthe nodes that lie in a common interface (YI) are removed from the system, as depictedin Figure 1.3b. Node tearing or network decomposition was firstly proposed as a meansto reorder the sparse matrices in order to reduce fill-ins during the LU factorization, assuggested by Tinney & Walker (1967) and Ogbuobiri et al. (1970). Due to the matrix shapeillustrated in Figure 1.3b, network decomposition is also often referred to as the Block-Bordered Diagonal Form (BBDF) method.At first, parallel applications based on such partitioning techniques were only qualita-tively evaluated. Brasch et al. (1979) showed that BBDF-based parallel algorithms wouldseverely suffer from scalability issues, due to the strong speedup saturation predicted for in-creasing number of processors. Lau et al. (1991) presented a partitioning algorithm based onthe computation of precedence relationships embedded in the factorization paths describedby Tinney et al. (1985). The partitioned factorization path was then structured accordingto BBDF and properly scheduled on distinct processors of a distributed-memory system.Results confirmed previous predictions, where speedups strongly saturated for more than 4processors. For a 662-bus system, the maximum observed speedups were slightly inferior to2 times on the same 4 processors, which leads to an efficiency of 50%. Other implemen-tations of parallel BBDF methods can be found in the literature, although the timings arenot clearly separated from the global timings of the problem solution, such as the transientstability problem, discussed previously.15Chapter 1. Introduction1.4 Other Related WorkParallel computing has been employed in solving scientific and engineering problems fordecades. Structural analysis of materials, weather forecast, simulation of astronomical phe-nomena, simulation of air flow about an aircraft and calculation of electromagnetic fieldsaround electric equipments are just a few examples of problems that are extensively tackledby parallel computing due to their computation and data intensiveness.A computational task shared by all the previous problems is the solution of large sparselinear systems of equations, often extracted from sets of discretized ordinary or partial dif-ferential equations. Many parallel algorithms have been proposed to solve such large sparsesystems. These can be categorized into parallel direct methods and parallel iterative meth-ods.1.4.1 Parallel Direct MethodsParallel direct solvers are based on the classical Gaussian elimination combined with partialor total pivoting strategies. Amongst the most well known parallel direct sparse solversavailablearethe SuperLU DIST (Li & Demmel, 2002) and MUMPS(Amestoy & Duff, 2000).The algorithms underlying these two solvers represent a vast class of solvers (Amestoy et al.,2001). Hence, an overview of such solvers, to some extent, summarizes two of the bestalgorithms employed in computer science for direct sparse linear solutions.SuperLU DIST partitions the matrix in a two dimensional block-cyclic fashion. Theblocks are formed by groups of submatrices and assigned to a two-dimensional process gridof shape r×c, as depicted in Figure 1.4. Each block is defined based on the notion ofunsymmetric supernodes, suggestedby Demmel et al. (1999). As described in (Li & Demmel,2002), a supernode is a range of columns of L with the triangular block just below thediagonal being full, and the same nonzero structure elsewhere (either full or zero). Theoff-diagonal block may be rectangular and need not be full. By the block-cyclic layout, theblock (i,j) is mapped onto the process at coordinate parenleftbig(i−1) mod r,(j−1) mod cparenrightbig of theprocess grid. During the factorization, the block L(i,j) is only needed by the process rowparenleftbig(i−1) mod rparenrightbig, which restricts the interprocess communication. Similarly, the block U(i,j)is only needed by the processes on the process column parenleftbig(j−1) mod cparenrightbig.As also reported in (Li & Demmel, 2002), SuperLU DIST was able to solve matrices ofvarious sizes and patterns, originated from different fields of application, with decreasingtimings for up to 256 processors of a Cray T3E-900.On the other hand, MUMPS (Multifrontal Massively Parallel Solver) aims at solvingunsymmetric or symmetric positive definite linear systems of equations on distributed mem-16Chapter 1. IntroductionFigure 1.4. SuperLU DIST two-dimensional block-cyclic partitioning scheme. (Extractedfrom (Li & Demmel, 2002))ory computing architectures. There are two main levels of parallelism within MUMPS: treeand node parallelism. As explained by Amestoy & Duff (2000), in the factorization process,data is first assembled at a node combining the Schur complements from the children nodeswith data from the original matrix. The original matrix’s data comprises rows and columnscorresponding to variables that the analysis forecasts should be eliminated at this node.This data is usually supplied in so-called arrowhead format (i.e., BBDF format), with thematrix ordered according to the permutation from the analysis phase. This data and thecontribution blocks from the children nodes are then assembled (or summed) into a frontalmatrix using indirect addressing (sometimes called an extended add operation). Since thenodes belonging to a same level of the elimination tree and their children are independent(see Figure 1.5), the previous procedure can be performed concurrently for all nodes at samelevel. This explains the tree parallelism. The node parallelism lies on the fact that the Schurcompliments of each node of the elimination tree are obtained from local blocked updateoperations, which can be performed by means of parallel implementations of the Level 3BLAS (Blackford et al., 2002; Goto & Van De Geijn, 2008b).According to Amestoy et al. (2001), although the total volume of communication is com-parable for both solvers, MUMPS requires many fewer messages, especially with large num-bers of processors. The difference found was up to two orders of magnitude. This is partlyintrinsic to the algorithms, and partly due to the 1D (MUMPS) versus 2D (SuperLU) matrix17Chapter 1. IntroductionFigure 1.5. Decomposition of the elimination tree adopted within MUMPSpartitioning. Furthermore, MUMPS was, in most tested matrices, faster in both factorizationand solve phases. The speed penalty for SuperLU partly comes from the code complexitythat is required to preserve the irregular sparsity pattern, and partly because of the greaternumber of communication messages. With more processors, SuperLU showed better scala-bility, because its 2D partitioning scheme keeps all of the processors busier despite the factthat it introduces more messages.In order to illustrate the performance of these tools on a commodity cluster, a seriesof timings were obtained as part of this thesis from the SuperLU DIST on a 16 AMDAthlonTM 64 2.5 GHz processors cluster interconnected by a Dolphin SCI network is pre-sented on Table 1.1. Both factorization and forward and backward substitutions (F/BSubst.) procedures were timed for two complex admittance matrices extracted from theSouth-Southeastern Brazilian Interconnected (SSBI) system (see Section 4.4) and the NorthAmerican Western Electricity Coordinating Council (WECC) system (see Section 3.4). Asit can be observed, regardless of the number of processors employed in the computations,the parallel timings are always greater that the timing for a single processor. This seeminglycontradictory fact is explained by the additional overhead incurred by the interprocessorscommunications. The timings also show a marginal improvement for the WECC system(14,327 nodes) in comparison to the timings recorded for the SSBI system (1,916 nodes).According to the obtained timings, the parallel factorization is roughly 3 to 25 times slowerthan its sequential counterpart for the SSBI system, and 3 to 20 for the WECC system;while the parallel F/B substitutions are about 10 to 45 slower for the SSIB system, and 7 to30 times slower for the WECC system. It can also be observed that the choice of the processgrid topology also greatly influences the timings. In this case, the factorization benefits from18Chapter 1. IntroductionTable 1.1. SuperLU DIST timings on a 16 AMD AthlonTM 64 2.5GHz processors clusterbuilt on a single rack and interconnected by a dedicated network.ProcessGridSSBI (1,916 nodes) WECC (14,327 nodes)Fact. [s] F/B Subst. [s] Fact. [s] F/B Subst. [s]1 0.0030 0.0005 0.0265 0.00471×2 0.0100 0.0121 0.0794 0.09162×1 0.0274 0.0051 0.2039 0.04271×3 0.0105 0.0146 0.0816 0.10633×1 0.0457 0.0054 0.3387 0.04042×2 0.0307 0.0113 0.2197 0.08294×1 0.0170 0.0051 0.3688 0.03831×4 0.0355 0.0195 0.0796 0.14195×1 0.0264 0.0050 0.5256 0.03481×5 0.0578 0.0183 0.0781 0.13052×3 0.0186 0.0100 0.2120 0.07143×2 0.0299 0.0073 0.3374 0.05166×1 0.0768 0.0046 0.5701 0.03161×6 0.0098 0.0226 0.0750 0.1591column-wise partitioning, while row-wise partitioning benefits the F/B substitutions.An explanation for such low performance is twofold. Firstly, algorithms as the one em-ployed by SuperLU DIST have massive supercomputers as their target architectures, whoseprocessors communicate by means of specialized low-latency and high-bandwidth networkinterconnects. Secondly, the dimension of the problems aimed at by such tools are much big-ger than the power systems related problems. One can understand bigger as those problemsfor which the computational burden per processor is much bigger than the communicationoverhead due to the parallelization. In this sense, the foregoing power systems admittancematrices (∼15,000), often considered big in the power systems field, turn out to be smallin comparison to problems often encountered in the computer science field, for which tools,such as MUMPS and SuperLU DIST, are developed.1.4.2 Parallel Iterative MethodsAccording to Saad & Vorst (2000), originally, the usage of iterative methods was restrictedto systems related to elliptic partial differential equations, discretized with finite difference19Chapter 1. Introductiontechniques5. For other problems, for instance those related to finite element modeling andlarge electric and electronic circuit calculations, direct solution techniques were preferred,because of the lack of robustness of iterative solvers for a large number of classes of matrices.Until the end of 1980’s, almost none of the commercial packages for finite element problemsincluded iterative solution techniques. However, for many PDE-related problems, the com-plexity of the elimination process increases too much to make realistic 3D modeling feasible.For instance, irregularly structured finite element problems of order one million may besolved by direct methods - given a large enough computer (memory wise), but at a tremen-dous cost and difficulty. However, some of such problems can be solved with less resourcesand more efficiently by means of iterative solution techniques, if an adequate preconditioningcan be constructed.As reported by Pai & Dag (1997), iterative solution techniques applied to both staticand dynamic simulation problems, typical of large scale power systems, still cannot competewith direct methods because of possible convergence problem. Such statement agrees withSaad & Vorst (2000), who reported that large electric and electronic circuits are not easy tosolve in an efficient and reliable manner by iterative methods.The performance of direct methods, both for dense and sparse systems, is largely boundby the factorization of the matrix. This operation is absent in iterative methods, whichalso lack dense matrix suboperations. Since such operations can be executed at very highefficiency on most current computer architectures, a lower flop rate for iterative than fordirect methods is expected. Furthermore, the basic operations in iterative methods oftenuse indirect addressing, which depends on the data structure of the matrix. Such operationsalso have a relatively low efficiency of execution. On the other hand, iterative methods areusually simpler to implement than direct methods, and since no full factorization has to bestored, they can handle much larger systems than direct methods (Barrett et al., 1994).Vorst & Chan (1997) compares an iterative with a direct method for a sparse systemrelated to a finite element discretization of a second order PDE over an irregular grid, wheren is the order of the problem. It is shown that the cost of a direct method varies with n32for a 2D grid and with n53 for a 3D grid. Comparing then the flop counts for the directmethod with those for the Conjugate Gradient (CG) method, it was concluded that the CGmethod might be preferable, in case different systems have to be solved each time with largen. Furthermore, in case many systems with several different right-hand sides have to besolved, it seems likely that direct methods will be more efficient, for 2D problems, as long asthe number of right-hand sides is so large that the costs for constructing the LU factorization5Such systems are originated from oil reservoir engineering, weather forecasting, electronic device mod-eling, etc.20Chapter 1. Introductionis relatively small. For 3D system, however, it is concluded that such a choice is unlikely,because the flop counts for two triangular solves associated with a direct solution methodare proportional to n53, whereas the number of flops for the iterative solver varied accordingto n43.Following the argument presented in the previous section on parallel direct methods, itcan be concluded that iterative solvers are recommended for solving even bigger problemsthan the ones tackled by parallel direct solvers, because of mainly memory constraints. Forinstance, as reported in (Li & Demmel, 2002), SuperLU DIST has played a critical role inthe solution of a long-standing problem of scattering in a quantum system of three chargedparticles. This problem requires solving a complex, nonsymmetric and very ill-conditionedsparse linear system, whose dimension reached 8 million. The task performed by SuperLUDIST was building the block diagonal preconditioners for the Conjugate Gradient Squared(CGS) iterative solver. For a block of size 1 million, SuperLU DIST took 1209 seconds tofactorize using 64 processors of a IBM SP and 26 seconds to perform triangular solutions.The total execution time of the problem was about 1 hour. This scientific breakthroughresult was reported in a cover article of Science magazine (Rescigno et al., 1999).1.5 Thesis MotivationBased on the methodologies and experiences reported in the literature overviewed in theprevious sections, one can readily identify the parallel solution of power networks as one ofthe major bottlenecks in achieving fully parallel transient stability simulations.Although many have addressed the problem of parallelizing the solution of large sparselinear systems, no standard methodology has yet been agreed upon. This can be partiallyexplained by the fact that some algorithms are more suitable for shared-memory systemsthan for distributed-memory ones, and vice-versa.From the hardware standpoint, inexpensive commodity clusters, also known as Beowulfclusters (Gropp et al., 2003; Beowulf.org, 2009), provide a technical and economically alter-native to the expensive vector-based supercomputers. Nowadays, high performance comput-ing clusters can be easily built with off-the-shelf symmetric multiprocessing (SMP) machinesinterconnected through high-speed and high-bandwidth network cards. Moreover, due tolimitations on increasing the number of cores on a single chip, the cluster setup is expectedto become the standard in the computer industry.Bearing the aforementioned high performance computing architectures, the most success-ful parallel algorithms will likely combine coarse and fine grain approaches. In this sense,by means of coarse grain approaches, one can assign tasks to SMP nodes, which in turn can21Chapter 1. Introductionfurther extract further parallelism by adopting fine grain approaches.From a software point of view, much effort had been put into developing direct sparse lin-ear solvers. In this way, algorithms that require a complete redesign and re-implementationof existing sparse solvers are bound to be treated with a high level of scepticism mainly dueto economical reasons. Such an undertaking would certainly require re-educating develop-ers with the use of new algorithms, which in turn would require from developers time forprogramming and testing the new routines. The reusability of available and highly-testedsparse routines becomes significant to reduce costs and time of deployment of newly devel-oped parallel tools.In this context, the Multi-Area Thévenin Equivalents (MATE) approach, proposed byMartí et al. (2002) and extended by Armstrong et al. (2006), has the potential to provideboth coarse and fine granularities required by SMP clusters. Moreover, the MATE approachalso allows one to employ ready-to-use sparse routines, as it will be shown in this thesis. Asa consequence, with a minimum programming effort, it is expected a performance boost inthe solution of large sparse systems, which are at the heart of essential power system analysistools, such as the on-line transient stability assessment.1.6 Thesis ContributionsThe main contributions of this thesis are:• Introduction of the network-based MATE algorithm, which further optimizes the orig-inal matrix-based MATE algorithm formulation (Martí et al., 2002) in terms of com-putation and communication overhead;• Implementation of the network-based MATE algorithm on a commodity cluster, builtwith single-core nodes interfaced by a dedicated high-speed network, employing ready-to-use sparsity libraries;• Development of a performance model for the network-based MATE algorithm, whichenabled the establishment of a theoretical speedup limit for the method with respectto traditional sequential sparsity-oriented sparse linear solvers;• Application of the parallel network-based MATE algorithm for the solution of thenetwork equations associated with transient stability simulations.22Chapter 1. Introduction1.7 Publications• Tomim, M. A., De Rybel, T., & Martí, J. R. (2009). Multi-Area Thévenin EquivalentsMethod Applied To Large Power Systems Parallel Computations. IEEE Transactionson Power Systems (submitted).• Tomim, M. A., Martí, J. R., & Wang, L. (2009). Parallel solution of large power systemnetworksusing the Multi-Area Thévenin Equivalents (MATE) algorithm. InternationalJournal of Electrical Power & Energy Systems, In Press.• Tomim, M. A., Martí, J. R., & Wang, L. (2008). Parallel computation of large powersystem network solutions using the Multi-Area Thévenin Equivalents (MATE) algo-rithm. In 16th Power Systems Computation Conference, PSCC2008 Glasgow, Scot-land.• De Rybel, T., Tomim, M. A., Singh, A., & Martí, J. R. (2008). An introduction toopen-source linear algebra tools and parallelisation for power system applications. InElectrical Power & Energy Conference, Vancouver, Canada23Chapter 2Network-based Multi-Area ThéveninEquivalents (MATE)The Multi-Area Thévenin Equivalents (MATE) method was proposed by Martí et al. (2002),which extends the idea of diakoptics in terms of interconnected Thévenin equivalents. Assummarized by the authors, the MATE algorithm embodies the concepts of multi-nodeThévenin equivalents, diakoptics, and the modified nodal analysis by Ho et al. (1975), in asimple and general formulation.Although the possibility of parallel computations at the subsystems level has been con-ceived in (Martí et al., 2002; Armstrong et al., 2006), no implementation of the MATE algo-rithm in a distributed computer architecture had been realized until recently. This fact canbe attributed to the only recent release of economically viable commodity clusters, built fromout-of-the-self personal computers interconnected by dedicated local networks, and multi-core processors. Tomim et al. (2008, 2009) present an evaluation of the MATE algorithmapplicability to the solution of nodal equations of large electric power systems in a distributedcomputer architecture. The MATE algorithm was implemented using ready-to-use highlyoptimized sparse and dense matrix routines in a 16-computer cluster interconnected by adedicated network.In this chapter, a formulation of the MATE algorithm from an electrical network per-spective will be set forth. In comparison to the algorithm proposed by Martí et al. (2002),this network-based approach adds novel concepts to the algorithm, and further optimizes theamount of operations, memory usage, and data exchange among parallel processes duringthe actual solution. However, before introducing the network-based MATE algorithm, thethe original problem will be restated along with the original MATE formulation. Lastly,the matrix and network-oriented approaches for formulating the MATE algorithm will bequalitatively and quantitatively compared.24Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)2.1 Problem StatementIn order to solve an electric network, one needs to compute the voltages at all its nodesdue to the excitation sources to which the system is subjected. A widely used techniquefor solving power systems is the well-known nodal approach (Grainger & Stevenson, 1994),where the system is modeled as an admittance matrix Y and all excitation sources convertedinto current injections, which populate the vector i.Yv = i (2.1)Then, taking into consideration the relationship between the nodal voltages v and thecurrent injections i, given by a linear system of equations (2.1), one can finally compute vthrough Gaussian elimination, for instance.A second approach, the one analyzed presently in greater detail, considers tearing thesystem apart into smaller subsystems, then solving the individual parts, and subsequentlycombining and modifying the solutions of the torn parts to yield the solution of the originaluntorn problem. In this case, assume the system S is subdivided into p disconnected areas, orsubsystems, such that as S ={S1,S2,...,Sp}, and that these subsystems are interconnectedby a set of global links L = {α1,α2,...,αl}, with l elements. The nodes that form theinterface between the subsystem Sk, with k = 1,...,p, and the interconnection system Lwill be called border nodes and form the set Bk with bk nodes. Finally, the group of linksconnected to the subsystem Sk are called local links and form the set Lk with lk nodes.For illustrative purposes, an example of such a system is depicted in Figure 2.1. Inthis example, there are three disconnected areas, which makes p = 3 and, consequently,S = {S1,S2,S3}. In addition, four global links interconnect the three subsystems, makingl = 4 and L = {α1,α2,α3,α4}. As for the local quantities, subsystem S1 has b1 = 3 andl1 = 3; subsystem S2 has b2 = 2 and l2 = 2; and, subsystem S3 has b3 = 2 and l3 = 3.Consider, now, that each subsystem Sk ∈S is modeled by the nodal equations given in(2.1), where Yk is the admittance matrix of the isolated subsystem Sk, vk is a vector withnodal voltages, jk a vector with internal injections and ik a vector with currents injected bythe local links Lk. If the subsystem Sk has nk local buses, Yk is a square and, usually, sparsematrix with dimension nk, whereas vk, ik and jk are vectors of order nk. Additionally, theset of nodes in the subsystem Sk is defined as Nk.Yk vk = ik +jk (2.2)Usually, during power systems computations, e.g., dynamic simulations, the subsystems25Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)S1 S2S3α1α2α3 α41321212Figure 2.1. Generic electric network partitioned into three subsystems interconnected byfour tie lines.matrices Yk and their associated local injections jk are known, while the local nodal voltagesvk and local link injections ik are unknown. Notice, however, that if the links contributionsik were known, vk would be straightforward to computed. Therefore, in the MATE context,the link currents ik, which depend on how the subsystems are connected to one another,need to be computed first and then submitted to their correspondent subsystems so localvoltages vk can be obtained.2.2 MATE Original FormulationFollowing the same reasoning presented in (Martí et al., 2002), the three-area electric system,depicted in Figure 2.1, can be mathematically represented by the hybrid modified nodalequations, given in (2.3). In other words, the latter model comprises nodal equations for thesubsystems S1, S2 and S3, and branch equations for the links α1, α2, α3 and α4.Y1 P1Y2 P2Y3 P3(P1)T (P2)T (P3)T −Zl0v1v2v3il=j1j2j30(2.3)26Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)The matrices Pk, defined in (2.4), capture which nodes in subsystem Sk are connected tolinks, and whether each link is injecting/drawing current into/from that specific node.Pk(i,j) =1 if link j injects current into bus i,−1 if link j draws current from bus i,0 otherwise.(2.4)where i refers to buses in the subsystem Sk and j to global links.In order to solve (2.3), one can apply Gaussian elimination to the row correspondent tothe link currents il, which yields the new set of equations given in (2.5).Y1 P1Y2 P2Y3 P3Zlv1v2v3il=j1j2j3el(2.5)Here, Zl and el are defined below, with k = 1,2,3.Zl = Zl0 +summationdisplay∀kZlk el =summationdisplay∀kelk (2.6)Zlk = (Pk)T (Yk)−1Pk elk = (Pk)T (Yk)−1jk (2.7)In the transformed system of equations (2.5), each pair, Zlk and elk, represents theThévenin equivalent of the subsystem Sk with respect to the links αl with l = 1,...,4,whereas Zl and el represent the reduced version of the original system from the links’ per-spective.Ultimately, the original system can be solved by solving the link currents system (at thebottom of (2.5)) and then appropriately injecting the currents il into the subsystems Sk withk = 1,2,3.As pointed by Martí et al. (2002), the main advantage of the MATE algorithm lies inthe fact that subsystems can be computed independently from each other, apart from thelinks system of equations. This fact, in turn, leads to other possibilities, such as: subsystemscan be solved concurrently at different rates, with different integration methods, or evenin different domains (like time and quasi-stationary phasor domains). The possibility ofparallel computations at the subsystems level has been conceived in (Martí et al., 2002;Armstrong et al., 2006), but only implemented on a distributed computer architecture for27Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)solving nodal equations of large electric power systems in (Tomim et al., 2008).RecentinvestigationsoftheMATEalgorithm, however, indicatedthatthesetofequations(2.5), (2.6) and (2.7) hide a few interesting features that help reduce amount of operationsperformed in the subsystems and the data exchange between link solver and subsystems.Therefore, in the subsequent sections, the MATE algorithm will come under close scrutinyin order to make such features evident.2.3 Network-based MATE FormulationEven though the set of equations given in (2.5) are sufficient to describe the algorithmunder study, they hide a few interesting features that help reduce the number of operationsperformed in the subsystems and the amount of data exchange between link solver andsubsystems. Therefore, in the next sections, the MATE algorithm will be restated from anelectric network viewpoint in order to make these features evident.2.3.1 MATE Algorithm SummaryFor solving the problem stated in Section 2.1, one could start by extracting the links from theoriginal system S and finding a multi-node Thévenin equivalent with respect to the bordernodes, as suggested in (Dommel, 1996). Since the subsystems are uncoupled a priori, theThévenin equivalents computation could also be done for each subsystem Sk separately. Themulti-node Thévenin equivalent of each subsystem, as seen from its border nodes, can beconstructed in two basic steps:(a) Find the voltages at the border nodes due to the internal current and voltage sources;(b) Find the self and mutual impedances seen from the border nodes (short-circuiting andopening internal voltage and current sources, respectively).As a result of the previous steps, reduced-order networks, completely isolated from eachother, will be generated for each subsystem Sk. These equivalent networks, along with theinterconnection system, will form a reduced-order version of the original system, as seenfrom the links. The currents flowing in each link αi, with i = 1,...,l, can then be computedemploying the newly constructed equivalent system.Once the currents flowing in the links are known, each vector ik can be formed accordingto the links connections to the subsystems. Lastly, combining ik vectors with the internalvoltage and current sources represented by jk enables the computation of each subsystemsnodal voltages vk using (2.1).28Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)In the next sections, each MATE algorithm building block will be explained in moredetail.2.3.2 Multi-Node Thévenin EquivalentsThe first step in the MATE algorithm is constructing multi-node Thévenin equivalents forall subsystems Sk ∈ S with respect to their border nodes set Bk. Such equivalent circuitsfully describe the voltage-current characteristic of each subsystem at its border nodes, andwill provide the fundamental structures for solving the link currents at a later stage. Inaddition, as an aid to the multi-node Thévenin equivalents construction, a subsystem-to-border mapping will also be introduced.Subsystem-to-Border MappingGiven a specific subsystem Sk ∈S, it will be necessary to map its bk border nodes quantitiesonto its nk local buses quantities, i.e., map quantities defined in Bk onto Nk. One suchsituation occurs when one needs to express current injections defined in the border nodesset Bk in terms of the local nodes set Nk. The inverse mapping, i.e., mapping of Nk onto Bk,is equally important, since it allows one to gather border nodes voltages from local nodalvoltages.Formally, the mapping of Nk onto Bk can be seen as a bk×nk matrix Qk defined in (2.8).The matrix Qk is filled with zeros, except for bk ones, which indicate that a specific localnode j (related to the columns of Qk) corresponds to the border node i (related to the rowsof Qk).Qk(i,j) =0 if Bk(i)negationslash= Nk(j),1 if Bk(i) = Nk(j).(2.8)where i = 1,...,bk and j = 1,...,nk.As for the inverse direction of the mapping, it is achieved by (Qk)T. This inverse mappingsimply scatters the vectors defined in the border node set Bk so they are expressed in termsof the local nodes set Nk. Although the vectors produced by (Qk)T have nk elements, onlybk of them are non-zero, which correspond to the border nodes.Thus, if the vector vk contains voltages associated with the set of local nodes Nk, gather-ing border nodes voltages vbk from vk can be summarized as the matrix-vector multiplicationexpressed by (2.9).vbk = Qk vk (2.9)On the other hand, consider now a vector ibk that contains current injections associated29Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)S1¯E11¯E31¯E21132Figure 2.2. Thévenin equivalent voltages.with the set of border nodes Bk. If one wants to scatter the injections ibk into another vector,say ik, defined in Nk, one only needs the matrix-vector multiplication expressed by (2.10).ik = (Qk)T ibk (2.10)For example, for the subsystem S1 of the system depicted in Figure 2.1, the matrix Q1is shown in (2.11), assuming that the set of border nodes B1 contains b1 = 3 nodes.Q1 =system nodes1 ··· 3 ··· 2 ··· 1 ··· ··· ··· ··· ······ ··· ··· ··· 1 ······ ··· 1 ··· ··· ··· 123border(2.11)Multi-Node Thévenin VoltageMulti-node Thévenin voltages are nodal voltages detected at the border nodes of a spe-cific subsystem, due to internal voltage and current sources only. In such a situation, alllinks connected to the same subsystem must be open and all subsystem internal sourcesactive during the computation of border nodal voltages. The Figure 2.2 illustrates the con-cept of multi-node Thévenin voltages when applied to subsystem S1, depicted in Figure 2.1.Mathematically, the multi-node Thévenin voltages computation can be summarized by thetwo step procedure shown in (2.12).Yk ek = jk (2.12a)ebk = Qk ek (2.12b)First, by solving the linear system (2.12a), the new voltage vector ek is obtained, which30Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)contains Thévenin voltages at all nodes of the subsystem Sk described by Yk. However, sinceonly Thévenin voltages at the border nodes are needed, one needs to gather the correspondingvoltages in ek and store them in ebk by means of the subsystem-to-border mapping matrixQk, according to (2.12b).Multi-Node Thévenin ImpedanceMulti-node Thévenin impedances are a set of self and mutual impedances seen from theborder nodes of a specific subsystem Sk, when all its internal voltage and current sourcesare turned off, i.e., the voltage sources become short circuits and the current sources opencircuits. For compactness, it can also be represented as a bk×bk matrix Zbk which relatesthe injected currents at the border nodes ibk to the border nodes voltages vbk.vbk = Zbkibk (2.13)In the general case, the matrix Zbk can be obtained column by column, injecting uni-tary currents into its corresponding border node and measuring the voltages at all bordernodes. Thus, bearing in mind the fact that each subsystem Sk is represented by its nodalequations, voltages at the border nodes can be gathered from repeated solutions of (2.1) forindividual current injections at each border node. In turn, each unitary current injectioncan be represented by a single column vector ubik of size bk, which has one 1 at the positioni, associated with the border node Bk(i), while all other components are zero. Now, takinginto consideration that Yk is related to the local nodes in Nk, one first needs to map ubikonto Nk, using the border-to-subsystem mapping (Qk)T. So, collecting all the vectors ubikassociated with i = 1,...,bk, in this order, and applying the transformation (Qk)T to it,leads to (2.14). This shows that the unitary current injections required for computing Zbkare readily provided by (Qk)T.(Qk)TbracketleftBigub1k ub2k ··· ubbkkbracketrightBig= (Qk)T (2.14)(Qk)TbracketleftBigub1k ub2k ··· ubbkkbracketrightBig= (Qk)T1 0 ··· 00 1 ··· 0... ... ... ...0 0 ··· 1= (Qk)T (2.15)Hence, a general procedure for computing Zbk, with respect to border nodes of Sk issummarized by (2.16). Since (Qk)T has bk columns, the equation (2.16a) is equivalent to31Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)S1132 ¯V 111 1.0¯V 311¯V 211(a)S1132 ¯V 121¯V 321¯V 221 1.0(b)S1132 ¯V 131¯V 331¯V 2311.0(c)Figure 2.3. Multinode Thévenin equivalent impedances construction.solving a sparse linear system of order nk, defined by Yk, bk times, and storing each solutionin each column of Zk. Afterwards, one only needs to gather the impedances associated withthe border nodes in Zk and store them in Zbk, as stated by (2.16b).Yk Zk = (Qk)T (2.16a)Zbk = Qk Zk (2.16b)Figure 2.3 illustrates the concept of multi-node Thévenin voltages when applied to sub-system S1, depicted in Figure 2.1.Multi-Node Thévenin EquivalentsAfter equations (2.12) and (2.16) were solved for each subsystem Sk ∈S, a series of circuits,synthesized as shown in Figure 2.4, is constructed. Algebraically, the multi-node Théveninequivalent can be represented by (2.17), where ibk represent the current injections at theborder nodes, vbk lumps all the voltage drops across the subsystem Sk, and ebk correspondsto the vector with the Thévenin voltages that lie behind the Thévenin impedance networkZbk.vbk = Zbkibk +ebk (2.17)vbk =¯V 1k − ¯V 1′k...¯V ik − ¯V i′k...¯V bkk − ¯V b′kkibk =¯I1k...¯Iik...¯Ibkkebk =¯E1k...¯Eik...¯Ebkk32Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)¯E1k Zbkbordernodes1ibk¯Eik¯Ebkk ¯Ibkk¯Iik¯I1k1′i′b′kreferencenodesFigure 2.4. Multi-node Théveninequivalentofa genericsubsystemSk withbk border nodes.2.3.3 Multi-Area Thévenin EquivalentsOnce all subsystems Sk ∈S have been reduced to their multi-node Thévenin equivalents, oneneeds to interconnect them through the link system so the currents flowing between adjacentsubsystems can be calculated. This final equivalent of the original system with respect to alllinks is then called the Multi-Area Thévenin Equivalent (MATE). Analogously to what hasbeen done for the multi-node Thévenin equivalents construction, a link-to-border mappingwill also be introduced.Link-to-Border MappingThe aim of the link-to-border mapping is to represent current injections at the border nodesBk in terms of currents flowing in the links defined in the set of global links L. For instance,for the subsystem S1 in Figure 2.1 which has b1 = l1 = 3, the current injections at eachborder node can be seen as an injection coming from a single link, respecting proper linkcurrent orientation. Thus, if ¯Iik is a current injection at border node Bk(i), ¯I1A = −¯Iα1,¯I2A = ¯Iα3, and ¯I3A =−¯Iα2. In matrix form, the previous relationships yield (2.18), where ib1 isa vector with the border nodes injections, il is a vector with the link currents according tothe orientation adopted in Figure 2.1, and R1 is a b1×l matrix that represents the mappingbetween the two vectors.ib1 = R1il (2.18)where,ib1 =¯I11¯I21¯I31 il =¯Iα1¯Iα2¯Iα3¯Iα4R1 =−1 0 0 00 0 1 00 −1 0 033Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)For the subsystem S3 in the same example, b3 = 2, l3 = 3, and the current injectionsat each border node become combinations of one or more links. In this case, ¯I13 = ¯Iα4 and¯I23 = ¯Iα2−¯Iα3, which leads to (2.19).ib3 = R3il (2.19)where,ib3 =bracketleftBigg¯I13¯I23bracketrightBiggR3 =bracketleftBigg0 0 0 10 1 −1 0bracketrightBiggSo far, only the current injections were related by means of the link-to-border mapping.In the case of the voltages, the mapping direction has to be reversed. This can be explainedby the fact that each local link connecting the same border node senses the same nodalvoltage. However, the link orientation may change the polarity of the voltage. This is thecase in subsystem S3 in Figure 2.1, where links α2 and α3 are joined together at the bordernode B1(2). Thus, if ¯V ik is a nodal voltage at border node Bk(i), and ¯V αjk the voltage at linkαj whose termination is in the subsystem Sk, equation (2.20) applies.vl3 = (R3)T vb3 (2.20)where,vb3 =bracketleftBigg¯V 13¯V 23bracketrightBiggvl3 =¯V α13¯V α23¯V α33¯V α43(R3)T =0 00 10 −11 0In a general form, the mapping Rk, from L onto Bk, can be defined by (2.21). Conversely,border-to-link mapping is achieved by employing (Rk)T.Rk(i,j) =1 if L(j) injects current into Bk(i),−1 if L(j) draws current from Bk(i),0 otherwise.(2.21)where i = 1,...,bk and j = 1,...,l.Note that Rk is sparse and each row, associated with the border nodes of subsystem Sk,has as many non-zero entries as the number of links connected to it. Therefore, the number34Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)of non-zeros in Rk always equals the number of local links in Lk, i.e., lk.Link Thévenin EquivalentsAt this point, all subsystems Sk ∈S are reduced to their set of border nodes Bk by meansof multi-node Thévenin equivalents and can be finally reconnected through the link system.Since the equation (2.17) is expressed in terms of the border nodes in Bk, each multi-nodeThévenin equivalent has to be referenced to the set of global links L, so all equivalents can beconnected to one another. The latter equivalents are then called link Thévenin equivalents,and will be discussed next.Recalling that each vector vbk and ebk, with border nodes voltages and Thévenin voltages,respectively, can be mapped onto L by means of (Rk)T, and the link currents vector ilmapped onto each Bk by means of the mapping Rk, one can pre-multiply (2.17) by (Rk)Tand make ibk = Rkil, as follows:(Rk)T vbk =bracketleftBig(Rk)T ZbkRkbracketrightBigil + (Rk)T ebkwhich yields:vlk = Zlkil +elk (2.22a)Zlk = (Rk)T ZbkRk (2.22b)elk = (Rk)T ebk (2.22c)In the above equations, elk represents the vector with the Thévenin voltages associated withsubsystem Sk that partially drive the global link currents il. The vector vlk contains the totalvoltage drops across the subsystem Sk due to il and elk. Lastly, the l×l matrix Zlk, given in(2.22b), is the set of impedances Zbk seen from the global links.For example, applying link-to-border mapping R2 to the multi-node Thévenin equivalentZb2 with respect to the border nodes of subsystem S2 in Figure 2.1, as given in (2.23), one canobtain el2, vl2 and Zl2, given in (2.24), which form the link Thévenin equivalent of subsystemS2.Zb2 =bracketleftBigg¯Z112 ¯Z122¯Z212 ¯Z222bracketrightBiggR2 =bracketleftBigg1 0 0 00 0 0 −1bracketrightBigg(2.23)35Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)¯E12α1¯E22¯Iα3¯Iα4¯Iα1¯Iα2¯Z112¯Z222¯Z122α2α3α4α′1α′2α′3α′4(a)ilel2Zl2vl2(b)Figure 2.5. Link Thévenin equivalent of subsystemS2 in Figure 2.1 in detailed and compactforms, respectively.el2 = (R2)T eb2 =¯Eα12¯Eα22¯Eα32¯Eα42=¯E1200−¯E22(2.24a)vl2 = (R2)T vb2 =¯V α12¯V α22¯V α32¯V α42=¯V 1200−¯V 22(2.24b)Zl2 = (R2)T Zb2R2 =¯Z112 0 0 −¯Z1220 0 0 00 0 0 0−¯Z21B 0 0 ¯Z22B(2.24c)As matter of fact, equation (2.22) can be seen as a polyphase voltage source elk in serieswith the polyphase impedance Zlk. Thus, the link Thévenin equivalent of the subsystem S2,described by (2.24), can be represented by the equivalent circuits shown in Figure 2.5.Multi-Area Thévenin EquivalentsOnce the link Thévenin equivalents are found for all subsystems Sk ∈S, for each subsystemthere will exist a polyphase representation similar to the one shown in Figure 2.5. Since eachphase of each link Thévenin equivalent is associated with one single global link, they can all36Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)¯E1B¯E2B¯Z11B¯E1A¯E3A¯E2A¯E2C¯E2C¯E1C ¯Iα4¯Iα1¯Iα2¯Iα3¯Z22B¯Z12B¯Z11A¯Z33A¯Z22A¯Z13A¯Z23A¯Z12A ¯Z22C¯Z22C¯Z11C¯Z22C¯Z12C¯Z12C¯Zα1¯Zα2¯Zα3¯Zα4(a)ilelB ZlBelA ZlAelC ZlC Zl0(b)Figure 2.6. Multi-area Thévenin equivalent of the system in Figure 2.1 in its detailed andcompact versions, respectively.be connected in series as shown in Figure 2.6.In addition to the subsystems, one can also add the impedances of the interconnectionsys-tem, represented by the matrix Zl0. In the case of the system example depicted in Figure 2.6,Zl0 is a diagonal matrix, because all links are considered uncoupled. In a more general case,however, couplings among links could be straightforwardly added as off-diagonal elements ofZl0.Finally, based on the circuit Figure 2.6b, one can combine all polyphase impedances andvoltages sources accordingly, which results in the linear system given in (2.25), which canthen be solved for the link currents il.Zl = Zl0 +summationdisplaySk∈SZlk (2.25a)el =summationdisplaySk∈Selk (2.25b)Zlil = el (2.25c)2.3.4 Subsystems UpdateSince all link currents il are known at this stage, one only needs to build the proper ik fromthe link currents il and, together with the internal currents jk, solve the linear system (2.1)for the nodal voltages vk for each subsystem Sk.For building ik, a two step conversion is proposed, first from links to border nodes and37Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)then from border nodes to subsystem nodes. Mathematically, this task is achieved by meansof the transformations Rk and (Qk)T, defined in (2.21) and (2.8), respectively, and expressedin (2.26).ik = (Qk)T Rk il (2.26)By the end of this stage, each subsystem Sk ∈ S will have its set of nodal voltages vkknown, which completely solves the original untorn system.2.4 MATE: Original versus Network-basedThe difference between the original MATE formulation, introduced by Martí et al. (2002),and the network-based formulation, presently discussed, lies on the definitions of the sub-system-to-border mapping Qk, given in (2.8), and the link-to-border mapping Rk, given in(2.21), and how they relate to Pk, given in (2.3).From (2.3), notice that Pk converts the global link currents il into currents injected intothe subsystem Sk. Therefore, each matrix Pk can be seen as a link-to-subsystem mappingfrom L to Nk, as defined in (2.4) and emphasized in (2.27).ik = Pkil (2.27)Comparing then (2.27) to (2.26), one can verify that the following equality applies.Pk = (Qk)T Rk (2.28)Similarly to the other mappings, the reverse mapping of Pk is also given by its transposedversion, i.e., (Pk)T, and can be used to convert the nodal voltages defined in Nk to voltagesat links terminals in L.The relationships among all defined mappings are summarized in Figure 2.7, which in-cludes the flow directions of node voltages and current injections among subsystem nodesset Nk, border nodes set Bk and global links set L.The main benefits in using Rk and Qk mappings instead of Pk are the following:Less computational effort to obtain the multi-area Thévenin equivalent Zl: Themulti-area Thévenin equivalent Zl can be interchangeably defined in (2.25a) and (2.6). How-ever, if each link Thévenin equivalent Zlk, which will later form the multi-area Théveninequivalent Zl, is computed according (2.7), the nodal equations associated with the subsys-tem Sk need to be solved l times, where l is the number of global links. Even when the38Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)voltages flowcurrents flowNk BkLQk(Qk)T(Rk)TRk(Pk)TPkFigure 2.7. Relationships among the various mappings Rk, Qk and Pk.empty columns of Pk are skipped during the computation of Zlk, the subsystem Sk needs tobe solved at least lk times, where lk is the number of local links. On the other hand, if eachlink Thévenin equivalent Zlk is computed according to (2.22b), the computational burdenwill be concentrated in forming the multi-node Thévenin equivalents Zbk, which requires onlybk solutions for each subsystem Sk. Taking into consideration that, for large systems, thenumber of border nodes bk ≤lk ≪l, a reasonable gain in performance should be expected.Less data exchange between subsystems processes and link solver: The link-to-border mapping Rk works as an indirect reference to the multi-node Thévenin equivalent Zbk,which has only bk2 elements, to form the link Thévenin equivalent Zlk, which in turn has lk2elements. Without the information inherited by the link-to-border mapping Rk, Zlk wouldhave to be formed by the subsystems and sent to the link solver. Moreover, without Rk,the link solver would have no other option, but sending all link currents to all subsystemsSk ∈ S. With Rk, on the other hand, the link solver only need to send the net currentsinjected at the border nodes of the subsystems. Therefore, the link-to-border mapping Rkplays an important role in minimizing data exchange between subsystems.Global link information is confined in the link solver process: If link Théveninequivalents Zlk are formed by its subsystem processes, information regarding the global linksis required by all subsystems and, therefore, needs to be replicated in all processes par-ticipating in the solution. The multi-node Thévenin equivalents Zbk, however, only requireinformation regarding the local links, reducing again memory usage and amount of globalinformation that needs to be shared among the processes.39Chapter 2. Network-based Multi-Area Thévenin Equivalents (MATE)2.5 ConclusionIn this chapter, a step-by-step formulationof the MATE algorithm froma networkstandpointhas been developed. The formulation revealed properties of the subsystems interface nodes(or, border nodes) and the links that can be used to minimize both subsystems computationsand communication overhead.The distinct mappings among subsystem nodes, border nodes and links, introduced inthis formulation, also make subsystem computations and communications rely on local in-formation only. This characteristic also plays an important role in the aforementioned com-munication and computation reduction.In the next chapter, a thorough assessment of the computation and communication effortrequired by the foregoing methodology, in terms of a performance model of the network-basedMATE algorithm, will be presented.40Chapter 3Network-based MATE AlgorithmImplementationAlthough diakoptics theory is very well established in literature, not many implementationsof such techniques are available in power systems field. This fact is likely due to the greatsuccess of sparsity techniques in solving power systems problems in sequential computers,and high costs involved with early parallel machines.However, with the advent of cheaper commodity computer clusters and multi-core proces-sors, diakoptics-based as well as other parallel algorithms have been drawing more attentionfrom power systems field. Therefore, decision makers need appropriate tools to compareexisting traditional software to newly developed parallel alternatives. This requires an un-derstanding of the related computational costs, which can be developed and formalized inmathematical performance models. These models can be used to compare the efficiency ofdifferent algorithms, to evaluate scalability, and to identify possible bottlenecks and otherinefficiencies, all before investing substantial effort in an implementation (Foster, 1995).Even though many parallel algorithms have been proposed in computer science and powersystems fields for solving large sparse linear systems (see Chapter 1 for references), their per-formance greatly varies according to the available computing architecture and, the type andsize of the problem under analysis. Depending on the target architecture, the maximumcommunication overhead due to the parallelization, while assuring reasonably efficient par-allel computations, may drastically decrease, which is the case of distributed systems. As aconsequence, fine grain parallel algorithms may be not be recommendable or even becomeunfeasible.An implementation of the MATE algorithm suitable for distributed computing architec-tures will be set forth, according to the theoretical development presented in Chapter 2. Oneof the reasons supporting this choice comes from the fact that, in the MATE algorithm, theamount of communications remains limited, as long as the system under analysis is parti-tioned in so many subsystems, so the interconnections among them also remains limited.In the development, the proposed implementation will be also modeled in terms of perfor-mance metrics, such as computation and communication times, efficiency and speedup. For41Chapter 3. Network-based MATE Algorithm Implementationthis task, the implementation will be assessed from the computational and communicationstandpoint, aiming at timings required to compute the subsystems and link equations, andto exchange data between subsystems and link solver processes. Afterwards, the aforemen-tioned performance metrics for the MATE parallel algorithm, relative to sequential sparsitytechniques, will be evaluated and analyzed.3.1 MATE Algorithm Flow ChartIn Figure 3.1, the proposed MATE algorithm flow chart is depicted. Since all subsystemsare similar in behavior, only one generic subsystem Sk is represented. Moreover, each com-munication point between the subsystem Sk and the link solver should actually be seen asa collective communication point, where all subsystems Sk ∈S communicate with the linksolver and vice-versa.At the beginning, all processes need to acquire the data they rely upon. At this stage, allsubsystems concurrently load local admittance matrices Yk, local current injections jk, andinformation regarding the links connected to this specific subsystem, which is summarizedby Lk. As for the link solver, information about all links interconnecting all subsystems,denoted by simply L, need to be loaded. Subsequently, subsystems identify the border nodessets Bk and form the subsystem-to-border mappings Qk and the link-to-border mappingsRk, according to the definitions shown in (2.8) and (2.21). Afterwards, the link-to-bordermappings Rk are sent to the link solver, so it becomes aware of which links are associatedwith each subsystem Sk, and their respective border nodes. At this point all subsystemsstart computing their multi-node Thévenin equivalents, formed by Zbk and ebk, according to(2.12), (2.16) and (2.17), and send them to the link solver. After receiving the previousequivalents, the link solver assembles the multi-area Thévenin equivalent, formed by Zl andel, and computes the link currents il, following the procedure described by (2.22) and (2.25).With the link currents il available, the link solver scatters them appropriately, i.e., respectingthe link-to-border mappings Rk, among the subsystems, which can, finally, update the localinjections jk with the links coming from the link solver and compute the local node voltagesvk.In the next sections, a more elaborate discussion of the forgoing procedure will pre-sented, along with performance aspects of each task involved in the network-based MATEimplementation.42Chapter 3. Network-based MATE Algorithm ImplementationReadYk, jk and Lk Read LComputeZbk and ebkFormZl and elFind BkForm Qk and RkGatherRk for k = 1, . . . , pSend RkSendZbk and ebkGatherZbk and ebkSolveZlil = elScatteribk for k = 1, . . . , pReceive ibkUpdatejk ←(Qk)T ibk +jkSolveYk vk = jkSubsystem Sk Link SolverFinishBeginFigure 3.1. MATE algorithm flow chart43Chapter 3. Network-based MATE Algorithm Implementation3.2 MATE Performance ModelIn this section, the MATE method for solving large linear electric networks in a distributedcomputer architecture will be modeled in terms of performance metrics, such as computationand communication times, efficiency and speedup. For this task, the algorithm presentedearlier will be assessed from the computational and communication standpoint, aiming attimings required to compute the subsystems and link equations, and to exchange data be-tween subsystems and link solver processes. Afterwards, the aforementioned performancemetrics for the MATE parallel algorithm, relative to sequential sparsity techniques, will beevaluated and analyzed.3.2.1 Performance Model PreliminariesIn order to model the performance of the MATE algorithm, one first needs to understandthe computational and communication requirements of each of the MATE’s tasks, takinginto perspective the type of problems intended to be solved.For the processes performing tasks associated with subsystem Sk, their execution timeT(Sk) is defined as follows:T(Sk) = TThvcomp(Sk) + TThvcomm(Sk) + Tlnkidle(Sk) + Ticomm(Sk) + Tvcomp(Sk) (3.1)where,TThvcomp(Sk) = time spent computing the multi-node Thévenin equivalent Zbk and ebk for thesubsystem Sk ∈S;TThvcomm(Sk) = time spent sending Zbk and ebk to the link solver process;Tlnkidle(Sk) = time spent waiting for the link solver process to compute the link currents il;Ticomm(Sk) = time spent receiving the border nodes injections ibk from the link solver;Tvcomp(Sk) = time spent solving the node voltages vk for the updated current injections ik.For the link solver process, which solves equations associated with the link system L, ormulti-area Thévenin equivalent, the execution time is given by:T(L) = Tsubsidle (L) + TThvcomm(L) + Ticomp(L) + Ticomm(L) (3.2)where,Tsubsidle (L) = time spent waiting for the subsystems Sk ∈S to compute Zbk, ebk and vk;TThvcomm(L) = time spent receiving Zbk and ebk from the subsystems Sk ∈S;44Chapter 3. Network-based MATE Algorithm ImplementationProcess A Link SolverProcess Process BTThvcomp(S1) TThvcomp(S2)Tsubsidle (L)TThvcomm(L)TThvcomm(S1) TThvcomm(S2)Ticomp(L)Tlnkidle(S1) Tlnkidle(S2)Ticomm(S1) Ticomm(L)Tsubsidle (L)Ticomm(S2)Tvcomp(S2)Tvcomp(S1)t0t2t3t4t5computation communication idlet1Figure 3.2. MATE timeline for two subsystems, S1 and S2, and the link solver.Ticomp(L) = time spent setting up and computing the link current equations;Ticomm(L) = time spent scattering the border node current injections ibk to the subsystemsSk ∈S.All of the above timings are shown in the MATE algorithm timeline example, illustratedin the Figure 3.2. Three processes are shown: two subsystem processes A and B, and the linksolver process. At instant t0, all subsystem processes start computing their correspondingmulti-node Thévenin equivalents. Once the first of them is ready, at instant t1, the link solverprocess can start receiving it, while the other processes finish their computation. Right afterfinishing sending their multi-node Thévenin equivalents to the link solver, the subsystemsstart waiting for the link solver feedback. At instant t2, the link solver received all Théveninequivalents and starts setting up and computing the link current equations. Once the linkcurrents become available, the link solver process sends them to the subsystems at instant t3.As soon as the subsystems receive the border nodes current injection due to their local links,they are able to update their local current injections and solve their local node voltages.After instant t4, the link solver process remains waiting for the next batch of Théveninequivalents, while the subsystems finish updating the local variables.From the previous discussion on the MATE timeline, the total time for the network-basedMATE algorithm, TMATE, can be estimated by the sum of the terms related to the mosttime-consuming subsystem Smax, in addition to the data exchanges and computational tasks45Chapter 3. Network-based MATE Algorithm Implementationassociated with the link currents. Thus, the foregoing estimation can be denoted as follows:TMATE = TThvcomp(Smax) + TThvcomm(L) + Ticomp(L) + Ticomm(L) + Tvcomp(Smax) (3.3)Next, each time component, introduced above, will be formulated, based on the compu-tational complexity or communication requirements of each of the MATE’s tasks describedin Section 2.3. In summary, the computational and communication tasks covered in thefollowing sections are:• Computational Tasks:– Multi-node Thévenin equivalents computation– Link currents computation– Subsystems update• Communication Tasks:– Subsystems multi-node Thévenin equivalents gathering– Link currents scattering3.2.2 Computational Aspects of MATEIn general, the most computationally expensive tasks of the MATE algorithm are the com-putation of the multi-node Thévenin equivalents at the subsystems level, and the solutionof the link currents at the link solver level.Since the main purpose of the MATE algorithm is to tackle very large systems, the sub-systems, obtained by means of any partitioning heuristic, will be also large enough to renderthese subsystems very sparse. Therefore, the application of well-known sparsity techniqueswill be fundamental to guarantee efficiency of the subsystems tasks. As for the link solvertasks, traditional dense matrix operations are required.Multi-node Thévenin equivalentsBased on (2.12) and (2.16), notice that both ebk and Zbk are solutions of the same linearsystem (2.1) defined by Yk, with only different right-hand sides, i.e., jk for computing ebkand (Qk)Tfor computing Zbk. In this way, each subsystem’s local network, expressed by (2.1)has to be solved bk times for building Zbk, and one time for ebk. Since the matrix Yk is usuallyvery sparse, sparse techniques for solving this linear system should be employed. A widelyused technique for solving large sparse linear systems is LU factorization.46Chapter 3. Network-based MATE Algorithm ImplementationThe time necessary to compute multi-node Thévenin equivalents can be then approxi-mated by the following expression:TThvcomp(Sk) = T1fact(Sk) + (Nz−1)T2fact(Sk) + (Nz bk + Ni)Tsolv(Sk) (3.4)where T1fact(Sk) represents the time for the first factorization of the admittance matrixYk, which includes both symbolic and numerical factorizations. In turn, T2fact(Sk) is thetime spent in the numerical factorization that reuses the structures of the factors Lk andUk, acquired during the symbolic factorization of Yk. Additionally, Tsolv(Sk) is the timefor solving the triangular systems, defined by the factors Lk and Uk. The parameter Nzcaptures the number of topological changes in subsystem Sk, i.e., how often Zbk needs to berecalculated, whereas the parameter Ni represents the number of changes in the injectedcurrents.Results obtained by Alvarado (1976) with respect to the order of complexity of sparsefactorization and repeated solutions of triangular systems suggests that T1fact(Sk), T2fact(Sk)and Tsolv(Sk) can be approximated by exponential functions, as follows:T1fact(Sk) = a3 ρk2 nk + a2 ρk nk + a1 nk + a0 (3.5)T2fact(Sk) = b3 ρk2 nk + b2 ρk nk + b1 nk + b0 (3.6)Tsolv(Sk) = c2 ρk nk + c1 nk + c0 (3.7)where nk is the number of buses in subsystem Sk and, ρk represents the branch-to-bus ratio ofthe same subsystem. These relationships show that more interconnected systems, i.e., higherbranch-to-bus ratios ρk, are more computationally expensive. On one hand, if a system hasall its buses interconnected to one another, its branch-to-ratio ρ equals n−1, which makesits factorization and repeated solution time of orderO(n3) andO(n2), respectively. On theother hand, for loosely interconnected systems, as is the case of most power systems, whereeach bus connects to just a few other neighboring buses, the factorization and repeatedsolution times vary practically linearly with the number of buses n in the system, whichmeans an order of complexityO(n)6.In summary, the aspects that influence the computation of the multi-node Théveninequivalents are: subsystem size, topology, ordering of the nodes, as well as the number ofborder nodes. Both size and number of border nodes are highly dependent on the parti-tioning used to tear the original system apart. Ideally, the employed partitioning techniqueshould make subsystems as balanced as possible, while keeping the number of border nodes6For further details, see Appendix A47Chapter 3. Network-based MATE Algorithm Implementationreasonably small. This also helps in reducing the amount of data that needs to be exchangedwith the link solver.Subsystems updateOnce the local current injections jk + ik are available, the sparse linear system (2.2) needsto be solved. Since the subsystem matrix Yk is the same as the one used to compute themulti-node Thévenin equivalents, the factors Lk and Uk can be reused to solve (2.2) forthe node voltages vk, by means of sparse forward/backward substitutions. Hence, usingTsolv(Sk), which is defined in (3.7), the time required to calculate vk is as follows:Tvcomp(Sk) = Ni Tsolv(Sk) (3.8)Notice from (3.8) that each subsystem has to be solved as many times as their currentinjections change, i.e., Ni times.Link currents computationConsidering that all p subsystems multi-node Thévenin equivalents are available at the linksolver, one can build the multi-area Thévenin equivalent, formed by Zl and el, by meansof the relationships denoted in (2.25). In these equations, the link Thévenin equivalents,defined by (2.22), are built in-place on top of the multi-area Thévenin equivalent, Zl and el,according to (3.9).Zl = Zl0 +psummationdisplayk=1(Rk)T ZbkRk (3.9a)el =psummationdisplayk=1(Rk)T ebk (3.9b)Once the multi-area Thévenin equivalent seen from the links is built, the link currentscomputation can proceed. In this case, the dense linear system denoted in (2.25c) needs tobe solved, by first factorizing Zl and then solving the triangular systems associated with theobtained dense L and U factors, by means of forward and backward substitutions. Moreover,since subsystems may suffer topological changes during simulations, their correspondentlink Thévenin equivalents Zlk will also change. Therefore, if any subsystem Sk ∈S has itstopology changed Nz times, the matrix Zl also needs to be re-factorized Nz times. Similarly,if subsystems current injections change Ni times over a simulation period, the link Théveninvoltages el also have to be computed Ni times. So, based on the fact that the number of48Chapter 3. Network-based MATE Algorithm Implementationoperations required to factorize Zl is of the orderO(l3), and forward/backward substitutionsof the orderO(l2), one can conclude that:Ticomp(L) = Nz Tfact(L) + Ni Tsolv(L) (3.10)where, Tfact(L) and Tsolv(L) can be approximated byTfact(L) = a3 l3 + a2 l2 + a1 l (3.11)Tsolv(L) = b2 l2 + b1 l (3.12)3.2.3 Communication Aspects of MATEIn the MATE algorithm, two main communication tasks are necessary: gathering the multi-node Thévenin equivalents in the link solver and scattering link currents to subsystems.For establishing the performance model of such routines the PLogP model, introduced byKielmann et al. (2000), was employed. Known as the parameterized LogP model, PLogP isan extension of the LogP model, originally proposed by Culler et al. (1996). Differently fromthe original LogP, which considers fixed parameters for modeling performance of dedicatednetworks, the PLogP includes the dependencies on the messages sizes. The parameters whichboth PLogP employs are described as follows:• L: end-to-end latency from process to process, which includes the time for copying thedata to and from the network interfaces and transferring the data over the physicalnetwork;• os(m): period of time in which the processor is engaged in sending the message of sizem;• or(m): period of time in which the processor is engaged in receiving the message ofsize m;• g(m): minimum time, or gap, between consecutive messagetransmissions or receptions,which also considers all contributing factors, including os(m) and or(m).In the next sections, the link currents scattering will be first explained, followed by themulti-node Thévenin equivalents gathering.Link currents scatterAt this stage of the solution, each of the p subsystems Sk ∈S has to receive from the linksolver process, L, the net currents ibk injected by the lk local links at the bk border nodes.49Chapter 3. Network-based MATE Algorithm ImplementationThese current injections are the result of a linear combination of the link currents il obtainedby means of the link-to-border mapping Rk, as shown in (3.13).ibk = Rk il (3.13)The procedure described by (3.13) represents, in fact, a pre-processing step performedby the link solver on the link currents il just before sending them to the subsystems. Asmentioned earlier, by sending the border nodes net current injections ibk, instead of the globallink currents vector il, one reduces the communication overhead of the algorithm.In addition, once the subsystems have received their net injections ibk, they are addedto the local subsystem current injection jk by means of the border-to-subsystem mapping(Qk)T as shown in (3.14). The final vector jk is later used for the subsystem update, asdiscussed in Section 3.2.2.jk ←jk + (Qk)T ibk (3.14)In fact, the present communication task corresponds to a customized scatter function,according to the MPI Standard (2008). This is shown graphically in Figure 3.3. In thisprocedure, parameters L, os(m), or(m) and g(m) characterize the behavior of a network,according to the PLogP model (Kielmann et al., 2000).The link currents scatter routine starts in the link solver process, which sends p consecu-tive messages to distinct processes handling the subsystems. As seen in Figure 3.3, the firstmessage alone takes a time equal to os(m1) + L + or(m1) to be completely received by theprocess. However, before the link solver can start sending the second message, it demands aminimum of g(m1) seconds before the message can be sent. The second message leaves thelink solver process only at the os(m2)+g(m1) mark and is completely received in the respec-tive subsystems by g(m1)+os(m2)+L+or(m2). Following the previous reasoning, one canfind the time consumed by the link solver process to send p messages of size mk (k = 1,...,p)to the subsystems processes. This time is given by L+os(mp)+or(mp)+summationtextp−1k=1 g(mk). Finallythe link currents scatter timing Ticomm(L) is given in (3.15).Ticomm(L) = L + os(mp) + or(mp) +p−1summationdisplayk=1g(mk) (3.15)Recalling that each message of size mk is associated with currents being injected at thebk border nodes of subsystem Sk, there is a linear relationship between mk and bk. In fact,this relationship depends on the data type that represents each injection. If a single current50Chapter 3. Network-based MATE Algorithm Implementationos(m1) os(m2) os(m3)or(m1)or(m2)or(m3)g(m1) g(m2)LLLLS1S2S3timeos(mp)or(mp)LSpg(mp)Figure 3.3. Link currents scatter according to PLogP model.injection takes c bytes of memory, the message size mk is defined by (3.16).mk = cbk (3.16)Considering that longer messages take longer periods of times to be transfered to anotherprocess, one can conclude that the function g(mk) is always positive and increases monoton-ically with the message size mk. Therefore, for any message size mk, g(mk)≤g(mk), wheremk is the longest message amongst all mk, for k = 1,...,p. Consequently, the followinginequality is also true.p−1summationdisplayk=1g(mk)≤p−1summationdisplayk=1g(mk) = (p−1)g(mk) (3.17)That fact that the gap g includes both overheads os and or leads to g(mk)≥os(mk) andg(mk) ≥ or(mk) (Kielmann et al., 2000). Combining these last pieces of information with(3.15) and (3.17) yields (3.18), which represents an upper bound for the link currents scattertime Ticomm(L).Ticomm(L)≤L + (p + 1)g(mk) (3.18)51Chapter 3. Network-based MATE Algorithm Implementationos(m1)os(m3)or(m1) or(m3) or(m2)g(m3)LLLLS1S2S3timeos(mp)or(mp)LSpg(mp)os(m2)Figure 3.4. Multi-node Thévenin equivalents gather according to PLogP model.Multi-node Thévenin equivalents gatherIn this case, each of the p subsystems Sk ∈S has to send its previously computed multi-nodeThévenin equivalent, formed by Zbk and ebk, to the link solver process. In fact, this taskcorresponds to a customized gather function (MPI Standard, 2008).Considering that each one of the multi-node Thévenin equivalents is associated with amessage of size mk, the gather routines follows the procedure described in Figure 3.4. Inlight of the PLogP model (Kielmann et al., 2000), the time that the link solver takes toreceive p consecutive messages of size mk corresponds to L+os(mp)+or(mp)+summationtextp−1k=1 g(mk)seconds.Differently from the current links scatter, for the multi-node Thévenin gather proceduretwo kinds of messages need to be considered: full multi-node Thévenin equivalents, consistingof Zbk and ebk and, the multi-node Thévenin equivalent voltage ebk alone. While the first kindwill be sent to the link solver only when a subsystem topology changes, the second one issent whenever the internal current injections vary. The size of the messages carrying theinformation of the multi-node Thévenin equivalents of the subsystem Sk are summarized by(3.19).mk =cbk (bk + 1) if Zbk and ebk are sent,cbk if only ebk is sent.(3.19)where c represents the amount of memory, in bytes, that each element of either Zbk or ebktakes.52Chapter 3. Network-based MATE Algorithm ImplementationFinally, the communication time TThvcomm(L), perceived by the link solver process, can becalculated as follows:TThvcomm(L) = L + os(mp) + or(mp) +p−1summationdisplayk=1g(mk) (3.20)Notice that when only ebk needs to be gathered in the link solver, TThvcomm(L) is equal tothe link currents scatter time Ticomm(L), defined in (3.15). Moreover, the upper bound (3.18)found for Ticomm(L) is applicable to TThvcomm(L) as well, which leads to (3.21).TThvcomm(L)≤L + (p + 1)g(mk) (3.21)3.2.4 MATE Speedup and EfficiencyAccording to (Foster, 1995), both speedup and efficiency are relative quantities, which aremeasured, usually, with respect to the best known algorithm available to perform the sametask under analysis. Hence, since sparsity-based network solvers are widely used in powersystem industry (Tinney & Walker, 1967), they appear a natural choice as a comparativebaseline to estimate speedups and efficiency.Sparsity-based solver computation timeWhen applying sparsity techniques alone to solve a large electric network, two basic tasksneed to be performed: firstly, factorization of the system matrix Y and, secondly, solutionof two triangular sparse systems for each current injection i. Thus, keeping in mind thatthe system topology may change Nz times and its current injections changes Ni times, thesparsity-based network solver computation time TSPARSE may be defined as denoted bellow:TSPARSE = T1fact(S) + (Nz−1)T2fact(S) + Ni Tsolv(S) (3.22)where T1fact(S), T2fact(S) and Tsolv(S) follow the relationships (3.5), (3.6) and (3.7), respec-tively.53Chapter 3. Network-based MATE Algorithm ImplementationMATE versus Sparsity AlonePlugging (3.3) and (3.22) into the definitions for speedup and efficiency yields the followingexpressions:SMATE = TSPARSETMATE(3.23)EMATE = SMATEp + 1 (3.24)The above expressions synthesize multivariable functions dependent on aspects whichinvolve characteristics of the system to be solved, such as system topology and partitioningstrategy7. Furthermore, softwareimplementationandhardwarecapabilities alsoconsiderablyimpacts metrics, such as speedup and efficiency. Hence, the performance analysis of theMATE algorithm needs to be evaluated on a case by case basis, which needs to consider thesystem under study, in addition to the computer architecture and software available. In thefollowing section, the proposed MATE performance model will be evaluated for a test caserecently presented in (Tomim et al., 2008).3.2.5 MATE Performance Qualitative AnalysisIn order to provide means to qualitatively understand the constraints in the network-basedMATE algorithm, a qualitative analysis based on the computational and communicationcosts, discussed previously, will be presented.For this analysis, one first splits the total MATE timing TMATE, given by (3.3), into threedistinct components, given by (3.25): TP, TL and TC. The component TP is associated withparallel tasks performed by the processes handling subsystems. The second component TLrefers to sequential computations carried out by the link solver. Finally, the component TCaccumulates the communication overhead incurred by the MATE algorithm.TP = TThvcomp(Smax) + Tvcomp(Smax) (3.25a)TL = Ticomp(L) (3.25b)TC = TThvcomm(L) + Ticomm(L) (3.25c)Employing the expressions obtained in Section 3.2, the set of timings above can be ex-plicitly written in terms of the variables that directly affect the performance of the MATE7Note that p refers to the number of partitions a given system is torn into, and the number of processorsemployed in the MATE algorithm equals p + 1. Therefore, p + 1 processors is used in order to compute theefficiencyEMATE.54Chapter 3. Network-based MATE Algorithm Implementationalgorithm, as shown in (3.26).TP = bracketleftbigNz ksf + (Nz bmax + 2Ni)kssbracketrightbignmax (3.26a)TL = Nz kdf l3 + Ni kds l2 (3.26b)TC = (p + 1)bracketleftbig2Ni g(me) + Nz g(mt)bracketrightbig (3.26c)where me and mt are relative to the memory associatedwith the Thévenin equivalent voltagesand the full Thévenin equivalent, respectively. Furthermore, nmax represents the order ofthe biggest subsystem Smax, which, for well balanced partitions, can be approximated by Np ,where p the number of subsystems that form the original untorn system of order N.Under similar circumstances, the sequential sparse solver timing TSPARSE, defined in(3.22), assumes the following form:TSPARSE = (Nz ksf + Ni kss)N (3.27)Plugging the expressions defined above into the speedup definition (3.23) givesSMATE ≈ pNz ksf+(Nz bmax+2Ni)kssNz ksf+Ni kss +Nz kdf l3+Ni kds l2(Nz ksf+Ni kss)nmax+ (p+1)[2Ni g(me)+Nz g(mt)](Nz ksf+Ni kss)nmax(3.28)Next, two separate analyzes on the speedup SMATE, introduced in (3.28) will be dis-cussed: (a) for when the number of factorizations Nz approximates the number of repeatedsolutions Ni, and (b) for when the number of repeated Ni is much greater than the numberof factorizations Nz.Case for Nz ≈NiIn such a situation, (3.28) becomes:SMATEvextendsinglevextendsinglevextendsingleNz≈Ni≈ pksf+(bmax+2)kssksf+kss +kdf l3+kds l2(ksf+kss)nmax+ (p+1)[2g(me)+g(mt)](ksf+kss)nmax(3.29)In general, subsystems become more interconnected as the number of partitions increase,which makes the number of border nodes bmax, number of links l, and the communication-related g(me) and g(mt) increase as well, while only nmax decreases. Therefore, the denomi-nator of (3.29) will always increase. Moreover, the term associated with the number of linksl can be expected to present the faster increase ratio, due to its cubic power.As a result, the network-based MATE algorithm will drastically loose performance when-55Chapter 3. Network-based MATE Algorithm Implementationever systems require frequent factorizations. Such performance loss rate can only be alle-viated in case the hardware/software-associated coefficients kdf, kds, g(me) and g(mt) canbe reduced. However, such reduction can only be achieved up to some extent, by means ofspecialized hardware/software for dense matrix operations and data communication.Case for Nz ≪NiIn this case, the speedup achieved with the MATE algorithm approximates the expressiondenoted in (3.30).SMATEvextendsinglevextendsinglevextendsingleNz≪Ni≈ p2 + kds l2kss nmax +2(p+1)g(me)kss nmax(3.30)Following the trend of the previous case, the denominator in the speedup expression stilltends to increase as the number of partitions grows. However, its increase occurs at a muchslower rate, due to the quadratic term associated with the number of links l.From the expression above, even for partitions that are very weakly connected, i.e., l≪nmax, the speedup would be still limited by p2, which represents the theoretical speedup limit,in case both dense operations in the link solver and data communications were instantaneous.This is explained by the fact that each subsystem is solved twice, each time the originaluntorn system is solved.In order to keep the speedup as close to p2 as possible, the two varying denominator termsshould be much less than the unity. In this sense, these terms can be seen as penalty factorson the network-based MATE algorithm due to the sequential computations performed onthe link solver and the data exchange among processes. These penalty factors are defined in(3.31), which also states the constraints that should be imposed on the same penalty factorsin order to maximize the algorithm’s performance.λlink = kds l2kss nmax ≪1 (3.31a)λcomm = 2(p + 1)g(me)kss nmax≪1 (3.31b)The inequality given in (3.31a) serves as a constraint on the number of links that a systemof size N can have, whenever it is torn into p subsystems. In other words, it reflects thefact that the workload in the subsystems has to be much heavier than the workload in thelink solver. Therefore, the MATE algorithm has better chances to achieve higher speedupwhen solving larger systems. Constraint (3.31b) provides a bound on the communicationtime associated with Thévenin equivalent voltages, which depend on the intercommunicationgap g(me). Similarly to the previous term, it reflects the fact that the Thévenin equivalents56Chapter 3. Network-based MATE Algorithm Implementationcommunication overhead should be much lower than the computations. In other words,subsystems cannot be too small, otherwise the Thévenin equivalents communication timingswould eventually surpass their computation time, degrading considerably the algorithm’sperformance.3.3 Hardware/Software BenchmarksThe parameters that describe the performance of the hardware and software combinationavailable is of fundamental importance to predict the performance of the network-basedMATE implementation without actually implementing it.Sparse and dense operations modules as well as interprocess communication kernels werebenchmarked for the distributed computing environment available in the UBC Power Engi-neering Group Laboratory. As detailed in (De Rybel et al., 2008), this environment consistsof 16 AMD AthlonTM 64 2.5 GHz PC units built on a single rack, interconnected by twodistinct private networks: one built with Dolphin SCI (Scalable Coherent Interface) networkcards and another built with Gigabit Ethernet cards.3.3.1 Sparse Linear Solver BenchmarkFor the sparse routines, two specific kernels need to be benchmarked, namely, the LU fac-torization and repeated triangular solutions.In this study the sparse linear solver kernels implemented in SuperLU 3.0 library wereemployed (Demmel et al., 1999; Li et al., 2003). For the timing procedure, a number ofWECC subsystems, generated in the partitioning stage (see Section 3.4.1, page 70), werefully solved, i.e., factorized and solved, 1000 times. Computation time averages are plottedin Figure 3.5.Another important aspect that has be considered when solving large sparse linear systemsis minimizing fill-in of the factors L and U during the factorization process. Minimizing thefill-in has a twofold benefit: minimizing the memory consumption and the number of requiredfloating-point operations. Thus, choosing a good ordering technique for the sparse matricesof interest ensures efficiency of their solutions. For the present work, the employed orderingtechnique is known as multilevel nested dissection, whose implementation is available in theMETIS 4.0 library (Karypis & Kumar, 1998a,b), introduced in Section 3.4.1.By means of linear regression, the timing data previously acquired were fitted in the mod-eling functions (3.5), (3.6) and (3.7). A summary of the results of the data fitting procedureis presented in Table 3.2. In this table, the parameters associated with each component of57Chapter 3. Network-based MATE Algorithm Implementation0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1505001000Dimension (×1000)FirstFact.[ms]  MeasuredFitted0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150100200300Dimension (×1000)SamePatternFact.[ms]  MeasuredFitted0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 150204060Dimension (×1000)RepeatedSol.[ms]  MeasuredFittedFigure 3.5. Sparse operations benchmark.the modeling functions are shown. The term associated with ρ2n is not observable in thefitted function, which shows that, the computation effort, required by the SuperLU routinesto solve the WECC system, increases linearly with the size of the subsystems. This is thesame behavior expected from sparse band matrices (Alvarado, 1976).The sparse routines throughputs are given in Figure 3.7. The results show that therepeated solutions are more efficient than the factorizations in terms of floating-point opera-tions per second. As for the factorizations, it can be observed that the symbolic factorizationrequired by the first-time factorizationreduces considerably the efficiency of the routine whencompared to the same-pattern factorization.58Chapter 3. Network-based MATE Algorithm ImplementationTable 3.1. Data fitting summary for the sparse operations.First Factorization (R2 = 0.99749)Component Value C.I. (95%)*ρn 1.4956×10−11 ±9.7840×10−13n 5.0778×10−6 ±8.1457×10−81 −1.3855×10−3 ±1.1733×10−4Same Pattern Factorization (R2 = 0.99803)Component Value C.I. (95%)*ρn 2.0521×10−12 ±2.4136×10−13n 1.5777×10−6 ±2.0094×10−81 −4.7687×10−4 ±2.8943×10−5Repeated Solution (R2 = 0.99869)Component Value C.I. (95%)*ρn 5.6283×10−13 ±3.5275×10−14n 2.6740×10−7 ±2.9368×10−91 −6.7353×10−5 ±4.2301×10−6* 95% confidence interval for the parameters estimates.3.3.2 Dense Linear Solver BenchmarkFollowing the same procedure adopted for timing the sparse solver kernels, both dense rou-tines, namely, the LU factorization and the repeated triangular solutions, were benchmarked.The dense routines employed in this study are implemented in the GotoBLAS library (Goto,2006; Goto & Van De Geijn, 2008a,b), which provides a large set of highly optimized BLAS(Basic Linear Algebra Subprograms) (Blackford et al., 2002) and LAPACK (Linear AlgebraPackage) (Anderson et al., 1999) routines.For the benchmarking process, instead of using matrices extracted from the previous par-titions (multi-node Thévenin equivalents, for instance), randomly generated dense matriceswere fully solved 1000 times each. The measured timings are shown in Figure 3.6. Theresults of the data fitting procedure, using the functions defined in (3.11) and (3.12), areshown Table 3.1.The throughput for the dense factorization and repeated solution is shown in Figure 3.8.Differently from what was observed for the sparse routines, the factorization is more efficientthat the repeated solutions in terms of floating-point operations per second. This can beexplained by the fact that, for dense matrices, the factorizations can be performed in blocks.Such characteristic allows implementations in modern processors, such as GotoBLAS, that59Chapter 3. Network-based MATE Algorithm Implementation50 100 150 200 250 300 350 400 450 500 550 600050010001500DimensionFactorization[ms]  MeasuredFitted50 100 150 200 250 300 350 400 450 500 550 600051015DimensionRepeatedSolution[ms]  MeasuredFittedFigure 3.6. Dense operations benchmark.minimize considerably RAM memory access and maximize cache memory usage.Such an improved performance of dense matrix operations is often exploited in sparse-oriented algorithms. For instance, blocked strategies for handling sparse matrices are used inthe aforementioned SuperLU (Demmel et al., 1999; Li et al., 2003) and UMFPACK (Davis,2004) libraries.60Chapter 3. Network-based MATE Algorithm Implementation0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15050100150200250300Dimension (×1000)Throughput[Mflops]  First Fact.Same Pattern Fact.Repeated Sol.Figure 3.7. Sparse operations throughput.0 100 200 300 400 500 6005001000150020002500300035004000DimensionThroughput[Mflops]  FactorizationRepeated SolutionsFigure 3.8. Dense operations throughput.61Chapter 3. Network-based MATE Algorithm ImplementationTable 3.2. Data fitting summary for the dense operations.Factorization (R2 = 0.99975)Component Value C.I. (95%)*l3 6.5066×10−10 ±5.2271×10−12l2 2.7058×10−8 ±1.1134×10−9l 8.8535×10−8 ±1.9420×10−8Repeated Solution (R2 = 0.99855)Component Value C.I. (95%)*l2 4.9757×10−9 ±1.9051×10−10l −2.2719×10−7 ±9.1490×10−81 2.8222×10−6 ±8.9642×10−7* 95% confidence interval for the parameters estimates.3.3.3 Communication Libraries BenchmarkAs previously discussed in Section 3.2.3, in order to properly characterize a network, accord-ing to the PLogP model (Kielmann et al., 2000), the latency L, sending overhead os(m), re-ceiving overhead or(m) and inter-message gap g(m) are required. The underlying hardwareis not the only aspect to influence these parameters, but also the software layer responsiblefor handling the interprocess communications.The Message Passage Interface (MPI) was chosen for the implementation of the commu-nication operations required by the network-based MATE algorithm. This choice is mainlydue to the fact that, since its creation in 1994, the MPI Standard (2008) have become thede facto standard in high-performance computing industry, which specifies the interfacesto a number of point-to-point and collective inter-processes communication kernels. Assuch, parallel MPI-based programs are portable and can be easily compiled against manyhardware-specific MPI implementation.Among the several MPI implementations publicly available, the well-known MPICH2(Argonne National Laboratory, 2007) and NMPI (NICEVT, 2005) libraries were selected.The MPICH2 is a widely used MPI implementation, developed by the Argonne NationalLaboratory, USA, which provides the entire MPI 2.1 standard (MPI Standard, 2008) forshared memory and a number of distributed network stacks, such as TCP, InfiniBand, SCTPand Myrinet. The NMPI library, based on the previous MPICH2, implements the MPIstandard over SCI networks. These two ready-to-use MPI libraries are examples of highlytested and hardware-specialized pieces of software, which abstract the programming fromthe hardware, significantly reducing the development time.62Chapter 3. Network-based MATE Algorithm ImplementationSince two network types are available in the employed parallel computing environment,the acquisition of the network parameters, latency L, sending overhead os, receiving overheador and minimum inter-message gap g, not only provides the means to predict the networkperformance, but also enables comparing the two networks. Results of the benchmark,summarized in Appendix C, are depicted in Figures 3.9 and 3.11 for the SCI and Ethernet-based networks available in the UBC’s Power System Engineering Group’s cluster.The minimum inter-message gaps g, shown in Figure 3.9 for both Ethernet and SCI net-works, are inversely related to the maximum bandwith of the network. Since the parameterg(m) denotes the minimum time necessary to send an m-byte message to another process,the bandwidth of the network can be calculated as the ratio between the message size m andits associated gap g(m), as shown in (3.32).B(m) = mg(m) (3.32)The inter-message g(m) plots, shown in Figure 3.11, suggest that, for the Ethernet andSCI networks, g(m) can be approximated by a piece-wise linear function, such as (3.33). Inthe expression, the coefficient G represents the gap per byte in a long message, accordingto the LogGP model, proposed by (Alexandrov et al., 1995), whereas g0 coincides with theestimated minimum gap between zero-length messages, g(0).g(m) = Gm + g0 (3.33)As a consequence, for very large messages, the bandwidth B(m) reaches a steady-state, i.e., the maximum network bandwidth, which is identical to the reciprocal of G, i.e.,Bmax = 1G. According to the fitted parameters for both Ethernet and SCI networks, reportedin Table 3.3, the maximum bandwidth of the Gigabit Ethernet network was 122 MB/s, whilethe SCI yielded 182 MB/s.The other parameters, the latency L and the overheads os and or for sending and re-ceiving messages, respectively, were also obtained, according to the approach summarized inAppendix C and depicted in Figures 3.9and 3.10. One similarity observed for both networksis, in fact, related to some MPICH2’s design strategies (Argonne National Laboratory, 2007)regarding the handling of short and large messages. Two message protocols are usually em-ployed by MPI implementations to differentiate short and large messages, namely, the eagerand the rendezvous protocols8. In both MPICH2 and NMPI libraries, the default thresh-old for the message size is 128 kB, which is characterized by sharp discontinuities in both8For more information about these protocols see Appendix C.63Chapter 3. Network-based MATE Algorithm Implementationnetworks benchmarks.For the Ethernet network, the average values of os and or vary practically linearly, asobserved in Figures 3.9a and 3.10. In this case, the receiving overhead or is slightly less thanthe sending overhead os, for messages sizes smaller than 128 kB. For large messages, however,due to the switching between eager and rendezvous message protocols, the receiving overheador increases considerably, approaching the inter-message gap g, while the sending overheados presents just a slight increase. Such an increase in or is related to the extra requiredhandshakes between processes, in order to guarantee available memory for the transmissionand readiness of the receiver.For the SCI network, in addition to the eager and rendezvous message protocols, anextra message protocol is provided, which takes advantage of hardware-specific features forsending very short messages of less than 12 kB. In this short message protocol, the messagesare encoded, so the receiver can interpret the message without communicating with thesender. This extra work required to decode short messages is captured by the faster increaserate verified for the receiving overhead or. As also observed in Figure 3.9b, the sendingoverhead os is coincident with the inter-message gap g, due to the error checking, included inthe implementation provided by the NMPI library, which blocks the process until the transferis complete. For the rendezvous protocol region, similarly to the Gigabit Ethernet behavior,both overheads become practically equal because of additional handshakes required betweenprocesses.Other aspects regarding the performance of each network can be extracted comparingthe sending overhead os and the inter-message gap g, according to Culler et al. (1996). Forthe Ethernet network, for instance, os < g, for all benchmarked message sizes, which showsthat the hardware represents the bottleneck for the communications, since the CPU is ableto process data in a much faster pace than the NIC is able to handle them. On the otherhand, the SCI network presented os = g throughout the tested message size range, whichshows that the software is the bottleneck of the communications.Regarding the latencies, they cannot be measured directly from any process, but onlyinferred, because it captures the time that the message spends in the network. Accordingto the procedure summarized in Appendix C, the latency L associated with the MPICH2implementation over the Gigabit Ethernet network was 12.8 µs. As for the SCI, becauseof the blocking behavior of the embedded error checking functions, the latency cannot beextracted accurately, as for the Gigabit Ethernet network, only estimates can be obtained.In this case, half of round trip timing for zero-byte messages can yield an upper limit forthe latency. The measured round trip timing for zero-byte messages was about 7 µs, whichlimits the latency to 3.5 µs. Such a latency clearly shows the superior processing power of64Chapter 3. Network-based MATE Algorithm ImplementationTable 3.3. Parameter summary for Ethernet and SCI networks.Ethernet (latency L = 12.8 µs)Coefficient Value C.I. (95%)*g0 [µs] 0.0 (152.68)† ±0.0 (±2020.7)†GbracketleftbigµskBbracketrightbig 8.4944 (8.4055)† ±0.0147 (±14.077)†SCI (latency L = 3.5 µs)‡Coefficient Value C.I. (95%)*g0 [µs] 6.6246 ±0.5435GbracketleftbigµskBbracketrightbig 5.6415 ±0.0081* 95% confidence interval for the parameters estimates.† Parameters around brackets are fitted for messages of size greater than 128 kB.‡ Latency upper bound (half of the round trip time of a zero-byte message).the SCI network.One major difference that should also be noticed is the variability of both networks, whichis remarkably lower for the SCI network than for the Ethernet network, as the error bars forthe receiving overheads or show in Figure 3.9. Such latent variability depends very much onloading conditions of network and amount connections made to a single network interface.As a consequence, in collective communications, necessary for implementing the network-base MATE algorithm, for instance, the Ethernet network is expected to present even highervariability, if messages are sent to the same process from many others simultaneously.Network-based MATE Communication TimeNow that all required network parameters are available, the performance of both GigabitEthernet and SCI networks can be compared in the context of the network-based MATEalgorithm.As previously analyzed in Section 3.2.3, the communications required by the network-based MATE algorithm involve gathering subsystems’ Thévenin equivalents in the link solverand scattering link currents back to the subsystems. The timings estimates for these tasksare defined, respectively, as TThvcomm(L), denoted in (3.20), and Ticomm(L), denoted in (3.15).Since these timings follow an identical structure, (3.34) will be used for comparing the bothnetwork interfaces.TMATEcomm (L) = L + or(m) + os(m) + (p−1)g(m) (3.34)65Chapter 3. Network-based MATE Algorithm Implementation0 20 40 60 80 100 120 140 1600200400600800100012001400160018002000Message size m [kB]Time[µs]  os(m)or(m)g(m)(a) Gigabit Ethernet network0 20 40 60 80 100 120 140 16001002003004005006007008009001000Message size m [kB]Time[µs]  os(m)or(m)g(m)(b) Scalable Coherent Interface (SCI)Figure 3.9. Benchmark results of two MPI implementations over (a) SCI and (b) GigabitEthernet networks.66Chapter 3. Network-based MATE Algorithm Implementation0 20 40 60 80 100 120 140 16005001000os(m)[µs]  SCIEthernet0 20 40 60 80 100 120 140 160010002000or(m)[µs]  SCIEthernet0 20 40 60 80 100 120 140 160010002000Gapg(m)[µs]Message size m [kB]  SCIEthernetFigure 3.10. Sending and receiving overheads and inter-message gaps for MPI over SCIand Gigabit Ethernet network.The message sizes m depend on the communication task and size of the subsystemsThévenin equivalents and number of link currents. For the Thévenin equivalents’ gathering,for instance, there are two types of messages: full Thévenin equivalents, which consist ofimpedances and voltages, and only the Thévenin voltages. The size of the Thévenin equiva-lents vary on a case basis and depend mostly on the the system’s topology and partitioningstrategies. Although the partitioning of the system will be discussed in the next section, thesubsystems’ information presented in Table 3.4 show that a real power system with about15,000 buses and partitioned into 2 up to 14 subsystems may present Thévenin equivalentsof order ranging from 1 to 100. Considering now that the Thévenin equivalents are complex-valued and each complex number takes 16 B of memory, for the previous system, messagessizes ranging up to about 16 kB for the Thévenin voltages alone, and 160 kB for the fullThévenin equivalents may be expected.In Figure 3.12, the estimated timings TMATEcomm (L) are plotted for 2 and 14 subsystems.In the case of two subsystems, the estimated timings for both SCI and Ethernet networksare very close to one another for messages sizes smaller that 128 kB, while the estimated67Chapter 3. Network-based MATE Algorithm Implementation10−1 100 101 102020406080100120140160180200181.5 [MB/s]121.8 [MB/s]Bandwidth[MB/s]Message size m [kB]  SCISCI Max. BandwidthEthernetEthernet Max. BandwidthFigure 3.11. Bandwidth of MPI over SCI and Gigabit Ethernet networks.timings for the Ethernet network exceed the ones for the SCI network in about 20%. For14 subsystems, the SCI network is expected to present a better performance throughout theobserved message size range. In this case, the estimated timings for the Gigabit Ethernetnetwork can be up to 55% higher than the timings estimated for the SCI network.The estimated timing TMATEcomm (L), defined in (3.34), however, does not consider data col-lision, which is true only for the SCI network (IEEE, 1993). In the case of the Ethernetnetworks, data collision is handled by the CSMA/CD algorithm (REFERENCE). In thisalgorithm, whenever the Ethernet card detects data collision, the data is retransmitted aftera time delay, aka backoff delay, determined by the truncated binary exponential backoff al-gorithm9. In the MATE context, such a characteristic may seriously degrade the efficiency ofthe Thévenin equivalents gathering routine and even lock up all subsequent computations.This undesirable behavior can be explained by the many sending requests the link solverprocess may receive from the subsystems’ processes. A condition that may even worsen thisscenario is when the subsystem’s computations are equally balanced and sending resquestare posted to the link solver almost simultaneously.9After i collisions, a random number of slot times between 0 and 2i−1 is chosen. The truncated aspectrefers to the maximum number of acceptable increases before the exponentiation stops.68Chapter 3. Network-based MATE Algorithm Implementation0 20 40 60 80 100 120 140 160012345Message size m [kB]  0 20 40 60 80 100 120 140 16001020304050Message size m [kB]  SCIEthernetSCIEthernetTMATEcomm(L)[ms](2subs.)TMATEcomm(L)[ms](14subs.)Figure 3.12. Comparison of Ethernet and SCI network’s timings for the network-basedMATE communications.Due to its better performance and inherent determinism, the SCI network will employedfor the network-based MATE algorithm timings presented in the next section.3.4 Western Electricity Coordinating Council SystemThe Western Electricity Coordinating Council (WECC) is responsible for coordinating thebulk electric system, including generation and transmission, serving all or part of the 14Western American States, in addition to British Columbia and Alberta in Canada. TheWECC region encompasses a vast area of nearly 1.8 million square miles, which makes thissystem the largest and most diverse of the eight regional councils of the North AmericanElectric Reliability Council (NERC) depicted in Figure 3.13 (http://www.wecc.biz/wrap.php?file=wrap/about.html).For the present analysis, the admittance matrix which represents the WECC electric net-work will be employed. The WECC system presented has 14,327 buses and 16,607 branches,resulting in an admittance matrix of order 14,327 and 47,541 non-zero elements. These69Chapter 3. Network-based MATE Algorithm Implementationnumbers show that only about 0.23% of the WECC system matrix is actually filled withnon-zeros, which characterizes the high level of sparsity of such large power systems. TheWECC admittance matrix pattern is shown in Figure WECC System PartitioningLoad balancing and minimization of the communication volume between subsystems andnumber of links play a vital role when implementing the MATE algorithm. Ideally, subsys-tems need to be as equally sized and have as few links as possible. In such a case, the localcomputations (e.g., Y matrix factorization and v solution) should require about the sametime to be performed for each subsystem and interface operations (e.g., multi-node Théveninequivalents computation and communication) should be as fast and balanced as possible.A supporting tool, METIS 4.0 (Karypis & Kumar, 1998b,a), intended for partitioninglarge unstructured graphs, was employed to assist in the partitioning procedure. This generalpurpose graph partitioner takes an undirected graph (defined by the Y matrix topology) asinput and splits it into the requested number of partitions, so that they have nearly the samenumber of nodes and a minimum number of links interconnecting them.The WECC system was partitioned from 2 to 14 subsystems using two different tech-niques, available in the METIS package, namely:(a) Multilevel recursive bisection method;(b) Multilevel k-way method.The results for each partitioning method are presented in Table 3.4 (see page 74). Basedon these results, aspects that influence the goodness of the partitions obtained with bothmethods will be evaluated, under the light of the MATE algorithm requirements.Load balance among subsystemsOne of the first aspects that should be discussed is how balanced the subsystems operationsare. Fromthesubsystems performancemodeldescribed by (3.4) and(3.8), andlaterin(3.25),it can be verified that the number of buses nk and number of border nodes bk considerablyaffect the load balance among subsystems.In this sense, for both partitioning heuristics, the subsystems average size µ(nk) decreasesaccording to np, where n is the size of the original system and p the number of subsystems,which indicates a fair level of balance among subsystems. More specifically, smaller standarddeviations of nk (given by σ(nk) in Table 3.4) lead to the conclusion that the subsystems70Chapter 3. Network-based MATE Algorithm ImplementationFigure 3.13. North American Electric Reliability Council (NERC) Regions: Florida Re-liability Coordinating Council (FRCC), Midwest Reliability Organization(MRO), Northeast Power Coordinating Council (NPCC), Reliability FirstCorporation (RFC), SERC Reliability Corporation (SERC), Southwest PowerPool (SPP), Texas Regional Entity (TRE), Western Electricity CoordinatingCouncil (WECC). (http://www.nerc.com/page.php?cid=1%7C9%7C119)n = 14327, nnz = 47541Figure 3.14. Western Electricity Coordinating Council (WECC) System admittance ma-trix (14,327 buses and 16,607 branches).71Chapter 3. Network-based MATE Algorithm Implementationobtained with the recursive bisection method are generally better balanced in terms of nkthan those yielded by the k-way method.In terms of border nodes bk, however, the subsystems are not as balanced regardless of thepartitioning method, given the bigger values of the standard deviations σ(bk), comparativelyto the averages µ(bk). The influence of unbalanced number of border nodes bk on the timingsof different subsystems can be explicitly shown by combining TThvcomp(Sk) and Tvcomp(Sk), givenby (3.4) and (3.8), into the total subsystem computation time Tsubscomp(Sk) given below.Tsubscomp(Sk) =bracketleftBigT1fact(Sk) + (Nz−1)T2fact(Sk) + 2Ni Tsolv(Sk)bracketrightBig+ Nz bk Tsolv(Sk) (3.35)Bearing in mind that the timings T1fact(Sk), T2fact(Sk) and Tsolv(Sk), given by (3.5), (3.6)and (3.7), respectively, depend on well balanced quantities, such as number of nodes nk andbranch-to-node ratio ρk, one can induce that the term between square brackets in (3.35) willalso be well balanced across all subsystems. The term dependent on the number of bordernodes bk, however, will not. Moreover, this last component is also dependent on the numberof factorizations Nz the subsystems need to undergo. Thus, the less factorizations required,the less unbalanced the subsystems computations become.Communication volumeThe communication volume involved in the MATE algorithm is directly associated with thesize of the multi-node Thévenin equivalents that need to be gathered in the link solver, i.e.,the number of border nodes bk. Such statement becomes evident from the timings TThvcomm(L)and Ticomm(L) defined in (3.15) and (3.20).Both evaluated partitioning algorithms have the number of links l interconnecting allsubsystems as a minimizing function. Indirectly, minimizing such an objective function alsominimizes the total number of border nodes in the system, but not necessarily the localnumber of border nodes. Unbalances in the size of the subsystems multi-node Théveninequivalents are indicated by the relatively big standard deviations σ(bk).For large systems, the impact of the unbalanced number of border nodes bk will be hardlycomparable to the impact of the subsystems and link solver workload. For instance, wheneversystems have to be factorized often, workloads in the subsystems and link solver areO(ρ2n)and O(l3), respectively, while the communication burden is OparenleftBigcbk2parenrightBig. Hence, as long asbk2 ≪ ρ2n and bk2 ≪ l3, computational workload will be much more significant than thecommunication burden.For systems that require only a few factorizations during a simulation, the minimizationof the links that straddle the subsystems can be expected to yield reasonable performance,72Chapter 3. Network-based MATE Algorithm Implementation2 3 4 5 6 7 8 9 10 11 12 13 1400.  bisectionkway2 3 4 5 6 7 8 9 10 11 12 13 1400.51Number of partitionsCommunicationPenalty  bisectionkwayFigure 3.15. Link solver and communication penalty factors relative to the WECC system.as long as the constraints introduced in (3.31) are satisfied. Figure 3.15 shows that thecommunication penalty factors remain below the unity for all adopted partitioning schemes.This fact indicates that subsystems’ workload is sufficiently more significant than the com-munication overhead.Sequential Link SolverAccording to the computation time for the link currents Ticomp(L), defined in (3.10), the workload assigned to the link solver is minimum as long as the number of links l that straddle allsubsystems is also minimum.From Table 3.4, it can be observed that both tested partitioning algorithms are able togenerate roughly similar partitions as far as the number of links l is concerned. Since noneof the partitioning heuristics results in consistently smaller number of links, they can onlybe compared for specific number of partitions.Additionally, given that the link solver penalty factors shown in Figure 3.15 remain belowunity for all adopted partitioning schemes, one can conclude that the subsystem’s workload isstill sufficiently bigger than the link solver’s. Therefore, for systems that need to be factorizedonly a few times, reasonable performance should be expected for the chosen partitioningschemes.73Chapter 3. Network-based MATE Algorithm ImplementationTable3.4.PartitioningoftheWECCsystemusingMETISlibrary.(a)Multi-levelrecursivebisectionalgorithmplµ(nk)σ(nk)nknkµ(ρk)σ(ρk)ρkρkµ(bk)σ(bk)bkbkµ(lk)σ(lk)lklk2467163.500.50716371643.740.053.693.7841.003.00384446.000.0046463434775.670.47477547763.710.353.254.0826.673.77243228.673.3025334713581.750.43358135823.680.373.214.2532.507.02254035.508.20274651112865.400.49286528663.530.183.303.7839.4014.32236344.4017.06257461342387.831.34238623903.510.243.273.9139.8318.13206244.6718.28216671252046.710.45204620473.540.293.093.9832.576.90244335.718.14274881371790.880.33179017913.510.283.113.9529.6311.64115334.2513.27126091701591.890.31159115923.460.303.094.0232.117.05244437.788.702856101521432.700.64143214343.490.302.993.9327.008.2684230.4010.08950111591302.450.50130213033.490.322.983.9525.187.8193728.918.321243121891193.920.28119311943.470.333.063.9727.008.4763731.5010.59745132011102.080.92110011043.440.322.954.1526.159.9174330.9212.77850141991023.360.89102210253.430.402.854.0524.219.0174128.4310.93752(b)Multi-levelk-wayalgorithmplµ(nk)σ(nk)nknkµ(ρk)σ(ρk)ρkρkµ(bk)σ(bk)bkbkµ(lk)σ(lk)lklk2257163.509.50715471733.630.073.563.6923.000.00232325.000.0025253434775.6747.25470948133.670.063.583.7325.009.42133628.6711.2613394823581.7591.46347836823.620.123.453.7737.2510.47285541.0011.2531605732865.4049.83281629433.700.203.494.0525.4015.2154929.2017.4555661392387.8396.07219524553.540.133.373.7341.0021.35177846.3326.00189271032046.7135.62198720923.550.193.303.9425.8610.5154029.4312.0064782651790.8859.00171018403.450.193.193.8455.3831.002011166.2538.372313392421591.8979.90136816313.470.143.243.6345.8918.50106853.7821.091078101511432.7074.37121214743.570.253.144.0826.5014.5444830.2015.20751111451302.4536.25124213443.560.253.133.8922.736.84113426.367.671438121871193.9259.19101412293.520.183.153.7425.6715.1136131.1719.93380131871102.0818.66107311323.470.263.083.9124.0811.8734828.7715.67358141991023.3627.7095410493.470.213.164.0224.299.8444028.4311.95548•xkmeansthemaximumvalueofxk;•µ(xk)meanstheaveragevalueofxk;•xkmeanstheminimumvalueofxk;•σ(xk)meansthestandarddeviationofxk.74Chapter 3. Network-based MATE Algorithm Implementation3.4.2 Timings and Performance Predictions for the WECC SystemTo predict timings and performance, 1000 solutions of the WECC system were considered,i.e., Ni = 1000, and three distinct number of factorizations Nz ={1,100,500}.The predicted and measured timings are shown in Figures 3.17, 3.18 and 3.19. As thenumber of partitions increases, so does the number of links interconnecting the subsystems,which, in turn, become smaller (see Table 3.4). Such behavior helps reducing the partici-pation of the subsystems computations in the total time, while boosting up the workloadassigned to the link solver. Due to the significantly increased workload in the link solver,the timing ceases to decrease for higher number of partitions. However, since on modernprocessors, dense matrix computations are more efficient than sparse computations in termsof Mflops, the computational overhead incurred by the link solver increases at a much slowerpace than the subsystems’ workload decreases. This fact helps the present MATE imple-mentation achieve up to 6 times speedup with respect to the sequential SuperLU algorithm,as shown in Figure 3.23.Figures 3.17, 3.18 and 3.19 also show that as the number of factorizations Nz increase,the link solver operations increase withO(Nz l3), therefore, in a much faster rate than thesubsystems sparse operations which increase with a functionOparenleftBignpNzparenrightBig, where n representsthe number of buses in the original untorn WECC system. This explains the lower speedupsachieved when the system needs to be factorized many times during the simulations, regard-less of the number of partitions.As mentioned before in Section 3.4.1, the communication between subsystems and linkssolver has little impact on the global timings, regardless of the number of factorizations,although more data is exchanged whenever a factorization in the subsystems level is required.This is because of the massive amount of computations required in comparison with theamount of data needed to be exchanged among processes.In Figures 3.20, 3.21 and 3.22, the MATE algorithm performance is illustrated for thecase when the WECC system is torn apart into 14 subsystems by means of the multilevelrecursive bisection partitioning method. In these graphs, the processes participating in thesolution are represented in the x-axis, where process 0 is related to the link solver and therest to the subsystems. Observe that for fewer factorizations, the partitions obtained withthe METIS library show good computational balance, which deteriorates as the number ofrequired factorizations grows. This tendency is explained by the fact that the number ofborder nodes bk are not well balanced across the subsystems (see Table 3.4 for details), whichmakes the multi-node Thévenin equivalent in distinct subsystems very different.As for the performance metrics, speedup and efficiency were measured with respect tothe SuperLU sparse solver timings. Results are depicted in Figure 3.23. It can be observed75Chapter 3. Network-based MATE Algorithm Implementation2 3 4 5 6 7 8 9 10 11 12 13 1402461fact.[s]  bisectionk−way2 3 4 5 6 7 8 9 10 11 12 13 14051015100fact.[s]  bisectionk−way2 3 4 5 6 7 8 9 10 11 12 13 1402040500fact.[s]Number of Subsystems  bisectionk−wayFigure 3.16. Comparison between MATE timings for multilevel recursive bisection andmultilevel k-way partitioning algorithms.that, in case when the WECC system required just a few factorizations (less than 5% of thesteps), the network-based MATE algorithm achieved about 6 times speedup with respect tothe sequential SuperLU solver, when 14 partitions were considered. For many factorizations(more than 10% of the steps), however, the algorithm speedup degrades considerably toapproximately 3 times for Nz = 100, and 2 times for Nz = 500. Although maximumspeedup happened for the 14-subsystems case, the maximum efficiency was observed fora lower number of partitions. For small number of factorizations, a slightly higher than50% efficiency was observed, while lower values were registered for the cases with manyfactorizations, i.e., less that 30% for Nz = 100 and less that 20% for Nz = 500.Lastly, timings obtained with both partitioning algorithms are compared in Figure 3.16.Both multilevel recursive bisection and k-way deliver similar performance to the MATEalgorithm when only a few factorizations are required. The differences between the employedpartitioning heuristics become evident for higher number of factorizations. These differencesin performance come as a result of larger number of links and local border nodes, as it canobserved in Table 3.4, when the WECC system is partitioned into 2, 8 and 9 subsystems.76Chapter 3. Network-based MATE Algorithm Implementation2 3 4 5 6 7 8 9 10 11 12 13 14 020406080100Number of PartitionsParticipation [%]  Ticomp( ) Ticomm( ) TThvcomp( ) TThvcomm( ) Tvcomp( )Time [s]0.511.522.533.544.55LL SSS(a) multilevel recursive bisection2 3 4 5 6 7 8 9 10 11 12 13 14 020406080100Number of PartitionsParticipation [%]  Ticomp( ) Ticomm( ) TThvcomp( ) TThvcomm( ) Tvcomp( )Time [s]0.511.522.533.544.55LL SSS(b) multilevel k-way partitioningFigure 3.17. MATE predicted and measured timings for the solution of the WECC systemfor 1000 steps, 1 factorization and different partitioning strategies.77Chapter 3. Network-based MATE Algorithm Implementation2 3 4 5 6 7 8 9 10 11 12 13 14 020406080100Number of PartitionsParticipation [%]  Ticomp( ) Ticomm( ) TThvcomp( ) TThvcomm( ) Tvcomp( )Time [s]2468101214LL SSS(a) multilevel recursive bisection2 3 4 5 6 7 8 9 10 11 12 13 14 020406080100Number of PartitionsParticipation [%]  Ticomp( ) Ticomm( ) TThvcomp( ) TThvcomm( ) Tvcomp( )Time [s]12345678910LL SSS(b) multilevel k-way partitioningFigure 3.18. MATE predicted and measured timings for the solution of the WECC systemfor 1000 steps, 100 factorizations and different partitioning strategies.78Chapter 3. Network-based MATE Algorithm Implementation2 3 4 5 6 7 8 9 10 11 12 13 14 020406080100Number of PartitionsParticipation [%]  Ticomp( ) Ticomm( ) TThvcomp( ) TThvcomm( ) Tvcomp( )Time [s]510152025303540455055LL SSS(a) multilevel recursive bisection2 3 4 5 6 7 8 9 10 11 12 13 14 020406080100Number of PartitionsParticipation [%]  Ticomp( ) Ticomm( ) TThvcomp( ) TThvcomm( ) Tvcomp( )Time [s]510152025303540LL SSS(b) multilevel k-way partitioningFigure 3.19. MATE predicted and measured timings for the solution of the WECC systemfor 1000 steps, 500 factorizations and different partitioning strategies.79Chapter 3. Network-based MATE Algorithm Implementation0 1 2 3 4 5 6 7 8 9 10 11 12 13 1400. [s]Processes  TThvcomm( )Tfact( )Tvcomp( )Ticomp( )TThvcomp( )Ticomm( )LLLSSS(a) multilevel recursive bisection0 1 2 3 4 5 6 7 8 9 10 11 12 13 1400. [s]Processes  TThvcomm( )Tfact( )Tvcomp( )Ticomp( )TThvcomp( )Ticomm( )LLLSSS(b) multilevel k-way partitioningFigure 3.20. MATE timings for the solution of the WECC system partitioned in 14 sub-systems for 1000 steps, 1 factorization and different partitioning strategies.(Process 0 relates to L and processes 1-14 with S)80Chapter 3. Network-based MATE Algorithm Implementation0 1 2 3 4 5 6 7 8 9 10 11 12 13 1400. [s]Processes  TThvcomm( )Ticomm( )Tvcomp( )Ticomp( )TThvcomp( )Tfact( )LLLSSS(a) multilevel recursive bisection0 1 2 3 4 5 6 7 8 9 10 11 12 13 1400. [s]Processes  TThvcomm( )Ticomm( )Tvcomp( )Ticomp( )TThvcomp( )Tfact( )LLLSSS(b) multilevel k-way partitioningFigure 3.21. MATE timings for the solution of the WECC system partitioned in 14 subsys-tems for 1000 steps, 100 factorizations and different partitioning strategies.(Process 0 relates to L and processes 1-14 with S)81Chapter 3. Network-based MATE Algorithm Implementation0 1 2 3 4 5 6 7 8 9 10 11 12 13 1400.511.522.533.544.55Time [s]Processes  TThvcomm( )Ticomm( )Tvcomp( )Ticomp( )TThvcomp( )Tfact( )LLLSSS(a) multilevel recursive bisection0 1 2 3 4 5 6 7 8 9 10 11 12 13 1400.511.522.533.544.55Time [s]Processes  TThvcomm( )Ticomm( )Tvcomp( )Ticomp( )TThvcomp( )Tfact( )LLLSSS(b)Figure 3.22. MATE timings for the solution of the WECC system partitioned in 14 subsys-tems for 1000 steps, 500 factorizations and different partitioning strategies.(Process 0 relates to L and processes 1-14 with S)82Chapter 3. Network-based MATE Algorithm Implementation2 3 4 5 6 7 8 9 10 11 12 13 1401234567(a)(b)(c)Speedup2 3 4 5 6 7 8 9 10 11 12 13 14102030405060(a)(b)(c)Efficiency [%]Number of Partitions(a) Multilevel recursive bisection partitioning2 3 4 5 6 7 8 9 10 11 12 13 1401234567(a)(b)(c)Speedup2 3 4 5 6 7 8 9 10 11 12 13 140204060(a)(b)(c)Efficiency [%]Number of Partitions(b) Multilevel k-way partitioningFigure 3.23. MATE performance metrics for the solution of the WECC system for 1000steps and (a) 1, (b) 100 and (c) 500 factorizations.83Chapter 3. Network-based MATE Algorithm Implementation3.5 ConclusionIn this chapter, a performance model for a specific implementation of the network-basedMATE algorithm is discussed, along with an application on a real large power system, theWestern Electricity Coordinating Council (WECC) system, with about 15,000 buses.The approach adopted on the present implementation was to combine traditional sparsitytechniques withdense operations, inorder toobtainanoptimizedparallel MATE-basedlinearsolver.From the developed performance model, it was shown that for systems, which need to berepeatedly solved, but factorized only a few times, the theoretical speedup of the network-based MATE algorithm with respect to traditional sparsity-oriented linear solvers is p2, for asystem partitioned into p subsystems. As a consequence, for the same type of systems, theefficiency cannot exceed 50%. Nonetheless, compared with speedup and efficiency reported inthe literature (see Section 1.2), these values can be considered competitive with other parallelalgorithms employed in solving sparse linear solvers, even on massive supercomputers.The concept of link solver and communication penalty factors, introduced in (3.31),provides a normalized measure of the influence of the link solver computations as well asthe communication overhead in the network-based MATE algorithm, with respect to thesubsystems workload. In the future, such factors could be even used as objective functionsin the system partitioning phase.For the WECC system, the speedup was shown to approximate fairly well the theoreticalone, with obtained speedups slightly higher than 6 times when solving the WECC systempartitioned into 14 subsystems (see Figure 3.23). Concerning the efficiency of the algorithm,it was kept very close to the theoretical 50% for all tested partitioning schemes using themulti-level recursive bisection and multi-level k-way methods (Karypis & Kumar, 1998b).84Chapter 4MATE-based Parallel Transient StabilityAccording to Kundur (1994), transient stability is the ability of the power system to maintainsynchronism when subjected to a severe transient disturbance such as a fault on transmissionfacilities, loss of generation or loss of load. As such, the main objective of transient stabilitysimulations is to detect critical operating conditions under which power systems would loosesynchronism due to a large disturbance.In engineering applications, it is frequently desirable to make many system responsesimulations to calculate, for example, the effects of different fault locations and types, au-tomatic switching, initial power system operating states, different network, machine andcontrol-system characteristics (Stott, 1979). Such studies provide vital information for sys-tem design, operation planning and, more recently, real-time assessment of large power sys-tems. The sheer volume of computation required by such studies, however, imposes severeconstraints on the size and complexity of the systems that can be analyzed by means ofon-line Transient Stability Assessment (TSA) tools. Hence, speeding up transient stabilitycalculations is one way of improving the reliability of operation and security of modern powersystems (Andersson et al., 2005).In order to address the need of faster TSA tools, a parallel transient stability simulator,which employs the network-based MATE algorithm as the back-end for network solutions,will be presented. This implementation differs from other parallel transient stability pro-grams in the sense that it is inherently designed for distributed-memory computing systems.This is because the network-based MATE algorithm’s features allow the complete separationof subsystems. This approach, in addition to avoiding extraneous data movement betweencentral databases and subsystems processes, also allows the system data to be partitionedaccording to the topology of the network under study. This characteristic also improves theparallelism of other tasks, required by the transient stability simulator, such as dynamicmodels integration.For the sake of the comparisons, one acclaimed transient stability solution techniqueimplementation is discussed in its sequential and parallel versions. Timing results of theactual implementations, along with speedup and efficiency of the solution, are shown fora reduced version of the Brazilian National Interconnected System, with 1916 buses, 2788branches and 79 generators.85Chapter 4. MATE-based Parallel Transient StabilityGeneratorsMotorsOtherdynamicdevicesPower SystemStabilizerAutomaticVoltageRegulatorSpeedRegulatorPrime MoverEnergySupply GeneratorTransmissionSystemVtEfdVsωvalve/gatecontrolFigure 4.1. Basic structure of a power system model for transient stability analysis.4.1 Transient Stability ProblemAccording to the literature (Dommel & Sato, 1972; Stott, 1979; Kundur, 1994), the powersystem transient stability problem is modelled by means of a set of non-linear differential-algebraic equations (DAEs), which can be summarized as follows:˙x = f(x,v) (4.1a)Yv = i(x,v) (4.1b)where, x represents a vector with dynamic variables (or, state variables), whose first deriva-tives ˙x are defined by a vector function f, normally dependent on x themselves and thevector with nodal voltages v. In addition, i represents a vector function that defines thenodal current injections, which also depend on the variable states x and the nodal voltagesv. Lastly, Y represents the complex-valued nodal admittance matrix of the system understudy. For illustration purposes, a basic power system structure for transient stability studiesis also given in Figure 4.1.In this formulation, the set of dynamic equations (4.1a) comprises all non-linear dif-ferential equations related to generators and their associated prime movers (e.g., hydro andsteamturbines) and controllers(e.g, automaticvoltageregulators, speed governorsand powersystem stabilizers), motors, FACTS devices (e.g, HVDC systems and SVCs), as well as anyother dynamic device (e.g., dynamic loads). The structure of the dynamic equations stronglydepends on the models used for each of the aforementioned components, which, ultimately,define the strength of the non-linearities included in the system as well as its overall responsetime characteristics.86Chapter 4. MATE-based Parallel Transient StabilityThe set of static equations (4.1b), in turn, includes all algebraic complex-valued nodalequations associated with the passive transmission network. This network is mainly com-posed by transformers and transmission lines, stator of generators and motors, and voltage-dependent loads. Depending on the models used for the loads, the algebraic equations mayalso become non-linear. The interface between the dynamic and static equations in (4.1)is accomplished by the voltage-dependent algebraic equations associated with each deviceconnected to the passive network. The stator equations of generators figure as the mostimportant interface equations, as far as the transient stability is concerned.4.1.1 Transient Stability Solution TechniquesMany solution variations for the problem defined by (4.1) have been described in the lit-erature. These variations combine different integration methods for the dynamic equations(4.1a), different solution methods for algebraic equations (4.1b), and different manners ofinterfacing these dynamic and algebraic equations. A number of these methods are summa-rized in (Stott, 1979).All proposed variations, however, fall into two major categories, which are closely relatedto the interface between the dynamic and algebraic equations. These categories are the (a)alternating (or partitioned) and (b) simultaneous solution approaches.As pointed out by Stott (1979), the alternating approach is the most traditional methodand adopted by many industrial-grade transient stability programs. In addition, since thealternating method relies on repeated network solutions during a simulation, it presents itselfas an ideal option for direct application of the network-based MATE algorithm, analyzedin the previous chapters. Therefore, the alternating algorithm will detailed in the sequence,while more information with respect to simultaneous solution approach can be found inAppendix B.Alternating Solution ApproachThe alternating solution method is characterized by the fact that the dynamic equations(4.1a) are integrated separately from the algebraic equations (4.1b) solution. In addition,the integration method considered will affect the way the interface between both set ofequations is realized.In case of an explicit integration method, such as explicit Runge-Kutta methods, thederivatives ˙x calculated at previous time steps are used for computing new values of x.Subsequently, x is used for the the network computation, which turns out to be iterative incase non-impedance loads are used. A nice feature provided by explicit integration formulas87Chapter 4. MATE-based Parallel Transient Stabilitylies on the fact that both sets of dynamic and algebraic equations are truly isolated from oneanother and, their interface solution is non-iterative. On the other hand, explicit integrationmethods are weakly stable and often require very small time steps.In case of an implicit integration rule, such as Trapezoidal integration rule, extrapolationtechniques are used for predicting node voltages v, which are then employed during theintegration procedure, which yields an approximate x. With the approximate values of xand v, the current injections i can be calculated and used for computing an updated setof voltages v, which are later employed in the updating of the dynamic equations (4.1a).In comparison with the explicit integration methods, their implicit counterparts presentimproved numerical robustness and convergence characteristics.In order to properly choose an implicit integration method for the solution of the tran-sient stability problem, requirements of power systems need to be firstly considered. Powersystems related problems usually accept numerical errors in the order of a few percent, whichdiminishes the importance of the order of the integration method. The implicit method, how-ever, has to present strong numerical stability and, therefore, not accumulate errors duringa simulation. Moreover, the implicit integration method should never produce stable solu-tions to unstable problems. Bearing this requirements in mind, the Trapezoidal integrationrule has been proved to be reliable, reasonably accurate and computationally inexpensive,in comparison to higher order implicit formulas (Dommel & Sato, 1972; Stott, 1979).As such, the set of ordinary differential equations (4.1a) is first discretized according tothe Trapezoidal integration rule. This procedure is shown in (4.2).x(t) = x(t−∆t) +integraldisplay tt−∆tfparenleftbigx(ξ),v(ξ)parenrightbigdξ≈≈x(t−∆t) + ∆t2bracketleftBigfparenleftbigx(t),v(t)parenrightbig+fparenleftbigx(t−∆t),v(t−∆t)parenrightbigbracketrightBig (4.2)Collecting the present and past values yields (4.3), where xh(t) represents the historyterm of the state vector x, which is known at time t.x(t) = ∆t2 fparenleftbigx(t),v(t)parenrightbig+xh(t) (4.3a)xh(t) = x(t−∆t) + ∆t2 fparenleftbigx(t−∆t),v(t−∆t)parenrightbig (4.3b)Combining the original set of algebraic equations defined in (4.1b) with the one producedby the integration procedure, given in (4.3), results the alternating solution method given in88Chapter 4. MATE-based Parallel Transient Stability(4.4).xm = ∆t2 fparenleftbigxm,vmparenrightbig+xh (4.4a)Yvm+1 = iparenleftbigxm,vmparenrightbig (4.4b)where xh(t), defined in (4.3b), is computed from previous values of x and v, and m =0,1,2,... stands for the iteration count.The equations described in (4.4) describe the iterative process needed for the solutionat a given time t. The process starts with the computation of vm by extrapolation on afew immediate past values of v. Once vm is computed, a new iterate xm is obtained from(4.4a). Then, applying the previous xm and vm to calculate iparenleftbigxk,vkparenrightbig, (4.4b) can be solvedfor vm+1, which can be used to start a new iteration. The iterative process continues untilthe difference between vm+1 and vm is negligible.4.1.2 Transient Stability ModelsThe set of equations (4.1), which define the transient stability problem, is highly dependenton the power system under investigation as well as its associated models, as discussed inSection 4.1.However, before delving into modelling of power system devices for transient stabilitystudies, one need to keep in mind that the choice of the models has to be in agreementwith the time scale of interest. As described by Kundur (1994), transient stability canbe subdivided into three categories: short-term (up to 10 seconds), mid-term (from 10seconds to a few minutes ) and long-term stability (from a few minutes to tens of minutes).The modelling for different time scales may differ considerably. For instance, in short-termstability studies, generators and their respective automatic controls (voltage and speed)in addition to the transmission system with the loads are often enough, while for long-term stability, boiler dynamics of thermal power plants, penstock and conduit dynamics ofhydro plants, automatic generation control, frequency deviation effects on loads and network,excitation limiters, among others, have to be included.Moreover, bearing in mind that the present implementation aims at showing the feasi-bility of parallel transient stability computations employing the network-based MATE (seeChapters 2 and 3), the modelling requirements for the short-term stability will considered.Thus, the most basic models for short-term transient stability, the generators and the trans-mission system, will be introduced.89Chapter 4. MATE-based Parallel Transient StabilityTransmission NetworkThe basic consideration in transient stability studies, as far as the transmission systemmodelling goes, is that the frequency of the whole system is assumed to be near constant andclose to the rated system frequency, such as 50 and 60 Hz. Moreover, the very fast decayingnetwork transients are seldom of concern in the time scale of interest. Therefore, networkdynamics are usually neglected in transient stability studies. Under such considerations,the basic transmission system is modelled by a large set of nodal algebraic equations whichrelate voltages at all buses of the system and their current injections. This set of equations iscommonly represented by a large sparse and usually complex-valued admittance matrix Y,which is topologically symmetric but numerically unsymmetric. Furthermore, in transientstability problems, the Y matrix is also constant between topological changes, such as short-circuits and line tripping.The basic elements considered in the transmission network model are the transmissionlines, the transformers and the buses.(a) Transmission linesA lumped pi circuit, depicted in Figure 4.2, is normally employed for characterizationof the transmission lines in transient stability studies. According to (Kundur, 1994), theparameters of the transmission line pi circuit are given as follows:Zs = ZC sinh(γl) (4.5a)Ysh = 2ZCtanhparenleftbiggγl2parenrightbigg(4.5b)andZC =radicalBiggR + jωsLG + jωsC (4.6)γ =radicalbig(R + jωsL)(G + jωsC) (4.7)whereZC = Characteristic impedance, in [Ω]γ = Propagation constant, in [m−1]l = length of the line, in [m]R = series resistance, in [Ωm]90Chapter 4. MATE-based Parallel Transient Stability¯Zs¯Ysh2¯Ysh2Figure 4.2. Equivalent pi circuit of a transmission lineL = series inductance, in [Hm]G = shunt conductance, in [℧m]C = shunt capacitance, in [Fm](b) TransformersThe single line diagram of a 2-winding transformer is shown in Figure 4.3a. In thisrepresentation the parameter a is the per unit turns ratio and ¯YT = 1¯ZTthe nominal per unitseries admittance of transformer.As presented in (Kundur, 1994), the associated pi circuit of the present 2-winding trans-former is depicted in Figure 4.3b and defined by the paremeters given in (4.8).¯Yt = ¯YTa [p.u.] (4.8a)¯Yp =parenleftbigg1−aa2parenrightbigg¯YT [p.u.] (4.8b)¯Ys =parenleftbigga−1aparenrightbigg¯YT [p.u.] (4.8c)(c) Load busesModelling a load bus is a very complicated task due to the variety and sazonality of theloads that can be connected to a same bus, such as lamps, refrigerators, heaters, compressors,motors, computers and so forth. And, even if all load models were known, implementing allthe millions of loads often supplied by power systems would render the transient stabilitysimulations infeasible. Therefore, it is common practice adopting composite models for theload buses as seen from the power system.91Chapter 4. MATE-based Parallel Transient Stability¯ZTa : 1(a)¯Yt¯Yp ¯Ys(b)Figure 4.3. Transformer with off-nominal ratio and its pi circuit representation.One type of model that is widely used in power system industry is the polynomial model,as described in (4.9). The coefficients p’s and q’s represent the proportion of the componentsof constant power (p1 and q1), constant current (p2 and q2) and constant impedance (p3 andq3).PL = P0parenleftbigp1 + p2 V + p3 V 2parenrightbig [p.u.] (4.9a)QL = Q0parenleftbigq1 + q2 V + q3 V 2parenrightbig [p.u.] (4.9b)where p1 + p2 + p3 = q1 + q2 + q3 = 1.In order to interface the foregoing composite load model with the network in the alter-nating method, as summarized by (4.4b), each component of the load needs to be interfacedwith the network through an impedance or current injection or both.In the case of the constant current component, its associated injection ¯IiL is calculatedfrom the steady-state condition, as shown in (4.10), and included in the right-hand sidevector, i, of the system of equations defined in (4.4b).¯IiL = p1 P0−jq1 Q0¯V ∗0 [p.u.] (4.10)For the constant impedance component, similarly to the constant current case, the loadadmittance ¯Y zL is calculated from the steady-state solution of the network by means of theexpression denoted in (4.11). This constant admittance is then included in the admittance92Chapter 4. MATE-based Parallel Transient Stability¯IiL ¯Y zL¯I¯V¯Y pL¯IpLFigure 4.4. Equivalent circuit of the polynomial load modelmatrix that describe the transmission system.¯Y zL = p2 P0−jq2 Q0V02 [p.u.] (4.11)For the constant power components, a Norton equivalent is used (Arrillaga & Watson,2001), where the admittance ¯Y pL is connected in parallel with a current injection ¯IpL. In thiscase, the current ¯IpL provides a power adjustment for changes around the power initiallydrawn by the admittance ¯Y pL, so the specified power is enforced. Similarly to the constantimpedance case, ¯Y pL is also included in the admittance matrix of the system.¯IpL =parenleftbigg¯Y pL − ¯S∗V 2parenrightbigg¯V [p.u.] (4.12a)¯Y pL = (p3 P0−jq3 Q0) [p.u.] (4.12b)Synchronous GeneratorSynchronous machines are commonly developed in the machine rotor qd0 reference frame(Kundur, 1994; Krause et al., 2002; Anderson et al., 2003). The full synchronous machinemodel describe both stator and rotor dynamics, in addition to the mechanical dynamics.However, since the very fast decaying network transients are seldom of concern in the timescale of interest, both network and the stator dynamics are usually neglected in transientstability studies. Under such consideration, the stator equations become algebraic, while therotor equations are kept dynamic (Kundur, 1994).Another important assumption made when modelling synchronous machines for transientstability studies is that the electrical frequency ωe of the machines only slightly deviates fromthe rated frequency of the system ωs. Under such conditions, the electromechanical torque93Chapter 4. MATE-based Parallel Transient StabilityTe and power Pe of the machine, when given in p.u. on the machine base quantities, areinterchangeable, i.e., Te≈Pe.There is a vast literature on synchronous generators models applied to the transientstability problem (Kundur & Dandeno, 1983; Kundur, 1994; Anderson et al., 2003), hence,only a summary of the classical and transient models will be presented. The extension of thetransient to the subtransient model is straightforward and is also covered in the referencedliterature.(a) Mechanical EquationsThe mechanical equations, given in (4.13), describe the electrical frequency ωe of themachine and angle between the net magnetic flux in machine gap and the magnetic fluxinduced in the field winding, also known as the load angle of the machine δ.˙δ = ωs ω bracketleftbigradsbracketrightbig (4.13a)˙ω = 12HbracketleftbigPm−Pe−Dωbracketrightbig bracketleftbigp.u.s bracketrightbig (4.13b)ω = ωe−ωsωs[p.u.] (4.13c)whereδ = load angle, in [rad]ω = frequency deviation from the rated electrical frequency ωs, in [p.u.]ωe = electrical speed or frequency, in bracketleftbigrads bracketrightbigωs = system rated frequency, in bracketleftbigrads bracketrightbigPm = mechanical power, in [p.u.]Pe = electrical power, in [p.u.]H = inertia constant, in [s]D = damping factor, in [p.u.]94Chapter 4. MATE-based Parallel Transient StabilityApplying the Trapezoidal integration rule to (4.13) yields the new set of algebraic equa-tions given in (4.14).δ(t) = kδω ω(t) + δh(t) [rad] (4.14a)ω(t) = kωPbracketleftbigPm(t)−Pe(t)bracketrightbig+ωh(t) [p.u.] (4.14b)δh(t) = δ(t−∆t) + kδω ω(t−∆t) [rad] (4.14c)ωh(t) = kωPbracketleftbigPm(t−∆t)−Pe(t−∆t)bracketrightbig+kωh ω(t−∆t) [p.u.] (4.14d)wherekδω = ωs∆t2 kωP = ∆t4H + D∆t kωh = 4H−D∆t4H + D∆t(b) Electrical EquationsThe algebraic transient equations associated with stator are given in (4.15), while thedynamic equations which govern the transient voltages induced onto the stator are presentedin (4.16). These equations are defined in the qd reference frame of the machine, depicted inFigure 4.5.e′q−vq = raiq−x′did [p.u.] (4.15a)e′d−vd = raid + x′qiq [p.u.] (4.15b)de′qdt =1T′dobracketleftbigefd + (xd−x′d)id−e′qbracketrightbig bracketleftbigp.u.sbracketrightbig (4.16a)de′ddt =1T′qobracketleftbig−parenleftbigxq−x′qparenrightbigiq−e′dbracketrightbig bracketleftbigp.u.sbracketrightbig (4.16b)andPe = e′d id + e′q iq +parenleftbigx′d−x′qparenrightbig iq id [p.u.] (4.17)wherevq, vd = q and d-axis voltages at the terminal of the machine, in [p.u.]iq, id = q and d-axis stator currents, in [p.u.]95Chapter 4. MATE-based Parallel Transient Stabilityd-axisq-axisδθ¯Aqdωeωs¯Ad¯AqFigure 4.5. Synchronous machine qd reference frame and system reference framee′q, e′d = q and d-axis transient voltages induced at the stator of the machine, in [p.u.]efd = excitation field voltage referenced to the stator of the machine, in [p.u.]Pe = electrical power transferred to the machine shaft, in [p.u.]x′q, x′d = q and d-axis transient reactances of the machine, in [p.u.]T′qo, T′do = q and d-axis open-circuit time constants of the machine, in [p.u.]ra = stator resistance, in [p.u.]Again, applying the Trapezoidal integration rule to (4.16) results the set of algebraicequations given in (4.18).e′q(t) = kq1 efd(t) + kq2 id(t) + e′qh(t) [p.u.] (4.18a)e′d(t) = kd2 iq(t) + e′dh(t) [p.u.] (4.18b)e′qh(t) = kq1 efd(t−∆t) + kq2 id(t−∆t) + kq3 e′q(t−∆t) [p.u.] (4.18c)e′dh(t) = kd2 id(t−∆t) + kd3 e′d(t−∆t) [p.u.] (4.18d)wherekq1 = ∆t(2T′do + ∆t)kq2 = kq1 (xd−x′d) kq3 = 2T′do−∆t2T′do + ∆tkd1 = ∆t(2T′qo + ∆t)kd2 =−kd1parenleftbigxq−x′qparenrightbig kd3 = 2T′qo−∆t2T′qo + ∆tIn case the period of analysis is small compared with T′do and the effect of the amortisseursneglected, the machine model can be further simplified by assuming the voltage e′q constantand ignoring the equations associated with e′d. These assumptions eliminates all differential96Chapter 4. MATE-based Parallel Transient Stabilityequations related to the electrical characteristics of the machine (Kundur, 1994). These arethe foundation of the classical synchronous machine model with constant flux linkages.(c) Network InterfaceFrom (4.4), it can be noticed that dynamic models are interfaced with the network bymeans of the current injections i, which are dependent on their state variables x and nodevoltages v. In the case of the synchronous machine, the first step in computing the currentinjections is solving (4.15) for iq and id, which results (4.19).bracketleftBiggiqidbracketrightBigg= 1ra2 + x′d x′qbracketleftBiggra x′d−x′q rabracketrightBiggbracketleftBigge′q−vqe′d−vdbracketrightBigg(4.19)At this point, the stator current phasor ¯I is required for interfacing the synchronousmachine with the transmission network. And, since each machine has its own qd referenceframe, the currents iq and is need to be converted to the synchronous reference frame ofthe system. For this task, consider the generic phasor ¯Aqd illustrated in Figure 4.5, which isoriginally given in the qd reference frame. In such a case, the phasor ¯Aqd can be converted tothe system reference frame by means of (4.20b), which results the phasor ¯A. The proceduredenoted by (4.20b) represents a rotation of ¯Aqd by δ radians.¯Aqd = ¯Aq + ¯Ad¯Aq = aq¯Ad = jad (4.20a)¯A = ¯Aqd ejδ (4.20b)Therefore, using (4.20) and (4.19), the stator current ¯I in the system reference frame canbe found, as given in (4.21).¯I =bracketleftbiggparenleftbigg ra−jx′qra2 + x′d x′qparenrightbiggparenleftbige′q−vqparenrightbig−jparenleftbigg ra−jx′dra2 + x′d x′qparenrightbigg(e′d−vd)bracketrightbiggejδ (4.21)Although, (4.21) gives the complex-valued synchronous machine current injection re-quired by (4.4b), Dommel & Sato (1972) reports that injecting the forgoing current straightinto the transmission network often makes the alternating solution method non-convergent.In order to circumvent such an undesirable behavior, the authors proposed describing ¯I interms of a voltage source behind a fictitious admittance ¯YM, given in (4.22).¯YM = ra−j12parenleftbigx′d + x′qparenrightbigra2 + x′d x′q (4.22)97Chapter 4. MATE-based Parallel Transient Stability¯IM ¯YM¯I¯VFigure 4.6. Synchronous generator equivalent circuitsIn this way, one can rewrite (4.21) as follows:¯I =−¯YM ¯V + ¯IM (4.23a)¯IM = ¯YM ¯E′ + ¯Isaliency (4.23b)¯Isaliency = j 12parenleftbigg x′d−x′qra2 + x′d x′qparenrightbiggbracketleftBigparenleftbig¯E′parenrightbig∗−parenleftbig¯Vparenrightbig∗bracketrightBigej2δ (4.23c)where all phasor quantities are given in the system reference frame and ¯E′ = parenleftbige′q + je′dparenrightbigejδ.The equivalent circuit associated with the synchronous machine, represented by the al-gebraic equations (4.23), is depicted in the Figure 4.6For the classical synchronous machine model, a further approximation is often assumed,which ignores the transient saliency of the machine, i.e., x′d = x′q. This way, the fictitiousadmittance ¯YM becomes the inverse of the ra + jx′d, which corresponds to the transientimpedance of the machine. Moreover, since flux linkages are also considered constant duringthe period of interest, the voltage ¯E′ also remains constant.(d) Initial ConditionsThe initial conditions for the differential equations of the machine, given in (4.13) and(4.16), are usually computed from a power flow solution, previously calculated for a specificoperating condition of interest. Therefore, before starting a transient stability simulation,each synchronous machine is assumed to be in steady state, i.e., all derivatives are madeequal to zero.Since allgeneratedpowerandterminalvoltagesinthe systemreferenceframeareavailablefrom the power flow solution, one can first make the derivatives in (4.16) zero and solve for98Chapter 4. MATE-based Parallel Transient Stability¯Eqra + j xq¯I¯VFigure 4.7. Steady-state equivalent circuit of the synchronous machine.e′q and e′d, which yields (4.24).e′q = efd + (xd−x′d)id (4.24a)e′d =−parenleftbigxq−x′qparenrightbigiq (4.24b)Now, combining (4.24) with the stator algebraic equations (4.15) results (4.25).vq =−raiq + xdid + efd (4.25a)vd =−raid−xqiq (4.25b)Rewriting then (4.25) using phasors in the system reference frame yields (4.26), whichcan be seen as voltage ¯Eq behind the impedance ra + jxq, as depicted in Figure 4.7. From(4.26b), it can be observed that the direction of controlled voltage ¯Eq coincides with theq-axis of the machine, as illustrated in Figure 4.8. Hence, the angle of ¯Eq equals δ.¯V =−(ra + j xq) ¯I + ¯Eq (4.26a)¯Eq = Eq ejδ (4.26b)Eq = efd + (xd−xq)id (4.26c)Since the machine terminal voltage ¯V and the apparent power ¯S = P +jQ generated bythe machine are known from the power flow calculations, one can compute the stator current¯I as shown below.¯I =parenleftbiggP + j Q¯Vparenrightbigg∗(4.27)99Chapter 4. MATE-based Parallel Transient Stabilityd-axis q-axis¯I¯Vra ¯Ixq ¯Ij (xd −xq) ¯Id¯Efdδ¯EqFigure 4.8. Steady-state phasor diagram of the synchronous machine, where ¯Efd = efd ejδand ¯Id = jid ejδ.Now, plugging the previous ¯I into (4.26a), one obtain ¯Eq, which, from (4.26b), gives theload angle δ. Once δ is known, both current components iq and id can also be found asfollows:iq =ℜbraceleftbig¯I e−jδbracerightbig (4.28a)id =ℑbraceleftbig¯I e−jδbracerightbig (4.28b)For the initial value of the excitation field voltage efd, (4.26c) is used along with thevoltage Eq = |¯Eq| previously calculated from (4.26a). And, to finalize the initializationof the electrical differential equations, the transient voltages e′q and e′d are calculated from(4.24).As for the steady-state of the mechanical equations (4.13), the frequency deviation ω mustbe set to zero, which guarantees that the load angle δ remains constant. Such a requirementis only met when the mechanical and electrical power match, i.e.,Pm = Pe = e′d id + e′q iq +parenleftbigx′d−x′qparenrightbig iq id (4.29)4.2 Sequential Transient Stability SimulatorThe reasons for implementing of a sequential transient stability simulator is twofold. First,it will serve as basis for the parallel version implementation, which is discussed in the nextsection. Second, the sequential simulator will provide a fair base for performance measure-ment of the parallel transient stability simulator, since both simulators share the same corefunctions and programming techniques.The sequential transient stability simulator implemented complies with the alternating100Chapter 4. MATE-based Parallel Transient Stabilitysolution method, explained in Section 4.1.1. The flow chart of the present implementationis given in Figure 4.9.The preparation of a simulation starts with a parsing routine that reads the system datafile and the contingency information. The system data files include information regardinggenerators and their associated prime movers and controllers, loads, and any other dynamicdevice present in the system. Subsequently, all data structures for dynamic devices, such asgenerators, and loads are allocated and initialized. Initial conditions for the specified operat-ing condition are also computed for all dynamic devices connected to the system. Still duringthe pre-processing stage, the transmission system admittance matrixY is formed, taking intoconsideration the models for transmission lines, transformers, loads and generators discussedin Section 4.1.2.At this point the actual simulation starts. Beforethe computationof a time step proceeds,contingency statuses are checked, so that any due switching operation has to be performed.At the beginning of the simulation or whenever topology changes occur, the admittancematrix Y is factorized for later use during the network iterative solution. Non-integrablevariables, such as voltagesv at load and generation buses, are also predicted by extrapolationand history terms xh associated with the state variables are computed. The formula usedfor extrapolating the voltages is shown in (4.30) (Stott, 1979). At the end of this stage, theequations (4.4) are ready for computation, which is performed next.¯Vext = ¯V 2(t−∆t)¯V (t−2∆t) (4.30)Now, the inner loop responsible for solving the nonlinear network starts with the iterationcounter m = 0. At the beginning of the network solution, the currents injected into thepassive network are computed for each dynamic and algebraic device, such as generatorsand polynomial loads. This procedure encompasses the integration of the dynamic models,denoted by (4.4a), and computation of the current injections im due to non-impedance loads,required by (4.4b). Afterwards, the voltages vm+1 across the system are solved from (4.4a),using the previously factorized Y matrix of the transmission network. With the first iteratevm+1, the dynamic equipments can update their state variables. If the difference between theinitially predicted and updated voltages is negligible, i.e.,bardblvm+1−vmbardbl≤εv, the simulationcan move onto the next time step; otherwise, the iteration counter m is incremented by one,the current injections im are recalculated and the process repeats.Test cases simulated with an implementation of the present algorithm will be discussedin Section 4.4.101Chapter 4. MATE-based Parallel Transient StabilityParse system dataand contingenciesinfoCompute historyand extrapolatenon-integrablevariablesInitialize network,generators, loadsand controlersStart simulationt = 0Check contingenciesstatuses; factorizeY, if neededUpdate networkcurrent injections imUpdate internalvariables of dynamiccomponents FinishBeginSolveYvm+1 = imOutput handlingand t←t + ∆tNo Yes YesNoConvergencebardblvm−vm−1bardbl≤εv t≥Tfinalm←m + 1m = 0Figure 4.9. Flow chart of a transient stability program based on the partitioned approach.102Chapter 4. MATE-based Parallel Transient Stability4.3 MATE-based Parallel Transient Stability SimulatorA parallel transient stability simulator based on the network-based MATE algorithm willbe discussed. The objective of such an application is to further assess the potential of thenetwork-based MATE algorithm in improving existing transient stability simulators with theleast programming effort and investment.The present parallel transient stability simulator was targeted to run on distributed com-puting environments, such as computer clusters built with out-of-the-shelf PCs and networkcards. Although distributed-memory was the system of choice, all ideas discussed next canbe directly applied to shared-memory environments.This implementationdesign combines the previously presented sequential transient stabil-ity simulator, as the basis of the code, and the network-based MATE algorithm, introducedin the Chapters 2 and 3, as the back-end of the sparse linear systems solver.As such, the parallel alternating method for the transient stability solution is summarizedby (4.31). In this set of equations, all variables are computed at time t and, hence, it is keptimplicit, except for the history term xhk(t), which depends on past computed values.xmk = ∆t2 fkparenleftbigxmk ,vmk parenrightbig+xhk (4.31a)Yvm+1 = iparenleftbigxmk ,vmk parenrightbig (4.31b)where k = 1,...,p represents the number of subsystems, which the original system is torninto, m = 0,1,... is the iteration counter and, the history term xhk(t) is given below.xhk(t) = xk(t−∆t) + ∆t2 fkparenleftbigxk(t−∆t),vk(t−∆t)parenrightbig (4.32)In (4.31), the state variables are presented in a partitioned manner, where each xmk ,assigned to the subsystem Sk, is associated with only a portion of the state variables of theuntorn system xm. And, since each subsystem contain only a set of the dynamic devices,originally connected to the untorn system, each set of discretized dynamic equations, givenin (4.31a), only depends on the voltages vmk at buses belonging to the same subsystem Sk.Transient stability simulators can benefit considerably from the partitioning of the dy-namic equations alone, based on the fact that most production-grade transient stabilitysimulators usually spend from 60 to 80% of the simulation time integrating the dynamicmodels (Brasch et al., 1979; Wu et al., 1995).The actual MATE-based parallel transient stability simulator can be split into three mainstages: the partitioning stage, the pre-processing stage and solution stage. Each of these103Chapter 4. MATE-based Parallel Transient Stabilitystages are discussed in the sequence.4.3.1 System Partitioning StageSimilarly to the network-base MATE algorithm, the parallel transient stability simulatorneeds a system partitioning phase, where subsystems and interconnection links are topolog-ically identified prior the simulation.During the partitioning phase, the system under study in torn into a number subsystems,namely, p. Each of these subsystems will, therefore, contain a number of buses, branchesand their associated dynamic devices. In turn, these subsystems are interconnected by thesystem of links.Since actual power systems evolve incrementally, by addition of new buses or lines intothe systems, their associated subsystems are expected to follow the same trend. Keepingthis fact in mind, repartitioning of the systems is also expected to be mostly required incases a different number of partitions are needed. Therefore, the approach adopted forthis implementation was splitting the system data files, into several files required by thesubsystems and link solver systems. As consequence, the partitioning routine does not needto be invoked in every simulation.The subsystems files contain the data associated with their own transmission networks,generators, loads and local links. The link solver file, in turn, contain the global linksinformation. Each link is defined by the following parameters:Global ID: Identification of the link solver among the global links.Local ID: Identification of the link solver among the local links.From Subsystem ID: Subsystem, which this link is leavingFrom Bus ID: Bus, which this link is leaving from.To Subsystem ID: Subsystem, which this link is arriving.To Bus ID: Bus, which this link is arriving to.Orientation: 1 for orientation matching the one defined for the global link; -1, otherwise.This set of parameters allow the assembling of the subsystem-to-border mappings Qk,defined in (2.8), and the link-to-border mappings Rk, defined in (2.21), which, ultimately,provide interconnect all subsystems.Forthe actualpartitioning ofthe system, the METISlibrary(Karypis & Kumar, 1998a,b)was employed. This library provides a set of partitioning routines for undirected graphs104Chapter 4. MATE-based Parallel Transient Stabilitybased on multilevel algorithms, such as the multilevel recursive bisection and multilevel k-way partitioning. Among the objective functions minimized during the partitioning phasewas the minimum number of global links in the system.The basic tasks in the system partitioning stage are:(a) Parsing system data and initialize system structures.(b) Undirected graph generation for the transmission network of the system under study.Graph data format is described in (Karypis & Kumar, 1998b).(c) Undirected subgraphs generation for each generated, including local and global links,according to the partitioning produced by METIS.(d) Subsystems construction based on the constructed subgraphs and original system data,such as generators and loads.(e) Subsystems and link solver files generation.4.3.2 Pre-processing StageIn the pre-processing stage, the data structures required for storing and accessing the systemdata are allocated and initialized.Following the network-based MATE algorithm idea, two types of processes are spawnedat beginning of the simulation: subsystems and link solver processes. According to the flowchart presented in Figure 4.10, each processes start loading and parsing their appropriatedata files. In addition to the system data associated with the transmission network, gen-erators and loads, subsystems also require an extra file with the information regarding theevent to be simulated, such as a fault followed by a line tripping.During this stage, each subsystem Sk identifies its own set of border nodes Bk. Theborder nodes contained in Bk are then used, along with the local links information, discussedin Section 4.3.1, to assemble the subsystem-to-border mappings Qk, defined in (2.8), and thelink-to-border mappings Rk, defined in (2.21).Moreover, during the iterative network solution, the link-to-border mappings Rk arerequired by the link solver to form the multi-area Thévenin equivalent from the subsystems’multi-node Thévenin equivalents, as observedfrom(2.25) and(2.22). Therefore, all mappingsRk need to be gathered at the link solver prior the simulation. This will be further discussedin the next section.105Chapter 4. MATE-based Parallel Transient StabilityParse subsystemdata andcontingencies infoInitialize network,generators, loadsand controllersBeginGatherRk for k = 1, . . . , pSubsystem Sk Link SolverP LParse link solverdataIdentify BkForm Qk and RkSend RkFigure 4.10. Flow chart of the pre-processing stage of the MATE-based parallel transientstability program (continues on Figure 4.11).4.3.3 Solution StageThe solution stage of the parallel transient stability follows closely the procedure discussedin Section 4.2, and is summarized in the flow chart presented in Figure 4.11. Therefore, themain differences between the two simulators will be analyzed next. These differences lie ex-actly at the points where interprocess communications are required, namely, the contingencystatuses check, network-based MATE solver and convergence check.Contingency Statuses CheckAlthough contingencies are usually applied within a subsystem and usually incur in therefactorization of the subsystem’s local admittance matrix Yk, which, in turn, demandsthe link solver to rebuild and refactorize the multi-area Thévenin equivalent. Therefore,whenever a subsystem has its admittance matrix refactorized, the link solver needs to receivea signal, so it can proceed with the proper calculations.Another aspect to be considered whenever any of the subsystems have to be refactorized106Chapter 4. MATE-based Parallel Transient StabilityUpdate networkcurrent injections imkUpdate internalvariables of dynamiccomponentsFinishOutput handlingand t←t+ ∆tYesNot≥Tfinal t≥TfinalYesNoYes YesNoNoCompute historyand extrapolatenon-integrablevariablesm = 0Network-based MATE SolverYk vm+1k = imkConvergence Checkbardblvmk −vm−1k bardbl≤εvP Lt = 0Subsystem Sk Link SolverFactorize Y, ifneededCheck and gather contingency statusesm←m + 1Figure 4.11. Flow chart of the solution stage of the MATE-based parallel transient stabilityprogram (continued from Figure 4.10).107Chapter 4. MATE-based Parallel Transient Stabilityis related with the manner MPI-1 standard (MPI Standard, 2008; Gropp et al., 1999) definesthe interface for sending or receiving messages. First, one needs to keep in mind that themost basic MPI routines, the MPI_SEND for sending and the MPI_RECV for receiving messages,require, as passing arguments, the amount of data of a given type to be sent or received andwhich processes are participating in the communication.Now, recall that the multi-node Thévenin equivalents, consisting of Zbk and ebk, needto be sent from the subsystems to the link solver only when admittance matrices Ykare (re)factorized, and only ebk, otherwise. Hence, in the Thévenin equivalents gather(Section 3.2.3), the link solver has to keep track on not only the contingency statuses changes,but also which subsystems have changed.Bearing these aspects in mind, the functionality of the contingency statuses check isdescribed in Figure 4.12. In this adoptedapproach, eachsubsystemSk sends to the link solverits own contingency status sk, calculated according to (4.33a), by means of the MPI_GATHERroutine. Afterwards, the link solver contingency status, s0, is computed by finding themaximum of all gathered contingency statuses, as indicated in (4.33b). In this way, s0informs the link solver when a change occurred in the system, whereas sk’s identify thesubsystems that suffered the change.sk =1 if Yk was refactorized0 otherwise(4.33a)s0 = pmaxk=1ck (4.33b)Network-based MATE SolverIn the present parallel transient stability simulator, the parallel network-based MATE solver(Chapters 2 and 3) is employed in the sparse linear solutions required by the iterative networkcomputations (4.4b).The flow chart, presented in Figure 4.13, summarizes the tasks performed by the network-based MATE solver.At the beginning of a network solution, all subsystems Sk start computing their multi-node Thévenin equivalents, formed by Zbk and ebk, according to (2.12), (2.16) and (2.17). Asdiscussed previously, the recalculation of impedance matrix Zbk depends on the contingencystatus of the same subsystem, i.e., only when the admittance Yk changes.Once the Thévenin equivalents are computed, they can be gathered in the link solverprocess. This collective communication task depends on the contingency status of all subsys-tems. The contingency status are first checked on both subsystems and link solver processes.108Chapter 4. MATE-based Parallel Transient StabilitySp S3 S2 S1 Lsp s3 s2 s1s1s2s3sps0 = pmaxk=1skFigure 4.12. Contingency statuses check employed in the parallel MATE-based transientstability simulator.In case changes in any Yk are verified, the full Thévenin equivalents are sent(received) to(by)the link solver; otherwise, only the Thévenin voltages ebk are sent(received). After sendingthe Thévenin equivalents, the subsystems wait for the link currents to be calculated by thelink solver.After the link solver received the Thévenin equivalents, the multi-area Thévenin equiv-alent, formed by Zl and el, needs to be assembled according to the contingency statuses ofthe system. The assembling procedure is described by (2.22) and (2.25). However, if anyof the contingency statuses are true, the full multi-area Thévenin needs to be rebuilt andthe impedance matrix Zl factorized; otherwise only the multi-area Thévenin voltages el areassembled. Afterwards, the link currents il can be computed from Zlil = el.With the link currents il available, the link solver scatters them appropriately, i.e., re-specting the link-to-border mappings Rk, among the subsystems. The subsystems, in turn,that received their respective border node injection ibk due to the link currents il, can, finally,update the local injections imk and compute the local node voltages vk.Convergence CheckAt this point, all the nodal voltages vm+1k are solved and all the dynamic devices’ statevariables xm+1k updated, during the iteration m and for each subsystem Sk. Now, the con-vergence of the solution needs to be checked against the predicted values of vmk and xmk ,which were used to computed the currents imk .In the convergence check stage, the convergence of each subsystem Sk can be verified109Chapter 4. MATE-based Parallel Transient StabilitySubsystem Sk Link SolverTh´evenin Eqs. Computationif Yk changed thenZbk = Qk (Yk)−1 (Qk)Tend ifebk = Qk (Yk)−1 imkTh´evenin Eqs. Gatherif Yk changed thenSend Zbk and ebkelseSend ebkend ifTh´evenin Eqs. Gatherfor k = 1 to p doif Yk changed thenReceive Zbk and ebkelseReceive ebkend ifend forWait all Th´evenin Eqs.Link Currents ScatterReceive ibkWait until doneMATE Solutionif any Yk changed thenRefactorize Zlend ifSolve Zl il = elLink Currents Scatterfor k = 1 to p doSend ibk = Rk ilend forUpdate local injectionsimk ←(Qk)T ibk +imkCompute voltagesSolve Yk vk = imkMATE Assemblingif any Yk changed thenZl = Zl0 +summationtextSk∈S (Rk)T ZbkRkend ifel =summationtextSk∈S (Rk)T ebkFigure 4.13. Network-based MATE parallel linear solver detail.110Chapter 4. MATE-based Parallel Transient StabilityLcpc3c2c1SpS3S2S1cglobalmaxcglobalcglobalcglobalSpS3S2S1Figure 4.14. Convergence check employed in the parallel MATE-based transient stabilitysimulator.through the conditionbardblvm+1k −vmkbardbl≤εv, where εv is the error tolerance applied to voltages.Although the convergence check can be performed by each subsystem independently, a globalwarningabout the convergencestatusstill needs tobe issuedtoall subsystemsand link solver,so all the processes know whether or not start a new iteration.For this task, the convergence check is performed as illustrated in Figure 4.14. Thisprocedure starts by each subsystem Sk calculating a convergence flag, as denoted by (4.34a),which is 1 if the system is not converged and 0 otherwise. Afterwards all local convergenceflag are gathered in the link solver L. In the link solver, a global convergence flag cglobalis calculated, as in (4.34b), and broadcasted back to all subsystems. This operation canbe performed by the collective MPI_Allreduce routine applied with the MPI_MAX operator(MPI Standard, 2008; Gropp et al., 1999).ck =0 ifbardblvm+1k −vmkbardbl≤εv1 otherwise(4.34a)cglobal = pmaxk=1ck (4.34b)Whether the program should keep track of the convergence of each subsystem individ-ually is not clear, although it is possible. Different subsystems would perform more or lessiterations under such conditions. A direct consequence of the distinct number of iterationswould be a further unbalance in the computations in the subsystems and, hence, longeridle periods in the faster-convergent subsystems, which would have to wait for the slower-convergent subsystems to conclude the iterative process. However, further investigations are111Chapter 4. MATE-based Parallel Transient StabilityTable 4.1. Summary of the SSBI system.Element QuantityBuses 1916Transmission Lines 2788Generators 77Loads 1212needed concerning local convergence check, which goes beyond of the scope of the presentwork.4.4 Performance AnalysisFor analyzing the performance of the parallel transient stability simulator, described inSection 4.3, with respect to its sequential counterpart, discussed in Section 4.2, a reducedversion of the South-Southeastern Brazilian Interconnected (thereafter, SSBI) System willbe simulated. An overview of the Brazilian Interconnected System in its entirety is shownin Figure 4.15, whereas a summary of the SSBI system is given in Table 4.1.In addition, two other aspects associated with the parallel transient stability simulationwill be studied in this section: the parallelization of the network solution with the network-based MATE algorithm and the parallelization of the integration of the dynamic models.As discussed in Section 4.1, these are the major steps constituting the alternating solutionapproach for transient stability simulations.4.4.1 South-Southeastern Brazilian Interconnected SystemPartitioningEmploying the system partitioning tool described in Section 4.3.1, the SSBI system wastorn into 2 to up to 14 subsystems. A summary of the generated partitions is presented inTable 4.4.Since the parallel transient stability simulator implementation relies on repeated solu-tions of the nodal equations and only a few factorizations due to topology changes, thenetwork-based MATE algorithm performance for the present partitioning schemes can bequalitatively evaluated by means of the penalty factors defined in (3.31). Both link solverand communication penalty factors calculated from the partitioning information, given inTable 4.4, are shown in Figure 4.17.112Chapter 4. MATE-based Parallel Transient StabilityFigure 4.15. Brazilian National Interconnected System. (http://www.ons.org.br/conheca_sistema/mapas_sin.aspx#)n = 1916, nnz = 6630Figure 4.16. South-Southeastern Brazilian Interconnected System admittance matrix(1916 buses and 2788 branches).113Chapter 4. MATE-based Parallel Transient Stability2 3 4 5 6 7 8 9 10 11 12 13 1400.511.522.5Number of partitionsLinkSolverPenalty  bisectionk-way2 3 4 5 6 7 8 9 10 11 12 13 1402468Number of partitionsCommunicationPenalty  bisectionk-wayFigure 4.17. Link solver and communication penalty factors relative to the SSBI subsys-tems summarized in Table 4.4.From these graphs, both penalty factors remain below unity only for only a very reducednumber of partitions, regardless of the partitioning technique. Thus, a rapid degradationof the performance of the network-based MATE algorithm is expected as the number ofpartitions increases, because of the high penalty factors verified. In addition, the graphs alsoshow that the communication burden represents the biggest bottleneck for the network-basedMATE algorithm applied to the present partitions due to the high values of its associatedpenalty factor. Based on the fact that the penalty factors are normalized quantities withrespect to the subsystems workload, one can further conclude that such performance degra-dation occurs due the reduced size of the subsystems in comparison with the link solver andcommunication.4.4.2 Timings for the SSBI SystemForthe subsequent evaluations, a100ms short-circuitatthe 765kVFozdo Iguaçusubstation,followed by the opening of one of the four circuits that connect Foz do Iguaçu and Itaipu114Chapter 4. MATE-based Parallel Transient StabilityBi-national substations was simulated for 10 s, with a time step of 4 ms. The SSBI systemwas modelled by means of classical synchronous machine models, in addition to constantpower loads for voltages above 0.8 p.u. and constant impedance loads otherwise. For moreinformation on the modelling of the system, see Section 4.1.2.Timings for the sequential transient stability simulationDuring the sequential transient stability simulation, the solution of 2,501 time steps wasperformed, which required 7,313 system solutions, i.e., iterations. The average numberof iterations per time step was 2.92, with a standard deviation of 0.65, which conformswith the literature (Dommel & Sato, 1972; Stott, 1979). These statistics are summarized inTable 4.2. The timings recorded for the same simulation are given in Table 4.3. Ignoring thesimulation setup and output handling, the network solution corresponded to about 25% ofthe computation time, while the dynamic models computations amounted to about 65% ofthe computation time. These proportions are also in agreement with the ones associated withproduction-grade transient stability simulators, as reported in the literature (Brasch et al.,1979; Wu et al., 1995).Timings for the MATE-based parallel transient stability simulationThe same simulation of the SSBI system was performed using the parallel transient stabilitysimulator described in Section 4.3. The timings recorded from the simulations, which ignoresimulation setup and output handling time, are graphically shown in Figure 4.18.These timings show that both multilevel recursive bisection and k-way partitioning tech-niques yield similar performance to the parallel simulator. The timings are drastically re-duced from about 8 s to about 3 s, when number of subsystems varies from two to abouteight, while saturation is observed for higher number of subsystems. This saturation is jus-tified by the increase of the link solver computational burden and communication overhead,as predicted by the penalty factors, shown in Figure 4.17, which presented values lower thanthe unity for just a few subsystems. The convergence check timings also contributes to thecommunication overhead, which also increases with the number of subsystems, or, in thiscase, number of processors. The other timings, namely, network currents ik computations,Thévenin equivalents Zbk and ebk computations, solution of subsystems’ voltages vk and up-date of dynamic models’ internal variables, decrease with the number of partitions, due totheir close relationship with the size of each subsystem. That is because solving smallersubsystems requires, in general, less computational effort.The saturation of the performance of the parallel transient stability simulators can be115Chapter 4. MATE-based Parallel Transient StabilityTable 4.2. Statistics of the SSBI system simulation.Discrimination ValueSimulation time 10 sTime step 4 msSolution steps 2,501System solutions 7,313Average of solutions per time step 2.92Standard deviation of solutions per time step 0.65Table 4.3. Summary of the sequential transient stability simulation.Task Timing [s] Participation [%]Setup 0.857 5.88Fault check 0.001 0.01Convergence check 1.277 8.76History terms and extrapolation 0.305 2.10Current injections 8.063 55.3First time factorization 0.009 0.06Same pattern factorization 0.005 0.04Voltages solution 3.290 22.6Variables update 0.085 0.59Output handling 0.677 4.65Total 14.57 [s]116Chapter 4. MATE-based Parallel Transient Stabilityfurther illustrated by the link solver timings, given in Figure 4.19. In these timings, thelink currents il computation and scattering increase almost linearly with the number ofsubsystems. This aspect is denoted by the almost constant participation factors associatedwith each of the latter tasks, i.e., about 70 to 80% of communication and 20 to 30% ofcomputation. Theseparticipationfactorsalsoreinforcethequalitativeinformationembeddedin the penalty factors (Figure 4.17), which predicted a much higher communication overheadthan the link solver computations.Performance metrics of the MATE-based parallel transient stability simulationIn addition to the performance analysis of the MATE-based parallel transient stability sim-ulator, the MATE-based parallel network solutions, dynamic models computations and con-vergence and contingency check will be individually assessed for the simulations performedwith the SSBI system. In this manner, the contributions of each parallel computation to theoverall simulation can be evaluated.Following the definitions of speedup and efficiency, introduced in Section 1.1.2, the per-formance metrics of the MATE-based parallel transient stability simulation are given below:STSS = TpTSSTsTSS ETSS =STSSp + 1 (4.35a)SNET = TpNETTsNET ENET =SNETp + 1 (4.35b)SDYN = TpDYNTsDYN EDYN =SDYNp (4.35c)SCCC = TpCCCTsCCC ECCC =SCCCp + 1 (4.35d)where TsΘ and TpΘ indicates whether the computation indicated by Θ is performed sequentiallyor in parallel, respectively, andSΘ andEΘ are the speedup and the efficiency of the parallelcomputations also indicated by Θ, which can be one of the following:TSS = Transient stability simulationNET = Network solutionDYN = Computation of the dynamic modelsCCC = Convergence and contingency status check117Chapter 4. MATE-based Parallel Transient Stability2 3 4 5 6 7 8 9 10 11 12 13 14 0102030405060708090100Participation[%]Number of Subsystems  23456789TotalTime[s]ik comp.Zbk,ebk comp.vk solutionupdateconv. checklink solverZbk,ebk comm.cont. check(a) Multilevel recursive bisection2 3 4 5 6 7 8 9 10 11 12 13 14 0102030405060708090100Participation[%]Number of Subsystems  23456789TotalTime[s]ik comp.Zbk,ebk comp.vk solutionupdateconv. checklink solverZbk,ebk comm.cont. check(b) Multilevel k-way partitioningFigure 4.18. Timings of the parallel transient stability simulator for different partitioningheuristics.118Chapter 4. MATE-based Parallel Transient Stability2 3 4 5 6 7 8 9 10 11 12 13 14 0102030405060708090100Participation[%]Number of Subsystems[s]il solutionil scatter(a) Multilevel recursive bisection2 3 4 5 6 7 8 9 10 11 12 13 14 0102030405060708090100Participation[%]Number of Subsystems[s]il solutionil scatter(b) Multilevel k-way partitioningFigure 4.19. Timings of the link solver of the parallel transient stability simulator fordifferent partitioning heuristics.119Chapter 4. MATE-based Parallel Transient StabilityMoreover, the definition of the efficiency of the dynamic models’ parallelization, EDYN,given in (4.35c), considers only p processes rather than p + 1, as assumed for the othertasks. That is because the link solver does not take part in the computations of the dynamicmodels.Based on the timings obtained from both sequential and parallel transient stability sim-ulators, the performance metrics, defined in (4.35), were then calculated and plotted inFigure 4.21, as functions of the number of subsystems p.Due the intrinsic parallel nature of the computations of the dynamic models, partitioningthe latter task among distinct processors yielded speedups as high as six times with eightsubsystems. From two to four processors, the efficiency even remains higher than 100%, dueto speedups higher than p, which characterize superlinear speedups.For the MATE-based parallel network solutions, the achieved speedups saturate aroundtwo times for about six subsystems. That is explained by the fact that, as p increases,the subsystems become small compared to the communication overhead and link solvercomputations, as predicted by the penalty factors previously discussed in Figure 4.17. Theefficiency, however, presents values ranging from 30 to 40% for two to six processors, i.e.,before the speedup saturation. Compared with other parallel sparse linear solvers reportedin the literature (see Chapter 1), these values are still very competitive.Differently from the previous two tasks, the convergence and contingency status checkspresent speedups that decrease with the number of subsystems p. That is because the com-putations associated with, mostly, the convergence check are much faster than the requiredcommunications. Even though the computations for the convergence check decrease with thenumber of subsystems p, the communication increase withO(plogp) (Grama et al., 2003),which makes the speedups reduce with larger number of subsystems. The data fitting ofthe convergence and contingency status check time TpCCC, given in Figure 4.20, confirms theprevious statement.In order to further the understanding on the impact of the gains of the individual taskson the overall parallel transient stability simulation, one can expand the speedupSTSS, givenin (4.35a), in terms of the individual sequential and parallel timings as denoted in (4.36).STSS = TpTSSTsTSS =TpNET + TpDYN + TpCCCTsNET + TsDYN + TsCCC (4.36)Re-writting the above expression in terms of individual speedups defined in (4.35) yields:STSS = fsNETSNET + fsDYNSDYN + fsCCCSCCC (4.37)120Chapter 4. MATE-based Parallel Transient Stability2 4 6 8 10 12 of Subsystems  FittedMeasuredTp CCC[s]Figure 4.20. Convergence and contingency status check time TpCCC measured and fittedwith plogp function.where fsΘ stands for the fraction of the sequential parallel transient stability timing TsTSSthat the task Θ ={NET,DYN,CCC}consumes, defined as follows:fsΘ = TsΘTsTSS (4.38)In (4.37), the overall speedup of the transient stability simulation, STSS, is expressedas a combination of the individual speedups of the parallel network solution, SNET, par-allel solution of the dynamic and non-linear devices, SDYN, and the parallel convergenceand contingency status check, SCCC. The foregoing speedups are weighted according totheir corresponding task’s proportion in the sequential transient stability simulation, whichare given by the coefficients fsNET, fsDYN and fsCCC. Moreover, this relationship expresses,mathematically, that any speedup in the most time-consuming task of the transient stabilitysimulation will have a much stronger impact on the overall computations performance.From the sequential transient stability timings, given in Table 4.3, one can obtain theproportions of each of the tasks, which results:fsNET = 0.25 fsDYN = 0.65 fsCCC = 0.10From Figure 4.21, the overall parallel transient stability solution presented a speedupof about 4.3 times, which approximates the solution of (4.37), with SNET = 2, SDYN =5.5 and SCCC = 3, calculated for 8 subsystems, and the above coefficients fsNET, fsDYNand fsCCC. These proportions also show that the speedups of parallel transient stability121Chapter 4. MATE-based Parallel Transient StabilitysimulationSTSS are mostly influenced by the parallelization of the computations associatedwith the dynamic and non-linear devices, which occurs naturally when partitioning thesystem according to the network-based MATE algorithm. In the partitioning stage, however,only network topological information were taken into account, which causes imbalances inthe subsystems in terms of dynamic and non-linear devices connected to the systems, asTable 4.4 shows. Such imbalances also influence the saturation observed in the speedupsassociated with the devices’ computations,SDYN, and, consequently, in the speedup of theoverall parallel transient stability simulation.Although the network solutions correspond to only 25% of the sequential transient sta-bility simulation, parallelizing the network solutions by means of the network-based MATEalgorithm will definitely help speed up the transient stability solutions, specially for largesystems. For instance, from Section 3.4.2, the WECC system (about 15,000 buses) wassolved by the network-based MATE algorithm with a 6 times speedup with 14 subsystems.Now, assuming that the sequential tasks’ proportions remain fixed, one may expect a directcontribution to the overall parallel transient stability simulation speedup of about 1.5 timesfrom only the parallel network solutions.122Chapter 4. MATE-based Parallel Transient Stability2 3 4 5 6 7 8 9 10 11 12 13 140123456Speedup  2 3 4 5 6 7 8 9 10 11 12 13 14020406080100120140Efficiency[%]Number of Subsystems  STSSSNETSDYNSCCCSCCCETSSENETEDYN(a) Multilevel recursive bisection algorithm.2 3 4 5 6 7 8 9 10 11 12 13 140123456Speedup  2 3 4 5 6 7 8 9 10 11 12 13 14020406080100120140Efficiency[%]Number of Subsystems  STSSSNETSDYNSCCCSCCCETSSENETEDYN(b) Multilevel k-way partitioning algorithm.Figure 4.21. Performance metrics of the parallel transient stability simulator.123Chapter 4. MATE-based Parallel Transient StabilityTable4.4.PartitioningofthereducedSouth-SouthwesternBrazilianInterconnectedsystemusingMETISlibrary.(a)Multi-levelrecursivebisectionalgorithmplµ(nk)σ(nk)nknkµ(bk)σ(bk)bkbkµ(lk)σ(lk)lklkµ(gk)σ(gk)gkgkµ(zk)σ(zk)zkzk220958.00.095895816.51.5151820.00.0202038.56.43443606.017.0594618323638.70.663863915.04.5102115.34.2112125.73.82330404.047.3350438444479.01.647748117.56.1112522.09.0133219.24.61324303.035.9266339547383.20.438338416.85.772318.86.682715.44.41222242.435.8192282656319.30.831832015.75.582118.77.892812.84.9822202.037.5153245764273.71.027227515.36.962418.37.682911.06.2519173.126.2133202872239.51.123824114.15.172318.07.67289.63.3615151.524.6119187978212.90.921121414.15.552217.36.99298.65.1118134.722.9941691089191.60.819019314.34.352217.86.37267.74.2315121.222.5801521195174.21.117217614.85.962817.37.66357.03.8115110.215.0841381293159.70.815916112.44.241815.56.05256.43.9014101.019.66613413106147.40.814614913.55.842716.37.36355.93.221393.218.46212014109136.90.713613813.15.652615.66.57325.53.401386.618.957115(b)Multi-levelk-wayalgorithmplµ(nk)σ(nk)nknkµ(bk)σ(bk)bkbkµ(lk)σ(lk)lklkµ(gk)σ(gk)gkgkµ(zk)σ(zk)zkzk222958.033.993498218.01.0171922.00.0222238.50.73839606.019.8592620325638.719.162165914.03.6111916.74.9112325.70.62526404.044.0355440449479.012.846749220.810.693624.511.9103919.25.71426303.018.6281325553383.29.137139218.47.272621.28.773215.43.01119242.430.5190269669319.37.630932820.05.8122623.06.8143112.86.4825202.028.4154241772273.76.726528217.48.463220.610.383911.05.6521173.118.8146194880239.56.323024616.17.362720.08.69339.64.5317151.526.6112189970212.95.920021913.05.251915.66.56278.64.7315134.725.6931721086191.64.618519814.05.242117.27.45307.74.7417121.219.39315811105174.26.615717915.66.942819.110.14377.03.8315110.217.28613912104159.76.714116514.06.532417.38.73326.44.5218101.019.37412513108147.43.514015113.96.732416.68.85335.95.201693.218.85912714122136.96.311814413.97.343117.49.76425.53.111186.614.759109•xkmeansthemaximumvalueofxk•µ(xk)meanstheaveragevalueofxk•gkrepresentsthenumberofgenerators•xkmeanstheminimumvalueofxk•σ(xk)meansthestandarddeviationofxk•zkrepresentsthenumberofloads124Chapter 4. MATE-based Parallel Transient Stability4.5 ConclusionA transient stability simulator based on the network-based MATE algorithm was discussedand implemented in its sequential and parallel versions. The solution approach adoptedfor the transient stability problem was the alternating solution method, where differentialequations and network equations are solved in an alternate fashion. Basic models for themost elementary power systems devices, such as generators, transmission lines, transformersand composite loads, were also presented. A reduced version of the Brazilian InterconnectedSystem, with about 2,000 buses, was employed in the subsequent evaluations.The timings of the implementedsequential transient stability simulatorpresenteda profilesimilar to those observed in industrial-grade programs: 65% of the computations associatedwith dynamic and non-linear devices, and 25% related to network solutions.The performance metrics of the parallel transient stability simulator with respect toits sequential counterpart showed that the parallel network solutions and the distributionof the dynamic and non-linear devices among subsystems are fundamental in shaping theoverall parallel transient stability simulation performance. For the simulated system, theparallel computation of the dynamic and non-linear elements achieved speedups as high assix with eight subsystems, while the parallel network solutions based on the network-basedMATE algorithm achieved speedups of up to two times for the same number of subsystems.As a consequence, the overall parallel transient stability solution presented a speedup ofapproximately 4.5 times.125Chapter 5ConclusionIn this thesis, the network-based Multi-Area Thévenin Equivalent (MATE) algorithm hasbeen proposed and employed in the solution of the transient stability problem in a dis-tributed parallel computing architecture. Theoretical performance analysis of the network-based MATE has also been developed. The resultant performance model provides qualitativeand quantitative performance measures of the various stages of the algorithm when applied tothe solution of a specific network on a given hardware/software setup. A large power systemnetwork, the WECC system with 14,327 buses was solved with the developed network-basedMATE algorithm. A speedup of 6 times over conventional sparsity-oriented solutions wasobtained with a 14-PC commodity cluster.A parallel transient stability program was also developed, which employed the previ-ous network-based MATE algorithm as the backend for the sparse linear solutions. Thealternating algorithm for transient stability simulations was implemented on the top of thepractically intact structure of the network-based MATE algorithm, which provided straightconcurrent solution of groups of dynamic elements. Tests on the SSBI systems with 1916buses revealed speedups of up to 4.3 times over a sequential version of the same transientstability solution algorithm.The main conclusions of the present work will be summarized in sequence, along withsome possible further investigations and final remarks about the MATE algorithm appliedto the solution of bulk power systems.5.1 Summary of ContributionsThe main contributions of this thesis are:1. Introduction of the network-based MATE algorithm, which further opti-mizes the matrix-based MATE algorithm in terms of computation and com-munication overheadThe network-based MATE introduces new concepts, such as border nodes, multi-nodeand link Thévenin equivalents and their associated mappings (subsystem-to-border andlink-to-border). Such concepts were proven to be helpful in reducing the computations126Chapter 5. Conclusionperformed on the subsystems, as well as the communications required to form the linksystem in the link solver.2. Implementation of the network-based MATE algorithm on a commoditycluster employing ready-to-use sparsity and communication librariesThe network-based MATE algorithm was implemented on a PC cluster, consistingof several single-core PCs interfaced by a dedicated high-speed network. In order tominimize the code development time, generic and ready-to-use libraries, which im-plement all computational and communication kernels required by the network-basedMATE, were employed. Sparse and dense linear solutions were provided by SuperLU(Li et al., 2003; Demmel et al., 1999) and GotoBLAS (Goto, 2006), respectively. Morespecifically, the GotoBLAS library provides implementations of BLAS (Basic LinearAlgebra Subroutines) (Blackford et al., 2002) and LAPACK (Linear Algebra Package)(Anderson et al., 1999), which define a comprehensive standard interface for linearalgebra routines. As for the interprocess communications, the Message Passing In-terface (MPI) standard (MPI Standard, 2008) was adopted, whose implementationwas provided by NMPI library (NICEVT, 2005), which is a version of the well knownMPICH2 library (Argonne National Laboratory, 2007) over the Scalable Coherent In-terface (SCI) (IEEE, 1993). Because all the aforementioned libraries are implementedin accordance to widely accepted programming standards, employing such libraries notonly minimized the development time but also enhanced the portability of the code,which can be easily compiled for many parallel computing architectures ranging fromPC clusters to supercomputers.3. Development of a performance model for the network-based MATE algo-rithmDeveloping a performance model for the network-based MATE algorithm enabled theestablishment of a theoretical speedup limit for the method, with respect to tradi-tional sequential sparsity-oriented sparse linear solvers. In addition, the developedperformance model helps evaluating the performance of the proposed algorithm for agiven hardware/software setup and a specific power system network to be solved. Thisis a key piece of information that also helps one making decisions regarding the bestsuitable algorithm and computational environment for solving a given problem.4. Application of the parallel network-based MATE algorithm for the solutionof the network equations associated with the transient stability simulation127Chapter 5. ConclusionThe proposed network-based MATE algorithm was employed in the solution of sparselinear systems, common in transient stability programs. For the 14,327-bus system, ex-tracted from the North American Western Electricity Coordinating Council (WECC),speedups closely followed the theoretical speedup of p2, where p is the number of par-titions, reaching about 6 times with 14 partitions (Figure 3.23).5. Implementation of a parallel transient stability simulator based on the par-allel network-based MATE algorithmThe network-based MATE algorithm was employed as the backend for the sparse linearsolutions required by transient stability simulations. The network-based MATE algo-rithm not only provided an alternative for solving the network equation in parallel, butalso lent a structure suitable for solving groups of differential equations associated withdynamic elements concurrently. Because the integration of the differential equations isthe most time consuming task in most of the industrial-grade transient stability simu-lators (about 80% of the computational time), its parallelization becomes essential toyield maximum performance to any parallel transient stability simulator. Employingthe alternating algorithm for solving the transient stability problem, tests on the SSBIsystems with 1916 buses revealed speedups of up to 4.3 times over a sequential versionof the same transient stability solution algorithm.5.2 Future WorkThe network-based MATE algorithm has been proved to be competitive with other parallelalgorithms for the solution of large linear systems in transient stability simulations usingcommodity off-the-shelf PC clusters. However, there are still many related aspects suitablefor further investigation, such as:1. Improved solution of the subsystems equationsAs shown in the thesis, the theoretical speedup limit of the network-based MATE al-gorithm is p2, where p is the number of partitions. This limit can be roughly explainedby the fact that the subsystems need to be solved, at least, twice during each solution.Therefore, further optimizing the Thévenin equivalents computations may significantlyincrease the the foregoing speedup limit. One technique that can be employed for suchan optimization is the sparse vector methods (Tinney et al., 1985), which can poten-tially reduce the number of non-zeros handled during the solution of systems with onlya few injections. In this case, the theoretical speedup limit would approximate p1+f,128Chapter 5. Conclusionwhere f ∈ (0,1] represents the computational factor due to the Thévenin equivalentsolutions. For instance, as reported by Tinney et al. (1985), sparse vector methodsachieved up to 20 times speedup over regular sparse forward and backward substitu-tions. If sparse vector methods were used to compute the Thévenin equivalent compu-tations, the computational factor f = 120 would yield a speedup limit of approximately2021p≈0.95p, which means a practically linear speedup10.2. Implementation of the MATE algorithm in more advanced parallel archi-tecturesIn the present work, each subsystem and the the system of equations associated withthe interconnections among subsystems, i.e., link currents equations, were concur-rently solved on single processors. Further acceleration of the solutions could beachieved by means of more advance parallel architectures, like multi-core processorsand graphical processing units (GPU). With the tightly-coupled multiprocessing archi-tectures provided by off-the-shelf multi-core processors, the MATE algorithm could beused for the coarse grain parallelization, i.e., defining subsystems and links; whereasfiner grain parallelism (Huang & Wing, 1979; Wing & Huang, 1980; Wu & Bose, 1995;Amestoy & Duff, 2000; Armstrong et al., 2006) could be further exploited inside eachsubsystem or link solver. As such, each subsystem and the link solver could be mappedonto different multi-processing units and solved locally in parallel. And, as long as theinterprocessor communications, i.e., number of links, are minimized at the MATE al-gorithm level, one may expect better scalability of the solutions on SMP machines. Yetanother piece of hardware that has a great potential in accelerating floating-point op-eratins is the GPU. Differently from what has been observed for the CPU technology,the performance of GPUs is quickly accelerating, due to the inherent parallel natureof graphic computations. GPUs can achieve much higher arithmetic intensity with thesame transistor count as regular CPUs (Owens et al., 2007).3. Performance analysis of the multi-level MATE applied to the solution oflarge power systemsArmstrong et al. (2006) shows that multi-level MATE is capable of considerably im-proving the performance of the MATE algorithm when only dense matrices are handledon a single processor. Due to the relatively reduced size of the systems solved by themulti-level MATE, sparsity-oriented techniques were not considered. Hence, it seems10The theoretical speedup limit assumes both communications and link solver computations negligible,while only forward and backward substitutions are performed.129Chapter 5. Conclusionworthwhile developing a performance model for a sparsity-oriented version of multi-level MATE in order to determine possible performance gains and bottlenecks whenapplied to solving large power systems.4. Development of partitioning algorithms specially tailored for meeting bothcomputational and communication requirements imposed by the MATEalgorithmTwo different generic graph partitioning algorithms, namely, the multi-level recursivebisection and k-way algorithms (Karypis & Kumar, 1998b,a), were employed in thiswork for defining the subsystems solved by the MATE algorithm. From the results,the subsystems generated by both aforementioned partitioning heuristics yielded sim-ilar performance to the parallel forward and backward solutions implemented throughthe MATE algorithm. For the factorization, however, the load unbalance among thesubsystems was evident, due to the uneven number of border nodes in each subsystem.As for the parallel transient stability program, computational costs involved in solvingequations associated with generators and loads were not considered in the partitioningphase. Therefore, the MATE algorithm would certainly benefit from the developmentof a partitioning heuristic capable of balancing the load among many subsystems, con-sidering the number of border nodes and computational costs of generators and loads ineach subsystem, while keeping the number of interconnections among the subsystemslow.5. Feasibility of geographically distributed power systems simulations basedon the MATE algorithmAs pointed out by Wang & Morison (2006), another challenging issue is the modelingof the external system not observable by the SE (state estimator). Inclusion of thesemodels in the real-time system models may require the development of adequate offlineequivalent models, which can then be merged with the real-time SE models.Similarly to the method proposed by (Esmaeili & Kouhsari, 2007), the network-basedMATE algorithm can also aid the deployment of geographically distributed transientstability simulations. In the MATE context, areas managed by distinct utilities can berepresented by separate subsystems, whereas the tie lines interconnecting these areasare the links. In this manner, each utility can preserve the locality and the privacy ofits own data, and needs only to make their system equivalents available, for the sakeof the whole system’s solution.130Chapter 5. Conclusion5.3 Final RemarksThe MATE algorithm, originally proposed by (Martí et al., 2002), has been reformulatedfrom an electric network point of view, which yielded the network-based MATE algorithmintroduced in this work. It has been shown that inherent characteristics of the electricalnetworks, revealed by the present development, can reduce both computational effort andcommunication overhead of a parallel implementation of the MATE algorithm.Results of previously related researches and the present implemention of the MATE algo-rithm on a commodity cluster show that, in order to keep parallel computations efficient, onehas always to keep in mind the nature of the problem to be solved and the target computingarchitecture. Some parallel algorithms may be more suitable to certain architectures thanothers.For example, SuperLU DIST, whose goal is scalability when solving extremely large-scaleunsymmetrical problems on massive distributed-memory supercomputers, may not performwell on a commodity cluster solving systems of moderate size. For the sake of comparison,timings for the parallel network-based MATE and SuperLU DIST when solving the twopower systems studied in Section 3.3 and Section 4.4 on a commodity cluster are shown inTable 5.1. As can be observed, the network-based MATE scales well up to 8 processors,while the SuperLU DIST suffers from its high communication overhead in comparison withthe computational load of each processing node. The timings show, for instance, that thenetwork-based MATE algorithm was able to solve the 14,327-bus WECC system up to 26times faster than SuperLU DIST.In summary, the network-based MATE algorithm can be very efficient for solving sparselinear systems, commonly found in transient stability programs using commodity PC clustersbuilt with out-of-the-shelf processors.131Chapter 5. ConclusionTable 5.1. Timings for 1 factorization and 1000 repeated solutions using the network-basedMATE and SuperLU DIST on a 16 AMD AthlonTM 64 2.5GHz processors clusterinterconnected by a SCI-based network.Processors Solver SSBI† WECC†(1,916 buses) (14,327 buses)1 MATE - -SuperLU Seq. 0.43 4.62 MATE - -SuperLU DIST 4.52 37.63 MATE 0.45 4.3SuperLU DIST 5.24 39.34 MATE 0.32 2.6SuperLU DIST 5.25 38.45 MATE 0.26 1.8SuperLU DIST 4.87 36.06 MATE 0.24 1.4SuperLU DIST 4.58 32.67 MATE 0.22 1.2SuperLU DIST 4.48 30.38 MATE 0.21 1.1SuperLU DIST 4.19 N/A‡† All timings are given in [s]‡ N/A means that the program returned an unknown error132BibliographyAlexandrov, A., Ionescu, M. F., Schauser, K. E., & Scheiman, C. (1995). LogGP: Incorpo-rating Long Messages into the LogP Model — One step closer towards a realistic model forparallel computation. Technical report, University of California at Santa Barbara.Aloisio, G., Bochicchio, M., Scala, M. L., & Sbrizzai, R. (1997). A distributed computingapproach for real-time transient stability analysis. Power Systems, IEEE Transactions on,12(2), 981–987.Alvarado, F. (1976). Computational complexity in power systems. IEEE Trans. PowerApp. Syst., 95(4), 1028–1037.Alvarado, F. (1979). Parallel solution of transient problems by trapezoidal integration.IEEE Trans. Power App. Syst., PAS-98(3), 1080–1090.Alvarado, F., Reitan, D., & Bahari-Kashani, M. (1977). Sparsity in diakoptic algorithms.IEEE Trans. Power App. Syst., 96(5), 1450–1459.Alvarado, F., Yu, D., & Betancourt, R. (1990). Partitioned sparse a-1 methods. IEEETrans. Power Syst., 5(2), 452–459.Amdahl, G. (1967). Validity of the single processor approach to achieving large-scale com-puting capabilities. In AFIPS Conference Proceedings (pp. 483–485).Amestoy, P. R. & Duff, I. S. (2000). Multifrontal parallel distributed symmetric and un-symmetric solvers. Comput. Methods Appl. Mech. Eng, 184, 501—520.Amestoy, P. R., Duff, I. S., L’excellent, J., & Li, X. S. (2001). Analysis and comparison oftwo general sparse solvers for distributed memory computers. ACM Trans. Math. Softw.,27(4), 388–421.Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J.,Greenbaum, A., Hammarling, S., McKenney, A., & Sorensen, D. (1999). LAPACK Users’Guide. Philadelphia, PA: Society for Industrial and Applied Mathematics, third edition.133BibliographyAnderson, P. M., Fouad, A. A., of Electrical, I., Engineers, E., & Paul M. Anderson, A.A. F. (2003). Power system control and stability.Andersson, G., Donalek, P., Farmer, R., Hatziargyriou, N., Kamwa, I., Kundur, P., Martins,N., Paserba, J., Pourbeik, P., Sanchez-Gasca, J., Schulz, R., Stankovic, A., Taylor, C., &Vittal, V. (2005). Causes of the 2003 Major Grid Blackouts in North America and Europe,and Recommended Means to Improve System Dynamic Performance. IEEE Transactionson Power Systems, 20(4), 1922–1928.Argonne National Laboratory (2007). MPICH2, an implementation of the Message-PassingInterface (MPI). http://www-unix.mcs.anl.gov/mpi/.Armstrong, M., Martí, J. R., Linares, L. R., & Kundur, P. (2006). Multilevel MATE forefficient simultaneous solution of control systems and nonlinearities in the OVNI simulator.Power Systems, IEEE Transactions on, 21, 1250–1259. 3.Arrillaga, J. & Watson, N. R. (2001). Computer Modelling of Electrical Power Systems.Wiley, 2 edition.Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V.,Pozo, R., Romine, C., & der Vorst, H. V. (1994). Templates for the Solution of LinearSystems: Building Blocks for Iterative Methods, 2nd Edition. Philadelphia, PA: SIAM.Beowulf.org(2009). Beowulf.org: Overview. http://www.beowulf.org/overview/index.html.Blackford, L. S., Demmel, J., Dongarra, J., Duff, I., Hammarling, S., Henry, G., Heroux,M., Kaufman, L., Lumsdaine, A., Petitet, A., Pozo, R., Remington, K., & Whaley, R. C.(2002). An Updated Set of Basic Linear Algebra Subprograms (BLAS). ACM Transactionson Mathematical Software, 28(2), 135–151.Brasch, F., Van Ness, J., & Kang, S.-C. (1979). The use of a multiprocessor network forthe transient stability problem. In Proc. PICA-79 Power Industry Computer ApplicationsConference IEEE (pp. 337–344).Bunch, J. R. & Rose, D. J. (1972). Partitioning, Tearing, and Modification of Sparse LinearSystems. Technical Report TR 72 - 149, Cornell University.Chai, J. & Bose, A. (1993). Bottlenecks in parallel algorithms for power system stabilityanalysis. IEEE Trans. Power Syst., 8(1), 9–15.134BibliographyChai, J., Zhu, N., Bose, A., & Tylavsky, D. (1991). Parallel newton type methods forpower system stability analysis using local and shared memory multiprocessors. IEEETrans. Power Syst., 6(4), 1539–1545.Crow, M. & Ilić, M. (1990). The parallel implementation of the waveform relaxation methodfor transient stability simulations. IEEE Trans. Power Syst., 5(3), 922–932.Culler, D. E., Liu, L. T., Martin, R. P., & Yoshikawa, C. (1996). LogP performanceassessment of fast network interfaces. IEEE Micro, (pp. 35–43).Davis, T. A. (2004). A column pre-ordering strategy for the unsymmetric-pattern multi-frontalmethod. ACM Trans. Math. Softw., 30(2), 165–195.De Rybel, T., Tomim, M. A., Singh, A., & Martí, J. R. (2008). An introduction to open-source linear algebra tools and parallelisation for power system applications. In ElectricalPower & Energy Conference, Vancouver, Canada.Decker, I., Falcão, D., & Kaszkurewicz, E. (1992). Parallel implementation of a powersystem dynamic simulation methodology using the conjugate gradient method. PowerSystems, IEEE Transactions on, 7(1), 458–465.Decker, I., Falcão, D., & Kaszkurewicz, E. (1996). Conjugate gradient methods for powersystem dynamic simulation on parallel computers. Power Systems, IEEE Transactions on,11(3), 1218–1227.Demmel, J. W., Eisenstat, S. C., Gilbert, J. R., Li, X. S., & Liu, J. W. H. (1999). Asupernodal approach to sparse partial pivoting. SIAM Journal on Matrix Analysis andApplications, 20, 720—755.Dommel, H. & Sato, N. (1972). Fast transient stability soultions. IEEE Transactions onPower Apparatus and Systems, PAS-91, 1643–1650.Dommel, H. W. (1996). EMTP Theory Book. Vancouver, British Columbia, Canada:Microtran Power System Analysis Corporation, 2nd edition.Enns, M., Tinney, W., & Alvarado, F. (1990). Sparse matrix inverse factors. Power Systems,IEEE Transactions on, 5(2), 466–473.Ernst, D., Ruiz-Vega, D., Pavella, M., Hirsch, P., & Sobajic, D. (2001). A unified approachto transient stability contingency filtering, ranking and assessment. IEEE Trans. PowerSyst., 16(3), 435–443.135BibliographyEsmaeili, S. & Kouhsari, S. (2007). A distributed simulationbased approachfor detailed anddecentralized power system transient stability analysis. Electric Power Systems Research,77(5-6), 673 – 684.Foster, I. (1995). Designing and Building Parallel Programs: Concepts and Tools for ParallelSoftware Engineering. Addison-Wesley.Goto, K. (2006). Optimized GotoBLAS Libraries. Retrieved in January, 2007.Goto, K. & Van De Geijn, R. A. (2008a). Anatomy of high-performance matrix multipli-cation. ACM Trans. Math. Softw., 34, 1–25.Goto, K. & Van De Geijn, R. A. (2008b). High-performance implementation of the level-3BLAS. ACM Trans. Math. Softw., 35, 1–14.Grainger, J. J. & Stevenson, W. D. (1994). Power system analysis. McGraw-Hill, Inc.Grama, A., Gupta, A., Kumar, V., & Karypis, G. (2003). Introduction to Parallel Com-puting.Gropp, W., Lusk, E., & Skjellum, A. (1999). Using MPI : Portable Parallel Programmingwith the Message-Passing Interface. Cambridge, Mass.: MIT Press.Gropp, W., Lusk, E., & Sterling, T. (2003). Beowulf Cluster Computing with Linux, 2ndEdition. The MIT Press, 2 edition.Happ, H. H. (1970). Diakoptics and piecewise methods. IEEE Trans. Power App. Syst.,(7), 1373–1382.Happ, H. H. (1973). Gabriel Kron and Systems Theory. Schenectady, N.Y.: Union CollegePress.Happ, H. H. (1974). Diakoptics-the solution of system problems by tearing. Proc. IEEE,62(7), 930–940.Hatcher, W., Brasch, F.M., J., & Van Ness, J. (1977). A feasibility study for the solution oftransient stability problems by multiprocessor structures. IEEE Trans. Power App. Syst.,96(6), 1789–1797.Ho, C.-W., Ruehli, A., & Brennan, P. (1975). The modified nodal approach to networkanalysis. IEEE Trans. Circuits Syst., 22(6), 504–509.136BibliographyHollman, J. & Martí, J. (2003). Real time network simulation with pc-cluster. IEEE Trans.Power Syst., 18(2), 563–569.Hong, C. & Shen, C. (2000). Implementation of parallel algorithms for transient stabil-ity analysis on a message passing multicomputer. In Power Engineering Society WinterMeeting, 2000. IEEE, volume 2 (pp. 1410–1415 vol.2).Huang, J. & Wing, O. (1979). Optimal parallel triangulation of a sparse matrix. Circuitsand Systems, IEEE Transactions on, 26(9), 726–732.IEEE (1993). IEEE standard for scalable coherent interface (SCI).Ilić-Spong, M., Crow, M. L., & Pai, M. A. (1987). Transient stability simulation by wave-form relaxation methods. IEEE Trans. Power Syst., 2(4), 943–949.Juarez T., C., Castellanos, R., Messina, A., & Gonzalez, A. (2007). A higher-order newtonmethod approach to computing transient stability margins. In R. Castellanos (Ed.), Proc.39th North American Power Symposium NAPS ’07 (pp. 360–367).Karypis, G. & Kumar, V. (1998a). A fast and high quality multilevel scheme for partitioningirregular graphs. SIAM J. Sci. Comput., 20, 359–392.Karypis, G. & Kumar, V. (1998b). METIS: A Software Package for Partitioning Un-structured Graphs, Partitioning Meshes, and Computing Fill-Reducing Orderings of SparseMatrices Version 4.0. University of Minnesota, Department of Computer Science / ArmyHPC Research Center. Retrieved in October, 2006.Kassabalidis, I. N. (2002). Dynamic security border identification using enhanced particleswarm optimization. Power Systems, IEEE Transactions on, 17(3), 723–729.Kielmann, T., Bal, H. E., & Verstoep, K. (2000). Fast measurement of LogP parametersfor message passing platforms. of Lecture Notes in Computer Science, 1800, 1176–1183.Krause, P. C., Wasynczuk, O., & Sudhoff, S. D. (2002). Analysis of Electric Machinery andDrive Systems. Piscataway, NJ; New York: IEEE Press; Wiley-Interscience.Kron, G. (1953). A method to solve very large physical systems in easy stages. CircuitTheory, IRE Transactions on, 2(1), 71–90.Kron, G. (1963). Diakoptics: The Piecewise Solution of Large-Scale Systems. (London):Macdonald & Co.137BibliographyKundur, P. (1994). Power System Stability and Control. New York: McGraw-Hill.Kundur, P. & Dandeno, P. (1983). Implementation of advanced generator models into powersystem stability programs. power apparatus and systems, ieee transactions on, PAS-102(7),2047–2054.Kundur, P., Morison, G., & Wang, L. (2000). Techniques for on-line transient stability as-sessment and control. In Proc. IEEE Power Engineering Society Winter Meeting, volume 1(pp. 46–51).La Scala, M., Bose, A., Tylavsky, D., & Chai, J. (1990a). A highly parallel method fortransient stability analysis. Power Systems, IEEE Transactions on, 5(4), 1439–1446.La Scala, M., Brucoli, M., Torelli, F., & Trovato, M. (1990b). A gauss-jacobi-block-newtonmethod for parallel transient stability analysis. IEEE Trans. Power Syst., 5(4), 1168–1177.La Scala, M., Sblendorio, G., Bose, A., & Wu, J. (1996). Comparison of algorithms fortransient stability simulations on shared and distributed memory multiprocessors. IEEETrans. Power Syst., 11(4), 2045–2050.La Scala, M., Sblendorio, G., & Sbrizzai, R. (1994). Parallel-in-time implementation oftransient stability simulations on a transputer network. IEEE Trans. Power Syst., 9(2),1117–1125.La Scala, M., Sbrizzai, R., & Torelli, F. (1991). A pipelined-in-time parallel algorithm fortransient stability analysis. IEEE Trans. Power Syst., 6(2), 715–722.Lau, K., Tylavsky, D., & Bose, A. (1991). Coarse grain scheduling in parallel triangularfactorization and solution of power system matrices. Power Systems, IEEE Transactionson, 6(2), 708–714.Li, X. S. & Demmel, J. W. (2002). SuperLU_DIST: A scalable distributed-memory sparsedirect solver for unsymmetric linear systems. Lawrence Berkeley National Laboratory. PaperLBNL-49388.Li, X. S., Demmel, J. W., & Gilbert, J. R. (2003). SuperLU Users’ Guide. The Regents ofthe University of California, through Lawrence Berkeley National Laboratory. Retrieved inOctober, 2006.Mansour, Y., Vaahedi, E., Chang, A., Corns, B., Garrett, B., Demaree, K., Athay, T.,& Cheung, K. (1995). BC Hydro’s on-line transient stability assessment (TSA) modeldevelopment, analysis and post-processing. IEEE Trans. Power Syst., 10(1), 241–253.138BibliographyMarceau, R. & Soumare, S. (1999). A unified approach for estimating transient and long-term stability transfer limits. IEEE Trans. Power Syst., 14(2), 693–701.Martí, J. R., Linares, L. R., Hollman, J. A., & Moreira, F. A. (2002). OVNI: Integratedsoftware/hardware solution for real-time simulation of large power systems. In ConferenceProceedings of the 14th Power Systems Computation Conference, PSCC02, Sevilla, Spain.MPI Standard (2008). A Message-Passing Interface Standard. Retrieved [August 8, 2008]from http://www-unix.mcs.anl.gov/mpi/.NICEVT (2005). Message-passing interface MPI2 over high speed net SCI. Retrieved[October 1, 2006] from http://www.nicevt.ru/download/index.html?lang=en.Ogbuobiri, E., Tinney, W., & Walker, J. (1970). Sparsity-directed decomposition for gaus-sian elimination on matrices. IEEE Trans. Power App. Syst., PAS-89(1), 141–150.Owens, John, D., Luebke, David, Govindaraju, Naga, Harris, Mark, Kruger, Jens, Lefohn,Aaron, E., Purcell, & Timothy, J. (2007). A survey of General-Purpose computation ongraphics hardware. Computer Graphics Forum, 26(1), 113, 80.Pai, M. & Dag, H. (1997). Iterative solver techniques in large scale power system com-putation. In Decision and Control, 1997., Proceedings of the 36th IEEE Conference on,volume 4 (pp. 3861–3866 vol.4).Rescigno, T. N., Baertschy, M., Isaacs, W. A., & McCurdy, C. W. (1999). Collisionalbreakup in a quantum system of three charged particles. Science, 286(5449), 2474–2479.Saad, Y. & Vorst, H. A. V. D. (2000). Iterative solution of linear systems in the 20thcentury. Journal of Computational and Applied Mathematics, 123, 1—33.Sato, N. & Tinney, W. (1963). Techniques for exploiting the sparsity or the networkadmittance matrix. IEEE Trans. Power App. Syst., 82(69), 944–950.Shu, J., Xue, W., & Zheng, W. (2005). A parallel transient stability simulation for powersystems. Power Systems, IEEE Transactions on, 20(4), 1709–1717.Stott, B. (1979). Power system dynamic response calculations. Proc. IEEE, 67(2), 219–241.Tinney, W., Brandwajn, V., & Chan, S. (1985). Sparse vector methods. IEEE Trans.Power App. Syst., PAS-104(2), 295–301.Tinney, W. & Walker, J. (1967). Direct solutions of sparse network equations by optimallyordered triangular factorization. Proc. IEEE, 55(11), 1801–1809.139BibliographyTomim, M., Martí, J., & Wang, L. (2009). Parallel solution of large power sys-tem networks using the Multi-Area Thévenin Equivalents (MATE) algorithm. Interna-tional Journal of Electrical Power & Energy Systems, In Press, Corrected Proof. DOI:10.1016/j.ijepes.2009.02.002.Tomim, M. A., Martí, J. R., & Wang, L. (2008). Parallel computation of large power systemnetwork solutions using the Multi-Area Thévenin Equivalents (MATE) algorithm. In 16thPower Systems Computation Conference, PSCC2008 Glasgow, Scotland.Tylavsky, D. J., Bose, A., Alvarado, F., Betancourt, R., Clements, K., Heydt, G. T., Huang,G., Ilic, M., Scala, M. L., Pai, M., Pottle, C., Talukdar, S., Ness, J. V., & Wu, F. (1992).Parallel processing in power systems computation. Power Systems, IEEE Transactions on,7(2), 629–638.Vorst, H. A. V. D. & Chan, T. F. (1997). Linear system solvers: Sparse iterative meth-ods. Parallel Numerical Algorithms, ICASE/LaRC Interdisciplinary Series in Science andEngineering, 4, 167—202.Wang, F. (1998). Parallel-in-time relaxed newton method for transient stability analysis.Generation, Transmission and Distribution, IEE Proceedings-, 145(2), 155–159.Wang, L. & Morison, K. (2006). Implementation of online security assessment. Power andEnergy Magazine, IEEE, 4(5), 46–59.Wing, O. & Huang, J. (1980). A computation model of parallel solution of linear equations.Computers, IEEE Transactions on, C-29(7), 632–638.Wu, F. (1976). Solution of large-scale networks by tearing. IEEE Trans. Circuits Syst.,23(12), 706–713.Wu, J. Q. & Bose, A. (1995). Parallel solution of large sparse matrix equations and parallelpower flow. Power Systems, IEEE Transactions on, 10(3), 1343–1349.Wu, J. Q., Bose, A., Huang, J. A., Valette, A., & Lafrance, F. (1995). Parallel imple-mentation of power system transient stability analysis. IEEE Trans. Power Syst., 10(3),1226–1233.Xue, Y., Rousseaux, P., Gao, Z., Belhomme, R., Euxible, E., & Heilbronn, B. (1993).Dynamic extended equal area criterion. In P. Rousseaux (Ed.), Proc. Joint InternationalPower Conference Athens Power Tech APT 93, volume 2 (pp. 889–895).140Appendix ALU FactorizationThe LU factorization is mathematically equivalent to the Gaussian elimination and allowsto express a permuted version of a specific complex system matrix A in terms of its lowerand upper-diagonal factors, namely, L and U. Once the factors L and U are formed, twonew triangular linear systems are generated, which in turn, can be solved by forward andbackward substitutions.A.1 Problem FormulationLet (A.1) represent a generic complex linear system, where A is a complex n×n matrix,and, B and X are, also complex, n×m matrices, where the last one is unknown.AX = B (A.1)In order to solve this system for X, one alternative is to factorize A into two triangularmatrices, namely L and U, as stated in (A.2).Pr APc = LU (A.2)where Pr and Pc are a row and column permutation matrices, respectively. In the case A isdense, only Pr may be employed in order to perform diagonal partial pivoting, while bothPr and Pc are needed in case total pivoting is required. In case the matrix A large andsparse, Pr still performs partial pivoting during the factorization, whereas Pc is selected insuch manner that it minimizes the number of added non-zeros (or simply fill-ins) in thestructure of Pr APc, and consequently, in the structure of L+U.Pre-multiplying (A.1) by Pr and making X = PcZ leads to the following expression(Pr APc)Z = Pr B (A.3)where the product inside the brackets can be substituted by the product LU, according to141Appendix A. LU Factorization(A.2), which in turn yieldsLUZ = Pr B (A.4)The new linear system obtained above can then be split into two interdependent trian-gular linear systems, defined by L and U, whose result is the matrix Z. Lastly, the solutionof the original linear system, X, can be obtained by permuting the rows of Z according toPc. Thus, introducing a new n×m matrix W, which will keep the solution of the lowerdiagonal system defined by L, this procedure is summarized as follows.Pr APc = LU (LU factorization) (A.5a)LW = Pr B (forward substitution) (A.5b)UZ = W (backward substitution) (A.5c)X = PcZ (row permutation) (A.5d)A.2 LU Factorization ProcessAssuming that the complex matrix A is already ordered, its L and U factors can be obtainedin the manner very similar to the one synthetized in (A.6). Firstly, one separates the first rowand column of the matrix A, as shown in (A.6a). As a consequence, the previous partitionedmatrix can be further rewritten, as in (A.6b), in the form of a multiplication of two othermatrices matrices, namely L1U1. Note that L1 is already lower diagonal, while U1 is thematrix A with its first column eliminated by Gaussian elimination.A =a1 (r1)Tc1 A1= (A.6a)=1 0 ··· 011a1c1...1a1 (r1)T0 A1− 1a1c1 (r1)T= (A.6b)= L1U1 (A.6c)One can then repeat this procedure to matrices Ak = Ak−1 − 1akck (rk)T with k =2,...,n−1, which are in fact formed by means of recursive rank-1 updates. Such procedure142Appendix A. LU Factorizationultimately generates the following result.A = L1L2 ...Ln−1Un−1 = LU (A.7)where each Lk is as follows.Lk =1...11akck...1(A.8)As a final result, the Un−1 is in the upper-diagonal triangular matrix U, whereas thelower-diagonal triangular L equals to the product of all Lk with k = 1,...,n−1, i.e.,L = producttextn−1k=1 Lk.A.3 Computational Complexity of LU Factorizationand SolutionAs presented by (Bunch & Rose, 1972), assume the sparse matrix M = PrAPc that alsoincludes all fill-ins introduced in the structure PrAPc, according to (A.2). This matrix M isknown as perfect elimination matrix, since no fill-ins are introduced when decomposed intoits L and U factors. In this context, Table A.1 can be generated, based on some metricsdefined in (Alvarado, 1976) for the perfect elimination matrix M. These metrics definitionsare repeated below for convinience.τu =n−1summationdisplayi=1ri (A.9)α =n−1summationdisplayi=1rici (A.10)In the definitions (A.9) and (A.10), ri is the number of non-zero elements in the ith rowof the matrix U above its main diagonal and ci is the number of non-zero elements in the ithcolumn of the matrix L below its main diagonal. One can readily verify that τu equals thetotal number of off-diagonal non-zeros of U. The metric α is the sum of all multiplicationsneeded to perform all rank-1 updates on the LU structure, as shown in the previous section.143Appendix A. LU FactorizationOn a further note, for symmetric and topologically symmetric matrices, and consequentlydense matrices, ri is the same as ci. The operation count for dense systems is also includedin Table A.1, which were obtained by making ri = ci = n−i.Table A.1. Operation count for sparse and dense LU factorization and solution for unsym-metric, but topologicaly symmetric, systems.Task Sparse Denseadd mult div add mult divLU fact. α α τu n33 −n22 + n6 n33 −n22 + n6 n22 −n2LU sol. 2τu 2τu n n2−n n2−n nAlthough the operation counts summarized on Table A.1 provides an exact means todetermine the required computational effort for both sparse LU factorizations and solutions,they completely rely on the sparsity pattern of a specific prefect elimination matrix M.Therefore, a general way to, at least, predict the required computational effort of sparseoperations is desirable. Bearing this need in mind, in the following subsection, one way topredict the number of operations will be studied.A.3.1 Expected Computational Complexity of Sparse LUFactorizationSuppose that the n×n matrix M ={mij}is a perfect elimination matrix associated withthe generic matrix A, which is assumed to be unsymmetric but topology-symmetric, andhas no zeros in its main diagonal. Moreover, each upper triangular location is also assumedto have an equal probability of being non-zero, as given by (A.11).p = p(mij negationslash= 0) = 2τun2−n (A.11)The fact that there are n−i possible upper diagonal elements in row i of M withequal probability of being non-zero, given by (A.11), indicates that the number of non-zeroelements ri, in row i above the diagonal of M is a binomially distributed random variable.Therefore, according to basic statistics manipulation, the expected value and variance of ri144Appendix A. LU Factorizationare given as follows.E(ri) = (n−i)p (A.12)σ2(ri) = (n−i)(1−p)p (A.13)Now, recalling that the operation counts given in Table A.1 depend on the metrics τuand α, defined in (A.9) and (A.10), respectively, the expected value for these metrics canalso be obtained as follows.E(τu) =n−1summationdisplayi=1E(ri) = τu (A.14)E(α) =n−1summationdisplayi=1Eparenleftbigri2parenrightbig =nsummationdisplayi=1bracketleftbigσ2(ri) + E2(ri)bracketrightbig == 43 (n−2)(n−1)nτu2 + τu(A.15)Defining the new parameter ρ, given in (A.16), which represents the ratio of branches tobuses11, one can rewrite (A.14) and (A.15) as shown in (A.17) and (A.18), respectively.ρ = number of off-diagonal nonzerosnumber of nodes = τun (A.16)which givesE(τu) = nρ (A.17)E(α) = 43 (n−2)(n−1)nρ2 + nρ (A.18)The choice of expressing the above expected values in terms of branch-bus ratio ρ insteadof τu is due to the fact that buses in electric networks are usually connected to only a fewother neighboring buses. Thus, expressing τu in terms of ρ provides a measure of howinterconnected the system under study is.Combining Table A.1 and equations (A.17) and (A.18), expressions for predicting thefloating-point operations required to perform sparse LU factorizations, backward and for-ward substitutions can be deduced, which are shown in Table A.2. Since only complexsystems have been assumed so far, the number of floating-point operations needed for eachbasic complex operation, i.e., addition, subtraction, multiplication and divisions, used for11Note that the branch-bus ratio ρ is associate with the perfect elimination matrix Mand not the systemmatrixA. Therefore, the number of branches is always dependent on the ordering scheme adopted to reducefill-ins in the original matrix A structure145Appendix A. LU Factorizationobtaining Table A.2 is provided in Table A.3.Notice that the extra column in Table A.2, related to large power systems, takes intoconsideration that the term (n−2)(n−1) ≈1 when n≫2, which leads to a simpler expression torepresent E(α).Table A.2. Floating-point operation count for sparse and dense LU factorization and solu-tion for unsymmetric, but topologicaly symmetric, systems.Task Sparse Sparse (n≫2) DenseLU factorizationparenleftBig323(n−2)(n−1)ρ2 + 19ρparenrightBign parenleftbig323 ρ2 + 19ρparenrightbign 83n3 + 32n2−256 nLU solution (16ρ + 11)n (16ρ + 11)n 8n2 + 3nTable A.3. Complex basic operations requirements in terms of floating-point count.Complex Operation Mathematical Representation Flop Countaddition/subtraction (a + jb)±(c + jd) = (a + c)±j (b + d) 2 add/subtr.2 flopsmultiplication (a + jb)×(c + jd) = (ac−bd) + j (bc + ad)2 add.4 mult.6 flopsdivision a + jbc + jd = (ac + bd) + j (bc−ad)c2 + d23 add.6 mult.2 div.11 flops146Appendix BTransient Stability Solution TechniquesAs extensively described in the literature (Dommel & Sato, 1972; Stott, 1979; Kundur, 1994),the power system transient stability model can be summarized by the set of non-lineardifferential-algebraic equations shown in (B.1).˙x = f(x,v) (B.1a)Yv = i(x,v) (B.1b)where, x represents a vector with dynamic variables (or, state variables), whose first deriva-tives ˙x are normally dependent on x themselves and the vector with nodal voltages v. Inaddition, i represents a vector function that defines the nodal current injections, which alsodepend on the variable states x and the nodal voltages v. Lastly, Y represents the complex-valued nodal admittance matrix of the system under study.According to the literature (Dommel & Sato, 1972; Stott, 1979; Kundur, 1994), the manypossible alternatives for solving (B.1) simultaneously in time are categorized in terms of(a) the way in which (B.1a) and (B.1b) are interfaced;(b) the integration method, which can be explicit or implicit;(c) the technique used for solving the algebraic equations, that may be linear or non-linear.For cases when (B.1b) is linear, sparsity-based direct solutions should be employed,whereas for non-linear cases, either the Gauss-Seidel or the Newton-Raphson methodsshould be used.B.1 Problem DiscretizationIn this method, (B.1a) is firstly discretized according to some integration rule. Due to itsinherent numerical stability, the trapezoidal rule is chosen (Dommel, 1996; Dommel & Sato,147Appendix B. Transient Stability Solution Techniques1972). This procedure is shown in (B.2).x(t) = x(t−∆t) +integraldisplay tt−∆tfparenleftbigx(ξ),v(ξ)parenrightbigdξ≈≈x(t−∆t) + ∆t2bracketleftBigfparenleftbigx(t),v(t)parenrightbig+fparenleftbigx(t−∆t),v(t−∆t)parenrightbigbracketrightBig (B.2)Collecting the present and past values yields (B.3), where xh(t) represents the historyterm of the state vector x, which is known at time t.x(t) = ∆t2 fparenleftbigx(t),v(t)parenrightbig+xh(t) (B.3a)xh(t) = x(t−∆t) + ∆t2 fparenleftbigx(t−∆t),v(t−∆t)parenrightbig (B.3b)There are many solution variations described in the literature that combine differentintegration methods and algebraic equations solutions. A number of these methods aresummarized in (Stott, 1979). However, all variations still fall into two major categories:alternating (or partitioned) and simultaneous solution approaches. Since the alternatingsolution approach is described and discussed in Section 4.1, only the simultaneous solutionwill be discussed next.B.2 Simultaneous Solution ApproachOne of the most prominent methods pertaining to this class is undoubtedly the Newton-Raphson method. The method is often formulated by means of residual vector functionsgparenleftbigx(t),v(t)parenrightbig and hparenleftbigx(t),v(t)parenrightbig, given in (B.4), which are obtained from combining the dis-cretized differential equations (B.3) and the network algebraic equations (B.1b).gparenleftbigx(t),v(t)parenrightbig = x(t)−∆t2 fparenleftbigx(t),v(t)parenrightbig−xh(t) = 0 (B.4a)hparenleftbigx(t),v(t)parenrightbig = Yv(t)−iparenleftbigx(t),v(t)parenrightbig = 0 (B.4b)In the Newton-Raphson method, the iterates for x(t) and v(t) can be obtained using(B.5).xk+1 = xk +∆xk (B.5a)vk+1 = vk +∆vk (B.5b)148Appendix B. Transient Stability Solution Techniqueswhere the mismatches ∆xk and ∆vk are computed from the linear system (B.6)12.Ad BdCd Y +YdbracketleftBigg∆xk∆vkbracketrightBigg=−bracketleftBigggparenleftbigxk,vkparenrightbighparenleftbigxk,vkparenrightbigbracketrightBigg(B.6)andAd = ∂g∂xvextendsinglevextendsinglevextendsinglevextendsinglek= U−∆t2 ∂f∂xparenleftbigxk,vkparenrightbig−xh Bd = ∂g∂vvextendsinglevextendsinglevextendsinglevextendsinglek=−∆t2 ∂f∂vparenleftbigxk,vkparenrightbigCd = ∂h∂xvextendsinglevextendsinglevextendsinglevextendsinglek=−∂i∂xparenleftbigxk,vkparenrightbig Yd =−∂i∂xparenleftbigxk,vkparenrightbigBased on the fact that dynamic devices are interconnected through the network and notdirectly to each other, state variables associated with a device do not depend on other’s statevariables. Therefore, Ad is a block-diagonal matrix, denoted as follows for a power systemwith m dynamic devices.Ad =Ad1 0 ··· 00 Ad2 ··· 0... ... ... ...0 0 ··· AdmApplying Gaussian elimination to the Jacobian matrix in (B.6) yields (B.7a) and (B.7b),which enable the solution of ∆xk and ∆vk, respectively.∆xk =−Ad−1bracketleftbiggparenleftbigxk,vkparenrightbig+Bd∆vkbracketrightbig (B.7a)parenleftbigY +Yd−CdAd−1Bdparenrightbig∆vk =−hparenleftbigxk,vkparenrightbig+CdAd−1gparenleftbigxk,vkparenrightbig (B.7b)Once ∆xk and ∆vk are calculated, they can be used to find new iterates xk+1 and vk+1according to (B.5). In the sequence, gparenleftbigxk+1,vk+1parenrightbig and hparenleftbigxk+1,vk+1parenrightbig can be computed.Analogously to the partitioned method, the iterative process continues until the differencebetween two successive solutions of x(t) and v(t) is within a certain tolerance.Although the full Newton-Raphson method is a powerful method for solving the transientstability non-linear equations, it is very computationally expensive, due to its dependency onthe time-varying reduced Jacobian JR = Y + Yd−CdAd−1Bd. Such dependency requiresa sparse matrix inversion every single step, which slows down the overall solution by at leastan order of magnitude. In order to overcome such a drawback, in industrial-grade transient12The matrix U represents an identity matrix.149Appendix B. Transient Stability Solution Techniquesstability programs, the Jacobian matrix JR is assembled and factorized into its LU factorsonly at the beginning of a simulation and whenever topology changes occur in the systemsor convergence problems are faced. Under such conditions, the algorithm is usually referredto as Very DisHonest Newton (VDHN) method.B.3 Simultaneous versus Alternating Solution ApproachThe just discussed simultaneous solution approach can be compared with the alternatingapproach in various aspects, such as modeling flexibility, computational requirements andconvergence characteristics. These are discussed below:• Computational requirements: Since the simultaneous approach is usually relatedto Newton-Raphson methods for solution of non-linear system of equations, it demandsthe dynamic models to be expressed in terms of real and imaginary parts. As a directconsequence, the problem, which was originally complex-value of order n, becomes areal problem of order 2n. Moreover, this newly formed real system of equations has alsothe four times the number of non-zeros of its complex-valued counterpart. For instance,during the factorization process (Appendix A), the amount of floating-point operationscan be expected to, at least, quadruple. Other disadvantages of the real formulationlie on the increased number of memory references and memory consumption.• Modeling flexibility: In terms of modeling, the real form of the simultaneous solutionyields greater flexibility over the complex form, inherent to the alternating solution.In the complex form, complex derivatives, required by the linearization process, oftendepend on complex conjugates of the same variable, which demands the inclusionof redundant equations to the problem. This disadvantage can be exemplified by acommon constant power load. In this case, the current ¯IL injected by the load P0−jQ0into the system relates to the its voltage ¯VL according to (B.8). In order to include thismodel in a Newton-like solution procedure, one needs to find a relationship betweenvariations of current ∆¯IL and voltage ∆¯VL, which is not possible from (B.8) alone.Since, only (B.9) can be generated, the solution needs to add the equation associatedwith ∆¯V ∗L.¯IL =−P0−jQ0¯V ∗L (B.8)∆¯IL = P0−j Q0parenleftbig¯V ∗Lparenrightbig2∆¯V ∗L (B.9)150Appendix B. Transient Stability Solution Techniques• Convergence characteristics: Newton-like methods are known for their numericalrobustness and quadratic convergence. However, these aspects are entirely true onlywhen the Jacobian of the non-linear problem, expressed by the submatricesAd, Bd, CdandYd in (B.7), is updated at every iteration. In this case, factorization of the problemis also required at every iteration, which, for very large systems, may degrade consid-erably the overall performance of the algorithm. Dishonest variants, like the VDHNmethod, which keep the Jacobian constant unless topological changes or convergencedifficulties occur, are often employed, as a tradeoff between computational burden andconvergence strength of full Newton methods. Therefore, for very large systems, thealternating solution approach may become comparable with the simultaneous solutionapproach in terms of convergence.151Appendix CNetwork MicrobenchmarksIn order to optimize the performance of parallel applications, the underlying algorithms needto be translated in terms of simpler tasks, for which the performance on a given computingsystem is known. In this manner, bottlenecks can be identified and available resourcesproperly allocated aiming at the improvent of the overall application.Regarding communications in parallel computing systems, a few modeling approacheshavebeenproposedintheliterature. TheLogP(Culler et al.,1996), PLogP(Kielmann et al.,2000) and LogGP (Alexandrov et al., 1995) are examples of communication models in dis-tributed computing systems. The basic foundations of such models lie on the knowledge ofcertain network parameters, such as, latency L, overheads for sending, os, and receiving, or,and inter-message gaps g.In this context, network microbenchmarks provide the means of systematically acquirethe previous network parameters. In turn, microbenchmarks play an paramount role inassessing the performance of the hardware and software that makes the bridge betweenapplications and network interfaces (Culler et al., 1996).C.1 ProcedureAs the starting point towards the microbenchmark suggested by Kielmann et al. (2000), thegap between zero-byte messages, g(0), is first measured, following the Algorithm 1, givenbelow. In this algorithm, each message consists of a send and receive pair.According to Culler et al. (1996), three stages can be observed during this procedure.For a small number of messages M, the issued sending requests receive no replies. Hence,the cost of each of these first requests equals to the sending overhead os. As the numberof messages M increase, the sender starts receiving replies and the message cost increases,since the receiver spends or for each reply. During this stage, the replies are separated bythe time for successive messages to pass through the bandwidth bottleneck, i.e., by g. AsM further increases, the number of messages eventually causes the network capacity limit tobe reached. Under such a condition, any sending request will stall, until a reply is drainedfrom the network. During this stage, the cost of each message is determined by the gap g.152Appendix C. Network MicrobenchmarksAlgorithm 1 Inter-message gap and sending overhead timingtstart←start timerfor k = 1 to M doif Sender thenSend messageelseReceive messageend ifend fortstop←stop timert←tstop−tstarttissue←t/MThe results for the measurement of g(0) for the Gigabit Ethernet network available in theUBC’s Power Systems Engineering Laboratory is shown in Figure C.1.Next, the round trip time (RTT) for a given message is measured, following the Al-gorithm 2. Notice that both request message m1 and reply message m2 also consist ofpaired send and receive operations. Since, in the PLogP model, the RTT(0) is equivalent to2parenleftbigL + g(0)parenrightbig, one only needs to calculate the latency L.As suggested by Kielmann et al. (2000), the receiving overhead can be measured by theAlgorithm 3. In this case, the receiving overhead or can be measured by adding a con-trolled waiting time ∆, such that receiving the m-byte message occurs immediately withoutfurther delay. Without this extra waiting time, an extra idle time would be added to themeasurement. Therefore, making ∆ > RTT(m) is, usually, conservative enough.Algorithm 2 Round trip timing (RTT)tstart←start timerfor k = 1 to M doif Sender thenSend message m1Receive message m2elseReceive message m1Send message m2end ifend fortstop←stop timert←tstop−tstartRTT ←t/M153Appendix C. Network MicrobenchmarksAlgorithm 3 Receiving overhead or timingsor ←0for k = 1 to M doif Sender thenSend zero-byte messageWait for time ∆ > RTT(m)tstart←start timerReceive m-byte messagetstop←stop timeror ←or + tstop−tstartelseReceive zero-byte messageSend m-byte messageend ifend forif Sender thenor ←or/Mend if0 20 40 60 80 100 120 140 1603.63.844.[µs]Number of messagesos(0) = 3.80[µs]g(0) = 4.76[µs]Figure C.1. Average issue time, yielded by Algorithm 1, for zero-byte messages communi-cated by MPI routines over a Gigabit Ethernet network.154


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items