Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Determining the complete spectrum for sparse symmetric matrices Cavers, Ian Alfred 1994

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1994-953217.pdf [ 2.99MB ]
Metadata
JSON: 831-1.0051474.json
JSON-LD: 831-1.0051474-ld.json
RDF/XML (Pretty): 831-1.0051474-rdf.xml
RDF/JSON: 831-1.0051474-rdf.json
Turtle: 831-1.0051474-turtle.txt
N-Triples: 831-1.0051474-rdf-ntriples.txt
Original Record: 831-1.0051474-source.json
Full Text
831-1.0051474-fulltext.txt
Citation
831-1.0051474.ris

Full Text

DETERMINING THE COMPLETE SPECTRUM FORSPARSE SYMMETRIC MATRICESByIan A. CaversB.Sc. University of British ColumbiaM.Sc. (Computer Science) University of British ColumbiaA THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF COMPUTER SCIENCEWe accept this thesis as conformingto the required standardTHE UNIVERSITY OF BRITISH COLUMBIASeptember 1994© Ian A. Cavers, 1994In presenting this thesis in partial fulfilment of the requirements for an advanced degree at theUniversity of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarlypurposes may be granted by the head of my department or by his or her representatives. Itis understood that copying or publication of this thesis for financial gain shall not be allowedwithout my written permission.Department of Computer ScienceThe University of British ColumbiaVancouver, CanadaDate:ep7 2/ /99%AbstractThis dissertation studies a restricted form of the fundamental algebraic eigenvalue problem. From the broad spectrum of eigenvalue problems and solution methods, it focusesupon sequential direct methods for determining moderately large subsets of eigenvalues orthe complete spectrum of large sparse symmetric matrices. The thesis uses a combinationof theoretical analysis and experimentation with symbolic and numeric implementationsto develop generally applicable, reliable, efficient and accurate algorithms that are easilyapplied by novice and expert practitioners alike. This dissertation’s approach is to reexamine eigenvalue methods based on the similarity reduction of matrices to tridiagonal form,developing algorithms that more fully exploit matrix sparsity.Using specially developed sparse reduction tools, the thesis identifies the deficienciesand limitations of existing direct tridiagonalization methods, providing an improved understanding of the underlying fill characteristics of sparse reductions. The best previously published approach combines a bandwidth reducing preordering with Rutishauserand Schwarz’s O(bn2) band-preserving tridiagonalization algorithm. This approach placescomplete reliance upon the preordering to exploit sparsity, but it typically leaves the bandof the matrix relatively sparse prior to reduction. The thesis presents several novel sparsereduction algorithms, including the hybrid tridiagonalization methods HYBBC and HYBSBC, that rearrange the elimination of nonzero entries to improve band sparsity utilization.HYBBC combines Bandwidth Contraction, a diagonally-oriented sparse reduction, withRutishauser and Schwarz’s column-oriented tridiagonalization. For a wide range of 70 practical sparse problems the new algorithm reduces CPU requirements by an average of 31%,with reductions as high as 63%. HYBSBC improves upon HYBBC’s successful techniques bysubstituting the novel Split Bandwidth Contraction algorithm for Bandwidth Contraction.The Split Bandwidth Contraction algorithm takes additional advantage of band sparsity tosignificantly improve the efficiency of partial bandwidth contractions. In addition, HYBSBC employs the Z-transition strategy to precisely regulate the transition between its tworeduction stages, permitting tridiagonalization in as little as 1/5 the time of Rutishauser andSchwarz. Finally, to demonstrate the relative efficiency of sparse tridiagonalization basedeigenvalue methods, the thesis compares variants of the Lanczos algorithm to HYBSBCusing theoretical analysis and experimentation with leading Lanczos codes.11ContentsAbstract iiTable of Contents iiiList of Tables viiiList of Figures xAcknowledgements xivDedication xv1 Introduction 11.1 A General Introduction to Eigensolvers 31.2 Thesis Statement 61.3 Thesis Contributions 81.4 Thesis Overview 92 Background Material and Sparse Tridiagonalization Tools 122.1 Notation and Definitions 122.2 Sparse Symmetric Test Problems 132.3 Orthogonal Transformations and Sparse Tridiagonalization 131112.4 A Complexity Analysis Framework for Sparse Tridiagonalization Algorithmsand Symbolic Reduction Tools 162.4.1 Standard Givens Transformations 182.4.2 Fast Givens Transformations 202.4.3 An Enhanced Framework for Densely Banded Matrices 232.5 Xmatrix 242.6 Trisymb 262.7 A Bipartite Graph Model for Sparse Tridiagonalizatiou 282.7.1 Bipartite Graphs and Sparse Matrices 282.7.2 Modeling Givens Transformations 302.7.3 Modeling Householder Transformations 322.7.4 Concluding Remarks 353 Existing Algorithms and Their Extension for Sparse Tridiagonalization 373.1 Sparse Model Problems 383.2 A Naive Approach to Sparse Tridiagonalization 393.3 Customized Sparse Givens Reductions 423.4 The Tridiagoualization of Banded Matrices 423.4.1 The Rutishauser-Schwarz Algorithm 433.4.2 Lang’s Algorithm 473.5 Generalization of Band-PreservingTridiagonalizatiou Techniques 503.5.1 Bandwidth Reducing Sparse Matrix Preorderings 503.5.2 The Sparse Rutishanser-Schwarz Algorithm 513.5.3 Sparse Band Extensions of Lang’s Algorithm 54iv4 Bandwidth Contraction Algorithms for Sparse Tridiagonalization 574.1 Bandwidth Contraction 584.1.1 Motivation 584.1.2 The Sparse Bandwidth Contraction Algorithm 584.1.3 A Demonstration of Bandwidth Contraction 624.2 A Hybrid Tridiagonalization Algorithm 664.2.1 Motivation 674.2.2 The Hybrid Bandwidth Contraction Algorithm 694.2.3 Performance of the Hybrid Tridiagonalization Algorithm 724.3 Numerical Implementations of BC and HYBBC 734.3.1 Implementation Basics 734.3.2 Dense Versus Sparse Band Transformations 744.3.3 A Densely Banded Data Structure 764.3.4 Rescaling and Fast Givens Transformations 764.4 Experimentation 784.4.1 The Testing Environment 784.4.2 Threshold Selection for HYBBC 794.4.3 Numerical Testing Results 814.4.4 Sparsity Structures, Preorderings and Bandwidth Contraction Performance 855 Split Bandwidth Contraction Algorithms for Sparse Symmetric Matrices 885.1 The Split Bandwidth Contraction Algorithm 895.1.1 Row and Column-Oriented Adjacent Transformations 895.1.2 Bidirectional Elimination 90v5.1.3 A Formal Presentation of SBC 945.1.4 Split-Point Selection Strategies 975.1.4.1 Minimum Displacement Split-Point Selection 1005.1.4.2 Unidirectional Split-Point Selection 1005.1.4.3 Damped Split-Point Selection 1025.1.4.4 Induced Split-Point Selection 1045.1.4.5 The Final Form of the Split-Point Selection Strategy . 1115.1.5 Performance of the Split Bandwidth Contraction Algorithm 1125.2 The Hybrid Split Bandwidth Contraction Algorithm 1185.2.1 An Algorithmic Framework for HYBSBC 1215.2.2 The A-Transition Strategy 1215.2.2.1 A Theoretical Basis 1235.2.2.2 A Practical A-Transition Strategy 1275.2.2.3 Evaluation of the A-Transition Strategy 1325.3 Numerical Implementation of SBC and HYBSBC 1355.3.1 Implementation Basics 1365.3.2 Rescaling Techniques for SBC and HYBSBC 1385.4 Experimentation with Numerical Implementations 1395.4.1 SBC Numerical Testing Results 1395.4.2 HYBSBC Numerical Testing Results 1425.4.3 A-Transition Optimality 1456 A Comparison of the Lanczos Algorithm with Direct Sparse Tridiagonalization 1526.1 Lanczos Basics 153vi6.2 Lanczos With Full Reorthogonalization 1556.3 Selective or Partial Reorthogonalization Techniques 1566.4 Lanczos With No Reorthogonalization 1586.5 Block Shift and Invert Lanczos 1606.6 Experimentation With Lanczos Codes 1626.6.1 Harwell’s EA15 1636.6.2 BCS’s Block Shift and Invert Lanczos 1697 Conclusions and Future Research 1747.1 Summary and Conclusions 1747.2 Future Research 179Bibliography 182viiList of Tables2.1 A Test Suite of Sparse Symmetric Matrices 143.1 Tridiagonalization Costs of Givens Reduction for a Densely Banded Matrix 403.2 Tridiagonalization Costs of Givens Reduction for a Dense Matrix 403.3 Fill Characteristics of Givens Reduction 413.4 Tridiagonalization Costs of the Rutishauser-Schwarz Algorithm for a DenselyBanded Matrix 463.5 Comparing GPS and RCM Preorderings for the Harwell—Boeing Test Suite 513.6 Band Filling Characteristics of Sparse R-S 544.1 Tridiagonalization Costs of the Bandwidth Contraction Algorithm for aDensely Banded Matrix 674.2 Tridiagonalization Summary for a Small Sparse Example 724.3 A Comparison of Two Symbolic Implementations of HYBBC using Dense orSparse Band Fast Givens Transformations 754.4 Selected Tridiagonalization Summaries 824.5 DWT 361 Tridiagonalization Summary 844.6 Sparse Tridiagonalization, GPS versus Nested Dissection 864.7 The Effects of a Poor Preordering on DWT 878 87viii5.1 The Cost of SBC Eliminating the Outermost Nonzero Diagonal of a BandedModel Problem 965.2 The Cost of BC Eliminating the Outermost Nonzero Diagonal of a DenselyBanded Problem 975.3 Test Problems for Splitpoint Selection Experimentation 1005.4 Tridiagonalization Summary 1135.5 Selected Partial BC and SBC Contraction Summaries 1145.6 (n, b’, m) Evaluation Cost Summary 1305.7 Sparse Problems with bGPS > n/3 1305.8 Sparse Problems with Widely Varying no-split and /-Transition Bandwidths 1325.9 Selected Partial BC and SBC Reduction Timing Summaries 1415.10 Selected HYBSBC, BANDR and HYBBC Thdiagonalization Summaries . . 1436.1 The Computational Cost of Complete Spectrum Determination with HYBSBC+TQLRAT and EA15D (Full Accuracy) 1656.2 The Computational Cost of Complete Spectrum Determination with HYBSBC+TQLRAT and EA15D (AGO i0) 1666.3 HYBSBC+TQLRAT Summary for PLAT362 171ixList of Figures2.1 Sparsity Structure Unioning by a Givens Rotation 152.2 A Transformation Length Example 172.3 The Xmatrix Interface 252.4 A Sparse Matrix and its Bipartite Graph 292.5 Updating BA to Reflect G(i,j,O)TA 302.6 A4,1 is Eliminated by G(3,4,01)T 312.7 Completing the Nontrivial Transformation G(3, 4,01)TAG(3, 4, O) 322.8 Updating BA to Reflect a Trivial Transformation Exchanging Rows andColumns i and j 332.9 Completing the Reduction of Column and Row 1 to Tridiagonal Form witha Row and Column Exchange 332.10 Reducing Column 1 to Tridiagonal Form With a Householder Transformation. 342.11 Completing the Reduction of Column and Row 1 with a Householder Similarity Transformation 353.1 The 5-Point Model Problem 383.2 Givens Reduction 393.3 The Rutishauser-Schwarz Algorithm 433.4 Bulge Chasing to Preserve Bandwidth 44x3.5 The Bulge Chasing Path 453.6 Lang’s Algorithm Produces Triangular Bulges 483.7 Chasing One Column of a Triangular Bulge 483.8 A4,1’s Truncated Bulge Chasing Path 533.9 A Partial Tridiagonalization, Sparse R-S versus Lang’s Algorithm 554.1 Matrix Bandwidth and Spike Length 594.2 The Sparsity Structure of CAN 268 with a GPS Preordering 594.3 The Sparse Bandwidth Contraction Algorithm 604.4 A Nonadjacent Transformation Example 624.5 A Partial Tridiagonalization, Sparse R-S versus Bandwidth Contraction 644.6 Cascading Fill Entries 654.7 The Flop and Transformation Requirements of BC relative to R-S for aDensely Banded Matrix, n=1000 684.8 The Hybrid Bandwidth Contraction Tridiagonalization Algorithm 714.9 The Variance of PLAT362’s Tridiagonalization Cost with Transition Bandwidth 804.10 The Variance of Tridiagonalization Cost with Transition Bandwidth for a5-Point Problem, b = 30 814.11 The Distribution of HYBBC’s Improved Reduction Performance for Problems with n>400 834.12 Examples of Sparsity Structures Effecting Bandwidth Contraction Performance 855.1 Row and Column-Oriented Transformations 905.2 Bidirectional Elimination 915.3 Cyclic Dependencies of a Dense Diagonal 925.4 Split-points and Independent Transformations 93xi5.5 The Split Bandwidth Contraction Algorithm 955.6 The Flop and Transformation Requirements of SBC relative to BC forn=1000 and b=100 985.7 Reduction Performance and Split-point Selection 995.8 Desirable DSBC Reduction Characteristics 1035.9 Shrinking the Central Nonzero Block 1065.10 Shifting a Split-Point with a Band Elimination 1065.11 The Normal Reduction Shift of a Lower Triangular Split-Point 1075.12 Multiple Split-Point Shifts 1085.13 A Partial Bandwidth Contraction 1125.14 The Sparsity Structure of LSHP 265 with a GPS Preordering 1155.15 The Sparsity Structure of BCSSTKO9 with a GPS Preordering 1175.16 The Tridiagonalization Flop Requirements of SBC Relative to R-S for aDensely Banded Model Problem. nr=1000 1195.17 Split-Point Displacements of SBC’s Partial Contraction of BCSPWRO8 . . 1205.18 The Hybrid Split Bandwidth Contraction Tridiagonalization Algorithm . . 1225.19 The Cost of Eliminating the Outermost Nonzero Diagonal of a DenselyBanded Matrix (n=1000) Using a Split-Point of Displacement 100 1245.20 R-S Tridiagonalization Costs of Densely Banded Matrices. n=1000 1255.21 The Li-Transition Function for 1 <m < bC 1285.22 The Is-Transition Function for bC <m < 1285.23 Variance of (n, bc, m) with Split-Point Positioning. n = 1000, bC10, 50, 150, and 300 1295.24 HYBSBC’s Li-Transition Strategy 1315.25 Distribution of the s-Transition Strategy’s Flop Reductions 133xii5.26 Variation in Tridiagonalization Flop Counts with the Transition BandwidthOffset from the zX-Transition Bandwidth, bt 1355.27 The Distribution of HYBSBC’s Improved Reduction Performance Relativeto BANDR for Problems with n > 400 1445.28 Variation in Tridiagonalization Time with the Transition Bandwidth Offsetfrom the Z1-Transition Bandwidth 1465.29 Tridiagonalization Flop Counts for CAN 634 1475.30 FORTRAN Tridiagonalization Timings for CAN 634 on a SPARCstation 2with a 64KByte External Cache 1485.31 C Tridiagonalization Timings for CAN 634 on a SPARCstation 2 with a64KByte External Cache 1495.32 C Tridiagonalization Timings for CAN 634 on a 50 MHz i486 with a 8 KByteOn-Chip Cache and a 256 KByte External Cache 1505.33 C Tridiagonalization Timings for CAN 634 on a 50 MHz i486 with CachingDisabled 1516.1 CPU Requirements of EA15D Relative to HYBSBC+TQLRAT 1686.2 The CPU requirements of BCSLIB-EXT Lanczos, relative to HYBSBC+TQLRAT, extracting subsets of eigenvalues from both the left andright ends of a problem’s spectrum 172xiiiAcknowledgementsFirst and foremost, I would like to thank my supervisor, Jim Varah, for believing in andencouraging a young general sciences student so many years ago. Without your guidance,patience, friendship, and gentle prodding my graduate school years would not have been somemorable or as successful. Many thanks Jim.I would also like to thank the other members of my supervisory committee: Un Acher,Maria Kiawe, and Manfred Trummer for their guidance and support. Thanks as wellgo to Robert Schreiber, David Kirkpatrick, Brian Wetton and Larry Roberts for theirparticipation on the examination committee.Many thanks to Roger Grimes and John Lewis from Boeing Computer Services forproviding me access to the BCS shift and invert Lanczos code. I would also like to expressmy appreciation to John for his interest in my research and his many valuable suggestions.The many friends I have made during graduate school have helped to turn the endlessdays of hard work into a wonderful experience. To Murr, Don, Terry, Barry, Rick, Felix,George(s), Mike(s), Peter(s), Andrew, Andy, Pierre, John, Swamy, Dave(s), Alex, the CSand CICSR office staff, Karen, Frank, Ming, Glen, Carlin, Moyra, Luping, and their spouses— and the many, many others whom I’ve enjoyed getting to know — thanks for your friendshipand help. Long live “Bald Guy”!Outside the university I have a large extended family whose unfailing faith, love, andsupport has meant everything to me during this endeavor. A special thanks to my Mum(Gerry) for proofreading every single word of this dissertation. Thanks also go to AuntHope, to my brother aild sisters (Drum, Joan and Shelagh), their spouses (Pat, Al andMike), my in-laws (George, Zora, Irene and Ken), and my wonderful gaggle of nieces andnephews (Lindsay, Graham, Andy, Lauren, Gordon, Robert and Matthew). And to theother members of my extended family: the Harrisons, the Cavers, the Smith Clan, theShearer-Mealings (beat ya Ayden), the Sperlings, the Andrews, Jim, Jon, ... , many thanks.Finally, I have saved the most important thank-you for last. I would like to thankmy wife Diana for her endless support, patience, understanding and love. Start the e.b.o.Jenna your mu is coming home.xivThis dissertation is dedicated to the memory ofDr. Stuart Donald Cavers,professor, teacher, friend and Dad.xvChapter 1IntroductionThis thesis studies a restricted form of the fundamental algebraic eigenvalue problem. Givenan n x m square matrix A values of the scalar A, called eigenval’ues, are sought for whichthe following set of homogeneous linear equations has a nontrivial solution.Ax=Az (1.1)The nontrivial solution, x, corresponding to a particular A is referred to as its eigenvector.Eigenvalue problems arise in many different areas of science and engineering. Applications range from more traditional problems found in structural analysis, quantum physicsand chemistry, to problems in molecular dynamics, the analysis of electrical power systems,and the modeling of tides and currents in oceanographic research. Numerical methods forthese eigenvalue problems are desired which are reliable, appropriately accurate, exhibitefficient execution and require low levels of storage. Despite the simple formulation of thebasic eigenvalue problem, no single algorithm can provide an optimal balance of these goalsfor all applications. Instead, there is a wide range of different numerical eigenvalue methods.Each approach tries to exploit particular matrix properties, differing solution requirementsor special computing environment characteristics.Important matrix properties influencing algorithm design include symmetry, real orcomplex entries, the fraction of entries which are zero, and the presence of special sparsitystructures. Among the many solution requirements affecting algorithm design and selection,answers to the following questions are fundamental.1CHAPTER 1. INTRODUCTION 2• Are all eigenvalues required?• Are a few of the largest or smallest eigenvalues needed?• Is the identification of eigenvalue multiplicity required?• Are all eigenvalues in a specified region required?• Are eigenvectors needed?Many characteristics of the computing environment also affect algorithm design and selection. Is the algorithm targeted for a sequential or parallel architecture? Do the processingunits have scalar or vector capabilities? Finally, algorithm design is also affected by thealternatives of a shared or distributed memory architecture, and cache memory implementations.From this broad spectrum of eigenvalue problems and solution methods, this thesisfocuses upon sequential direct methods for determining moderately large subsets of eigenvalues or the complete spectrum, including eigenvalue multiplicities, of large sparse symmetric matrices. Other than symmetry, no special assumptions are made of a matrix’ssparsity pattern. Although sequential eigenvalue calculations are the primary objective ofthis thesis, when appropriate we comment upon parallel implementation and eigenvectorcomputations.A precise definition of the term sparse matrix is difficult to formulate. For the purposesof this thesis, however, a matrix is considered sparse if few of its entries are nonzero.Acceptable ranges of nonzero density could be proposed, but J. H. Wilkinson [GMS92]suggested that a matrix is sparse if “it has enough zeros that it pays to take advantage ofthem”. Similarly, classifying the size of a sparse matrix problem is also difficult. Typically,the size of a large sparse problem is taken to be of order 1000 or greater, but matricesof several hundred rows and columns may be included if the density of nonzero entries issufficiently high. Although the thesis only considers matrices with real valued entries, oursparse matrix techniques are easily generalized to complex hermitian matrices.CHAPTER 1. INTRODUCTION 31.1 A General Introduction to EigensolversAs previously outlined, there are a wide range of eigenvalue methods. Each algorithm isdesigned to exploit particular matrix characteristics, solution requirements, or propertiesof the computing environment. In this section we provide a general overview of eigeuvaluemethods for symmetric matrices, identifying the existing algorithms best suited to thesolution of our particular sparse eigenvalue problem.In general, eigenvalue methods for dense symmetric matrices can be grouped into twobroad classes of algorithms. Members of the first group are applied directly to the originalmatrix, while algorithms in the second group work with an equivalent intermediate matrixconstructed from the original problem.An example of the first algorithm class is the Jacobi algorithm [Wi165, GV89]. Historically, the Jacobi algorithm was the method of choice for determining all the eigenvaluesof dense matrices, until superseded by more modern algorithms. (Recently, there has beenrenewed interest in variants of the algorithm that are inherently parallel.) The Jacobi algorithm applies sweeps of orthogonal similarity transformations, constructed from planerotations, that systematically reduce the norm of a matrix’s off-diagonal entries. Typically,the Jacobi algorithm requires O(n3)t fiops to diagonalize a dense matrix.When large subsets or all the eigenvalues of a dense symmetric matrix are desired thedominant method is the QR (or QL) algorithm. (See [GV89] or [Par8O] for a list of references.) Although the QR algorithm could be applied directly to the original dense matrix,a costly QR factorization is required by each iteration of the algorithm. Consequently, theQR algorithm is in the second class of eigenvalue methods that first reduce a matrix to asimplier canonical form with the same eigenvalues. Typically, this reduction is effected byHouseholder or Givens reductions, which use sequences of orthogonal similarity transformations to reduce a matrix to tridiagonal form in 0(n3) flops. The QR algorithm requires0(n2) flops to determine the complete spectrum of a tridiagonal matrix.tg(n)= Nf(n)) means that g(n) <Const.f(n) for sufficiently large n.Following [GV89] a flop is defined to be any floating point arithmetic operation.CHAPTER 1. INTRODUCTION 4Many other eigenvalue methods also use a two stage approach to avoid workingwith the original matrix directly. For example, the Bisection and Multisectioning algorithms [Wi165, GV89] isolate eigenvalues of a symmetric tridiagonal matrix using Sturmsequencing. Sequentially, the Bisection algorithm is typically used to find a few eigenvaluesin a specified region, but it can isolate all eigenvalues of a tridiagonal matrix in 0(n2) flops.In addition, successful variants of both algorithms have been implemented on distributedmemory multiprocessors. (See [Jes89] for example.)Another eigenvalue method for tridiagonal matrices is Cuppen’s [Cup8i] divide andconquer algorithm. This approach recursively splits a tridiagonal matrix into smaller andsmaller tridiagonal subproblems using rank-i tearings. The recursive tearing process stopswhen subproblems are on the order of a 2 x 2 matrix. The eigensystem of the originalproblem is reconstructed by gluing the eigensystems of the lowest level, aild intermediatesubproblems, back together using rank-i modification techniques [Go173, BNS78]. Thisdivide and conquer approach requires 0(m2) flops to sequentially isolate all eigenvalues.Several researchers have developed parallel versions of Cuppen’s algorithm. For example,Dongarra and Sorensen [DS87] successfully implemented Cuppen’s algorithm on a sharedmemory multiprocessor.The four eigenvalue methods mentioned previously permit the efficient determination ofa dense symmetric matrix’s complete spectrum. Unfortunately, these approaches are unableto meaningfully exploit matrix sparsity. If the Jacobi algorithm or Givens or Householderreductions are applied to a sparse matrix, typically the orthogonal transformations used bythese algorithms produce overwhelming numbers of fill entries (new nonzeros) that quicklydestroy sparsity. In fact, for most sparse problems almost all zero entries experience fillduring the course of the diagonalization or tridiagonalization. As as result, these approachesrequire 0(n3) flops and 0(n2)storage and sparse matrices might as well be treated as dense.For several decades the Lanczos algorithm [Par8O, GV89I has attracted researchers interested in large sparse eigenvalue problems. This iterative algorithm performs an implicitpartial tridiagonalization of a sparse matrix using a three term recurrence relation to di-CHAPTER 1. INTRODUCTION 5rectly compute entries of the tridiagonal matrix. Lanczos leaves the original sparse matrixintact and for the simple Lanczos iteration the core of the computation is a matrix-vectorproduct, which permits exploitation of matrix sparsity. When the iteration terminates, theRitz values of the partial tridiagonal matrix approximate a subset of the original matrix’seigenvalues. Unfortunately, the loss of orthogonality amongst Lanczos vectors complicatesthe algorithm’s application. Although variants of the Lanczos algorithm have been considered that attempt the computation of the complete spectrum of a sparse symmetricmatrix (See [ELT79J,[CW79], and [PR81] for example.), in practice Lanczos is best suitedto finding small subsets of a sparse matrix’s spectrum. As shown in Chapter 6, it appearsthat none of the current Lanczos variants can reliably and economically find the completespectrum of most large symmetric sparse matrices, or even moderately large subsets of theireigenvalues.We complete our brief overview of symmetric eigenvalue methods with direct methodsdesigned to take advantage of sparse matrices in banded form. The first algorithm [MRW71]is a variant of the QR algorithm intended for direct application to symmetric banded matrices. This method is implemented as routine BQR of EISPACK [GBDM77J. Because theQR iteration is band-preserving, this approach avoids the extreme fill difficulties of previous methods. Although BQR is intended for finding only a few eigenvalues, all eigenvaluesof a bandwidth1 b matrix can be isolated using O(b2n) flops. When b is in the range2 b < the banded QR approach is theoretically faster than using an 0(n3) flopdirect tridiagonalization and an 0(n2) tridiagonal eigenvalue method. The next algorithmwe consider, however, dramatically improves the tridiagonalization of banded matrices.As previously mentioned, Givens or Householder reductions create many new nonzeros, quickly filling the entries outside the band of the unreduced portion of the matrix.Alternatively, the band-preserving algorithm of Rutishauser [Rut63] and Schwarz [Sch7l]tridiagonalizes a symmetric banded matrix with O(bn2) flops and 0(bn) storage. The algorithm uses Givens transformations to zero band entries column by column. Betweenthe elimination of each band nonzero, fill entries created outside the band are chased offBandwidth (or semi-bandwidth) is defined as b = i— .11 such that 0.CHAPTER 1. INTRODUCTION 6the end of the matrix with bulge chasing transformations. The Rutishauser-Schwarz algorithm is implemented as EISPACK’s BANDR routine [GBDM77]. LAPACK’s [ABB92]version of this routine, SSBTRD, is based on Kaufman’s [Kau84] vectorized version of theRutishauser-Schwarz algorithm. Recent work by Lang [Lan92] and Bischof and Sun [BS92]presents parallel algorithms, closely related to Rutishauser-Schwarz, that investigate theuse of multiple elimination and delayed bulge chasing techniques.These algorithms are primarily designed for the tridiagonalization of densely bandedmatrices. There are many well-established heuristic algorithms [Cut72, GPS76a, CCDG82,LewS2], however, for identifying bandwidth reducing preorderings of general symmetricsparse matrices. (These preorderings are formulated as similarity transformations, avoiding changes to a matrix’s eigenvalues.) Combining such a preordering scheme with theRutishauser-Schwarz tridiagonalization and an 0(n2) tridiagonal eigenvalue method, is thebest published sequential direct method for finding the complete spectrum of a symmetricsparse matrix. The popular numerical software system MATLAB [Mat92] includes thisapproach for the solution of sparse eigenvalue problems. Although this scheme takes someadvantage of zero entries, for many problems sparsity exploitation is limited. This methodis almost completely dependent upon the bandwidth reduction algorithm to take maximumadvantage of sparsity, but typically the selected preordering leaves the band of the permuted matrix relatively sparse. Unfortunately, the Rutishauser-Schwarz algorithm rapidlyfills the sparse band and further opportunity to exploit band sparsity is lost. If matrixsparsity could be more fully exploited, the potential exists for dramatic improvements inthe efficiency of sparse tridiagonalization techniques.1.2 Thesis StatementExisting algorithms for determining the complete spectrum of sparse matrices, or moderately large subsets of their eigenvalues, are well suited for densely banded problems, butare sub-optimal for symmetric matrices with more irregular sparsity structures. This dissertation develops sequential algorithms that more fully exploit the sparsity of symmetricCHAPTER 1. INTRODUCTION 7matrices while determining their eigenvalues. The approach selected by this research is toreexamine the reduction of symmetric matrices to tridiagonal form, developing novel sparsetridiagonalization algorithms. The class of algorithms studied have the following genericform,A0:=AFOR i:= 1, 2, ... ,kA QA1in which A is systematically reduced to tridiagonal form (Ak = T) using a sequence of kcarefully selected orthogonal similarity transformations QTA_iQ.The following points outline this dissertation’s approach to the design and developmentof new sparse tridiagonalization techniques.• Representative sparse model problems conducive to formal analysis are not availablefor many practical sparse problems. Consequently, we support algorithm developmentusing a combination of experimental and theoretical complexity analysis.• We develop symbolic tridiagonalization tools to facilitate the use of experimentalevidence as a research tool.• We characterize the underlying fill properties of sparse tridiagonalization and identify the difficulties and limitations associated with existing direct tridiagonalizationmethods extended for use with sparse matrices.• Using formal complexity analysis, experimental results and insight gained from theanalysis of existing algorithms, we design and develop novel tridiagonalization algorithms which take better advantage of matrix sparsity.The novel algorithms developed in this dissertation are assessed using both formal analysis and experimental comparison to existing direct methods. Given the shortage of representative model problems, however, we emphasize experimentation with symbolic andCHAPTER 1. INTRODUCTION 8numeric implementations of the new algorithms. An important goal of this work is to develop sparse tridiagonalization techniques that are generally applicable. Consequently, allexperiments are conducted using a large test suite of sparse problems drawn from a broadrange of applications.To benchmark the success of our sparse tridiagonalization schemes relative to othereigenvalue approaches, we investigate the ability of Lanczos type algorithms to solve oursparse eigenvalue problem. Once again, comparison with our sparse tridiagonalization techniques employ both theoretical and experimental analysis.1.3 Thesis ContributionsThe primary contributions of this thesis’s research are:• Development of a framework for the complexity analysis of algorithms employingsequences of Givens transformations.• An improved understanding of the fill properties associated with the reduction ofsparse symmetric matrices to tridiagonal form.• An enhanced formal and experimental analysis of existing algorithms extended forthe tridiagonalization of general sparse symmetric matrices.• Development of the symbolic sparse reduction tools Xmatrix and Trisymb and agraph-theoretic model for symmetric sparse tridiagonalization and partial bandwidthcontraction.• Development of the Bandwidth Contraction algorithm, demonstrating that sparsityexploitation can be improved by rearranging the elimination sequence of nonzeros.• Development of the Hybrid Bandwidth Contraction algorithm, which combinessparsely and densely banded tridiagonalization algorithms to improve efficiency.• Development of the Split Bandwidth Contraction algorithm to reduce bulge chasingtransformation costs and increase the efficiency of partial bandwidth contractions.CHAPTER 1. INTRODUCTION 9• Development of the transition strategies based on formal analysis leading to the Hybrid Split Bandwidth Contraction algorithm.• Extensive experimental analysis demonstrating the efficiency of the new sparse tridiagonalizations algorithms relative to existing direct methods.• A theoretical and experimental comparison of sparse tridiagonalization based eigensolvers to variants of the Lanczos algorithm, demonstrating the efficiency of the novelsparse tridiagonalization algorithms.1.4 Thesis OverviewThe remainder of the thesis is organized in the following manner.Chapter 2 provides an overview of relevant background material and introduces a number of novel sparse tridiagonalization tools created to support the development and evaluation of sparse reduction algorithms. The chapter begins with general definitions andnotations, followed by a brief description of a large test suite of sparse symmetric problems, and a review of Givens and Householder transformations and their fill properties. Itthen introduces a general analysis framework for algorithms employing sequences of Givenssimilarity transformations. The following two sections of the chapter describe the newlydeveloped symbolic reduction tools Xmatrix and Trisymb that manipulate sparsity structures to predict the computational requirements of sparse reductions. Finally, we introducea graph-theoretic model for the application of orthogonal transformations to sparse symmetric matrices.Using these sparse reduction tools, Chapter 3 explores existing direct tridiagonalizationapproaches, analyzing their limitations and their potential extension to improve sparsityexploitation. Employing two model problems and practical sparse matrices, we first characterize the fill properties of both standard and customized Givens Reduction algorithms. Asalternatives to these algorithms, we investigate the Rutishauser-Schwarz and Lang bandpreserving tridiagonalization methods for banded matrices, which restrict the accumulationCHAPTER 1. INTRODUCTION 10of fill entries to improve reduction efficiency. Chapter 3 concludes with the development ofalgorithms generalizing these band-preserving techniques for the tridiagonalization of general sparse symmetric matrices, and analysis of the relative merits of the resulting sparsereduction methods.In response to the limitations of the previous sparse reduction algorithms, Chapter 4develops alternative approaches to sparse tridiagonalization that also use bandwidth reducing preorderings and band-preserving reduction techniques, but rearrange the eliminationof nonzero entries to more fully exploit internal band sparsity. Chapter 4 first describes thedevelopment of the Bandwidth Contraction algorithm, or BC, whose diagonally-oriented reduction allows it to significantly reduce the tridiagonalization costs of many practical sparseproblems. Employing theoretical analyses of the Rutishauser-Schwarz and Bandwidth Contraction algorithms, we then guide the development of the effective hybrid tridiagonalizationalgorithm HYBBC, which exploits the best characteristics of both algorithms to form a versatile and efficient two stage sparse reduction process. Following a discussion of key aspectsof the numerical implementation of BC and HYBBC, extensive experimentation with alarge test suite of practical sparse problems is used to evaluate the success of both sparsetridiagonalization algorithms. The chapter’s final section investigates the relationship between sparsity structures, preorderings and the performance of the Bandwidth Contractionalgorithm using additional experiments with practical sparse problems.The algorithms developed in Chapter 4 demonstrate the ability of diagonally-orientedreductions to take advantage of band sparsity. Chapter 5 expands upon these successfultechniques, developing second generation sparse algorithms for bandwidth contraction andtridiagonalization. The chapter begins with the development of the Split Bandwidth Contraction algorithm, or SBC, which employs bidirectional elimination techniques to enhanceband sparsity exploitation and potentially halve the computational requirements of eachdiagonal’s reduction. To evaluate the efficiency of performing partial bandwidth contractions with SBC, we provide an extensive experimental comparison of BC and SBC usingboth symbolic and numeric implementations.CHAPTER 1. INTRODUCTION 11Building on the success of SBC, the novel Hybrid Split Bandwidth Contraction algorithm, or HYBSBC, incorporates many aspects of HYBBC’s approach, but replaces the BCstage with the Split Bandwidth Contraction algorithm. The new hybrid algorithm lendsitself to formal analysis, permitting the development of the A-transition strategy for preciseregulation of the transition between the algorithm’s two stages. Once again, a numericalimplementation of HYBSBC is described before using a wide variety of experiments withpractical sparse problems to evaluate the performance of HYBSBC and the optimality ofthe A-transition strategy.To investigate the relative efficiency of sparse tridiagonalization based eigenvalue methods, Chapter 6 compares the resource requirements of variants of the Lanczos algorithmwith the Hybrid Split Bandwidth Contraction algorithm. The chapter starts with a briefoverview of the mathematics underlying the basic Lanczos iteration and then surveys different techniques employed by practical Lanczos implementations. The presentation of eachLanczos approach includes a theoretical discussion of its ability to efficiently compute thecomplete spectrum of a sparse symmetric problem. To support this analysis experimentsare conducted with two well regarded Lanczos codes and practical sparse problems. Theresource requirements of these codes to compute subsets of a sparse problem’s eigenvalues,ranging in size from moderate fractions of its eigenvalues to its complete spectrum, arecompared to the requirements of HYBSBC and the TQLRAT routine from BISPACK.Chapter 7 concludes the thesis by providing a summary of its major results and conclusions, and identifying directions of possible future research.Chapter 2Background Material and SparseTridiagonalization ToolsThis chapter provides an overview of relevant background material and introduces a number of sparse tridiagonalization tools developed to support the research presented in theremainder of the dissertation. We begin the chapter with general definitions and notation.The following section briefly describes the large test suite of sparse symmetric problemsused in the extensive experiments of later chapters. We then provide a review of Givensand Householder transformations, emphasizing their relationship to sparsity structures andfill production. Section 2.4 introduces a general analysis framework for algorithms using sequences of Givens similarity transformations. The following two sections introduce Xmatrixand Trisymb: symbolic reduction tools for small sparse and sparsely banded symmetric matrices respectively. Both tools manipulate sparsity structures to predict the computationalrequirements of sparse reductions. Finally, Section 2.7 introduces a graph-theoretic modelfor the application of orthogonal transformations to symmetric sparse matrices.2.1 Notation and DefinitionsThis section outlines some general definitions and notation used in the remainder of thedissertation.12CHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 13Symbol Conventions: Unless mentioned otherwise, an uppercase Roman letter represents a matrix, while a lowercase Roman letter represents a vector or scalar value.A, : The entry in row i and column j of matrix A is denoted byMatrix Bandwidth: The bandwidth (sometimes called semi-bandwidth) of A is definedas b= maxi,{l.fl}, i—i such that 0.flop: Following [GV89] a flop is defined to be any floating point arithmetic operation.2.2 Sparse Symmetric Test ProblemsAs outlined in Section 1.2, experimentation with practical sparse problems plays a key role inthe development and evaluation of sparse tridiagonalization algorithms. Table 2.1 lists theidentifier and size of a suite of 115 test problems selected from the Harwell—Boeing sparsematrix collection [DGLP82, GLD92j. The originating applications for Table 2.1 sparseproblems include: air traffic control, structural engineering, oceanic modeling and powersystem networks. A description of each matrix’s characteristics is provided in [GLD92].When the Harwell—Boeing collection only specifies a matrix’s sparsity pattern, we assign arandom value in the range (0.0, 1.0] to each nonzero entry.By no means all problems in Table 2.1 originated as eigenvalue problems. In addition,Table 2.1 also contains many small matrices, even though a primary theme of this dissertation is the tridiagonalization of large sparse problems. The test suite includes theseadditional problems because we seek general reduction techniques that are suitable for abroad range of sparsity patterns.2.3 Orthogonal Transformations and Sparse TridiagonalizationAs discussed in Section 1.2, this dissertation studies the class of sparse tridiagonalizationand partial reduction algorithms that use a sequence of orthogonal similarity transformations to eliminate unwanted nonzero entries. These algorithms could construct a similarityCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 14Problem n Problem n Problem_[__n Problem n1138 BUS 1138 BCSSTK21 3600 DWT 59 59 DWT 2680 2680494 BUS 494 BCSSTK22 138 DWT 66 66 ERIS1176 1176662 BUS 662 BCSSTK23 3134 DWT 72 72 GR 30 30 900685 BUS 685 BCSSTK24 3562 DWT 87 87 LSHP 265 265ASH292 292 BCSSTK26 1922 DWT 162 162 LSHP 406 406ASH85 85 BCSSTK27 1224 DWT 193 193 LSHP 577 577BCSPWRO1 39 BCSSTK28 4410 DWT 198 198 LSHP 778 778BCSPWRO2 49 BCSSTMO7 420 DWT 209 209 LSHP1009 1009BCSPWRO3 118 BCSSTM1O 1086 DWT 221 221 LSHP127O 1270BCSPWRO4 274 BCSSTM27 1224 DWT 234 234 LSHP1561 1561BCSPWRO5 443 BLCKHOLE 2132 DWT 245 245 LSHP1882 1882BCSPWRO6 1454 CAN 24 24 DWT 307 307 LSHP2233 2233BCSPWRO7 1612 CAN 61 61 DWT 310 310 L5HP2614 2614BCSPWRO8 1624 CAN 62 62 DWT 346 346 LSHP3O25 3025BCSPWRO9 1723 CAN 73 73 DWT 361 361 LSHP3466 3466BCSSTKO1 48 CAN 96 96 DWT 419 419 LUND A 147BCSSTKO2 66 CAN 144 144 DWT 492 492 LUND B 147BCSSTKO3 112 CAN 161 161 DWT 503 503 NOS1 100BCSSTKO4 132 CAN 187 187 DWT 512 512 NOS2 237BCSSTKO5 153 CAN 229 229 DWT 592 592 NOS3 957BCSSTKO6 420 CAN 256 256 DWT 607 607 NOS4 960BCSSTKO7 420 CAN 268 268 DWT 758 758 NOS5 468BCSSTKOS 1074 CAN 292 292 DWT 869 869 NOS6 675BCSSTKO9 1083 CAN 445 445 DWT 878 878 NOS7 729BCSSTK1O 1086 CAN 634 634 DWT 918 918 PLAT1919 1919BCSSTK11 1473 CAN 715 715 DWT 992 992 PLAT362 362BCSSTK12 1473 CAN 838 838 DWT 1005 1005 SSTMODEL 3345BCSSTK19 817 CAN 1054 1054 DWT 1007 1007 ZENIOS 2873BCSSTK2O 485 CAN 1072 1072 DWT 1242 1242Table 2.1: A Test Suite of Sparse Symmetric MatricesCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 15transformation using one of many different orthogonal transformations. Givens rotations,G(i,j, 0), are often chosen for sparse matrix applications because they permit fine grainedcontrol over the elimination of nonzeros and the introduction of fill entries. In fact rotations can be constructed to zero any entry of a matrix using any other entry in its row orcolumn as the zeroing entry. (In Figure 2.1, x1 is the zeroing entry.) The trivial examplecos0 —sinO x1 x x x x x x x x-sinO cosO x2 x x x 0 x x x x xFigure 2.1: Sparsity Structure Unioning by a Givens Rotationin Figure 2.1 illustrates an important property of rotations which impacts matrix sparsity.If cancellation is ignored and x1 is nonzero, the sparsity structure of both modified rows isthe union of the structure of the two rows prior to the rotation’s application. In the specialcases of x2 or x1 zero, the rotation is either the identity or swaps the sparsity structure ofthe two rows, avoiding the introduction of fill entries. It is advantageous to have a zerozeroing entry. In addition to fill avoidance, swapping matrix rows to eliminate nonzero x2requires no floating point computations.Tridiagonalization algorithms can use Givens rotations to construct orthogonal similarity transformations of the form G(i,j,0)TAG(i,j,0), which modify both rows and columnsi and j of A. By exploiting the symmetry of sparse problems and of similarity transformations, algorithms need only consider transformation modifications to either the upperor lower triangular portion of each matrix. Without loss of generality we work with amatrix’s lower triangular portion. From the wide selection of Givens rotations that canbe constructed to eliminate a particular nonzero entry, many algorithms presented in laterchapters use adjacent rotations exclusively. A rotation G(i, j, 8)T is considered adjacent ifli—il 1.To assess the stability of Givens similarity transformations, suppose k transformationsof the form GBG are applied to matrix B. If the updated matrix B is calculated usingfinite precision floating point operations, Wilkinson [Wi1651 shows that in the absence ofCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 16roundoff(2.1)where E II2 cu B 112, constant c is dependent on n and k, and u is the machineprecision. In other words, B is exactly similar to a matrix close to B.Another commonly used orthogonal transformation is a Householder transformation; ann x n orthogonal matrix of the formH = I — 2vvT/vTv,where v E JR. In general, a Householder transformation can be constructed to reflect anygiven vector onto any other vector of the same dimension and equal2-norm. Consequently,we can construct a Householder transformation that simultaneously annihilates multipleentries from a matrix’s column, when applied from the left.Compared to Givens rotations, applying Householder transformations to sparse matricesgenerally creates higher levels of fill. A Givens rotation applied to the left of sparse matrixA to eliminate a single nonzero, unions the sparsity structure of two rows. In contrast, for aHouseholder transformation simultaneously eliminating k entries of a matrix’s column, thefinal structure of each modified row is the union of the sparsity structures of all modifiedrows prior to the transformation’s application. Duff and Reid [DR75] show there exists asequence of k Givens rotations, Gk,... , G1 able to eliminate the same group of nonzerossuch thatsparsity.structure(Gk . . . G21A) sparsity_structure(HA).2.4 A Complexity Analysis Framework for Sparse Tridiagonalization Algorithms and Symbolic Reduction ToolsThis section introduces a general framework for analyzing the complexity of algorithmsapplying sequences of Givens similarity transformations to sparse symmetric matrices. Thisframework guides the formal analyses of many partial reduction and tridiagonalizationalgorithms presented in subsequent chapters. Examples of detailed analyses performedCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 17‘XX O[XXIX )( XX Oi’XI I0 X XIX XIX ‘X\rxx X 1ckJoikxio x x’x’x o X’XJt ‘XX X X ‘Xx x o X\A”nonzero )( X 0 X X0 X 0 *pair” XXXOXXO0 X 0 X X X X0 XFigure 2.2: A Transformation Length Example.using this framework are provided in [Cav93]. The general framework is also used by thesymbolic reduction tools Xmatrix and Trisymb (See Sections 2.5 and 2.6.) to coordinatethe gathering of flop count statistics.To simphfy the resolution of each algorithm’s computational requirements, the framework directs each analysis to be split into two sub-tasks.Task 1: Calculate the number of nontrivial transformations, Ttotai , employed by thereduction.Task 2: Calculate the total number of off-diagonal, lower triangular pairs of nonzeroentries modified by the reduction’s nontrivial transformations. We refer tothis value as the total transformation length or LtotaiThe first sub-task is self-evident but the second requires additional clarification. The lengthof a single transformation is the number of pairs of lower triangular nonzero entries itmodifies, excluding those entries updated by both rotations constituting the transformation.We consider a pair of modified entries nonzero if one or both entries are nonzero. As anexample, the length of the transformation modifying the highlighted entries of Figure 2.2’smatrix is 7. (Section 2.4.3 considers a specialized variant of the analysis framework, fordensely banded matrices, that exploits the sparsity of a pair of entries creating a bulge.) WeCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 18note that a transformation’s length is equal to the total number of pairs of nonzero entrieson both sides of the main diagonal affected by the application of G(i,j, o)T. As a result, it isoften easier to consider the number of pairs of nonzero entries modified by a single rotationwhen symmetry is ignored, rather than apply the strict definition of transformation length.Once an individual analysis has found Ttotai and Ltotai , we use the following generalformula to calculate the algorithm’s flop requirements.TotaLfiops = (Ftrans )(Ttotai ) + (Fpair )(Ltotai ) + OTC (2.2)Ftrans represents the number of flops required to construct a transformation and apply it tothe entries modified by both the transformation’s rotations. Fpair represents the number offlops required to apply a rotation to a single pair of nonzero entries. OTC represents onetime costs that are not spread over individual transformations. The total flop count doesnot include the cost of square roots, which are separately accounted for by the analysisframework.Section 2.3 introduced the standard Givens transformation. To reduce computationalrequirements many algorithms actually employ a “root free” variant of the transformation,with identical fill properties, often referred to as a fast Givens transformation [Gen73].The specific values of Ftrans, Fpair, and OTC are dependent upon the type of Givenstransformation used by an algorithm. The following two subsections refine Equation 2.2 foreach transformation type. Finally, Section 2.4.3 provides a specialized analysis frameworkfor densely banded matrices reduced by band-preserving algorithms. (Band-preservingreductions are introduced in Chapter 3.)2.4.1 Standard Givens TransformationsFpairA standard 2 x 2 Givens rotation has the generic form [ —s ]• Applying this rotationCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 19to a typical pair of entries c —s Yi 1j [ Y2 j requiresFtransFpair = 6 flops. (2.3)The calculation of c and s requires 5 flops and one square root GV89]. The cost of updatingthe 3 lower triangular entries modified by both rotations making up the transformationrequires more detailed consideration. By using the following scheme, we save 3 flops overthe most obvious approach.au aua a[ C S a c —s—s c s cc s ca + saui —sauu + ca—s c ca + sa —sau + ca3c2au + csa + csa +s2a —csa + c2au s2au + csa—csau— s2a +c2au + csa s2auu — csau — csau +c2abut au =— c2a + 2csa +s2a (c2 —s2)a + cs(aj — auu) 2 4(c2 —s2)a + cs(a — a) s2a — 2csau +c2a . )The total number of flops required to compute the final value of the twice modified entriesâj, âj, and is summarized in the following table. Each calculation is free to use thosevalues appearing to the left of it in the table.Calculation c2 cs 2 2csau ãjj âju ToFJFlops 1 1 1 2 4 4 5 18Finally, it is not necessary to calculate the updated value of the eliminated entry, saving 3flops for each transformation. Thus for standard Givens transformationsOTCFtrans = 5 + 18 — 3 = 20 flops. (2.5)There are no one time costs associated with tridiagonalization algorithms using standardGivens transformations.CHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 20Standard Givens Flop FormulaFor standard Givens transformations Equation 2.2 becomesTotal_flops_SG = 20(Ttotai ) + 6(Ltotai ) (2.6)In addition to this flop count, the reduction requires Ttotai square roots.2.4.2 Fast Givens TransformationsIn this section we assume that the reader is familiar with the fast Givens transformationpresentation of [GV89]. Suppose that a series of fast Givens transformations are accumulated in a single similarity transformation QTAQ. In this case Q is equivalent to the productof a series of standard Givens rotations. The novel idea behind the fast Givens approachis to represent Q as the product of two matrices MD’/2. D is a diagonal matrix that isinitially set to the identity. As the reduction proceeds the effects of each transformationare accumulated in D7KT-‘-‘new — ivi 2.7and this portion of the transformation is finally applied to the tridiagonal matrix at theend of the reduction.Tfinal =D”2TD” (2.8)On the other hand, each M is applied to A immediately to effect the elimination of nonzeroentries. Following the presentation of [GV89], and using a 2 x 2 example for simplicity, Mcan take one of two forms. We assume that MT is applied to to zero X2.X2i3 1 1 a21v1= 1142=1-32 1where a1 =‘- =—a1(-) where a2 = /32 = —a2(-)FpairApplying M1 or M2 to a typical pair of entries1Tor1 a2TYi (2.9)1 a1 Y2 /32 1 112CHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 21requiresFpair = 4 flops. (2.10)FtransWe consider the cost of updating the 3 lower triangular entries modified by both MT andM in detail. The cost of updating these entries using transformations constructed fromeither M1 or M2 is identical. Without loss of generality the following analysis considersM1.âjj âij— 1 /3 1âji âjj — 1 c1 1 a— /3i 1 [31a+ + aia— 1 a /3ia + + cia— 13?aii + /3ia + i3ia + /3a + J3iaia + +c1a— /3ia + + /3icia + aia + + aia +but =— /3?aii + 2i3ia + 13ia + /3icia + + Uia 2 11— + + thaia + + 2ia + a?ajj ( . )The total number of flops required to compute the final value of the twice modified entriesjj, and àj, is summarized in the following table. Each calculation is free to use thosevalues appearing to the left of it in the table.Calculation /3ja 13ia 2(/3ia) /3i(/31a) ci(/3ia) ciaL Flops 1 1 1 1 1 1[iaji ai(aiajj) âj àj ãj Total“H 2 1 2 2 3 16The next component of Ftrans is the cost of updating the diagonal matrix D. For themoment we assume the first fast Givens transformation type has been selected.[ ] =— d(1—cii3i) 0 2 12— 0 d(1—aith)CHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 22In this case, the calculation of and requires a total of 4 flops.Determining the cost of constructing a fast Givens transformation is complicated bythe required choice between two transformation types. The normal procedure is to firstcalculate and /3 using 3 flops. To check the stability of this first transformation, themagnitude of (1— i/3) is evaluated. (The cost of computing (1 — c,Bi) is included in thecost of updating D.) If (1— cri/3i) is too large, the second fast Givens transformation typemust be used and computing a2 and /32 requires 3 additional flops. Assuming the value ofcrj/3i is saved, the new scaling factor (1— c2/32) can be computed from —(1—using one additional flop. If we assume that 1/2 of the transformations employed are type2, constructing the average fast Givens transformation requires+ 3 + 1) + (3) = 5flops. (2.13)Finally, it is not necessary to calculate the updated value of the eliminated entry, saving2 flops for each transformation. Thus for fast Givens transformationsFtrans = 16 + 4 + 5 — 2 = 23 flops. (2.14)OTCWhen A has been reduced to tridiagonal form, the fast Givens process is completed asshown by equation 2.8. The calculation of D’12 requires N square roots. The followingequation illustrates the modifications made to the tridiagonal matrix by entry d”2.d”2 b a b1 d112b+(2.15)By exploiting symmetry this update requires 3 flops. Generalizing this result to the cost ofthe entire updateOTC = 3n flops. (2.16)CHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 23Fast Givens Flop FormulaFor fast Givens transformations Equation 2.2 becomesTotal_fiopsYG = 23(Ttotai ) + 4(Ltotai ) + 3Ti. (2.17)In addition to this flop count, n square roots are required by the reduction.As discussed in [GV89], fast Givens transformations require periodic rescaling to avoidoverflow problems. Rescaling costs are difficult to predict and are not included in theanalysis leading to Equation 2.17. Fortunately, in Section 4.3.4 we demonstrate that rescaling costs are typically insignificant when sparse tridiagonalization algorithms are applied tolarge problems. Recently, Anda and Park [AP94] presented new fast plane rotations that donot require periodic rescaling. To facilitate direct comparison with existing code, however,the implementations developed for this thesis employ standard fast Givens transformations.2.4.3 An Enhanced Framework for Densely Banded MatricesIn the general framework described above we increment transformation length if one orboth entries iii a modified pair are nonzero. For densely banded matrices, those transformations creating a bulge modify a single entry pair with only one nonzero. (As described inChapter 3, a bulge is a fill entry created outside the current band.) The zero entry in thispair is filled by the bulge. If the sparsity of this modified pair is exploited, each fast Givenstransformation creating a bulge saves 3 flops, while a standard Givens transformation save4 flops. If CR is the total number of nontrivial bulge chasing transformations used bya band-preserving reduction, then the enhanced flop formulas are given by the followingequations.Total_flops_SG’ = 20(Ttotai ) + 6(Ltotai ) — 4CR (2.18)TotaLflopsFG’ 23(Ttotai ) + 4(Ltotai ) + 3fl — 3CR (2.19)CHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 242.5 XmatrixXmatrix is an interactive symbolic sparse matrix reduction tool. It assumes exact numericalcancellation does not occur and manipulates the sparsity structures of symmetric matricesto model the application of sequences of Givens similarity transformations.An example of Xmatrix’s mouse driven graphical interface, written for the X WindowSystem using the Athena widget set, is illustrated in Figure 2.3. The dominant feature ofthe interface is the matrix situated in the top left corner of the Xmatrix window. The sizeof this matrix is specified by the user at start-up. Due to the physical constraints of theinterface, however, the order of the largest matrix that can be easily manipulated is 50.Each entry of the matrix is a button that can be selected by different mouse buttons withvarying effects. The menu to the right of the matrix provides the user access to Xmatrix’sdifferent features. For example, the Load, Dump, Permute and Movie buttons activate popup windows, which are also displayed in Figure 2.3. Finally, the area below the matrixrecords statistics for the current reduction.Xmatrix assumes that all sparse matrices are symmetric and that their main diagonalis nonzero. Otherwise, a user is completely free to construct an arbitrary sparsity patternwith one of two mechanisms. First, a user can interactively construct a sparse matrix bytoggling the nonzero status of entries selected with the middle mouse button. Alternatively,existing sparsity structures can be loaded from files in Harwell—Boeing or Xmatrix formatusing the Load pop-up window. Generally, Xmatrix marks nonzero entries with an “X”, butin keeping with the band-preserving reduction algorithms of Chapters 3, 4 and 5, a nonzeroentry lying outside the current band is marked by a “B”.Once the desired sparsity structure has been establlshed, a user can model the effects ofapplying a sequence of standard or fast Givens similarity transformations. Each transformation is interactively specified with the mouse. First, the nonzero entry to be eliminatedis selected from the lower or upper triangular portion of the matrix. Using the mouse toselect the zeroing entry (see Section 2.3) completes the similarity transformation’s definition, causing Xmatrix to automatically update the sparsity of the modified pair of rows andCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 25Figure 2.3: The Xmatrix InterfaceCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 26columns, and zero the target entry. On completion of these updates another transformationmay be selected.The statistics area of the Xmatrix window displays the matrix’s current bandwidth andthe nonzero density of the outermost diagonal and band. In addition, Xmatrix summarizestrivial and nontrivial transformation totals, and estimates the flop requirements associatedwith the applied sequence of Givens transformations using the general analysis frameworkof Section 2.4.1 or 2.4.2.Additional features of Xmatrix include:• Xmatrix permits modeling of both sparse and dense band transformations. Xinatrixassumes a dense band when computing the lengths of dense transformations.• The user may select automatic bulge chasing up or down the band. (See Chapters 3,4 and 5.)• The current sparse matrix can be permuted using a user specified permutation vectoror by GPS, RCM, GK, MDA, ND or reverse reordering algorithms.• The current sparsity structure can be saved to a file in Harwell—Boeing, Xmatrix ormtxps format. Using the mtxps filter on the latter type of file creates a postscriptimage of the saved sparse matrix.• The “movie” feature permits the recording and playback of a series of Givens transformations.As shown in this brief introduction, Xmatrix is a versatile tool, useful for the design andevaluation of sparse algorithms employing sequences of Givens transformations. Xinatrixwas used extensively during the development of algorithms presented in Chapters 4 and 5.2.6 TrisymbTrisymb is a symbolic sparse matrix reduction tool for sparsely (or densely) banded symmetric matrices. Assuming exact cancellation does not occur, Trisymb manipulates sparsitystructures to model the application of a sequence of Givens similarity transformations tosparsely banded symmetric matrices.CHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 27Trisymb provides an efficient internal data structure for a banded matrix’s sparsitystructure and a set of basic routines for its manipulation. For example, Trisymb providesroutines for applying row and column exchanges or nontrivial transformations to the storedsparse matrix. Using this basic platform, sparse reduction algorithms are easily implemented in Trisymb. Currently, Trisymb provides the following five algorithms: sparse R—S,BC, SBC, HYBBC and HYBSBC. (The many unfamiliar terms and acronyms used in thissection will be fully explained in later chapters.)A user must supply Trisymb with a sparse symmetric matrix in Harwell—Boeing formatand a permutation vector, which typically specifies a bandwidth reducing preordering ofthe sparse problem. Unlike Xmatrix, Trisymb is suitable for the symbolic reduction of bothsmall and large symmetric sparse matrices. The availability of resources on a particularmachine places the only restrictions on problem size. Once appropriate data structures havebeen initialized with a sparse problem, Trisymb executes the requested symbolic reductionby calling routines to manipulate the stored sparsity structure.Throughout each reduction Trisymb gathers a wide range of statistics, producing a general reduction report and separate reports tailored to highlight special features of individualalgorithms. The general report summarizes the extent of the executed reduction, and provides separate transformation and flop counts for each stage of the algorithm. Trisymbestimates flop requirements using the general analysis framework of Section 2.4.1 or 2.4.2.To permit the application of different variants of the five implemented algorithms, auser may select from several command line arguments specifying the following modelingoptions.• Trisymb can model either standard or fast Givens transformations.• Trisymb permits modeling of both sparse and dense band transformations. Whencomputing the lengths of dense transformations Trisymb assumes a dense band, butdoes not modify the underlying manipulation of sparsity structures.• For appropriate algorithms, users may request either a partial bandwidth contractionor a complete tridiagonalization.CHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 28Trisymb provides fixed, , nosplit and density thresholded transition and terminationstrategies.Trisymb predicts the computational requirements and sparsity characteristics of reductions without the aid of a matrix’s nonzero entries. Despite working solely with sparsity structures, Trisymb’s no cancellation assumption provides relatively accurate analyses.Trisymb is especially useful when comparing a problem’s reduction using several differentalgorithms or when assessing the effectiveness of a single algorithm applied to many different problems. The development and experimental evaluation of algorithms in subsequentchapters uses Trisymb extensively.2.7 A Bipartite Graph Model for Sparse TridiagonalizationThis section introduces a graph-theoretic model for the tridiagonalization or partial reduction of symmetric sparse matrices. We begin with basic graph theory notation for bipartitegraphs and establish their connection to symmetric sparse matrices. Using bipartite graphs,we then model the application of orthogonal transformations used in sparse reductions. Thissection’s discussion assumes familiarity with basic graph-theoretic terminology. (A generalgraph theory reference is [Har69]).2.7.1 Bipartite Graphs and Sparse MatricesBipartite graphs can be used to describe the nonzero structure of sparse matrices. Thebipartite graph BA of matrix A’s sparsity structure consists of two sets of vertices or nodesV’ and V, and a set of edges E. For an n x m matrix both vertex sets consist of the first nintegers, with instances and variables of V’ differentiated from those of V using primes. Tomodel a sparse matrix we assume node i’ e V’ corresponds to row index i of A, while themembers of V correspond to A’s column indexes. Each edge in E consists of an orderedpair of vertices < v’, v >, satisfying v’ e V’ and v E V. Sparse matrix A’s bipartite graphBA has an edge <i’,j > for each entry 0.Figure 2.4 illustrates a small symmetric sparse matrix and its corresponding bipartiteCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 29i XIX’ LJ’2 XX 3X 4 XX 5 X6X 7,X 8,XAFigure 2.4: A Sparse Matrix and its Bipartite Graphgraph. Assuming the matrix’s main diagonal is nonzero, there is a one-to-one correspondence between the edges of BA and the nonzero entries of A. If the labels of vertices in setsV’ and V of BA are arranged in sequence, and v’ and v are horizontally aligned, off-diagonalnonzero entries further from A’s main diagonal correspond to steeper edges in the bipartitegraph. In graphical terms, the ultimate goal of a tridiagonalization algorithm is to reduceBA to a bipartite graph whose edges <i’,j > (i’ E V’,j e V) satisfy I — ii 1.One of the most efficient and convenient schemes for mathematically representing abipartite graph uses adjacency sets. Two nodes x’ E V’ and y E V are adjacent ify >e E. The adjacency set of node x’ E V’ is defined byAdjBA(x’) = {z é V <x’,z >e E},and similarly the adjacency set of node y e V isAdjBA(y) = {z’ e V’I <z’,y >e E}.As examples, the adjacency sets of nodes 8’ and 4 in Figure 2.4 are {5, 8, 9} and {l’, 4’, 7’}respectively. The adjacency sets for all nodes in either V or V’ completely defines E. Fora symmetric matrix A,BAVv E V, AdjBA(v) = (AdjBA(v’))’.CHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 30(We interpret the “priming” of a set as applying prime to each of its members.)2.7.2 Modeling Givens TransformationsUsing bipartite graphs we can model the changes to matrix sparsity structures resultingfrom the application of orthogonal similarity transformations. Suppose the Givens transformation G(i,j, O)TAG(i,j, 0) eliminates nonzero entries A,k and Ak, (i z k), or equivalentlyeliminates edges < j’, k > and < k’, j > from BA. We will investigate the application ofthe two halves of this transformation independently.First, consider the application of rotation G(i, j, 0)T to the left of A, eliminating nonzeroor edge <j’, k >. As discussed in Section 2.3, the precise implementation of the Givensrotation depends on the nonzero status of the zeroing entry A,k. When A,k is nonzero (or<i1, k >E E), a nontrivial Givens rotation is applied. Except for the eliminated entry A,k,the sparsity structures of rows i and j both become the union of their sparsity structuresjust prior to modification. The corresponding modification of BA adds the minimal set ofedges to E that make AdjBA(i’) = AdjBA(j’), and then removes edge < j’, k > from E.Figure 2.5 more formally outlines the modifications to BA’S adjacency sets resulting fromG(i,j, 0)T application.1. (a) U := AdjBA(i’) U AdjBA(j’)(b) AdjBA(i’) := U(c) AdjBA (j’) U2. FOREACHzU(a) AdjBA(z) := AdjBA(z) U {i’,j’}3. (a) AdjBA(j’) := AdjBA(j’) — {k}(b) AdjBA(k) := AdjBA(k)—{j’}Figure 2.5: Updating BA to Reflect G(i,j,0)TA.As an example, Figure 2.6 illustrates the effects of eliminating entry A4,1 (or edgeCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 312 x;0 F4 XX 5 X6xX 8XH___Figure 2.6: A411 is Eliminated by G(3,4,01)T.< 4’, 1 >) from Figure 2.4’s example, using the adjacent rotation G(3, 4, Oi)’ (or Gf)applied to the left of A. The zeroing entry is A31. Each fill entry is marked by an F andeach fill edge in BA is highlighted.To complete the similarity transformation, G(i, j, 61) is applied to the right of A, eliminating nonzero Ak and unioning the sparsity structnres of columns i and j. The corresponding modification of BA adds the minimal set of edges to E that make AdjBA (i) =AdjBA (j), and then removes edge < Ic’, j > from E. Exchanging V’ and V variables fortheir V and V’ counterparts in Figure 2.5 creates a more formal specification of Bs modifications. Figure 2.7 illustrates the effects of completing the Givens similarity transformationbegun in Figure 2.6, by eliminating entry A14 and edge < 11,4 > with adjacent rotationG(3, 4, 81). In this case the zeroing entry is A13.Once again, suppose a Givens transformation G(i, j, 8)TAG(i, j, 6) is constructed toeliminate nonzero entries A,k and Ak, (i 0 Ic) from the symmetric sparse matrix A, butnow assume A,k = Ak = 0. As discussed in Section 2.3, in the special case that the zeroingentries A,k and Ak, are zero, or edges < i’, Ic >, < Ic’, i >0 E, the transformation simplyswaps rows and columns i and j. In graphical terms, exchanging a pair of rows or columnscorresponds to relabeling a pair of nodes in V’ or V. In this case nodes i’ and j’ switchCHAPTER 2. BACKGRO UND AND TRIDIAGONALIZATION TOOLS 32Tjof2 XX 3F F0 F4 XX 5 X6FX 7X 8X__x 9GG1Figure 2.7: Completing the Nontrivial Transformation G(3, 4,01)TAG(3, 4, Ui).labels, as do nodes i and j. More formally, the row and column swaps precipitated by thetransformation require modifications to BA’S adjacency sets as outlined in Figure 2.8.Continuing our example of Figures 2.4, 2.6 and 2.7, Figure 2.9 completes the reductionof column 1 by eliminating nonzero entries A3,1 and A1,3 using an adjacent transformationswapping rows and columns 2 and 3. In terms of the bipartite model, we simply exchangethe labels of nodes 3’ and 4’, and nodes 3 and 4. Finally, we envisage moving the relabelednodes, with their edges attached, to reestablish sequential order of V’ and V. The affectednonzero entries and edges are highlighted in the figure.2.7.3 Modeling Householder TransformationsAs mentioned in Section 2.3, another commonly used orthogonal transformation is a Householder transformation. Unlike a Givens rotation, a Householder transformation can simultaneously eliminate multiple nonzero entries from a single row or column. Suppose theHouseholder transformation H is applied to the left of A as the first half of a similarity transformation. Further assume it modifies rows i, i + 1,... , + 1, eliminating entriesAi+1,k, Ai+2,k,.. . , Ajj. Except for these entries, the sparsity structure of each modifiedrow becomes the union of the sparsity structures of all modified rows just prior to theCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 33Figure 2.8: Updating BA to Reflect a TrivialColumns i and j.Transformation Exchanging Rows and1xooX2 XO if30X0X04 Xox 5 xI-_6xox 7xjxTTG21AG12Figure 2.9: Completing the Reduction of Column and Row 1 to Tridiagonal Form with aRow and Column Exchange.1. FOR EACH z e AdjBA(i’)— AdjBA(j’)(a) AdjBA(z) := (AdjBA(z) — {i’}) U {j’}(b) AdjBA(z’) := (AdjBA(z’) — {i}) U {j}2. FOR EACH z e AdjBA(j’)— AdjBA(i’)(a) AdjBA(z) := (AdjBA(z)— {j’}) U {i’}(b) AdjBA(z”) : (AdjBA(z’)— {j}) U {i}3. /* Switch adjacency sets.*/(a) S : AdjBA (i’)(b) AdiBA(i’) := AdjBA(j’)(c) AdjBA(j’) := S(e) S := AdjBA(i)(f) AdjBA(i) := AdjBA(j)(g) AdjBA(j) := SCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 34HT-F2FFX FL!OFF4F Xx 5EEErjHAFigure 2.10: Reduciug Column 1 to Tridiagonal Form With a Householder Transformation.eliminations. The corresponding modification of BA adds the minimal set of edges to E sothatAdjBA(i’) = AdjBA((i + 1)’) = ... = AdjBA((i + 1)’),and removes edges < (i + 1)’, Ic >,. . . , < (i + 1)’, Ic > from E. Completing the similaritytransformation by applying H to the right of A, results in similar sets of nodes being addedand removed from the adjacency sets of nodes i, I + 1,... , I + 1.In two stages Figures 2.10 and 2.11 illustrate the application of a Honseholder similaritytransformation to the sparse example in Figure 2.4. The first figure illustrates the effect ofapplying a Householder transformation to the left of the sparse matrix to eliminate entriesA4,1 and A3,1 or edges < 4’, 1 > and <3’, 1 >. The second figure illustrates the modifications made to both the matrix and its associated bipartite graph by the complete similaritytransformation. The application of this transformation and the application of the twoGivens transformations leading to Figure 2.9 both result in the elimination of two nonzerosfrom both the first row and column. We note that the use of Givens transformations hasresulted in 12 fewer fill entries than for the Householder transformation.CHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 351 FOOF2FFX F—OF3FFFOFF4F XXFF5 XiZEkHAHFigure 2.11: Completing the Reduction of Column and Row 1 with a Householder SimilarityTransformation.2.7.4 Concluding RemarksThe bipartite graph model presented in this section is intended for describing, understandingand providing insight into the sparse reduction algorithms described in subsequent chapters.Often symmetric sparse matrices are modeled using undirected graphs with a single nodefor each entry of the main diagonal and a single edge for each symmetric pair of off-diagonalnonzero entries. Using undirected graphs, however, it is more difficult to model the nonzerostatus of main diagonal entries and the fill they may create. Undirected graphs could beaugmented with self-loop edges. Alternatively, bipartite graphs represent a diagonal entryas an edge between two distinct nodes, providing a more easily understood and naturalmodeling of diagonal entries. In addition, bipartite graphs can separately model the effectsof applying each half of an orthogonal similarity transformation. The bipartite graph modelis also easily extended to unsymmetric matrices or non-congruent transformations. Finally,bipartite graph models are conducive to describing sparse reduction algorithms using sequences of orthogonal similarity transformations. For example, the unioning of row andcolumn sparsity structures corresponds to the unioning of adjacency sets. As we will see inthe following chapter, the bipartite graph model provides a natural graphical interpretationCHAPTER 2. BACKGROUND AND TRIDIAGONALIZATION TOOLS 36of band-preserving reduction algorithms artifacts such as bulge chasing sequences.Chapter 3Existing Algorithms and TheirExtension for SparseTridiagonalizationThis chapter explores existing direct tridiagonalization approaches, analyzing their limitations and their potential extension to improve sparsity exploitation. For evaluating therelative success of presented algorithms, we first introduce two simple sparse model problemsthat lend themselves to formal analysis. Employing these model problems, as well as othersparse matrices, we then characterize the fill properties of Givens Reduction, identifying theinefficiencies associated with this direct dense matrix approach. If we attempt to improvesparsity exploitation by changing the elimination order of each column and the planes of zeroing rotations, the next section shows even these customized Givens Reductions experienceoverwhelming levels of fill. In the following section we explore alternative band-preservingtridiagonalization approaches for banded matrices, which restrict the accumulation of fillentries to improve reduction efficiency. The chapter’s final section considers the extensionof these algorithms for the tridiagonalization of general sparse symmetric matrices, anddemonstrates the relative efficiency of the sparse Rutishauser-Schwarz algorithm.37CHAPTER 3. EXTENDING EXISTING ALGORITHMSFigure 3.1: The 5-Point Model Problem3.1 Sparse Model Problems38In practice the density and nonzero structure of sparse symmetric eigenvalue problems varywidely. To provide a common basis for the formal comparison of reduction algorithms,this section presents two families of sparse model problems that lend themselves to formalanalysis.The first family of model problems consists of densely banded symmetric matrices ofbandwidth b and order n. We define the bandwidth b (or semi-bandwidth) of matrix A asmax,E{1...fl}, i—j such that 0.The second family of model problems is the 5-point problems. The symmetric sparsematrices associated with 5-point problems arise from discretizing partial differential equations on a square uniform grid of dimension b x b, using Dirichlet boundary conditionsand a 5-point differencing molecule. The resulting matrices are of order n = b2 and using a standard lexicographic ordering, consist of five nonzero diagonals, as illustrated inFigure 3.1.xxxxx-0x o.xx.x0xx0x• x•• x•x xx.0o •xx•x •x•.x•x0xCHAPTER 3. EXTENDING EXISTING ALGORITHMS 39These families of model problems were selected because they are scalable and exhibitimportant features also found in many practical sparse symmetric problems. In addition,the simple and highly regular nature of their sparsity structures makes formal algorithmanalysis feasible. Ultimately, however, we are interested in the tridiagonalization of generalsparse symmetric matrices. Unfortunately, our sparse model problems do not show all thesignificant characteristics exhibited by sparse matrices. For example, many sparse matricesare much more irregular than our model problems. Consequently, we often augment formal analysis results for our model problems with experimentation using sparse symmetricproblems from Section 2.2’s test suite.3.2 A Naive Approach to Sparse TridiagonalizationOne approach to sparse tridiagonalization is to simply enhance a successful dense matrixalgorithm, such as Givens reduction, with basic sparse matrix operations. Using 0(m3)flops and 0(m2) storage, the standard Givens reduction tridiagonalizes a dense symmetricmatrix column by column, as shown in Figure 3.2. Within each column nonzero entriesare eliminated from the bottom up using the column’s subdiagonal entry as the zeroingFOR col:r= 1 TO n-2 DOFOR row:= n DOWNTO col+2 DOA := G(col + 1, row, O)T A G(col + 1, row, 0)Figure 3.2: Givens Reductionentry. We suggest two modifications to this basic algorithm in an attempt to exploitmatrix sparsity. First, we can obviously avoid constructing and applying transformationsto eliminate entries that are already zero. In addition, we need to apply a transformationto only those lower triangular entries in the unioned sparsity structure of the two modifiedrows and columns.To evaluate the potential of this modified form of Givens reduction, we first consider itsCHAPTER 3. EXTENDING EXISTING ALGORITHMS 40application to densely banded model problems. Unfortunately, for nontrivial bandwidthsGivens reduction quickly overwhelms the banded nature of these sparse matrices with fillentries. In fact, after the reduction of the first[n—i 1 n—i Mod(n_l,b)*b H — bcolumns the remaining (n — r) x (n — r) submatrix is completely filled in. Table 3.1 providesestimates of the computational requirements for this densely banded reduction computedusing the analysis framework of Section 2.4. The analysis assumes the reduction employsstandard Givens transformations, and FgGR and TDBGR refer to flop and transformationcounts respectively. In addition to FaQR, the construction of each transformation requiresFaGR(2_+i)n3+O(n2)1Mod(n—1,b)— Mod(n—1,b)\ 2 1” bTable 3.1: Tridiagonalization Costs of Givens Reduction for a Densely Banded Matrixone square root. For comparison, Table 3.2 provides formula for the tridiagonalizationcosts of reducing a dense symmetric matrix with Givens reduction. FgGR and T00 areL I TGR___2n3+n2_17n+141c_y+iTable 3.2: Tridiagonalization Costs of Givens Reduction for a Dense Matrixvery close to their dense matrix counterparts and the densely banded reduction takes littleadvantage of matrix sparsity. In addition, due to overwhelming levels of fill, the sparsereduction also requires 0(n2) storage.The overwhelming levels of fill produced by Givens reduction are not restricted todensely banded model problems. Despite the additional sparsity of the 5-pt problems,their tridiagonalization by Givens reduction also suffers from high levels of fill. AfterModQc y) is the remainder from the division of integer x by integer y.CHAPTER 3. EXTENDING EXISTING ALGORITHMS 41Problem I’ Initial Off-Diagonal # of columns (rows) eliminatedNonzero Density before remaining matrix is dense.b(2n—b—1) n—j. Mod(n—1 b)Dense Band n— b5-Pt Problems b2 ‘- 4/ba 2b— 3PLAT1919 1919 0.83% 36NOS3 960 1.62% 23BCSSTKO8 1074 1.03% 7Table 3.3: Fill Characteristics of Givens Reductionr = 2b — 3 columns of a 5-pt problem have been reduced to tridiagonal form the remaining (n — r) x (n — r) submatrix is dense. As for the densely banded model problems,the tridiagonalization of this remaining submatrix dominates reduction costs and sparsityexploitation is limited.The speed with which Givens reduction fills our model problems is typical of mostsparse symmetric problems. Table 3.3 summarizes the fill characteristics for both familiesof model problems and also provides data for three additional sparse symmetric problemsfrom Section 2.2’s test suite. To determine the fill characteristics of each Harwell—Boeingproblem, we used the basic sparse reduction platform of Trisymb to implement the sparseGivens reduction algorithm. The resulting symbolic reduction code was applied to eachsparse problem. In each case, Givens reduction produces overwhelming numbers of newnonzeros, quickly filling the unreduced portion of each matrix despite extremely low initialnonzero densities.This section’s results clearly demonstrate that a naive approach to tridiagonalizationusing Givens reduction is unable to significantly exploit matrix sparsity. The fill created bythese reductions so quickly overwhelms matrix sparsity that the original matrix might aswell have been treated as dense. There is no evidence to suggest that other dense matrixtridiagonalization algorithms, for example Householder’s reduction, offer any significantadvantage. In fact, as shown in Section 2.3, Householder transformations generally createhigher levels of fill.CHAPTER 3. EXTENDING EXISTING ALGORITHMS 423.3 Customized Sparse Givens ReductionsThe unsuccessful sparse tridiagonalization algorithm of Section 3.2 did not change the basicform of Givens reduction. It eliminates each column’s nonzeros in the same order as thestandard Givens reduction, using the same subdiagonal zeroing entry. The sparse algorithmplaces complete reliance on exploiting matrix sparsity at the transformation level, withoutregard for the position and number of fill entries produced. Moreover, no attempt is madeto utilize the fine grained control over the elimination of nonzeros, and the introduction offill entries, provided by Givens transformations.The flexibility of Givens rotations permits both the order in which a column is zeroed,and the plane of each zeroing rotation, to be modified. (See Section 2.3.) These variableelements of the basic algorithm can be used to construct customized sparse Givens reduction algorithms which attempt to further improve sparsity exploitation. Unfortunately, theexperimentation of Duff and Reid [DR75] shows that, independent of rotation plane selection and the elimination order of each column’s nonzeros, adaptations of Givens columnby column reduction for large sparse matrices generally experience overwhelming levels offill, and matrix sparsity is quickly destroyed. In fact Duff and Reid conclude that if aGivens reduction approach is taken, typically little advantage is made of sparsity and it ispreferable to treat the matrix as dense.3.4 The Tridiagonalization of Banded MatricesIf tridiagonalization algorithms are to effectively utilize matrix sparsity it appears essentialto restrict the accumulation of fill entries to some maintainable substructure of the matrix.Suppose that a symmetric matrix A of bandwidth b, is to be reduced to tridiagonal form.In addition, for the remainder of this subsection assume that the band is dense. We haveshown that applying a column by column Givens reduction to this model problem leads tooverwhelming levels of fill outside the band, quickly destroying matrix sparsity. Alternatively, the algorithm of Rutishauser [Rut63] and Schwarz [Sch7l] (subsequently referred toCHAPTER 3. EXTENDING EXISTING ALGORITHMS 43as the Rutishauser-Schwarz, or simply R-S, algorithm) controls the encumbering effects offill by actively preserving a matrix’s banded structure throughout tridiagonalization. After exploring this algorithm in detail, we briefly describe a closely related algorithm dueto Lang [Lan92]. This algorithm is primarily intended for parallel implementation andsequentially is less efficient than the Rutishauser-Schwarz approach.3.4.1 The Rutishauser- Schwarz AlgorithmThe pseudocode in Figure 3.3 outlines the Rutishauser-Schwarz algorithm. Once again,Givens transformations are used by this band-preserving tridiagonalization algorithm toprovide fine grained control over the elimination process. Globally the effect is that theFOR col:= 1 TO n-2 DOFOR diag:= min(b,n-col) DOWNTO 2 DO/*ZeroA := G(col + diag, col + diag — 1, O)T A G(col + diag, col + diag — 1,0)IF bandwidth(A) > b THENAnnihilate bulges with additional adjacent Givens transformations.Figure 3.3: The Rutishauser-Schwarz Algorithmbanded matrix is reduced column by column. Indeed, within each column adjacent transformations eliminate the nonzeros from the outside in. The key difference from Givensreduction is that R-S immediately removes fill created outside the band of the original matrix. The symmetric elimination of a band nonzero produces a pair of fill entries, or bulges,outside the band as illustrated by the Bi entries in Figure 3.4. Before eliminating the nextband nonzero R-S chases the bulges off the end of the matrix with an additional sequenceof adjacent transformations. (See Figure 3.4.) In this fashion the algorithm maintains thebanded structure of the unreduced portion. During the reduction of a typical column k, theelimination of band entry A,k requires E bbi+ll bulge chasing transformations. Adjacenttransformations are used exclusively by the algorithm because nonadjacent rotations createtriangular bulges, consisting of several nonzero entries, which require a larger number ofCHAPTER 3. EXTENDING EXISTING ALGORITHMS 44Hx xX2XXXX 3 X x x[j— BandZeroingX 4 x x x TransformationXX X5 X X XrLG(3,4,91)TAG(3,4,0XXX6XXX—_B1 X X X 7 X X X BulgeChasingX X X 8 X X X TransformationsXX X 9 X X x[f—_ G(6,7,02)TAG(67,9x x x ii x x x j, G(9,1O,93) AG(9,1O,03)\ X x X 12 X X XB4- G(12,13,8)TAG(12,13,011XXX13XXXx x x ii x x G(15,16,0& 3(15,16,05)X X X 15 XFigure 3.4: Bulge Chasing to Preserve Bandwidthbulge chasing transformations.The graph-theoretic model of Section 2.7 provides an alternative view of the bulgechasing process. In terms of our bipartite graph model, the steepest edge in the graphindicates the bandwidth of the corresponding sparse matrix.b= max li—il<i’,j)’EEConsequently, if a reduction algorithm creates a bulge lying outside the original matrix’sband, the corresponding bulge edge <x’, y> satisfies jx—y > b.As a working example, BA1 of Figure 3.5 is the bipartite graph corresponding to thedensely banded example of Figure 3.4. Without impacting subsequent discussion, we omitmany edges corresponding to interior band nonzeros to avoid cluttering the illustrations.BA1 highlights two paths of edges connecting nodes (4’, 7, 10’, 13, 16’) and (4,7’, 10, 13’, 16).Each pair of neighboring nodes in these lists are connected by an edge < i’, j > satisfyingb = i— ii. We refer to these edge sequences as bulge chasing paths for edges <4’, 1 > and< 1’, 4 >. The constituent edges of each path are ancestors of the bulge edges created bybulge chasing sequences accompanying R-S’s elimination of nonzeros A41 and A1,4.CHAPTER 3. EXTENDING EXISTING ALGORITHMS 45Figure 3.5: The Bulge Chasing PathPrior to the elimination of a band nonzero, identifying its associated bulge chasing pathin the bipartite graph is straightforward. More generally, suppose R-S will eliminate edge <i’,j > using an adjacent rotation that unions adjacency sets AdjBA1(i’) and AdjBA1 ((i—i)”).The first edge in the associated bulge chasing path connects node i’ with node k = (i + b) eV. Similarly, the next edge in the bulge chasing path is <1’, k >e E, where 1 = k + b. Theconstruction of the bulge chasing path continues in this fashion until the an edge in thissequence is absent from E. For a densely banded reduction, the node, x or x’, terminatingeach bulge chasing path satisfies x + b > m.Continuing our example in Figure 3.5, consider the effects of R-S’s elimination of edges< 4’s, 1 > and < 11,4 > from BA1. Using the modeling techniques of Section 2.7.2, theelimination of these edges unions the adjacency sets of nodes 4’ and 3’, and 4 and 3, creatingthe two bulge edges < 3’, 7 > and < 7’, 3 > shown in bipartite graph BA2. To preservethe bandwidth an adjacent transformation eliminates these bulge edges, creating two newbulge edges < 6’, 10 > and < 10’, 6 > in BA3. The bulge chasing process continues until the8A1 B B% BCHAPTER 3. EXTENDING EXISTING ALGORITHMS 46last bulge edges are eliminated from BA5. The net effect of the bulge chasing sequence is toduplicate all edges in the original bulge chasing paths and drag the endpoints of a copy ofeach edge up one node in the graph. We duplicate bulge chasing path edges, because R-Suses nontrivial transformations for densely banded reductions.Table 3.4 provides formula for the tridiagonalization costs of the Rutishauser-Schwarzalgorithm, computed using the analysis framework of Section 2.4. A comprehensive description of this analysis is presented in [Cav93]. The analysis assumes 2 < b (n/2 — 1)and CR_S is the nonanalytic term Mod(n — 1, b)(Mod(ri.— 1, b) — b), which typically can besafely ignored without incurring large errors. and TRS refer to flop and transformationcounts respectively. In addition to F5 , the construction of each transformation requiresFs TRSI (6b — + 2)n — (6b2 + 5b — + 5)nL+2b3 + 4b2 — b — +3 + ( + 3)CR_S (b-1)(n-1)2+CR_STable 3.4: Tridiagonalization Costs of the Rutishauser-Schwarz Algorithm for a DenselyBanded Matrixone square root. Although TRS is comparable to the number of transformations used byGivens reduction on the same densely banded problem, in general the band-preservingapproach modifies fewer nonzero entries with each transformation. As a result, for matrices of moderate bandwidth F5 is smaller than the floating point requirements of Givensreduction given by FGR.EISPACK’s [GBDM77] implementation of the R-S algorithm, BANDR, does not usestandard Givens transformations. Instead it employs a “root free” variant of the transformation, with identical fill properties, often referred to as a fast Givens transformatiori [Gen73j. (See Sections 2.3 and 2.4.2.) In this case the computational requirements ofthe band-preserving tridiagonalization are reduced to= (4b — + 6)n2 — (4b2 + 3b — + lO)n4b3 5b2 5b 10 10 (3.1)CHAPTER 3. EXTENDING EXISTING ALGORITHMS 47and n square roots. The analysis leading to FK?S ignores the cost of periodic rescaling andassumes that a reduction uses the two types of fast Givens transformations in equal proportion. For either variant of the band-preserving R-S algorithm O(bn) storage is required.We note that the Rutishauser-Schwarz algorithm will not be faster for all bandwidthsthan the best dense matrix algorithm, Householder’s reduction. Assuming that the Householder reduction requires n3 flops [GV89], while R-S requires 4bn2, the dense matrixalgorithm requires fewer floating point operations for bandwidths larger than . The selection of a tridiagonalization algorithm, however, is often strongly influenced by the lowerstorage requirements of the Rutishauser-Schwarz algorithm.3.4.2 Lang’s AlgorithmLang’s column-oriented tridiagonalization algorithm [Lan92J for symmetric banded matrices is closely related to the Rutishauser-Schwarz approach. Like the R-S algorithm, Lang’salgorithm restricts the accumulation of fill to a maintainable substructure of the original matrix using bulge chasing. Unlike R-S, however, Lang’s algorithm uses Householder similaritytransformations to simultaneously eliminate multiple nonzero entries. Implementations ofLang’s algorithm exploit the symmetry of sparse problems and similarity transformations,only considering transformation modifications to the lower triangular portion of each matrix.Lang’s algorithm begins by reducing the first column of our densely banded modelproblem to tridiagonal form, using a single Householder similarity transformation. Thiselimination creates a pair of (b—i) x (b—i) triangular bulges, as shown for the small examplein Figure 3.6. Rutishauser {Rut63] points out that the cost of chasing the entire triangularbulge off the end of the matrix is unacceptably high. Alternatively, Lang eliminates onlythe first column of the lower triangular bulge, using an additional Householder similaritytransformation. As shown by Figure 3.7, this elimination creates a second set of triangularbulges. Lang’s algorithm continues the reduction by eliminating the first column of thesecond bulge, creating a third pair of bulges if n/b is sufficiently large. When the first columnaq-1 a aq C a C C aqCr,p Cl) C C Cr, Cl) p p——XIx<4XXOXIX<C,,XXXOwwLxxxxxxxxwwJxxxxxxoW><X45XOXXXXTOxxxxxxwoxxxx><><xxxxxxx><x><xxxicxxx><><xxx>5•xxwxxxxxxxxxxxZxxxiIII_____xxJExxxxxXXXX)XXxxxxxxxL 1 Lxxxxxxxx-xxxx,xxxxJxxxxxxxxTxxxxxxxx:xxxx-x><x><xxxExxxxLz> x-oXpxxxx1TxxxxxTo:EXxx4X5J xxxxoixxxToI>X)XXXXLJxxxxxx><xww—x1<xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx—xxxxxxxxxCHAPTER 3. EXTENDING EXISTING ALGORITHMS 49of each triangular bulge has been eliminated, Lang’s algorithm reduces the band’s nextnonzero column to tridiagonal form and chases the first columns of its associated triangularbulges. Continuing in this fashion the algorithm completes a matrix’s tridiagonalization.Lang’s modified bulge chasing procedures do not reestablish the matrix’s original bandedform between the elimination of successive columns of the band. Chasing the first columnof each triangular bulge, however, prevents bulges from growing beyond their original (b —1) x (b — 1) size and gradually marches them off the end of the matrix.Lang’s algorithm is primarily intended for parallel implementation. Its use of Householder transformations allows almost all computation to be performed using level 2BLAS [DDHD9O, DDHH88, LHKK79]. As a result, Lang’s algorithm is more conduciveto parallel implementation than R-S, which uses level 1 BLAS type operations exclusively.Unfortunately, Lang’s algorithm reqnires almost twice the storage of R-S to handle itssequence of large triangular bulges. In addition, Lang reports that the computational requirements of his algorithm are approximately 6bn2 flops. Although these requirements areclose to Fg5 , Lang’s algorithm requires 50% more flops than the fast Givens variant ofR-S. As a result, theoretically the Rutishauser-Schwarz tridiagonalization is sequentiallymore efficient than Lang’s algorithm. Despite increased flop requirements, Lang reportssequential experiments in which his algorithm runs faster than EISPACK’s BANDR on asingle node of an iPSC/860 hypercube parallel computer. It appears this discrepancy isa machine dependent anomaly in which the cache of an iPSC node is able to exploit theimproved data locality of level 2 BLAS. In addition, Lang’s implementation is based onoptimized assembly-coded BLAS routines.More recently, work by Bischof and Sun [BS92] generalizes Lang’s algorithm, developinga novel framework for the tridiagonalization of symmetric banded matrices. Bischof andSun’s approach reduces the band of the matrix to tridiagonal form in one or more bandwidth reducing stages. Each stage of their algorithm eliminates the outer d (1 d < b)subdiagonals of the current band using a Lang type algorithm. The only difference betweena partial reduction and Lang’s algorithm is the number of nonzeros eliminated from eachcolumn of the band. By cleverly selecting sequences of partial band reductions, BischofCHAPTER 3. EXTENDING EXISTING ALGORITHMS 50and Sun are able to reduce tridiagonalization storage requirements to R-S levels or improve algorithm complexity by 10—25% relative to Lang. Despite these improvements, thecomplexity of the R-S algorithm remains minimal and of these algorithms R-S is the bestgenerally applicable sequential densely banded tridiagonalization algorithm.3.5 Generalization of Band-PreservingTridiagonalization TechniquesThe discussion of Section 3.4 explored the tridiagonalization of densely banded matrices.For general sparse symmetric matrices we can extend the band-preserving techniques ofRutishauser and Schwarz to form the following two stage sparse tridiagonalization algorithm.1. A : PTAP, where P is a bandwidth reducing permutation matrix.2. Tridiagonalize A using an enhanced form of the R-S algorithm.We begin this section with a short introduction to bandwidth reducing reorderings andsuccessful heuristics for their computation. The following subsection describes an enhancedvariant of the Rutishauser-Schwarz algorithm and explores the reduction characteristics ofthe two stage sparse tridiagonalization algorithm. Finally, we briefly explore a similar twostage sparse reduction based on Lang type algorithms.3.5.1 Bandwidth Reducing Sparse Matrix PreorderingsTo reduce the bandwidth of symmetric sparse matrices we use orthogonal similarity transformations of the form PTAP, where P is a carefully selected permutation matrix. Theproblem of minimizing the bandwidth of a matrix by permuting rows and columns isNP-hard [Pap76] and is known to remain so even for symmetric sparse matrices whoseassociated undirected graph is a tree with all vertices of degree 3 [GGJK78j. Manyheuristic algorithms have been suggested for the identification of bandwidth reducing preorderings [Cut72, CCDG82]. Two of the most widely accepted algorithms are the reverseCHAPTER 3. EXTENDING EXISTING ALGORITHMS 51GPS Bandwidth RCM Bandwidth GPS and RCMSmaller Smaller Bandwidths Equal# of Problems 91 11 13Table 3.5: Comparing GPS and RCM Preorderings for the Harwell—Boeing Test SuiteCuthill-McKee (or RCM) [GL78b] and Gibbs-Poole-Stockmeyer (or GPS) [GPS76a, Lew82]algorithms. The experience of Gibbs et al [GPS76b] suggests that GPS most frequentlyprovides the smallest bandwidth preorderings over a wide range of problems. Our experimentation with Section 2.2’s test suite of 115 Harwell—Boeing test problems, summarized inTable 3.5, supports this conclusion. Consequently, for experimentation we typically selectGPS and often reference it as the bandwidth reducing preordering in subsequent discussion.3.5.2 The Sparse Rutishauser-Schwarz AlgorithmThe two stage tridiagonalization algorithm takes limited advantage of matrix sparsity ifR-S treats the band of the preordered matrix as dense. In this case complete reliance isplaced upon the preordering algorithm to exploit sparsity, but for many problems the bandof the preordered matrix is relatively sparse prior to reduction. (See Table 3.6.) To takeadvantage of band zeros, three modifications could be made to the basic band-preservingtridiagonalization algorithm.1. Avoid constructing and applying transformations to eliminate band or bulge entriesthat are already zero.2. Exploit zeroing entries (see Section 2.3) that are zero by performing row and columnexchanges instead of using the general form of the Givens transformation.3. Apply each nontrivial transformation to only those lower triangular entries whosecolumn(row) index is in the unioned sparsity structure of the two modifiedrows (columns)CHAPTER 3. EXTENDING EXISTING ALGORITHMS 52Both Schwarz’s code [Sch7l] and EISPACK’s BANDR [GBDM77] check if the bulge orband entry is already zero before performing an elimination. These codes, however, areprimarily intended for densely banded matrices and neither incorporates the second or thirdmodification. Enhancing the two stage tridiagonalization algorithm by all three sparsitymodifications produces a new algorithm subsequently referred to as the sparse RutishauserSchwarz algorithm or simply sparse R-S.When the band of the permuted matrix is sparse, sparse R-S enjoys an additionalcomputational advantage. As shown in Section 3.4.1, we can identify the bulge chasingpath associated with a particular band nonzero Ak prior to its elimination. A,k’s bulgechasing path consists of edges in the sequence<i1, (i + b) >, < (i + 2b)’, (i + b) >, < (i + 2b)’, (i + 3b) >,...and terminates when the next edge in the series is absent from E. The RutishauserSchwarz algorithm uses one bulge chasing transformation for each edge in the path duringA,k ‘s elimination.For a densely banded reduction a bulge chasing path always terminates at a node, xor x’, satisfying x + b > n. When the band is sparse, however, the bulge chasing pathmay terminate before this condition is satisfied, reducing bulge chasing requirements. Forexample, consider R-S’s elimination of A4,1 from the small sparsely banded matrix (b = 3)in Figure 3.8. The bulge chasing path in the associated bipartite graph starts with edge<41,7>. Because < 101,7 > E, the bulge chasing path terminates at node 7 and A4,1’selimination requires a single bulge chasing transformation. In contrast, the elimination ofthe same entry from a densely banded matrix of equal dimension and bandwidth requiresfour bulge chasing transformations.Although the R-S algorithm removes fill created outside the band with bulge chasingtransformations, it allows fill entries within the band to accumulate. Unfortunately, theunreduced portion of a typical sparse matrix’s band fills quickly during band-preservingtridiagonalization and there is little opportunity for enhancements exploiting band sparsity.As an example, consider applying sparse R-S to the 5-point model problems of Section 3.1.CHAPTER 3. EXTENDING EXISTING ALGORITHMS 53:Ex!12 XX X13 XX14X 15XX X16AFigure 3.8: A4,1’s Truncated Bulge Chasing PathAfter b— 1 columns of a 5-point problem have been reduced to tridiagonal form, the remainder of the band is completely filled in. The preponderance of fill within the band is largelydue to the seqnences of bulge chasing transformations required by the elimination of bandnonzeros. Once the matrix’s band has been filled there is no further opportunity to exploitsparsity beyond the densely banded form of the remaining submatrix. Consequently, theflop and transformation requirements of a 5-point problem’s reduction are identical in thehighest order terms to FR°s and TR5 for a densely banded matrix of equivalent order andbandwidth.The speed with which the band of a 5-point problem fills is typical of most large symmetric problems. Table 3.6 provides fill data for three additional sparse problems, fromSection 2.2’s sparse matrix test suite, preordered by GPS and tridiagonalized by Trisymb’ssymbolic implementation of sparse R-S. Despite starting with relatively sparse bands, theunreduced portion of each problem’s band is completely filled well before b columns areBACHAPTER 3. EXTENDING EXISTING ALGORITHMS 54Problem n bandwidth Initial # of columns (rows)Off-Diagonal eliminated before band isBand Density full.5-Pt Problems b2 b ‘—‘ 2/b b-iPLAT1919 1919 80(GPS) 10.1% 31NOS3 960 65(GPS) 12.4% 19BCSSTKO9 1083 95(GPS) 8.8% 18Table 3.6: Band Filling Characteristics of Sparse R-Sreduced to tridiagonal form.Not all sparse symmetric problems can be permuted to obtain a relatively small bandwidth. As for densely banded matrices, a computational trade-off exists between the denseGivens or Householder and the band-preserving sparse R-S reduction algorithms. Given thespeed with which a typical sparse band fills, the transition point for most sparse problemsmust be close to the b = m/3 value observed for densely banded matrices. Once again,however, the lower storage requirements of sparse R-S may influence algorithm selection.In summary, typically high levels of fill severely limits the number of opportunities forsparse R-S to utilize band sparsity. In general, sparse R-S is almost completely reliant onbandwidth reducing preorderings to exploit sparsity and the second stage of the algorithmis only slightly superior to a densely banded R-S tridiagonalization approach.3.5.3 Sparse Band Extensions of Lang’s AlgorithmSection 3.4.2 describes Lang’s reduction algorithm for the tridiagonalization of denselybanded matrices. This section explores the potential of extending Lang’s techniques tocreate a sparse R-S like two stage reduction algorithm for general sparse symmetric matrices.Once again, the algorithm’s first stage preorders the sparse matrix to reduce bandwidth,while the second stage of the tridiagonalization employs a modified form of Lang’s algorithmthat attempts to exploit band sparsity. To improve the construction and application ofHouseholder transformations for a sparse band, we envisage modifying Lang’s algorithmwith changes analogous to the first and third enhancements of R-S made in Section 3.5.2.0D)_JcrrJDci)Cl)CccC0(1)0CocDbDz---xx:Ixxx><x2xxxx><xx><x><x><xccxxxxxXXr-xxx)<U)><xxx‘x‘)xx:±::::II /Ncx—j0)xxx><xj7x><><><><)<-xx0U)U)X><><xxxxxx0U)xxxXXxxx>.——cxXxx!xXxxU)<xxxx2xxcoo0U)U)U)cXXXXx>xxc-—-0U)U)xXXxx>(U)0><X><x)<><><x<mU)0%XXxxOcxxxxU)U)U),XXX><U)XX)<)<‘O000,xXxx-Xxxx0(;DXXX><U)XXU)XX)<X)mU)00X’XXXX’%U)U)U)X‘)XXX)<>(O000XcX00r:--±---C)X’><XXX-‘xxxXXXXXXX><)<XXX2xxx>(xxx2xxx.‘xxxXXX??xx/xxxxx‘xXXo’XXx)<)<XXU)XXX,øXc—xXXU)Xx/—0U)XX/0X’Xx,---XcX)<)<Xcx0p-CHAPTER 3. EXTENDING EXISTING ALGORITHMS 56To overcome its inherently higher algorithmic complexity, the sparse Lang algorithmmust significantly improve the exploitation of band sparsity relative to R-S. Discussion inSections 2.3 and 2.7.3, however, has shown that a Householder transformation generallyproduces more fill entries than equivalent sequences of Givens transformations. As a result,the sparse Lang algorithm typically produces fill in the unreduced portion of the bandeven more rapidly than sparse R-S. Figure 3.9 illustrates two partial reductions of a smallsparse matrix, A, which we assume has been preordered to reduce bandwidth. Matrix Billustrates the relatively sparse band remaining after R-S has reduced the first two columnsof the band to tridiagonal form. The unreduced portion of the band does not become denseuntil R-S eliminates the third band nonzero from column 5. In contrast, reducing the firsttwo columns of the band with Lang’s algorithm completely fills the remainder of the bandand there is no further opportunity to exploit band sparsity.The results of this small example generalize to large practical sparse problems. Althoughsparse R-S has difficulty exploiting band sparsity, sparse extensions of Lang’s algorithmexhibit more prolific fill characteristics and take even less advantage of band sparsity. Inthe following chapter we introduce an alternative to sparse R-S, which more successfullyutilizes band sparsity to significantly improve reduction efficiency.Chapter 4Bandwidth ContractionAlgorithms for SparseTridiagonalizationAs shown in the previous chapter, attempts to adapt Givens Reduction for the tridiagonalization of sparse symmetric matrices are confronted with overwhelming levels of fill entriesthat quickly destroy matrix sparsity. Alternatively, the sparse R-S algorithm accepts theinevitability of fill entries, but actively restricts their accumulation to the band of a permuted matrix. Unfortunately, the band typically fills quickly and sparse R-S places almostcomplete reliance on the preordering algorithm to exploit matrix sparsity. This chapterpresents alternative approaches to sparse tridiagonalization. They also use bandwidth reducing preorderings and band-preserving reduction techniques, but reorder the eliminationsequence to more fully exploit internal band sparsity.We begin this chapter with the development of our Bandwidth Contraction algorithm forthe tridiagonalization of symmetric sparse matrices. Combining this successful algorithmwith the sparse R-S algorithm, we then produce an effective hybrid tridiagonalization algorithm. The following section describes key aspects of the numerical implementations of bothalgorithms. Using these implementations, the chapter’s final section describes extensive experimentation with the Bandwidth Contraction and hybrid tridiagonalization algorithms.In comparison to the Rutisbauser-Schwarz approach, both algorithms are shown to dra57CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 58matically reduce the computational requirements of sparse tridiagonalization.4.1 Bandwidth ContractionThe sparse Bandwidth Contraction algorithm, or BC, is similar in many respects to sparseR-S. Both algorithms employ bandwidth reducing preorderings and constrain the accumulation of fill entries to a maintainable substructure of the original matrix. The BandwidthContraction algorithm, however, improves the efficiency of sparse tridiagonalization by rearranging the elimination of nonzeros to exploit a commonly observed characteristic ofsparsely banded matrices.4.1.1 MotivationThe sparse tridiagonalization techniques explored in this section are motivated by the following observation. A bandwidth reducing preordering frequently produces a permutedmatrix whose profile consists of varying length spikes of nonzeros extending from the maindiagonal. The longest spike defines the bandwidth of the permuted matrix, as shown inFigure 4.1. For many practical problems the spikes of the permuted matrix are of dramatically different length. As an example, the black dots in Figure 4.2’s plot illustrate thenonzero sparsity structure of the Harwell—Boeing problem CAN 268 preordered by GPS.The new sparse tridiagonalization approach attempts to exploit variation in spike length.Although fill cannot be avoided, the bandwidth of the matrix could be significantly reducedwith relatively few transformations, if the ends of the longest spikes could be clipped off atlow cost before the contracted band becomes dense.4.1.2 The Sparse Bandwidth Contraction AlgorithmThere are a number of ways that the ends of the profile’s longest spikes could be clippedoff with Givens transformations. One way to approach the clipping process is to rearrangethe elimination order of a band-preserving tridiagonalization so that the matrix is reducedto tridiagonal form diagonal by diagonal (outermost diagonal first), rather than columnCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 59Figure 4.1: Matrix Bandwidth and Spike Length0 50 100 150 200 250nz = 3082Figure 4.2: The Sparsity Structure of CAN 268 with a GPS PreorderingCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 601. A : PTAP, where P is a bandwidth reducing permutation matrix.2. b := bandwidth(A)3. FOR b : b DOWNTO 2 DO /*Tridiagonalize A.*/FORcol:=lTOn-bDOIF A01+ 0 THEN /*ZeroA0i+.*/IFA =OTHENcol+b—1,colExchange rows/columns (col + b) and (col + b — 1) in A.ELSEA := G(col+,coi+_1,9)T AG(col+,col+b— 1,0)(Exploit band sparsity of modified rows and columns.)IF bandwidth(A) > b THENChase bulges with additional adjacent Givenstransformations or row/column exchanges.ENDIF /*Outermost IF*/Figure 4.3: The Sparse Bandwidth Contraction Algorithmby column as in sparse R-S. A diagonally-oriented band-preserving tridiagonalization fordensely banded symmetric matrices has been previously considered [Sch63, Wi165], but wassuperseded on sequential machines by the Rutishauser-Schwarz’s column-oriented reduction [Rut63, Sch7l]. (See Section 4.2.1 for a detailed comparison of algorithm complexity.)To our knowledge, however, no one has considered the relative merits of the two reductionparadigms extended for general application to sparse symmetric matrices. (For denselybanded matrices, the LAPACK [ABB92] project considered the relative merits of thesetwo algorithms implemented on vector machines.)As shown in Figure 4.3, our sparse Bandwidth Contraction algorithm (BC) uses thediagonally-oriented spike clipping process to completely tridiagonalize a sparse symmetricmatrix. To exploit band zeros, the three modifications made to the basic R-S algorithm inSection 3.5.2 are also incorporated into the Bandwidth Contraction algorithm.The Bandwidth Contraction algorithm begins by symmetrically permuting the matrixto reduce bandwidth. Then, starting with Ab+1,1, the band’s outermost diagonal is scannedCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 61for its first nonzero entry. This entry is eliminated using either an adjacent Givens transformation or a row/column exchange, depending upon the nonzero status of the zeroingentry. If a nonzero entry is created beyond the current bandwidth, the bulge is chased offthe end of the matrix as described in Section 3.4.1. The scanning and reduction of theoutermost diagonal continues until all nonzeros have been eliminated. At this point thecurrent bandwidth of the matrix is reduced by one and the reduction process continueswith the next diagonal.As for the sparse R-S algorithm, Bandwidth Contraction uses adjacent transformationsexclusively, avoiding the creation of multiple entry bulges and the concomitant extra bulgechasing transformations. The additional bulge chasing transformations not only increasecomputational costs, they also accelerate the introduction of fill entries into the band. Anadded complication for a diagonally-oriented tridiagonalization is that nonadjacent transformations may reintroduce fill entries in previously zeroed positions of the diagonal, severelyrestricting the utility of this transformation class.The small example in Figure 4.4 demonstrates some of the potential difficulties associated with nonadjacent transformations. When the Bandwidth Contraction algorithmeliminates entry A5,1 and chases the associated bulge, it uses one row/column exchange andtwo nontrivial adjacent transformations, producing matrix B. Alternatively, suppose wedecide to eliminate A5,1 with the nonadjacent transformation G(2, 5, O)TAG(2, 5, 0) to avoidhaving to eliminate a nonzero from column 1, and chase its associated bulge, during thenext diagonal’s reduction. Application of this transformation creates a three entry bulgeas shown in intermediate matrix C. Chasing this multiple entry bulge off the end of thematrix to regain the matrix’s original banded form requires 11 additional nontrivial transformations. As shown by matrix D, these transformations completely fill the band beyondrow and column 5. Consequently, using a nonadjacent transformation in this manner isdefinitely not cost-effective.This discussion is not intended to completely rule out the use of nonadjacent transformations. There are special sparsity patterns for which nonadjacent transformations areCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMSDtFigure 4.4: A Nonadjacent Transformation Example62beneficial. Future study will explore the potential role of nonadjacent transformations inmore sophisticated or special purpose bandwidth contraction algorithms.4.1.3 A Demonstration of Bandwidth ContractionTo illustrate the potential effectiveness of Bandwidth Contraction, we have manipulated thesmall contrived example in Figure 4.5 with the symbolic reduction tool Xmatrix. The topmatrix in Figure 4.5 shows the original sparse matrix, A, with its nonzero entries indicatedby “X”s and the numbered diagonal. It is assumed that A has already been permutedto reduce its bandwidth to 6. Two additional matrices, C and D, illustrate A after aNon-adjacenttransformationA=B=1X XX2 X3 X4- XX 5XXXXX X6 XX X 7 XXX 8 XX 9 XX 10 XX 11X 12X 13X 14One step of BC.,,1X X0X2 X3 XX 4 XXX0 5 XXX X 6 XXX 7 XXX 8X XX X9 XXX 10 XX 11XX 12XX X13X 141X 0X2 XXBBB3 X4 XOX 5XXXXX X6 XBX X 7 XB XX 8 XB X 9 XX 10 XX 11X 12X 13X 14Chase the multipleentry bulge.1X 0X2 XXXX4 XXXOX 5XXXXXXXX 6 XXXXXXXX 7 XXXXXXXX 8XXXXXXXX9 XXXXX X X X 10 X X X XX X X X 11 X X XX X X X 12 X XX X X X 13 XX X X X 14CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 63partial reduction by sparse R-S or Bandwidth Contraction. In both C and D a “0” marksthe positions of eliminated band nonzeros. Finally, reported flop counts assume that bothreductions employ fast Givens transformations.Matrix C illustrates A after sparse R-S has reduced its first three columns to tridiagonalform. Despite the highly sparse nature of the original problem, the remainder of the bandis almost completely filled. The entire tridiagonalization uses 8 row/column exchanges and132 nontrivial transformations, requiring a total of 7232 flops.Matrix D illustrates A after the Bandwidth Contraction algorithm has eliminated thethree outermost nonzero diagonals and contracted the bandwidth to 3. Although the elimination of nonzeros once again produces fill entries within the band relatively quickly, thealgorithm is able to efficiently exploit the sparsity of the band away from the main diagonal.For example, the Bandwidth Contraction algorithm eliminates the entire 6th and 5th sub-diagonals of the band at the relatively low cost of 216 flops, using 7 row/column exchangesand 4 nontrivial transformations. The complete tridiagonalization uses 12 row/columnexchanges and 163 nontrivial transformations, requiring a total of 6537 flops. For this example, the computational requirements of tridiagonalization with Bandwidth Contraction,as measured by flop counts, are approximately 9.6% lower than for the sparse R-S approach.It is important to note that the number of nontrivial transformations used by a tridiagonalization is a misleading metric of algorithm performance. The Bandwidth Contractionalgorithm requires more transformations, but generally fewer nonzeros are modified by eachnontrivial transformation, permitting a lower total flop count.The key to the success of the Bandwidth Contraction algorithm is the elimination ofthe outermost diagonals at low cost. As the result of several contributing factors, Bandwidth Contraction is able to exploit and prolong the advantages of sparsity in the outermostsubdiagonals. First, when BC requires nontrivial adjacent transformations, they are oftenapplied to well separated pairs of rows and columns, insulating the effects of fill from onetransformation sequence to the next. In contrast, sparse R-S’s initial band zeroing transformations and associated bulge chasing transformations are applied to groups of neighboringL)CC(3Ccr1IC)CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 65Figure 4.6: Cascading Fill Entriesrows and columns. These initial transformations produce a cascade of fill entries, typicallynot observed for BC, which quickly fills the band despite a matrix’s initial sparsity.As an example, bipartite graphs BA1 through BA5 in Figure 4.6 model the sparsitystructures of Figure 4.5’s example just before sparse R-S symmetrically eliminates nonzerosA8,2 A7,2 A6,2 A52, and A4,2 respectively from column 2. The final bipartite graph,BA6, models A’s sparsity structure after the completion of column 2’s reduction. GraphsBA1 to BA5 each highlight the edges corresponding to the band nonzero pair selected forelimination, their bulge chasing paths, and the pairs of nodes modified by the associatedband zeroing and bulge chasing transformations. We observe that in neighboring bipartitegraphs the pairs of highlighted nodes share a common node. Consequently, nonzeros areBA1 BA2 BA3 BA4 BA5 BCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 66carried from one transformation sequence to the next, either through adjacency set unionor exchange, producing a cascade of fill entries. Comparing BA1 and BA6, the eliminationof column 2 substantially reduces band sparsity, introducing 48 new band nonzeros.In addition to reducing instances of fill cascading, BC’s diagonally-oriented reductionrequires fewer bulge chasing transformations for initial band nonzero eliminations. As BCeliminates the outermost diagonal, bulge chasing sequences must shorten, while the lengthof the sequences used by sparse R-S remains relatively constant as it reduces the first fewcolumns. Consequently, the initial stage of the Bandwidth Contraction algorithm producesfewer band fill entries.Avoiding fill entries prolongs the survival of sparsity in the outermost diagonals andimproves the efficiency of BC’s reduction in several ways. First, BC must eliminate fewerband nonzeros to effect each diagonal’s reduction. In addition, the sparsity of the outermost diagonals permits Bandwidth Contraction to cheaply eliminate many band nonzerosand associated bulges, without fill, using row/column exchanges. Finally, the sparsity ofthe outermost diagonals often produces truncated bulge chasing paths (see Section 3.5.2),reducing transformation requirements and fill production.In summary, the advantages of BC’s sparse reduction outlined in the preceding discussion often allow it to perform a partial tridiagonalization that significantly contracts amatrix’s bandwidth at low cost before producing a densely banded intermediate matrix. Ingeneral, the relative success of the sparse tridiagonalization algorithms depends on problemspecific sparsity structures. The extensive experimental analysis of Section 4.4 confirms therelative advantage experienced by the Bandwidth Contraction algorithm for many practicalsparse problems.4.2 A Hybrid Tridiagonalization AlgorithmThe goal of this dissertation is to produce generally applicable sparse eigenvalue routines.Unfortunately, the computational requirements of sparse Bandwidth Contraction are notsmaller than those of sparse R-S for every problem’s banded sparsity structure. In thisCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 67section we combine the individual strengths of each reduction algorithm to produce theversatile Hybrid Bandwidth Contraction tridiagonalization algorithm, or HYBBC*, whichis suitable for the rednction of a broad range of sparse symmetric matrices.4.2.1 MotivationTwo metrics of tridiagonalization algorithm cost are flop and transformation counts F andT. As demonstrated in the previous section, the Bandwidth Contraction algorithm may beable to significantly reduce the bandwidth of a sparsely banded matrix at relatively low cost.Consequently, for sparsely banded matrices Bandwidth Contraction flop counts are smallerthan for sparse R-S, despite larger transformation counts. If the band of a matrix is dense,however, Rutishauser and Schwarz’s column-oriented band-preserving tridiagonalization issuperior in both measures of work.Table 4.1 provides formulae for the tridiagonalization costs for the fast Givens Bandwidth Contraction variant applied to a densely banded, symmetric matrix of bandwidthb. Using Section 2.4’s analysis framework, we provide a comprehensive analysis leadingH pFG IBC TBC(4b_4+lozt_f))n2+(22_16b_3&2)n2‘5 (1—(b--1) +b £ucL.dk=2kk)2L4k2 kTable 4.1: Tridiagonalization Costs of the Bandwidth Contraction Algorithm for a DenselyBanded Matrix.to these results [Cav93]. The analysis assnmes that b < (n + 1)/2 and that an equalproportion of the two types of fast Givens transformations are employed. The analysisignores the potential cost of the periodic rescaling required by fast Givens transformations.CBC is the nonanalytic term Mod(n, k)(k— Mod(n, k)). When b << m the Mod(n, Ic) termscan be safely ignored without incurring significant errors. Comparison of FK and FL,from Equation 3.1, shows that the flop reqnirements of the Bandwidth Contraction algorithm are larger than for the R-S algorithm applied to the same problem. To demonstrate*Jn [Cav94j we also refer to this algorithm with the acronym BANDHYB.CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 68(BC flops)/(R-S flops) (BC trans.)/(R-S trans.)1.051100 200 300 400 1 100 200 300 400Bandwidth BandwidthFigure 4.7: The Flop and Transformation Requirements of BC relative to R-S for a DenselyBanded Matrix, n=1000.the potential difference in tridiagonalization costs, the first graph in Figure 4.7 plots theflop requirements of Bandwidth Contraction, normalized by F9 , against bandwidth for adensely banded matrix.While FB is typically 10 to 25% larger than F9 , for problems with nontrivial bandwidth the difference between TBC and TRS can be much greater, as shown by the secondgraph in Figure 4.7. At first glance there may seem to be an inconsistency in our analysis.As mentioned in Section 4.1.3, however, generally transformation counts are a misleadingmetric of tridiagonalization costs. F9 and FB are closer than predicted by TBC and TRSbecause as Bandwidth Contraction reduces a matrix’s bandwidth, the number of nonzerosmodified by each transformation generally declines. The computational effort of applyinglater transformations is reduced, while for the R-S algorithm the cost of each transformationremains relatively constant.As an aside to our discussion of sequential algorithms, vector machines complicate therelative efficiency analysis of tridiagonalization algorithms. In general, vector machinesput more weight on T relative to F. During the development of LAPACK’s [ABB92]replacement for BANDR, SSBTRD, several band-preserving tridiagonalization algorithmswere tested on vector machines [DC92J. In very general terms, for small n (less than 50) ormatrices with moderate bandwidth (20 < b < 50), it was found that vectorized code basedCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 69on a diagonally-oriented elimination is the fastest approach. For other densely bandedmatrices, variants of a column-oriented tridiagonalization are more efficient. Emphasizingthe importance of good performance for large n and small bandwidth, SSBTRD is basedon the column-oriented, vectorized algorithm of Kaufman [Kau84j.4.2.2 The Hybrid Bandwidth Contraction AlgorithmThe observations of the previous subsection suggest a hybrid tridiagonalization algorithm.While the band of the intermediate reduction matrix remains sufficiently sparse, the hybridalgorithm employs the sparse Bandwidth Contraction scheme. When it is found that BCcan no longer efficiently exploit band sparsity, the reduction switches to sparse R-S tocomplete the tridiagonalization. To avoid redundant elimination of band nonzeros, thehybrid algorithm always completes a nonzero diagonal’s reduction before switching to sparseR-S.The most sensitive design issue for this hybrid algorithm is the selection of a generallyapplicable strategy to govern the transition between the algorithm’s two stages. Ideally thehybrid algorithm should switch to sparse R-S at the transition bandwidth, bt, minimizingthe following problem dependent cost function.Cost_BC(b —* bt) + Cost_R-S(b —* 1)The matrix’s original bandwidth is b and functions Cost_BC() and CostR-S() representthe computational requirements of BC and sparse R-S performing the indicated reductionin bandwidth. Using fill level monotonicity arguments, it is not difficult to see that atransition strategy could minimize this function by comparing the following reduction costsbefore each diagonal’s elimination. (bC represents the bandwidth of the reduction’s currentintermediate matrix.)Cl = Cost_BC(bc (bC— 1)) + Cost_R_S((bc — 1) 1)C2 = Cost_RS(bc 1)As long as Cl < C2 the hybrid algorithm should continue the reduction with the sparseBandwidth Contraction algorithm.CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 70When the band of the intermediate matrix is dense, we can determine the flop components of Cl and C2 exactly using the analysis results of Sections 3.4.1 and 4.2.1. While zeroentries remain in the band, however, these analyses do not apply and accurate resolutionof the cost comparison is difficult. The computational requirements of BC and sparse R-Sare influenced by both the nonzero density of the band and problem specific sparsity structures. The design of transition strategies that take into account a band’s particular sparsitystructure is hindered by a lack of model problems that exhibit all significant sparsity characteristics of practical problems. In addition, a formal analysis of BC or R-S’s reduction ofproblems with general sparsity patterns is not feasible. Alternatively, we consider transitionstrategies that provide an approximate solution to the comparison of costs Cl and C2 bythresholding some measure of band density or “fullness”.There are many possible metrics for measuring the fullness of the contracted band. Themost obvious choice is to directly monitor the number of nonzero entries. Maintaining suchdetailed knowledge of the band’s sparsity pattern, however, may require significant levelsof overhead. Instead, we suggest regulating the transition between Bandwidth Contractionand sparse R-S by a threshold on the number of nonzero entries in the outermost nonzerosubdiagonal. The transition is made when the number of nonzeros is greater than somefraction of the subdiagonal’s length. As shown below, monitoring the number of nonzerosin the next subdiagonal can be cheaply integrated into Bandwidth Contraction. Equallyimportant, this transition regulation technique provides a good approximation of measuringtrue band density. As shown in Section 4.3.2, sparsity within the band is best exploited atthe transformation level, which is controlled to a large extent by the number of zeros in theoutermost diagonal. In addition, if BC continues the reduction once the outermost diagonalis full, or nearly so, each successive subdiagonal eliminated will have a similar density andthe band will quickly fill.The pseudocode in Figure 4.8 describes the Hybrid Bandwidth Contraction tridiagonalization algorithm (HYBBC). The second stage of the hybrid algorithm is identical to thedescription of sparse R-S in Section 3.5.2, except it does not perform an addition preordering. Let threshold be the fraction of the outermost diagonal that must be nonzero before theCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 711. A : PTAP, where P is a bandwidth reducing permutation matrix.2. b bandwidth(A)3. /Initialize nzcnt for the outermost diagonal.*/mzcnt := 0FOR i:= 1 TO n-b DOIF A+b, 0 THEN nzcnt:== nzcnt+l4. (a) b’H=b(b) /*While the matrix is not tridiagonal and the threshold has not been*// *met, eliminate the outermost nonzero diagonal.*/WHILE ( (bC > 2) AND (nzcnt < (threshold * (n — bC))) ) DOi. rizent 0ii. FOR col := 1 TO n — bC DOIF AQl+bc,0j 0 THEN /*Zero Acoz+bc,col.*/IF Acol+bc_1,col = 0 THENExchange rows/columns (col + b’) and (col + bC — 1) in A.ELSEA := G(col+bc,col +bc — i,8)T A G(col+bc,col + if — 1,0)(Exploit band sparsity of modified rows and columns.)IF bandwidth(A) > bC THENChase bulges with additional adjacent Givenstransformations or row/column exchanges.ENDIF /*Outermost IF*/IF AQl+bc_1,01 0 THEN nzcnt:= nzcnt+1iii. IF A,_bc+1 0 THEN nzcnt:= nzcnt+1iv. bC bC — 1(c) IF bC> 1 THEN complete tridiagonalization with sparse R-S.Figure 4.8: The Hybrid Bandwidth Contraction Tridiagonalization AlgorithmCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 72transition to sparse R-S is made and let nzcnt be the number of nonzeros in the next sub-diagonal. The algorithm is able to check the nonzero status of entry Acol+bc_1,cQl after theelimination of entry AcQl+bc,coj because the entries of row (bC + col — 1) will not be modifiedagain during the reduction of the current outermost diagonal. The resource requirementsof the band density metric are minimal—one additional integer variable, and during eachdiagonal’s reduction n — bC comparisons and at most n — bC + 1 integer operations.4.2.3 Performance of the Hybrid Tridiagonalization AlgorithmConsider the application of the Hybrid Tridiagonalization algorithm to matrix A of Figure 4.5 with a threshold of 0.85. In the first stage of the tridiagonalization BC reducesthe three outermost nonzero subdiagonals, producing matrix D of Figure 4.5. The HybridTridiagonalization algorithm then transfers control to sparse R-S to complete the reduction to tridiagonal form. Table 4.2 summarizes the computational requirements of all threesparse tridiagonalization algorithms, assuming the use of fast Givens transformations. TheRow/ColumnMethod Nontrivial Transformations FlopsExchangesSparse R-S 8 132 7232Bandwidth Contraction 12 163 6537Hybrid Tridiagonalization 12 136 5880Table 4.2: Tridiagonalization Summary for a Small Sparse Examplehybrid algorithm requires approximately 19% and 10% fewer floating point operations thansparse R-S and Bandwidth Contraction respectively. It is interesting to note that the totalnumber of nontrivial transformations required by the Hybrid Tridiagonalization algorithmis significantly lower than for Bandwidth Contraction and only marginally higher than forsparse R-S.The test problem of Figure 4.5 is obviously a trivial example. The experiments describedin Section 4.4, however, show that the Hybrid Tridiagonalization algorithm dramaticallyreduces the computational requirements of sparse tridiagonalization for a wide range ofCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 73sparse problems.4.3 Numerical Implementations of BC and HYBBCThis section describes features of numerical implementations of the Bandwidth Contraction and Hybrid Bandwidth Contraction tridiagonalization algorithms. Following a generaldescription of both implementations, we study the ability of individual nontrivial Givenstransformations to exploit band sparsity. The outcome of this study influences the sparsitytechniques employed by both implementations and dictates the form of the banded datastructure used by the routines. Finally, we outline new rescaling techniques for implementations using fast Givens transformations to effect diagonally-oriented reductions.4.3.1 Implementation BasicsThe first stage of both BC and HYBBC preorders a sparse problem to reduce bandwidth.We rely upon existing preordering algorithm implementations to conduct this phase of thereduction and concentrate upon the implementation of each tridiagonalization algorithm’ssecond stage. The implementation of the Bandwidth Contraction algorithm was createdby rewriting EISPACK’s FORTRAN routine BANDR (an R-S code) to perform a sparse,diagonally oriented, band-preserving tridiagonalization. Using this new routine, we implemented the Hybrid Bandwidth Contraction algorithm by augmenting the BC routine withthe described thresholding strategy, and a transition to a modified version of BANDR thatomits initializations. The hybrid algorithm switches to a column-oriented scheme when theband is dense, or nearly so. Given the speed with which a typical sparse band fills duringan R-S reduction (see Section 3.5.2), using a sparse R-S code for this portion of HYBBCis not warranted. Otherwise the implementations of BC and HYBBC closely follow thealgorithms in Figures 4.3 and 4.8 with one exception. That is, based on the outcome of thefollowing section’s study, we implement the Bandwidth Contraction algorithm with oniytwo of the three sparse algorithm modifications listed in Section 3.5.2.CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 744.3.2 Dense Versus Sparse Band TrmsformationsUnlike the first two sparsity modifications listed in Section 3.5.2, exploiting the sparsity ofa pair of rows or columns during the application of a transformation requires significantoverhead. To determine if the potential savings are worthy of the increased overhead, weconduct experiments with the symbolic reduction tool Trisymb. For 15 larger problems,Table 4.3 compares the flop requirements of a HYBBC variant that fully exploits the unionedsparsity structure of modified rows and columns, with a second HYBBC variant that treatsthe band as dense while applying a nontrivial transformation. Accounting procedures differbetween the two simulations, but sparsity structures of the intermediate reduction matricesare identical. The simulated reductions model fast Givens transformations exclusively. Thebandwidth of the original matrix preordered by GPS is given by bGPS and bt is the transitionbandwidth.Going from dense to sparse transformations, savings of 12—21.5% in the BandwidthContraction portion of the reduction are observed for 3 problems, but for the remainingmatrices savings are less than 5%. Considering the cost of the entire hybrid tridiagonalization, the potential savings of sparse transformations are very small. The largest reductionis 1.1% and for the remaining problems the potential savings are 1% or lower. If standard Givens transformations replace fast Givens transformations, the results are essentiallyunchanged.This study clearly shows band sparsity is best exploited at the transformation level byidentifying entries that are already zero or that the algorithm can eliminate with an adjacentrow/column exchange. Considering the storage and computational overhead required by asparse data structure, performing sparse transformations is not beneficial to BC or HYBBCperformance, and will not be pursued by the sparse tridiagonalization implementationsdescribed in this dissertation. In future work, however, sparse transformations will bereevaluated for the special case in which a partial bandwidth contraction is the end goal.Cl)CDC)CD-01Cl)CCl)0000C)-1-.4CDC00C-3_)CDja__ClClClClCDCDCDCDCDCDCDCDCDC)C)CDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCD0COOp,P’goD’,P’)D’p’oD’jP’P’,D’Cl000ClCl0000ClOClO00ClOClO0000Cl00o-o-o-o-o-o-o-o-oo-o-‘Cl0000_0000“0000000000:C).CCCCI.31’.)-3tZ0Cl)..I,-4-4.•00000000)01CC)lZ--01100C0CC01CC)-)CC—3-30)0)-.C-4010)03C)C.))01CDC.).•t.)D)C.)CD.-‘-I•0)CXCXC.)CC)C)C)4C.)-)CC)Z-0000CC)C.)00-3CDC.)C.)0)00—3CC)Is,)L’300CC)CDCDCDCDCDOCDCDCDCDCDCDCDCDCDCDCD(DC)CDCDCDC)CDCDCDCDCDCD+++++++++++++++++++++++++++++—.p—’C—iCCCCCCCCCCCCCCCCCCCCCCCCCCCC)CC0000CDCD0000000000CX—3-3CX000000C)C)0000CXCXCXCX00CX00I:CDCCCCCCCCCC.C.)000)CD-CC)00CD0 Cl..C)0)L’Dt’).CC)-)t-.).-30DD.—.c’.)L)tD)CDCD01CC)CC---1CDCC)0)0)C)(DCD—3-0).L-D.C)C)C.)C.)Is)C.)Is)..CC).000000Is)CDCXCDCDC.)CC).Is)C)p—’CC)Is).CXIs)CC)—C.)00Is)C)CXCC)CC)—4CC)Is)CC)—4CC).C—-.4CDCC)Is.)C)CD—1CD—CDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCD+CDCDCDCDCDCDCDCD++++++++++++++++++++C++++++++CCCCCCCCCCCCCCCCCC00CCCCC00CCCCC0000CDCD0000C)0)C)0)-3-3-3000001CC)CX0000-3-000000j)CDCCC01ClC)Is)IIs)I-.4CDCC)C)0ICI)jIClIs)Is):CDCDCC)CC)CC)CC)00CX0)0)Is)Is)CDCDCIèàobo.C)C)CDCDI—CDCDts)Is)I01CC)C)C)000000000)0)0000-3-3t.)Is)0000ç)(0(0CDCDICDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCDCD++++I++++++++++++++++++++ICl)CCCCICCCCCCCCCCCCCCCCCCCCDCD-.4-.4ICX00-3-3CX000)C)-3-4-3-1C)C)0000-.4—1CDjI0CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 764.3.3 A Densely Banded Data StructureAs a consequence of the preceding section’s study, our implementations of BC and HYBBCkeep the densely banded data structure of BANDR. By exploiting the symmetry of sparseproblems and similarity transformations, these algorithms need only consider transformation modifications to the lower triangular portion of each matrix. The main diagonal andeach subdiagoual of the band’s lower triangular portion is stored in a separate column ofan n x (b + 1) double precision array. The storage of bulge entries does not necessitate anadditional column of storage. Because the storage requirements of BANDR are essentiallyidentical to those of the BC and HYBBC implementations, the experimental analysis ofSection 4.4 concentrates on the CPU requirements of each routine.4.3.4 Rescaling and Fast Givens TransformationsTo improve efficiency BANDR and our implementations of BC and HYBBC use fast Givenstransformations in place of classical Givens transformations. As shown in Equation 2.7,each fast Givens transformation applied to the matrix under reduction must also updatea diagonal matrix, D, associated with the reduction. At the end of the reduction D updates the tridiagonal matrix to complete the sequence of fast Givens transformations. (SeeEquation 2.8.)If the fast Givens transformation MTAM modifies rows and columns i and j of A, theupdated diagonal matrix, 13, takes the following form. Assuming D = Diag(di,... ,D=MTDM= (4.1)As shown in Section 2.4.2, M can take one of two forms. By carefully selecting the appropriate fast Givens transformation type, we can ensure each transformation’s “growth factor”(1 + 7) is bounded by 2. Of course, modifying one of D’s diagonal entries by s transformations may result in its growth by a factor of 2S• For reductions requiring large numbersof transformations, periodic rescaling of D is necessary to avoid overflow. The remainderCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 77of this subsection provides a general outline of the rescaling techniques employed by theR-S, BC and HYBBC implementations. As discussed in Section 2.4.2, to permit directcomparison of our codes to BANDR we do not employ the recently developed fast planerotations {AP94] that avoid explicit periodic rescaling. Implementing BC and HYBBC andalgorithms of subsequent chapters using Anda and Park’s fast plane rotations with dynamicscaling is an interesting task for future research.All three implementations avoid difficulties with overflow by carefully restricting theworst case growth between rescaling episodes and by selecting an appropriate rescaling constant for entries that experience too much growth. During R-S’s column-oriented reduction,monitoring the potential growth of D’s entries is straightforward. When R-S eliminates onecolumn from the band, at most two transformations modify each of D’s diagonal entries.Thus if the maximum permissible growth between rescalings is 264, we should schedulerescaling episodes after the elimination of each block of 32 columns. Suppose D’s entriesmust always remain less than 2128 to avoid overflow. If during each rescaling all entrieswith IDi,j > are rescaled by 264, D’s entries cannot overflow before the next rescaling.The diagonally-oriented reduction of Bandwidth Contraction requires a complete reformulation of BANDR’s rescaling techniques. To simplify the rescaling process we ignoreband sparsity and assume BC eliminates each entry with a nontrivial transformation. If bCis the current bandwidth, during the elimination of a contiguous block of bC entries fromthe outermost nonzero diagonal, at most 2 transformations modify each diagonal entry ofD. In a similar fashion to BANDR’s techniques, monitoring the number of blocks of bCentries eliminated permits appropriate scheduling of rescaling episodes. This monitoringprocedure is complicated by the completion of a diagonal’s reduction, but these aspectsof the implementation will not be detailed at this time. It is important to note, however,that as the reduction contracts the band and bC grows smaller, the frequency of rescalingepisodes increases. As a result, BC typically requires more rescaling episodes than BANDR.This result is not surprising considering BC uses significantly more transformations.The implementation of HYBBC inherits the rescaling techniques of its constituent algoCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 78rithms with one modification. Just before switching from BC to R-S, the hybrid algorithmapplies an extra rescaling episode to ensure a successful transition between the two rescalingprocedures.For all three implementations the cost of monitoring D’s growth and the scheduling ofrescaling episodes is insignificant. The cost of the rescaling procedures is dominated bythe actual rescaling itself. During a rescaling episode each implementation scans D andrescales those entries that could potentially overflow before the next rescaling. In addition,the corresponding row and column of the intermediate reduction matrix are also rescaledto maintain the integrity of the similarity reduction. Even though BC and HYBBC mayrequire significantly more rescaling than BANDR, Section 4.4’s extensive experimentationwith practical sparse problems shows rescaling accounts for a very small portion of thetotal computational requirements of tridiagonalization. For three problems HYBBC spendsbetween 1 and 2% of its total reduction time rescaling, but for the remaining 112 problemsthe relative cost of rescaling is less than 1%.4.4 ExperimentationThis section describes extensive experimentation with implementations of the BandwidthContraction and Hybrid Bandwidth Contraction algorithms. After briefly describing thetesting environment and the selection of HYBBC’s threshold value, we analyze test resultscomparing our implementations to EISPACK’s BANDR. The section concludes by investigating the correlation between preorderings and tridiagonalization performance.4.4.1 The Testing EnvironmentTo compare the computational requirements of our implementations to BANDR (an R-Scode), all three routines are applied to the 115 symmetric sparse problems in the Harwell—Boeing test suite of Section 2.2. Unless otherwise specified, each problem is preordered toreduce bandwidth using Lewis’s implementation of GPS [Lew82].All testing was conducted on a Sun SPARCstation 2 with 16 MBytes of main memory.CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 79All routines were compiled by the standard Sun FORTRAN compiler with full object codeoptimization requested. The reported CPU second timings, produced using the systemroutine etime, include both the user and system time for tridiagonalization, excluding thepreordering stage.4.4.2 Threshold Selection for HYBBCHYBBC uses a transition strategy that thresholds the number of nonzeros in the outermost diagonal of the current band. For our experimentation we want to identify a value ofthe constant threshold (see Figure 4.8) that provides acceptable transition bandwidths fora broad spectrum of problems. Our research of effective threshold selection techniques isinconclusive, however, and for these experiments the transition to a column-oriented tridiagonalization is made when the outermost subdiagonal is full. With specialized knowledgeof a particular problem’s sparsity structure other values of threshold may be preferred.In our experience, threshold values less than 1.0 can improve the performance of HYBBCfor some sparse problems. For example, with thresholdrrrl.0 HYBBC chooses a transitionbandwidth of 11 for PLAT362. As shown by Figure 4.9, this problem’s tridiagonalizationwould profit from an earlier transition at bandwidth 24. (The flop requirements of Figure 4.9’s reductions were provided by the symbolic reduction tool Trisymb.) Despite thediscrepancy between these transition bandwidths, HYBBC still comes within 3.1% of theoptimal reduction using a threshold of 1.0.Even with threshold set to 1.0, the transition strategy may also direct HYBBC to switchto R-S when continued bandwidth contraction would benefit the reduction. For example,HYBBC makes an early transition for large 5-point problems. In some sense the reductionof 5-point problems provide a worst case example of an early reduction transition. With astandard lexicographic ordering, the outermost diagonal of a 5-point problem is dense, butthe remainder of the band is zero except for the main diagonal and its adjacent subdiagonal.Consider HYBBC’s reduction of the 5-point problem with b = 30 (m = 900). Despite arelatively sparse band, HYBBC immediately switches to R-S at bandwidth 30, independentCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 80C’)ci0IJFigure 4.9: The Variance of PLAT362’s Tridiagonalization Cost with Transition Bandwidthof the value of threshold. As shown by Figure 4.10, the ideal transition bandwidth is 23.HYBBC’s transition strategy, however, still allows the reduction to be within 3% of theoptimal tridiagonalizatiori.These two examples demonstrate that HYBBC may make both an early transition whenthe band remains relatively sparse or a late transition when the outermost diagonal is notfull but the band is relatively dense. Choosing a threshold of 1.0 seems to be an appropriate compromise. As described in Section 4.2.2, formal analysis of HYBBC’s transition forgeneral sparse problems is very difficult. Chapter 5 introduces a second hybrid tridiagonalization algorithm that lends itself to formal transition analysis, permitting the developmentof more precise transition strategies. Despite HYBBC’s rudimentary transition strategy,this chapter’s extensive experimentation shows that in the worst case HYBBC is almostalways comparable to BANDR, but typically it is significantly more efficient.20 25Transition Bandwidth35CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 81Cl,0U-Figure 4.10: The Variance of Tridiagonalization Cost with Transition Bandwidth for a5-Point Problem, b 304.4.3 Numerical Testing ResultsThe implementations of BC and especially HYBBC are very successful relative to ETSPACK’s BANDR. For 98 of the 115 problems tested the hybrid tridiagonalization algorithmsignificantly reduces CPU requirements. For this group of problems, reductions in CPUtime range from a low of 6.6% to a high of 63.3%. For the 70 test problems with morethan 400 nodes, HYBBC requires on average 31.1% fewer CPU seconds than BANDR. Thehistogram in Figure 4.11 summarizes the distribution of CPU time reductions achieved byHYBBC for this group of 70 problems.Table 4.4 summarizes the tridiagonalization of 20 test problems for which HYBBC isespecially successful. The table provides tridiagonalization times for each implementation,with the CPU requirements of HYBBC’s two reduction phases given in parentheses. Forfuture reference the table also includes transformation totals. The two values in parenthesesbelow each transformation total provide the number of bulge chasing transformations androw/column exchanges included in the total count. The symbols bGPS and bt refer to theTransition Bandwidth‘C.‘Q01C)ci-OON•-c-00C-4CRC)C)0000 000000(i)ZC)©C00CDt)-C)000000-C?’_C?’-‘00CC.-C)C00L’3-C)_C)‘CC‘D--0000C?’00C?’000000C)CC)CR_CRC?00CCCR00-_z00CR-00-4SC)CC--C?z-;----CRC)CCCCCCC)CCC?’t-oC)C)CRi’)CCC)_J)C)000000C?’0000CR44CC0000C)—1CCC)CCI000000CR00CC000000CC-300•‘C)CR00CR00CRCRCC)00C)-C)C)C)-4-—001-03-3CC-3-000CCt-.)00CR00C)—300C)C)00C)0-ciS——-------------——S—-;--—--------—--,—-4W00L00.‘00CRC)IICC114t)C)CCCRCCCCC)IIllCC-400lIC)()-‘II’—100C)CC)‘1-’C)CR0IL3iLC)C?’OO-300CC00CRlIIIHC)C)03C)-3C)II P11CR00CC-‘-CC000003CC-30000II-31103031-CCC?’CO-3CO_IC)II—-CC-411-41100-i0IC)Il0CC001-0-303IIC)11CRtoC)--ciII—II—IIIIIIII331-3COk)C)CZ.0CO—JDCCC)—3COCCCR0000CR—•_00C?’CC00C)COCO:‘C)00—300L-oCD-Ct)U),_CR00_CO-CO:CC‘CR00-CCCOC?’C)L’3---CCC?)00CR00CC00CDC)-4CRPC?)C)C?’C)tOCR00?C)cCRC)00C)C)CC—3tOOOCCO0C)0OCoCo-C)3-S-- COCOCOCOC?’00COC?’CO1-.)-4C)C)-‘CCC)CC0000CR-00C?’C?’-3-31--)-3COCC0000-300C?’-300CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 83Figure 4.11: The Distribution of HYBBC’s Improved Reduction Performance for Problemswith n > 400bandwidth of the GPS preordering and HYBBC’s transition bandwidth respectively. Thefinal column of the table reports the reduction in CPU time achieved by HYBBC relativeto BANDR. For this group of problems HYBBC exhibits a mean reduction in CPU time of44.2%.Of the 17 test problems for which HYBBC shows little or no improvement, 14 arevery small problems, but 3 matrices have between 400 and 1000 nodes. As shown byFigure 4.11, HYBBC is actually slower than BANDR for 2 problems with n > 400. Forone problem, DWT 992, HYBBC is only 1.3% slower than BANDR, but for NOS6 HYBBCrequires approximately 7% additional CPU seconds. Of all 115 test problems, NOS6 is theonly problem for which HYBBC significantly increases CPU requirements. The HYBBCreduction of a few very small matrices also shows a small increase in CPU requirementsrelative to BANDR. When the computational requirements of these reductions are checkedwith Trisymb, however, we find HYBBC requires fewer flops and timing discrepancies aredue to the clock’s granularity.CoE1a)aI00).0EDzCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 84Tridiagonalization Times (sec)r Name n bGPS BANDR BC HYBBC % CPU ReductionBANDR—*HYBBCDWT..361 361 14 3.29 3.94 3.14 4.6 1Table 4.5: DWT 361 Thdiagonalization SummaryAlthough the performance of BC is similar to HYBBC for most of the problems listedin Table 4.4, for some problems BC is substantially slower than BANDR. As an example, consider the tridiagonalization times of DWT 361 given in Table 4.5. BC requiressignificantly more CPU time than does BANDR. Such a performance degradation is understandable given the theoretical analysis of densely banded tridiagonalizations providedin Sections 3.4.1 and 4.2.1. Although initially 74% of the permuted problem’s band entriesare zero, BC’s reduction of the first nonzero diagonal dramatically fills the band, resultingin a nonzero band density of .95. As a result, the band is dense for most of the reduction bythe Bandwidth Contraction algorithm. In contrast, HYBBC only reduces 2 diagonals withBC before detecting the absence of band sparsity and switching to R-S to complete thereduction. Except for NOS6, the hybrid routine always has comparable CPU requirementsto BANDR in the worst case; for the majority of sparse problems it is significantly faster.For 40 of the 115 problems tested, HYBBC with a threshold of 1.0 uses the BandwidthContraction algorithm for the entire tridiagonalization process. The band of many of theseproblems becomes quite dense towards the end of the reduction and the tridiagonalizationmight have benefited by a earlier switch to BANDR. As an extreme example, once againconsider the reduction of NOS6 by HYBBC. The initial steps of the Bandwidth Contractionalgorithm reducing the bandwidth from 16 to 12 fills most of the permuted band’s entries.Due to idiosyncrasies of the sparsity structure of NOS6, however, the outermost diagonalnever completely fills. Consequently, HYBBC conducts the entire reduction using BC, eventhough the band is essentially dense for the majority of the reduction. These results confirmthe observations of Section 4.4.2. Although HYBBC’s transition regulating criterion workswell for most problems, there is room for improvement using problem specific thresholdCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 85Figure 4.12: Examples of Sparsity Structures Effecting Bandwidth Contraction Performancevalues or employing more sophisticated transition strategies.4.4.4 Sparsity Structures, Preorderings and Bandwidth Contraction PerformanceAs shown previously, the performance of BC and HYBBC is dependent on problem specificsparsity structures. In general, the class of sparsity structures ideally suited to BandwidthContraction concentrates nonzeros near the main diagonal and exhibit increased sparsitytowards the outermost diagonals. We have seen that BC and HYBBC can dramaticallyreduce tridiagonalization requirements by exploiting sparsity away from the main diagonal.For example, Figure 4.12 plots the sparsity structure of CAN 1054 with a GPS preordering.HYBBC is able to exploit this problem’s increased sparsity away from the main diagonal,and reduce tridiagonalization requirements by 37.8% relative to BANDR.In contrast, sparsity structures that concentrate nonzeros in the band’s outermost diagonals hamper the spike clipping techniques of Bandwidth Contraction. For this class ofproblems the outermost diagonals cannot be reduced at low cost and nontrivial transformations rapidly fill the band. The plot of sparse matrix DWT 307 in Figure 4.12 provides an0 200 400 600 800 1000 0 50 100 150 200 250 300nz=12196 nz=2523CAN 1054 DWT 307CHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 86example of this sparsity structure class. The initial band nonzero density for this problemis a relatively sparse 9.24%. Unfortunately, the GPS preordering concentrates nonzeros inthe band’s outermost diagonals, making it difficult for HYBBC to exploit band sparsity.For this problem HYBBC shows a more modest reduction in CPU requirements of 10%relative to BANDR.Of course, not all matrices can be permuted to simultaneously reduce bandwidth andproduce sparsity structures conducive to Bandwidth Contraction. Whenever possible, however, the ideal preordering algorithm would concentrate sparsity near the main diagonalwithout sacrificing bandwidth. As a consequence of its preordering techniques using rootedlevel structures, GPS has a propensity to position nonzeros away from the main diagonal.Barnard et al [BPS93] demonstrate that spectral preorderings can exhibit complementarycharacteristics, concentrating nonzeros tightly about the main diagonal. Currently thispreordering algorithm is designed to minimize matrix profile, not bandwidth. Pothen suggests, however, the possibility of using a local reordering strategy to refine preorderingsfor bandwidth reduction, without significantly altering the preordering characteristics beneficial to Bandwidth Contraction [Pot93]. Continuing the investigation of the relationshipbetween this ordering scheme, and others, and Bandwidth Contraction performance will bean interesting avenue for future research.Although HYBBC implements efficient techniques for clipping the longer spikes exTridiagonalization Times (sec)BANDR HYBBCName b’ &ND GPS ND GPS NDn (bt) (bt)BCSSTKO9 95 980 234.3 964.4 150.8 375.61083 (57) (33)DWT 1007 34 894 72.2 715.3 50.2 137.41007 (12) (22)PLAT1919 80 1891 706.7 6877.0 424.6 2729.01919 (10) (72)BCSSTK27 45 778 140.4 1145.7 113.7 264.61224 (25) (31)Table 4.6: Sparse Tridiagonalization, GPS versus Nested DissectionCHAPTER 4. BANDWIDTH CONTRACTION ALGORITHMS 87Preordering Method b BANDR Time HYBBC TimeGPS 27 43.1 34.4RCM 46 70.0 36.3GK 40 59.8 34.7Table 4.7: The Effects of a Poor Preordering on DWT 878tending from the main diagonal, it is important to reiterate that the primary objective ofpreordering must be to reduce bandwidth. It is not beneficial to search for preorderingswith long spikes or a wide variation in spike length as the first priority. For example, nesteddissection [GL78a] permutations of sparse matrices often produce spikes of widely varyinglength, with the longest spikes towards the bottom of the matrix. But the bandwidth ofsuch preorderings is much larger than for GPS. Although HYBBC takes good advantageof the increased sparsity away from the main diagonal for Nested Dissection preorderings,Table 4.6 demonstrates that it is much better to choose bandwidth reducing preorderings.More direct examples of banded-like ordering are given by considering the tridiagonalization of DWT 878 using GPS, RCM [GL78b], and Gibbs-King (or GK) [Gib76, Lew82]preorderings summarized in Table 4.7. GK and RCM are likely to have longer spikes (higherbandwidth) but better profile than GPS. They are exactly the kinds of orderings one mightuse to get longer spikes. Switching from a GPS ordering to either RCM or GK, the CPUrequirements of BANDR closely mirror the large increase in bandwidth. The “spike clipping” process of HYBBC, however, is able to efficiently exploit the increased band sparsitypresented by RCM and GK, resulting in only marginal increases in CPU requirements.Chapter 5Split Bandwidth ContractionAlgorithms for Sparse SymmetricMatricesIn Chapter 4 we demonstrated the ability of Bandwidth Contraction’s diagonally-orientedreduction to exploit band sparsity and dramatically reduce tridiagonalization costs for awide range of practical sparse problems. In addition, we combined BC with the RutishauserSchwarz algorithm to produce the versatile hybrid sparse tridiagonalization algorithmHYBBC. This hybrid algorithm exploits the differing reduction characteristics of BC andR-S to further improve the reduction of a typical sparse matrix and in the worst caseguarantee reduction costs comparable to R-S. This chapter expands upon these successfultechniques, developing second generation sparse algorithms for bandwidth contraction andtridiagonalization.We begin with the development of the Split Bandwidth Contraction algorithm (or SBC),which enhances band sparsity exploitation using bidirectional elimination techniques. Thenovel Hybrid Split Bandwidth Contraction algorithm (or HYBSBC), described in the following section, incorporates many aspects of HYBBC’s approach. To improve sparsity exploitation, however, HYBSBC replaces the BC stage with the Split Bandwidth Contractionalgorithm. As a result, the new hybrid algorithm lends itself to formal transition analysis,permitting the development of the -transition strategy for precise regulation of the algo88CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 89rithm’s transition bandwidth. The chapter’s next section outlines implementations of boththe SBC and HYBSBC algorithms. Finally, extensive experimentation demonstrates thatSBC can dramatically reduce the cost of partial sparse bandwidth contractions, while HYBSBC substantially reduces sparse tridiagonalization costs relative to HYBBC and BANDR.5.1 The Split Bandwidth Contraction AlgorithmFor most sparse problems the requirements of bulge chasing transformations dominate thetridiagonalization costs of the Bandwidth Contraction algorithm. For example, on average 96.2% of the transformations employed by the 20 Bandwidth Contraction reductions in Table 4.4 eliminate bulge entries. This section introduces a novel band-preserving,diagonally-oriented reduction similar to BC. To improve reduction performance, however,the Split Bandwidth Contraction algorithm modifies the sequence of transformations usedto eliminate each sparse diagonal. These bidirectional elimination techniques dramaticallyreduce bulge chasing requirements by taking additional advantage of band sparsity.5.1.1 Row and Column-Oriented Adjacent TransformationsTwo types of adjacent transformations are necessary for SBC’s diagonally-oriented reduction. Consider the elimination of nonzero T from Figure 5.1’s sparse matrix. Assumingthe zeroing entry must be within the current band, only two adjacent transformations,G1(8, 9,01)TAG(8,9, 8) and 02(5,6,02)TAG(5,6,02), can be constructed for T’s elimination. The zeroing entries for these transformations are marked by Z1 and Z2 respectively.We choose to classify both similarity transformations according to which of their rotationseliminates T from the the lower triangular portion of the matrix. The first transformationeliminates T from the lower triangle by modifying rows 8 and 9 with rotation G(8, 9, 01)applied to the left of the matrix. Consequently, we classify this transformation as a roworiented adjacent transformation. In contrast, the second transformation eliminates T fromthe lower triangular portion by applying G2(5, 6,02) to the right of the matrix. Becausethis rotation modifies a pair of columns, we classify the second transformation as column-CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 902X X’N j I I_tI__—4j NX J5 Z1’T\x J 6 Z2\1 7 X\\JZ1 8 X x\“TZ2: X9____\__ii__Z_\x ii x x.X 12::zz:zf:g:n:Figure 5.1: Row and Column-Oriented Transformationsoriented.Applying either transformation to Figure 5.1’s small example creates bulge entries. Asexpected, the bulge entry pair associated with the row-oriented transformation lies furtherdown the band from T in positions A1315 and A8113. The bulges created by the column-oriented transformation, however, lie above T in positions A6,1 and A1,6.5.1.2 Bidirectional EliminationThe Bandwidth Contraction algorithm eliminates each nonzero diagonal from top to bottomusing row-oriented adjacent transformations exclusively. Alternatively, we propose startingthe reduction in the middle of each diagonal, eliminating nonzeros in both directions, asdepicted by Figure 5.2, using both nontrivial Givens transformations and row/column exchanges. This bidirectional elimination annihilates band nonzeros below the midpoint withrow-oriented transformations and chases their bulge entries off the end of the matrix in theusual fashion. To eliminate the nonzeros of the diagonal’s top half, however, a bidirectionalelimination employs column-oriented transformations. The bulges created by these eliminations are chased off the top of the matrix with additional column-oriented transformations.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 91As a result, bulge chasing sequences associated with the nonzeros above the diagonal’s midpoint are significantly shortened. In fact, relative to BC these techniques potentially halvethe total number of bulge chasing transformations required for each diagonal’s reduction.Unfortunately, cyclic elimination dependencies prevent the use of bidirectional elimination techniques for dense diagonals. Rather than develop a formal proof of this intuitiveresult, we manipulate the fragment of a lower triangular dense diagonal in Figure 5.3 todemonstrate the insurmountable difficulties associated with initiating its bidirectional elimination. For this example we assume the exclusive use of nontrivial adjacent transformations,but similar demonstrations permitting row/column exchanges reach the same conclusions.Without loss of generality suppose the reduction begins above the midpoint by eliminatingT with a column-oriented transformation. This elimination creates a bulge as illustratedby the second band fragment in Figure 5.3. At this point the reduction could proceed inone of two ways. The elimination of the dense diagonal could continue above the midpoint,but first we must eliminate the bulge to avoid its growth. As shown in Figure 5.3, however,zeroing B with either a column or row-oriented transformation reestablishes the diagonal’sFigure 5.2: Bidirectional EliminationCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 92EU!L1_ _i eliminate below LI_ __L_____________0 Xl the midpoint : x xLFigure 5.3: Cyclic Dependencies of a Dense Diagonaloriginal nonzero structure. Alternatively, we could attempt to initiate the elimination below the midpoint without removing the bulge. Although the row-oriented transformationeliminating X2 does not cause bulge growth, it recreates a nonzero in the previously zeroedentry above the midpoint. Continuing the reduction below the midpoint requires the elimination of the bulge, which once again reproduces the diagonal’s original sparsity structure.Although this is not an exhaustive example, it turns out there is no practical techniquefor starting a bidirectional elimination that circumvents the cyclic dependencies of a densediagonal.The bidirectional elimination of a sparse diagonal does not experience cyclic dependencies if the reduction appropriately exploits the diagonal’s zero entries. The key to asuccessful bidirectional elimination is a region of a sparse diagonal we refer to as a split-point. A split-point is a block of 1 or more zero entries in the outermost nonzero diagonalof the current band. If a bidirectional elimination begins above and below a split-point, thediagonal’s reduction can proceed in both directions with complete independence. The zerostart eliminationabove midpointeliminatethe bulgeChAPTER 5. SPLIT BANDWIDTH CONTRACTION 93ThB1X 7 Z10r -— -X 9 X0 Z2 11 0 B:\JX 13 X—L0 15 XEAB2XxJkJFigure 5.4: Split-points and Independent Transformationsentries of the selected split-point prevent the inappropriately positioned bulges that createdthe cyclic reduction dependencies for a dense diagonal. In addition, transformations appliedabove and below the split-point modify independent subsets of band entries. For example,in Figure 5.4 there is no overlap between the dashed rectangles outlining the extent of therow and column-oriented transformations eliminating T2 and T1.When a sparse diagonal contains a single contiguous block of zero entries the splitpoint is fixed. Other sparse diagonals may have many split-points to choose from. Theoptimum split-point that minimizes their reduction costs is dependent on problem specificsparsity structures and may be difficult to determine. Section 5.1.4 explores four spht-pointselection strategies that attempt to minimize the reduction costs associated with a sequenceof bidirectional eliminations.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 945.1.3 A Formal Presentation of SBCThe Split Bandwidth Contraction algorithm is similar in many respects to BC and inheritsall the advantages enjoyed by its predecessor for a sparse band. To exploit band nonzeros,for example, SBC incorporates the three techniques for exploiting band sparsity included inSection 4.1.2’s description of BC. As detailed in Figure 5.5, however, SBC also employs thebidirectional elimination techniques of Section 5.1.2 to take additional advantage of bandsparsity and reduce the bulge chasing requirements of each diagonal’s reduction.Once again, SBC begins by symmetrically permuting the sparse matrix to reduce bandwidth. Beginning with the outermost nonzero diagonal of the lower triangular portion ofthe band, SBC then searches for an appropriate split-point with the routine find_split-point.If the outermost diagonal is dense this routine returns 0, causing SBC to skip step 3b andrevert to a standard BC reduction. Otherwise find_split-point returns the column index ofa zero in the selected block of zero entries. The specific split-point selection algorithm usedby this routine is detailed in Section 5.1.4, which explores split-point selection strategiesthat attempt to minimize reduction costs.With the split-point assigned, SBC scans the diagonal for its first nonzero entry abovethe split-point. This entry is eliminated using an adjacent column-oriented row/columnexchange or nontrivial transformation, depending upon the nonzero status of the zeroingentry AcQl+bccQj+1. If this transformation creates a nonzero beyond the limits of the currentband, the bulge is chased off the top of the matrix with additional column-oriented transformations, as described in Section 5.1.2. The scanning and reduction process continuesuntil the portion of the diagonal above the split-point has been eliminated. The algorithmthen switches to a more familiar diagonally-oriented reduction using row-oriented transformations to eliminate nonzeros below the split-point. Upon completion of the diagonal’selimination, SBC decrements the current bandwidth, bc, and begins the bidirectional elimination of the next diagonal. As defined in Figure 5.5, SBC is a tridiagonalization algorithm.Although it efficiently tridiagonalizes sparse matrices, SBC also effectively exploits bandsparsity while performing partial bandwidth contractions. The final bandwidth of SBC’sCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 951. A PTAF, where P is a bandwidth reducing permutation matrix.2. b := bandwidth(A)3. FOR b’ b DOWNTO 2 DO /*Tridiagonalize A.*/(a) sp find_split-poirbt(A, bC)(b) FOR col— 1 DOWNTO 1 DO /*Eliminate above the splitpoint*/IF Ac01+bc, ol 0 THEN /*Zero Acoj+bccoz.*/IF Acoj+bc,col+1 = 0 THENExchange rows/columns (col) and (col + 1) in A.ELSEA := G(col,col +l,O)T A G(col,col+ 1,0)IF bandwidth(A) > bC THENChase bulges with additional column-oriented adjacentGivens transformations or row/column exchanges.ENDIF /*Outermost IF*/(c) FOR col := sp + 1 TO n — bC DO /*Eliminate below the splitpoint*/IF A0i+oc, 0 THEN /*Zero Aco1+bccQI.*/IF j+b_, = 0 THENExchange rows/columns (col + bC) and (col + bC — 1) in A.ELSEA := G(col+bc,col+bc — 1,9)T A G(col+bc,col +bc— 1,0)IF bandwidth(A) > bC THENChase bulges with additional row-oriented adjacentGivens transformations or row/column exchanges.ENDIF /*Outermost IF*/Figure 5.5: The Split Bandwidth Contraction AlgorithmCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 96Fc1 TSBC1(4+%)n2—(+6b+8m+10)n+(1O+-)m±(6b-+8)m+2b3b %- +(1-)m+--if 1 < m < b+ 2CSBC(5 + 2b — CSBC(2 + )) +—± 2CSBC(5 + b — CSBC(1 + )) + --(1—(4 + )n2—( + 6b + 8m + 1O)nif b < m + (8 + )m + (8b — + 12)m + 2b + b + 3 Unchanged+ (% + 2)(Csc(b — CSBC) + CSBC(b — CSBC))Table 5.1: The Cost of SBC Eliminating the Outermost Nonzero Diagonal of a BandedModel Problemreduction can be freely selected from the range 1 b < b.Formal analysis of the SBC algorithm is confronted by the same difficulties as an analysis of Bandwidth Contraction when considering the reduction of matrices with generalsymmetric sparsity patterns. We can obtain useful complexity results, however, for a special family of bandwidth b, order n symmetric problems. The bands of the problems aredensely populated except for a single zero entry, a split-point, in both the upper and loweroutermost nonzero diagonals of the band. The third parameter defining each matrix in thisfamily of model problems is the column index, m, of the zero in the lower triangular portionof the band.After SBC reduces the outermost diagonal of a problem in this class, the contractedband is completely dense and inaccessible to bidirectional elimination techniques. The elimination of this single diagonal, however, provides a worst case analysis for SBC’s reductionof the outermost diagonal of identically sized sparse problems using a split-point in columnm. Table 5.1 provides flop and transformation count formulas for this reduction, F1C andTSBC1 computed using the enhanced analysis framework of Section 2.4.3. Without loss ofgenerality, the analysis assumes 1 m To apply the analysis to a model problem with its split-point in the diagonal’s bottom half, simply consider the computationallyequivalent problem found by reflecting the split-point’s position about the diagonal’s midpoint. Due to difficulties with operation count summations, we actually performed oneCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 97TBC1zFG‘BC12M(b)1±)Md( b)) () (n(n - b) + Mod(n, b)(b - Mod(n, b)))Table 5.2: The Cost of BC Eliminating the Outermost Nonzero Diagonal of a DenselyBanded Problemanalysis when the split-point is in the first b columns of the diagonal and a separate analysis when b < m In addition, the analysis assumes b < n/3 and that an equalproportion of the two types of nontrivial fast Givens transformations are employed. Theanalysis also ignores the potential cost of the periodic rescaling required by fast Givenstransformations. Finally, CSBC1 and GSBC1 are the nonanalytic terms Mod(m — 1, b) andMod(n— m, b).For comparison Table 5.2 provides flop and transformation counts for the reductionof one diagonal of a densely banded symmetric matrix using the Bandwidth Contractionalgorithm. (These results were extracted from BC’s tridiagonalization analysis detailedin [Cav93].) It is difficult to compare these results to the complicated formulas for F° andTSBC1. Alternatively, the graphs in Figure 5.6 plot SBC’s flop and transformation counts,normalized by FBF’8 and TBC1, against m for a model problem with ii = 1000 and b = 100.When the split-point is close to the center of the diagonal (m=450) SBC dramatically reduces computational requirements and, as expected, BC requires almost twice as manyflops and transformations. As the split-point is moved towards the top of the diagonal,however, the relative advantage of SBC gradually diminishes and eventually the computational requirements of SBC converge to those of BC. For a dense diagonal the SBC and BCreductions are identical. The general trends exhibited by these plots are largely insensitiveto changes in the relative size of m and b.5.1.4 Split-Point Selection StrategiesThe bidirectional elimination of a sparse diagonal may have many alternative split-pointsto choose from. Surprisingly, it is often difficult to choose the split-point that minimizesCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 98(SBC1 flops)/(BC1 flops) (SBC1 trans.)/(BC1 trans.)100 200 300 400 100 200 300 400m mFigure 5.6: The Flop and Transformation Requirements of SBC relative to BC for n=1000and brrrlOO.the cost of reducing a sparse problem’s outermost nonzero diagonal. A simple selectionstrategy chooses the split-point with minimum displacement. A split-point’s displacementis defined as mid — sp), where sp and mid are the column indexes of a split-point and thediagonal’s midpoint respectively. As shown in the previous subsection, the bidirectionalreduction costs of a diagonal with a single zero entry are minimal when the split-pointis at the diagonal’s midpoint. For diagonals with multiple split-points, however, choosingthe split-point with minimal displacement may not always be optimal. For example, theoutermost diagonal in the band of the small sparse problem in Figure 5.7 has two split-points. The minimum displacement selection strategy chooses the centered split-point incolumn 9. With this split-point the diagonal’s bidirectional elimination uses 11 nontrivialfast Givens transformations, requiring 553 flops. In contrast, the bidirectional eliminationusing the split-point in column 4 requires 7 transformations and 345 flops.In general a diagonal’s optimum split-point depends on several contributing factors.1. The distribution of nonzeros in the diagonal.2. The distribution of nonzeros in the remainder of the band.3. Problem specific sparsity structures of the sparse band.Unfortunately, these factors interact in a very complex manner, and without potentiallyCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 99lxx xxX 2 x x[x ‘x\x x 3 x[x x\x x x x x)cX Xx 5 xxxx x X 6 x x x xX’XXX7XXX”x.‘OXXX’8X.X’XX’ I I I\X X X 9 X X X:”1oiS+fp-------ThI \‘X X X16X’X X\. XX X 17X X XThtLESilFigure 5.7: Reduction Performance and Split-point Selectionexpensive modeling of a diagonal’s reduction, it is difficult to consistently improve nponthe minimum displacement approach. For example, without modeling it is generally notknown to what extent a diagonal will fill during its reduction. Choosing the minimumdisplacement split-point, however, minimizes reduction costs in the worst case when thediagonal under reduction experiences high levels of fill. In our experience Figure 5.7’sexample is an exceptional case. The minimum displacement split-point selection strategy isgenerally applicable and typically finds the optimal or near optimal split-point for a singlediagonal’s reduction.The Split Bandwidth Contraction algorithm, of course, typically reduces a sequence ofnonzero diagonals. In this case the split-point selected for one diagonal’s elimination mayimpact the quality of split-points available for the reduction of subsequent diagonals. Toperform an efficient bandwidth contraction, we seek a sequence of split-points that globallyminimizes reduction costs. Many different factors affect the sparsity of reduction intermediates and the location of their split-points. The remainder of this subsection brieflyinvestigates four global selection strategies, which explore trade-offs between the displace-CHAPTER 5. SPLIT BANDWIDTH CONTRACTION____100Problem n bGPS Problem n__[ bGPS_E__Problem n bGPSDWT 361 361 14 CAN 1054 1054 112 BCSSTK12 1473 62LSHP 577 577 25 BCSSTKO9 1083 95 BCSPWRO8 1624 108GR 30 30 900 49 1138 BUS 1138 126 PLAT1919 1919 80NOS3 960 65 ERIS1176 1176 100 SSTMODEL 3345 82DWT 1007 1007 34 DWT 1242 1242 91 BCSSTK28 4410 323Table 5.3: Test Problems for Splitpoint Selection Experimentationment of selected split-points, the position of future split-points, fill, and reduction costs.Because it is difficult to theoretically determine the relative importance of factors affectingcoiltraction performance, we evaluate the merits of each approach using Trisymb experimentation with the subset of 15 problems from the Harwell—Boeing test suite listed inTable 5.3.5.1.4.1 Minimum Displacement Split-Point SelectionOur first global split-point selection strategy simply applies the minimum displacementselection criterion to each intermediate matrix, relying upon its efficient elimination ofeach diagonal to approximate a global minimization of reduction costs. As a typical splitbandwidth contraction proceeds, a block of nonzeros builds in the center of the outermostdiagonal of successive intermediate matrices, forcing split-points away from the midpoint.For the moment when the two split-points bordering this central nonzero block have equal(minimal) displacement, the selection strategy breaks ties arbitrarily. We use this versionof the Split Bandwidth Contraction algorithm, referred to simply as SBC, to benchmarkeach of the three remaining global selection strategies.5.1.4.2 Unidirectional Split-Point SelectionWhile experimenting with a Trisymb implementation of SBC, we observed that a few problems experience higher levels of band fill with SBC than BC. The largest discrepanciesin the band density of corresponding intermediate matrices are often associated with exCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 101tended periods of split-point flipping. During an episode of split-point flipping SBC selectssplit-points from alternating sides of the growing block of nonzeros straddling the midpointof the outermost diagonal of successive intermediate matrices. Without going into greatdetail, split-point flipping can increase intermediate fill levels by1. changing the orientation of transformations eliminating nonzeros from particular regions of the band.2. modifying the order in which the outermost diagonal’s entries are eliminated.If we curtail split-point flipping, perhaps fill could be reduced and reduction performanceimproved. To investigate the relationship between split-point flipping, fill and reductionperformance, we propose a second global strategy, the unidirectional split-point selectionstrategy. (The corresponding version of the reduction algorithm is referred to as USBC.)The unidirectional selection strategy chooses minimal displacement split-points subject to the following constraint. After selecting its first off-centre minimum displacementsplit-point, the strategy must choose split-points from the same side of the midpoint forthe reduction of subsequent diagonals. When the selection process encounters a diagonalwithout a split-point in this region, it may consider split-points from the remainder of thediagonal.To compare the relative merits of USBC and SBC, we performed partial bandwidth contractions for each of the 15 problems in Table 5.3, using Trisymb symbolic implementationsof both algorithm variants. In isolation from other reduction factors, eliminating split-pointflipping may reduce fill production, but the unidirectional selection strategy typically forcesUSBC to select split-points of significantly higher displacement, increasing bulge chasingrequirements. The fill accompanying these extra transformations typically neutralizes thepotential band fill savings of unidirectional split-point selection. Consequently, USBC doesnot reduce the contraction costs of any problem relative to SBC. In fact, for 5 problems:CAN 1054, DWT 1242, GR 30 30, SSTMODEL and 1138 BUS, USBC increases flop requirements by 11—76%. For these problems USBC frequently chooses split-points with 3 to 4times the displacement of SBC’s split-point selection for the corresponding reduction inter-CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 102mediate. Because fill and band nonzeros interior to the band of the original matrix reducethe sparsity of the contracted band’s outermost diagonal, there is little chance of USBCexhausting the split-points on one side of the midpoint and finding small displacementsplit-points remaining in the other half of the diagonal. In conclusion, these experimentsclearly demonstrate that selecting minimum displacement split-points is more cost effectivethan attempting to avoid flipping with a unidirectional scheme.5.1.4.3 Damped Split-Point SelectionThe difficulties of the unidirectional split-point selection strategy result from dramaticincreases in split-point displacement. The unidirectional strategy, however, is one extremeof a range of selection strategies. This section reexamines the trade-offs between fill, flippingand split-point displacement using a generalization of the unidirectional strategy that dampsthe frequency of split-point flipping incidents to varying degrees. With this scheme we seeka compromise between the unidirectional and minimum displacement split-point selectionstrategies that improves reduction performance.The damped split-point selection strategy controls the selection of each diagonal’s split-point using the hysteretic process defined by the following conditional statement. (The corresponding version of the Split Bandwidth Contraction algorithm is referred to as DSBC.)IF (DF * IDssI) < Dosi THENselect split-point SSELSEselect split-point OS88 and OS are the minimum displacement split-points on the same side and oppositeside of the midpoint as the split-point chosen for the previous diagonal’s reduction. Thedisplacement of split-points 88 and 08 is given by D55 and D05. The roles of oppositeand same side split-point are undefined when• SBC is reducing the outermost diagonal of the original band.• a centred split-point is available in the current diagonal.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 103FlopsFigure 5.8: Desirable DSBC Reduction Characteristics• the previous diagonal was eliminated using a centred split-point.• split-points are unavailable in one half of the current diagonal.In these special cases the diagonal’s minimum displacement split-point is chosen.The damping factor, DF, regulates the amount by which the displacement of the selectedsplit-point may exceed the minimum displacement to avoid flipping. (0 DF 1) Forexample, if DF = 0.5 then IDssI must be more than twice IDosi before split-point OSis selected. When DF = 0 the damped split-point selection strategy degenerates into theunidirectional scheme, while at DF = 1 it is equivalent to the minimum displacementstrategy.Once again, we explored the relative merits of DSBC by performing partial bandwidthcontractions of Table 5.3’s 15 sparse problems with a Trisyrnb symbolic implementationof the algorithm. The goal of the DSBC algorithm is to damp split-point flipping, reducing or delaying the introduction of band fill to improve contraction efficiency. Although the unidirectional selection strategy increases flop requirements, we pursue thepossibility that partially damping flipping results in flop savings, as depicted in Figure 5.8.Unfortunately, extensive experimentation did not reveal any consistent improvements incontraction performance independent of the damping factor. In fact, the contractions of.0. DF .1(unidirectional) (minimumdisplacement)CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 104most problems were unchanged or showed slight increases in flops for moderate dampingfactors (0.5 DF 1.0) relative to SBC. Further reduction of the damping factor causessignificant increases in flop requirements for the four problems on which USBC performedthe most poorly, while the remaining problems were largely unaffected. No problem exhibits a trend towards reduced contraction costs. In conclusion, these experiments clearlydemonstrate that amongst the split-point selection alternatives presented by DSBC, choosing split-points with minimum displacement must be the paramount objective.5.1.4.4 Induced Split-Point SelectionAs discussed earlier in Section 5.1, to improve the efficiency of sparse bandwidth contraction,it is important to reduce bulge chasing requirements. The damped split-point selectionstrategy tried to reduce bulge chasing by restricting split-point flipping to control bandfill. Alternatively, induced split-point selection techniques ignore potential increases inband density and focus directly upon the development and use of small displacement split-points. In fact, the induced split-point selection strategy encourages split-point flippingin the hope of producing a set of smaller displacement split-points for the reduction ofsubsequent diagonals.There are several factors motivating the investigation of induced split-point selectiontechniques. First, we hope this scheme will distribute bulge chasing more evenly, encouraging a balanced distribution of nonzero entries and future split-points across the midpointsof a band’s outermost diagonals. As observed for applications of the damped split-pointselection strategy, unevenly distributed split-points can result in unidirectional split-pointselection behavior.In addition, while the bands of intermediate matrices remain relatively sparse, split-point flipping can reduce the displacement of split-points available for the reduction ofsubsequent diagonals. Consider the reduction of the outermost nonzero diagonal from asparsely banded matrix of bandwidth b, and for the moment ignore the original nonzerostructure of diagonal b — 1. As diagonal b is reduced using adjacent nontrivial transformaCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 105tions and row/column exchanges its nonzero entries, and possibly split-points, are generallymapped into neighboring positions in column b — 1. In special circumstances split-pointsavailable in diagonal b — 1 will have a smaller displacement than their ancestors in diagonal b. For example, an unused split-point in the current diagonal may also exist in thenext diagonal, but with a smaller displacement, if a split-point from the opposite side ofthe midpoint is chosen for the current diagonal’s reduction. The following discussion outlines several circumstances under which split-points are shifted by SBC’s elimination of onediagonal.As for previous selection strategies, we assume the split-point selected for each diagonal’sreduction is best chosen from the pair of minimal displacement split-points straddling thediagonal’s midpoint. We refer to this pair of split-points as:SPA - The smallest displacement split-point above the midpoint of the outermost diagonal.SPB - The smallest displacement split-point below the midpoint of the outermost diagonal.The evolution of SPA and SPB’s displacements is closely related to the following problem.• As SBC contracts the bandwidth of a sparse matrix, does the length of the contiguous nonzero block encompassing the midpoint of the outermost diagonal growmonotonically?Although growth in the central block of nonzeros is a dominant reduction trend, there arespecial situations in which the nonzero block can shrink slightly. Figure 5.9 illustrates asmall example in which reducing the bandwidth from 3 to 2 with SBC decreases the lengthof the central nonzero block by one and moves SPA closer to the midpoint.In this example, the reduced length of the nonzero block, and part of SPA’s reduceddisplacement, results from the elimination of the second band nonzero above the split-pointwith a row/column exchange. As illustrated in Figure 5.10, application of this transformation switches diagonal 3’s fourth nonzero above the split-point into position A5,3 thusavoiding its own explicit reduction and separating it from what remains of the centralnonzero block.aq CD 1 U)U)C CCD U)I-C C) CD ‘-C z C N Ct, ‘-C C C\x-“I’3“x“><.0•1>K‘SO)><A!x—i”><0x<o><-‘Ci)0-oFjx‘CD ><-‘C)iODiCD2.CD“><“IS.)‘><C’)><‘:X\\01X—x‘x—i”><U)_CDX0xC) I ‘1><““><C.)\:X.V01/><‘SO)XA(Ci...‘x——U)><><CD><‘\-‘(I)04i;xPCI)1W o‘\“\XC.)<“N><“s><01x“.o><.—J“NXU)><NXCD><‘N0jxI\)CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 107Let NBL be the length of the central nonzero block in the outermost diagonal of a bandwidth b sparse matrix. In general, when NBL < b there is no opportunity of reducinga split-point’s displacement in this fashion. However, if NBL b then once the centralblock has been reduced to length b by the diagonal’s partial reduction, a split-point shiftcan occur if the next nonzero in the central block is reduced with a row/column exchange.Of course, these conditions are not sufficient to guarantee a split-point shift. Fill entriesor nonzero entries already existing in the (b — i)th subdiagonal could disrupt the shiftingprocess.The previous discussion describes why SPA moves one row towards the midpoint inFigure 5.9, but what about the other half row of the reported shift? Consider a moregeneral case in which SPB is used to reduce the current diagonal and SPA is at leasttwo zero entries in length. In this case if SPA is not destroyed by fill entries, or existingnonzeros in the neighboring subdiagonal, it will move at least 0.5 rows closer to the midpointin adjacent diagonal, as shown in Figure 5.11, independent of the type of column-orientedtransformations employed. Given this result it might be tempting to choose a series of split-Figure 5.11: The Normal Reduction Shift of a Lower Triangular Split-Pointpoints from one side of center to move split-points on the other side closer to the center. Asshown in Sections 5.1.4.2 and 5.1.4.3, however, split-point selection schemes with a strongunidirectional flavor are unsuccessful.When NBL > b bulge chasing row/column exchanges may also shift split-points in asimilar fashion to shifts resulting from row/column exchanges eliminating band nonzeros.bandwidth = b bandwidth = b-iCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 108Figure 5.12 demonstrates the mechanism by which a bulge chasing rotation shifts a split-point. The elimination of the first nonzero above the split-point creates a bulge entryas shown in matrix A of Figure 5.12. When this entry is chased with a row/column exchange, the transformation switches A7,4 off the main diagonal and moves SPA closer tothe midpoint, creating matrix B.I )( /2)( 4X X\ X5 \6 ;( BX 7/X\x/B •x13,/ 142)( 4XXa /N X5 \/\x 6 )Ca i/ K‘)( /9 %x)c 9 a/ X% IOXX/ ‘x xii Nax 12/ 13,/• 14B23 x /4xxxxs x /X 6 /x 7/’x/8 xx g xx ioxx/ xxii•‘ x 12- 137ZJli*i4A CFigure 5.12: Multiple Split-Point ShiftsContinuing this reduction illustrates the potential of distinct shifting events combiningto shift a split-point multiple positions during the reduction of a single diagonal. As theresult of a bulge chasing shift, a band elimination shift and the normal half row shiftof a diagonal’s reduction, SPA’s displacement reduces from 3 in matrix A to 0.5 in thebandwidth 2 intermediate matrix C.It is important to reiterate, however, that the location in the neighboring diagonalto which a split-point is shifted may already have nonzero entries or fill may destroy thesplit-point during the first diagonal’s reduction. As the separation of SPA and SPB growsbeyond the current bandwidth, there are more opportunities for fill to destroy split-pointsduring shifting. In addition, appropriate positioning of zero entries is essential to permitthe use of the row/column exchanges needed for bulge chasing and band elimination shifts.Having demonstrated several circumstances under which split-point SPA (SPB) can beshifted closer to the midpoint by using SPB (SPA), we are ready to consider the form of theinduced split-point selection strategy more closely. Given the difficulties encountered byscapply abulge chaainrow/cdexchangecompletereductionto b=2CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 109unidirectional schemes, encouraging split-point flipping seems a reasonable way to exploitsplit-point shifting. For example, suppose SPA and SPB are single zero split-points of equaldisplacement, and SPA is used to reduce the current subdiagonal, b. In diagonal (b — 1) thenew SPA must have a larger displacement, so it seems reasonable to use SPB for diagonal(b — 1)’s reduction. Assuming SPA is not destroyed during (b — 1)’s reduction, it will beavailable for the elimination of diagonal (b — 2) with a potentially reduced displacement.This example does not imply, however, that the induced strategy should select split-pointsfrom strictly alternating sides of the midpoints of a sequence of diagonals. Unlike this case,there may be a wide discrepancy between the displacements of SPA and SPB. If split-pointsare chosen in an alternating fashion, a split-point with an unacceptably large displacementmight be chosen over a split-point of significantly smaller displacement. A solution to thesedifficulties is to promote split-point flipping with careful regulation of the selection processto avoid split-points with excessively large displacements.In a similar fashion to the damped selection strategy, the induced split-point selectionstrategy controls the selection of each diagonal’s split-point with the following hystereticprocess. (We refer to the corresponding version of the Split Bandwidth Contraction algorithm as ISBC.)IF (IF * IDosi) IDssI THENselect split-point OSELSEselect split-point SSOnce again, SS and OS are the minimum displacement split-points on the same side andopposite side of the midpoint as the split-point chosen for the previous diagonal’s reduction.D8 and D0 represent the displacements of split-points SS and OS. In the special caseswhen the roles of opposite and same side are undefined (see page 102), the diagonal’sminimum displacement split-point is chosen.The inducement factor, IF, regulates the amount by which the displacement of OS mayexceed the minimum displacement and still permit split-point flipping. (0 IF 1) WhenCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 110IF = 1 the induced split-point selection strategy is equivalent to the minimum displacementstrategy, while if IF = 0 split-points are selected from alternating sides of the midpoint.As for previous selection strategies, we explored the relative merits of ISBC by performing partial bandwidth contractions of Table 5.3’s 15 sparse problems with a Trisymbsymbolic implementation. Each problem was contracted to a predetermined bandwidthusing a variety of inducement factors in the range 0 < IF 1 and compared to SBC’scontraction of the same problem.Independent of the inducement factor and test problem, however, we did not observesignificant reductions in the computational requirements of Split Bandwidth Contractionusing ISBC. In fact, even with reasonably large inducement factors, ISBC significantlyincreased the bandwidth contraction costs of some problems relative to SBC. For example,contracting NOS3 to bandwidth 40 using ISBC and IF = 0.7 increases the flop countby 36% relative to standard SBC. The flop count totals of ISBC, however, are typicallywithin 1% of SBC’s costs when 0.80 IF < 1. Large flop count increases are usuallyrestricted to smaller inducement factors for which ISBC permits the selection of muchhigher displacement split-points to enable flipping. Small inducement factors do not alwaysresult in dramatic increases in computational requirements. In fact, BCSSTK12, DWT1007, LSHP 577 and PLAT1919 are relatively immune to inducement factor variation.The goal of the induced split-point selection strategy is to reduce bulge chasing requirements by choosing a sequence of split-points with minimal total displacement. To meetthis goal it encourages split-point flipping to exploit shifting and to bulge chasing balanceacross the midpoints of the outermost diagonals. Split-point shifting was observed duringthe reduction of a majority of the test problems. In most cases, however, displacementsavings resulting from the selection of shifted split-points just managed to offset the extradisplacement of split-points chosen to provide split-point flipping. In other cases the extradisplacement sacrificed to induce a split-point flip far exceeds any reduction in the displacement of subsequent split-point selections. Similarly, induced split-point selection improvedthe balance bulge chasing for many problems, but its impact on the contraction of practicalCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 111problems was lower than anticipated.In conclusion, exploring the induced split-point selection strategy has clearly demonstrated, once again, that the paramount concern of split-point selection must be to choosesplit-points of local minimum displacement. Although the induced split-point selectionstrategy was not successful, through these investigations we have learned much about thedynamics of the Split Bandwidth Contraction algorithm.5.1.4.5 The Final Form of the Split-Point Selection StrategyThe introduction to Section 5.1.4 identified several factors affecting the performance of splitbandwidth contractions. The four global selection strategies investigated in previous subsections explored trade-offs between these factors from the perspective of split-point selection.Among the presented alternatives, these studies have clearly shown that selecting a minimum displacement split-point for each diagonal’s elimination is the best overall strategy. Itmight be possible to design special selection strategies to more fully exploit common sparsity structure characteristics of specific classes of matrices, but the minimum displacementsplit-point selection strategy results in an efficient, generally applicable method.As described in Section 5.1.4.1, the minimum displacement strategy currently arbitrarily breaks ties between two split-points of equal (minimal) displacement. Alternatively,we propose resolving these equivocal selections with either damped or induced split-pointtechniques. Neither technique improves upon the minimum displacement strategy whenemployed as a global strategy, but the damped split-point selection strategy generally degrades the performance of a contraction less than the induced approach. Consequently,the final version of SBC’s find_split-point routine (see Section 5.1.3) employs the minimumdisplacement selection strategy and in the case of ties selects the split-point on the sameside of the midpoint as the split-point chosen for the previous diagonal.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 112B=Split Bandwidth Contraction Bandwidth ContractionFigure 5.13: A Partial Bandwidth Contraction5.1.5 Performance of the Split Bandwidth Contraction AlgorithmTo illustrate the effectiveness of the Split Bandwidth Contraction algorithm we performedexperiments with the symbolic reduction tools Xmatrix and Trisymb, comparing SBC’s flopand transformation requirements to those of BC, as well as sparse R-S and HYBBC. Allresults assume the use of fast Givens transformations and sparsity exploitation as describedby each algorithm’s definition.Consider SBC’s reduction of the small sparse example discussed in Sections 4.1.3 and4.2.3, and illustrated by matrix A of Figure 4.5. Figure 5.13 illustrates the intermediatereduction matrix resulting from SBC’s elimination of the three outermost nonzero diagonalsof the band using the lower triangular split-points in columns 7, 9 and 8, as chosen by SBC’sroutine find_split-points. For comparison, Figure 5.13 also illustrates the correspondingintermediate matrix from BC’s reduction of the same problem. For this partial contractionBC uses 12 row/column exchanges and 16 nontrivial transformations, requiring 736 flops.Bidirectional elimination techniques, however, enable SBC to more fully exploit the sparsityof the band away from the main diagonal. SBC uses 10 row/column exchanges and 12XX2 XXXX 3XXXXXX4XXXXX5 XXXXXX6XXXXXX7XX 8XXXX X9XXXxx10 XXX lix XX X12X XX X13 XX--- X 14 XXXX 15XXXX X X 16 X X XXXXI7XXXx X x 18 x Xx x x 19 Xx X x 20ixxxX2 XXX 3XX 4XXXX X5XXX—XX 6XXXXX7XXXXXX8XXXXX9XXXC—XX XI1X XXXXX12 XXX i3X XX X14 XX—X 15XXXX X X 16 X X XX X X 17 X X XX X X 18 X XX X X 19 XX X X 20CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 113Method Row/Column Nontrivial Transformations FlopsExchangesSparse R-S 8 132 7232BC 12 163 6537HYBBC 12 136 5880SBC 12 130 5026Table 5.4: Tridiagonalization Summarynontrivial transformations, reducing computational requirements by 24% to 560 flops. Inaddition, Table 5.4 compares the transformation and flop requirements tridiagonalizing thissmall example with SBC, sparse R-S, BC and HYBBC. The Split Bandwidth Contractionalgorithm requires 30.5%, 23.1% and 14.5% fewer floating point operations than sparseR-S, BC and HYBBC respectively. Finally, we note SBC requires even fewer nontrivialtransformations than sparse R-S.As demonstrated by the previous example, SBC effectively exploits band sparsity toefficiently tridiagonalize sparse matrices. However, we envisage SBC’s most important contribution will be as a partial bandwidth contraction technique used by the first half of ahybrid tridiagonalization algorithm (See Section 5.2.), and as a preprocessing technique forsparse linear systems solvers and other banded eigenvalue routines such as BQR [GBDM77]or parallel implementations of Lang’s densely banded tridiagonalization [Lan92]. To gaugethe potential of SBC in this role, we conducted an extensive comparison of partial BCand SBC contractions using Trisymb implementations and the practical problems of Section 2.2’s test suite of Harwell—Boeing sparse problems. For each problem, both partialbandwidth contractions continue until reaching the L-transition bandwidth of the HybridSplit Bandwidth Contraction algorithm to be discussed in Section 5.2.2. All problems werepreordered with GPS.These symbolic experiments confirm that relative to BC the Split Bandwidth Contraction algorithm significantly reduces the flop requirements of partial reductions. For the 70test problems of order 400 or greater, SBC requires on average 17% fewer floating pointoperations than BC. From this subset of problems SBC reduces the flop requirements ofCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 114Transformation Totals(bulee chasg row/coiexch.)Flops (xlO°)Name BC SBC BC SBC %Flopn, b°, b- ReductionNOS5 21031 14360 7.5415 4.5001 40.3%468, 88, 49 (14121,3308) (7694,3146)BCSSTK19 365696 197039 27.731 15.118 45.5%817, 18, 3 (358087,1185) (189348,248)DWT 1007 76910 21554 14.064 3.9477 71.9%1007, 34, 19 (71201,2682) (17952,70)CAN 1072 289032 180585 145.15 86.232 40.6%1072, 156, 48 (239795,17051) (128192,16221)1138 BUS 733720 482193 135.36 90.61 33.1%1138, 126, 11 (679001,71770) (422750,60327)ERIS1176 1304873 627337 201.04 72.857 63.8%1176, 100, 4 (1251204,19457) (575732,70193)BCSPWRO6 1319287 646685 269.94 121.35 55.1%1454, 100, 12 (1254062,92591) (583013,68994)BCSSTK11 354180 184820 119.62 61.780 48.4%1473, 62, 33 (326338,1397) (157859,1074)PLAT 1919 889023 540000 309.76 193.76 37.5%1919, 80, 30 (828261,30884) (480506,8813)BCSSTK25 16171900 10273284 22543 13838 38.6%15439, 238, 162 (15611226,484692) (9668324,519577)Table 5.5: Selected Partial BC and SBC Contraction Summaries20 sparse problems by more than 35% relative to BC, and reductions of 40—50% are common. These results are consistent with the theoretical analysis of Section 5.1.3, whichpredicted that a well centred split-point permitted SBC to eliminate a nonzero diagonalfrom the special model problem with approximately half the computational effort of BC.This theoretical result, however, does not bound the maximum flop count reductions forthese practical problems, which range as high as 71.9% for DWT 1007. SBC’s reduction ofthis problem is able to take extra advantage of band sparsity to further shorten the lengthof bulge chasing sequences and dramatically improve performance.Table 5.5 summarizes the computational requirements of partial BC and SBC contractions for 10 select test problems. The symbols bGPS and b refer to the bandwidth of theGPS preordering and the final bandwidth of the partial contraction respectively. The twovalues in parentheses below each transformation total provide the number of bulge chasingtransformations and row/column exchanges included in the total count. The final columnCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 115C 1’:i.i \ \50\ 4100\ 4\ \150 \ \ \\\ _\ \2002500 50 100 150 200 250nz = 1753Figure 5.14: The Sparsity Structure of LSHP 265 with a GPS Preorderingof the table reports the reduction in flop counts achieved by SBC as a percentage of BC’scosts.SBC does not improve the efficiency of partial bandwidth contractions for all sparseproblems. For example, the BC and SBC reductions of an L-shaped problem (LSHP 265,LSHP 406,...) are equivalent, because of their special sparsity structures under GPS preorderings. As demonstrated by the plot of LSHP 265’s sparsity structure in Figure 5.14,the envelope encompassing the nonzeros of the band is bow shaped. Its outermost nonzerodiagonal consists of a single contiguous block of nonzeros straddling the midpoint. Consequently, when SBC chooses the split-point at the top end of this nonzero block, all entriesabove the split-point are zero and the diagonal’s reduction reverts to a unidirectional elimination. Similarly, SBC’s reduction of subsequent diagonals cannot exploit bidirectionaltechniques and the SBC and BC contractions use identical sequences of transformations.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 116With different preordering, such as GK, SBC is able to take better advantage of bandsparsity than BC.For 8 of the 115 test problems SBC actually requires more floating point operations thanBC. For 7 problems from this group, SBC exhibits moderate increases in flop requirementsranging from 2—12%. Although SBC generally requires fewer transformations for thesematrices, peculiar characteristics of their sparsity structures result in a lower proportionof row/column exchanges than in BC’s reduction. As a result, SBC may actually requiremore nontrivial transformations than BC, causing higher flop requirements. The tendencyof SBC to trade row/column exchanges for nontrivial transformations also affected problemsfor which SBC significantly reduces flop requirements. In these cases, SBC does not meetperformance improvement expectations that are based solely on the displacement of eachdiagonal’s split-point. Currently, we do not have a general understanding of this complexphenomenon, and must deal with each problem on an individual basis.For the final problem of this group, BCSSTKO9, SBC increases flop requirements bya factor of more than 2.5 relative to BC. To understand this dramatic increase we examine the distinctive sparsity structure of BCSSTKO9 illustrated in Figure 5.15. As SBCreduces the first few nonzero diagonals from the band, split-points are available close themidpoint of each diagonal. Consequently, band nonzeros in region B are eliminated withcolumn-oriented transformations, producing band fill in region A. This increases the local bandwidth of region A, gradually forcing nonzeros outwards to meet the edge of thecontracting band. During BC’s elimination of the same diagonals, transformations do notimpinge upon region A and fill in this region is avoided until the bandwidth has been furthercontracted. As the effects of SBC’s fill in region A cascade into the remainder of the band,however, SBC exhibits significantly higher intermediate band densities. For example, oncethe bandwidth of BCSSTKO9 has been reduced from 95 to 76, SBC’s intermediate reductionmatrix has 1.87 times as many band nonzeros as BC’s corresponding reduction intermediate. Consequently, each diagonal requires more band zeroing transformations and longerbulge chasing sequences, employing a dramatically higher proportion of nontrivial transformations. When BCSSTKO9 is reduced by the numeric implementation of SBC discussedCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 117Figure 5.15: The Sparsity Structure of BCSSTKO9 with a GPS Preorderingin Section 5.3, the discrepancy between the SBC and BC contractions is less pronounced,because row/column exchanges are no longer without cost. In this case, SBC requires only16.5% more CPU seconds than BC.With the exception of the previous groups of problems, the bidirectional eliminationtechniques of SBC generally reduce transformation requirements significantly. As the result of fewer nontrivial transformations, we anticipated, but did not observe, comparablereductions in intermediate band densities. This apparent inconsistency is due to transformation saturation. During a typical reduction, SBC updates particular pairs of rows andcolumns in the band with many nontrivial transformations. Only a few of these transformations, however, may actually produce fill entries within a row/column pair. Suppose anontrivial transformation unions the sparsity structure of a pair of rows and columns. Subsequent transformations applied to the same pair will not affect their sparsity structures0 200 400 600 800 1000CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 118unless between transformations the sparsity of one of the rows or columns in the pair hasbeen unioned with a third row or column. As a result, significantly lower transformationlevels may not be accompanied by similar reductions in the band density of intermediatesmatrices. Despite reduced numbers of transformations, there may still be a sufficient varietyof transformations to saturate the band and produce similar levels of band fill.This section has evaluated the Split Bandwidth Contraction algorithm by manipulatingsparsity structures with symbolic implementations. Discussion in Section 5.4.1 confirms thegeneral characteristics of these results with numerical routines for SBC.5.2 The Hybrid Split Bandwidth Contraction AlgorithmChapter 4’s Hybrid Bandwidth Contraction algorithm, HYBBC, combines the BC andR-S algorithms to produce an effective two stage tridiagonalization. HYBBC’s first stageemploys BC, to exploit its efficient contraction of a sparse band, but switches to R-S once thecontracted band becomes too dense and the costs of completing the tridiagonalization withBC overtake those of R-S. As shown in Section 5.1, however, partial bandwidth contractionsusing the bidirectional elimination techniques of the Split Bandwidth Contraction algorithmare often dramatically more efficient than BC’s unidirectional elimination. This observationsuggests replacing BC with SBC to create a second generation tridiagonalization algorithm,the Hybrid Split Bandwidth Contraction algorithm or HYBSBC.To produce an efficient and versatile tridiagonalization algorithm, HYBSBC must alsoswitch to R-S when band fill reduces the effectiveness of the SBC contraction and R-S canmore efficiently complete the reduction. Consider SBC’s reduction of a special family ofsymmetric model problems with n=1000 and various bandwidths. Each matrix is denselybanded, but we assume under an SBC reduction that a single split-point in column rn ofthe lower triangular portion is available for each diagonal’s elimination. Of course, thisis an unlikely sequence of sparsity structures for SBC to encounter, because it is onlypossible if numerical cancellation provides the appropriately positioned split-points. Usingthe complexity results for R-S and SBC in Tables 3.4 and 5.1, the first graph in Figure 5.16CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 1190.&U) 0.6ixU)o.0.S0U)(I,0a.0E.0.50.50.4&50 100 150 200 250 300 350BandwidthCentred Split-PointsFigure 5.16: The Tridiagonalization Flop Requirements of SBC Relative to R-S for aDensely Banded Model Problem. n=1000plots the tridiagonalization flop requirements of SBC with centred split-points, normalizedby the cost of R-S, against the bandwidth of the model problem. As expected, this plotshows that as long as SBC finds a well centred split-point for each diagonal’s reduction,it provides tridiagonalizations with significantly lower flop requirements than R-S. Thesedensely banded model problems, however, provide a worst case analysis. If SBC finds awell centred split-point for each diagonal while tridiagonalizing a sparsely banded problem,it typically enjoys even lower flop requirements relative to It-S.For the reduction of practical problems SBC cannot expect to always find a well centredsplit-point for the elimination of each diagonal. Like BC, SBC must also contend with fillentries inside the contracting band. As discussed in Sections 5.1.4.1 and 5.1.4.4, duringSBC’s reduction of a typical sparse problem an expanding block of nonzeros, straddlingthe centre of the outermost diagonal of successive intermediate matrices, forces split-pointsaway from the midpoint. For example, Figure 5.17 plots the displacements of the split-points used by SBC during its reduction of BCSPWR08 from bandwidth 108 to 12. Aspredicted by the analysis of Section 5.1.4.4, split-point displacements decrease occasionallyduring the reduction, but split-points of higher and higher displacement is the dominantu 50 100 150 200 250mEach Diagonal is eliminated by a split-point in column m. b=150CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 120160—: I110 100 90 80 70 60 50 40 30 20 10 110 100 90 80 70 60 50 40 30 20 10Intermediate Bandwidth Intermediate BandwidthFigure 5.17: Split-Point Displacements of SBC’s Partial Contraction of BCSPWR08trend. The speed with which split-points are forced away from the midpoint is a measureof SBC’s relative success.If split-points are sufficiently far from the midpoint, the cost of using SBC to completethe tridiagonalization may rise above that of R-S. For example, the second graph in Figure 5.16 plots the tridiagonalization flop requirements of SBC, normalized by the costs ofR-S, as a function of split-point positioning for the special model problem with b=150.Each SBC reduction assumes the lower triangular split-point for each diagonal’s elimination is in the same column, m. It is not surprising that for sufficiently small m SBC’s flopcounts exceed those of R-S, considering SBC converges to the BC algorithm as split-pointswith larger and larger displacement are employed. In fact, Figure 4.7 demonstrates thatwhen each diagonal’s split-point is in column 1, and SBC is equivalent to BC, SBC mayrequire more than 20% additional flops relative to R-S for select densely banded problems.Consequently, if the Hybrid Split Bandwidth Contraction algorithm is to be efficient andgenerally applicable, it must be able to detect SBC’s inability to effectively reduce thecurrent intermediate matrix and switch to R-S to complete the tridiagonalization.The remainder of this section provides a formal description of the HYBSBC algorithmbefore developing the A-transition strategy, which precisely regulates the transition betweenthe algorithm’s SBC and R-S stages.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 1215.2.1 An Algorithmic Framework for HYBSBCTo formalize the previous discussion, Figure 5.18 provides a pseudocode frameworkdefining the Hybrid Split Bandwidth Contraction algorithm. Once again, HYBSBC beginsby symmetrically permuting the sparse matrix to reduce bandwidth. While the matrixis not in tridiagonal form and the transition condition has not been met, the algorithmthen contracts the band diagonal by diagonal using the bidirectional elimination techniquesof SBC. The column index, sp, of the lower triangular split-point used for a diagonal’selimination is returned by the function findsplit-point defined in Section 5.1.4.5. Once thetransition bandwidth bt identified by the function transition has been reached, sparse R-Scompletes the tridiagonalization if bt > 1. This second stage of the algorithm is identical toSection 3.5.2’s description of sparse R-S, except HYBSBC does not perform an additionalpreordering.As for the HYBBC algorithm, a sensitive design issue is the selection of an appropriatetransition strategy. For example, suppose we apply HYBSBC to the small sparse problemin Figure 4.5 and use a transition strategy that chooses bt = 3. In this case HYBSBC’stridiagonalization uses 11 row/column exchanges and 127 nontrivial transformations, increasing the cost of tridiagonalization relative to SBC by 9% to 5473 flops. To avoid suchinefficiencies and to take full advantage of SBC, the following subsection develops the -transition strategy, which is shown to select optimal transition bandwidths for practicalsparse problems.5.2.2 The Li-Transition StrategyThe design of transition strategies for the HYBBC algorithm was hindered by our inabilityto cheaply estimate the cost of eliminating a single diagonal with BC from a general sparselybanded matrix. Consequently, HYBBC employs a relatively unsophisticated transitionscheme based on density measures of the outermost nonzero diagonal. In contrast, HYBSBClends itself to formal analysis. By exploiting characteristics of a typical SBC reduction,the Li-transition strategy is able to use complexity analyses of SBC and R-S to preciselyCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 1221. A PTAP, where P is a bandwidth reducing permutation matrix.2. b := bandwidth(A)3. (a) bc:=b(b) sp = fimd_split-point(A, bC)(c) /*While the matrix is not tridiagonal and the transition condition has*//*not been met, eliminate the outermost nonzero diagonal with SBC.*/WHILE ( (b’ 2) AND (NOT transition(n, bED, sp)) ) DOi. FOR col := sp — 1 DOWNTO 1 DO /*Eliminate above the splitpoint*/IFA0l+bc,cj 0 THEN /*Zero AcQl+bc,coj.*/IF Acol+bc,o1+1 = 0 THENExchange rows/columns (col) and (col + 1) in A.ELSEA G(col,col+ l,e)T A G(col,col + 1,6)IF bandwidth(A) > b’ THENChase bulges with additional column-oriented adjacentGivens transformations or row/column exchanges.ENDIF /*Outermost IF*/ii. FOR col := sp + 1 TO n — bC DO /*Eliminate below the splitpoint*/IF A0I+b,1 0 THEN /*Zero Acol+&c,col.*/IF Aco1+bc_1,co1 = 0 THENExchange rows/columns (col + bC) and (col + bC — 1) in A.ELSEA := G(col+bc,col+bc — 1,0)T A G(col+bc,col+bc— 1,0)IF bandwidth(A) > bC THENChase bulges with additional row-oriented adjacentGivens transformations or row/column exchanges.ENDIF /*Outermost IF*/iii. b’ = bC— 1iv. sp =find_split-point(A, bED)(d) bt = bC /*Record the transition bandwidth.*/(e) IF bC> 1 THEN complete tridiagonalization with sparse R-S.Figure 5.18: The Hybrid Split Bandwidth Contraction Tridiagonalization AlgorithmCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 123regulate HYBSBC’s transition bandwidth and minimize computational requirements. Afterdetailing the s-transition strategy, this section assesses its potential using a subset of theHarwell—Boeing test suite and symbolic reduction tools.5.2.2.1 A Theoretical BasisAs demonstrated in the introduction to Section 5.2, the Rutishallser-Schwarz algorithmis clearly superior to bandwidth contraction schemes for the tridiagonalization of matriceswith a dense band. When applying HYBSBC to a band with entries that are zero, however,Figure 5.16 demonstrates that continuing the SBC stage of the reduction is cost-effectivewhile a split-point exists near enough to the midpoint of the outermost nonzero diagonal. Unfortunately, as a typical reduction proceeds the best split-points available for eachdiagonal’s elimination generally exhibit larger and larger displacements. At a particularintermediate bandwidth, allowing SBC to continue one extra step using an off-centre split-point before switching to R-S may be more costly than switching to R-S immediately. Thejob of HYBSBC’s transition strategy is to identify the optimal transition bandwidth bt thatminimizes the following cost function.Cost_SBC(b bt) + Cost_R-S(b —÷ 1) (5.1)Cost_SBCQ and CostR-S() are problem dependent functions representing the computational requirements of SBC and R-S performing the indicated reductions. To permit thedevelopment of a generally applicable transition scheme, we simplify Equations 5.1’s minimization by assuming that the displacement of split-points selected by an SBC reductionincrease monotonically with contracting bandwidth. Although SBC violates this assumption for several sparse problems, decreases in split-point displacements are typically smalllocal anomalies, overwhelmed by the dominant trend towards progressively larger displacements. (See Figure 5.17.)If we assume monotonic increases in split-point displacement, it is also reasonable toassume that the flop requirements of eliminating each successive diagonal increase as the reduction progresses. A dominant factor governing the cost of SBC’s elimination of a diagonalCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 124Figure 5.19: The Cost of Eliminating the Outermost Nonzero Diagonal of a Densely BandedMatrix (n=1000) Using a Split-Point of Displacement 100.from a sparse band is the displacement of its best split-point. Increasing the split-point’sdisplacement generally increases the cost of a diagonal’s elimination. In addition, if split-point displacement is held constant as we lower the bandwidth of the intermediate matrixfrom which the diagonal is eliminated, elimination costs also generally rise. Figure 5.19demonstrates this characteristic of SBC reductions for a family of densely banded matriceswith a single split-point of displacement 100 in their outermost diagonal. Finally, we alsoassume the cost of completing a sparse tridiagonalization with R-S essentially decreases ata constant rate as the bandwidth contracts. This is a practical assumption, considering thespeed with which R-S typically fills the band of a sparse matrix and that the dominant termof the densely banded flop analysis of R-S, FK, is (4b + 6)n2. Using the full FR analysis,Figure 5.20 plots the flop requirements of R-S, as a function of b, for the tridiagonalizationof densely banded matrices of order 1000.Given these assumptions, the transition strategy can approximate the minimization ofEquation 5.1’s cost function by comparing the following costs before the elimination of eachCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 125Ii,aCLICl)ct350Figure 5.20: R-S Tridiagonalization Costs of Densely Banded Matrices. n=1000diagonal. The bandwidth of the reduction’s current intermediate matrix is given by bc.C3 Cost_SBC(lf —, (b’— 1)) + Cost_R-S((b — 1) 1)C4 = Cost.RS(bC 1)(5.2)(5.3)As long as C3 < C4 HYBSBC should continue eliminating diagonals with SBC. Once C3grows larger than C4, however, the reduction switches to R-S to complete the tridiagonalization.Although our assumptions have greatly simplified the optimization of Equations 5.1’scost function, predicting C3 and C4 exactly for general sparsity structures is difficult without unacceptably expensive modeling. The z-transition strategy avoids the difficulties ofexact analysis by employing the results of two theoretical analyses, L’3S and F11$C fromSections 3.4.1 and 5.1.3, to approximate components Cost_SBC() and Cost_RSQ of C3 andC4. Throughout the following discussion we assume HYBSBC employs fast Givens transformations, but a completely analogous scheme could be developed for implementationsbased on standard Givens transformations.100 150 200Initial BandwidthCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 126FK’ and F1 estimate the flop requirements of specific densely banded reductions,but the L-transition strategy will be applied during the tridiagonalization of matriceswith a wide variety of sparsely banded structures. A number of factors make the useof FK and FC1 an acceptable compromise. First, without having detailed knowledgeof a banded intermediate’s sparsity structure both analyses provide an upper bound onthe flop requirements of a corresponding sparsely banded reduction. For example, giventhe triple of values (n, bc, m) and no additional knowledge of the band’s sparsity structure,F°C1 approximates the cost of eliminating the outermost nonzero diagonal by assumingthe worst case in which all band entries except the split-point are nonzero. As the reductionproceeds the band suffers from fill and split-points are forced away from the midpoint. Thismakes the predicted flop counts more and more accurate, because information about theoutermost diagonal’s sparsity structure is implicitly represented by the minimum displacement split-point selected by SBC. We know, for example, there is a contiguous block of atleast 2(1 r(--)1 — mI + 1) nonzeros straddling the midpoint. Once split-points are sufficiently far from the midpoint to perhaps warrant a transition, this nonzero block typicallycomprises a large majority of the sparse diagonal’s entries. As discussed in Section 4.3.2,band sparsity is best exploited at the transformation level. Thus during the critical periodof HYBSBC’s reduction, Fc1 provides progressively more accurate approximations of theoutermost diagonal’s reduction costs. FK also provides good approximations of the truecost of tridiagonalizing the contracted band of intermediate matrices, because of the speedwith which R-S typically fills the band of a sparse band. Once again, the longer the reduction remains in the SBC stage of the algorithm producing fill within the band, the betterFK’s approximation becomes.Using the results of these analyses we define the s-transition function, (n, bc, m), asthe difference between costs C4 and C3.(n,bc,m) = FK(n,bc) — [FKi(n,bc,m) +F (n,bc—1)] (5.4)Each of the three flop counts in this formula incorporate the one time costs, given byEquation 2.16, of applying the transformation D”2TD” at the end of a reductionCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 127employing fast Givens transformations. The z\-transition strategy should not considerthese costs, and they are nullified by adding 3n to the right hand side of Equation 5.4.A(n, be, rn) = FL(m, b’) — [F(3-1(m, b’, m) + FK(n, b’ — 1)] + 3n (5.5)To employ the z\-transition strategy, HYBSBC checks the value of (n, be, m) beforeSBC eliminates each diagonal of the band. While (n, be, m) > 0, the transition strategyconfirms the cost-effectiveness of continuing the band’s contraction with bidirectional elimination techniques. When (n, b’2, m) drops below zero, however, HYBSBC switches to R-Sto complete the tridiagonalization. Because the storage costs of HYBSBC are independentof the transition bandwidth, the zX-transition strategy is based solely upon the predictionof computational requirements.The following subsection presents explicit formulas for (n, b’, m) and investigates theirpractical application.5.2.2.2 A Practical /.\-Transition StrategyWe explicitly form the z\-transition function A(n, be, m) by inserting the complexity resultsFK and FK1 from Sections 3.4.1 and 5.1.3 into Equation 5.5. Operation count summation difficulties encountered during SBC’s analysis forced Fc1 to be separated into twoindependent results, each valid for a specific range of split-point displacements. Correspondingly, we construct the two formulas for (n, be, m) shown in Figures 5.21 and 5.22.Both formulas assume that the column index m of the split-point refers to an entry inthe top half of the lower triangular diagonal under elimination. In those cases when thesplit-point lies in the diagonal’s bottom half, in is simply assigned the column index of theentry in the diagonal’s upper half with identical displacement. When 1 in bD, we use/bc(n, b, in) to compute the value of the z-transition function, while if b < inwe employ mjd(m, b, in). Figure 5.23 plots the z-transition function for matrices withn = 1000 and bandwidths of 10, 50, 150 and 300.Each formula has been simplified to reduce its evaluation cost, but both remain relativelyCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 12820(m— m2)bc(Th, bc, m) = —1 + 2((bc) bC) — (8 + 6bc + lOm)m + bc+(142bc 20 20 (20_10b\+(8+—)ml n+— (b)2 — bc bc I (b — bc) (m2 + 1)r / bC + 4 \+2 Lbc— i) Mod(n— 1,bc— 1)(bC—1— Mod(ri — 1,bc —1))7 bC + 5+ bc ) [Mod(n — 1, bc)(_bc + Mod(n — 1, bC))— bc(Mod(m— 1, bC) + Mod(n — m, bC)) + Mod(m — 1, bc)2+ Mod(n — m, btD)2] — bcMod(m — 1, b) + Mod(m — 1, bc)2Figure 5.21: The X-Transition Function for 1 m20(m— m2)mid(fl, b, m) = 1 + 2 (bc)2 — (12 + 8(b + m))m +20” \ /20_lObc+ (14 (bc)2 —20_2bc+ (8+)m) n+ (bc)2_bc) (n2+1)r / b + 4 \+2L(\bcl)(MOd(fl_1,bc_1)(bc_1_MOd(Th_1,bc_1)))/b’ + 5+ b ) [Mod(m — 1, bc)(_bc + Mod(n — 1, be))— bc(Mod(m— 1, b) + Mod(m — m, be))+ Mod(m — 1, bc)2 + Mod(n — m, bc)2]Figure 5.22: The Li-Transition Function for b <mCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 129Deltan=1000, b=lO Delta n=1000, b=5061.5 1061.25 1061. 10750000500000250000—250000200 300 400Delta Deltan=1000, b=300__________________________m40000050 /00 150 200 250 300 350200000—200000m/ 100 200 300 400—200000 7 —400000Figure 5.23: Variance of (n, bc, m) with Split-Point Positioning. n = 1000, bC = 10, 50, 150,and 300complicated and lengthy. Many of the encumbering terms result from the nonanalyticcorrection terms involving ModQ. As discussed in Sections 3.4.1 and 5.1.3, these termscan be safely ignored under many circumstances. When the value of bC is large relativeto n, however, omitting these terms can shift the value of m at which (n, b”, m) is zeroand significantly alter the behavior of the Li-transition strategy. Fortunately, the costof evaluating (n, bc, m) in its entirety before the reduction of each diagonal is minimal.Table 5.6 summarizes the total number of floating point and Mod() operations required tocompute1c(n,bc,m) or mjd(n,bD,m). Although many terms of the formulas could beevaluated using integer operations, we assume all calculations are conducted using floatingpoint. Assuming a Mod() operation is equivalent to approximately 3 flops, bc and amidrequire 66 and 62 flops respectively. These costs are insignificant in comparison to the costof reducing a typical diagonal from a moderately large practical problem. In fact, the cost1.5 101. 1050000(m0 200 300 400 500n=1000, b=l50CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 130Flops ModQsbc 54 4mid 50 4Table 5.6: (n, bED, m) Evaluation Cost SummaryName m bGPSBCSSTKO1 48 25BCSSTKO2 66 65BCSSTKO4 132 46BCSSTKO8 1074 475CAN 73 73 27CAN 256 256 86CAN 268 268 101Table 5.7: Sparse Problems with bGPS > fl/3.of evaluating either formula is essentially equivalent to the cost of applying two fast Givenstransformations to a penta-diagonal symmetric matrix.The L-transition functions bc(n,bc,m) and /mid(fl,b’,m) are only valid when bC <n/3, because of assumptions underlying the SBC analysis. While the current bandwidthof a sparse matrix lies outside this range, the transition strategy must take an alternativeapproach. For several reasons, however, we do not feel that accommodating this specialcase warrants the expenditure of much effort. In our experience this class of problem is rare.Of the 115 problems in the Harwell—Boeing test suite, only 7 problems have bGPS > n/3.(See Table 5.7.) In addition, we note that the reduction of a diagonal from a matrix witha relatively large bandwidth generally requires fewer bulge chasing transformations than asimilar elimination from a matrix of lower bandwidth. In fact, when bC > n/2 SBC createsno bulge entries while eliminating the outermost diagonal. As a result, the cost of a genericdiagonal’s reduction decreases as the current bandwidth gets larger. (For an example seeFigure 5.19.) Thus if HYBSBC selects a poor transition bandwidth bt > n/3, which is afew diagonals from the optimal transition, the penalty incurred is very small relative to theCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 131FUNCTION transition(n, bc, sp)IF (sp> ((n — bC)/2)) THENm = n — bC— sp + 1ELSEm = spIF (bC> n/3) THENIF ((no split-point) OR (n —— 2m threshold *(n — bC))) THENreturn TRUEELSEreturn FALSEELSE IF ((split-point exists) AND ((n, bc, m) > 0)) THENreturn FALSEELSEreturn TRUEENDFigure 5.24: HYBSBC’s z-Transition Strategycost of HYBSBC’s entire reduction.The s-transition strategy can adequately regulate the reduction of sparse matriceswith large bandwidths by simply adopting the density thresholding transition techniquesof HYBBC while b’ > m/3. In this case if (bC > n/3) and (nzcnt < (threshold * (m —then HYBSBC eliminates the next diagonal with SBC. (nzcnt and threshold are defined inSection 4.2.2.) Rather than explicitly counting the number of nonzeros in the diagonal,however, we propose approximating nzcnt using the sparsity information implicit in theposition of the chosen split-point. By definition all entries closer to the diagonal’s midpointthan the selected split-point in column sp must be nonzero. Recall m is the column indexof the split-point normalized to lie in the top half of the matrix’s lower triangular diagonal.If we assume all entries in the first and last m columns of the diagonal are zero, nzcnt=m — bC — 2m. Of course, in general m —— 2m is a lower bound on mzcnt. Using n — bC — 2mto approximate nzcnt, Figure 5.24 provides a pseudocode framework for the s-transitionstrategy by defining the boolean function transition used in Figure 5.18’s description ofCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 132Name n bGPS mo-split transition s-transition % flop reductionbandwidth bandwidth no-split —CAN 445 445 74 7 33 5.8%CAN 634 634 100 14 57 6.7%BCSSTKO8 1074 475 1 159 8.1%BCSSTK11 1473 62 1 33 13.0%BCSSTK23 3134 351 10 66 2.0%BCSSTK25 15439 238 1 162 5.8%Table 5.8: Sparse Problems with Widely Varying no-split and Li-Transition Bandwidthsthe HYBSBC algorithm. We assume (n, bc, m) represents the form of the Li-transitionfunction, bc(n,bc,m) or mid(n,bc,m), appropriate for the specific values of m and bc.5.2.2.3 Evaluation of the Li-Transition StrategyA general formal analysis of the optimality of the s-transition strategy, suitable for allsparse symmetric matrices, is not possible. To gauge the success of the s-transition strategy for practical problems, we conduct two sets of experiments with Trisymb symbolicimplementations of HYBSBC and the Harwell—Boeing test suite. For these experimentsthe s-transition strategy uses a threshold of 1.0.To benchmark the success of the s-transition strategy, our first set of experimentscompares a symbolic implementation of HYBSBC to an almost identical implementationthat replaces the Li-transition strategy with a much simpler approach. This alternativetransition strategy directs the hybrid algorithm to continue the SBC phase of the reductionuntil the matrix is in tridiagonal form or its outermost nonzero diagonal is void of splitpoints. For 46 problems in the Harwell—Boeing test suite, the La-transition strategy choosesa higher transition bandwidth than the no-split scheme. Table 5.8 presents examples ofproblems for which the Li-transition strategy suggests a dramatically large increase in thetransition bandwidth. For a majority of the problems, however, the difference betweenthe two schemes’ transition bandwidths is much smaller relative to the problem’s initialpermuted bandwidth.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 133Sci,.0000.0S0zFigure 5.25: Distribution of the LI-Transition Strategy’s Flop ReductionsIn comparison to the no-split approach, the LI-transition strategy successfully reducesthe flop requirements for each of the 46 problems with an earlier transition. Reductions inflop requirements range from 0.5 to 20%, but average 8.1% over all 46 problems. Figure 5.25summarizes the distribution of flop reductions achieved by the LI-transition strategy. Asexpected, the flop reductions attained by the LI-transition strategy are comparable to thedifferences between the complexity of BC and R-S for densely banded problems observedin Section 4.2.1.As predicted in Section 5.2.2.2, the cost of evaluating LI(m, bc, m) before SBC’s reductionof each diagonal is an insignificant fraction of the entire tridiagonalization cost. For Harwell—Boeing problems with n > 100, the cost of evaluating the LI-transition function is atmost 0.2% of the total reduction costs. Typically, the fraction of the total reduction costattributable to the LI-transition strategy is at least an order of magnitude lower than thisvalue.The previous experiments clearly show that the LI-transition strategy is superior to therudimentary no-split transition bandwidth approach, but how well does it predict optimal% Flop ReductionCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 134transition bandwidths? To determine the precision with which the &transition strategyregulates the transition bandwidth, we investigate more closely the tridiagonalization costsof two representatives from each of the following general classes of sparse Harwell—Boeingproblems.1. Problems with very little difference between their no-split and s-transition bandwidths, but for which HYBSBC provides a very efficient tridiagonalization.2. Problems for which there is a large difference between the no-split and s-transitionbandwidths.ERIS1176 and 1138 BUS are chosen to represent the first class of problems, while BCSSTK11 and CAN 445 are selected from the second class.Using Trisymb we examine the change in tridiagonalization flop requirements of eachproblem as transition bandwidths are perturbed from the &transition bandwidth bt. Jaddition to the HYBSBC reduction, we attempt 16 tridiagonalizations with fixed transitionbandwidths of bt+ offset, where offset= —8, —7,... , —1, +1,... , +7, +8. For each problem,Figure 5.26 plots the cost of these reductions normalized by the flop requirements of HYBSBC (offset= 0). Offsets —8, —7,..., and —4 produce invalid transition bandwidths forER151176 (bt = 4) and are not included in its plot. In addition, it is riot possible to selecta transition bandwidth for 1138 BUS lower than 11, the &transition bandwidth, becausethe outermost nonzero diagonal of the corresponding intermediate matrix is full. Althoughthese plots only consider a small subset of possible transition bandwidths, expanding thenumber of reductions conducted for each problem does not change the exhibited generaltrends in tridiagonalization cost.The s-transition strategy clearly chooses the optimal transition bandwidth for bothBCSSTK11 and 1138 BUS. During the reduction of ERIS1176 HYBSBC transfers to R-Sone diagonal prematurely, but HYBSBC’s cost is within 0.63% of the optimal symbolicreduction’s flop requirements. Technically, HYBSBC also transfers control of CAN 445’sreduction to R-S one diagonal too soon. The difference between the cost of the HYBSBCand optimal reductions, however, is an insignificant 0.0015%. The general characteristicsCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 135Flops Relative Flops Relativeto HYBSBC to HYBSBC1.2 1.1.1.15 • 1.081.061.1ERTS1176 1.04 1138 BUS1.05 • 1.02.. .—2 — 2 4 6 8Offset2 4 6 8OffsetFlops Relative Flops Relativeto HYBSBC to HYBSBC1.012 1.081.011 061.008.1.006 1.041 004 CAN 445 • BCSSTK11• 1.002 •••• •..l.02 ••-7.5-55 2.5 5 7:offset-7.5-5-2.5 2.5 7:soffsetFigure 5.26: Variation in Tridiagonalization Flop Counts with the Transition BandwidthOffset from the Li-Transition Bandwidth, bt.of the Z-transition strategy exhibited by the plots of these four problems’ reduction costsare typical of the hybrid algorithm’s reduction of other Harwell—Boeing problems. Withrespect to flop requirements, these results demonstrate that in practice the Li-transitionstrategy selects optimal transition bandwidths for practical sparse problems. The optimalityof the &transition strategy will be further investigated in Section 5.4.3 using numericalimplementations of the Hybrid Split Bandwidth Contraction algorithm.5.3 Numerical Implementation of SBC and HYBSBCThis section provides an overview of implementations of the Split Bandwidth Contractionand Hybrid Split Bandwidth Contraction algorithms. The following discussion relies onSection 4.3’s description of aspects of these implementations inherited from the BC andHYBBC implementations, while concentrating upon the implementation of features uniqueCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 136to SBC and HYBSBC.5.3.1 Implementation BasicsOnce again, we rely upon existing preordering algorithm implementations to conduct thepreordering phase of both SBC and HYBSBC, and direct our attention to the implementation of each reduction algorithm’s second phase.The numerical implementation of HYBSBC consists of two FORTRAN modules. Thefirst module is a completely redesigned group of routines implementing SBC’s bidirectional elimination techniques using fast Givens transformations. Within this module, thefind_split-point routine identifies minimum displacement split-points by starting at the midpoint of each diagonal and searching in both directions simultaneously for the closest zeroentry. When two split-points of equal minimal displacement are found, it uses a globalvariable indicating the region from which the last diagonal’s split-point was selected tochoose the best split-point with damped tiebreaking. The cost of conducting this sequential search before the elimination of each diagonal is insignificant relative to the cost ofthe entire reduction. To permit meaningful comparison of numeric and symbolic HYBSBCreductions, the find_split-point routine was carefully implemented to ensure the symbolicand numeric versions of SBC select the same split-point sequence for an identical seriesof sparsity structures. Of course, numerical cancellation is not modeled by Trisymb andfor some practical problems the structure of the intermediate matrices encountered by thesymbolic and numeric routines may differ despite equivalent split-point selection routines.HYBSBC’s LI-transition strategy was implemented as a separate subroutine closelyfollowing Figure 5.24. It uses a threshold of 1 for problems whose permuted bandwidthis greater than n/3. As predicted in Sections 5.2.2.2 and 5.2.2.3, the fraction of the totaltridiagonalization time spent in the LI-transition routine is insignificant. As an example,ERIS 1176 requires 96 calls to transition(n, bc, sp), but the total number of CPU secondsused by these calls is below the precision of the built-in FORTRAN timing routine etime.Artificially increasing the number of calls by 500, transition(n, b’, sp) requires a total ofCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 1370.01 second. This represents less than 0.02% of the total CPU reqnirements of ERIS1176’stridiagonalization.Using its implementation of the Li-transition strategy, HYBSBC regulates the transitionto routines in the second module implementing the column-oriented phase of the reduction.The central routine in this module is a modified version of BISPACK’s FORTRAN routineBANDR (an R-S code) employed by HYBBC. Once again, the speed with which a typicalsparse band fills during an R-S reduction makes it impractical to use a sparsely banded R-Scode for this portion of HYBSBC.We also developed a separate implementation of SBC capable of partial bandwidthcontractions and complete tridiagonalizations. It is essentially identical to HYBSBC’s SBCmodule, but eliminates the switch to R-S using the L\-transition strategy. Alternatively, itcontinues the contraction until reaching the desired final bandwidth. If it encounters a denseoutermost diagonal during the reduction, control transfers to a special implementation ofBC, which omits initializations, to complete the reduction.As defined in Sections 5.1.3 and 5.2.1, SBC and HYBSBC incorporate the three techniques listed in Section 3.5.2 for exploiting band sparsity during the elimination of individual nonzero entries. The experimental study described in Section 4.3.2, however, clearlydemonstrates that band sparsity is best exploited at the transformation level by identifying entries that are already zero or that the algorithm can eliminate with an adjacentrow/column exchange. Performing sparse transformations, however, is shown to be without benefit for BC or HYBBC implementations, considering the storage and computationaloverhead required by a sparse data structure. This analysis applies equally well to boththe SBC and HYBSBC algorithms, and neither algorithm’s implementation pursues sparsetransformations.Without the need for a special data structure to accommodate sparse transformations,the SBC and HYBSBC implementations both use the densely banded data structure described in Section 4.3.3. As a result, the storage requirements of SBC and HYBSBC areessentially equivalent to the BANDR, BC, and HYBBC implementations, allowing the exCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 138perimental analysis of Section 5.4 to concentrate upon the CPU requirements of SBC andHYBSBC.5.3.2 Rescaling Techniques for SBC and HYBSBCAs discussed in Section 4.3.4, the use of fast Givens transformations necessitates periodicrescaling of the diagonal matrix, D, associated with the reduction. As for previous implementations, the SBC and HYBSBC implementations avoid difficulties with overflow bycarefully restricting the worst case growth of D’s entries between rescaling episodes and byselecting an appropriate rescaling constant for entries experiencing too much growth. Onceagain, for accounting purposes the rescaling schemes assume all entries above and belowthe split-point are nonzero.SBC’s implementation modifies the rescaling procedures in BC’s implementation to reduce the number and scope of rescaling episodes by exploiting the decreased bulge chasingrequirements of bidirectional elimination techniques. Specifically, SBC’s rescaling procedures attempt to exploit the isolation of the shortened bulge chasing sequences above andbelow the split-point. For example, suppose SBC’s procedures monitoring rescaling recordthat a diagonal’s elimination modifies individual main diagonal entries above and belowthe split-point with at most abv and blw nontrivial transformations respectively. At thestart of the next diagonal’s elimination, the monitoring procedure assumes that in theworst case a single entry of D has been modified by max(abv, blw) transformations sincethe last rescaling. In contrast, during a BC reduction its monitoring procedure would beforced to assume that in the worst case an entry of D had been modified by approximatelyabv+blw transformations. In addition, after the first rescaling during a diagonal’s reductionby SBC, subsequent rescalings before the diagonal’s elimination is complete are restrictedto smaller and smaller regions above and/or below the split-point. In these cases SBCexperiences shorter rescaling sequences during which fewer of D’s entries are checked andpossibly rescaled. Given these savings, the cost of SBC’s rescaling procedures are nevermore than for BC and are typically substantially less.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 139The HYBSBC implementation inherits the rescaling techniques of its constituent aFgorithms with one modification. Just before transferring from SBC to R-S, the hybridalgorithm invokes a rescaling episode to ensure a successful transition between the rescaling procedures of the two modules.5.4 Experimentation with Numerical ImplementationsThis section describes extensive testing of Section 5.3’s implementations of the Split Bandwidth Contraction and Hybrid Split Bandwidth Contraction algorithms. Unless otherwisespecified, each problem is preordered to reduce bandwidth using Lewis’s implementationof GPS [Lew82], and tridiagonalized in the same testing environment described in Section 4.4.1. For future reference we also note that the Sun workstation used for testing has a64 KByte external cache in addition to 16 MBytes of main memory. For these experimentsthe La-transition strategy uses a threshold of 1.0. After analyzing test results comparing our numerical implementations of SBC and HYBSBC to BC, HYBBC, and BANDR,we coilciude by investigating the precision with which the z-transition strategy regulatesHYBSBC’s transition bandwidth for practical numerical reductions.5.4.1 SBC Numerical Testing ResultsThe ability of SBC to efficiently execute a partial bandwidth contraction of a sparselybanded problem is crucial to the success of HYBSBC. As discussed previously, efficientpartial SBC reductions can also play an important role as a preprocessing technique forother banded eigenvalue routines, such as BQR, or parallel implementations of Lang typetridiagonalizations, or sparse linear systems solvers. Consequently, we first analyze thepartial bandwidth contraction performance of SBC relative to BC. The final bandwidth foreach problem’s contraction is the .-transition bandwidth chosen by the numerical implementation of HYBSBC.Theoretical analysis predicts that SBC can at least halve the computational requirements of a single diagonal’s elimination when split-points are well centred, but split-pointsCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 140in the outermost nonzero diagonal have increasingly higher displacements as a typical reduction proceeds. Despite this tendency the numerical SBC implementation executes partialbandwidth contractions very efficiently. In fact, the numerical SBC routine performs partial contractions even more successfully relative to BC than the substantial improvementspredicted by similar experiments with symbolic implementations of BC and SBC in Section 5.1.5.For the 70 problems from the Harwell—Boeing test suite with ii > 400, SBC requireson average 24% fewer CPU seconds than BC to perform the partial reductions. From thissubset of problems SBC reduces the CPU requirements of 22 problems by more than 35%and reductions of 45 to 55% are common. Reductions ranged as high as 76% for the sparseproblem DWT 1007.Table 5.9 summarizes the transformation and CPU second requirements of 10 selectedpartial BC and SBC contractions. The table’s columns provide similar information to Table 5.5’s summaries of symbolic reductions, except the final three columns summarize CPUsecond requirements instead of flop counts. Comparison of Tables 5.5 and 5.9 reveals thatthe symbolic codes provide relatively accurate assessments of SBC’s actual computationalrequirements relative to BC.As the symbolic routines predict, however, SBC does not provide a faster partial contraction than BC for all test problems. For example, the numerical SBC and BC reductionsof the “LSHP” problems are essentiafly identical. The symbolic analysis predicts that for 8of the 115 test problems the computational requirements of SBC should be higher than BC.Using the numerical routines, however, SBC’s performance degradation is lower than anticipated for these problems and fewer of them remain in this category. Although the symbolicimplementations typically predict the actual computational requirements very closely, theirflop counts do not encompass all reduction costs. For example, they do not include the costof performing row/column exchanges. When these additional costs are taken into consideration by the timings of numerical routines, SBC is significantly slower than BC for only3 problems, DWT 918, BCSSTK28 and BCSSTKO9.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 141Transformation Totals Time (sec)(bulge chasing, row/col. exch.)Name BC SBC BC SBC %CPUn, bGPS, b ReductionNOS5 21006 14285 4.29 2.77 35.4%468, 88, 49 (14109,3319) (7663,3204)BCSSTK19 364569 195902 13.3 6.41 51.8%817, 18, 3 (357001,1255) (188242,281)DWT 1007 76908 21553 7.1 1.7 76.1%1007, 34, 19 (71200,2686) (17952,70)CAN 1072 283858 178412 77.6 43.4 44.0%1072, 156, 48 (235492,16888) (126802,16194)1138 BUS 770172 498302 80.5 49.3 38.8%1138, 126, 10 (715203,71691) (438850,60228)ERIS1176 1295336 620025 110.5 44.0 60.2%1176, 100, 4 (1241984,19140) (569094,69630)BCSPWRO6 1299517 636455 154.4 62.2 59.7%1454, 100, 12 (1234992,90699) (573608,67932)BCSSTK11 354157 184806 63.1 28.3 55.2%1473, 62, 33 (326321,1402) (157849,1078)PLAT 1919 887007 537775 173.6 91.7 47.2%1919, 80, 30 (826386,31693) (478445,9523)BCSSTK26 3983410 3105826 907.6 579.3 36.2%1922, 245, 12 (3730049,151257) (2871728,91367)Table 5.9: Selected Partial BC and SBC Reduction Timing SummariesThe symbolic routines predict that due to peculiar sparsity characteristics of DWT 918SBC requires 13.8% more flops than BC. Although the symbolic analysis correctly predictsnontrivial transformations and row/column exchanges, SBC needs only 9.0% additionalCPU seconds. Similarly, the symbolic routines correctly predict that special characteristicsof BCSSTK28 result in higher transformation totals for SBC and a shift from row/columnexchanges to nontrivial transformations, resulting in a 6.8% increase in CPU seconds forSBC relative to BC.Finally, the symbolic analysis of BCSSTKO9 predicts SBC requires more than 2.5 timesas many flops as BC’s partial contraction of the same problem. As discussed in Section 5.1.5,this large discrepancy in requirements is due to a moderate increase in transformationtotals for SBC with a dramatically lower proportion of row/column exchanges. WhenBC’s large number of row/column exchanges are no longer without cost in the numericalCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 142implementations, the discrepancy in CPU seconds is reduced to 16.5%.5.4.2 HYBSBC Numerical Testing ResultsThe z-transition strategy permits numerical implementations of HYBSBC to effectivelyexploit SBC’s bidirectional elimination techniqnes and significantly improve upon bothHYBBC and BANDR tridiagonalizations.Relative to HYBBC, HYBSBC provides additional savings in CPU time, ranging as highas 52%, for 98 of the 115 test problems. HYBSBC achieves the largest timing reductions onproblems which the SBC phase is especially efficient and comprises a significant proportionof the complete reduction cost. For the 70 problems with more than 400 nodes, HYBSBCrequires on average 12.7% fewer CPU seconds than HYBBC. HYBSBC tridiagonalizationshave significantly higher timings than HYBBC for only 5 sparse problems, which requirebetween 2 and 10.6% additional CPU seconds. The worst offenders in this gronp areDWT 918, BCSSTKO9 and BCSSTK28, which were previously identified as problems withpeculiar sparsity structures resulting in sub-optimal partial SBC contractions. Thus, thereremain problems for which HYBBC remains the optimal reduction. At present, however,we have not discovered easily identifiable characteristics common to the few problems forwhich HYBBC is superior and must analyze each problem’s characteristics individually.The additional gains enjoyed by HYBSBC relative to HYBBC make it an impressivealternative to EISPACK’s BANDR. In fact, for every test problem of order greater than100 HYBSBC’s tridiagonalization requires fewer CPU seconds than BANDR. HYBSBCtridiagonalizes one problem, ERIS1176, in 1/5 of BANDR’s time. For the 70 problems inthe test suite with n > 400, HYBSBC requires on average 38.7% fewer CPU seconds thanBANDR. In comparison, HYBBC’s mean reduction is 31.1% for the same problem set.The histogram in Figure 5.27 illustrates the wide distribution of CPU reductions HYBSBCachieves for this group of 70 problems.Table 4.4 in Chapter 4 summarizes the computational requirements of 20 test problemsfor which HYBBC is especially successful. For the same group of problems Table 5.10 snm_-‘—CC—1CCC..2Qij_00IC.ZHH—CCI..ICOI..CC.OC03C9CCCCcyCC.31IlCnoo000o1ICDCDCC’-00CDCCII9‘100CC’00—CCCC’-—t.CDCCCC’0000110000CC-00CC’CD—00CCI0000CC1I0000C00CC—CCC00CDCC00CCCC0000CzCII00CJCQ1II00Q0DCCHI00III’————IIII-;:----—————..——.——‘—.-—.-1)CD——C——CC’—CC’C.—-3DCCCC’CDCDCCCCCC’tI)CCD‘2CCCCIDCC’0o00CDJ)..00‘DCC’C00-CCCIs00j-..000000C-C00CD—-400—CCD—C-lCDCC’CCCoCC’00CD——0000CD-300—000000CIDIDCD00-3CC00CCRCC..00—IDCD00ID-300CCCC’CDIDCL1CCC----;-,CC’‘-C’ZCC’IDCC’IDCCCCC.”00IDIDDCCC’I00PCDC)CDID—C0000-CC’00CCCC’CDCC’CDCD-ID--1-3CC00CC-40C—-3CC—CCDCCCCC00CDCDCDCC’CC’CCCD‘CCCCC’CD00CCC’CCDCDCIDIIC)—4CCIDCDCDCDIDID—IIDCCCC00—CC’—-3CD0011H00CCC-CDCCDCDCIDIICC’.‘CC’CC11.0C)II_--ILCDCCD00—.C-cDCCC’CC’00CC-.’00CIICC’hIDIIcII,cCC’,CD——I—I CDlC.I00CC’IL-.001CCCc.’CC.400-00..CC’IDCIICDI-1?CC’ID00c.’CD..,IDlao0CIDCC’CC’--,—.CC’—CCC’CDC’CC’CC’.CC.CDIDID-.I1<C$CIDCD—,—00-..00CD00Càoc1..’b..’40IIc’t’.’I’CDCDC—CDCCCC))D0OCrCC’3,CC’C-CC’IDCC-34CC’:IIIa-CD-C-EIII IIIIICD,IDCCCC’-4-3CC’00—CCC’CCICCCDCDC-CCCCD‘CC’C,C-CIII00CD-CDCC’CCDIDCCCCCCD0000CDIII[.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 144U)Ea)-oCU0a)-D2DzFigure 5.27: The Distribution of HYBSBC’s Improved Reduction Performance Relative toBANDR for Problems with n > 400marizes the tridiagonalization requirements of HYBSBC, as well as BANDR and HYBBCto facilitate comparison. The value of bt listed for each problem is the bandwidth selectedby HYBSBC’s L\-transition strategy. Table 4.4 provides HYBBC’s transition bandwidth.The final column of Table 5.10 reports the reduction in CPU time HYBSBC achieves relative to BANDR. For this group of 20 problems HYBSBC exhibits an impressive meanreduction in CPU time of 52.2%, or on average is 2.09 times faster than BANDR.As discussed during this chapter, the performance of SBC, and consequently HYBSBC,is dependent upon problem specific sparsity structures. As for BC, the class of sparsitystructures particularly suited to the Split Bandwidth Contraction algorithm concentratesnonzeros near the main diagonal and exhibits increased sparsity towards the outermostdiagonals. In addition, SBC ideally prefers nonzeros to be evenly distributed across themidpoint of each diagonal and concentrated towards its two ends.As discussed in Section 4.4.4, GPS cannot always preorder a problem to simultaneouslyreduce bandwidth and produce desirable sparsity structure characteristics in the permutedCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 145matrix. In addition to the GPS—HYBSBC experimentation previously discussed, we alsoconducted HYBSBC tridiagonalizations using RCM and GK preorderings. Although SBCefficiently exploits the increased peripheral sparsity often presented by these preorderings,the typically higher initial bandwidths they produce does not permit HYBSBC’s overall performance to improve. As for the Bandwidth Contraction algorithm the primary preorderingobjective remains the reduction of bandwidth. As outlined in Section 4.4.4 for BC, investigating the ability of preorderings to produce small bandwidth sparsity structures conduciveto SBC reductions is an interesting avenue of future research.In conclusion we note one additional advantage of the HYBSBC algorithm. The-transition strategy and bidirectional elimination techniques both contribute towards thelower transformation totals observed for HYBSBC relative to HYBBC. On average HYBSBC requires 1.4 times as many transformations as R-S, while HYBBC requires 2.25. Reducing transformation totals is of interest if we replace fast Givens transformations withstandard Givens transformations, which have higher fixed costs associated with each transformation. In addition, it is desirable to reduce transformation totals if we want to accumulate transformations to find eigenvectors. Future research will investigate transitionstrategies that attempt to provide efficient reductions while minimizing transformation usage.5.4.3 A-Transition OptimalityUsing Trisymb symbolic implementations, Section 5.2.2.3 thoroughly investigated the precision with which the A-transition strategy regulates a reduction’s transition bandwidth. Inparticular, Figure 5.26 examines the change in tridiagonalization flop requirements of fourrepresentative problems as their transition bandwidths are perturbed from the A-transitionstrategy’s selection. These experiments rely completely upon the symbolic analysis of floprequirements to predict tridiagonalization performance. In this section we confirm thisfavorable analysis by conducting similar experiments with a special numerical implementation of the Hybrid Split Bandwidth Contraction algorithm that permits the user to specifythe transition bandwidth.CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 146Time Relative Time Relativeto HYBSBC to HYBSBC.,1.175 1.151.151.1251.1251 11.11.075 ERIS1176 1.075 •l138 BUS1.05 1.05•fl.025 •Offset Offset—2 2 4 6 8—7.5—5—2.5 2.5 5 7.5Time Relative Time Relativeto HYBSBC to HYBSBC1.06 1.125• . .1.04 1.1CAN 4451.02 1.075• BCSSTK11—7.5 -5 —.5 • •1.050.98•••.1.025••0.96 Offset—7.5—5 —25 2.5 5 7.5Figure 5.28: Variation in Tridiagonalization Time with the Transition Bandwidth Offsetfrom the s-Transition BandwidthOnce again we study the effects upon tridiagonalization timings of perturbing the transition bandwidth of the four Harwell—Boeing problems ERIS1176, 1138 BUS, CAN 445 andBCSSTK11. Figure 5.28 plots the cost of each problem’s perturbed reductions normalizedby the CPU second requirements of HYBSBC using the /.-transition bandwidth (offset=0).The reduction time of each problem with a particular transition bandwidth is actually themean of 3 or more reduction timings.These plots clearly show that the Li-transition strategy selects the optimal transitionbandwidth for ER151176 and 1138 BUS. As expected, the plots of these problems arerelatively smooth and follow the same general trends as their symbolic counterparts. Theplots of the other two problems, however, are not smooth and tridiagonalization times oftenfluctuate wildly from one transition bandwidth to the next. Despite these irregularities thes-transition strategy chooses a near optimal transition bandwidth for BCSSTK11, while forCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 147Flops Relativeto HYBSBCD1.11.081.06.. 1.041.02-- -0•10OffsetFigure 5.29: Tridiagonalization Flop Counts for CAN 634CAN 445 the zX-transition bandwidth is within 5.6% of the optimal reduction time. If SBChad been allowed to eliminate one additional diagonal from CAN 445, the optimal transitionbandwidth would have been used. These results demonstrate that the tX-transition strategysuccessfully picks optimal or near optimal transition bandwidths, but the irregular natureof CAN 445 and BCSSTK11’s plots warrants further investigation.To understand the source of the irregularities in Figure 5.28’s plots, we chose to studymore closely the tridiagonalization costs of a fifth sparse problem, CAN 634, as a functionof transition bandwidth. Figure 5.29 plots the relative flop requirements of CAN 634’stridiagonalization against a large range of transition bandwidths offsets. As expected,the plot is smooth and the almost linear curves emanating from the origin indicate thetX-transition bandwidth is computationally optimal. The nature of this plot is radicallydifferent from its numeric counterpart in Figure 5.30. Although this plot is smoother thanthe plot for CAN 445, the magnitude of the timing fluctuations is even larger. According tothis plot the tX-transition bandwidth results in a tridiagonalization whose reduction timeis 10% higher than optimal. Although the tX-transition strategy is foiled by the timingirregularities of CAN 634 and CAN 445, the tX-transition strategy does not usually havethis much difficulty picking the optimal transition bandwidth. These two pathological casesCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 148Time Relativeto I-IYBSBCD1.1—30 • —2e 10Offset0.950.9Figure 5.30: FORTRAN Tridiagonalization Timings for CAN 634 on a SPARCstation 2with a 64KByte External Cacheare representative of less than 7 problems in the Harwell—Boeing test suite.Fluctuations in integer and floating point operation counts do not change rapidly enoughwith transition bandwidth to account for these wild timing oscillations. We hypothesizethat data access patterns are influencing reduction performance and producing the dramaticchanges in CPU requirements. The densely banded data structure of HYBSBC stores eachsubdiagonal of the band’s lower triangular portion in a separate column of a two dimensionaldouble precision array. As we reduce a problem’s transition bandwidth, the portion of thebanded data structure referenced by the R-S phase shrinks. The elimination of one extradiagonal before switching to R-S means that R-S accesses 8n fewer bytes of memory. Ifeliminating one column of storage results in more of the data remaining in the cache duringthe R-S stage of the reduction, the performance of the tridiagonalization could dramaticallyincrease as cache hits rise. As lower and lower transition bandwidths are considered dataaccess patterns may change yet again, resulting in more cache misses and higher reductiontimes. In this fashion the effectiveness of the cache may change with different transitionbandwidths, leading to the observed oscillatory nature of tridiagonalization timings.To confirm this hypothesis we conducted different experiments with CAN 634 on ma-CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 149Time Relativeto HYBSBCDI1.03.1.02••••••.•. •..•••. •. . .•.-- -10 OffsetI •0.99Figure 5.31: C Tridiagonalization Timings for CAN 634 on a SPARCstation 2 with a64KByte External Cachechines with different caching capabilities. For example, when the same reductions areperformed on a SUN SPARCstation 10 with a 36 KByte on-chip cache and a 1 MByte external cache, the general characteristics of CAN 634’s relative timings plot do not change,but the period of the oscillations is larger than for the smaller cache SPARCstation 2.This result is consistent with our hypothesis, but to be certain of its validity we needto eliminate the effects of caching on timing altogether. Unfortunately, the SUN hardware does not permit users to turn off the cache. Alternatively, we conducted a series ofexperiments on a 50 MHz i486, with both an 8 KByte on-chip cache and a 256 KByteexternal cache, running release 2.5 of the Mach operating system. Due to the unavailability of a FORTRAN compiler on this machine, we created a C version of HYBSBC usingthe program fc [FGMS93j. To ensure this conversion process did not change the generalcharacteristics of tridiagonalization performance, we experimented with the C code versionon a SUN SPARCstation 2, producing Figure 5.31. Although the magnitude of the oscillations have changed, the general trends of Figure 5.31’s plot are identical to its FORTRANcounterpart.Figure 5.32 plots normalized tridiagonalization times against transition bandwidth off-CHAPTER 5. SPLIT BANDWIDTH CONTRACTION 150Time Reletiveto HYBSBCD1.021.0151.01.1.005.• .•• 2Q 1Q. • • Offset•••••••.••••.0.995Figure 5.32: C Tridiagonalization Timings for CAN 634 on a 50 MHz i486 with a 8 KByteOn-Chip Cache and a 256 KByte External Cachesets for the C version of the code run on the i486 machine with caching enabled. The plotdiffers from Figure 5.31 significantly, but still exhibits the familiar erratic nature. Finally,Figure 5.33 plots normalized timings from the same machine with caching disabled. Thisplot is very close to the plot of normalized symbolic flop counts in Figure 5.29. It clearlyshows that the erratic tridiagonalization performance is due to caching effects and that thez\-transition strategy selects the computationally optimal transition bandwidth.If a user desires to have the fastest possible tridiagonalization of a particular problemon a specific sequential architecture, the zX-transition strategy may not always provide theoptimal transition bandwidth. One might be able to fine-tune the &-transition strategy bycreating a detailed cache model and analyzing the data access patterns of the Hybrid SplitBandwidth Contraction algorithm. Even if this difficult task is tractable, the approach isboth machine and problem dependent. In addition, cache effects are sensitive to machineload. In fact, with more than one compute intensive process running during tridiagonalization testing, the erratic performance behavior will largely disappear as each plot approachesFigure 5.33. Thus, although fine-tuning the L\-transition strategy is possible, it is not generally applicable. The s-transition strategy, however, provides a general scheme, whichalways selects an optimal or near optimal transition bandwidth for use in any sequentialCHAPTER 5. SPLIT BANDWIDTH CONTRACTION 151Time Relativeto HYBSBCD.S1.04S1.03S.1.02.•........l.0lS••.--OffsetFigure 5.33: C ‘ftidiagonalization Timings for CAN 634 on a 50 MHz i486 with CachingDisabledcomputing environment. It is the best possible transition strategy with general applicability and, as the testing results of Section 5.4.2 demonstrate, it effectively contributes toHYBSBC ‘s highly efficient tridiagonalizations.Chapter 6A Comparison of the LanczosAlgorithm with Direct SparseTridiagonalizationThere continues to be widespread interest in the use of Lanczos-type algorithms {Lan5O]for the solution of sparse eigenvalue problems. Lanczos algorithms are certainly well suitedto finding a few eigenvahies from either end of the spectrum of a sparse symmetric matrix.This chapter, however, explores the ability of Lanczos algorithms to economically find moderately large subsets of eigenvalues or identify the complete spectrum, including eigenvaluemultiplicities, of a sparse symmetric problem. To evaluate the relative success of applyingLanczos to these tasks we compare its resource requirements to the cost of isolating a sparsematrix’s complete spectrum with HYBSBC and EISPACK’s tridiagonal eigenvalue routineTQLRAT [SBD+76].We begin the chapter with a brief overview of the mathematics underlying the basicLanczos iteration and identify the difficulties associated with the implementation of Lanczos in this simple form. (For a rigorous discussion of the Lanczos method consult [GV89]or [Par8Oj.) With a general understanding of the Lanczos iteration, we then survey thedifferent techniques employed by practical Lanczos implementations. Although this surveyis far from exhaustive, it attempts to touch upon the techniques of the dominant Lanczos methods. The presentation of each Lanczos approach includes a theoretical discussion152CHAPTER 6. THE LANCZOS ALGORITHM 153of its ability to compute efficiently the complete spectrum of a sparse symmetric problem.Finally, the chapter’s last section summarizes experiments applying two well regarded Lanczos codes to practical sparse problems from the Harwell—Boeing test suite. We analyze theresource requirements of these codes to compute a sparse problem’s complete spectrum, ormoderately large subsets of its eigenvalues. For these sparse eigenvalue problems, directmethods based on HYBSBC tridiagonalizations compare very favorably to iterative Lanczostechniques.6.1 Lanczos BasicsUsing a three term recurrence, the Lanczos algorithm directly computes tridiagonal matrices whose eigenvalues approximate some of the original matrix’s eigenvalues. Essentiallyfollowing the notation of Parlett [Par8O], the simple form of the Lanczos algorithm for ann x n matrix A, and arbitrary starting vector TO, is described by the following iteration.Let q be the zero vector and /3o.All vectors are of length n.For j = 1,2,...1. qj =2. uj = Aq2—3. ci.j = qu4. r3 =— iqi5.‘3j = II IIEach iteration of the Lanczos algorithm produces a new Lanczos vector q3 and scalars j3and c. It is not difficult to understand why the Lanczos iteration is attractive to sparsematrix researchers. Lanczos does not modify the original matrix and sparsity can be easilyexploited during the formation of the matrix-vector products Aq3.Assume that all Lanczos vectors are collected into a single matrix= {ql,q2,...,qi} (6.1)CHAPTER 6. THE LANCZOS ALGORITHM 154and that the scalars a and are used to form the tridiagonal matrix T.al thT/3j-1/3j aIt can be shown thatAQ — QTj = rjej, (6.2)where 4 = (0,... , 0, 1). If exact arithmetic is used to perform the computations of eachLanczos iteration, the Lanczos vectors are by definition orthogonal. Thus QJQ = Ii andQJrj = 0, and it can be shown thatT,=QTAQ,. (6.3)Using the Rayleigh-Ritz procedure, the tridiagonal matrix T can be used to find approximations to the eigenvalue-eigenvector pairs of A from the snbspace {qi,... , qj }. Supposethat eigenvalue-eigenvector pairs of Tj are (O, si), i = 1,... ,j. The Ritz vectors y =and corresponding Ritz values O approximate eigenpairs of A. Ti’s eigenvalues tend toexhibit the quickest convergence to the extremal eigenvalues of A, making the Lanczosalgorithm particularly well suited to finding eigenvalues at either end of A’s spectrum.With exact arithmetic, the Lanczos algorithm continues until for some Ic n = 0and the orthogonal Lanczos vectors {q,... , qk} span the invariant subspace of A containingthe starting vector. Equivalently, the iteration continues until Ic = rank(K(A, q, ))*. Atthis stage the Ritz pairs (O, y), i = 1,.. . ,k, form an exact subset of A’s eigenpairs. IfIc <n, one can restart the iterative process by choosing a starting vector orthogonal to allprevious Lanczos vectors.In practice, when using finite precision arithmetic Lanczos encounters difficulties relatedto the loss of orthogonality between Lanczos vectors, resulting from numerical cancellationduring the computation of r3. This phenomenon complicates the algorithm’s terminationand the relationship between Ritz pairs and the eigenvalne-eigenvector pairs of A. For*K(Aqin) is the Krylov subspace spanned by {qi,Aqi,... ,A’qi}.CHAPTER 6. THE LANCZOS ALGORITHM 155example, in practice a straightforward implementation of the basic Lanczos algorithm doesnot terminate. It keeps producing duplicates of previously converged Ritz values indefinitely. In fact, the algorithm may even duplicate eigenvalues before all other eigenvalues ofA have been identified by a converged Ritz value. In addition, without modifying the simpleiteration, Lanczos is theoretically able to find at most a single eigenvector for each distincteigenvalue. The eigenvector computed by Lanczos is the projection of the starting vectoronto the invariant subspace of the corresponding eigenvalue. As a result, in its simplestform Lanczos is unable to determine the multiplicity of eigenvalues.Many researchers have suggested modifications of the Lanczos algorithm to overcomethese difficulties. The following subsections provide brief overviews of the most popular ofthese techniques. We evaluate each variant of the Lanczos algorithm in terms of its abilityto isolate efficiently the complete spectrum of a sparse problem.6.2 Lanczos With Full ReorthogonalizationThe first technique resolving the difficulties of the simple Lanczos algorithm actively maintains strict orthogonality among the Lanczos vectors. As each new Lanczos vector is constructed it is orthogonalized with respect to all previously computed Lanczos vectors.= — (6.4)Using n iterations of this modified Lanczos method creates an n x n tridiagonal matrixwhose n Ritz values provide good approximations to all of A’s eigenvalues with correctmultiplicity. Of course, to successfully complete n iterations the Lanczos algorithm mustbe restarted with an orthogonal starting vector whenever /3k = 0.Unfortunately, the costs of employing full reorthogonalization are very high. Independent of the sparsity of A, the reorthogonalization of Equation 6.4 alone requires 2n3+ 0(n2)flops over n iterations and 0(n2) storage. Although in practice full reorthogonalization isnot often implemented precisely as shown in Equation 6.4 (see [GV89]), the economies ofthese alternative schemes do not change the leading term of the operation count. ClearlyCHAPTER 6. THE LANCZOS ALGORITHM 156this variant of the Lanczos algorithm cannot isolate the complete spectrum of a sparseproblem with the efficiency of a HYBSBC+TQLRAT combination.6.3 Selective or Partial Reorthogonalization TechniquesIn response to the overwhelming cost of full reorthogonalization, researchers have proposedaugmenting the simple Lanczos iteration with selective reorthogonalization [PS79] or partialreorthogonalization [Sim84]. Rather than insisting upon strict preservation of orthogonality,these schemes attempt to maintain semi-orthogonality amongst the Lanczos vectors. Boththe selective and partial reorthogonalization schemes use clever techniques to monitor theloss of orthogonality and when necessary reorthogonalize Lanczos vectors against eitherconverged Ritz vectors or a subset of the Lanczos vectors produced by previous iterations.Parlett and Scott [PS79] claim that an eigenvalue’s multiplicity can be found by Lanczosschemes with selective reorthogonalization, but they appear to be relying upon roundingerrors to sufficiently perturb an eigenvalue problem.Either reorthogonalization technique allows the Lanczos iteration to successfully isolatea few of a problem’s eigenvalues without incurring the excessive computational requirements of a complete reorthogonalization approach. If we use these variants of the Lanczosalgorithm to isolate all the eigenvalues of a sparse symmetric matrix, however, the reorthogonalization costs of either scheme could approach those of full reorthogonalization. As anexample, we consider the potential costs of selective reorthogonalization in more detail.The selective orthogonalization scheme reorthogonalizes each new Lanczos vectoragainst all Ritz vectors that have converged to an eigenvector of A, because the loss oforthogonality between Lanczos vectors goes hand in hand with convergence [PS79]. Asa result, once this Lanczos algorithm detects the convergence of a Ritz vector, it mustcompute and store the Ritz vector even if the eigenvectors of A are not desired. Duringthe algorithm’s identification of a few eigenvalues, there are usually far fewer convergedRitz vectors than Lanczos vectors. This allows the algorithm to enjoy substantial savingsin comparison to full reorthogonalization approaches. If all of A’s eigenvalues are soughtCHAPTER 6. THE LANCZOS ALGORITHM 157using n Lanczos iterations, however, the number of converged Ritz vectors may eventuallyapproach n. Suppose on average a newly converged Ritz vector must be computed aftereach Lanczos iteration. During the th iteration, the cost of computing the matrix-vectorproduct Qs to form Ritz vector y is 2jn flops. Thus the calculation of all m convergedRitz vectors requiresn /2nn+1)n 3 22jn= 2 = (n + m ) flops. (6.5)j=1This result is a lower bound on the cost of computing the n Ritz vectors. Typically, theconvergence of many Ritz vectors is delayed until much later in the Lanczos process thanassumed by this analysis, forcing their computation from matrix-vector products of higherdimension.Independent of A’s sparsity structure, just the cost of calculating the Ritz vectors ispotentially very high, but this operation count does not even include the significant computational requirements of the following portions of the Lanczos algorithm.1. The cost of computing Tj ‘s eigensystem when the selective reorthogonalization variantof the Lanczos algorithm pauses to assess the convergence of Ritz vectors.2. The cost of the actual reorthogonalization process.3. The cost of forming the matrix-vector products for the basic Lanczos iteration.Even though the partial reorthogonalization approach does not require the computationof Ritz vectors, Simon [Sim84] reports that the costs of selective and partial reorthogonalization are comparable. Thus, it is unlikely that either of these Lanczos algorithms cancompete with HYBSBC+TQLRAT when computing the complete spectrum of sparse matrices of moderate bandwidth. In addition to reorthogonalization that may approach 0(n3),both methods potentially require 0(n2) storage to record Lanczos vectors and convergedRitz vectors in the case of selective orthogonalization.Because the rate of convergence of Ritz vectors is problem dependent, this section’sanalysis could not be as precise as the discussion of full reorthogonalization. Section 6.6.2CHAPTER 6. THE LANCZOS ALGORITHM 158reports on experiments with a Lanczos code employing variants of both reorthogonalizationtechniques while identifying the complete spectrum of sparse problems.6.4 Lanczos With No ReorthogonalizationAs mentioned in Section 6.1’s discussion of the simple Lanczos iteration, without the reorthogonalization of Lanczos vectors, Tj may have duplicate Ritz values corresponding tomany of A’s simple eigenvalues. At any particular stage of the Lanczos process duplicate Ritz values have converged with varying degrees of success to one of A’s eigenvalues.Much effort has gone into the development of Lanczos type algorithms using no reorthogonalization, which can identify these ghost or spurious eigenvalues and deal with theirexistence. This section briefly outlines two spurious eigenvalue approaches by Cullum andWilloughby [CW79, CW8O] and Parlett and Reid [PR81]. These algorithms are intendedfor the isolation of many or all distinct eigenvalues of a large sparse matrix. When A’seigenvectors are not required, both algorithms enjoy the distinct advantage of being ableto discard Lanczos vectors once they are no longer required by the three term recurrenceof the simple Lanczos iteration.The two algorithms use different techniques to identify and manage spurious eigenvalues.Cullum and Willoughby’s approach executes the simple Lanczos algorithm a user specifiednumber of iterations m (> m), creating a large tridiagonal matrix Tm. They identify spuriousor unconverged Ritz values by comparing the eigenvalues of Tm to a second tridiagonalmatrix of order (m — 1), produced by deleting the first row and column of Tm. If thenumber and accuracy of the good eigenvalues identified is satisfactory the routine exits.Otherwise, a preset increment is added to m and the Lanczos iteration continues until thelarger tridiagonal matrix Tm is found. Once again, the algorithm compares the eigenvaluesof Tm and Tm_i to identify eigenvalues of A. The algorithm continues in this fashion untilsatisfying the eigenvalue request or reaching the maximum number of Lanczos iterationspreset by the user. Cullum and Willoughby report that some problems may require morethan iOn iterations to find all distinct eigenvalues.CHAPTER 6. THE LANCZOS ALGORITHM 159Parlett and Reid’s algorithm approaches the isolation of nonspurious eigenvalnes usingdifferent techniques, which do not require the direct computation of the eigenvalues of largetridiagonal matrices. Once again, the algorithm executes the simple Lanczos iteration, butit uses a heuristic test for termination. Parlett and Reid’s algorithm monitors the convergence of eigenvalues using recurrences on sets of points within and near the requestedportion of the spectrum. To determine the actual positions of A’s eigenvalues it employsrational interpolation procedures at these points. When all distinct eigenvalues are soughtthis technique typically requires far more than n simple Lanczos iterations. In the limited experimentation presented by Parlett and Reid, one problem required more than Sniterations to identify all eigeuvalues.A number of difficulties are represented by both algorithms. First, there is no wayto predict the final number of iterations required to find a problem’s complete spectrumwithout executing the algorithms. In both cases, the computational requirements dependupon the distribution of eigenvalues within the spectrum of a particular sparse problem. Ingeneral, without the use of reorthogonalization both algorithms have difficulty with regionsof a problem’s spectrum in which the eigenvalues are not well separated. Secondly, neither algorithm can guarantee to find all distinct eigenvalues, because the algorithms mayterminate before all eigenvalues are identified. In the Cullum and Willoughby approachthere is no mechanism for deciding the number of iterations to use unless n nonspuriouseigenvalues are identified. Despite encouraging practical experience, the heuristic termination techniques of Parlett and Reid cannot be guaranteed to succeed. In addition, bothalgorithms rely upon the assumption that the user-selected starting vector has a nontrivialprojection onto the invariant subspace of all requested eigenvalues. If one uses a startingvector deficient in a particular eigenvalue’s invariant subspace, the algorithms may neverfind this eigenvalue independent of the number of iterations taken.Finally, both algorithms are designed to find eigenvalues without regard to multiplicity.Augmenting these methods with multiplicity checking would be unrealistic. For example, inthe Cullum and Willoughby algorithm the multiplicity of eigenvalues could be determinedby looking at the dimension of the subspace spanned by the eigenvectors corresponding toCHAPTER 6. THE LANCZOS ALGORITHM 160converged eigenvalues. If the approach uses m Lanczos iterations it would require O(mn)storage for the Lanczos vectors and at least 0(n3) flops to compute the Ritz vectors. Alternatively, the multiplicity of each eigenvalue could be checked by using an LDLT factorization of A — Al [BK77]. The cost of these factorizations are prohibitive considering the filllevels experienced during a typical sparse LDLT factorization. Even if all A— Al matriceswere banded and positive definite, each Cholesky factorization would require O(b2n) flops,which translates into O(b2n)flops for all factorizations in the worst case.In summary, these spurious eigenvalue approaches cannot economically determine eigenvalue multiplicity and guarantee to find all distinct eigenvalues, and only very rough estimates of their resource requirements are possible prior to execution. Despite these difficulties, many researchers suggest the use of these algorithms for the practical computation ofa problem’s complete set of distinct eigenvalues. Section 6.6.1 reports on experiments comparing the efficiency of HYBSBC+TQLRAT to an implementation of Parlett and Reid’salgorithm, which we selected to represent the spurious eigenvalue Lanczos codes.6.5 Block Shift and Invert LanczosThe spurious eigenvalue approaches discussed in the previous section are unable to providethe multiplicity of eigenvalues. The block Lanczos algorithm is a generalization of thesingle vector Lanczos iteration that overcomes these limitations. Rather than computing asingle Lanczos vector, each iteration of block Lanczos produces a block of Lanczos vectors.Similarly, in place of tridiagonal matrices, the block Lanczos algorithm computes a sequenceof block tridiagonal matrices whose eigenvalues approximate eigenvalues of A. The blockLanczos algorithm is able to compute the multiplicity of eigenvalues, as long as the blocksize is larger than the dimension of the invariant subspace of each requested eigenvalue.Without prior knowledge of a matrix’s spectrum, however, the selection of an appropriateblock size is difficult. The efficiency of block Lanczos algorithms may suffer from theselection of an overly large block size, because the algorithm’s overhead is quadratic in theblock size [5co83, GL594].CHAPTER 6. THE LANCZOS ALGORITHM 161Another difficulty with algorithms based on the simple Lanczos iteration is their potentially slow convergence for poorly separated eigenvalues. For many eigenvalue applicationsthe simple Lanczos iteration isolates eigenvalues efficiently from only one end of the matrix’s spectrum. To overcome this difficulty Grimes, Lewis, and Simon [GLS94] combine theblock Lanczos algorithm with the spectral transformation Lanczos method of Ericsson andRuhe [ER8O]. The goal of this sophisticated block shift and invert Lanczos algorithm is toprovide a “black box” extraction routine for the generalized symmetric eigenvalue problem,Hx = AMx. (6.6)Spectral transformations permit the block Lanczos algorithm to quickly isolate eigenvalues in any part of the spectrum of a generalized eigenvalue problem. Of course, thisapproach applies equally well to the ordinary eigenvalue problem, Ax = Ax, examined bythis dissertation. In terms of this simpler problem, the spectral transformation works inthe following manner. Suppose we want to find a group of A’s eigenvalues near , whichare poorly separated. If we invert the shifted eigenvalue problem (A — ul) using a sparsefactorization, then the eigenvalues of interest are transformed into the largest and mostwell-separated eigenvalues of the following eigenvalue problem.(A— I)’x ix (6.7)The shift and invert transformation does not change the eigenvectors, but the eigenvaluesp. associated with Equation 6.7 are related to those of A by p. =When more than a few eigenvalues close to a single shift are sought, a sequence of shiftscan be used to speed up the convergence of the isolation process. Grimes et al provide arobust strategy to regulate the selection and use of shifts. With the selection of each newshift o, the algorithm performs a sparse factorization of (A — I) and initiates a new blockLanczos iteration.In general, finite precision arithmetic implementations of block Lanczos algorithms suffer from the same difficulties as the simple Lanczos iteration, resulting from the loss oforthogonality amongst Lanczos vectors. To cope with the loss of orthogonality, Grimes,CHAPTER 6. THE LANCZOS ALGORITHM 162Lewis and Simon’s algorithm employs a novel combination of three reorthogonalizationschemes.1. Local ReorthogomalizationAt each iteration of the block shift and invert Lanczos algorithm, a local reorthogonalization is performed between the newest block of Lanczos vectors and the set ofLanczos vectors created by the previous step of Lanczos.2. Partial ReorthogonalizationTo correct the global loss of orthogonality amongst Lanczos vectors, the algorithmuses a block version of Simon’s partial reorthogonalization [5im84].3. External Selective ReorthogonalizationExternal selective reorthogonalization keeps the current sequence of Lanczos vectorsorthogonal to eigenvectors computed with previous shifts, preventing the recomputation of eigenvalues already identified. This novel scheme is motivated by the standardselective orthogonalization approach of Parlett and Scott [PS79].Unlike local reorthogonalization, the second and third reorthogonalization techniques areused oniy when the algorithm’s models of orthogonality loss indicate that their application is warranted. As discussed in Section 6.3, when seeking moderately large numbersof a problem’s eigenvalues the introduction of reorthogonalization schemes may degradealgorithm efficiency to unacceptable levels.The techniques outlined in this section form the basis of a robust Lanczos algorithm,which is capable of isolating groups of eigenvalues from any portion of a matrix’s spectrumor computing a problem’s complete spectrum, including eigenvalue multiplicities. Section 6.6.2 reports on experiments comparing the efficiency of HYBSBC+TQLRAT to animplementation of Grimes, Lewis and Simon’s shift and invert block Lanczos algorithm.6.6 Experimentation With Lanczos CodesTo complement the previous discussion, this section presents experimentation with thetwo well-known Lanczos codes. EA15D is a double precision implementation of Parlettand Reid’s spurious eigenvalue Lanczos algorithm [PR81] from the Harwell SubroutineCHAPTER 6. THE LANCZOS ALGORITHM 163Library [Har9O]. Boeing Computer Services provided access to an implementation of theblock shift and invert Lanczos algorithm [GLS94] from the Boeing Extended MathematicalSubprogram Library (BCSLIB-EXT) [Boe89]. We selected these implementations, becausetheir capabilities appropriately match the eigenvalue problems addressed by this thesis andthe codes use contrasting techniques to cope with the difficulties resulting from the loss oforthogonality amongst Lanczos vectors.Applying these codes to practical sparse symmetric problems from the Harwell—Boeingtest suite, we conduct two types of experiments. First, we investigate the costs of usingLanczos to isolate the complete spectrum of sparse problems. In addition, we explorethe resource requirements of identifying subsets of eigenvalues, ranging in size from 12.5to 100% of a sparse problem’s complete spectrum. In all cases, we compare the resourcerequirements of the Lanczos codes to the costs of using HYBSBC and TQLRAT to computea problem’s entire spectrum, including eigenvalue multiplicities.As for the majority of experimentation in Chapters 4 and 5, testing was conducted on aSun SPARCstation 2 with 16 MBytes of main memory. All routines were compiled by thestandard Sun FORTRAN compiler with full object code optimization requested. Prior toHybrid Split Bandwidth Contraction tridiagonalizations, each problem was preordered toreduce bandwidth using Lewis’s implementation of GPS [Lew82]. The CPU second timingsof HYBSBC+TQLRAT and EA15D were produced using the built-in system routine etime,while the reported timings of the block shift and invert Lanczos code are provided byBCSLIB-EXT’s own statistic gathering routines. The timings of HYBSBC+TQLRAT donot include the requirements of GPS, which are typically insignificant relative to CPUsecond totals for the entire process.6.6.1 Harwell’s EA15EA15D provides the user with several variable parameters to control the Lanczos code’sexecution. Using these parameters we requested EA15D to provide the starting vectorfor the Lanczos iteration and to skip eigenvector computations. EA15D’s parameter ACCCHAPTER 6. THE LANCZOS ALGORITHM 164permits the user to request the precision, relative to the matrix’s largest eigenvalue, towhich the routine computes all eigenvalues. The user can assign AUG a specific relativeaccuracy or a negative value, which directs EA15D to find all requested eigenvalues toas much accuracy as the working precision reasonably allows. We chose to perform allexperiments with EA15D twice, once requesting full accuracy and a second time with therelatively low accuracy of i0.The user must provide EA15D with a routine for the computation of u + Ày, where Ais the sparse matrix under analysis and u and v are vectors. To maximize the efficiency ofcomputing the matrix-vector product Av, we chose to store A’s nonzero entries in a row-oriented data structure using full adjacency lists. During a typical application of EA15D,we found that on average the computation of u + Av at each iteration makes up 10—15% ofthe total CPU requirements, but ranges as high as 40% for specific eigenvalue problems andlevels of accuracy. If storage is a paramount concern, we could easily exploit the symmetryof A by representing it with a lower adjacency data structure, requiring approximately halfthe storage. Consequently, when computing the storage requirements of EA15D we assumea lower adjacency structure is in use.Unfortunately, the user must predict the total storage requirements of EA15D priorto its execution of an eigenvalue task. If allocations are too small, EA15D exits withoutreaching its goal and must be restarted with higher allocations of storage. In practice wefound it very difficult to assess EA15D’s storage requirements. In contrast, HYBSBC hasthe distinct advantage of knowing its storage requirements prior to commencing a reduction.In subsequent discussion of EA15D experimentation, we report the minimum storage neededby EA15D to successfully complete an eigenvalue task.The EA15D code is designed for the isolation of a symmetric problem’s complete spectrum, without regard to multiplicity. For 11 sparse problems from the Harwell—Boeingcollection, Table 6.1 summarizes the costs of complete spectrum determination using HYBSBC+TQLRAT and EA15D with full accuracy requested. The table’s first column recordsthe problem’s name, order and preordered bandwidth, as well as indicating if the Harwell—CHAPTER 6. THE LANCZOS ALGORITHM 165HYBSBC + TQLRAT EA15DName Time Storage Time Storage Eigenvalues Lanczosn,bS,val/noval (CPU sec) (KBytes) (CPU sec) (KBytes) Found IterationsERIS1176 65.9 969.02 4335.1 647.52 1176 230711176,100,novalBCSPWR06 130.9 1198.1 6413.9 676.94 1455 274121454,100,novalBCSSTKO5 0.74 31.824 17.4 47.472 153 887153,23,valBCSSTKO9 159.9 849.07 >8181.6 >677.68 812 >258251083,95 ,valCAN.1054 121.3 969.68 1134.2 339.41 1054 85981054,112,novalDWT.1005 88.8 876.36 1278.2 329.00 1005 96621005,106,novalNOS3 87.6 522.24 >3039.5 >538.68 695 >19201960,65 ,valBCSSTK19 14.7 137.26 >3101.2 >271.60 56 >8171817,18,val1138.BUS 84 1174.4 >12234 >429.98 69 >152491138,126,valBCSSTK26 705.0 3813.3 >109022 >1070.1 15 >384411922, 245 ,valPLAT1919 361.9 1274.2 >21571.6 >889.16 10 >287861919,80,valTable 6.1: The Computational Cost of Complete Spectrum Determination with HYBSBC+TQLRAT and EA15D (Full Accuracy)Boeing collection supplies the problem’s nonzero values. As discussed in Section 2.2, if onlythe matrix’s sparsity pattern is available, we assign a random value in the range (0.0, 1.0]to each nonzero entry. During our testing we did not observe a correlation between EA15Dperformance and the use of randomly assigned or provided values. The next two columnsof the table summarize the CPU seconds and storage requirements 8n(bG + 3) bytes)of HYBSBC+TQLRAT. The table’s final columns report EA15D’s CPU and storage usage,the number of eigenvalues found, and the total number of Lanczos iterations executed. Theconvergence of EA15D is very slow for many problems. In fact, for some problems we gaveup restarting EA15D with more storage before the isolation of all eigenvalues was complete.For these problems EA15D’s resource requirements are preceded by the “>“ symbol, mdi-CHAPTER 6. THE LANCZOS ALGORITHM 166HYBSBC + TQLRAT EA15DName Time Storage Time Storage Eigenvalues Lanczosn,bS,va1/nova1 (CPU sec) (KBytes) (CPU sec) (KBytes) Found IterationsERIS1176 65.9 969.02 589.6 397.72 1080 74591176,100,novalBCSPWRO6 130.9 1198.1 444.1 333.34 1446 59371454,100,novalBCSSTKO5 0.74 31.824 3.8 39.376 152 3811 53,23,valBCSSTK09 159.9 849.07 417.1 401.94 711 85911083,95 ,valCAN.1054 121.3 969.68 251.9 258.8 1054 35601054,112,novalDWT.1005 88.8 876.36 193.1 222.7 1004 30181005, 106,novalNOS3 87.6 522.24 247.6 299.92 953 4279960,6 5 ,valBCSSTK19 14.7 137.26 71.9 192.92 155 3253817,18,val1138.BUS 84 1174.4 1505.5 442.75 452 160471 138,126,valBCSSTK26 705.0 3813.3 4532.6 1054.8 492 374871922 ,245,valPLAT1919 361.9 1274.2 146.3 461.71 684 20701919,80,valTable 6.2: The Computational Cost of Complete Spectrum Determination with HYBSBC+TQLRAT and EA15D (ACC = 10)cating that additional Lanczos iterations are required to successfully solve the eigenvalueproblem.EA15D successfully isolates all the distinct eigenvalues of 5 problems, but in doingso requires dramatically higher CPU second totals than HYBSBC+TQLRAT. For theseproblems the CPU requirements of EA15D range from 9.4 to 66 times those of HYBSBC+TQLRAT, while the storage requirements of the two approaches are comparable.For the remaining 6 problems EA15D requires between 34 and 212 times the CPU secondsof HYBSBC+TQLRAT and sllbstantially higher levels of storage, without successfully completing the eigenvalue tasks. Even with these extraordinary timings, EA15D isolates veryCHAPTER 6. THE LANCZOS ALGORITHM 167few eigenvalues for some of these problems.Table 6.2 summarizes similar experiments with ACC set to i0. In this case, EA15Dsuccessfully isolates all the distinct eigenvalues of each problem to an accuracy of i0relative to their largest eigenvalue. Despite dramatically reducing the requested accuracy, for 10 of the 11 problems EA15D still uses substantially more CPU seconds thanHYBSBC+TQLRAT, which finds all eigenvalues to full accuracy and identifies their multiplicities as well. For PLAT1919 EA15D’s running time actually drops below that ofHYBSBC+TQLRAT. The efficiency of EA15D in this case is a consequence of the lowaccuracy requested. Except for a single eigenvalue at zero, all of PLAT1919’s eigenvalueshave a multiplicity of 2 and the magnitude of the smallest and largest eigenvalues differ by afactor of 1013. Consequently, to satisfy the eigenvalue request EA15D needs to isolate onlya small number of eigenvalues relative to the matrix’s order. Finally, the storage requirements of EA15D and HYBSBC+TQLRAT for the table’s 11 problems are comparable. Forsome problems EA15D requires significantly less storage, while for others its requirementsexceed those of HYBSBC+TQLRAT.We are also interested in the potential use of HYBSBC based methods for the isolationof moderately large subsets of eigenvalues. To investigate the efficiency of this approachrelative to EA15D, we requested EA15D to find all the eigenvalues of BCSSTKO5 andERIS1176 in smaller and smaller intervals including the right end point of their spectra.The fraction of each problem’s spectrum EA15D isolates ranges from 0.25 to 1.0. Thegraphs in Figure 6.1 plot EA15D timings, normalized by the time for complete spectrumdetermination with HYBSBC+TQLRAT, against the fraction of the spectrum isolated byEA15D. HYBSBC+TQLRAT costs could be reduced if TQLRAT sought the subset ofeigenvalues found by EA15D, but HYBSBC dominates the computational requirementsand the timings would only change marginally. Once again, for both problems we conductEA15D experiments with full accuracy and ACCrr i0. As shown by these plots, EA15Dhas dramatically higher CPU requirements than HYBSBC+TQLRAT for the identificationof all but the smallest subsets of eigenvalues. Even when the requested accuracy is significantly lowered, EA15D still requires much higher CPU requirements for the isolationCHAPTER 6. THE LANCZOS ALGORITHM 1680a)I—<41-jI—+o 3.U)>-z 2.()a)U)oLf)w0.3 0.4 0.5 0.6 0.7 0.8 0.9Fraction of Spectrum Isolated0a)U)6C-J0I— SC+04C>-I3Ca)U)2C182 0.3 0.4 0.5 0.6 0.7 8 0.9Fraction of Spectrum Isolated-7,’ERIS1176, Full AccuracyBCSSTKO5, Full Accuracy08.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Fraction of Spectrum IsolatedBCSSTKO5, AGC iOErERIS1176, ACC= iO0a)U)I—Figure 6.1: CPU Requirements of EA15D Relative to HYBSBC+TQLRATCHAPTER 6. THE LANCZOS ALGORITHM 169of all subsets of ERIS1176’s eigenvalues. For BCSSTKO5, EA15D timings oniy manageto drop below those of HYBSBC+TQLRAT during the isolation of the smallest subset ofeigenvalues with ACC= iO.These experiments clearly demonstrate the efficiency of our novel hybrid tridiagonalization techniques. Although it may be possible to improve the efficiency of EA15D, it isunlikely to significantly change the dramatic advantage enjoyed by HYBSBC+TQLRAT.For many sparse problems, EA15D’s difficulties appear to arise from the isolation ofeigenvalues in regions of the spectrum that are poorly separated. The shift and invertLanczos code of the following subsection is especially suited to isolating eigenvalues fromsuch poorly separated regions of the spectrum. A distinct advantage of our efficient hybridtridiagonalization approach, however, is that it can isolate blocks of eigenvalues from welland poorly separated regions equally well.6.6.2 BCS’s Block Shift and Invert LanczosWe conducted all experimentation with the block shift and invert Lanczos algorithm usinga test program, HLANCZOS_TEST, provided by John Lewis of Boeing Computer Services.This program uses BCSLIB-EXT routines to input the sparse matrix’s structure and values,preorder it with the minimum degree algorithm [TW67, GL89] and perform a symbolicfactorization, before calling the double precision eigenvalue extraction routine HDSEEX tocompute the requested eigenvalues.As for Harwell’s EA15D, the user must predict the maximum level of storage required byHDSEEX, and its accompanying BCSLIB-EXT routines, before the eigenvalue extractionbegins. When isolating the eigenvalues of an ordinary eigenvalue problem, Ax = x, withNZLA nonzeros in the strictly lower triangular portion of A, BCSLIB-EXT documentationsuggests allocating the main working storage as a double precision array of length (200 +(3NZLA + 20m)/2). If storage levels are underestimated, the code halts with an errormessage and must be restarted with a longer array. To reduce the in-memory storagerequirements of intermediate results such as eigenvectors, the BCSLIB-EXT code also usesCHAPTER 6. THE LANCZOS ALGORITHM 170input/output files for additional storage. The BCSLIB-EXT code may use many MBytesof file space for large eigenvalue problems.HDSEEX provides a variety of mechanisms for the user to specify which eigenvaluesare desired. For each experiment we chose to provide an interval within which the routineis requested to extract all eigenvalues. The user must also provide HDSEEX with themaximum block size for its Lanczos iterations. The block size used for our experments willbe specified shortly. Finally, the user must specify an accuracy tolerance for HDSEEX’seigenvalue computations. This value is provided to HDSEEX by HLANCZOSJEST andevaluates to 2.31 x lO in our testing environment.All reported timings of the BCSLIB-EXT Lanczos code include the requirements ofsymbolic factorization and HDSEEX’s eigenvalue extraction. To be consistent with theHYBSBC experiments, however, they do not include the time to conduct the preordering, to input the sparse problem, and to check the computed eigenvalues. The CPU secondtiming of each extraction is actually the average timing for 2 identical extractions. HLANCZOSJ7EST performs the timings using the BCSLIB routines HDSLT1 and HDSLT2. Basedon comparisons with shell level timing routines, it appears these BCSLIB routines only report user time, and do not account for the system time or any idle time. Consequently,the high levels of file I/O HDSEEX often requires may cause the reported timings to differsignificantly from elapsed time. The system time and I/O latency are highly dependentupon the computing environment. In our testing environment, the system time for someextractions comprises more than 10% additional CPU seconds and the elapsed time maydiffer from user time by more than a factor of 1.4. In contrast, HYBSBC+TQLRAT doesnot require high levels of file I/O, and timings include both the user and system time.The focus of this thesis is the computation of eigenvalues, but HDSEEX always computesthe eigenvector(s) corresponding to each requested eigenvalue. The cost of constructingeigenvectors cannot be removed completely from our timings, because the code’s externalselective reorthogonalization techniques reorthogonalize Lanczos vectors against subsets ofthe converged eigenvectors. By subtracting the time for computing eigenvectors, however,CHAPTER 6. THE LANCZOS ALGORITHM 171we get a range into which the performance of a block shift and invert Lanczos code tunedfor eigenvalue extraction might fall.To investigate the relative efficiency of the block shift and invert Lanczos code we experimented with three Harwell—Boeing problems: PLAT362, BCSSTK19 and NOS3. TheHarwell—Boeing collection provides both the sparsity pattern and values of each problem.The eigenvalues of these problems have a maximum multiplicity of 2 and we chose to usea block size of 3 for all HDSEEX extractions. Because the eigenvalues of each problemare all positive, the eigenvalues at the left end of each spectrum are generally more poorlyseparated, relative to the difference between the largest and smallest eigenvalues, thanthose at the right end. We performed two sets of experiments to explore the ability of theBCSLIB-EXT code’s shift and invert techniques to isolate eigenvalues from both regionsof the spectrum. These experiments request HDSEEX to extract contiguous subsets ofeigenvalues from either the left or the right extreme of each problem’s spectrum. The fraction of a problem’s eigenvalues isolated by each extraction ranges from 0.125 to 1.0. Thegraphs in Figure 6.2 plot Lanczos timings, normalized by the time for complete spectrumdetermination with HYBSBC+TQLRAT, against the fraction of the eigenvalues isolated.The solid and broken lines in each plot connect data points whose Lanczos timings includeor exclude the computation of eigenvectors respectively. As previously discussed, the performance curve of a shift and invert Lanczos code tuned for eigenvalue extraction would liebetween this pair of lines. Table 6.2 of Section 6.6.1 summarizes the cost of determining thecomplete spectrum of BCSSTK19 and NOS3 with HYBSBC+TQLRAT, while Table 6.3provides similar information for PLAT362.L bGPS Time (CPU sec.)54 8.96Table 6.3: HYBSBC+TQLRAT Summary for PLAT362As shown by the three pairs of plots in Figure 6.2, the Lanczos code exhibits verysimilar performance characteristics while extracting eigenvalues from opposite ends of aCHAPTER 6. THE LANCZOS ALGORITHM0a)21eliU)(I)U)>-xliC)a)C’)NC)Ca)_I ,C)a)U)ci:-j0I—+0U)U)U)I0a)U)C’)0N0Ca)-J172Figure 6.2: The CPU requirements of BCSLIB-EXT Lanczos, relative to HYBSBC+TQLRAT, extracting subsets of eigenvalues from both the left and right ends ofa problem’s spectrum.PLAT362, Left End Extractionsor- BCSSTK19, Left End Extractions BCSSTK19, Right End Extractions0a)U)-J0I;C) tU)U)U)>-Iii0a)U)a)0NC)Ca)0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Fraction of Spectrum IsolatedNOS3, Left End ExtractionsI 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9Fraction of Spectrum IsolatedNOS3, Right End ExtractionsCHAPTER 6. THE LANCZOS ALGORITHM 173problem’s spectrum. The shift and invert Lanczos code, however, has substantially higherCPU requirements than HYBSBC+TQLRAT for the identification of all but the smallestsubsets of eigenvalues. Because the Lanczos timings do not account for the potentiallysubstantial cost of performing file I/O, it is difficult to pinpoint the size of the eigenvaluesubset for which computing the complete spectrum with HYBSBC+TQLRAT becomesslower than with the Lanczos code. For PLAT362 the crossover point is close to 10 or12%, while for BCSSTK19 it appears to be significantly lower. For NOS3 the Lanczoscode is relatively more efficient, moving the crossover point slightly higher. Finally, wenote that the magnitude of the relative advantage enjoyed by HYBSBC+TQLRAT in theseexperiments is comparable to the EA15D experiments with ACC = ion.This chapter’s experiments with BCSLIB-EXT Lanczos and EA15D clearly demonstratethe efficiency and versatility of eigenvalue algorithms based on our novel hybrid tridiagonalization techniques relative to Lanczos-type algorithms. HYBSBC+TQLRAT significantlyout performs either Lanczos code when all eigenvalues are requested, and maintains itsadvantage until we reduce the requested number of eigenvalues to surprisingly relativelysmall subsets of the entire spectrum.Chapter 7Conclusions and Future ResearchThis dissertation has studied a restricted form of the fundamental algebraic eigenvalueproblem, focusing upon generally applicable methods for determining a significant fractionof the eigenvalues of a large sparse symmetric matrix or its complete spectrum. The approach taken by this research was to reexamine techniques for reducing symmetric matricesto tridiagonal form with carefully selected sequences of orthogonal similarity transformations. Using a combination of theoretical analysis and extensive testing with symbolic andnumerical implementations, we have developed novel sequential eigenvalue methods thatmore fully exploit the sparsity of symmetric matrices than previously published methods.This chapter concludes the thesis by providing a summary of its major results andconclusions, and briefly outlining directions of future research.7.1 Summary and ConclusionsWithout sparse model problems that are both conducive to formal analysis and representative of the characteristics of a broad spectrum of practical sparse problems, we chose tosupport the development and evaluation of sparse reduction algorithms using a combinationof theoretical and experimental analysis. This successful approach was greatly assisted bythe group of novel sparse reduction tools described in Chapter 2. A key role in the researchthis thesis was played by our general framework for analyzing the complexity of algorithmsapplying sequences of Givens similarity transformations to sparse symmetric matrices. This174CHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH 175framework was used by our research to facilitate the precise analysis of the BC, SBC andR-S algorithms for special model problems. The new symbolic reduction tools Xmatrixand Trisyinb also use this analysis framework to predict the computational requirements oftheir simulated reductions. Xmatrix provides an interactive graphical X Windows interfacefor modeling the reduction of small sparse matrices with sequences of Givens transformations. In contrast, Trisymb provides a platform for the symbolic implementation of sparsereduction algorithms using sequences of Givens transformations to reduce large sparselyor densely banded symmetric matrices. While conducting the research of this thesis bothsymbolic reduction tools were indispensable, but without the encumbering influence of thecomputing environment Trisymb was especially important for the accurate analysis of algorithms and the confirmation of conclusions drawn from experiments with numerical codes.Finally, we developed a bipartite graph model for the application of orthogonal transformations to sparse symmetric matrices, which is useful for the description and analysis ofsparse reduction algorithms.Using these tools we explored the deficiences and limitations of previously publishedtridiagonalization methods and considered their potential extension to improve sparsity exploitation. By characterizing the high levels of fill associated with standard and customizedGivens reduction algorithms using formal and experimental analysis, we demonstrated thatit is essential to restrict the accumulation of fill entries to some maintainable substructure of the matrix to effectively utilize matrix sparsity. Alternatively, we extended theband-preserving tridiagonalization algorithm of Rutishauser and Schwarz for application togeneral sparse symmetric matrices by introducing an initial bandwidth reducing preordering stage and techniques to exploit band sparsity during the application of transformations.Although the sparse R-S algorithm is a significant improvement, we demonstrated that itsuffers from instances of fill cascading that quickly fill the band of a permuted matrix,placing almost complete reliance on the preordering to exploit matrix sparsity. The denselybanded algorithm of Lang can be similarly extended for use with general sparse matrices,but its Householder transformations exhibit more prolific fill characteristics.In response to the limitations of these algorithms we proposed the Bandwidth Con-CHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH 176traction algorithm. It also employs bandwidth reducing preorderings, band-preservingreduction techniques, and adjacent Givens transformations, but reorders the eliminationsequence to prolong and more fully exploit internal band sparsity. Several contributingfactors permit the diagonally-oriented reduction of BC to perform a partial reduction thatsignificantly contracts the bandwidth of a permuted sparse matrix at low cost before producing a densely banded intermediate matrix. This allows BC to reduce the tridiagonalizationcosts of many practical sparse problems relative to R-S. Using detailed complexity analysesof the Rutishauser-Schwarz and Bandwidth Contraction algorithms, however, we showedthat the tridiagonalization costs of BC are typically 10—25% more than those of R-S for adensely banded matrix. These observations led to the development of the Hybrid Bandwidth Contraction algorithm. Using a transition strategy thresholding the density of theoutermost diagonal of the intermediate band, HYBBC combines a partial bandwidth contraction and R-S to exploit their specific reduction characteristics and improve reductionefficiency.Extensive experiments were conducted with implementations of the Bandwidth Contraction and Hybrid Bandwidth Contraction algorithms, and a large test suite of practicalsparse problems. Both implementations were shown to dramatically reduce the computational requirements of tridiagonalization relative to the EISPACK BANDR routine, an R-Simplementation. In fact, for a wide range of 70 practical sparse problems HYBBC reducesCPU requirements by an average of 31%, with reduction as high as 63% for certain problems, but unlike BC it is always comparable to BANDR in the worst case. The improvedefficiency of these algorithms was accomplished without increasing storage requirements.With additional experimentation we also investigated the relationship between preorderings, sparsity structures and HYBBC performance. The class of sparsity structures ideallysuited to bandwidth contraction techniques concentrates nonzeros near the main diagonaland exhibits increased sparsity towards the outermost diagonals of the band. These experiments clearly demonstrated, however, that the primary objective of a preordering algorithmmust be to reduce bandwidth.CHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH 177Typically, bulge chasing requirements dominate the costs of applying the BandwidthContraction algorithm to most sparse problems. During a sparse tridiagonalization withBC, more than 96% of the transformations may eliminate bulge entries. In response to thisobservation we developed the diagonally-oriented Split Bandwidth Contraction algorithm,which inherits all the advantages enjoyed by its predecessor for a sparse band. The bidirectional elimination techniques of SBC, however, take additional advantage of band sparsityto dramatically shorten bulge chasing transformation sequences and potentially halve thecomputational reqnirements of partial bandwidth contractions.The snccess of SBC is dependent npon the position of the split-point at which a sparsediagonal’s bidirectional elimination commences. We demonstrated that for many sparseproblems the split-point selected for a diagonal’s elimination may significantly influence theefficiency of the remainder of the reduction. Using the symbolic reduction tools Xmatrixand Trisymb, we investigated four global split-point selection strategies. Experimentalevidence demonstrated that the minimum displacement split-point selection strategy withdamped tiebreaking is preferable.To evaluate the efficiency of performing partial bandwidth contractions with SBC, weconducted an extensive experimental comparison of BC and SBC using both symbolic andnumeric implementations. SBC was found to substantially rednce the computational requirements of partial bandwidth contractions of most problems. Reductions of 45—55% werecommon, but ranged as high as 76% for one sparse problem. SBC’s efficient partial contractions make it an ideal replacement for HYBBC’s Bandwidth Contraction stage in a secondgeneration hybrid tridiagonalization algorithm, and a valuable preprocessing technique forother banded eigenvalue routines and sparse linear systems solvers.Chapter S’s novel Hybrid Split Bandwidth Contraction algorithm incorporates manyaspects of HYBBC, but replaces its BC stage with SBC and improves upon the techniquesfor transition bandwidth selection. Using formal analyses of SBC and BC, we showed thatas long as SBC finds a well centred split-point for each diagonal’s reduction, it providestridiagonalizations with significantly lower flop requirements than R-S. If split-points areCHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH 178forced sufficiently far from the midpoint of a diagonal, however, the costs of completing atridiagonalization with SBC may rise above those of R-S. To select the transition bandwidthat which HYBSBC switches between its SBC and R-S stages, we developed the s-transitionstrategy. By exploiting characteristics of a typical SBC reduction, the Li-transition strategy is able to use complexity analyses of SBC and R-S to precisely regulate HYBSBC’stransition, selecting close to optimal transition bandwidths that minimize computationalrequirements for practical problems.Once again, extensive experimentation was conducted with a numeric implementationof the Hybrid Split Bandwidth Contraction algorithm. The /.-transition strategy permitsHYBSBC to effectively exploit SBC’s bidirectional elimination techniques and significantlyimprove upon both HYBBC and BANDR tridiagonalizations. Relative to HYBBC, HYBSBC substantially reduces transformation totals and provides additional savings in CPUtime of 20—40% for many sparse problems. Savings of more than 50% were observed forselect problems. These additional gains make HYBSBC a very impressive alternative toBANDR, with one problem tridiagonalized in 1/5 of BANDR’s time. For the 70 test problems with m> 400, HYBSBC requires on average 39% fewer CPU seconds than BANDR.To demonstrate the relative efficiency of sparse tridiagonalization based eigenvaluemethods, we compared variants of the Lanczos algorithm with the Hybrid Split BandwidthContraction algorithm. We used both theoretical and experimental analysis to explore theability of Lanczos algorithms to economically extract moderately large subsets of eigenvalues or identify the complete spectrum of sparse symmetric problems. Experiments withthe BCSLIB-EXT Lanczos code and Harwell’s EA15D clearly demonstrate the efficiencyof HYBSBC based eigenvalue methods relative to Lanczos-type algorithms. In conjunctionwith TQLRAT, HYBSBC significantly out performs either Lanczos code when all eigenvalues are requested, and maintains its advantage for the typical sparse problem until therequested number of eigenvalues is reduced to a surprisingly small subset consisting of10—20% of the entire spectrum’s eigenvalues.From these results we conclude that eigenvalue methods based on the novel hybrid tridiCHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH 179agonalization algorithms introduced in this dissertation provide the most reliable, versatileand efficient techniques for isolating subsets of a sparse symmetric matrix’s eigenvaluesranging in size from moderate fractions of the entire spectrum to all eigenvalues, includingtheir multiplicities. Although HYBSBC is slower than HYBBC for a small number of ourtest matrices, HYBSBC significantly outperforms HYBBC for the majority of problems andis recommended as the best algorithm for general implementation.7.2 Future ResearchThis section briefly outlines some areas of possible future research resulting from the workof this thesis. Each topic belongs to one or more general categories of research includingsymbolic reduction tools, the enhancement of sparse bandwidth contraction techniques, thegeneralization of sparse tridiagonalization techniques, and parallel sparse tridiagonalizationalgorithms.• Continue the development of Xmatrix to create an interactive teaching tool, permitting the symbolic modeling, manipulation and demonstration of a wide variety ofsparse matrix techniques and algorithms.• Extend our sparse reduction methods with new techniques facilitating eigenvectorcomputation. If one wants to compute eigenvectors by accumulating the transformations applied during a sparse tridiagonalization, it is important to carefully regulatetransformation totals to minimize flop requirements. The development of new transition strategies founded in complexity analysis are needed to optimize HYBSBC’sreduction in this situation. Alternatively, when only a few eigenvectors are desired,we propose investigating the trade-offs associated with applying a banded inverseiteration algorithm to intermediate matrices from the SBC stage of HYBSBC tridiagonalizations.• Continue to investigate the relationship between preorderings and the performance ofsparse bandwidth contraction techniques. One approach of particular interest is theCHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH 180possibility of using a local reordering scheme to refocus spectral preordering methods [BPS93] for bandwidth reduction without significantly altering the characteristicsof spectral preorderings beneficial to BC and SBC reductions.• For the special case in which a partial bandwidth contraction is the end goal of a reduction, develop a sparsely banded data structure and reevaluate the implementationof SBC with sparse band transformations.• Explore the potential role of nonadjacent transformations in special variants of oursparse bandwidth contraction algorithms designed for problems with specific sparsitypattern characteristics.• Investigate the trade-offs associated with the use of rank-i tearing and modificationtechniques [0o173, BNS78] to create split-points and prolong the efficient Split Bandwidth Contraction stage of HYBSBC’s reduction.• Extend our successful sparse tridiagonalization techniques to develop sparse bidiagonalization algorithms for both symmetric and unsymmetric sparse matrices.• Explore the potential of creating a sparse, symmetric generalized eigenvalue methodby extending Crawford’s banded algorithm [Cra73, Kau84] with our sparse reductiontechniques.• Use the knowledge of sparse tridiagonalization gained during the investigations ofthis thesis to guide the development of novel parallel sparse eigenvalue algorithms. Aprimary focus of this research would be parallel hybrid tridiagonalization algorithmscombining the bidirectional elimination of SBC with densely banded Lang-type algorithms [Lan92, B592], which employ multiple elimination and delayed bulge chasingtechniques to provide efficient parallel implementations. Although SBC’s sparse bidirectional elimination enhances the parallelism of sparse bandwidth contractions, onecould also consider the introduction of other techniques to improve the parallel implementation of SBC including split-point creation with rank-i tearings and modifications, delayed bulge chasing, and modified split-point selection methods permittingCHAPTER 7. CONCLUSIONS AND FUTURE RESEARCH 181the simultaneous elimination of multiple sparse diagonals in a pipelined fashion.Bibliography[ABB+92] E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Green-baum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen. LA-PACK Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia, 1992.[AP94] Andrew A. Anda and Haesun Park. Fast plane rotations with dynamic scaling.SIAM Journal on Matrix Analysis and Applications, 15(1):162—174, 1994.[BK77] James R. Bunch and Linda Kaufman. Some stable methods for calculatinginertia and solving symmetric linear systems. Mathematics of Computation,31(137):163—179, 1977.[BNS78] James R. Bunch, Christopher P. Nielsen, and Danny C. Sorensen. Rank-onemodification of the symmetric eigenproblem. Numerische Mathematik, 31:31—48, 1978.[Boe89J Boeing Computer Services, Seattle, WA. The Boeing Extended MathematicalSubprogram Library, 1989.[BPS93] Stephen T. Barnard, Alex Pothen, and Horst D. Simon. A spectral algorithm forenvelope reduction of sparse matrices. Technical Report CS-93-49, Departmentof Computer Science, University of Waterloo, Waterloo, Ontario, 1993.[BS92] Christian H. Bischof and Xiaobai Sun. A framework for symmetric band reduction and tridiagonalization. Preprint MCS-P298-0392, Mathematics andComputer Science Division, Argonne National Laboratory, Argonne, IL, July1992.[Cav93j Ian A. Cavers. Tridiagonalization costs of the Bandwidth Contraction andRutishauser-Schwarz algorithms. Technical Report 93—39, Department of Computer Science, University of British Columbia, Vancouver, B.C., November1993.182BIBLIOGRAPHY 183[Cav94] Ian A. Cavers. A hybrid tridiagonalization algorithm for symmetric sparsematrices. SIAM Journal on Matrix Analysis and Applications, 15(4), October1994. To appear.[CCDG82] P. Z. Chinn, J. Chvatalova, A. K. Dewdney, and N. E. Gibbs. The bandwidthproblem for graphs and matrices— a survey. Journal of Graph Theory, 6:223—254, 1982.[Cra73] C. R. Crawford. Reduction of a band-symmetric generalized eigenvalue problem. Communications of the ACM, 16(1):41—44, 1973.[Cup8l] J. J. M. Cuppen. A divide and conquer method for the symmetric tridiagonaleigenproblem. Numerische Mathematik, 36:177—195, 1981.[Cut72j Elizabeth Cuthill. Several strategies for reducing the bandwidth of matrices. InDonald J. Rose and Ralph A. Willoughby, editors, Sparse Matrices and TheirApplications, pages 157—166. Plenum Press, New York, 1972.[CW79J Jane Cullum and Ralph A. Willoughby. Lanczos and the computation in specified intervals of the spectrum of large, sparse real symmetric matrices. Inlain S. Duff and G. W. Stewart, editors, Sparse Matrix Proceedings 1978, pages220—255. Society for Industrial and Applied Mathematics, 1979.[CW8Oj Jane Cullum and Ralph A. Willoughby. Computing eigenvectors (and eigenvalues) of large, symmetric matrices using Lanczos tridiagonalization. In G. A.Watson, editor, Numerical Analysis, Proceedings of the 8th Biennial Conference, Dundee, June 26-29, 1979, pages 46—63. Springer-Verlag, 1980.[DC921 J. Du Croz, Personal communication, October 1992.[DDHD9OJ Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and lain Duff. A set oflevel 3 basic linear algebra subprograms. ACM Transactions on MathematicalSoftware, 16(1):1—17, 1990.[DDHH88] Jack J. Dongarra, Jeremy Du Croz, Sven Hammarling, and Richard J. Hanson. An extended set of FORTRAN basic linear algebra subprograms. ACMTransactions on Mathematical Software, 14(1):1—17, 1988.[DGLP82I I. S. Duff, Roger Grimes, John Lewis, and Bill Poole. Sparse matrix testproblems. ACM SIGNUM Newsletter, 17:22, June 1982.BIBLIOGRAPHY 184[DR75] I. S. Duff and J. K. Reid. On the reduction of sparse matrices to condensedforms by similarity transformations. Journal of the Institute of Mathematicsand Its Applications, 15:217—224, 1975.[DS87] Jack J. Dongarra and D. C. Sorensen. A fully parallel algorithm for the symmetric eigenvalue problem. SIAM Journal on Scientific and Statistical Computing,8(2):s139—s154, 1987.[ELT79J J. T. Edwards, D. C. Licciardello, and D. J. Thouless. Use of the lanczos methodfor finding complete sets of eigenvalues of large sparse symmetric matrices.Journal of the Institute of Mathematics and Its Applications, 23:277—283, 1979.[ER80J Thomas Ericsson and Axel Ruhe. The spectral transformation Lanczos methodfor the numerical solution of large sparse generalized symmetric eigenvalue problems. Mathematics of Computation, 35(152):1251—1268, October 1980.[FGMS93J S. I. Feldman, David M. Gay, Mark W. Maimone, and N. L. Schryer. AFORTRAN—to—C converter. Computing Science Technical Report 149, AT&TBell Laboratories, 1993.[GBDM77J B. S. Garbow, J. M. Boyle, J. J. Dongarra, and C. B. Moler. Matrix EigensystemRoutines- EISPACK Guide Extension, volume 51 of Lecture Notes in ComputerScience. Springer-Verlag, New York, 1977.[Gen73] W. Morven Gentleman. Least squares computations by Givens transformations without square roots. Journal of the Institute of Mathematics and ItsApplications, 12:329—336, 1973.[GGJK78] M. R. Garey, R. L Graham, D. S. Johnson, and D. E. Knuth. Complexityresults for bandwidth minimization. SIAM Journal on Applied Mathematics,34(3):477—495, 1978.[Gib76j N. E. Gibbs. A hybrid profile reduction algorithm. ACM Transactions onMathematical Software, 2(4):378—387, 1976.[GL78a] Alan George and Joseph W. H. Liu. An automatic nested dissection algorithmfor irregular finite element problems. SIAM Journal on Numerical Analysis,15(5):1053—1069, October 1978.[GL78b] J. A. George and J. W. H. Liu. User guide for SPARSPAK: Waterloo sparselinear equations package. Research Report CS-78-30, Department of ComputerScience, University of Waterloo, Waterloo, Ontario, 1978.BIBLIOGRAPHY 185[GL89] Alan George and Joseph W. H. Liu. The evolution of the minimum degreeordering algorithm. SIAM Review, 31(1):1—19, 1989.[GLD92] Roger G. Grimes, John G. Lewis, and lain S. Duff. Users’ guide for the HarwellBoeing sparse matrix collection (release I). Technical Report MEA-TR-202 Ri,Mathematics and Engineering Analysis, Boeing Computer Services, Seattle,WA, October 1992.[GLS94] Roger G. Grimes, John G. Lewis, and Horst D. Simon. A shifted block Lanczos algorithm for solving sparse symmetric generalized eigenproblems. SiamJournal on Matrix Analysis and Applications, 1(15):228—272, January 1994.[GMS92] John R. Gilbert, Cleve Moler, and Robert Schreiber. Sparse matrices in MAT-LAB: Design and implementation. SIAM Journal on Matrix Analysis and Applications, 13(1) :333—356, January 1992.[Gol73j Gene H. Golub. Some modified matrix eigenvalue problems. SIAM Review,15(2):318—334, 1973.{GPS76aJ Norman E. Gibbs, William G. Poole Jr., and Paul K. Stockmeyer. An algorithmfor reducing the bandwidth and profile of a sparse matrix. SIAM Journal onNumerical Analysis, 13(2) :236—250, 1976.[GPS76b] Norman E. Gibbs, William G. Poole Jr., and Paul K. Stockmeyer. A comparisonof several bandwidth and profile reduction algorithms. ACM Transactions onMathematical Software, 2(4) :322—330, 1976.[GV89] Gene H. Golub and Charles F. Van Loan. Matrix Computations. Johns HopkinsUniversity Press, Baltimore, second edition, 1989.[Har69] Frank Harary. Graph Theory. Addison-Wesley Publishing Company, Inc., 1969.[Har9O] Harwell Laboratory, Oxfordshire, England. Harwell Subroutine Library, Release10, 1990.[Jes89] Elizabeth R. Jessup. Parallel Solution of the Symmetric Tridiagonal Eigenproblem. PhD thesis, Department of Computer Science, Yale University, October1989.{Kau84] Linda Kaufman. Banded eigenvalue solvers on vector machines. ACM Transactions on Mathematical Software, 10(1):’73—86, 1984.BIBLIOGRAPHY 186[Lan5Oj Cornelius Lanczos. An iteration method for the solution of the eigenvalueproblem of linear differential and integral operators. Journal of Research of theNational Bureau of Standards, 45(4):255—282, October 1950.[Lan92J Bruno Lang. Reducing symmetric banded matrices to tridiagonal form — A comparison of a new parallel algorithm with two serial algorithms on the iPSC/860.Technical report, Universität Karlsruhe, January 1992.[Lew82j John G. Lewis. Implementation of the Gibbs—Poole—Stockmeyer and Gibbs—King algorithms. Transactions on Mathematical Software, 8(2):180—189, 1982.[LHKK79] C. L. Lawson, R. J. Hanson, D. R. Kincaid, and F. T. Krogh. Basic linear algebra subprograms for FORTRAN usage. ACM Transactions on MathematicalSoftware, 5(3):308—323, 1979.{Mat921 The MathWorks, Natick, MA. MATLAB User’s Guide, 1992.[MRW71] R. S. Martin, C. Reinsch, and J. H. Wilkinson. The QR algorithm for bandsymmetric matrices. In J. H. Wilkinson and C. Reinsch, editors, Linear Algebra,volume II of Handbook for Automatic Computation, pages 266—272. Springer-Verlag, 1971.[Pap76j C. H. Papadimitriou. The NP-completeness of the bandwidth minimizationproblem. Computing, 16:263—27, 1976.[Par8O) Beresford N. Parlett. The Symmetric Eigenvalue Problem. Prentice—Hall, 1980.{Pot93] Alex Pothen, Personal communication, May 1993.[PR81] B. N. Parlett and J. K. Reid. Tracking the progress of the Lanczos algorithmfor large symmetric eigenproblems. IMA Journal of Numerical Analysis, 1:135—155, 1981.[PS79] B. N. Parlett and D. S. Scott. The Lanczos algorithm with selective orthogonalization. Mathematics of Computation, 33(145):217—238, January 1979.[Rut63] H. Rutishauser. On Jacobi rotation patterns. In Experimental Arithmetic, HighSpeed Computing and Mathematics, volume 15 of Proceedings of Symposia inApplied Mathematics, pages 219—239, Providence, RI, April 1963. AmericanMathematical Society.[SBD76] B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema,and C. B. Moler. Matrix Eigensystem Routines - EISPACK Guide, volume 6 ofBIBLIOGRAPHY 187Lecture Notes in Computer Science. Springer-Verlag, New York, second edition,1976.[Sch63] H. R. Schwarz. Reduction of a symmetric bandmatrix to triple diagonal form.Communications of the ACM, 6(6):315—316, June 1963.[Sch7l] H. R. Schwarz. Tridiagonalization of a symmetric band matrix. In J. H. Wilkinson and C. Reinsch, editors, Linear Algebra, volume II of Handbook for Automatic Computation, pages 273—283. Springer-Verlag, New York, 1971.[Sco83] David S. Scott. LASO2 Documentation. Computer Science Department, University of Texas at Austin, 1983.[Sim84] Horst D. Simon. The Lanczos algorithm with partial reorthogonalization. Mathematics of Computation, 42(165): 115—142, 1984.[TW67] W. F. Tinney and J. W. Walker. Direct solutions of sparse network equations by optimally ordered triangular factorization. Proceedings of the IEEE,55(11):1801—1809, 1967.[Wi165J J. H. Wilkinson. The Algebraic Eigenvalue Problem. Claredon Press, Oxford,second edition, 1965.

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051474/manifest

Comment

Related Items