Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Analysis of cyclic reduction for the numerical solution of three-dimensional convection-diffusion equations Greif, Chen 1998

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_1998-271501.pdf [ 7.41MB ]
Metadata
JSON: 831-1.0080032.json
JSON-LD: 831-1.0080032-ld.json
RDF/XML (Pretty): 831-1.0080032-rdf.xml
RDF/JSON: 831-1.0080032-rdf.json
Turtle: 831-1.0080032-turtle.txt
N-Triples: 831-1.0080032-rdf-ntriples.txt
Original Record: 831-1.0080032-source.json
Full Text
831-1.0080032-fulltext.txt
Citation
831-1.0080032.ris

Full Text

ANALYSIS O F C Y C L I C R E D U C T I O N F O R T H E N U M E R I C A L S O L U T I O N OF T H R E E - D I M E N S I O N A L C O N V E C T I O N - D I F F U S I O N E Q U A T I O N S by C H E N G R E I F B . Sc. (Applied Mathematics), Tel Aviv University, 1991 M . Sc. (Applied Mathematics), Tel Aviv University, 1994 A THESIS S U B M I T T E D IN P A R T I A L F U L F I L L M E N T O F T H E R E Q U I R E M E N T S F O R T H E D E G R E E O F D O C T O R O F P H I L O S O P H Y in T H E F A C U L T Y O F G R A D U A T E STUDIES Department of Mathematics Institute of Applied Mathematics We accect this thesis as conforming to tre^ required ^ m d \ r d T H E U N I V E R S I T Y O F BRITISH C O L U M B I A Apri l 1998 © Chen Greif, 1998 In presenting this thesis in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for refer-ence and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department of Mathematics The University of British Columbia Vancouver, Canada Abstract This thesis deals with the numerical solution of convection-diffusion equations. In particular, the focus is on the analysis of applying one step of cyclic reduction to linear systems of equations which arise from finite difference discretization of steady-state three-dimensional convection-diffusion equations. The method is based on decoupling the unknowns and solving the resulting smaller linear systems using iterative methods. In three dimensions this procedure results in some loss of sparsity, compared to lower dimensions. Nevertheless, the resulting linear system has excellent numerical properties, is generally better conditioned than the original system, and gives rise to faster convergence of iterative solvers, and convergence in cases where solvers of the original system of equations fail to converge. The thesis starts with an overview of the equations that are solved and general properties of the resulting linear systems. Then, the unsymmetric discrete operator is derived and the struc-ture of the cyclically reduced linear system is described. Several important aspects are analyzed in detail. The issue of orderings is addressed and a highly effective ordering strategy is pre-sented. The complicated sparsity pattern of the matrix requires careful analysis; comprehensive convergence analysis for block stationary methods is provided, and the bounds on convergence rates are shown to be very tight. The computational work required to perform cyclic reduction and compute the solution of the linear system is discussed at length. Preconditioning techniques and various iterative solvers are considered. ii Table of Contents Abstract ii Table of Contents iii List of Tables v List of Figures v i Acknowledgements vi i i Chapter 1. Introduction 1 1.1 Background 2 1.1.1 The Convection-Diffusion Equation 2 1.1.2 Finite Difference Methods 4 1.1.3 Solving the Linear System 9 1.2 One Step of Cyclic Reduction 14 1.2.1 The One-Dimensional Case 14 1.2.2 More Space Dimensions 20 1.3 Thesis Outline and Notation 23 Chapter 2. Cyc l ic Reduct ion 25 2.1 Complete Cyclic Reduction 26 2.2 The Three-Dimensional Cyclically Reduced Operator 29 2.2.1 The Constant Coefficient Model Problem 31 2.2.2 The Variable Coefficient Case 36 2.3 Properties of the Reduced Matrix ;. 39 Chapter 3. Order ing Strategies 46 3.1 Block Ordering Strategies for 3D Grids 48 3.2 The Family of Two-Plane Orderings 50 3.3 The Family of Two-Line Orderings 54 3.4 Comparison Results 58 Chapter 4. Convergence Analysis for Block Stationary Methods 64 4.1 Symmetrization of the Reduced System 64 4.1.1 The Constant Coefficient Case 65 4.1.2 The Variable Coefficient Case 72 4.2 Bounds on Convergence Rates 75 4.3 "Near-Property A " for ID Partitioning of the Two-Plane Matrix 87 4.4 Computational Work 94 4.5 Comparison with the Unreduced System 97 4.6 Fourier Analysis 103 4.7 Comparisons 107 m Table of Contents Chapter 5. Solvers, Preconditioners, Implementation and Performance 113 5.1 Krylov Subspace Solvers and Preconditioners - Overview 113 5.2 Incomplete Factorizations for the Reduced System 120 5.3 The Overall Cost of Construction of the Reduced System 126 5.4 Numerical Experiments 128 5.5 Vector and Parallel Implementation 140 Chapter 6. Summary, Conclusions and Future Research 153 6.1 Summary and Conclusions 153 6.2 Future Research 157 Bibl iography 160 iv List of Tables 4.1 comparison between the computed spectral radius and the bound for the ID splitting 86 4.2 comparison between the computed spectral radius and the bound for the 2D splitting 86 4.3 comparison of computed spectral radii of the ID Jacobi iteration matrix with matrices which are consistently ordered 92 4.4 comparison between iteration counts for the reduced and unreduced systems . . . . . . 108 4.5 iteration counts for different iterative schemes 110 4.6 iteration counts for one nonzero convective term I l l 4.7 iteration counts for two nonzero convective terms I l l 5.1 number of nonzero elements in the ILU(l ) factorization 123 5.2 computational work involved in the construction of the reduced matrix 127 5.3 comparison between the computed spectral radii and the bounds 129 5.4 comparison of solving work/time for various mesh sizes 130 5.5 comparison between estimated condition numbers 132 5.6 performance of block stationary methods for Test Problem 1 132 5.7 construction time/work of ILU preconditioners 133 5.8 performance of Q M R for Test Problem 1 134 5.9 performance of B i C G for Test Problem 1 134 5.10 performance of B i - C G S T A B for Test Problem 1 135 5.11 performance of C G S for Test Problem 1 135 5.12 overall flop counts and average computed constants for Test Problem 2 138 5.13 performance of various incomplete factorizations for Test Problem 2 139 5.14 norm of the error for Test Problem 3 140 5.15 comparison of performance for Test Problem 3 141 5.16 comparison of iteration counts of natural and multicolor orderings 152 List of Figures 1.1 computat ional molecules 8 1.2 numerical solution of E q . (1.24) 14 1.3 red/black ordering of the one-dimensional grid 15 1.4 the mat r ix associated with the red/black ordering for the one-dimensional case 16 1.5 eigenvalues of the reduced matr ix (solid line) and the unreduced mat r ix (broken line) for the one-dimensional model problem with n = 65 and /3 = 0.3 19 1.6 natural lexicographic ordering of the tensor-product grids 21 1.7 sparsity patterns of the matrices 22 1.8 red/black ordering in the two-dimensional case 22 2.1 a three-dimensional checkerboard 30 2.2 red/black ordering of the 3D grid 31 2.3 points that are affected by the block Gaussian elimination 32 2.4 structure of the computat ional molecule associated with the reduced operator 33 2.5 sparsity pattern of the lexicographically ordered reduced matr ix 35 2.6 eigenvalues of both systems for Poisson's equation 43 2.7 singular values of both matrices for E q . (2.24) 44 3.1 orderings of the 2D block grid 49 3.2 three members in the family of natural two-plane orderings 50 3.3 red/black and toroidal two-plane ordering corresponding to x-y oriented 2D blocks . 51 3.4 sparsity patterns of two members of the two-plane family of orderings 53 3.5 block computat ional molecule corresponding to the family of orderings 2 P N 53 3.6 four-color I D block ordering 54 3.7 ordering and sparsity pattern of the matr ix associated with 2 L N x y ordering 55 3.8 block computat ional molecule corresponding to the ordering strategy 2 L N x y 58 3.9 possible block partit ionings of the two-plane matr ix 59 3.10 a zoom on 2 D blocks 60 3.11 comparison of the spectral radii of block Jacobi iteration matrices for cross-sections of the mesh Reynolds numbers 61 3.12 spectral radii of iteration matrices vs. mesh Reynolds numbers 62 3.13 symmetric reverse C u t h i l l - M c K e e ordering of the reduced matr ix 63 4.1 sparsity patterns of the matrices involved in the proof of L e m m a 4.8 84 4.2 "Near Proper ty A " for the I D spli t t ing 89 4.3 sparsity pattern of the matr ix CJ 1 ' 90 4.4 the function hs (at) • 93 4.5 spectral radius of the S O R iteration matr ix vs. the relaxation parameter 95 4.6 comparison of the spectral radii of the Gauss-Seidel iteration matrices 103 5.1 a two-dimensional slice of the stencil associated wi th U 121 5.2 fil l-in in the construction of ILU ( l ) in the plane containing the gridpoint for which the discretization was done 121 vi List of Figures 5.3 fill-in in the construction of ILU(l ) in the plane adjacent to the gridpoint for which the discretization was done 122 5.4 sparsity patterns of the factors of ILU(l ) and ILU(2) 123 5.5 sparsity patterns of factors for ILU with drop tolerance 1 0 - 2 125 5.6 /2-norm of relative residual for preconditioned B i - C G S T A B 131 5.7 a 2D slice of the numerical solution of Test Problem 3 140 5.8 2D mesh architecture (3 X 3) 143 5.9 a 2D slice of gridpoints associated with one processor 145 5.10 the part of the reduced matrix that contains the gridpoints associated with a certain subcube, and the rectangular matrix before local ordering 149 5.11 the "local part" of the rectangular matrix after re-ordering 150 5.12 a 2D slice of processors that hold entries of ID or 2D sets of unknowns 151 vii Acknowledgements I would like to thank my supervisor, J i m Varah , for his devoted guidance. J i m has been of great inspirat ion to me, always supportive, enthusiastic, insightful and helpful. I learned a lot from J i m about mat r ix analysis and linear algebra, and am deeply grateful to him for the many hours he devoted to me, and for making these years of work on my thesis a most pleasant, enjoyable and satisfying experience. I am indebted to Gene Golub . Th is work started after I had read Gene's joint papers wi th Howard E l m a n , and was fascinated by the results and the analysis. I am grateful to Gene for his interest in my work, for pointing out useful references and important aspects of the problem, and for his kind hospitality during my visits at Stanford University. I would also like to thank Howard E l m a n for being the external examiner of this thesis, and in particular, for his valuable comments and suggestions.concerning Section 4.3, which substantially improved the exposition. M a n y thanks to U r i Ascher. U r i has been very helpful, and I am grateful to him for numerous pleasant and fruitful discussions and for teaching me many interesting and important aspects of numerical computat ion. Thanks to Michae l W a r d and Anthony Peirce for their interest and input in our committee meetings; thanks to Michae l for providing interesting problems and references. Thanks also to E l d a d Haber, for many st imulating discussions on everything from global politics to cyclic reduction. I have benefited from several discussions wi th X i a o d i Sun and Xiao -Wen Chang and thank them for their helpful suggestions. Final ly , very special thanks to my wife Ron i and my daughter Shakaed, who have been a wonderful source of motivation and support, and have given me all the love and encouragement I could hope for. v i i i Chapter 1 Introduction Mathematical models for fluid mechanics are of major interest in science and engineering. The important task of modelling physical phenomena (such as the motion of particles in a fluid or the relative motion between an object and a fluid) is complicated and challenging, not only from the point of view of the scientists who design the model for a given problem, but also from the point of view of the numerical analysts who are interested in finding accurate numerical solutions for the model. Concentration of a pollutant (for example radioactive waste or atmospheric pollution), flow of fluid past an obstacle, the motion of airplanes and submarines, the wind that blows past a bridge, semi-conductor modelling, and even some financial models of share options -all these important problems have in common the type of equations that are used to describe them. These are partial differential equations whose associated unknown quantities are typically functions of space and time; the equations are typically second order in space and first order in time; they could be nonlinear, and an analytical solution is usually not available, due to their complicated nature. One such family of equations, which we will focus on throughout this thesis, is known by the name convection-diffusion equations. A great deal of investigation and analysis has been performed on these equations in the last few decades. They encapsulate many mathematical difficulties, and are of major importance in several areas of application. In the preface to Morton's book [81] on numerical solution of this type of equations, the first sentence reads: "Accurate modelling of the interaction between convective and diffusive processes is the most ubiquitous and challenging task in the numerical approximation of partial differential 1 Chapter 1. Introduction equations". In this thesis the focus is on the particular problem of how to solve linear systems that arise from finite difference discretization of three-dimensional steady-state convection-diffusion equations. In particular, we analyze the technique of one step of cyclic reduction, which proves to be numerically stable and effective. The description of the method and its analysis are to come later. We first give a general description of the equations we deal wi th , and the linear systems that arise when the equations are discretized. 1.1 Background 1.1.1 T h e C o n v e c t i o n - D i f f u s i o n E q u a t i o n We consider steady-state convection-diffusion equations of the following type: - V • (DVu) + V • (Eu) = F on Q ; du n^ vu + ip— = G on oil . on (1.1a) (1.1b) Here u is an unknown quantity. We focus on the three-dimensional problem, in which case u, D, E, F and G are trivariate functions, and £1 £ 1Z3. Consider for example the equations associated wi th fluid motion. Newton 's second law of motion for fluids, according to which the rate of change of momentum of a fluid particle is equal to the net force acting on it , is represented by the Navier-Stokes equations (see, for example, [46]). Considering the particular class of equations for viscous, incompressible fluids, the equations for this type are: /9(ui + u - V u ) = - V p + ^ A u + f ; (1.2a) V • u = 0 , (1.2b) 2 Chapter 1. Introduction where: • p is the density of the fluid. • The unknown quanti ty u is the velocity field: a vector function, each of whose elements depends on the spatial variables x, y and z, and on the time variable, t. • p is the pressure: a scalar quantity considered to be unknown, which depends on the spatial variables and the time. • p is the coefficient of viscosity, which is a physical property of the material . • f denotes the external force acting on the fluid, which is assumed to be known. Such a force could be, for example, gravity. E q . (1.2a) is called the momentum equation. E q . (1.2b) is the continuity equation, and illustrates the incompressibility of the flow, which means that pressure variations do not produce any significant density variations. The most obvious difficulty associated wi th the Navier-Stokes equations is the nonlinearity. One can deal wi th it in several ways. For example, using P ica rd-like i teration [73] for the momentum equation leads to the linear Oseen equation [44], which for steady-state is a linear equation of the form (1.1). Other common ways to handle the nonlinearity are Newton's method or quasi-Newton methods [86]. Typical ly , in a nonlinear iterative solver, at each iterate a linear system of equations has to be solved. It is therefore of much importance to derive efficient methods for numerically solving these linear systems. Equat ions of the type (1.1) arise when considering, for example, temperature variation, or varying concentration of some substance (e.g. chemical or radioactive pollutant) mixed with the f luid. Related examples are thermal processes in meteorology, salt concentration in oceans, and air and water pollution [109]. A n important term here is "convection", which describes heat transfer in case of temperature variation, or mass transfer when concentration variation is considered. The distr ibution of the concentration c of a fluid is determined by its advection by moving particles and by its diffusion between fluid particles. The equation is ct + « • V c = V • (DVe) + / . (1.3) 3 Chapter 1. Introduction Here u is the velocity and D is a coefficient of diffusivity. If these quantities do not depend on c, E q . (1.3) is linear. Th is is a typical convection-diffusion model problem. In order to "quantify" difficulties that arise in physical problems such as the ones that have been described above, a dimensionless quantity that is typical of the problem is defined. In the problem describing concentration of a pollutant, it is the Peclet number, which is the velocity times the length scale of the domain, divided by the diffusivity coefficient. In the Navier-Stokes equations, the dimensionless Reynolds number is the product of velocity, density and length, divided by the viscosity of the fluid. Reynolds number is denoted by Re; Peclet number is denoted by Pe. These numbers significantly vary from one problem to another. A s it turns out, as long as these numbers are small , the problem is relatively simple to solve and theory can provide an accurate description of the physical phenomenon. For example, according to [109], the flow past a cylinder can be quite accurately modelled for Reynolds numbers up to about 40; the flow is called laminar in this case. A s the Reynolds numbers pass this l imi t , there is separation to what is called turbulent flow, instabili ty develops, and it is much more difficult to predict or model the behavior of the flow. The physical problems scientists and engineers are interested in are obviously not l imited to low Reynolds or Peclet numbers. When these numbers are high, the difficulties that arise are not only l imited to the physical sense; there are also difficulties in the numerical solution of the equations. 1.1.2 Finite Difference Methods There are various approaches for numerical solution of convection-diffusion equations. A m o n g them we mention finite differences [80],[91],[101], finite elements [17],[66],[72], and finite volumes [70] . The approach used in this thesis is finite differences. Here the idea is to approximate the differential operator by a difference quotient which can be thought of as a discrete difference operator. For example, consider the ordinary differential equation £u = f (1.4) 4 Chapter 1. Introduction on the interval (0,1), where £ is a differential operator of the type £ = a(x)-£s + b(x)-^:, and either u or u' are given on x = 0 and x = 1. First, the problem is discretized as follows: a grid consisting of n gridpoints is defined, each of which is a point where an approximate solution of u is to be computed. Suppose the grid is uniform, denote its size by h = and let U{ denote the numerical solution at X{ = ih, i — 1,..., n. If h is small enough and u is sufficiently smooth [101], the first few terms of the Taylor expansion can be used to derive an effective approximation; given i, u(xi+i) can be expressed as: h2 h3 u(xi+1) = u(xi) + hu'{xi) + —u"(xi) + -yu"'{xi) + 0(h4) . (1.5) Similarly, for u(xi-i): h? h3 «(a:,--i) = u(Xi) - hu'(xi) + yu"(x*) - -^'"(a:,-) + 0(hA) . (1.6) Now, simple combinations of Eq. (1.5) and (1.6) provide a few possible approximation formulas for u'(xi) and u"(xi), expressed in terms of w(x t_i), u(xi) and u(xi+i). For the first derivative it is straightforward to verify that: # u > ( X i ) = "(*i+iJH"(*0 + 0 ( h ) . . u'(xi) = ^ ± l t ^ i = l l + 0(h2) . The following approximation schemes can therefore be defined: 1. First order accurate "forward scheme": (u )i = 1 • I 1- 7) 2. First order accurate "backward scheme": («')•• = • (1-8) 3. Second order accurate "centered scheme": Chapter 1. Introduction For the second derivative, one usually uses a second order accurate centered scheme, which can be obtained by summing Eq. (1.5) and Eq. (1.6): (u")i = 1 + 1 . (1.10) What all the approximation schemes that have been presented have in common, is that the derivatives of u at the point X{ are approximated by three values only, namely tt,-_i, and Ui+i. The difference operator is thus a three-point operator. The generalization to more space dimensions is straightforward. In the two-dimensional case, if n is the number of unknowns in each horizontal or vertical line of the mesh and h = —rr n + l is the grid size, we denote by the numerical approximations to u(ih,jh), 1 < i,j < n, and discretize the differential equation as follows [101]: • The first order derivatives are approximated by either a second order accurate centered difference scheme, or by a first order accurate one sided scheme. For example, for the derivative in ^ -direction, in case of centered differences we have ' . ( ^ S t ! ^ , - ( 1 .n) a backward scheme would be ^ , i k ) M S i Z J S = l i ; ( 1 . 1 2 ) and a forward scheme is ±u(ih,jh)^Ut+1\~U^ . (1.13) ox h • The second derivatives are approximated using a second order accurate centered difference scheme. For example: d2 cu -u\ ui+i,j ~ 2ui,j + ui-l,j fi I A \ -^u(ih,jh) « '1 ~f ± , (1.14) and analogously for the y-derivative. 6 Chapter 1. Introduction A t this point the pattern (for any number of space dimensions) is clear: each gridpoint is associated wi th its two neighbors in each direction, and thus: in the two-dimensional case five points are involved in the difference equation, and in the three-dimensional case seven points are involved. Indeed, for the 2D and 3D case, the operators are termed the five-point operator and the seven-point operator respectively [67]. The associated computational molecule is defined as the set of values that are associated with the discretization of a certain gridpoint. In the (general) variable coefficient case the values of the computat ional molecule are gridpoint-dependent. O n the other hand, in the constant coefficient case the computat ional molecule has the same form for all gridpoints that are not next to the boundary. Throughout the thesis, we shall use the term "the model problem" in reference to the following constant coefficient equation: ~Au + VT Vu = w , (1.15) on Q, = (0, l ) s , subject to Dirichlet boundary conditions on dQ. Here s is the number of space dimensions, and VT is an s-dimensional vector of constants: V — (a) if s = 1, V = (O~,T) if s = 2, and V = (a, r , p) if s = 3. Let °~h rh . ah , „ , 0 = T , 7 = T , S = ^ . (1.16) Clearly, in two dimensions only j3 and 7 are defined, and in one dimension only (3 is defined. (3, 7 and 5 are the mesh Reynolds numbers. 1 In order to examine how effective a numerical solution would be, both the magnitude of the P D E coefficients and the size of the grid are considered, and a mesh Reynolds number encapsulates both of them in one mathematical quantity. The discrete operators can be expressed in terms of the mesh Reynolds numbers. For the model problem, the computat ional molecules for interior gridpoints in one, two and three dimen-sions are graphically illustrated in F i g . 1.1. If centered differences are used for approximating 1 We note that we are following the definition used by Elman & Golub in [40], [41], [42]. In some texts, e.g. [46], [81], there is no division by 2 in (1.16). The words "cell" instead of "mesh", and "Peclet" rather than "Reynolds", are used interchangeably in the literature and are equivalent. 7 Chapter 1. Introduction the first order derivatives, the values of the computat ional molecules are as follows: a = 6 ; b= - 1 - 7 ; c = - 1 - /3 ; d = - 1 + /? ; e = - i + T ; / = - 1 - * ; 9 = -1 + 6 . (1.17) For backward approximations we have: a = 6 + 2 • (/3 + 7 + 5) ; 6 = - 1 - 2 T ; c = - 1 - 2/3 ; d = - 1 ; e = - 1 ; / = - 1 - 26 ; g = - 1 . (1.18) For forward approximations we have: a = 6 - 2 - (P + y + S) ; b = - 1 ; c = - 1 ; d = - 1 + 2/3 ; e = - 1 + 2 T ; / = - 1 ; g = - 1 + 25 . (1.19) g / / c a d c a / d / Figure 1.1: computat ional molecules associated with the three-point, five-point and seven-point operators for the constant coefficient problem. The difference operators for the I D , 2D and 3D problems, after scaling by h?, are given respectively by (1.20a) (1.20b) F 3 Mi,i,fc = a + b M , j - i , f c + c u , - i , j , fe + d Wj+ij.fc + e + / « i , j , f c - i + 5 « i , j ,A;+i .(1.20c) 8 Chapter 1. Introduction 1.1.3 Solving the Linear System Once the finite difference scheme has been determined, we have a large sparse linear system of equations for the unknowns. The sparsity pattern of the associated matr ix depends on the ordering of the unknowns. For many commonly used orderings the matr ix is narrow-banded. In more than one dimension the band is necessarily sparse, regardless of the ordering used. Suppose the mat r ix is of size N X N, and the linear system is given by Ax = b . (1.21) The direct solving approach of Gaussian elimination [54] is equivalent to transforming E q . (1.21) into a system Ux = y where U is upper triangular, by solving a system Ly = b where L is a unit lower tr iangular matr ix . A = LU is known as the L U decomposition of A , and storing L and U is advantageous when one needs to solve a system wi th the same mat r ix but wi th multiple right-hand-side vectors. Unfortunately, for sparse matrices this approach suffers the major drawback of fill-in in the factors L and U, which occurs during the factorization: the bandwidth of the mat r ix is preserved, but the sparsity inside the band is destroyed. The issue here is not only waste of storage; fill-in causes an unacceptable increase in computat ional work, and in fact, the amount of work increases with the number of space dimensions when finite difference discretization of part ial differential equations is concerned. It is thus advisable to use iterative methods, which exploit the sparsity pattern of the matr ix and typical ly use only a very smal l number of extra column vectors in addition to the storage required for the linear system. These are methods which generate a sequence of vectors {x^}, which converge to the solution x. The amount of computat ional work involved in solving the system is modest when efficient iterative techniques are used. The mult igrid method [16],[115], for example, can solve the system wi th only 0(N) floating point operations. A typical iterative method involves essentially only matrix-vector products (as opposed to direct methods). Throughout the thesis we shall consider stationary methods and K r y l o v subspace methods. Der ivat ion and analysis of these methods can be found in many books. In particular, we mention the books of Varga [113] and Young [117] on stationary methods, the book of Greenbaum [61] on K r y l o v subspace 9 Chapter 1. Introduction solvers and preconditioners, and the books of Golub k Van Loan [55] and Saad [95], which present a variety of topics, including stationary and K r y l o v subspace methods. In C h a p . 5 a short overview on K r y l o v subspace solvers is to be given. Below we briefly describe stationary methods. Th is is a family of fixed point schemes, which have been known and used for a long t ime. Consider the system Ax = b and denote by x^ the approximation to the solution x at the fcth iterate. The idea is: given an ini t ia l guess a;(0', iterate as follows: M i ( l + 1 ' = ( M - A)xW + b , ft = 0 , 1 , . . . (1.22) where M is a mat r ix associated in some way wi th the matr ix A. Ideally, M should approximate A (or rather, M _ 1 should be an effective approximation to A - 1 ) , and at the same time solving a system involving M should be inexpensive - much cheaper than solving a system involving A. Clearly, these two requirements are somewhat contradictory, and here the main difficulty in picking an efficient scheme lies. Standard schemes are based on using a splitting [113] of the mat r ix A, which is the operation of wri t ing it as A = M - N . (1.23.) The mat r ix M~lN is the iteration matrix. A necessary condition for convergence for any ini t ia l guess is p(M~lN) < 1, where p denotes spectral radius (the maximal absolute value among the eigenvalues). Denot ing the diagonal part of A by D and its strict lower and upper tr iangular parts by E and F, respectively, classical schemes are point Jacobi , which corresponds to picking M = D, and point Gauss-Seidel, which corresponds to picking M — D + E'. The point Successive-Over-Relaxat ion scheme (SOR) is obtained by picking M = ^[D + uE] where u is a scalar between 0 and 2, and can be considered as a technique for accelerating the convergence of the Gauss-Seidel scheme. W h e n to is close to the opt imal value (that is, the value that would give the smallest possible spectral radius of M~lN), the convergence of the S O R scheme is substantially faster than the convergence of Jacobi and Gauss-Seidel. A great deal of convergence analysis for the S O R method (mainly for symmetric positive definite matrices) has been done by Young in the 1950s and 1960s. In particular, Young defined a class of "consistently ordered" matrices, for 10 Chapter 1. Introduction which strong connections between the eigenvalues of the Jacobi , Gauss-Seidel and S O R iteration matrices exist. Some of these results wi l l be discussed in Chapter 4. Another technique for accelerating the convergence of iterative schemes is the Chebyshev method (see G o l u b & Varga [59]). Other split t ings which are more relevant to this work, are block spli t t ings. The idea is to work wi th square submatrices of the system's matr ix rather than single mat r ix elements, as the basic building blocks for the spli t t ing. For example, the block Jacobi scheme corresponds to taking M = D, where D now is no longer the diagonal part of A; rather, it is the WocAr-diagonal part of A. General convergence results for stationary methods are given in Varga 's book [113]. Varga presents many interesting and powerful results, and below we mention the ones which wi l l be useful for us later. See [113] for proofs. Definition 1.1. A matrix B = (b{j) is reducible if there exists a permutation matrix P such that PTBP is a 2 X 2 block upper triangular matrix. If a mat r ix does not satisfy the conditions of Defn. 1.1, it is termed irreducible. Definition 1.2. A matrix B is called positive (nonnegative) if all its elements are positive (nonnegative), and is denoted by B > 0 (B > 0). A n important class of matrices is the following: Definition 1.3. A matrix B = (bij) is an M-matrix if it is nonsingular, bij < 0 for all i ^ j , and B ' 1 > 0. Obviously, one does not want to compute the inverse of a matr ix in order to determine whether or not it is an M - m a t r i x . A useful way of determining whether a mat r ix is an M-matr ix is: T h e o r e m 1.1. [113, Cor. 1, p. 85] If a matrix B — (b^j) is a real, irreducibly diagonally dominant n X n matrix with b{^ < 0 for all i ^ j and b{^ > 0 for all 1 < i < n, then B~x > 0. 11 Chapter 1. Introduction A s it turns out, the above-defined properties could in some circumstances guarantee con-vergence of stationary schemes: Definition 1.4. A = M - N is a regular splitting of A if M is nonsingular with M _ 1 > 0 and N > 0. T h e o r e m 1.2. [113, Thm. 3.13, p. 89] If A = M-N is a regular splitting of A and A - 1 > 0, then the iterative scheme associated with this splitting converges for any initial guess. W h e n at tempting to solve the linear systems that arise from discretizing P D E s , potential difficulties that might arise are: 1. If any of the mesh Reynolds numbers is nonzero, the matr ix is nonsymmetric. The nonsymmetry complicates the analysis significantly. The numerical properties are difficult to assess, compared to the symmetric case for which a great deal of analysis is available. The matr ix is not necessarily diagonally dominant. 2. If the Reynolds numbers are large the matr ix can be ill-conditioned, leading to reduced accuracy in the numerical solution and possible divergence of iterative methods. A good example of these numerical difficulties is when the original (continuous) problem has i l l -condit ioning which the discretized linear system inherits. Th is occurs, for example, when considering the singularly perturbed exit problem [77] of Brownian motion of particles confined by a finite potential well . In this case the continuous eigenvalue problem has an eigenvalue which is exponentially small , and a standard finite difference method would result in a matr ix which is close to singular. For discussion and examples, see Sun [103] and references therein. It is important to distinguish between numerical difficulties that arise as a result of the discretization scheme itself, and ones that arise when coming to solve the underlying linear system. A s an i l lustrat ion of this, consider the following classical problem (see, for example, [46],. [81], [97]): - e u " ( x ) + a u ' = 0 , u ( Q ) = 0, u(l) = 1 . (1.24) 12 Chapter 1. Introduction This equation has the exact solution ecrx/e _ ^ u(x) = . (1.25) Here the mesh Reynolds number is (3 = Using centered difference schemes for discretizing u" and u', the corresponding difference equation is given by • (-l-fi)uj-1 + 2uj + (-l + f3)uj+1=0. (1.26) In this simple case, the difference equation can be solved directly by seeking a solution of the form Uj — ip3, and we get a quadratic equation whose solution for cp is [46]: f + = Y^ ;• f - = l - ( L 2 7 ) After incorporating the boundary conditions, we obtain: - 1 (1.28) When (5 > 1 the numerical solution is oscillatory (UJ changes sign in dependence on the parity of j ) , whereas the analytical solution of the differential equation is smoothly monotoni-cally increasing. The larger j3 is, the more oscillatory and less accurate (as an approximation to the analytical solution) the numerical solution is. In Fig. 1.2(a) the oscillations are illustrated. In this graph, a was chosen to be equal to 1, e = 0.01, and a uniform 16-point grid was used. The solid line represents the analytical solution, and the broken line represents the numerical solution. Here the oscillations are not a result of numerical instability in the solution process; rather, they are a result of the finite difference scheme that is used. The natural remedy in this case would be to use a different numerical scheme. For example, if upwind differences are used, then the numerical solution is smooth - see Fig. 1.2(b). However, in this case the scheme is only first order accurate, and in particular, the numerical solution is not accurate at the vicinity of the boundary layer. 13 Chapter 1. Introduction (a) centered scheme (b) backward scheme Figure 1.2: numerical solution of E q . (1.24). 1.2 One Step of Cyclic Reduction A t this point we present the technique that this thesis is concerned wi th : (one step of) cyclic reduction. The idea of the method is to take the linear system arising from a discretization scheme, decouple the unknowns, and then compute the solution by solving smaller systems. The smaller systems generally have better numerical properties compared to the original system (including a smaller condition number), and thus by performing this procedure the difficulties described in the previous section become easier to handle. 1.2.1 The One-Dimensional Case In order to illustrate cyclic reduction and its advantages, we first consider a simple one-dimensional model problem: -u"{x) + au' = 0. , (1.29) subject to Dir ichlet boundary conditions. The most common ordering of the unknowns is the natural lexicographic ordering. In one dimension this ordering simply means that for each i, i = 1,..., n, the i t h unknown corresponds to the numerical solution approximat ing u(ih). Let /3 = ^ be the mesh Reynolds number. The matr ix associated wi th the linear system, 14 Chapter 1. Introduction using centered difference discretization, is [40]: ( 2 - 1 + J3 -1-/3 2 -l + (3 0 -1-13 2 - 1 + 13 V - 1 - / 3 2 -l + (3 0 - 1 - / 3 2 (1.30) Lexicographic ordering is only one alternative. Another common ordering strategy is the red/black ordering: referring to the grid as a checkerboard, each gridpoint is assigned a color, either red or black, and points of one of the colors, say red, are numbered first. In the one-dimensional case red/black ordering means that we take the indices used in the lexicographic ordering, and re-order them, so that the odd-numbered indices are ordered first. Th i s is illus-trated in F i g . 1.3. R B R B R B R B . 1 5 2 6 3 7 4 8 Figure 1.3: red/black ordering of the one-dimensional grid The sparsity pattern of the matr ix associated wi th the red/black ordering is depicted in F i g . 1.4. It is evident by looking at the matr ix in this figure that the linear system has the form: B C \ ( * % ( ^ \ (1.81) D E j \ J \ J where both B and E are diagonal. In E q . (1-31) the superscripts (r) and (b) have been introduced to illustrate the dependence on the color of the gridpoint. Now, referring to the 15 Chapter 1. Introduction 0 5 10 IS 20 25 30 nz = M Figure 1.4: the matrix associated with the red/black ordering for the one-dimensional case. matrix as a 2 x 2 block matrix, a simple process of block Gaussian elimination in Eq. (1.31) leads to a system whose second block-row is a (smaller) system for the black points, which is called the reduced system [40]: [E-DB-lC}u^ = w^-DB~lw^ . (1.32) Since B~x is diagonal and has entries whose typical values are not zero or close to zero (when an appropriate discretization scheme is used), the inversion is numerically stable. The new system in the ID case is tridiagonal, just like the original system, but is half the size. Once the solution for the black points is computed, the solution for the red points corresponds to solving a diagonal system. The procedure of moving from system (1.31) to system (1.32) amounts to performing one step of cyclic reduction. In order to illustrate the advantages of one step of cyclic reduction in the nonsymmetric case, let us refer back to Eq. (1.30). The region of numerical stability for centered difference discretization is < 1 [65]. For these values of (3 the matrix is irreducibly diagonally dominant [40]. Also, Elman & Golub showed in [40], that for these values of (3, there exists a real diagonal matrix Q such that Q~lAQ is symmetric positive definite. For solving a tridiagonal system, Gaussian elimination is clearly the preferred numerical technique, and from either one of the above two observations it follows that if \f3\ < 1 the algorithm is stable even without pivoting. If \(3\ > 1 the matrix A is not diagonally dominant, and pivoting needs to be used to ensure 16 Chapter 1. Introduction stabili ty. Alternat ively, consider applying the point-Jacobi iterative scheme; the Jacobi i teration mat r ix is tr idiagonal wi th zeros along its diagonal, and convergence results can be readily obtained by using the following (see, for example [40], or [95]): L e m m a 1.1. The eigenvalues of an n x n tridiagonal matrix with b, a and c along its subdi-agonal, main diagonal, and superdiagonal respectively, are given by \i = a + sign(c)2\/&ccos f , i = l , . . . , n . (1.33) B y L e m m a 1.1 the Jacobi scheme is convergent only if (3 < 2 [40]. Using the Jacobi scheme for solving a tr idiagonal system is generally not efficient, but the fact that the scheme converges for such a narrow range of mesh Reynolds numbers is a primary indication of the difficulties that arise (and in fact are magnified) when considering two-dimensional or three-dimensional problems (where, as opposed to I D problems, iterative methods are efficient). We now compare the above wi th the analogous properties of the cyclically reduced matr ix . T h e reduced mat r ix (for odd n), after scaling by 2, is [40]: / 2 + 2(32 - ( l - / ? ) 2 N -(l + f3)2 2 + 2(32 -(1-/3)2 - ( 1 + /3) 2 2 + 2f32 -(l-(3)2 S = V (1-34) - ( 1 + /?) 2 2 + 2/? 2 -(l-(3)2 -(1 + P)2 2 + 2P2 J It can now be observed that the reduced matr ix is diagonally dominant for all values of (3. In addit ion, it is symmetrizable by a real diagonal matr ix for all (3 and the symmetrized matr ix is positive definite [40]. Compar ing that to the original system, we see that the restriction \/3\ < 1 17 Chapter 1. Introduction has disappeared. (We note, however, that the symmetrizat ion operation itself, for both the reduced and unreduced matr ix , may be numerically unstable [40]). A s far as the point-Jacobi scheme is concerned, it has been shown in [40] that it is convergent for /3 ^ 1. In comparison to the condit ion of convergence for the unreduced solver (/? < 2), this is a substantial improvement. A d d i t i o n a l observations regarding the eigenvalues and the spectral condit ion numbers of the unreduced and the reduced matrices are presented below: Proposit ion 1.1. Suppose |/3| < 1, n is odd, and h is sufficiently small. Denote the eigenvalues of the unreduced and the unsealed reduced matrices by A ^ ' and A ^ ' respectively. Then min \f] < min A ^ < max A ^ ' < m a x A ^ . (1.35) Proof. B y L e m m a 1.1, the eigenvalues of the unreduced matr ix are: \f] = 2 - 2 • v 7 ! - / ? 2 • cos(7rjh) , j = 1,. . . , n . (1.36) The eigenvalues of the unsealed reduced matr ix (namely \S) are A f ) = l + / 3 2 - ( l - / ? 2 ) - c o s ( 7 r ^ ) , j = l , . . . , i ( n - l ) , (1.37) where h = . Since m a x A f = 2 + 2 • yfl - (32 • COS(TT/*) and m a x \ \ R ) = 1 + (32 + (1 -^2—hi i j P2) -cos(7r^), and since 1 + fi2 < 2, 1-/32 < 2^/l - fi2 and cos(-jrh) < cos(7r/i), it readily follows that max A ^ ' < m a x A ^ 7 ' . A s for the smallest eigenvalues, min A ; ' 7 ' = 2 — 2 • \ / l — /32 -cos(7r/i) 3 3 3 and min \ { R ) = 1 + (32 - (1 - (32) • cos(nh). Denote x = y/l -j32. Then min \ { R ) - min \f) = 3 3 3 2x • (1 — x) + o(h); this expression is nonnegative for h sufficiently small . • F i g . 1.5 illustrates P rop . 1.1 for a particular example: n = 65 and (5 = 0.3. Here the smallest and largest eigenvalues are 0.1841 and 1.9959 for the reduced matr ix , and 0.0943 and 3.9057 for the unreduced matr ix . The condition numbers are 72.58 and 262.27 - a ratio of about 3.6 in favor of the reduced matr ix . Numerical experiments that we have performed, indicate 18 Chapter 1. Introduction that for \(3\ < 1 the reduced matr ix is better conditioned than the unreduced matr ix by a factor of 2 to 4. In fact, for the symmetric case (3 = 0 it can be shown that: Proposit ion 1.2. Fork sufficiently small, K2{A) ^4K2(S). In other words, the reduced matrix is better conditioned, and the improvement is by a factor of approximately 4-Proof The two matrices in this case are identical (as far as the values on each diagonal are concerned), except the reduced matr ix is about half the size of the unreduced matr ix . B y the eigenvalues specified in Prop . 1.1 it follows that Ko(A) = ]+cos\nlfl and no(S) = 1 + c o s ( 7 r ^ ) Using ° V ' • l-COS(7T/l) ZV I l-COs(7r/l) b Taylor expansions, for h sufficiently small K2(A) = ^2 + °{h2) a n d K2{S) = + o(h2), and since h ~ 2/i , the result readily follows. • Figure 1.5: eigenvalues of the reduced matr ix (solid line) and the unreduced mat r ix (broken line) for the one-dimensional model problem wi th n = 65 and (3 = 0.3. We have seen, then, by examining a one-dimensional model problem, that one step of cyclic reduction improves the numerical properties of the linear system. In addi t ion, since it involves solving a tr idiagonal system for only half of the unknowns and a diagonal system for the rest of the unknowns, the cyclic reduction step gives rise to a faster solution procedure. In finite ar i thmetic and in situations of very large linear systems, the improvement can be very significant. 19 Chapter 1. Introduction 1.2.2 More Space Dimensions In more than one space dimension the equations are considerably more complicated and the difficulties directly affect the level of difficulty in numerically solving the underlying discretized system of equations. From the linear algebra point of view, a major difference between the one-dimensional problem and higher dimensions, is that for the latter there is no way to turn the matrix into one with a dense band, regardless of the ordering strategy used. The natural lexicographic ordering for 2D and 3D is illustrated in Fig. 1.6; the corresponding sparsity patterns are depicted in Fig. 1.7. The red/black ordering in two dimensions is depicted in Fig. 1.8. Another major difference is that more than one convective term is involved; thus, we can now face a situation of small and large Reynolds numbers simultaneously. When a discretization gives rise to a matrix which is not diagonally dominant, there might be a need for block methods which are performed in accordance with how weakly or strongly coupled the unknowns are in each of the directions [6]. Finite difference schemes and orderings that take into account the direction of the flow or attempt to minimize the truncation error might perform better than standard schemes [92],[98],[39]. In [92], Roe & Sidilkover derive multidimensional equivalents of the upwinding scheme in a direct fashion (that is, not by generalizing the ID scheme in a dimension-by-dimension fashion). In particular, they examine linear schemes and determine the range of values of coefficients for which the truncation error is minimized over all positive schemes. The formulas which were derived depend on the geometry of the characteristics in the time-dependent problem. As is shown in [92], the cross-stream diffusion is smaller for the suggested scheme, compared to the dimensional upwind scheme, and these schemes have typically narrow stencils. See [99] for an overview of several narrow schemes with small truncation error for convection-dominated equations. The procedure of one step of cyclic reduction can be performed for any two-dimensional and three-dimensional problem originally discretized by a finite difference scheme. When the five-point and seven-point operators are used, the difference equation for a red point depends 20 Chapter 1. Introduction only on the point itself and its neighboring black points; similarly for the difference equations corresponding to the black points, and therefore the matrix B is diagonal, and the cost of performing one step of cyclic reduction is low. However, the matrix B in Eq. (1.31) is not always diagonal. In higher dimensions cyclic reduction causes some loss of sparsity (as opposed to the ID case). Another important difference from the ID case is that the elimination must be ac-companied by re-ordering of the unknowns (equivalently, permutation of the matrix). The re-ordering is essential because orderings which are effective for the standard tensor-product grid are not necessarily effective for the reduced grid, which has "holes" that correspond to the now-eliminated red points. 13 14 15 16 9 10 11 12 5 6 7 8 1 2 3 4 (a) 2D (b) 3D Figure 1.6: natural lexicographic ordering of the tensor-product grids. In the early 1990s, Elman & Golub conducted analysis and an extensive set of numerical experiments, examining the properties of one-dimensional and two-dimensional non-self-adjoint problems to which one step of cyclic reduction is applied. Their findings were published in a series of papers [40],[41],[42]. They showed that in the two-dimensional case the cyclically reduced operator is a 9-point operator (different from the classical compact 9-point operator), and a matrix-vector product with the reduced matrix is slightly less costly than a matrix-vector product with the unreduced, original matrix. One-line and two-line orderings were considered, and bounds on convergence rates were derived for both orderings. Symmetrization results were 21 Chapter 1. Introduction (a) 2D fn = (b) 3D (n = 4) Figure 1.7: sparsity patterns of the matrices. 15 7 16 8 5 13 6 14 11 3 12 4 1 9 2 10 \ V \ \ v . \'t, *•. *. tm : W: ':':. *•. W - . ' . 1. •. 1. \>!.\ W \ \ •! \ \ t : \ \ \ 0 10 20 30 «0 50 (a) ordering (b) matrix Figure 1.8: red/black ordering in the two-dimensional case. given, which showed that the reduced matrix can be symmetrized by a real diagonal similarity transformation for a larger region of mesh Reynolds numbers, compared to the unreduced matrix: both matrices can be symmetrized for any mesh Reynolds number if upwind differences are used, or if centered differences are used and the mesh Reynolds numbers are smaller than 1 in magnitude. However, only reduced matrix can be symmetrized if both mesh Reynolds numbers are larger than 1 in magnitude and centered differences are used. The symmetrization was used to derive tight bounds on convergence rates of block stationary methods [40],[41]. Earlier observations of Thompson, Ferziger & Golub indicate convergence of the reduced SOR 22 Chapter 1. Introduction solver for any mesh Reynolds number, and convergence which is faster than the convergence of unreduced solvers. Elman & Golub showed analytically as well as experimentally, that reduced solvers converge faster. In the constant coefficient case it was shown in [40] that the reduced matrix is a diagonally dominant M-matrix for any value of the P D E coefficients when upwind differencing is used, or for P D E coefficients in the diffusion-dominated region when centered differencing is used. The variable coefficient case was also addressed [42], and an extensive set of numerical experiments demonstrating performance and the connection between orderings and the direction of the flow, was given. The work of Elman & Golub is very relevant to the current work. Even though the matrices derived and analyzed in this work are very different from the ones in [40],[41],[42], some of the same techniques are used. Throughout the thesis we will point out similarities and differences between the 3D case discussed here and the findings of Elman & Golub for the two-dimensional case. The step of cyclic reduction could be repeated several times. The complete cyclic reduction approach involves repeating the decoupling until the reduced systems are small enough that they can be solved directly. This procedure was originally presented by Hockney [71], based on Golub's ideas. Buneman [18] pointed out that computing the right-hand-side vector in finite precision arithmetic could result in severe round-off errors, and suggested a numerically stable algorithm (mathematically equivalent to Hockney's algorithm). Buzbee, Golub and Nielson [20] presented Buneman's algorithm and a variation of it, and provided careful analysis of the stability issue. 1.3 Thesis Outline and Notation This work is the first to derive and analyze one step of cyclic reduction in conjunction with the important class of three-dimensional non-self-adjoint elliptic problems. In the succeeding chapters, we demonstrate the superiority of the reduced linear system of equations, in particular the convergence rate and amount of computational work involved. 23 Chapter 1. Introduction In Chapter 2 we describe cyclic reduction and derive the cyclically reduced operator for the three-dimensional problem. In Chapter 3 we present block ordering strategies which can be applied to any three-dimensional problem, and demonstrate how they are applied to the reduced grid. Full details on the structures of the matrices associated with the various orderings are given. In Chapter 4 a comprehensive convergence analysis is performed for block stationary methods, and it is shown that the new system of equations is in general better conditioned than the original. Symmetrization conditions and tight bounds on convergence rates are derived. In Chapter 5 we discuss ways of solving the reduced system using Krylov subspace methods in conjunction with several preconditioning techniques. We give details on implementation, and present numerical experiments which include comparisons with the unreduced system. Finally, in Chapter 6 we draw conclusions and suggest future directions of investigation. For notational convenience, the following rules are used throughout the thesis: • For narrow banded matrices the terms 'diag', ' tri ' , 'penta' etc. are used in the same manner as in the Matlab command 'spdiags', that is: if x, y and z are vectors of size n, then tri[a:,y, z] will denote a tridiagonal matrix whose main diagonal consists of entries of the vector y, the subdiagonal consists of x\ to £ n _ i , and the superdiagonal consists of Z2 to zn. xn and z\ do not appear in the matrix in this case, x, y and z could be also matrices, in which case the same rules (as for vectors) apply. • The index of a given diagonal in a matrix is: 0 for the main diagonal, positive numbers for superdiagonals and negative numbers for subdiagonals. • In stands for the identity matrix of order n. • p(T) stands for the spectral radius of the matrix T. • If A and B are matrices, then B > A or B > A refer to inequalities elementwise. • -E's1,...,sk denotes a vector whose entries are s i , . . . , S k , repeated. For example EQ\ = (0,1, 0,1, 0 ,1 , . . . ) , -Eiooi = (1) 0) 0) 1) 1) 0,0,1, . . . ) and so on. The size of the vector will be clear from the context where it appears. 24 Chapter 2 C y c l i c Reduc t ion In the Introduction the term "one step of reduction" was explained and its advantages in the one-dimensional and two-dimensional cases were briefly described. As was mentioned, several steps of cyclic reduction can be performed, until the system is small enough to be solved directly [71],[18],[20]. In Sec. 2.1 we present this procedure, namely complete cyclic reduction. The purposes of presenting it are: first, to clarify how cyclic reduction can be carried out for general block tridiagonal systems, and how the step can be iterated. Secondly, we mention the issues that arise when seeking to compute the solution accurately. It was shown in [20] that complete cyclic reduction can be performed in a stable fashion if the matrix associated with the linear system arises from Poisson's equation, i.e. it is symmetric positive definite. However, in the nonsymmetric case, where typically the matrix is not diag-onally dominant, and does not necessarily have real eigenvalues, the stability of the complete cyclic reduction algorithm cannot be guaranteed. Therefore, in non-self-adjoint problems, the focus is on applying a single step of cyclic reduction. In Section 2.2 we derive the cyclically reduced operator for the three-dimensional case, and in Section 2.3 we present some of its properties. 25 Chapter 2. Cyclic Reduction 2.1 Complete Cyclic Reduction Consider the linear system Ax = b , where A is an n 2 X n 2 block tridiagonal matrix of the form: A = B C C B C 0 C B C \ V C B C 0 C B (2.1) and B and C are nxn matrices. 1 Suppose also that n = 2k — 1, where k is some positive integer (see Sweet [105], [106] for an explanation on how to perform cyclic reduction if n is arbitrary). In the following it is assumed that when the index exceeds the dimensions of the matrix the corresponding value is considered as being identically zero. For any even 2 < j < n — 1, we write three consecutive b l o c k equations of this system, as follows: C X J _ 2 + BXJ-I + CXJ = b j - i C X J _ I + BXJ + CXJ+I = bj C x j + BXJ+I + CXJ+2 = b j + 1 If we now multiply the first and third block equations by C , the second block equation by —B, and add up the three block equations, we get: C2Xj-2 + (2C 2 - B2)XJ + C 2 x j + 2 = C6j_i + C V i - B b j . (2.2) We thus obtain a new system of equations, for the vector (x2, ..., z n _ i ) T , where the corre-In general there is no need to assume equality between the number of blocks and the size of each of the blocks. It is done here merely for the purpose of presentation. 26 Chapter 2. Cyclic Reduction sponding matrix is ( 2 C 2 - B 2 C 2 0 C 2 2C2 - B 2 C 2 C 2 2 C 2 - B 2 C 2 V and the right-hand-side is C 2 2 C 2 - B 2 C 2 0 C 2 2C2 - B 2 (2.3) ( C b x + C b 3 - B b 2 ^ C b 3 + C b 5 - B b 4 \ C6„_ 2 + C 6 n - B 6 n _ i J (2.4) We can proceed recursively, as follows: define B ^ = B , C ^ = C , 6 = b j , and for m > 0, B ( m + i ) = 2 (C< m ) ) 2 - (5<m)) 2; C < m + 1 ) = ( C ( m ) ) 2 , 6 5 m + 1 ) = C ( m ) ( 6 ^ m + b^\m) - £ ( m ) & J m ) . We now have a system of equations of the form A ^ x ^ = c ^ m \ where / i?(TO) C ^ ^ C ( m ) #(m) c(m) 0 C*m) B ' m ' c(m) V (2.5) c(m) JB(m) c ( m ' 0 C < m ' 5 ( m ' J and the elements of c ( m ' are given by c^"1' = b ^ l . Once this system is solved, it is straight-forward to compute the solution for the unknowns that were previously eliminated throughout the process. 27 Chapter 2. Cyclic Reduction Note that even if B and C are sparse, B ^ are C ' m ' are not necessarily sparse. For solving the reduced system, Fourier type algorithms based on diagonalizing the matrix (by the matrix whose columns are the eigenvectors) could be used. This option is very attractive when B and C commute (BC = CB), as in this case these two matrices can be diagonalized simultaneously. Since B ^ and are polynomials in B and C , it follows that their eigenvalues can be easily computed, provided that the eigenvalues of B and C are known. Indeed, for Poisson's equation the spectrum of the matrix is known, and stability analysis is available [20]. The problematic term in the computation of the right-hand-side is B ^ b ^ . There is more than one way to employ a recurrence relation for computing it. For example [20]: i>o = - 2 6 j m ) ; u 1 = B b ^ - v k = - C 2 v k - 2 . Using this formula, we get ^("OfcH = u 2 m . Suppose Xj and LOJ are the eigenvalues of B and C respectively. It is shown in [20] that if |Aj| > 2\LJJ\ then the computation of _B(m)&'m ' is actually done by a recurrence relation involving the term cosh(jzj), where Zj = c o s h - 1 ( i j 1 ) • Thus there can be a significant difference in magnitude between and b ^ \ and the latter could be lost in rounding errors if n is large. (This situation occurs frequently; for example, it occurs in the finite difference discretization of Poisson's equation). Buneman's algorithm [18] for overcoming the difficulty involves "backward"-type compu-tations. For Poisson's equation, where C is the identity matrix, the technique uses the equality B = B ^ B ~ l - 2B'1; and in general 5<m) = 2In - B ^ m + 1 \ Substituting this for j = 1,.. . , m, after some algebraic manipulation, a two-step recurrence relation is obtained, where the term ( £ ( ™ - i ) B ( > ™ - 2 ) . . . B ( o ) y i B { m ) r e p l a c e s t h e problematic term in the original algorithm. It is then shown that the 2-norm of the matrix ( B * ™ - 1 ) B ^ - V • - - B ^ ) - 1 is bounded from below by e ~ c B l , where c is some constant, and #i = cosh _ 1(—A;/2). For the Poisson's equation, 6\ > 1, and thus the upper bound is small enough to ensure stability. The full proof of stability of this procedure can be found in [20, pp. 648-655]. Throughout the years, many extensions to the original papers and algorithms [71],[18],[20] have been presented. Buzbee, Dorr, George & Golub [19] use cyclic reduction for solving Pois-28 Chapter 2. Cyclic Reduction son's equation on irregular regions; Concus & Golub [25] discuss two-dimensional nonseparable cases; Bauer and Reiss [12] generalize the procedure for block pentadiagonal systems arising from the discrete biharmonic operators; Heller [68] compares the iteration count of block cyclic reduction to that of block Gaussian elimination [111] and shows that when the matrix is diago-nally dominant, norms of the off-diagonal blocks decrease quadratically relative to the diagonal blocks at each step of reduction, and based on this observation, suggests criteria for early termi-nation of the process, when the matrix becomes essentially block diagonal. Detyna [33] suggests an algorithm which is based on introducing two sets of equations, each with a different stencil, whose solutions are identical, and using the deliberately created large number of degrees of freedom, to eliminate half of the equations in each of these two sets, in a manner that preserves the structure of the computational molecules (of each of the sets). This step can be repeated, and leads to a procedure with 0(N2) operations, thus is a faster algorithm than the classical algorithm [71],[18],[20] described in detail above; on the other hand, the algorithm gives rise to larger errors in the numerical solution; the error can be reduced by applying the suggested algorithm on the finer grids, and proceeding at the higher grids in the hierarchy with the more stable cyclic reduction algorithm (or any other accurate fast solver). Bondeli & Gander discuss application of cyclic reduction for special tridiagonal systems [15]; Swarztrauber & Sweet [104], Gallopoulos & Saad [51], Amodio & Mastronardi [2] discuss aspects of parallelization; Amodio & Paprzycki [3] present a parallel algorithm for solving boundary value problems using cyclic reduction, and apply it to a distributed memory machine. A description of the cyclic reduction idea and a list of references can be found in Golub & Van Loan [55]. 2.2 The Three-Dimensional Cyclically Reduced Operator Even though the complete cyclic reduction algorithm can be carried out even when B and C do not commute [20], for nonsymmetric matrices the spectrum is frequently not known and numerical stability cannot be ensured in general. For example, Heller's observations in [68] are not applicable, as no diagonal dominance is guaranteed. In three dimensions issues of fill-in and stability are magnified, and could be very problematic. It is our purpose, then, to focus in 29 Chapter 2. . Cyclic Reduction a single step of cyclic reduction and perform comprehensive analysis. As a first step we derive the cyclically reduced operator. The derivation is done by per-forming the block Gaussian elimination step described in the Introduction [see Eq. (1.31) and (1.32)]. Once the points corresponding to one of the colors (say red) are eliminated, the re-sulting reduced grid has a somewhat irregular structure: there are "holes" in the grid, which correspond to the previously present red points. A difficulty which is specific to three dimen-sions is: the parity of the planes has to be taken into consideration; points of the same color are not located on the same spots for even-indexed planes and odd-indexed planes. This can be illustrated by looking at the three-dimensional checkerboard depicted in Fig. 2.1. Figure 2.1: a three-dimensional checkerboard. Our starting point is the original, unreduced problem in three dimensions. The red/black ordering is depicted in Fig. 2.2; the points that have to do with the block elimination are numbered in Fig. 2.3. Here the point for which the discretization is done is in the center, indexed by #13. If this point is black, then points #4, #9, #12, #14, #17 and #22 are red and are to be eliminated. The other points in the figure are black, but their corresponding entries in the matrix at row #13 are to be changed after eliminating the red points. The construction of the reduced computational molecule is done by eliminating half of the rows and the columns of the matrix that corresponds to red/black ordering. The constant coefficient 30 Chapter 2. Cyclic Reduction Figure 2.2: red/black ordering, and the corresponding sparsity pattern of the associated matrix, for the three-dimensional problem. case is easier to derive, as the computational molecule is independent of the coordinates of the gridpoints. We thus illustrate how to construct the cyclically reduced operator for this case first. 2.2.1 T h e C o n s t a n t Coefficient M o d e l P r o b l e m Consider the model problem (1.15). In terms of the matrix elements, below are the entries which are affected by the block elimination step for the model problem. c o l u m n : 4 9 12 14 17 22 13 1 2 3 5 6 7 8 10 11 15 16 18 19 20 21 23 24 25 row 4 a g f e c d b row 9 a b f e c d g row 12 a. d f e c b g row 14 a c / e d 6 g row 17 a e f ' c d b g row 22 a / e c d b g row 13 f e c d b g a Here are the computations performed during the block elimination: column : 13 i 2 3 5 6 7 8 10 eliminate 4 : -fa- - / a " 1 / - / a - 1 e -fa~1c -fa~1d -fa~lb eliminate 9 : — ea~ b - e a - 1 / —ea — 1 e — e a — 1 c - e a _ 1 d eliminate 12 : — ca" d - c a ' 1 f - c a ~ 1 e eliminate 14 : — da~ c -da~lf -da~le eliminate 17 : —ba~ e - 6 a - 1 / eliminate 22 : — ga~ / 31 Chapter 2. Cyclic Reduction 7 10 11 12 13 14 15 16 17 18 19 Figure 2.3: points that are affected by block Gaussian elimination for point #13 (center) column : 19 20 eliminate 4 : eliminate 9 : eliminate 12 eliminate 14 eliminate 17 eliminate 22 — ca c - c a_ 1 b - d a _ 1 6 - b a _ 1 c - b a _ 1 d -da~Lg - b a _ 1 b -ba~1g —ga~1e — ga~ 1 c —ga—1d —ga~1b From this, we can see that the typical value on the diagonal of the reduced matrix is: a - 1 (a 2 - 2be - 2cd - 2fg) . (2.6) By 'typical' we mean an entry that is associated with an interior point. For non-interior points fewer operations of elimination are required, and the value in their associated diagonal entry changes with respect to (2.6) in the following manner: • add a_1cd in case the x-coordinate of the associated point (ih,jh, kh) is i = 1 or i = n. • add a~xbe in case the y-coordinate of the associated point (ih,jh, kh) is j = 1 or j — n. • add a~1fg in case the ^-coordinate of the associated point (ih,jh, kh) is k = 1 or k = n. 32 Chapter 2. Cyclic Reduction Let R be the reduced operator for the model problem, after scaling by ah2. Then for an interior gridpoint, (ih, jh, kh), the difference operator is given by: Figure 2.4: structure of the computational molecule associated with the reduced operator. The gray squares correspond to the gridpoints which have been eliminated throughout the elimination process. Referring to the indices used in Figure 2.3, in terms of mesh Reynolds numbers the values of the computational molecule (for an interior gridpoint) are given below. R U i t j t k = (a2 - 2be - 2cd - 2fg) Uiijik - f2 Mij,fc-2 - 2e/ uitj+itk-i -2cf Ui-ij,k-i ~ 2df ui+i,j,k-\ ~ W Ui,j-i,k-i - e2 u i t j + 2 , k -2de Ui+itj+i,k ~ c2 Ui-2,j,k ~ d2 ui+2,j,k ~ 2 & c « ; _ i J - i , f c -b2 u i t j - 2 , k - 2eg Ui,j+i,k+i ~ 2c9ui-r,j,k+i ~ 2 c e J + i . ^ -2bdui+ij-itk ~ 2dgui+1jik+i - 2bg Uij-itk+i - g2 ui,jM2 • (2.7) 33 Chapter 2. Cyclic Reduction Index General case Centered differences U p w i n d differences 1 -P -(1 + S)2 . - ( l + 2c5)2 2 -2ef - 2 ( l - 7 ) ( l + c5) -2 (1+2(5) 3 - 2 c / - 2 ( l + /5)(l + <5) - 2 ( 1 + 2/5)(l + 25) 5 -2df - 2 ( l - / 3 ) ( l + <5) - 2 ( 1 + 25) 6 - 2 6 / - 2 ( l + 7 ) ( l + <5) - 2 ( l + 2 T ) ( l + 2£) 7 - e 2 - a - 7 ) 2 - 1 8 - 2 c e - 2 ( 1 + / 3 ) ( 1 - T ) - 2 ( 1 + 2/3) 10 -2de - 2 ( 1 - / 3 ) ( 1 - T ) - 2 11 - c 2 - ( 1 + 2.9) 2 13 a 2 - 2be - 2cd - 2fg 30 + 2(/3 2 + T 2 + 82) 30 + 4(/3 + 7 + <5)2 + 20(/? + 7 15 -d2 - ( l - p ) 2 - 1 16 - 2 6 c - 2 ( 1 + /?) (1 + 7) - 2 ( l + 2/3)(l + 2 7 ) 18 -2bd - 2 ( 1 - / 3 ) ( 1 + T ) - 2 ( 1 + 2 7 ) 19 - b 2 - ( 1 + 7 ) 2 - ( 1 + 2 T ) 2 20 -2eg - 2 ( l - 7 ) ( l - < 5 ) - 2 21 -2cg -2 (1 + /3)(1-<S) - 2 ( 1 + 2/3) 23 -2dg - 2 ( l - / 3 ) ( l - « ) - 2 24 -2bg -2 (1 + 7 ) ( 1 - * ) - 2 ( 1 + 2 7 ) 25 - 9 2 - ( l - < 5 )2 - 1 If the points in the reduced grid are ordered lexicographically, then the sparsity pattern of the matrix is as in Fig. 2.5. Notice that the matrix corresponds to a grid of size 6 x 6 x 6 , but is only 108 X 108 in size, whereas 6 3 = 216. This is because half of the unknowns in the original linear system were eliminated. Another thing to notice is that the matrix is less sparse than the original, 34 Chapter 2. Cyclic Reduction Figure 2.5: sparsity pattern of the reduced matrix corresponding to lexicographic ordering of the unknowns (for a 6 X 6 X 6 grid) unreduced matrix: there are typically 19 nonzeros in each rows, as opposed to (typically) 7 nonzeros in each row of the unreduced matrix. Finally, notice that the main block diagonal has "holes" within itself, which get "larger" when the size of the matrix is larger. This is because no re-ordering which better fits the structure of the reduced grid has been performed here. This aspect is to be addressed in Chapter 3. In the discussion that follows, we refer to centered schemes. Similar observations can be made for upwind schemes. Denote the continuous operator corresponding to the model problem by L = - A + ( c r , r , M ) T V . (2.8) Expanding (2.7) in a multivariate Taylor expansion about the gridpoint (ih,jh,kh) yields: 1 1 1 1 1 2RU = LU - Uyyyy ~ ~ h} UyyZZ ~ ~ h} T fj, Uy Z + ~ h} \X UyyZ + ~ h} OUXZZ 1 1 1 1 1 1 + -h2Tuxxy + - h2auxyy + — h2 Guxxx - h2 a fiuxz - - h2aruxy + .- h2 [inzzz 1 - 2 2 1~L2 1 ,2 j l"i.2 1"L2 l"i.2 Y2 • ^  ~^ 3 "^ ^yyy Y2 ^ ^yy 6 ^ ^ ^ z z 6 ^xxyy g ^xxzz l i l l + - h 2 p, uxxz - —h2a2 uxx - - h2 uxxxx - - h2 uzzzz + 0(h3) . (2.9) o 12 o o The above computation was carried out using Maple V . The right-hand-side in Eq. (2.9) can be referred to as an introduction of a new continuous operator, based on a combination 35 Chapter 2. Cyclic Reduction of the original equation, and 0(h2) terms associated wi th second or higher derivatives of the solution. Th is continuous operator can be discretized directly, using the points that belong to the computat ional molecule of the cyclically reduced operator. For example, Uj+2,j,k — 2ui,j,k + Uj-2,j,k u x x ^ h 2 and similar ly for uyy and uzz; and for the first derivatives ^ ui+l,j+l,k + i^+lj-l.fc ^  Ui-l,j+l,k — Ui-l,j-l,k Ah and similar ly for uy and uz. The result is a discretization scheme wi th an 0(h2) discretization error. Th is result is similar to the result in [57] for the two-dimensional case. We also state that the reduced right-hand-side is equal to wt',j,fc wi th an 0(h2) error. Indeed, Gaussian el imination yields the following right hand side: b c d e f g Wi,j,k wi,j-l,k wi-l,j,k wi+l,j,k wi,j+l,k wi,j,k-l wi,j,k+l ' (2.10) '•" a a a a a a whose Taylor expansion about the gridpoint (ih, jh, kh), after scaling by 2ah2 [that is, after doing the same scaling as for E q . (2.9)] is: w — y^[ — Aw + (awx + rwy + pwz) ] , where al l the derivatives of w are evaluated at the point (ih, jh, kh). Th is result is similar to the result in [41] for the two-dimensional case. 2.2.2 T h e V a r i a b l e Coeff icient Case For the variable coefficient case performing cyclic reduction is considerably more difficult, as now the components of the computat ional molecule depend on each gridpoint 's coordinates. We first i l lustrate how the seven-point operator is derived. Th is derivation is standard; see for example [41] for details on the two-dimensional case. The discrete difference operator is derived by working on a fine grid whose size is | rather than h. Equat ion (1.1) is rewritten in the following manner: ~[{pux)x + (quy)y + [ruz)z] + sux + tuy + vuz = w . (2-11) Assume that p, q, r > 0 on Q,. The seven-point discretization is done as follows: 36 Chapter 2. Cyclic Reduction • For the convective terms a centered finite difference approximation or an upwind dis-cretization scheme is used. For example, if s > 0 on Q, the term involving the deriva-tive in the x-direction is discretized either as follows (centered differencing): sux « si,j,kU'+1'J'k2h'~l':''k i o r as follows (upwind backward differencing): sux ' « S i j , f c " ' ' J l ' i ~^ '~ 1 ' J , ' c . The terms tuy and vuz are discretized in a similar fashion. • For the diffusive terms a centered difference scheme is applied: a mesh of size | is used and the terms pux, quy and ruz are discretized, and then the derivatives of these terms in the x, y and z directions respectively are computed. In doing so, a scheme that defines u at a grid whose mesh size is h is obtained, using values of the coefficient functions at the finer grid whose size is \ . For example, for the first term in Eq. (1.1) Pi+LJtkUt+l,j,k ~ (Pi+i,j,k + Pi-I,j,jfeKj,fc + Pi_iijikUi-l,j,k \Pux)x ~ 7^ 2 • The terms (quy)y and (ruz)z are discretized by applying a similar procedure. The difference operator for an interior gridpoint is given by: F ^i,j,k — ^i,j,k V'i,j,k ~\~ ^i,j,k ui,j — l,k ~\~ ^i,j,k ui—l,j,k "f" di,j,k ^i+l,j,fc +e;,j,fc Uij+iik + fi,jtk v-i,j,k-i + 9i,j,k ui,j,k+i • (2.12) If Sij,fe, ti,j,k and Vitjtk are all positive and centered differences are used, the values of the computational molecule are given by ai,i,k = Pi+U,k + Pi-U,k + 9i,,-+i,k + %3-\,k + r i t j M L + riihk. ti,j,kh _ . ti,j,kh _ bij,k = -9i,j-i,k 2 ~ ' 6 i ' 3 ' k ~ qi,i+k*-r 2 Si,j,kh j , si,j,kh . Ci,j,k = -Pi-i,j,k 2~ ' ~ P ' + 2 - . i ' f c i " 2 fiJ,k = - r i , j , k - L - - Y - 9ij,k = - r i i j M L + -2- • V'li> If one uses upwind schemes, then the type of scheme depends on the sign of the convective terms. Assuming that s, t and v do not change sign in the domain, for each of them which is 37 Chapter 2. Cyclic Reduction positive one would use the backward scheme, and forward scheme would be used in the negative case. Discret izing using backward differences yields ai,j,k = Pi+i,j,k + Pi-\,j,k + %3+\,k + %3-\,k + ri,j,k+i + r ; , j , f c - i + si,j,kh +Uj,kh + Vij>kh ] b i j t k = - q i d _ i t k - t i i j i k h ; eitjtk = - q i i j + L j k ; Ci,j,k = -Pi_Lj)k ~ si,i,kh ; dijik = - p i + \ _ - k ; fi,3,k = - r h J : k - i ~ wt,j,kh ; gi,j,k = - r i t j M i _ , '(2.14) and for forward differences the above needs to be modified in an obvious manner. A s opposed to the constant coefficient model problem [Eq. (1.20c)], now the components of the computat ional molecule do depend on the associated coordinates of the gridpoints. Nev-ertheless, the sparsity structure of the reduced matr ix is identical in both cases. The entries of the mat r ix which are affected by the elimination are specified below. c o l u m n : ( i , j , fe — 1) ( • . j + l . f c ) ( i - l , j ' , f c ) ( i + l , j , f c ) ( i , j — 1, fe) ( i , j , f c + l ) (i.j ' .fc) row (i,],k-l) aiijik_i 3iJ,k-l row + l,fe) ai,j+l,k bi,j + i,k row ( i - l . J . fc ) a i - l , j , f c d i - l . i , f c row ( i + l , J , fc ) ° ' + l,J.'= c i + l , J . * row - 1, fc) ° i , J - l . l = e i , j - l , * row ( i , j , f c + 1) a « , J , * + l- /i,J,fc + l row / i . j . fc ei.j.fc <=i,j,fc d t , j > f > i . i . " 9 ' . J . f e 0 i ' - i . f c c o l u m n : (e, >, fc — 2) (t, J + 1, fc — 1) ( i — 1, j , fc — 1) (t + 1, j , fc — 1) («, j - 1 ,.fc - 1) + 2, fc) row («, j , fc - 1) fi,j,k-l ei,j,k-l c'J,k-l di,j,k-\ bi,j,k-l row (t, j + 1, fe) fi,j+l,k e i , i + i , k row (t - l . j ' . fc) li-l,j,k row (t + 1, j , fe) fi+lj,k row ( i , j - 1, fe) row (i, j, k + 1) row (t, fe) column : ( t - l , j + l,fc) (t + 1, j + 1, fc) (i — 2, j , fc) ( i + 2 , j , fc) ( . - l , j - l , f e ) • (« + 1, j - 1, fc) row (t, j, fc — 1) row (i,j+l,k) c i , j + l,fc d i , j - f l , f c row (« — l , j , f c ) c i - l , j , k l>i-l,j,fc row (t + 1, fc) di + U,k bi + l,j,k row - l,fc) Ci,j-l,k di,j-\,k row (t, >, fe + 1) row (t, J , fc) 38 Chapter 2. Cyclic Reduction column : (», j — 2, fc) ( . , j + l , f c + l ) (» — 1, j , fc + 1) ( i + 1, j , fc + 1) ( i , j - 1, fc + 1) (i,j,k + 2) row (t, j, fc — 1) row (i, j + 1, fc) 9 i , j + l,fe row ( i - 1, j, fc) 9 i - l , j , f c row (. + 1, j, fc) 9i+i,j,k row ( i , j - 1, k) i > i , j - i , t 9i,j-l,k row fc+l) c i , j , fc + l d>,j,fc + l b i , j , fc + l 9 i , i , fc+i row ( i , j , fc) Let i? denote the reduced discrete operator, after scaling by h2. For an interior gridpoint, (ih,jh, kh), the operation of R is given by the difference equation 5 u . _ . , fi,j,k9i,j,k-l Ci,j,kdj-i,j,k ei,j,kbj,j+i,k l , h ^ l , h k ai,3,k-l ai-\,3,k ai,j+l,k ^ t , j , f c c i+ l , j , f c _ bj,j,kei,i-\.,k _ 9i,j,kfi,j,k+l \ a-i + l j . f c a-i,j-\,k a-i,j,k+\ ' _(ei,i,k-lfi,],k , ej,j,kfi,i+l,k \ y . (fi,i,kCj,j,k-l , c;,.;,fc/i-l,. ?,fc \ » . . _ , . , I a ; i > i f e _ , + a ! J + l , f c ^ ut,3+l,k-l { a t ) J i k _ x + »i- i , j , fc ^ ^ - i . ^ " 1 _(dijjlk-1Jiljik . ^ , ; , f c / , + i , ; , f c N, A , j , f c - i / , - , j , f c , fc;,j,fc/i,,-i,fc N /Cj,j,k9i-l,j,k , 9i,j,kCj,j,k + l \ _ (Cj,i,kbi-\,j,k , bi,j,kCi,i-l,k \ _ ^ , - , j , f c g t + l , , , f c , 9i,i,kdjliMl \ „ . (bi,j,k9i,j-l,k , 9 i , j , k b i J t k + l , _ / . . j . f c / . j . f c - i e ' . J + i , f c e i , . ) , * , . . . , „ , _ di..i.kdi+i,j.k at,j,k — l a i , j + l,fc ' J ' a t+l , j / , fc , J ' _ b i j - l l k b i j k 9i,j,k9i,j,k+l Ci,i,kCi-\,j,k . . / n i r t Notice that E q . (2.15) involves significantly more floating point operations compared to the seven-point operator. The point of computat ional work involved in constructing the reduced matr ix is to be discussed in Chap . 5. 2.3 Properties of the Reduced Matrix The reduced mat r ix S = E — DB~lC is the Schur complement mat r ix of the original matr ix A. Schur complements arise in many applications. In particular, we mention the class of domain decomposition techniques [22]. These are techniques which are suitable for irregular (e.g. L-shaped) domains: the domain is divided into "simple" (e.g. rectangular) subdomains, where the 39 Chapter 2. Cyclic Reduction partial differential equation can be easily solved. The difficulty is that for some partitionings, between each two subdomains there is an "interface" layer, of unknowns that belong to both subdomains (see e.g. [95]). The submatrix associated with the interface unknowns is the Schur complement and is typically a much smaller matrix than the submatrices associated with unknowns that belong to the subdomains. If the solution for the interface unknowns is obtained first, by solving for the Schur complement, then it is easy to find a solution for the rest of the unknowns. This approach is used in several areas of applications. For example, in cut-loop techniques for multibody systems simulation (see [7] and references therein). Schur complements have many useful properties (see, for example, [27]). In the context of cyclic reduction, we are interested in seeing whether numerical properties of the unreduced matrix are preserved, or improved, after one step of cyclic reduction is performed. The results in Lemmas 2.1-2.4 below are not specific to the particular class of problems discussed in this thesis and are well known (see, for example, [95], for some of the results presented below); the discussion and results following after these lemmas are specific to our reduced system. L e m m a 2.1. / / t h e o r i g i n a l m a t r i x A i s n o n s i n g u l a r , t h e n t h e S c h u r c o m p l e m e n t m a t r i x S i s n o n s i n g u l a r a s w e l l . P r o o f . The block-LU decomposition of A is (2.16) thus S must be nonsingular. • L e m m a 2.2. I f t h e m a t r i x A i s s y m m e t r i c p o s i t i v e d e f i n i t e , t h e n s o i s S . P r o o f . It is straightforward to show that (2.17) and thus, if A is symmetric positive definite, so is S . • 40 Chapter 2. Cyclic Reduction For M-matrices (see Defn. 1.3), we have the following result (the proof can be found in [45] or [9, Thm. 6.10]): L e m m a 2.3. I f a m a t r i x i s a n M - m a t r i x , t h e n s o i s t h e a s s o c i a t e d S c h u r c o m p l e m e n t m a t r i x . As far as conditioning is concerned, we have: L e m m a 2.4. I f A i s s y m m e t r i c p o s i t i v e d e f i n i t e , t h e n K2{S) < K2{A) . (2.18) I n o t h e r w o r d s , t h e m a t r i x S i s b e t t e r c o n d i t i o n e d t h a n A . P r o o f . From the proof of Lemma 2.2 B is symmetric positive definite. Hence so are B ~ x and C T B ~ 1 C . Since S = E — C T B _ 1 C , E must be symmetric positive definite, and thus | | S | | 2 < | | £ | | 2 < | | A | | 2 . (2.19) On the other hand, / B ~ x + B - 1 C S ~ 1 D B - 1 - B ^ O S " 1 \ A = (2.20) \ - S - l D B - x S ~ l j so we have 11^ —X||2 < (2-21) From Eq. (2.19) and (2.21) the result stated in the lemma readily follows. • Moving to the nonsymmetric case, we first note that, as already mentioned in the Introduc-tion, the first step of cyclic reduction is attractive mainly because the matrix to be inverted is diagonal. In fact, matrices that arise from discretization of differential equations, using three-point, five-point or seven-point operators, all belong to a special class of matrices, defined as follows [67, §9.2]: 41 Chapter 2. Cyclic Reduction Definition 2.1. Let T be partitioned into q2 submatrices: ( Ti,i T = (2.22) y J-q,l ••• J-q,q ) T is said to have block Property A with respect to this partitioning if there exist two disjoint nonempty subsets SR and SB of {1,2,. . . , q} such that SRU SB = {1, •••,<?} and such that if Tij ^ 0 and i ^ j , then i £ SR and j £ SB or i £ SB and j £ SR. . Property A was first defined by Young [116], and was later generalized by Arms, Gates & Zondek [4], and by Varga [112]. The matrices that are associated with the three-point, five-point and seven-point operators, all have Property A with respect to l x l matrices; showing it is easy: the two subsets SR and SB mentioned in Definition 2.1 correspond to the set of red gridpoints and the set of black gridpoints when red/black ordering is used. A matrix that corresponds to any arbitrary ordering of the unknowns can be transformed by a symmetric permutation to the matrix corresponding to red/black ordering, and thus it has Property A. The question whether a matrix has Property A or not, is very significant in the convergence analysis of stationary methods, in particular the SOR scheme [117]. We will discuss this in Chapter 4, where we will also address the question in what circumstances the reduced matrix has Property A, and relative to which partitionings. In Fig. 2.6 plots and histograms of the eigenvalues of both the unreduced and the re-duced systems are depicted, for the symmetric positive definite case corresponding to Poisson's equation discretized on a 12 x 12 X 12 grid. In the histograms, on the rc-axis we have the eigenvalues, and the y-axis is their number. The eigenvalues of the unreduced system are explicitly known -Au = f (2.23) 42 Chapter 2. Cyclic Reduction in analytical terms and are distributed between 0 and 12. The maximal eigenvalue is close to 12, and the minimal eigenvalue is close to 0. On the other hand, the reduced matrix has all its eigenvalues between 0 and 6, and it is evident either by the histogram of by the moderate slope of the graph for the eigenvalues (at the region of the largest ones), that the majority of the eigenvalues are between 5 and 6. Also, the smallest eigenvalues of the reduced matrix are more isolated compared to the unreduced matrix, and are larger than the smallest eigenvalues of the unreduced matrix. In this particular case the minimal eigenvalues of the unreduced and reduced matrices are 0.174 and 0.344 respectively. The condition number of the unreduced matrix is larger by a factor of 3.885. Since the rate of convergence of Krylov space methods strongly depends on the condition number of the matrix and on clustering of the eigenvalues, the above observations lead to the conclusion that convergence of the reduced solver will be faster. (a) unreduced (b) reduced Figure 2.6: eigenvalues of both systems for Poisson's equation on a 12 X 12 X 12 grid Similar observations apply to the general nonsymmetric case; when convection is moderate in size, the structure of eigenvalues is very similar to the structure presented in Fig . 2.6. When 43 Chapter 2. Cyclic Reduction convection dominates, the situation is a bit different, but s t i l l , the reduced mat r ix is better conditioned. In F i g . 2.7 we refer to a more difficult problem: the singularly perturbed equation - e A u + V u • VIA = / , (2.24) where v = \ [ { x - \ ) 2 + ( y - | ) 2 + ( z - \ ) 2 ] and the domain is the unit cube (0,1) X (0,1) X (0,1). U p w i n d differencing is used to discretize the problem on a uniform grid . Tak ing e = 0.02 on an 8 x 8 x 8 grid, we get that the spectral condition number of the unreduced mat r ix is K2 = 2, 292 wi th o m i n = 0.0098 ; a m a x = 22.46 and the condition number of the reduced mat r ix is K2(R) — 759, wi th c r m i n = 0.0195 ; c r m a x = 14.80 . In F i g . 2.7 we give the histogram of the singular values of the matr ix for this particular problem. (a) unreduced (b) reduced Figure 2.7: singular values of both matrices for E q . (2.24) ( 8 x 8 x 8 grid) Another point of interest, is the circumstances in which the reduced mat r ix is an irreducibly diagonally dominant M - m a t r i x . Here, in a way which is similar to [40, C o r . 3] we can obtain the following useful result: L e m m a 2.5. I f b e , c d , f g > 0 t h e n b o t h t h e u n r e d u c e d a n d t h e r e d u c e d m a t r i c e s a r e d i a g o n a l l y d o m i n a n t M - m a t r i c e s . P r o o f . Consider the unreduced matr ix . For the matr ix to be an M - m a t r i x , a necessary condition is that the off-diagonal elements must be nonpositive, which is readily obtained i f be, cd, fg > 0. It is straightforward to see, by substi tuting the values of the computat ional molecule in (1.17), 44 Chapter 2. Cyclic Reduction (1.18) or (1.19), that the unreduced matr ix is diagonally dominant and irreducible. Thus by Theorem 1.1 the unreduced matr ix is an M - m a t r i x . A s for the reduced matr ix , consider a row associated wi th an interior point: diagonal dominance is translated to the requirement that a 2 - 2be - 2cd - 2fg > b2 + c 2 + d2 + e 2 + f2 + g2 + 2\bf\ + 2\cf\ + 2\df\ + 2\ef\ +2\bc\ + 2\bd\ + 2\ce\ + 2\de\ + 2\bg\ + 2\cg\ + 2\dg\ + 2\eg\ . (2.25) If (2.25) holds, then diagonal dominance of rows which correspond to points close to the bound-ary also holds, as in this case the associated diagonal entry is larger. E q . (2.25) is identical to a2 > (H + |c| + |d | + H + | / | + \g\)2, which holds because the unreduced mat r ix is diagonally dominant, by L e m m a 2.3 the reduced matr ix is also an M - m a t r i x . • The results mentioned in this section are important in the sense that if the original mat r ix has valuable numerical properties, then so does the reduced matr ix; the procedure of cyclic reduction does not damage properties such as diagonal dominance or positive definiteness. 45 Chapter 3 Order ing Strategies The question of ordering of the unknowns is of importance for both iterative and direct solvers, as a good ordering strategy can lead to significant saving in computat ional work. A m o n g books that address this subject we mention George & L i u [52] and Duff, Er i sman & Reid [35]. A m o n g the many papers that deal wi th effects of orderings on preconditioned iterative methods (in part icular K r y l o v subspace solvers) we mention [14],[23],[29],[31],[36],[37],[89]. There are several possible "guidelines" for picking orderings. Popular ones are orderings for small bandwidth, which attempt to minimize the profile of the matr ix (e.g. C u t h i l l - M c K e e [28] or Reverse C u t h i l l - M c K e e [52]), or orderings which attempt to minimize the amount of f i l l - in , i.e. the number of entries that were ini t ial ly zero and become nonzero in the process of el iminat ion (e.g. M i n i m u m Degree [108]). A convenient way to obtain such orderings is by working wi th the graph of the matr ix [52]. Ordering strategies which consider the graph of the mat r ix as well as its values are M D F [29] and T P A B L O [83]. In seeking an effective ordering technique for our reduced matr ix , our purpose in general is to choose a strategy that results in a well-structured narrow-banded matr ix . In addit ion, we are interested in ordering the unknowns so that the main diagonal blocks of the mat r ix are as dense as possible. Some motivation for that can be given, for example, by considering the following result due to Varga [113, T h m 3.15]: Let A = Mi -Ni = M2- N2 be two regular splittings of A , where A"1 > 0. If N2 > Nx > 0, 46 Chapter 3. Ordering Strategies equality excluded, then 0 < piM^Nx) < p(M2~1N2) < 1 (3.1) O n the other hand, dense blocks mean more computat ional work when block iterative solvers are considered. For some of these solvers the block diagonal part of the mat r ix needs to be inverted in each i teration. There is a clear tradeoff between the number of nonzero diagonals that belong to the block diagonal part of the matr ix and the amount of work required to invert this submatr ix . O f course, the term "block diagonal part" is very loose, as it is determined by the user (taking things to extreme, the user could decide that the bandwidth of the "block diagonal part" of the matr ix is equal to the matr ix bandwidth) . Here we shall adopt a strategy of referring only to the block diagonal part of the matr ix whose bandwidth does not depend on the mat r ix size, and at tempting to group as many nonzero diagonals as possible in this part of the mat r ix . Inversion of the diagonal block would then st i l l require only a number of floating point operations which is proportional to the number of unknowns, and at the same time a substantial part of the matr ix is to be included in its block diagonal part. E x a m i n i n g the sparsity pattern of the lexicographically ordered mat r ix (F ig . 2.5) reveals i the flaws of this ordering: it is not suitable for the reduced grid, as it does not take into account the structure of the grid, that is, the fact that the red points are now missing. In addit ion, it does not take into account the special structure of the computat ional molecule of the reduced operator. Since the stencil is not compact, and contains gridpoints from five parallel planes, an ordering strategy which numbers unknowns from more than one plane at a t ime might be required. Another major point of consideration should be the abili ty to parallelize the solution pro-cedures: the matrices considered here are very large due to the number of space dimensions, and their block structure might lend itself to parallelism. Below we present ordering strategies which can be applied to any three-dimensional prob-lem. The idea is to divide the unknowns into one-dimensional and two-dimensional sets, and 47 Chapter 3. Ordering Strategies analyze the ordering not only relative to the three-dimensional grid but also relative to the lower dimensional grid of sets of gridpoints. 3.1 Block Ordering Strategies for 3D Grids Denote the number of unknowns by N and suppose that a certain gridpoint 's index I, 1 < £ < N, is associated wi th coordinate values x, y, z. We denote its coordinate indices by i = | , j = k — I", and refer to sets of coordinate indices simply as "coordinate sets". The coordinate indices are integers assuming values from 1 to n. Here n is the same as in the previous chapters. Definition 3.1. Let {Sm} be 0(n)-item disjoint sets of clustered indices, such that the co-ordinate set of the gridpoints contained in each set Sm contains all integers from 1 to n for exactly one of the independent variables, and \JSm = {1,..., N}. We call each of the sets Sm a ID block of gridpoints, and we say that this block is x-oriented, y-oriented or z-oriented, according to which independent variable is associated with all integers from 1 to n. Definition 3.2. Let {Tm} be 0(n2)-item disjoint sets of clustered indices, such that the coor-dinate set of the gridpoints contained in each set Tm contains all integers from 1 to n for two of the independent variables, and (J Tm = { 1 , . . . , N}. Each set Tm is a 2D block of gridpoints, and we say that it is either x-y oriented, x-z oriented, or y-z oriented, according to which two independent variables are associated with all integers from 1 to n in the coordinate set. Notice that from the above definitions it follows that the I D blocks do not necessarily form single lines, and the 2D blocks do not necessarily form single planes. Below we define a 2D block grid and a block computat ional molecule. Definition 3.3. Suppose the one-dimensional blocks associated with a certain ordering are s\-oriented, where S\ is one of {x, y, z}. Then a three-dimensional grid of points can be rede-fined as a 2D block grid of S\-oriented ID blocks. The variables associated with this grid are {S2, S3> = { z , y , 2 } \ { s i } . Definition 3.4. For a certain given gridpoint, its associated block computational molecule is defined as the computational molecule in the corresponding 2D block grid. That is, its com-48 Chapter 3. Ordering Strategies ponents are all the sets of ID blocks which contain at least one gridpoint which belongs to the point computational molecule associated with the 3D problem. Using the above definitions, we can now easily define variants of block orderings: Definition 3.5. A certain 3D ordering strategy is called natural block ordering if the one-dimensional blocks are ordered in the corresponding 2D block grid using natural lexicographic ordering. Definition 3.6. A certain 3D ordering strategy is called red/black block ordering if the one-dimensional blocks are ordered in the corresponding 2D block grid using red/black ordering. Definition 3.7. A certain 3D ordering strategy is called toroidal block ordering if the one-dimensional blocks are ordered in the corresponding block 2D grid using toroidal ordering strategy Ul]. In F i g . 3.1 we illustrate the above-mentioned orderings. See [41] for an explanation on toroidal ordering and its advantages. 13 14 15 16 9 10 11 12 5 6 7 8 1 2 3 4 15 7 16 8 5 13 6 14 11 3 12. 4 1 9 2 10 13 2 7 12 9 14 3 8 5 10 15 4 1 6 11 16 (a) natural (b) red/black (b) toroidal Figure 3.1: orderings of the 2D block grid. Each point in these grids corresponds to a I D block of gridpoints in the underlying 3D grid Hav ing established the general idea, we now consider particular families of orderings for the reduced gr id . Various families can be determined according to the way their I D and 2D sets of gridpoints are picked. 49 Chapter 3. Ordering Strategies 3.2 The Family of Two-Plane Orderings This ordering corresponds to ordering the unknowns by gathering blocks of 2n gridpoints from two horizontal lines and two adjacent planes. In Figure 3.2 three different members of the natu-ral two-plane ordering are depicted. For notational convenience, we label this family by " 2 P N " ("2P" stands for two-plane, and " N " stands for natural). Similarly, we name the red/black family and the toroidal family by " 2 P R B " and " 2 P T " respectively. T w o addit ional letters are added in order to distinguish between members of the family: for example 2 P N x y is the order-ing associated wi th I D blocks of gridpoints in x-direction and 2D blocks in x-y direction, and so on. (a) 2 P N x y (b) 2 P N x z (c) 2 P N y z Figure 3.2: three members in the family of natural two-plane orderings Let us il lustrate the definitions presented in the previous section by examining the ordering 2 P N x y : the basic one-dimensional sets are 2n-item sets: for the specific case of F i g . 3.2(a), which corresponds to n = 4, indices 1-8, 9-16, 17-24 and 25-32 are each a one-dimensional set which forms a "str ipe" of size n X 2 x 2. Thus the I D sets are z-oriented. Next , the sets 1-16, 17-32 form an n x n x 2 shape, thus the 2D sets are x-y oriented. The block grid is thus a 2D grid of x-oriented I D sets of gridpoints, associated with the independent variables y and z, and if we associate y wi th rows and z wi th columns, then the I D sets are ordered in a natural lexicographic rowwise fashion. 50 Chapter 3. Ordering Strategies In Fig. 3.3 we depict the two-plane red/black and toroidal families of orderings. The combination of Fig. 3.1, Fig. 3.2 and 3.3 clarifies how the ordering for a larger grid is done. 25 29 (a) 2PRBxy (b) 2PTxy Figure 3.3: red/black and toroidal two-plane ordering corresponding to x-y oriented 2D blocks We now specify the entries of the matrices for the two-plane family of orderings. We arbitrarily pick one member of this family, 2PNxz; any other family member's specific entries can be deduced simply by interchanging the roles of x, y and z or by changing the indexing of the blocks in an obvious manner. The matrix can be written as an ^-order block tridiagonal matrix S — t r i f S j j - i , S j j , S j j + i ] . (3.2) Each of the above components of S is an n2 x n2 matrix, block tridiagonal relative to 2n X 2n blocks. Let us use superscripts (—1), (0) and (1) to describe subdiagonal, diagonal, and superdiagonal blocks of each of the block matrices, respectively. Denote by 'cntr'.the center of the computational molecule, specified by (2.6) and the explanation that follows. The diagonal submatrices 5 h a v e nonzero entries in diagonals -4 to 4, and their diagonals 51 Chapter 3. Ordering Strategies are given as follows : [ - c 2 , - 2 c / • £"10, -2ce • £1001 - 26c • £ 0 n o , -2c# • E01 - 2 e / • E w o o - 2 6 / • £ 0 o i o , cntr, -2bg • E o w o - 2df • E w - 2eg • E00oi, -2bdE0no - 2de • E l 0 0 1 , -2dg • E 0 1 , -d2] . (3.3) The other submatrices contained in Sjj are of an irregular structure. The superdiagonal sub-mat r ix SjLj has nonzero entries in diagonals -3 to 1, as follows: [-2cg-Ew, 0, -2egEltm - 2bg • E00io, - g 2 , -2dg • Ew] , (3.4) and the subdiagonal submatr ix S j h a s nonzero entries in diagonals -1 to 3, as follows : [ - 2 c / • £cn , - / 2 , - 2 6 / • E 0 1 0 0 - 2 e / • £ ^ 0 o i , 0, -2df • E0i] . . (3.5) The components of 5j,j+i and SJJ-I are given by: S J J ^ = d i a g [ - 2 e / • Eowo] ; (3.6) Sj°j+i = penta[-2ce • £0110, - 2 e / • £0010, - e 2 , -2eg • E o w o , -2deE0110} ; (3.7) S $ + 1 = d i a g [ - 2 e 5 • £ 0 o i o ] ; (3-8) SJjl^  = d i a g [ - 2 6 / • £0001] ; (3-9) = penta[-26c • E 1 0 0 1 , - 2 6 / • E100Q, - 6 2 , -2bg • £ 0 o o i , -2bdEWOi] ; (3.10) = diag[-26£f • E1000] • (3.11) The sparsity pattern of the matr ix is depicted in F i g . 3.4(a). Having specified the entries of the matr ix , we can now find the block computat ional molecule corresponding to the two-plane ordering. Refer to the natural ordering depicted in F i g . 3.2. Examin ing the 2n-item blocks of the ordering, we get a block computat ional molecule of the form depicted in F i g . 52 Chapter 3. Ordering Strategies 0 50 100 150 200 250 0 50 100 150 200 250 nz = 3760 nz = 3760 (a) natural (b) 2D red/black Figure 3.4: sparsity patterns of two members of the two-plane family of orderings 3.5(c). F i g . 3.5(a) and 3.5(b) illustrate what sets are associated wi th a single gridpoint . A s is evident, the structure depends upon the parity of one of the gridpoint 's indices. The block computat ional molecule is obtained by taking the union of the 2ra-item sets associated wi th each of the gridpoints in the block, and is identical to the computat ional molecule of the classical nine-point operator. Revealing this structure allows one to learn about the cyclical ly reduced operator (as a block operator), by what is known about the 9-point operator. For example, it is clear that the reduced matr ix does not have block Proper ty A relative to par t i t ioning into 2n x In blocks, (discussion of this is to come in Sec. 3.4.) X X X X X X X x — x — X x — x — X x — x — X X X X X X X X (a) even k — ^ (b) even k = j r (c) computat ional molecule F igure 3.5: block computat ional molecule corresponding to the family of orderings 2PN The 9-point operator gives rise to matrices for which parallelization can be obtained by using four-color schemes [57],[95]. In general, multicoloring techniques are useful for parallel computat ion [95]. Here we consider four-color ordering of I D sets of gridpoints, and the resulting 53 Chapter 3. Ordering Strategies matr ix for the two-plane ordering is illustrated in F i g . 3.6. Notice that the main blocks are narrow-banded block diagonal matrices, and thus are easy to invert. \ v ' \ \ \ ' \ \ \ \ v \ \ \ \ \ v \ \ V \ \ \ \ v v \ x \ V V \ 0 200 400 600 BOO 1000 1200 1400 1600 1BO0 2000 (12.3+400 Figure 3.6: four-color I D block version of the two-plane ordering (the mat r ix is of size 2048 X 2048, obtained by discretization on a 16 X 16 X 16 grid) The same ideas can be applied to 2D blocks of gridpoints. The block grid in this case is one-dimensional. In F i g . 3.4(b) the sparsity pattern of a 256 X 256 mat r ix corresponding to red/black ordering based on 2D blocks is depicted. A s in the common cases of red/black ordering for problems wi th a smaller number of space dimensions, the bandwidth of the red/black matr ix is larger than the one corresponding to natural ordering. 3.3 The Family of Two-Line Orderings T h e two-plane ordering, presented in the previous section, is unique to three-dimensional prob-lems. We now consider an alternative: a straightforward generalization to 3D of the two-line ordering for 2D problems used by E l m a n & Golub in [41]. In this ordering the numbering is done in pairs of horizontal lines that lie in a single plane. The ordering and sparsity structure of the matr ix corresponding to the natural two-line ordering 2 L N x y are depicted in F i g . 3.7. Other members of the family can be obtained in a straightforward manner. Here the I D blocks are sets of n grid points, and each of the 2D blocks corresponds to a single x-y plane. The reduced matr ix for this ordering strategy can be wri t ten 54 Chapter 3. Ordering Strategies V- . % \ . . '••v-.vvs'-v-(a) ordering (b) sparsity pattern Figure 3.7: ordering and the sparsity pattern of the matr ix associated wi th the 2 L N x y ordering strategy as a block pentadiagonal matr ix: S - p e n t a , [ S j t j - 2 , S j j - i , S j j , S j t j + i , S j j + 2 ] • (3.12) Each Sij is ( n 2 / 2 ) X ( n 2 / 2 ) , and is a combination of ^ uncoupled matrices, each of size n X n. The diagonal matrices {Sjj} are themselves block tridiagonal. Each submatr ix is of size n x n and its diagonal block is p e n t a ( - c 2 , - 2 6 c • EQI - 2ce • E i o , cntr, -2bd • E 0 1 - 2de • E i o , -d2) j odd [ p e n t a ( - c 2 , - 2 6 c • E w - 2ce • EQI, cntr, - 2 6 d • E w - 2de • EQI, -d2) j even sS2 = t For the superdiagonal and the subdiagonal blocks of the matrices Sjj we have the following irregular tr idiagonal structure, which depends on whether j is even or odd. The superdiagonal matrices are given by ( - e 2 -2de \ 0 2 c( l ) _ -2ce -2de -2ce -2de _ „ 2 (3.13) 55 Chapter 3. Ordering Strategies if j is odd, or -2ce — e 2 -2de - e 2 c ( i ) _ -2ce - e l (3.14) if j is even. The subdiagonal matrices are / - 6 2 o(-i) ^3,3 - 2 6 c - 6 2 -2bd - 6 2 - 6 2 -26c - 6 2 (3.15) if j is odd, or I -S . (-i) _ 3,3 b2 -2bd -b2 - 2 6 c - 6 2 -2bd -26c - 6 2 -2bd - b 2 (3.16) if j is even. 56 Chapter 3. Ordering Strategies The superdiagonal and the subdiagonal blocks of S, Sjtj±i, are block tr idiagonal: - < d i a g ( - 2 6 / . f ? o i ) j odd [ d i a g ( - 2 6 / • Ew) j even (3.17) c (0) t r i ( - 2 c / , -2bf-E10-2ef-E01, -2df) 2c / , - 2 6 / • E01 - 2ef • E w , -2df) t r i j odd j even (3.18) S$-t = { ^ - ^ ' ^ i 0 d d (3.19) d i a g ( - 2 e / • £ 0 i ) J even 4 ; + H d i a g ( - 2 6 5 ' E o i ) J o d d (3.20) diag(-26fif • Ew) j even o(0) _ J tr i(-2c<7, -2bg • E w - 2eg • E0i, -2dg) j odd [ t r i ( -2c# , -2bg • E01 - 2eg • E10, -2dg) j even (3.21) ^ 3,3 + 1 ~ \ diag(-2e# • Ew) j odd [ diag(-2e# • E01) j even Fina l ly , the matrices 5 j j - 2 and Sjtj+? are diagonal: Sjj-2 = d i a g ( - / 2 Sj,3+2 = d i a g ( - 5 2 ) 3 = 3, • • . , n ; j = 1, . . . , n - 2 (3.22) (3.23) (3.24) T h e block computat ional molecule corresponding to the two-line ordering relative to split-t ing into I D blocks is presented in F i g . 3.8. 57 Chapter 3. Ordering Strategies III I W I x—x—x-x—x x—x—x-x-x ! / I \ X X X X (a) even k = | (b) odd k = \ Figure 3.8: block computat ional molecule corresponding to the ordering strategy 2 L N x y 3.4 Comparison Results In order to compare the various orderings, we consider stationary methods, for which there are some useful general results [113] which make it possible to relate the orderings to the rate of convergence. Various splittings are possible. We consider two obvious ones, based on dimension: the term "one-dimensional spl i t t ing" is used for a spli t t ing which is based on part i t ioning the mat r ix into 0 ( n ) blocks (2n x 2n blocks for the two-plane ordering and n x n blocks for the two-line ordering). A "two-dimensional spl i t t ing" is one which is based on par t i t ioning the mat r ix into 0 ( n 2 ) blocks ( n 2 x n 2 blocks for the two-plane ordering and {n2/2) x ( n 2 / 2 ) blocks for the two-line ordering). In other words, the I D (2D) spli t t ing for both ordering strategies is essentially associated wi th I D (2D) sets of gridpoints. The partit ionings for the two-plane mat r ix are il lustrated in F i g . 3.9. A n i l lustrat ion of the importance of using ordering strategies which fit the structure of the reduced grid is given in F i g . 3.10: the sparsity patterns of a single n 2 x n 2 diagonal block of the two-plane mat r ix and the matr ix that corresponds to point natural lexicographic ordering are depicted. The matrices in the figure arise from discretization on a 12 x 12 X 12 gr id . If the two matrices are referred to as block tridiagonal, the bandwidth of the block diagonal part of the two-plane mat r ix is significantly smaller. A t the same time, if only diagonals whose indices do not depend on n are referred to, then the two-plane matr ix has more of them grouped next to the main diagonal. X X X x—x—x-x-x / l \ X X X (c) computat ional molecule 58 Chapter 3. Ordering Strategies 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 BO SO 100 nz=1440 ra-1440 (a) I D part i t ioning (b) 2D part i t ioning Figure 3.9: possible block partitionings of the two-plane matr ix . In F i g . 3.11 the spectral radii of the one-dimensional Jacobi iteration matrices for both of the two-plane ordering and the natural lexicographic ordering are plotted. The graphs correspond to a constant coefficient case, and two cross-sections of the mesh Reynolds numbers are given. A s is evident, the two-plane iteration matr ix has a smaller spectral radius. Th i s results in faster convergence. It is clear, then, that block ordering strategies such as the two-plane ordering can be more effective than natural lexicographic ordering of the reduced gr id . A t this point, we compare the two families of block orderings that have been introduced in the previous sections: the two-plane ordering and the two-line ordering. T h e o r e m 3.1. If the reduced matrix is an M-matrix, then for both the ID and the 2D parti-tionings the spectral radius of the Jacobi iteration matrix associated with two-plane ordering is smaller than the spectral radius of the iteration matrix associated with the two-line ordering. Proof. Each of the two ordering strategies generates a matr ix which is merely a symmetric permutat ion of a mat r ix associated with the other ordering. Suppose Si — M\ — i V i is either a I D spl i t t ing or a 2D spli t t ing of the two-plane ordering matr ix , and S2 = M2 — N2 is an analogous spl i t t ing for the two-line ordering, wi th I D and 2D blocks of gridpoints oriented in the same direction. There exists a permutation matr ix P such that PTS2P — S\. Consider the 59 Chapter 3. Ordering Strategies 0 20 40 60 80 100 120 140 0 20 40 60 80 100 120 140 nz a 1636 nz =1636 (a) lexicographic (b) two — plane Figure 3.10: a zoom on 2D blocks of the matrices corresponding to two ordering strategies of the reduced grid spl i t t ing PT S2P = PT M2P — PT N2P. It is straightforward to show by examining the mat r ix entries that PTN2P > Nx. The latter are both nonnegative matrices, thus by [113, T h m 3.15] it follows that 0 < p(M1~1Ni) < p(PTM^lN2P) = p{M~1N2) < 1. • Some of our numerical experiments, presented in F i g . 3.12, validate the results indicated in T h m . 3.1. In the figure spectral radii of the block Jacobi iteration matrices are presented. It is interesting to observe that the spectral radius of the two-plane iteration mat r ix is smaller also for the case be, cd, fg < 0, which corresponds to the region of mesh Reynolds numbers larger than 1 and for which the theorem does not apply. A few cross-sections of mesh Reynolds numbers are examined. For example, graph (a) corresponds to flow wi th the same velocity in x, y and z directions. Graph (b) corresponds to flow only in x and y directions, and no convection in the z direction, and so on [see E q . (1.16) for the definition of (3, 7 and 6], We note that the bandwidth of the two-plane matr ix is larger than the bandwidths of the natural lexicographically ordered matr ix and the two-line matr ix: n 2 + 2n vs. n 2 . However, the difference is negligible, as typically 2n <C n 2 . Smaller bandwidth can be obtained by using direct solver orderings such as R C M (see F i g . 3.13). However, numerical experiments that have been performed wi th this ordering show slower convergence of block stat ionary methods, 60 Chapter 3. Ordering Strategies 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.B 0, 0 0.1 0.2 0,3 0.4 O.S 0.6 0.7 0.8 0.9 (a) p = 1 = 8 (b) 7 = <5 = 0 Figure 3.11: comparison of the spectral radii of block Jacobi iteration matrices for certain cross-sections of the mesh Reynolds numbers. The upper curves correspond to lexicographic ordering. The lower curves correspond to two-plane ordering. The matrices were obtained using centered difference discretization. compared to the two-plane ordering. Next , we consider the issue of consistent ordering. Recal l the definition of consistently ordered matrices (see [67],[95],[117]): Definition 3.8. A matrix is said to be consistently ordered if the vertices of its adjacency graph can be partitioned in p sets S i , S 2 , • •., S p with the property that any two adjacent vertices i and j in the graph belong to two consecutive partitions S k and S ^ , with k' = k — 1 if j < i, and k' = k + 1 if j > i. A matr ix that is consistently ordered has Property A ; conversely, a mat r ix wi th Proper ty A can be permuted so that it is consistently ordered [95]. It should be noted that not all consistently ordered matrices for a given problem have the same Jordan canonical form [47]. Th i s is of importance, because the size of the Jordan block associated wi th the largest eigenvalue (in magnitude) affects the magnitude of the spectral norm of the iteration matr ix ; the latter affects the speed of convergence [47],[113]. It was mentioned in the Introduction that the unreduced matr ix has Proper ty A relative 61 Chapter 3. Ordering Strategies (a) 0 = 7 = 5 (b) p = 7, 8 = 0 (c) 7 = 5 = 0 (d) /3 = 0, 7 = £ (e) /3 = 7 = 0 (f) /3 = 5 = 0 Figure 3.12: spectral radii of iteration matrices vs. mesh Reynolds numbers for the block Jacobi scheme, using I D spl i t t ing and centered difference discretization. The broken lines correspond to two-plane ordering. The solid lines correspond to two-line ordering. to part i t ioning into l x l matrices. For the reduced system, we have the following observations: Proposi t ion 3.1. The reduced matrix associated with two-line ordering, SL, does not have Property A relative to ID or 2D partitionings. Proof. Th i s can be deduced directly from the structure of the block computat ional molecules of the I D and 2D partitionings. Formally, let 5 , j denote the (i,j)th n x n block of SL, and let Q be an ( n 2 / 2 ) X (n2/2) matr ix , whose entries satisfy qij — 1 if Sij ^ 0 and g;j = 0 otherwise. Let T be an n X n matr ix , such that i j j = 1 if the (i,j)th ( n 2 / 2 ) X (n2/2) block submatr ix of 5 is nonzero, and i 2 j = 0 otherwise. T is a pentadiagonal matr ix , thus does not have Proper ty A . Since Q could have Proper ty A only if T d id , it does not have Proper ty A either. • Proposi t ion 3.2. The reduced matrix associated with two-plane ordering, Sp, does not have Property A relative to ID partitioning. Proof. Let S,-j denote the (i,j)th 2 n x 2 n block of Sp, and let Q to be an (n2/4) x (n2/4) matr ix , 62 Chapter 3. Ordering Strategies Figure 3.13: symmetric reverse C u t h i l l - M c K e e ordering of the reduced mat r ix (of size 256x256) . whose entries satisfy qij = 1 if 5 ; j ^ 0 and </;j = 0 otherwise. It is straightforward to see that the nonzero pattern of Q is identical to that of the matr ix associated wi th using a 9-point operator for a 2D grid (indeed, this is indicated by the structure of the block computat ional molecule). Since the latter does not have Property A relative to part i t ioning into l x l matrices, the result follows. • O n the other hand, we have: Proposi t ion 3 . 3 . The reduced matrix associated with two-plane ordering, Sp, has Property A, and moreover - is consistently ordered relative to 2D partitioning. Proof. The mat r ix is block tridiagonal relative to this part i t ioning, thus is consistently ordered [117]. • The above results indicate that it wi l l be difficult to analyze the convergence properties of the reduced matrices, as they are not necessarily consistently ordered. In Chapter 4 we wil l address this point and provide some analysis which overcomes the loss of Proper ty A for the two-plane matr ix , relative to I D part i t ioning. 63 Chapter 4 Convergence Analys is for B lock Stat ionary Methods In this chapter bounds on the convergence rates of the Jacobi , Gauss-Seidel, and S O R schemes are derived. F i r s t , the constant coefficient case is analyzed, and then the results are generalized to the variable coefficient case. We consider block splittings, which well suit the block structure of the reduced matr ix . In particular, the I D and the 2D splittings discussed in Chapter 3 are considered. Since the two-plane ordering was shown in Chapter 3 to be more effective than other orderings, we focus on matrices associated with this ordering. We analyze the block Jacobi scheme, derive upper bounds on convergence rates, and demonstrate their tightness. We then discuss consistently ordered matrices [113],[117] in the context of the reduced matr ix , and use our observations to derive bounds for the block Gauss-Seidel and S O R schemes. Final ly , convergence results are derived for the unreduced (original) system, and are compared to the bounds on convergence rates for the reduced system. 4.1 Symmetrization of the Reduced System In order to estimate the spectral radii of the iteration matrices involved in solving the reduced system, the following idea of E l m a n & Golub [40],[41] is used: denoting the reduced matr ix by S, suppose there exists a real diagonal nonsingular matr ix Q, such that S = Q~1SQ is symmetr ic . The symmetry can then be used as follows: suppose S = D — C is a spl i t t ing, such that D = Q~lDQ is symmetric positive definite, and C — Q~lCQ is symmetr ic . Since D~lC 64 Chapter 4. Convergence Analysis for Block Stationary Methods and D~XC are similar, it follows that: p(D-xC) = p{D-xC) < \\D-%\\C\\2 = - 4 £ L . (4.1) ^minyl-J j The latter is useful, since the eigenvalues of a symmetric matr ix are real, are positive if the mat r ix is positive definite, and it is easier to compute them (compared to nonsymmetric ma-trices) . 4.1.1 T h e C o n s t a n t Coefficient Case Consider the model problem (1.15). Let a, &, c, d, e, /, g be the components of the computat ional molecule [see E q . (1.17), (1.18) and (1.19)]. We have: T h e o r e m 4 .1. The reduced matrix S can be symmetrized with a real diagonal similarity trans-formation if and only if the products bcde, befg and cdfg are positive. Proof. For a certain ordering, the corresponding matr ix is merely a symmetric permutat ion of a mat r ix which is associated wi th another ordering. Thus we can examine the symmetr izat ion conditions for the matr ix associated with a particular ordering without loss of generality. We thus refer to the particular ordering whose associated matr ix was described in detail in Chapter 3, namely 2 P N x z . Our aim is to find a real diagonal matr ix Q , so that Q~lSQ is symmetr ic . Suppose Q — diag[<3i,i, Q 1 > 2 , Q i , f , Q2.i1 Q2,2, Q2.f1 Q f , f ] , where each matr ix Qj . / i 1 < j-, I < f 1 is a diagonal 2n X 2n matr ix whose entries are denoted by: Q i ^ d i a g ^ , . . . , ^ ] . (4.2) Consider first the diagonal block matrices 5,-°-'. We require that Qj}Si0}Qj J be symmetric . Notice that for any j, Sj°j is a notation that corresponds to ?p submatrices, but in order to avoid addit ional notation, we do not add a script that indicates the location of these matrices in each block, as it can be understood from the double index used for the submatrices of Q . A straightforward computat ion for the entries of s\°), 1 < j ; < 77 and for al l the § submatrices in JiJ ^ £ 65 Chapter 4. Convergence Analysis for Block Stationary Methods each block S j j , leads to the following conditions O'.O c 2 ( % 7 y ) 2 = ^ , i = 5 , . . . , 2 n , (4.3) °i-4 Cf ( % ) ) 2 = ^ , * = 4 , 6 , 8 , . . . , 2 n , (4.4) 1i-3 (i-0 ( l i ^ ) 2 = — , l < i < 2 n , i mod 4 = 2 or 3 , (4.5) °i-2 a{j'l) be Ii \2 0 C „(i.O ' de Hi-2 l < i < 2 n , i mod 4 = 0 or 1 , (4.6) c ( i , ° ca { U 7 ) ) 2 = d J ' i = 3 , 5 , 7 , . . . , 2 n - l , (4.7) ( )2 _ | I ) 1 < i < 2n , i mod 4 = 2 , (4.8) o ( j ' 0 bf ( Vn )2 = — , K i < 2 n , i mod 4 = 0 . (4.9) The restriction that the diagonal similari ty transformation be real leads to the following conditions: • Equat ions (4.4) and (4.7) imply cdfg > 0. • Equat ions (4.5) and (4.6) imply bede > 0. • Equat ions (4.8) and (4.9) imply befg > 0. We choose qx'l\ 1 < j , / < f arbitrarily, and use Equations (4.7), (4.8) and (4.9) to determine q^'l\ 2 < i < 2n, 1 < j,l < In so doing, we must make sure that Equat ions (4.3)-(4.6) are consistent wi th Equations (4.7)-(4.9). Indeed there is full consistency. Below we show 66 Chapter 4. Convergence Analysis for Block Stationary Methods this for entries whose index, i, satisfies i mod 4 = 0. The same procedure can be done for the other values of i. Since [i - 1) mod 4 = 3, applying (4.7) for i - 1 (rather than i) and (i>1) mult ip ly ing it by (4.9) we obtain ( \ ^ ) 2 = ^ • ^  = ^ , which is exactly (4.6). Next , since 1,-2 (i - 2) mod 4 = 2, we can use (4.8) to conclude that ( -jjjr )2 = f- , and combining this 1,-3 9 (J.O equation wi th (4.6) we obtain ( %pr ) 2 = | | • | £ = , which is identical to Equa t ion (4.4). 1i-3 qii.1) Final ly , (i - 3) mod 4 = 1 , therefore (4.7) implies ( -^ f r ) 2 = . M u l t i p l y i n g this by Equat ion Ii— 4 (4.4) yields ( \ ^ ) 2 = ^ • ^  = ^ , which is identical to (4.3). Th is completes the proof of consistency for indices which correspond to Equat ion (4.9). For indices which satisfy either E q . (4.7) or E q . (4.8) the process is completely analogous, and the algebraic details are omit ted. The same procedure repeats when considering the off-diagonal matrices of the main block diagonals, namely Sj^, and the off-diagonal block matrices Sjj±\. For the former we have the following equation, which determines the ratio between qf'l+1^ and q\3,l\ thus defines the values of q \ 3 , l + 1 \ using q\3,1^: ( L r ) 2 = Z _ ) 1 < l < 2 n . (4.10) Notice that (4.10) establishes conditions for values that were previously considered arbitrary, namely q[3'l\ I > 1. A t this stage only q[3,1\ 1 < j < § are left arbitrary. In addi t ion, we have the following: (In the equations that appear below, I goes from 1 to j — 1, and j goes from 1 to ^.) hf ( ^ ' - 1 \2 _ °J [ ,(i.O ' ~ eg 1 < i < 2n , i mod 4 = 2 (4.11) ' i i \2 l i - i dg 3, 5 , . . . , 2n - 1 , (4.12) 67 Chapter 4. Convergence Analysis for Block Stationary Methods u > l ) ) = ~ , i = 4 ) 6 ) . . . , 2 n 1 (4.13) ( ^rj-rp ) 2 - y g , 1 < i < 2n , i mod 4 = 0 , (4.14) Equat ions (4.11)-(4.14) can all be obtained from combinations of the group of Equat ions (4.3)-(4.10), thus are consistent and do not impose any addit ional restrictions. Next , the off-diagonal block matrices SJJ±I are examined. We start wi th S j ° j ± 1 . The following conditions, for 2 < j < ^ and 1 < / < f , need to be satisfied: Ji+U) r2 < ( 7 7 T ) 2 = ? ' l < ^ < 2 r ^ , (4.15) 1 < i < 2n , i mod 4 = 2 , (4.16) Q U + 1 ' 1 ) be 1 { . ; ) ) 2 = — , 1 < i < 2n , i mod 4 = 2 or 3 , (4.17) %-2 9, (i.0 ; ~ ef ' 1 < i < 2n , i mod 4 = 0 , (4.18) ( ) 2 = — , 1 < t < 2n , i mod 4 = 0 or 1 , (4.19) ce Equat ion (4.15) defines the values of q^+1'l\ 1 < j < f — 1, given q\^'l\ Imposing this con-di t ion leaves only q^'1^ arbitrary. E q . (4.16)-(4.19) are consistent wi th the previous equations: 68 Chapter 4. Convergence Analysis for Block Stationary Methods • Equat ions (4.15) + (4.8) imply (4.16). • Equat ions (4.15) + (4.5) imply (4.17). • Equat ions (4.15)+(4.9) imply (4.18). • Equat ions (4.15)+(4.6) imply (4.19). In order for Equations (4.15)-(4.19) to hold with a real Q, we require bcde, befg > 0. These conditions are contained in the three conditions that were imposed when Equat ions (4.3)- (4.9) were discussed. F ina l ly , the matrices 5^ j ± \ are considered. For these we require that for 1 < J < f : ( \2 _ °J 1 < i < 2n , i mod 4 = 0 2 < I < (4.20) ( C l i r i ) ) 2 = — , 1 < i < 2n , i mod 4 = 2 , 1 < / < - - 1 . (4.21) C i 2 Equat ions (4.18)+(4.10) imply (4.20), whereas Equations (4.15)+(4.11) imply (4.21): ( ibl )2 „U+U+i) 1 „(i+i,0 ; ( '-1 Y' [ Ji-0 ' p b_g_ = bj_ ef ; ge (4.22) ( J + 1 . / - 1 ) ( J + 1 . / - 1 ) J+1.0 ,2 h n ( q* ) 2 = f Si ) 2 . f Sl=l )2 = e_g_ b_ _ bg_ 1 AU,i) ' [ AU+U) > [ > bf e 2 ~ ef [ ' V i - l "i-l J J Equat ions (4.20) and (4.21) are to hold for real Q only if befg > 0, which is a condit ion which has already been imposed. • In terms of the mesh Reynolds numbers, Theorem 4.1 leads to: 69 Chapter 4. Convergence Analysis for Block Stationary Methods Corol lary 4.1. T h e r e d u c e d m a t r i x S c a n b e s y m m e t r i z e d w i t h a r e a l d i a g o n a l s i m i l a r i t y t r a n s -f o r m a t i o n f o r a n y /3, 7, 8 > 0 i f o n e u s e s t h e u p w i n d ( b a c k w a r d ) s c h e m e s , a n d f o r e i t h e r |/31, I 7 I , \8\ < 1 o r |/3|, I 7 I , \ S \ > 1 i f o n e u s e s c e n t e r e d d i f f e r e n c e s c h e m e s . P r o o f . For upwind schemes c d f g = ( l + 2/3)(l + 2<5) which is always positive for positive values of /3, 7 and 8. The same is true for b e f g = ( l + 27)(l+2r5) and b c d e = ( l + 2(3)(l-\-2j). For centered difference schemes c d f g = (1 - /3 2)(1 - 82), b e f g = (1 - 7 2 ) ( 1 - 52), and b c d e = (1 — /3 2)(1 — 7 2 ) . For c d f g to be positive, we require that either |/3| < 1 and |<5| < 1 or |/3| > 1 and <5| > 1. If |/3| < 1 and |<5| < 1, then b e f g > 0 implies I 7 I < 1 and then b c d e > 0 holds as well . If |/3|, \ S \ > 1, then the same argument yields \j\ > 1. • In order to construct the matr ix Q Equations (4.7)-(4.15) are used. However, from the proof it is clear how the symmetrized matr ix Q ~ l S Q looks and there is no need to construct Q and actually perform the similari ty transformation. Moreover, the symmetrizer might contain very large values, thus using it might cause numerical difficulties. In [40] it is shown that in the one-dimensional case, for the equation —u" + cru' = / wi th Dirichlet boundary conditions, if the first entry of the symmetrizer is 1, and the mesh Reynolds number is between 0 and 1, the last entry of the symmetrizer goes to e ° as n goes to infinity, and thus is very large for large values of the underlying P D E coefficient. Instability occurs in the three-dimensional problem, as is now shown: T h e o r e m 4.2. S o l v i n g t h e symmetrized l i n e a r s y s t e m o b t a i n e d b y u s i n g t h e d i a g o n a l s y m -m e t r i z e r Q i s n u m e r i c a l l y u n s t a b l e . P r o o f . Symmetr iz ing by Q means that instead of solving the system 5a; = w , a linear system of the form Sx = w , (4.24) is solved, where S = Q ~ l S Q is symmetric, x , — Q ~ x x and w = Q ~ l w . The mat r ix Q is con-structed as follows: 70 Chapter 4. Convergence Analysis for Block Stationary Methods Choose any value for q\ f o r i = 2 to 2n (Li) i f i mod 2 = 1 (i.D _ ' (M) i f i mod 4 = 0 (i,i) rf J 1- 1) i f i mod 4 = 2 for / = 2 to \ for i = 1 to 2n g(i,0 = / . J M - D for j = 2 to f for / = 1 to | for i = 1 to 2n 0,0 & (i—i,. F r o m the algori thm it follows that the entries of Q are not bounded as n tends to infinity. For example, the last row in the algorithm implies that if | | | ^> 1, even i f q[3'1^ is small , there exists i such that computing qj3'^ wi l l cause overflow. Such a si tuation occurs when 7 « 1 and centered differences are used. In this situation, if the other two mesh Reynolds numbers are small , the original matr ix is well conditioned; hence, the instabil i ty arises from the symmetr iza t ion, not as a result of conditioning of the matr ix . • F r o m Theorem 4.2 we can deduce, then, that the symmetrized matr ix should be used merely for the convergence analysis; in actual numerical computat ion, the nonsymmetric system is used. 71 Chapter 4. Convergence Analysis for Block Stationary Methods The entries of the symmetrizer can be determined up to sign. For example, for be, cd, fg > 0, a symmetr izat ion operator that preserves the sign of the matr ix entries transforms the matr ix entries as follows: -b2 -t -be ; - c 2 -> -cd ; -d2 ->• -cd; -e2 -be ; -f2 -fg ; -g2 -> -fg ; -2bf -+ - 2 v ^ ; -2c/ - 2 v ^ ; -2df ->• -2vWff ; -2e/ -+ -2^/677^ ; -26# -»• -2^be~fg~ ; -2c# - » -2^c~dfg~ ; -2dflr -+ -2y/c~djg~ ; -2e«? ->• -2^be~fg ; - 2 6 c ->• -2y/bcde ; -2bd -t -2\/bcde ; - 2 c e -2y/bcde ; -2de -> -2Vbcde . The value of the computat ional molecule corresponding to the center point is unchanged under the symmetr izat ion operation. 4.1.2 The Variable Coefficient Case Consider Equat ion (2.11). A s in the constant coefficient case, we denote the diagonal sym-metrizer by Q, and require symmetry element by element to derive the conditions of the entries of Q. Let qi denote the Ith diagonal entry in Q. Then 13 Z It is sufficient to look at 2n x 2n blocks. We start wi th the main diagonal blocks. We have to examine all the entries of the main block that appear in the / th row of the matr ix , namely S ; , / _ 4 , S / . / - 3 , • • •, s/,/, • • • , S(,;+4. For s/ )/_4, if / mod 2n > 5 or is equal to 0, then 1-4 corresponds to the (i — 2,j,k) mesh point. Thus Sl-A,l = - (4.26) a—%—2,/3=j,7=fc and from this it follows that Ql \ _ \ a ' - i , > , * / ci-l,j,kci,j,k (4.27) , 9 7 - 4 / _ ( d i - 2 l h k d i - i j , k \ di-2,j,kdi-l,j,k \ ai-i,j,k J In this case the values associated with the center of the computat ional molecule (namely a;,j,fc) are canceled, but this happens only for rows that involve the (i ± 2,j,k), ± 2,k) and (i,j,k ± 2) gridpoints. A p p l y i n g the same procedure for the rest of the entries of the main diagonal block we obtain 72 Chapter 4. Convergence Analysis for Block Stationary Methods J 7 _ ci,j,k — 1 fi,j,k i ci,j,k Ii — l,j,fc ai,j,k-l a,-l,j,k g i , j , f c - l ^ i - l , j , f c - l _j_ d-i-l,j,k mod 2n = 0 ,4 ,6 , ,2n - 2 ai,3,k-l (4.28) ai-\,j,k .91-2, Ci,j+l,kej,j,k _|_ Ci,j,kej-i,j,k - i , i , f c / mod 4 = 2 or 3 (4.29) 9l qi-2 fet —1 ,j,fc c»,j,fc _|_ ci,j —1 ,fc a > , J - l , * <^« —l, j , fc e « —1 ,j —1 ,fc _|_ e i , j —l,fc <^t —1 ,j —l,fc &i — l,j,k ai,j — l,k I mod 4 = 0 or 1 (4.30) 9i — \,j,k ci,j,k j ci,7,fc+1 9i,j,k ai-l,j,k ai,j,k+l di- •1 + a t , j , f c + l / mod 2n = 3 , . . .,2n - 1 ; (4.31) / mod 4 = 2 ; (4.32) / mod 4 = 0 . (4.33) A s is evident, Equations (4.31)-(4.33) are sufficient to determine all the diagonal entries, except the first entry in each 2n x 2n block, which at this stage can be arbi trar i ly chosen. We have to make sure, then, that E q . (4.27)-(4.30) are consistent wi th these three equations, and this requirement imposes some additional conditions. In the constant coefficient case there is uncondit ional consistency. The problematic nature of the variable coefficient case can be demonstrated simply by looking at one of the consistency conditions. Consider the (i,j,k) gridpoint whose associated index, /, satisfies / mod 4 = 0. A p p l y i n g (4.31) to / — 1 means looking at the row corresponding to the — l,k — 1) gridpoint, and mul t ip ly ing (4.31), 73 qi qi-i e t , j , fc-l fi,j,k _|_ Ji,j+l,kej,j,k ai,j,k-l ai,] + l,k 9 i , j , k - \ _ | _ fr>,j+l,fc/i,j+l,fc-l ai,j,k-l 0-i,j+l,k qi qi-i t>i,j,k-lfi,j,k I Ji,j — l,kf>i,j,k ai,j,k-l at,j'-l,fc g i , j , f c - l e i , j - l , f c - l . ei,j-l,k3i Chapter 4. Convergence Analysis for Block Stationary Methods applied to / - 1, by (4.33) results in an equation for which should be consistent wi th (4.30). The consistency condition is translated into the following: Si-l,j-l,k-lCj,j-l,k-lO-i,j-l,k + Q , j - l , f c < 7 i , j - l , f c - l A 8 - l , j - l , f c - l di-l,j-l,k-lfi-l,i-l,k^i,j-l,k + , j - l , f c / i j - l , y t « i - l J - l , f c - l bi,j,k—lfi,j,kQ'i,j — l,k ~\~ biJ,kfi,j — l,k—lQ'i,j,k—l X —— — 9i,j,k-iei,j-l,k-iai,j-l,k + ei,j-l,k9i,j-l,k-iai,j,k-l bi-l,j,kCj,j,k1i,j-l,k + bi,jtkCi,3-l,kO-i-l,j,k ^ ^ d{—l,j,k€i — I J — l,k^i,j — l,k ~\~ £i,j — l,kdi—lJ — llkQ'i — l,j,k There are three addit ional consistency conditions for the main block, and eight addi-t ional conditions for the rest of the blocks of the reduced matr ix . Equat ing variables that belong to the same location in the computat ional molecule, necessary conditions for the above-mentioned consistency conditions to hold are bi-ij^ — c ; j - i ^ - i = ci,j,k, ^ t ' - i j - i . f c - i = di-i,j,k, eij-.x^-i = e j - i j - i ^ , fijtk — / t - i j - i , f e , 9i,j,k-i = 9i-i,j-i,k-i- Under these condi-tions E q . (4.34) becomes Ci9k-\ bjfk _ bjCi fkdi-i ej-igk-i di-\ej_i which is obviously satisfied. (4.35) The analysis for off-diagonal blocks is identical, and the following addit ional conditions are obtained: ,9l-2n, fi,j,kfi,j,k—l 9i,j,k-l9i,j,k-2 2 n < l < (4.36) qi V_ b h J ^ k b h h k n 2 < l < ^ . (4.37) <7/_n2 / e , - i j _ i i f c e j j _ 2 , f c The above two equations determine the rest of the entries of the matr ix , and only the first entry in the symmetrizer can be determined arbitrarily. Last , in order for the symmetrizer to be real, the products c ;d ;_ i , bjej-i and fkgk-\ must have the same sign. The actual meaning of the conditions stated above is that the 74 Chapter 4. Convergence Analysis for Block Stationary Methods continuous problem has to be separable: in (2.11) the P D E variable coefficients should satisfy p — p(x), q = q(y), r — r(z), s = s(x), t = t(y) and w = w{z). In this sense the three-dimensional problem has the same behavior as the two-dimensional problem [42]. We can now summarize all that has been said in the following symmetrizat ion theorem: T h e o r e m 4.3. Suppose the operator of (2.11) is separable. J/c,-d,-_i, bjej_i and fkOk-i are all nonzero and have the same sign for all i, j and k, then there exists a real nonsingular diagonal matrix Q such that Q _ 1 S Q is symmetric. A s in the constant coefficient case, the symmetrized computat ional molecule should be de-rived without actually performing the similari ty transformation. For example, the symmetrized value corresponding to - c , ' J J f c C | ~ 1 ^ , f c is — ^  c,-l<hkChiMi-'*,hkd>-i,hk • the symmetrized value cor-responding to - ( c , J ' t - ' / i j ' t + £Li*Zi=LL*) i s _ 1 { C k _ l J t ] h k c, _ ^gt]hk 1 r f t _ 1 , J > _ 1 d i _ 1 j , f c g i _ L i l j i _ i , d ff j b j e m j V > ai,j,k-l ai-l,j,k ' v ai,j,k-l <lt-l,j,fc ; ' K separable the value at the center of the computat ional molecule is given by a,- ,• k — J'k9k~1— ai,j,k — l B j b j + l c , ( j ' ~ 1 d , c ' + 1 6 J e j - i _ 9kfk4-\ a n c j j g u n c h a n r r e c l u n d e r the s imilar i ty transformation. ai,j+l,k ° t - l , j ' , f c O i + l , j , f c ai,j-l,k a t , j ' , / c + l ° J 4.2 Bounds on Convergence Rates Below we attach the subscripts ' 1 ' and '2 ' to matrices associated wi th the I D and 2D parti-tionings respectively. Let S =• D\ — C\ = Di — C2 be the corresponding split t ings, and let Di = Q~lDiQ and C{ = Q_1CiQ, where Q is the diagonal symmetrizer. Below we refer to the part icular mat r ix 2 P N x z (convergence results for other matrices in this family can be obtained by appropriate interchange of indices or roles of x, y and z). Consider the constant coefficient case, wi th be, cd, fg > 0. Definition 4.1. An interior block is a 2n x 2n block sf*} whose associated set of grid points consists of points whose y and z coordinates are not 1 / (T I+ 1) or n / ( n + 1). No restriction is imposed on x coordinates. We can focus on interior blocks, because the minimal eigenvalues of Di are sought; since non-interior blocks differ only on their diagonals, and these are larger algebraically than the 75 . Chapter 4. Convergence Analysis for Block Stationary Methods diagonals of interior blocks if be, cd, fg > 0, the latter have larger minimal eigenvalues, compared to interior blocks. Let us define a few auxiliary matrices and constants: r = -2^/befg ; s = -2^cdfg ; Rn = tvi[E01, 0, Ew] ; Sn = tvi[E10,0, En] ; Tn = t r i [ l , 0,1] ; Un = penta[l , 0, 0,0,1] ; Vn = s e p t a [ £ 1 0 , 0, 0, 0,0, 0, E01] ; Zn = tri[s, r, s] . (4.38) The subscript n stands for the order of the matrices. Notice that al l the above matrices are symmetr ic (see Sec. 1.3 for explanation of the notation). The mat r ix U2 has l ' s on its fourth superdiagonal and subdiagonal, 2's on its main diagonal except for the first two entries and last two entries where the values are 1, and zeros elsewhere. If we define Wn = ( a 2 - 2be - 2fg) • In - 2\fbcde • Un - cd • C/ 2 (4.39) and Xn = -2^/c~dfg- (Rn + Vn) -2^beTg-Sn , (4.40) then an interior block of D\ is given by S | j = ^ 2 n + X2n . (4.41) We now examine the eigenvalues of W2n and X2n. In the following, we shall make use of Kronecker products (see, for example, Barnet t [10]), denoted by the symbol Recal l that for any two matrices A and B of sizes, say, m X n and p X q respectively, ( a l t l B a1<2B ••• a l j n B ^ A6AB (4.42) y « m , i 5 am^B ••• am>nB J In (4.42) each of the blocks is of size p X q, and A 0 B is of size (mp) x (nq). Kronecker products have a few useful properties [10], of which we mention in particular the ones which we wi l l use for deriving bounds: • A n y four given matrices A, B, C, D wi th the appropriate sizes satisfy (A (g) B)(C (g) D) = (AC) (g)(BD) . (4.43) 76 Chapter 4. Convergence Analysis for Block Stationary Methods • If {AJ,M , -} and {pi,Vi} are the sets of eigenvalues/eigenvectors of A and B respectively, then ( A ® f l ) ( u ; ® i ; 0 = (A ; /x,-)(u,-0i; t -) • (4-44) Using these properties, we have: L e m m a 4.1. The eigenvalues ofW2n are given by a2 - 2be — 2fg - 4Vbcde • cos(wjh) - 4cd cos2(n jh) , j = 1 , . . . , n , (4.45) each with algebraic multiplicity of 2. Proof. Since C / 2 n = T n(g)/ 2 , (4.46) this matr ix ' s eigenvalues are {2 cos (7rj / i )}™ = 1 , each of algebraic mult ipl ic i ty 2. Wn is a polyno-mia l in Un, therefore it has the eigenvalues stated in (4.45). • L e m m a 4.2. The matrix X2n has the following eigenvalues: Xf = ±{2y/beTg + 4yfc~dTg • c o s ^ ) ] , . j = l , . . . , n . (4.47) Proof. Suppose Y2 = . Then * 2 „ = Z n ( g ) Y 2 , (4.48) and thus by (4.44) the eigenvalues of X2n are all the combinations of products of eigenvalues of Zn and Y2, given by [r + 2s cos(njh)] • ( ± 1 ) , 1 < J < n. • Lemmas 4.1 and 4.2 can now be used to establish the following: 77 Chapter 4. Convergence Analysis for Block Stationary Methods T h e o r e m 4.4. The eigenvalues of interior blocks are (if = a2 - 2be - 2fg - 4\/bcde • cos(njh) - Acd cos 2(njh) ±[2y/bejg'+4y/clfg'-cos(TTJh)] , j = l , . . . , n . (4.49) Proof. A p p l y i n g (4.43), using (4.46) and (4.48), we have U2nX2n = {TnZn) (g)(/ 2y 2) (4.50a) X2nU2n = {ZnTn) (g)(Y2I2) (4.50b) Since I2Y2 = Y2I2 = Y2 and TnZn = ZnTn we conclude that X2nU2n = U2nX2n , (4-51) hence X2n and U2n commute, which means that X2n and W2n have common eigenvectors, can be simultaneously diagonalized, and the eigenvalues of Sj°j are the sum of the eigenvalues of X2n and W2 n , given in (4.49). • We remark that another way of analyzing the spectrum of X2n is by using the relation {X2n)2 — (Z2n)2. We are now ready to prove T h e o r e m 4.5. The symmetric matrix D\ is positive definite if be, cd, fg > 0. The eigenvalues given in (4-49) a r e a ^ s o eigenvalues of Di, each of multiplicity (^ — 2 ) 2 . The rest of the eigenvalues of D\ are all in the interval [min(^j) , max(ytij) + be + cd + fg]. The minimal i j eigenvalue of D\, m i n ( / i j ) , is given by j rj = a2 - 2be - 2fg - 2^befg - 4(Vbcde+ y/cdfg) • COS(TT/I) - 4cd cos2(nh) . (4.52) Proof. The eigenvalues of D\ are positive by combining L e m m a 2.5, [113, C o r . 2] and [113, T h m . 3.12]. For the eigenvalues (4.49), the mult ipl ic i ty is determined by counting the number of interior blocks. Since non-interior blocks differ from interior blocks only in their diagonal, 78 Chapter 4. Convergence Analysis for Block Stationary Methods and only by 0 to be + cd + fg in each diagonal, the minimal eigenvalue is attained in interior blocks and is thus given exactly by (4.52). Using equality between 2-norm and spectral radius of symmetric matrices, and the triangular inequality, results in an upper bound for the eigenvalues, given by max(/ij) + be + cd + fg. • j Next, D2 — Di is considered. For a given value of j (1 < j ' < ~), denote the jth block of D2 — D\ by Sjj = t r i [5-~ 1 \ 0, Sj~^ T]. Sjj is an n2 x n2 block tridiagonal matrix. A l l the block matrices Sjj , 1 < j < \ , are identical to each other, thus the analysis can be carried out for a single block. Define Y2n(r, s) to be the following In x In matrix : Y2n{r,s) = 0 r 0 s 0 0 0 0 0 s 0 r 0 0 0 0 0 r 0 s s 0 0 r 0 (4.53) 0 0 0 0 s 0 r V o J and let W(r, s) = tr i[Y 2 r i (r , s), 0,Y2n(r, s)] be an n2 x matrix, consisting of ^ blocks, each of size 2n X 2n. Then SJ,J = t n [ S ^ \ 0, 5 J J 1 } T ] = tri[-/flf 7 2 n , 0, -fg I2n] + W(-2^beJg~, -2^/clfg) . (4.54) 79 Chapter 4. Convergence Analysis for Block Stationary Methods Since Sjj is symmetric , its spectral radius can be estimated using p{Shl) = | | t r i [ - / f f l2n,0,-fg I2n} + W(-2^/befg, -2yfidfg)\\2 < | | t r i [ - / 5 / 2 n , 0 ) - / < 7 / 2 n ] | | 2 + | | W ( - 2 v / ^ , - 2 v ^ ) | | 2 • (4.55) The eigenvalues of the n2 x n2 mat r ix ti\[-fg 7 2 n , 0 , - fg I2n] are -2fg COS(TT^T) , 2 1 < i < f j each wi th algebraic mult ipl ic i ty of 2n. Since it is a symmetric matr ix , we obtain ||tri[-/<7 l2n,0,-fg I2n]\\2 = p(tn[-fg l2n,0,-fg I2n]) = 2fg c o s ( ^ — ) . (4.56) To estimate | | W | | 2 in (4.55), it is easier to work wi th W2, and use | | W | | 2 = | ' | W 2 | | 2 / 2 . We have: L e m m a 4.3. The matrix W2 has the following eigenvalues : 0, of multiplicity 2n, and 4befg + ldcdfg • COS 2 (TTJ/ I ) + wVbcde • fg • COS(TTjh) , 1 < j < n , (4.57) each of multiplicity n — 2. Proof. Forming W2 in terms of Y2N and Y2TN, we get / YTY 0 (vT\2 W1 0 YTY + YY1 Y2 0 0 T\2 YTY + YYT 0 (Y T\2 Y' V 0 YTY + YYT 0 Y2 0 v v T (4.58) YY1 J [in E q . (4.58) the subscripts 2n were dropped]. Now, a straightforward computat ion shows that ( l ^ n ) 2 = (Y2Tn)2 = 0. F rom that it follows that W2 is actually block diagonal. The matrices Y2NY2TN and Y2TNY2N, which appear in the first and last diagonal block entry 80 Chapter 4. Convergence Analysis for Block Stationary Methods of W2, are given by / r 2 + s2 0 2rs 0 0 0 0 2rs 0 0 r 2 + 2s 2 0 0 0 2rs 0 0 0 0 0 0 2rs 0 r 2 + 2s2 0 s 2 0 0 2rs V 0 0 0 0 2rs 0 r 2 + s 2 0 0 0 0 J (4.59) / 0 0 0 0 r 2 + s 2 0 Y2TNY2„ — 0 0 0 V 0 2rs 0 0 0 2rs 0 0 0 0 s 2 0 0 0 r 2 + 2s 2 0 2rs 0 0 0 0 0 0 0 0 0 0 0 0 s 2 0 v2 + 2s 2 0 0 0 0 0 2rs 0 \ 2rs 0 r 2 + s 2 ) (4.60) Combin ing (4.59) and (4.60), the matr ix Y2nY2Tn + l ^ n ^ n , which appears in all the rest of the block diagonal entries of W2, is given by 81 Chapter 4. Convergence Analysis for Block Stationary Methods r2 + s2 0 2rs 0 s2 0 0 r 2 + s2 0 2rs 0 s 2 2rs 0 r 2 + 2s 2 0 2rs 0 s 2 0 2rs 0 r 2 + 2 s 2 0 2rs 0 s2 s2 0 2rs 0 r 2 + 2s2 0 2rs s2 0 2rs. 0 r 2 + s2 0 s 2 0 2rs 0 r2 + s m (4.61) it is evident that in terms of Un [see E q . (4.38)] we have Y2nY2Tn + Y2TnY2n = r 2 + 2rs U2n + s2 • TJ2 (4.61) . (4.62) thus, the eigenvalues of this matr ix , wi th r and s as in (4.38), are the ones given in E q . (4.57). Each eigenvalue is of algebraic mult ipl ici ty 2. The mat r ix Y2TnY2n can be permuted so that the n rows containing nonzero entries are first, (X 0 \ • ~ followed by the zero rows. Doing so, we obtain a matr ix of the type where X is a V 0 Oj symmetric pentadiagonal n X n mat r ix given by penta[s 2 , 2rs, r2 + 2s2,2rs, s 2 ] , except the first and the last entries in the main diagonal are r 2 + s2. A g a i n , we use an auxil iary matr ix that has been introduced in E q . (4.38), and we have X = r 2 • / „ + 2rs -Tn + s2-T2 . (4.63) F r o m this it follows that the eigenvalues of Y2nY2n and Y2nY2n are thus exactly the ones given by (4.57), each of mult ipl ic i ty 1, plus the eigenvalue 0, of mult ipl ic i ty n. We can now find the eigenvalues of W2 by assembling all the eigenvalues of all blocks. Since 82 Chapter 4. Convergence Analysis for Block Stationary Methods there are f — 2 blocks in W2 that are equal to Y2nY2n + Y2r^Y2n, the result in the statement of the lemma follows. • Having the eigenvalues of W2 at hand, we can now use equations (4.56) and (4.57) to obtain: L e m m a 4.4. The spectral radius of D2 — T)\ can be bounded by £ = 2fg c o s ( - ^ — ) + y46e/7+16 • cdfg • cos2(irh) + wVbcle • fg • COS(TT/I) . (4.64) 2" + 1 Combin ing T h m . 4.5 and L e m m a 4.4, and applying Rayleigh quotients to the matrices E>i and D2 — Di, we obtain: L e m m a 4.5. The minimal eigenvalue of D2 is bounded from below by r/ — £ ; where r\ and £ are the expressions given in (4-52) and (4-64) respectively. A s a last step, we estimate the spectral radius of -C2 = tn[S^% 0, + HSV_V 0, 8$*] + t r i L S ^ , 0, . (4.65) L e m m a 4.6. The spectral radius of t r i f S J " ^ , 0, T ] + t r i [ 5 , 0, Sj1^] is 2\Jbefg. Moreover, its eigenvalues are either 0, +2y/befg or —2y/befg. Proof. The square of the matr ix is a diagonal matr ix whose entries are either zeros or Abefg. • 3 3 L e m m a 4.7. The spectral radius of the ^ X ^ matrix tri[—be-IN2, 0, —be-IN2] is 26e -cos (^ j - ) . Proof. Let Z denote the f x | matr ix t r i [ l , 0,1]. Then t r i [ -6e • IN2, 0, -be • IN2~\ = - be • ( Z < g>/ n 2) . • 83 Chapter 4. Convergence Analysis for Block Stationary Methods L e m m a 4 .8. T h e s p e c t r a l r a d i u s o f tri[SJ°Li, °> ^ ij-I-] ~ t r i [ -6e • IN2,0, - b e • In2] i s g i v e n b y 2Vbefg + 4 : V b c d e • c o s ( n h ) . (4.66) P r o o f . Let C \ denote the matr ix that results from permuting the rows of the mat r ix given in the statement of the lemma so that indices whose moduli 4 are 0 or 1 are indexed first, in increasing order, and indices whose moduli 4 are 2 or 3 are indexed later. Let C 2 be a permutat ion of C \ such that the rows and columns indexed n 3 / A — n 2 / 2 + 1 to n 3 / 4 + n2/2 ( n 2 such rows & columns) become rows/columns ra3/2 — n 2 + 1 to n3/2 and the rest of the rows/columns are shifted accordingly. The last n 2 rows and columns of C 2 are zeros. If C3 denotes the upper left ( n 3 / 2 — TI2) x { n 3 / 2 — n 2 ) submatr ix of C 2 , then it is a block mat r ix of 0 U the form | ] and U is a block diagonal matr ix consisting of n X n t r idiagonal matrices ii 0 given by - 2 • t r i [ y / b c d e , ^ b e f g , ^ b c d e ] . In F i g . 4.1 the sparsity patterns of the above matrices are depicted. ( a ) C r (b) C 2 (c)C 3 Figure 4.1: sparsity patterns of the matrices involved in the proof of L e m m a 4.8 B y L e m m a 1.1 the spectral radius of C3 is the one given in (4.66). It is also the spectral radius of the matr ix referred to in the statement of the lemma, as the rest of its eigenvalues are zero. " • We can now combine the results obtained in Lemmas or Theorems 4.3 - 4.8 to obtain the following result: 84' Chapter 4. Convergence Analysis for Block Stationary Methods T h e o r e m 4.6. The spectral radius of C 2 is bounded by <f> = 4^/befg + iVbcde • cos(—^—) + 2be cos( n * ) ; (4.67) the spectral radius of C\ is bounded by £ + cj) [£ is defined in Eq. (4-64)]-We are now ready to prove the main convergence result. T h e o r e m 4.7. The spectral radii of the iteration matrices D^C'i and Z ? 2 _ 1 C 2 are bounded by a n d —respectively, where n, £ and <f> are defined in Eq. (4-52), (4-64) and (4.67), respectively. F r o m Theorem 4.7 we can draw the following conclusion wi th regard to the convergence of the block Jacobi scheme: Corol lary 4.2. If be, cd, fg>0 then the block Jacobi iteration converges for both the ID and 2D splittings. Proof. The Taylor expansions of the bounds given in Theorem 4.7 are given by: ^ ^ + £ _ 1 C 1 0 ^ . 2 . 1 , 2 , 1 2. , 1 2 w 2 , <U2\ P(D^C\) < ^ = 1 - {-^ + + - a ' ) h * + o(h') (4.68) and P(D?C2) < ^ = 1 - (2*3 + A M 2 + l r 2 + la 2)/, 2 + o(h2) , (4.69) thus smaller than 1. • We remark that convergence of the scheme can also be deduced by [113, T h m . 3.13]. In Tables 4.2 and 4.1 we give some indication on the quality of the bounds. A s can be observed, the bounds are tight and become tighter as n increases, which suggests that they tend to the actual spectral radii as h —> 0. Another important point is that the actual spectral radii are bounded far below 1. 85 Chapter 4. Convergence Analysis for Block Stationary Methods scheme upwind centered n P bound ratio P bound ratio 8 0.641 0.717 1.12 0.475 0.525 1.11 12 0.720 0.758 1.05 0.529 0.554 1.05 16 0.752 0.775 1.03 0.551 0.566 1.03 20 0.768 0.783 1.02 0.562 0.571 1.02 24 0.778 0.788 1.01 0.568 0.575 1.01 Table 4.1: comparison between the computed spectral radius and the bound, for the I D split-t ing, wi th (3 = 0.4,7 = 0.5, 5 = 0.6. scheme upwind centered n P bound ratio P bound ratio 8 0.497 0.583 1.17 0.343 0.388 1.13 12 0.585 0.632 1.08 0.392 0.415 1.06 16 0.625 0.653 1.05 0.412 0.427 1.04 20 0.645 0.664 1.03 0.423 0.432 1.02 24 0.657 0.671 1.02 0.429 0.436 1.02 Table 4.2: comparison between the computed spectral radius and the bound, for the 2D split-t ing, wi th /? = 0.4, 7 = 0.5, S = 0.6. For the variable coefficient case, recall from the previous section that if the problem is separable the mat r ix can be symmetrized by a real diagonal similari ty transformation. Inspired by E l m a n & Golub ' s strategy [42, Cor . 2], and the technique that is used to prove i t , Theorem 4.7 can be generalized as follows: Theorem 4.8. Suppose the continuous problem is separable, and cJ+i(i,-, 6 j + i e j and fk+igk ar& all positive and bounded by /3X, j3y and fiz respectively. Suppose also that a-ij^ > for all i, j and k. Denote h = -nVr. Then the spectral radii of the iteration matrices associated with the block Jacobi scheme which correspond to ID splitting and 2D splitting are bounded by :— V 86 Chapter 4. Convergence Analysis for Block Stationary Methods and — ^ — respectively, where: n = a 2 - 2 P y - 2(3Z - 2^fz - 4(y/p%p\ + y/%f2) cos(7r/i) - 4/3* cos 2(rrh) ; (4.70a) £ = 2/3z cos(Trfc) + \J^y(3z + 160x(3z cos2(nh) + I6f3z^(3x(3y COS(TT/I) ; (4.70b) <t> = 4y/PyPz + Ay/Bxpy • cos{nh) + 2/3y cos(-K~h) . (4.70c) Proof. The conditions stated in the theorem guarantee that the mat r ix is symmetrizable by a real diagonal nonsingular matr ix . Denote the reduced matr ix by S, and the symmetrized mat r ix by 5 , and suppose 5* is obtained by modifying the S in the following manner: replace each occurrence of c; and d{ by yfp%, replace each occurrence of bj and ej by y/p\, replace each occurrence of fk and gk by ^fffl and replace each occurrence of a ; , ^ by a. Denote by S* = D* — C* the spl i t t ing which is analogous (as far as sparsity pattern is concerned) to the spl i t t ing S — D — C. For the I D spli t t ing, the matr ix D* is block diagonal wi th semibandwidth 4, its sparsity pattern is identical to that of D, and it is componentwise smaller than or equal to the entries of D. B y Cor . 2.5, D* is a diagonally dominant M - m a t r i x . Clearly, C* > C > 0. Thus the Perron-Frobenius theorem [113, p. 30] can be used to obtain an upper bound on the convergence rate. Since the matr ix S* can be now referred to as a symmetrized version of a matr ix that is associated wi th a constant coefficient case, the bounds on the convergence rates are readily obtained from T h m . 4.7. • 4.3 "Near-Property A" for ID Partitioning of the Two-Plane Matrix The block Jacobi scheme converges slowly, and thus is not satisfactory. We are interested in performing convergence analysis for the block S O R solver, which is typical ly very efficient if the opt imal relaxation parameter (or a good approximation to it) is known. Considering the two-plane ordering, the reduced matr ix is consistently ordered relative to the 2D part i t ioning 87 Chapter 4. Convergence Analysis for Block Stationary Methods (by P rop . 3.3), and thus in this case Young's analysis can be straightforwardly applied. O n the other hand, analyzing the I D part i t ioning is more challenging: by Prop . 3.2 the reduced mat r ix does not have Proper ty A relative to this part i t ioning. However, in Section 4.4 it w i l l be shown that the I D part i t ioning gives rise to a more efficient solution process. It is thus important to provide some analysis for this case. Our analysis and experimental observations are for the constant coefficient case and refer mainly to centered difference discretization. F i rs t , we consider the (easier) case of 2D parti-t ioning and provide an approximation to the opt imal relaxation parameter. We then consider the I D part i t ioning: the spectral radius of the Jacobi iteration mat r ix is bounded in terms of spectral radi i of two iteration matrices, one of which is associated wi th a consistently ordered matr ix , and one whose specral radius is small . We then use Varga 's extensions of the theory of p-cyclic matrices [112],[113, Sec. 4.4] (we are concerned wi th p = 2) to show that that relative to I D part i t ioning, the reduced matr ix behaves like a consistently ordered matr ix . We begin wi th the 2D part i t ioning. For this case we have the following: T h e o r e m 4 .9. Let Cw denote the block SOR operator associated with 2D splitting and using two-plane orderinq. If be, cd, fq>0or<0 then the choice u>* — , 2 , minimizes I + V I - P W 1 ^ ) p(Cw) respect to LO, and p(Cw») = LO* — 1. The proof of this theorem directly follows from [117, §5.2 and §14.3] and is essentially identical to the proof of [41, T h m . 4]. The algebraic details on how to pick the signs of the diagonal symmetrizer so that the symmetrized block diagonal part of the spl i t t ing is a diagonally dominant M - m a t r i x are omit ted. B y Cor . 4.2 it is known that p(D^"1C2) < 1. The reduced mat r ix is consistently ordered by Prop . 3.3. A way to approximately determine the opt imal relaxation parameter for be, cd, fg>0 is to replace /?(.D^1C2) by the bound for it (given in Theorem 4.7) in the expression for LO* in Theorem 4.9. If the bound for the block Jacobi scheme is tight, then the estimate of to* is fairly accurate: 88 Chapter 4. Convergence Analysis for Block Stationary Methods Corol lary 4 .3. Suppose be, cd, fg > 0. For the system associated with 2D splitting and for h sufficiently small, the choice u* = 2 ( ; ? ~ ^ (4.71) approximately minimizes p(Cu). The spectral radius of the iteration matrix is approximately Q* - 1. It is straightforward to verify that if u* is an accurate approximation to the op t imal relax-ation parameter, the asymptotic convergence rate of the S O R scheme is 0(h). M o v i n g to the I D parti t ioning, as mentioned above, it is fundamentally different from the 2D part i t ioning and an analogous result to the one stated in Theorem 4.9 cannot be obtained by using Young 's analysis. However, for be, cd, fg > 0 we have noticed that in many senses the reduced mat r ix does behave like a consistently ordered matr ix relative to I D part i t ioning. For example, we have observed that the spectral radius of the block Jacobi i teration mat r ix satisfies p2 w pes, and the graphs in F i g . 4.2 illustrate this phenomenon numerically. The broken lines correspond to the square of the spectral radius of the iteration mat r ix associated wi th block Jacobi , for an 8 X 8 X 8 grid and f3 = 7 = 8 = 0.5. The solid lines correspond to the spectral radius of the block Gauss-Seidel iteration matr ix . A s can be seen, the curves are almost indistinguishable. Th is phenomenon becomes more dramatic as the systems become larger. (a) pas vs. p2j - centered (b) pes vs. p2 - upwind Figure 4.2: "Near Property A " for the I D spli t t ing Let {Sij} denote the n2 X n2 blocks of the reduced matr ix . Each block Sij is a block 89 Chapter 4. Convergence Analysis for Block Stationary Methods t r idiagonal matr ix relative to 2n x 2n blocks. Define (i) (-1) (4.72) where the matrices on the right hand side are defined in Sec. 3.2. A s before, let S = D \ — C \ be the I D Jacobi spl i t t ing of the matr ix , and define c[2^ so that S = D \ — (CJ1^ + C ^ ) . The mat r ix D \ — C \ is consistently ordered, but S is not, and does not have Proper ty A either. It is convenient to consider the very sparse matr ix c ' 1 ' as one which prevents S from having Proper ty A . In this matr ix the magnitudes of the nonzero values are bounded by 2 i f b e , cd, f g > 0 and centered difference discretization is used for the convective terms. The nonzero pattern of CJ 1 ' is depicted in F i g . 4.3. Figure 4.3: sparsity pattern of the matr ix c[^. W h e n b e , c d , fg>0 the reduced matr ix 5 is a diagonally dominant M - m a t r i x (by L e m m a 2.5), which can be symmetrized by a real diagonal similari ty transformation (by Theorem 4.1). ~ 1/2 B y Theorem 4.5 the symmetrized matr ix D i is positive definite, thus D x is well defined. Th is leads to Lemma 4.9. T h e s p e c t r a l r a d i u s o f t h e I D J a c o b i i t e r a t i o n m a t r i x a s s o c i a t e d w i t h t h e r e d u c e d m a t r i x S i s b o u n d e d f r o m a b o v e a n d b e l o w a s f o l l o w s : 1 P(D^C{2)) < P ( D ^ C \ ) < P(D^ci2)) + P{D?CP) (4.73) X I would like to thank Howard.Elman for pointing out and providing the proof of the right-hand inequality in (4.73). 90 Chapter 4. Convergence Analysis for Block Stationary Methods Proof. Let the matrices and c[2^ denote the symmetrized versions of C J 1 ' and c f respec-tively. Since D^&i is similar to the symmetric matr ix D\/2(D^C^D^1^ = D^1/2CiD^1/2, we have piD-'Cr) = P{b?Cy) = pib-^c.b;1'2) = ||£>- 1 / 2cW / 2 l i? < w b ^ c ^ b ; 1 ^ + \\b;1/2c[2)b;1/2\\2 = p i b ^ C ^ b ^ 2 ) + p i b - ^ C ^ b - ^ ) = pipfdp) + P{D?CP) . (4.74) Th i s completes the proof for the right-hand inequality. O n the other hand, we have that both C[2^ and C\ are nonnegative matrices. Since D\ is an M - m a t r i x , and since C J 2 ' < C^ + C p ' = C i , we can use [95, P rop . 1.7], which states that i f A , B and C are nonnegative matrices wi th A < B, then CA < CB (and AC < BC). Hence D^C{2) < D ^ d . Now, using [95, T h m . 1.14], which states that for two matrices 0 < A < B we have p(A) < p(B) (this actually follows from the Perron-Frobenius theorem [113, p. 30]), the left-hand inequality in (4.73) follows. • Table 4.3 illustrates numerically the inequalities given in L e m m a 4.9 for a part icular ex-ample. In the table, spectral radii of the iteration matrices defined above are given, for several cross-sections of mesh Reynolds numbers. The matr ix for which the numbers are given is one arising from discretization on a 12 X 12 x 12 grid. A s is evident, the bounds are tight (by the analysis, the values in the second column of the table are bounded from above by the values in the th i rd column, and from below by the values in the fourth column). In particular, the right-hand inequality is a very tight upper bound. Ano the r interesting point is that the spectral radius of the iteration mat r ix associated wi th D\ — c j 1 ' is fairly small . Th is can be quantified analytically, as follows: Proposit ion 4.1. For h sufficiently small, the spectral radius of the Jacobi iteration matrix associated with the splitting D\ — C J 1 ' is bounded from above by rfflfCh <\ - + ± f + ^ + ^ V + .<*') • (4.75) 91 Chapter 4. Convergence Analysis for Block Stationary Methods 13 = 7 = 5 p{D^Cx) p( 2) pW 0.1 0.8734 0.8762 0.7759 0.1002 0.3 0.7487 0.7511 0.6648 0.0862 0.5 0.5441 0.5458 0.4827 0.0631 0.7 0.3151 0.3156 0.2788 0.0368 0.9 0.0979 0.0981 0.0866 0.0115 Table 4.3: comparison of computed spectral radii of the I D Jacobi i teration mat r ix and matrices which are consistently ordered. The expressions in the header of the table stand for: pM = P(D^C[1]) and p(2> E E P(D^C[2)). Proof. In L e m m a 4.6 the spectral radius of a matr ix which is exactly equal to — C ^ 1 ' is proved to be equal to 2^/befg. Since piD^C^) = p^D^C^) < P ^ ^ , it follows that p{D^1C^) < 2^/bejg ^ w n e r e ^ j g a s j n gq_ (4.52). The leading terms of the Taylor expansion of this expression are given by E q . (4.75). • We note that the matr ix C J 1 ' has terms which are associated only wi th two independent variables (y and z in this case), and thus the value in the Taylor expansion (4.75) which multiplies r 2 and p? (namely — -^h2) is not equal to the value mul t ip lying a2. The above analysis and experimental observations focus on the Jacobi i teration matr ix . However, it is s t i l l not clear at this point whether the Gauss-Seidel (or S O R ) eigenvalues behave in a way which resembles consistently ordered matrices. Here we can use Varga 's extensions of the theory of p-cyclic matrices [112], [113, Sec. 4.4]. Below we briefly describe some of Varga 's results, and use them for the reduced matr ix . In [113, Defn. 4.2], a set V of matrices is defined as follows: 2 The square matr ix B £ V if B satisfies the following properties: 2 In [113] the set is denoted by S. Here, since S denotes the reduced matrix, the letter V is used. 92 Chapter 4. Convergence Analysis for Block Stationary Methods 1. B > 0 wi th zero diagonal entries. 2. B is irreducible and convergent. 3. B is symmetr ic . Define S = D~ll2SD~x -1/2 _ = I - D -1/2^ — 1/2 C\Dl . A p p l y i n g block Jacobi to the original reduced system is analogous to applying point Jacobi to S, in the sense that the spectra of the i terat ion matrices associated with both systems are identical to each other. The iteration ~ " —1/2 ~ " —1/2 matr ix associated with S is B = Dx C\DX . Showing that the mat r ix B belongs to the set V defined above is easy and is omitted. Let L be the lower part of B. Define MB(6) = 6L + LT,6>0 and mB{6) = p(MB{0)). Let mB{6) hB(ln9) = 6 > 0 (4.76) p{ )0W Then we have by [113, Theorem 4.7] that hB(o>) = 1 if and only if B is consistently ordered. In some sense hB(ln6) measures the departure of the matr ix B from having block Proper ty A . F i g . 4.4 demonstrates how close the function hB is to 1 for the reduced mat r ix when I D part i t ioning is used, and is another way to illustrate the "near-Property A " of the mat r ix . In the figure, the function hB is computed for the same matr ix for which F i g . 4.2(a) is given. Figure 4.4: the function hB{ot) Let LByW denote the S O R iteration matr ix . [113, T h m . 4.8] reads: T h e o r e m 4.10. If B 6 V then with equality possible only if B is a consistently ordered cyclic matrix of index 2. 93 (4.77) Chapter 4. Convergence Analysis for Block Stationary Methods This is a sharpened form of the Stein-Rosenberg Theorem [113, p. 70]. A p p l y i n g this theorem to our reduced matr ix , we have: T h e o r e m 4.11. If the bound for the block Jacobi iteration matrix tends to the actual spectral radius as h —>• 0, then the spectral radius of the block Gauss-Seidel iteration matrix coincides with the square of the bound for the spectral radius of the block Jacobi iteration matrix up to and including terms of order h 2 . Proof. Since the iteration matr ix B has the same spectral radius as D1~1Ci, we can use the bound of T h m . 4.7. For simplici ty of notation, denote it by Clearly, since 0 < p(B) < <f>, " < B ) <-Af (4-78) 2-p{B) ~ 2 - $ Since $ has a Taylor expansion of the form 1 - ch2 + o(h2), by E q . (4.68) it follows that and <f>2 have the same Taylor expansion up to 0(h2) terms, of the form 1 — 2ch2 + o(h2). Indeed, in terms of the P D E coefficients, 2 ^ = 1 - ( f ^ + k + r 2 + ^ a + 0 ( f c 3 ) ' (4-79) and the same for $ 2 . • For the block S O R scheme, the upper bound for the spectral radius is given in [113, Theorem 4.9] as yju>* — 1 and is not tight. However, it is numerically evident that the bound for the Jacobi i teration matr ix can be effectively used to estimate the opt imal S O R parameter. In F i g . 4.5 we can observe that the behavior for the I D spli t t ing is quali tat ively identical to the behavior of 2-cyclic consistently ordered matrices [the graph is given for the mat r ix that was used in F i g . 4.2(a)]. In the figure we also present the behavior of the S O R unreduced i teration mat r ix . 4.4 Computational Work Having performed some convergence analysis, in this section we examine the question which of the solvers is more efficient overall. If be, cd, fg > 0, then by E q . (4.68) and (4.69) (or by 94 Chapter 4. Convergence Analysis for Block Stationary Methods 0.1 0 I — I — I — I — I — I — I — I — I — I — I 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Figure 4.5: spectral radius of the S O R iteration matr ix vs. the relaxation parameter. The uppermost curve corresponds to I D spli t t ing for the unreduced matr ix , and then, in order, 2D spl i t t ing for the unreduced matr ix , I D spli t t ing for the reduced matr ix , and 2D spl i t t ing for the reduced matr ix . [113, T h m . 3.15]) it is evident that the spectral radius of the iteration mat r ix associated wi th the 2D spl i t t ing is smaller than that of the I D iteration matr ix . However, inverting D\ involves less computat ional work than inverting D2. Consider the block Jacobi scheme. Asymptot ical ly , in the constant coefficient case there is a fixed ratio of 1.8 between the rate of convergence of the two splittings - see E q . (4.68) and (4.69). In rough terms, this number characterizes the ratio between number of iterations until convergence for the two solvers. A s far as the computat ional work per iteration is concerned, if D\ = L\U\ and D2 = L2U2 are the L U decompositions of the matrices of the systems that are to be solved in each iteration, we can assume that the number of operations per iteration is approximately the number of nonzeros in L{ + U{ plus the number of nonzeros in C,-. In order to avoid costly fil l-in using Gaussian elimination for D2 (whose band is sparse), a technique of inner-outer iterations is used instead. Let k\ and k2 denote the number of iterations for the schemes associated wi th the I D spl i t t ing and the 2D spli t t ing respectively. Let us also define cost functions, as follows: c\{n) 95 Chapter 4. Convergence Analysis for Block Stationary Methods and c2(n) represent the overall number of floating point operations for each of the solvers, and Cin(n) represents the cost of the inner solve. Then ci{n) ~ [nz(Li + Ui) + nz(S — D\)] • ki = [IOTI3 — 19ra2 + 4n] • k\ ; (4.80a) c2(n) w [ctn(n) + nz(S - D2)] • k2 = [cin{n) + 3 n 3 - 8 n 2 + 4n] • k2 . (4.80b) In E q . (4.80) the term ,nz(X)J stands for the number of nonzeros of a mat r ix X, and 5 stands for the reduced matr ix . Proposit ion 4.2. For n large enough, the scheme associated with the 2D splitting gives rise to a less costly solution process than the one associated with the ID splitting only if Cin(n) < 15n3. Proof If n is large enough we can use the relation = 1.8 and refer only to the leading power of n in the expressions for C\{n) and c2(n). So doing, it follows that ci(n) 18ns C2{n) Cin(n) + 3n3 ' (4-81) and the result stated in the proposition readily follows. • Next , the amount of work involved in solving the inner system of equations is determined. A natural choice of a spl i t t ing for this system is D2 = D\ — (Di — D2). It is straightforward to show, by L e m m a 4.4 and Theorem 4.5, that: spectral radius of the inner iteration matrix, namely I — D, 1 D 2 , is bounded by —, where n and Proposit ion 4.3. If block Jacobi based on the splitting D2 = D\ — (£>i — D2) is used, then the iteratio £ are as in Eq. (4-52) and (4-64)-For considering stationary methods that are faster than block Jacobi for the inner system, we have: Proposi t ion 4.4. The inner matrix is block consistently ordered relative to ID partitioning. Proof. The inner mat r ix is block tridiagonal relative to this part i t ioning. • 96 Chapter 4. Convergence Analysis for Block Stationary Methods We now have: T h e o r e m 4.12. If be, cd, fg > 0, then for n large enough, and the stationary methods con-sidered in this chapter, the ID solver is faster than the 2D solver. Proof. Below the result is shown for the block Jacobi outer solver; for Gauss-Seidel and S O R it is done in a similar way. The Taylor expansion of the bound in P rop . 4.3 is £ 4 .43 2 65 o 29 2 25 9 N l 9 - = - - ( — T T 2 + u2 + r 2 + a2 )h2 + o(h2). 4.82 rj 9 v 81 6 4 8 P 648 324 ' M V ; For h small enough, we can simply examine the leading term: the bound is approximately | if block Jacobi is used, and since by Prop . 4.4 the matr ix is consistently ordered, by Young's analysis the spectral radius is approximately | | if block Gauss-Seidel is used, and approximately 0.055 i f block S O R wi th the opt imal relaxation parameter is used. For these schemes each i teration costs about 7n3 floating point operations. Since reducing the ini ta l error by a factor of 1 0 m takes roughly — m/\og10p iterations, where p is the spectral radius of the associated i teration matr ix , it follows that even for the block S O R scheme wi th the opt imal relaxation parameter, which is the fastest scheme considered here, after two iterations the error is reduced only by a factor of approximately I O 2 5 which is far from enough. Thus the i teration count is larger than two, and the cost of inner solve is larger than 15ra3 floating point operations. • We remark that an inexact inner solve could also be considered (see, for example, [43]), but we do not analyze this type of solver here. 4.5 Comparison with the Unreduced System The unreduced matr ix is an nth-order block tridiagonal matr ix wi th respect to n2 x n 2 blocks: A = t r i [ A ( " 1 ) , A(°>, A W ] , (4.83) where A<°) = f In2, A* 1 ) = g Ini , and A(° ) are themselves block tr idiagonal matrices, A(°> = t r i [ B ( - 1 ) , J B ( ° ) , - B ( 1 ) ] , (4.84) 97 Chapter 4. Convergence Analysis for Block Stationary Methods consisting of n x n matrices given by : B ^ = bln; = tri[c, a, d] ; B^ = e In . (4.85) Consider the one-dimensional splitting A — D-C where D = diag[ JB(°),.. . , JB(°)] . (4.86) T h e o r e m 4.13. If cd > 0, be > 0, and fg>0 then there exists a real nonsingular diagonal matrix Q such that the matrix A = Q _ 1 AQ is symmetric. Proof. This can be shown by direct substitution, by requiring that the symmetrized matrix be equal to its transpose. Denote the ith entry in the main diagonal of Q by q{. q\ ^ 0 can be an arbitrary value, and the following algorithm generates the desired diagonal matrix : for i — 2 to n for £ — 2 to n for i = 1 to n Q ( i - l ) n + i — \ [ \ ' a(t-2)n+i for j = 2 to n for I = 1 to n for i = 1 to n From the above it is clear that the similarity transformation is real and nonsingular only if cd, fg and be are positive. • 98 Chapter 4. Convergence Analysis for Block Stationary Methods The resulting symmetrized matr ix is A = t r i [ i ( - 1 ) , i ( ° ) , i ( 1 ) ] (4.87) where A ^ - 1 ' = — y/Jg In2 ; i ( ° ) = t r i [ J B(- 1 ) , fit1)] (4.88) wi th B ( - i ) = B ( i ) = ^ j n B( 0 ) = tn[VcH, a, Vcd) . (4.89) Theorem 4.13 leads to the following symmetrizat ion result, obtained by expressing the conditions in terms of /3, 7 and 6 : Corol lary 4.4. If\f3\, \S\ < 1, the coefficient matrix for the centered difference scheme is symmetrizable by a real diagonal similarity transformation. For upwind (backward) schemes, the coefficient matrix is symmetrizable for all /3, 7, 5 > 0. C o r . 4.4 demonstrates the similari ty of the symmetrizat ion conditions between the 2D [40] and 3D problems: in both cases the unreduced matr ix can be symmetrized by a diagonal mat r ix only in the diffusion dominated case if centered differences are used, and in both cases the reduced matr ix can be symmetrized for a larger range of P D E coefficients, including the case where al l mesh Reynolds numbers are larger than 1 in magnitude. We now find the spectrum of the iteration matr ix . Let D and C be the symmetrized versions of D and C respectively. Then: T h e o r e m 4.14. The eigenvalues of the iteration matrix D~lC are given by 1 < j , k , t < n . (4.90) Proof. Inspired by the techniques used by E l m a n & Golub in [40, p. 677], suppose Vn is an n X n mat r ix whose columns are the eigenvalues of the tridiagonal mat r ix tn[Vcd, a, Vcd], and 99 Chapter 4. Convergence Analysis for Block Stationary Methods let V = d i ag [V n , . . . , Vn] have n2 copies of Vn. Then A = V~lAV diagonalizes D. Denote D = V~lDV and C = V~lCV. Since V-lb~lCV = b~lC we can find the spectrum of t)-lC by examining D~lC. The latter is easier to form explicitly, as D is diagonal (as opposed to D, which is tridiagonal). D^XC has the same nonzero pattern of C. Let Pi denote the permutation matrix that transforms rowwise ordering into columnwise ordering, leaving the ordering of planes (z direction) unchanged. Fi = P^ D~xCPi is block tridiagonal with respect to n2 X n2 blocks, and its superdiagonal and subdiagonal blocks are diagonal matrices, all equal to each other. Each of these n2 X n2 matrices looks as follows : diag[ iFS , , , . . . , ^ 5 ] . (4.91) a + 2ycc?cos(7r/i) a + 2v cd cos(2ir h) a + 2\/cdcos(wnh) n terms n terms n terms The diagonal block consists of n identical n2 x n2 identical matrices, each being an nth-order block diagonal matrix whose jth component is the tridiagonal matrix G j = t r i [ ^or?< -J • ( 4 - 9 2 ) a + 2vcacos(7rjft) Let VN2 denote a matrix whose columns are the eigenvectors of diag[Gi, . . ' . ,Gn], and let V2 be the block diagonal matrix consisting of n uncoupled copies of VN2] then F2 = V2~1FiV2 is still block tridiagonal with respect to n2 X n2 blocks, but now the main diagonal block is a diagonal matrix : it contains n identical n2 x n2 blocks, each being a diagonal matrix whose entries are 2 becosfrjh) 1 < j . k < n. Finally, let P2 be a permutation matrix a + 2 V c £ f cos(7rfc/i) • " z ^ which transforms rowwise ordering to planewise ordering, leaving the orientation of columns unchanged. Then F 3 = P2~lF2P2 is an n2-order block diagonal matrix whose components are nxn symmetric tridiagonal matrices. Their eigenvalues can be found using Lemma 1.1, and are given by (4.90). • Theorem 4.14 leads to the following useful result: Corollary 4 .5. For cd, be, fg>0 the spectral radius of the line Jacobi iteration matrix for the unreduced system, using the preconditioner defined in (4-86), is 2Vbe-cos(wh) + 2y/Jg-cos(TTh)_ a + 2ycd • cos(nnh) 100 Chapter 4. Convergence Analysis for Block Stationary Methods The Taylor expansion of E q . (4.90) about h = 0 is given by + o(h2) . (4.94) For the plane Jacobi a similar procedure to the one used in the proof of T h m . 4.14 can be applied. (The algebraic details are omitted.) T h e o r e m 4.15. The spectral radius of the plane Jacobi iteration matrix for the unreduced system is given by The same type of analysis that has been done in Section 4.4, comparing the I D spl i t t ing to the 2D spl i t t ing for the reduced system, is possible for the unreduced system. Below we sketch the main details: suppose inner-outer iterations are used in solving the scheme associated wi th the 2D spl i t t ing. Denote, again, this spli t t ing for the inner system as D 2 = D\ — ( D i — £ ) 2 ) [D\ and J9 2 are now different than the ones defined in the previous section). Then we have: Proposi t ion 4.5. Consider the unreduced system. Suppose be, cd, fg > 0, n is sufficiently large, and ID splitting is used in solving the inner system. Then for the stationary methods considered in this chapter, the ID solver is faster than the 2D solver. Proof. The ratio between the asymptotic rate of convergence between the I D solver and the 2D solver is 2. The number of nonzeros of the whole matr ix is approximately 7 n 3 , the number of nonzeros of D\ is approximately 3 n 3 and the number of nonzeros of D2 is approximately 5 n 3 . Since the spectral radi i for the two splittings are available, we can find the spectral radius for the i teration mat r ix of the inner system. Its Taylor expansion is given by \ — ( § 7 r 2 + j^a2 + ^ r 2 ) / i 2 + o(h2). Defining cost functions analogous to the ones defined in Section 4.4 for the reduced system, and using the same line of argument, we have (4.95) a + 2\fcd • cos(7rn/i) + 2\/be • cos(nnh) and its Taylor expansion about h = 0 is ^-&2 + l°2+\r2+ -p2) h2 y2 8 8 W 1 + o(h2) . (4.96) c i (n) 14n 3 (4.97) c%{n) cin(n) + 2n3 101 Chapter 4. Convergence Analysis, for Block Stationary Methods and from this it follows that only if c;n(ra) < 12n 3 the 2D solver is more efficient. However, as in T h m . 4.12, this means that at most two iterations of the inner solve can be performed, which is not enough for the required accuracy. • Since the I D spli t t ing for both the reduced and the unreduced systems gives rise to a more efficient solve, we compare these two systems, focusing on this spl i t t ing. The L U decomposition for the solution of the system in each iteration is done once and for all (see [54] for operation count) and its cost is negligible in comparison with the amount of work done in the iterative process. Each iteration in the reduced system costs about 1 0 n 3 floating point operations whereas each iteration for the unreduced system costs approximately 7 n 3 floating point operations per i terat ion. Hence, the amount of computat ional work per iteration is cheaper for the unreduced system by a factor of about 10/7. However, using the asymptotic formulas (4.68) and (4.94), it is evident that the number of iterations required for the unreduced system is larger than that required for the reduced system, and in the worst case, the ratio between the work required for solving the reduced system vs. the unreduced system is roughly (10/7) • (27/40), which is 27/28 and is s t i l l smaller than 1, thus the reduced solver is more efficient. If the convective terms are nonzero, then this ratio becomes smaller, and in practice we have observed substantial savings, as is i l lustrated in the test problems in Sec. 4.7. In F i g . 4.5 the superiority of the reduced system over the unreduced system for the Gauss-Seidel scheme is i l lustrated. The graphs were created for a 512-point grid. It is interesting to notice that the reduced Gauss-Seidel iteration matrices for both I D and 2D split t ings have spectral radi i that are significantly smaller than 1 even for the convection-dominated case, when centered difference discretization is applied. The 2D spli t t ing gives rise to a matr ix wi th Proper ty A , and in all the numerical experiments that have been performed for this spl i t t ing and for Gauss-Seidel and S O R the rate of convergence was much faster than for the unreduced system; in several cases the solver for the unreduced system did not converge. These observations are similar to the results in [107] for lower dimensions. 102 Chapter 4. Convergence Analysis for Block Stationary Methods The superiority of the reduced system is evident also for the SOR scheme - see Fig. 4.5. Notice that for the SOR scheme it is difficult to determine the optimal relaxation parameter when be, cd and fg are negative. 0 0.5 1 (a) pgs — centered (b) pgs — upwind Figure 4.6: comparison of the spectral radii of the Gauss-Seidel iteration matrices of the reduced and unreduced systems. Axes are: x - mesh Reynolds numbers (/? = j — 5); y - spectral radius. The uppermost curve corresponds to ID splitting for the unreduced system, and then, in order, 2D splitting for the unreduced system, ID splitting for the reduced system, and 2D splitting for the reduced system. 4.6 Fourier Analysis We now perform Fourier analysis in the spirit of Chan & Elman [21] and Elman & Golub [40], for convection dominated equations: We are considering here centered difference discretization of convection-dominated equations where be, cd, fg < 0. The boundary conditions are assumed to be periodic and the equation has constant coeffi-cients. For the results presented below, we denote the periodic symmetrized reduced operator, scaled by ah2, by S = D — C, where the question which matrices D and C are is to be clear in the context where they appear. T h e o r e m 4.16. Given be, cd, fg < 0, suppose 8y/\fg\ • {\/\cd\ + y/\be\). < a2. If the two-line ordering with x-y oriented 2D blocks of gridpoints is used, then the eigenvalues of D~1C 103 Chapter 4. Convergence Analysis for Block Stationary Methods associated with the symmetrized periodic operator using 2D preconditioning are all smaller than 1 in magnitude. Proof. Suppose v^a^^ are vectors whose components are u^lfi^ ~ e2vtjahe2-Kikphe2inl^h_ j -f e r e i = y/— 1, and 1 < j, k, /, a, /?, 7 < n. For a grid point Uijtk we have D Ui%J,k = {a2 - 2be - 2cd - 2fg) uitj>k - be uitj+2jk - 2 v b c d e u i + i t j + l i k - cdui^2,j,k - c d u i + 2 j > k - 2Vocdeu;-i,j_i,/c - beuitj_2jk - 2 V b c d e - 2Vbcdeui+ij_i>k (4.98) and C = - f g u i t h k _ 2 - 2y/befg,uij+lik_1 - 2yJ'cdj'g -2yJcdfgui+iyjyk-X - 2yfbejg~Uij-1,k-1 - 2y/befg u i t j + l i k + 1 -2yjcdfg u i _ h h k + i - 2y/cdfg u i + l t j M 1 - 2y/befg u i t j - l t k + i ~ f g u i t j t k + 2 . (4.99) Subst i tut ing the eigenvectors presented above, by some algebra we get that the eigenvalues of D are A - = a2 - 2be - 2cd - 2fg - Ay/bcde • [cos(0 + </>) + cos(0 - </>)] - 2 6 e c o s 2 0 - 2cdcos20 , (4.100) where 6 = 2irah, <p = 2ir(3h, and £ = 2717/1. The eigenvalues of C are given by A*a ,^7 = 4 \ / W f f • [«>s(# + 0 + cos!'/' - 0] +4Vcd7y-[cos(0 + £) + c o s ( 0 - O ] + 2 / 5 c o s 2 £ . (4.101) T h e eigenvalues of D~XC are all the combinations of / i / A . Denote p = -\ / |cd | , q = 6eI, r = vj/g|. Note that cd = - p 2 , be = - q 2 , fg = - r 2 . Then p, 8pr cos 0 cos £ + 8gr cos <f> cos £ - 4 r 2 cos 2 £ + 2 r 2 A a2 + 2 r 2 — 8pg cos # cos 0 + Ap2 cos 2 # + 4 g 2 cos 2 </> For the denominator in (4.102) we have (4.102) A = a2 + 2 r 2 + 4(pcos0 - qcoscf>)2 > a2 + 2 r 2 . (4.103) 104 Chapter 4. Convergence Analysis for Block Stationary Methods c o s £ appears only in the numerator, which is a quadratic function in i t : M(cos£) = 2 r 2 ( l + s c o s £ - 2 c o s 2 £ ) , (4.104) where p cos 8 + q cos d> s = 4- ^ . (4.105) Let s m a x = m a x | s | = 4(p+q)/r. If p+q < r then p has a local max imum at —1 < c o s £ = | < 1, and then ^ = l + s 2 / 8 < l + s m a x . (4.106) The min imal value of ^ occurs at either c o s £ = 1 or c o s £ = — 1, where its value is ^ f = - l ± S , (4.107) thus Pmin/(2r2) = - 1 — |s| and from this it follows that the maximal absolute value of p/(2r2) is bounded by 1 + \s\. We conclude then, that for this case \p/{2r2) \ < 1 + s m a x . It now remains to check what happens if p-\- q > r. In this case there are no local extrema for p, and we have to look at the end points. Subst i tut ing c o s £ = ± 1 , as above, we have p/(2r2) = - l i s which yields \p/(2r2)\ < l + s m a x . We thus have | M | < 2 r 2 - [ l + i ^ l ] . (4.108) F r o m (4.103) and (4.108) it follows that \p/X\ < 1 if 8r(p + q) < a 2 , which in terms of the components of the computat ional molecule is exactly the condition stated in the theorem. • For the two-plane ordering, since the matr ix entries were given in the previous section for 2 P N x z , we refer to this particular ordering strategy below, and we have the following: T h e o r e m 4.17. Given be, cd, fg < 0, suppose 4-y/|6e| • (2^/\cd\ + y/\be\ + 2y/\fg\) < a2. Then the eigenvalues of D~lC associated with the periodic operator using 2D preconditioning for the two-plane ordering are all smaller than 1 in magnitude. 105 Chapter 4. Convergence Analysis for Block Stationary Methods Proof. For a grid point «t,j,fc, the structure of the D and C depends on the parity of the index j, associated with the independent variable y. If j is even D u i i h k = (a 2 - 2be - 2cd - 2fg) uitjtk ~ fg Ui,j,k-2 -2y/cdfg Ui-iijtk-i - 2y/cdfg u i + l i j i k - i ~ 2y/befguijJ_lik-i - cdui_2,j,k -cdui+2,3,k - 2Vbcdeui-itj-lik - 2 y / c d f g - 2 v / & c r f e « i + i j - i , k -2y/cdjg u i + h j M i - 2y/befguij_l!k+1 - fguiijM2 (4.109) and Cuitjtk - -2y/befguitj+ltk-i - beui:j+2,k - 2\/bcdeUi+ij+i, k -beuij-2,k ~ 2y/befg u i t J + l t h + i - 2Vbcde U i - l t j + l t k • (4.110) If j is odd, then the roles of j + 1 and j — 1 in D and C are interchanged. In both cases the eigenvalues are identical, and after some algebra we get: p 4qr cos <j> cos £ + 4pq cos <f> cos <f) + 2q2 cos 2c/> A a 2 + 2g 2 — 8pr cos 9 cos £ + 4p 2 cos2 9 + 4r 2 cos2 £ — 4gr cos </> cos £ — 4p</ cos 9 cos <^> (4.111) We can write A = a 2 + 2q2 + 4(pcos9 - r c o s £ ) 2 - 4qcos<f>(pcos# + r cos£) , and if s = 2 • p c ° s g + r c o s g , W = 2(p + 9 ) / r , we have A > a2 + 2q2 - 2 s m a x q 2 . (4.112) For the numerator, p, we use a technique similar to the one used in Theorem 4.16: p is quadratic in cos^, and we have p = 2 g 2 ( - l + scos</>-2cos2<?!>) . (4.113) If |s| < 4 then ^ 2 obtains its maximum at cos<?!> = | , where p/(2q2) = — 1 + (s 2)/8 < 1. The minimum is obtained at cos<^ = ± 1 and we have M n W ( 2 g 2 ) = - 3 - \s\ . (4.114) 106 Chapter 4. Convergence Analysis for Block Stationary Methods So, \p,/(2q2) \ < 3 + s m a x . Hence, it follows that 2(3 -f- s m a x ) q 2 (4.115) a2 + 2q2 - 2 s m a x q 2 The latter is smaller than 1 if 4g(2p + q + 2r) < a2. • 4.7 Comparisons In this section some indication as to the performance of block stationary methods for the reduced system is given, and the effectiveness of the cyclic reduction step is i l lustrated. Compar i son between the reduced and the unreduced systems Consider E q . (1.15) where the right hand side is such that the solution for the continuous problem is u(x,y,z) = sin(7ra;) • sin(7ry) • sm(nz), and the domain is the unit cube. The zero vector was taken as an ini t ia l guess, and H^'lb/lM^lb < I O - 1 0 was used as a stopping criterion ( r ^ denotes, as usual, the residual at the ith iterate). The program stopped i f the stopping criterion was not satisfied after 2,000 iterations. In the experiments that are presented, the I D solver is used. In Table 4.4, the grid is of size 32 X 32 X 32. The matr ix of the underlying system of equations is of size 32, 768 X 32, 768. In the table, i teration counts for the Jacobi , Gauss-Seidel and S O R schemes are presented for four values of the P D E coefficients, and for two discretization schemes. It is straightforward to make the connection between iterations and computat ional work: since the problem has constant coefficients, the construction of both the reduced and the unreduced systems is straightforward and inexpensive, and is negligible in the overall cost of the computat ion (see Sec. 5.3 for discussion on the cost of constructing the reduced matr ix) . The number of floating point operations per i teration is approximately twice the number of nonzeros in each matr ix . The number of nonzero entries in the reduced matr ix is higher by a factor of about 35% than the number of nonzeros for the unreduced matr ix . For the block Jacobi method, in the L U decomposition of the preconditioner (which is done once and for all) , there is no fil l-in for the unreduced matr ix , and there is fill-in of a modest amount of \ entries for the reduced matr ix . 107 Chapter 4. Convergence Analysis for Block Stationary Methods A s a result, each iteration for the reduced system costs approximately 2 0 n 3 floating point operations, and each iteration for the unreduced matr ix costs approximately 1 4 n 3 floating point operations. The iterations in the reduced solver are thus approximately 42% more expensive. system reduced unreduced P D E coeff. ( o = T = fi) 10 20 100 1000 10 20 100 1000 Jacobi centered 393 173 53 N / C 1,030 444 N / C N / C G S centered 188 77 14 322 492 198 N / C N / C S O R centered 36 25 - 61 38 - -Jacobi upwind 455 239 75 43 1,194 620 179 89 G S upwind 219 111 27 10 574 287 63 16 S O R upwind 39 27 18 9 66 45 24 11 Table 4.4: comparison between iteration counts for the reduced and unreduced systems, for different values of mesh Reynolds numbers. ' N / C marks no convergence after 2,000 iterations. The experiments were performed on a 32 x 32 x 32 grid. For the values of the P D E coefficients in the table, namely 10, 20, 100 and 1000, the corresponding values of the mesh Reynolds numbers are 0.1515, 0.3030, 1.515 and 15.15. Notice that the last two are larger than 1, and so, for these values we do not know the op t imal relaxation parameter and the S O R experiments for these values were not performed. The following observations can be made: 1. Overal l the reduced solver is substantially faster than the unreduced solver. Even though each iteration of the reduced solver is more expensive (as explained above), the savings in i teration counts are much more significant and in regions where both solvers converge, the reduced solver is more efficient by a factor of at least 50% (except one case), and much more in many cases. Moreover, a significant fact here is that there are cases where the reduced solver converges whereas the unreduced solver does not. 2. For the reduced solver, the Gauss-Seidel scheme outperforms the Jacobi scheme by a factor of approximately 2 in diffusion-dominated regions, which illustrates the "near Property 108 Chapter 4. Convergence Analysis for Block Stationary Methods A " the matr ix has. In convection-dominated regions with centered differencing the Gauss-Seidel scheme is significantly better than Jacobi . (In this case the mat r ix is not necessarily nearly consistently ordered). 3. If centered differences are used, for o = 20 convergence is faster than for a = 10. T h i s illustrates the phenomenon which is supported by the analysis and holds also for the two-dimensional case [41] - that for sufficiently small mesh Reynolds numbers, the "more nonsymmetric" systems converge faster than the "close to symmetr ic" ones. 4. The upwind difference scheme converges more slowly than the centered difference scheme when the mesh Reynolds numbers are small in magnitude, but convergence is fast for large mesh Reynolds numbers. Th is applies to both the reduced and the unreduced systems, and follows from the fact that as the P D E coefficients grow larger, the matr ix is more diagonally dominant when upwind schemes are used. Next , we consider an example of a nonseparable problem, for which our convergence analysis does not apply. Consider the problem —O.lAu + yzux + xzuy + xyuz = w , (4.116) with Neumann boundary conditions uz = 0 at z = 0 ,1 , and zero Dirichlet conditions for x = 0,1 and y = 0 ,1 , on the unit cube. Note that there is a turning point, but it is on the boundary. Here w was constructed so that the exact solution is u(x,y,z) = sin(7ra;) sin(7ry) cos(7rz). The ordering strategy used is 2 P N x z . In this case it is impossible to know the op t imal S O R parameter; we examine the iteration count for block Jacobi and block Gauss-Seidel. The results in Table 4.5 are for a 20 X 20 X 20 grid (8,000 gridpoints in the tensor-product grid) , using centered difference discretization. The following observations can be made: 1. Even though the problem is nonseparable, the solvers have a behavior which is similar to the convergence results for the constant coefficient case. Refer back to E q . (4.68), 109 Chapter 4. Convergence Analysis for Block Stationary Methods method reduced unreduced BJ ,1D 655 1,787 B J , 2 D 445 904 B G S , 1 D 323 847 B G S , 2 D 220 466 Table 4.5: iteration counts for different iterative schemes (4.69), (4.94), (4.96): The difference in iteration counts matches the ratios.predicted by the analysis: the improvement after the cyclic reduction step is performed is by a factor of 100% to 200% in iteration counts. A s is predicted by the analysis, the improvement for the I D solver is more dramatic than the improvement for the 2D solver. 2. For the unreduced system the differences between the iteration counts between the I D solvers and the 2D solvers is approximately a factor of 1.5; in the constant coefficient case the convergence analysis provides a factor of 1.8. A l so , we notice an improvement by a factor of approximately 2 between the I D and the 2D solver (in iterations) for the unreduced system; the predicted factor of improvement for the constant coefficient case is 2. 3. The improvement in rate of convergence between Jacobi and Gauss-Seidel is approxi-mately 2 for I D parti t ioning, relative to which the matr ix does not have Proper ty A . Compar i son of variants of the orderings Tables 4.6 and 4.7 present selected results of applying block Gauss-Seidel based on various split t ings. These results were obtained as part of an extensive set of numerical experiments which have been performed for all possible combinations of signs of the P D E coefficients, for various magnitudes of the convective terms and various test problems. The results presented in the tables are for the model problem (1.15), where the right-hand-side vector is such that the analyt ical solution is u = 100xyz(l — x ) ( l — y)(l — z) exp(a; + y + z). Centered difference discretization was done on a 12 X12 X12 grid. The results give good indication as to the situation 110 Chapter 4. Convergence Analysis for Block Stationary Methods for finer grids. The stopping criterion was | | 7 " ' ^ | | 2 / | | r ( 0 ' | | 2 < 1 0 - 1 3 . The two-plane matrix was used, with x-z oriented sets of gridpoints. The following partitionings were considered: 1. The natural two-plane ordering with ID preconditioning (termed ' I D ' in the tables). 2. The four-color block ordering with ID preconditioning ('4C'). 3. The natural two-plane ordering with 2D preconditioning ('2D'). 4. The 2D red/black block ordering with 2D preconditioning ( 'RB') . (7 r ID 4C 2D R B o~ T M ID 4C 2D R B 100 0 0 17 15 12 11 500 0 0 14 14 10 10 -100 0 0 17 15 12 11 -500 0 0 14 14 10 10 0 100 0 24 26 28 28 0 500 0 197 201 192 194 0 -100 0 28 25 30 28 0 -500 0 203 202 194 195 0 0 100 25 24 12 11 0 0 500 197 201 10 10 0 0 -100 28 24 12 11 0 0 -500 202 201 10 10 Table 4.6: iteration counts for the block Gauss-Seidel scheme, for one nonzero convective term. <J r V- ID 4C 2D R B T M ID 4C 2D R B 100 100 0 28 26 26 26 500 500 0 260 273 254 275 -100 100 0 28 26 25 26 -500 500 0 253 278 256 279 -100 -100 0 31 26 28 26 -500 -500 0 263 278 263 280 100 -100 0 31 26 27 26 500 -500 0 264 274 258 275 Table 4.7: iteration counts for the block Gauss-Seidel scheme, for two nonzero convective terms. The results can be interperted in two aspects: performance of natural block orderings vs. multicolor block orderings, and relation to the direction of the velocity vectors. In [39], Elman & Chernesky have studied the one-dimensional convection-diffusion equation in conjunction with the Gauss-Seidel solver, and concluded that the solver is most effective when the sweeps 111 Chapter 4. Convergence Analysis for Block Stationary Methods follow the flow; when the sweeps are opposite to the flow, red/black orderings perform better than natural orderings, but not as well as natural orderings in the case where the sweeps follow the flow. In [42], E l m a n & Golub conducted numerical experiments for the two-dimensional convection-diffusion problem, using one-line and two-line orderings in conjunction wi th block stat ionary methods as well as I L U preconditioned G M R E S , and for stat ionary methods con-cluded that red/black orderings have essentially the same iteration counts in average as natural orderings have, but they are less sensitive to the direction of the flow. For the three-dimensional problem with the orderings examined here, we observe that in general multicolor and natural orderings have essentially the same iteration counts for moderate convection ( P D E coefficients of magnitude 100) but on the other hand, the performance of natural orderings is better when convection dominates. (The differences are bigger when all convective terms are nonzero.) Mul t i co lor orderings are less sensitive to change of sign of the P D E coefficients. Loosely speaking, if good performance is related to following the direction of the flow, then since the I D blocks are x-z oriented and the ordering of the blocks in y direction is done from bot tom to top, it is expected that when the signs of r or p, are positive, the scheme would converge faster (the sign of a plays a smaller role, as it is associated wi th the variable in the direction of which the preconditioning is done). A s can be observed in Table 4.6, the natural orderings are slightly more sensitive to the change of sign in r (orientation of 2D sets of gridpoints), and least sensitive to changes in a (the direction associated wi th the preconditioning). Since there is alternating direction in the ordering of the gridpoints in each of the I D sets of gridpoints, the sensitivity to sign is in general not part icularly high. The fact that the Gauss-Seidel solver wi th multicolor orderings converges in a reasonable rate is meaningful: these ordering strategies are parallelizable (see Sec. 5.5). 112 Chapter 5 Solvers, Precondit ioners, Implementat ion and Performance In this chapter solving techniques, implementation and performance are addressed. For solving the reduced system, block stationary methods have been considered in Chap. 4, and in this chapter we consider preconditioned Krylov subspace solvers. This is followed by discussion on implementation. We then present numerical results which demonstrate the effectiveness of solvers of the cyclically reduced system. 5.1 Krylov Subspace Solvers and Preconditioners - Overview Krylov subspace solvers are a family of projection methods, which in the context of linear systems can be described as methods that compute an approximation to the solution x = A_1b in a particular subspace of 1Zn. Requiring that the approximate solution belong to a specific subspace introduces a series of constraints. These could be, for example, orthogonality conditions on the residual. A general introduction to projection methods can be found in [95, Chap. 5]. Krylov subspace methods have not been used as long as stationary methods. Golub and van der Vorst [58] remark that the first Krylov subspace methods were developed roughly at the same time the SOR method was developed, but the latter was more popular, due to its modest storage requirements. The available convergence analysis of these solvers is less comprehensive than the convergence analysis of stationary methods, particularly for nonsymmetric systems like ours. Nevertheless, their performance is generally very good, particularly when the linear 113 Chapter 5. Solvers, Preconditioners, Implementation and Performance system is effectively preconditioned first. M u c h research work has been done in this area in the last ten to fifteen years; many new variants and convergence results have been derived, and numerical difficulties (in particular, situations of breakdown) have been addressed. There are several excellent references for these methods. A m o n g them we mention the books of Saad [95] and Greenbaum [61], and the survey papers of Freund, Golub & Nachtigal [49] and G o l u b & van der Vorst [58]. A s before, consider the linear system Ax = b, and let = b — AxW be the residual at the kth iterate. The fc-dimensional Krylov subspace associated wi th A and r(°) is defined as: /C f c (A, r ( ° ) =span{r(0\Ar(°\...,Ak-1rW} . . . (5.1) The basic principle is: at the kth iterate, seek an approximate solution x^ in the shifted K r y l o v subspace x^ 0 ' + /C^(A, r^ 0 ' ) . The criteria for computing x^ fall into three main categories [58]: 1. The R i t z -Ga le rk in approach: compute x^ such that r ' f c) is orthogonal to ICk(A, r ( 0 ' ) . For general nonsymmetric systems this is known as F O M (full orthogonalization method) [96]. W h e n the mat r ix is symmetric positive definite we get the well known conjugate gradient method [69]. 2. The min imum residual approach: compute x^ such that | | r ( f c ' | | 2 is min imal over fCk{A, r ( 0 ' ) . In this category M I N R E S [85] and G M R E S [96] are well known methods. 3. The Pet rov-Galerk in approach: compute x^ so that is orthogonal to some other fc-dimensional subspace. B i C G [90] and Q M R [50] are methods based on this approach. Highly effective methods that are hybrids of the above three approaches are C G S [100], B i - C G S T A B [110], T F Q M R [48], F G M R E S [93], and others - see [58] for descriptions. The first concern when computing opt imal solutions in the K r y l o v subspace, is the basis that should be used. Using the obvious basis {A'A0}} would cause numerical difficulties in finite precision ari thmetic, as the vectors A3A°\ even for relatively small j , might be dominated (as far as their direction goes) by the eigenvector corresponding to the largest eigenvalue; instead, an or thonormal basis is formed. A widely used technique for generating this basis is Arno ld i ' s 114 Chapter 5. Solvers, Preconditioners, Implementation and Performance method [5], which was originally introduced as a technique for reducing a dense matrix into an upper Hessenberg form (useful for computing the eigenvalues). At each step, the j 'th vector Vj of the basis for K,k is multiplied by the matrix A , and orthogonalized against all previous vectors in the basis, namely v\,..., Vj-i- The starting vector is r(o) vi=w%- (5'2) The upper Hessenberg matrix is generated throughout the process. Originally, the technique for orthogonalizing the vectors in [5] was the standard Gram-Schmidt algorithm. Later versions of Arnoldi's method use modified Gram-Schmidt, which is mathematically equivalent to Gram-Schmidt but is more stable numerically, and Householder transformations (see [55] for details on computational work and accuracy of these methods). When the matrix A is nonsymmetric, the process of orthogonalization becomes increasingly expensive as the dimension of the subspace increases. Denoting the (& + 1) X k upper Hessenberg matrix generated in Arnoldi's algorithm by H k + l i k , w e have [95]: VkTAVk = HKk , ' (5.3) where Vk is the k x k matrix whose columns are the vectors of the orthonormal basis, and Hk,k is the matrix obtained by taking the first k rows of H k + i t k . From here, it depends which approach is taken. The Ritz-Galerkin approach is equivalent to requiring vkTAk) = o, and it can be shown [95] that = x ^ + Vky(k\ where = [Hk^k}~1 | | r ' ° ) | | 2 e i , and e\ denotes the first vector in the standard basis for 1Zk. If A is symmetric positive definite, the Hessenberg matrix is reduced to a tridiagonal form, and the above-described procedure is (essentially) the conjugate gradient method [69] (see also [56]). If the above approach is modified so that instead of orthogonality, one requires that H r ^ ) ^ is minimal over the shifted Krylov space, then the result is the G M R E S method of Saad & Schultz [96]. This is a most important contribution, which actually marked the beginning of a wave of enormous popularity of Krylov subspace methods. Practical implementation of the algorithm is discussed in detail by Saad in [95, Sec. 6.5.3]. For a theoretical discussion on the 115 Chapter 5. Solvers, Preconditioners, Implementation and Performance differences and the similarities between F O M and G M R E S , see [95, pp. 165-168]. The Pet rov-Galerkin approach is based on the attempt to obtain the attractive property of three-term recurrence relations among the eigenvectors (that symmetric matrices have) "by brute force". In general, an orthogonal basis based on three-term recurrence relations cannot be constructed for nonsymmetric matrices. Nevertheless, it can be obtained if the requirement of orthogonali ty is dropped. One can then add constraints of orthogonality wi th respect to some other basis. Th i s means that we are looking for a matr ix W j , such that each new basis vector Vi of the K r y l o v subspace is orthogonal to the first i - 1 column vectors of W ; ( w i , . . . , WJ-I),. and we also require vjwi ^ 0. The construction is thus done by generating bi-orthogonal basis sets. The B i C G [90] and the Q M R [50] methods are based on these ideas (with features that are designed to avoid certain situations of breakdown). For example, in the B i C G method there are two sequences of residuals, one for each of the (bi-orthogonal) sets, denoted by r'- 7 ' and f^3\ which are polynomials in A and AT respectively. A s mentioned earlier, hybrids between the various approaches have been also developed. One of the most popular methods here is B i - C G S T A B , of van der Vorst [110]: it is a variant of the B i C G method, derived by a different choice of the polynomial associated wi th f the choice of the polynomial coefficients is done so that the residual vector A3^ is minimized. In this sense B i - C G S T A B is a hybrid of B i C G and G M R E S . K r y l o v subspace methods perform significantly better when the matr ix is well conditioned. A n i l lustrat ion of that can be given by referring to the following convergence result for the conjugate gradient method [61]: | | x - x(*)|U< 2.f^^fiy . | | x-^)|U l (5.4) where \\.\\A is the energy norm, defined by \\V\\A — VVTAy. A s is evident, convergence is fast when A is well conditioned; when it is not, it is worthwhile to find a preconditioning matrix, say M , which has the property that M~XA is better conditioned than A, and then replace the 116 Chapter 5. Solvers, Preconditioners, Implementation and Performance original system Ax = b by M-1Ax = M~1b. (5.5) Effective preconditioning is a crucial issue. Saad [95, p. 245] states that "In general, the reliabili ty of iterative techniques, when dealing wi th various applications, depends much more on the quali ty of the preconditioner than on the particular K r y l o v subspace accelerators used". In fact, we have already discussed preconditioned systems in Chap . 4; indeed, for fixed point schemes based on the spli t t ing A = M — N we have x^k+1^ = g(x^), where g : TZn 7Zn, and g(x) = M~1Nx + M'H . (5.6) Convergence is to the fixed point of g: x = g(xY, (5.7) and thus by substi tut ing (5.7) and N = M — A in (5.6) we are actually solving the linear system (5.5). T w o obvious ways to perform preconditioning are [95]: 1. left preconditioning, which means solving (5.5). 2. right preconditioning, which means solving AM~ly = b, y — Mx. The difference between these two approaches is discussed in [95, Sec. 9.3.4]. Combina t ion of left and right preconditioning is also widely used (split preconditioning). A m o n g popular preconditioning techniques we mention preconditionings based on incomplete factorizations, preconditionings based on approximate inverses, and block preconditioners. See [95] for an overview of several of these techniques. One of the most popular techniques for preconditioning a linear system is the class of incomplete factorizations. These methods are based on computing a factorization LU which is a fairly good approximation to the matr ix , and at the same time the factors are sparse, so that the the preconditioner solve is inexpensive. Here the simplest forms are (point or block) Jacobi 117 Chapter 5. Solvers, Preconditioners, Implementation and Performance and S S O R . These factorizations are based on the coefficient matr ix itself: Jacobi corresponds to diagonal preconditioning, and S S O R corresponds to taking L = I + uED~l and U = D + u>F, where D, E and F are the diagonal, lower triangular and upper tr iangular parts of the mat r ix respectively. If we set u = 1 we get the S G S scheme. These preconditioning techniques are set-up in min imal cost (SSOR) or no cost at all (Jacobi). No addit ional storage is required for both techniques. O n the other hand, these techniques (and in particular the Jacobi preconditioner) are not always effective (see numerical examples in [87]). Incomplete LU (ILU) factorizations, which are based on computing the L U decomposition, but without allowing much fill-in to occur, form a popular class of methods. In 1977, Meijer ink and van der Vorst [78] showed that when the matr ix for which the factorization is performed is a symmetr ic M - m a t r i x , there exists an I L U factorization and it is highly effective when applied to conjugate gradient as a preconditioner. This paper was very important in understanding the merits of incomplete L U factorizations and in directing interest of researchers to these techniques. A well known member in the class of incomplete L U factorizations is ILU(O) : here L and U are constructed so that the matr ix A = LU satisfies a;j = a;j whenever a,-j ^ 0. ILU(O) is easy to implement. Generally, the cost of setting it up is approximately C = € - (5.8) n floating point operations [87], where m is the number of nonzeros in the mat r ix and n is the number of unknowns. ILU(O) could behave poorly, for several reasons [14]. One possible reason: the pivots can be very small . In nonsymmetric matrices arising from discretization of P D E s , diagonal dominance is not at all guaranteed. In fact, even when the coefficient mat r ix is very well behaved, for example symmetric positive definite, the pivots of the incomplete factorization are not guaranteed to be necessarily positive [11], or far from zero in magnitude. Another source of trouble could be that the triangular solvers themselves are severely i l l -condit ioned. E l m a n [38] studied this aspect in detail . Ways of detecting the i l l-condit ioning are suggested in [24]. 118 Chapter 5. Solvers, Preconditioners, Implementation and Performance The condit ioning of the factors for certain two-dimensional convection-diffusion equations has been experimentally studied in [14]. One way to improve the stabili ty is by allowing more nonzeros in the factors L and U. For example, denote the factors of the ILU(O) factorization by L0 and UQ. Then , the I L U ( l ) factorization is obtained by allowing L and U have the nonzero structure of L0U0. The I L U ( l ) factorization is typical ly more accurate compared to ILU(O). O n the other hand, the number of nonzeros can be substantially higher, so that the process is more costly (a system associated wi th the preconditioning matr ix has to be solved once or twice at every i teration). The same principle leads to I L U ( 2 ) , ILU(3 ) , and so on. The number of nonzeros in the factors is important , because is can serve as a good indication for: 1. The storage required. 2. The set-up time for the preconditioner. 3. The t ime a preconditioner solve requires. Obviously, the number of nonzeros in the factors of the ILU(O) factorization is equal to the number of nonzeros in the sum of the triangular parts of the original matr ix . In order to find the nonzeros of the matrices associated wi th the more accurate I L U ( l ) factorization, there is no need to actually perform any matr ix product [95]. Instead, one can determine the fill-in by examining the computat ional molecules associated wi th the triangular parts L and U. ILU(p) factorizations exist if the matr ix is an M - m a t r i x [78], and in this case their associated spl i t t ing is a regular spl i t t ing [95, T h m . 10.2]. I L U ( p ) factorizations do not take into account the magnitude of the mat r ix elements. One might instead consider using a threshold: look at the magnitude of each element during the factorization, and drop it if it is below a certain (prespecified) threshold. Such a technique is I L U T [94], which first applies a rule based on threshold, and after the factorization is completed takes only a certain number of elements (the largest) in each row. W h e n the drop tolerance 119 Chapter 5. Solvers, Preconditioners, Implementation and Performance is sufficiently small these factorizations are very effective (see, for example, Pommerell [87]); however, the amount of nonzeros in the factors could be substantially higher than that of the coefficient matrix. A l l the above-mentioned incomplete factorizations have block versions. The idea of these factorizations is to treat the matrix as consisting of blocks (submatrices) rather than elements. The existence of incomplete block factorizations is guaranteed if the matrix is an M-matrix (Axelsson [8]). An important paper which contains a comprehensive testing of various block preconditioners for the symmetric positive definite case is by Concus, Golub & Meurant [26]. When using block factorizations, inverting the pivotal blocks is costly and might cause fill-in, as the inverse of a sparse matrix is not necessarily sparse (see Meurant [79] for a review of the topic of inverses of symmetric tridiagonal and block tridiagonal matrices). There is fill-in also in the off-diagonal blocks, due to a repetitive computation of Schur complements of the blocks. In order to overcome the problem of extra work and storage, approximate inverses which preserve sparsity are considered. Possible strategies are suggested, for example, in [9],[34],[114]. If the matrix is referred to as block tridiagonal then only the diagonal block elements are modified and fill-in can be avoided. Modifying only the diagonal elements can be done also for general matrices (not necessarily tridiagonal), and is known as the D- ILU factorization (Pommerell [87]). In general, block incomplete L U factorizations are less effective for three-dimensional prob-lems than for two-dimensional problems [11]. Magolu & Polman [76] discuss the effectiveness of block ILU factorizations for three-dimensional problems, and show that line partitionings could in some cases be superior to point incomplete factorizations, but not always; plane partitionings are less effective. 5.2 Incomplete Factorizations for the Reduced System For our reduced system, recall that the computational molecule consists of 19 points. The ith row of LU is obtained by multiplying each of the ten components of the computational molecule 120 Chapter 5. Solvers, Preconditioners, Implementation and Performance which are associated with the matr ix L, by the corresponding rows in U. Suppose the row of the mat r ix for which fill-in is examined is associated with a gridpoint whose coordinates are (i,j,k), and numbering of the unknowns is done in x-y plane oriented natural lexicographic fashion. In F i g . 5.1 we present a two-dimensional slice of the stencil associated wi th the factor U; we pick the plane where the associated gridpoint for which discretization was done lies. The stencil in the figure is also the fill-in pattern corresponding to the planes which contain gridpoints (i, j, k±2). X X X ® x Figure 5.1: a two-dimensional slice of the stencil associated with U (the value on the diagonal is circled) In F i g . 5.2 we demonstrate the fill-in (a 2D slice) that occurs by combining the stencils in F i g . 5.1 associated with all the components in the computat ional molecule. In F i g . 5.3 the stencils associated wi th the neighboring planes are presented. X X X X X (x) (x) X ® ~ ® x ® x Figure 5.2: fi l l- in in the construction of I L U ( l ) in the plane that contains the gridpoint for which the discretization was done (dashed circle). Circ led are the points that belong to the original computat ional molecule. Having determined the nonzero patterns in al l planes which belong to the lower triangular 121 Chapter 5. Solvers, Preconditioners, Implementation and Performance X X X X X X ® X Figure 5.3: fi l l- in in the construction of I L U ( l ) in the plane adjacent to the gridpoint for which the discretization was done (circled are the points that belong to the original computat ional molecule) mat r ix L, where fil l-in is experienced, the above can be summarized as follows: Proposit ion 5.1. Denote by L\ and U\ the matrices associated with the ILU(l) factorization of the reduced matrix associated with lexicographic ordering. Then the number of nonzeros of Li + Ui satisfies where nz(5) stands for the number of nonzeros in the reduced matrix. Proof. F i g . 5.1-5.3 clarify how many nonzeros are generated in each 2D slice when going by planes z-direction. (There are other 2D slices that need to be considered, but these do not generate any nonzeros that have not been already presented in the three figures; the details are omitted.) Thus in each row associated with an interior gridpoint there are 13 +11 • 2 + 5 • 2 = 45 nonzeros. The two factors of 2 here come from the fact that there are two neighboring planes, and two separate planes contain the points ( i , j , A: ± 2). For points next to the boundary the fill-in level is smaller, and thus the the number | | is only an estimate. • In order to check the estimate in Prop . 5.1, the I L U ( l ) factorizations of reduced matrices corresponding to 8 X 8 X 8, 16 X 16 X 16 and 24 X 24 X 24 grids have been generated; the actual number of nonzeros and the estimates are presented in Table 5.1. nz(Li + Ui)*-.nz(S) , (5.9) 122 Chapter 5. Solvers, Preconditioners, Implementation and Performance n nz(5) n z ( L i + Ui) estimate 8 x 8 x 8 3,760 7,522 8,905 16 x 16 x 16 34,400 75,218 81,474 24 X 24 x 24 121,104 272,194 286,825 Table 5.1: number of nonzero elements in the I L U ( l ) factorization: actual numbers and esti-mates The same can be done for any other ordering, and for higher allowed levels of f i l l . In F i g . 5.4 the sparsity patterns of the factors corresponding to the I L U ( l ) and the ILU(2) factorizations for the two-plane matr ix are presented. 0 50 100 150 200 250 0 50 100 150 200 250 nz = 7510 nz= 11380 (a) I L U ( l ) (b) ILU(2) Figure 5.4: sparsity patterns of the factors of I L U ( l ) and ILU(2) , associated wi th the two-plane mat r ix . Mul t i co lo r block versions of orderings, discussed in Chap . 3, give rise to a larger amount of fi l l- in in the incomplete factorization. For the four-color two-plane ordering the mat r ix has 8,700 nonzeros - compare to 7,510 for the natural two-plane ordering. A s a result, more work per i teration is required. The following result is due to Beauwens [13]: 123 Chapter 5. Solvers, Preconditioners, Implementation and Performance Suppose A is a nonsingular M-matrix and A = Dx - C\ = D2 - C2 are splittings of A , and let D\ = LiUi and D2 = L2U2 be incomplete factorizations of A such that the set of nonzeros of Li + Ui is contained in the set of nonzeros of L 2 + U2. Then p{D^C2) < p(D^Ci) . (5.10) This result was used by Elman & Golub in [42, Thm. 2], and can also be directly used for our reduced system: Proposit ion 5.2. / / the reduced matrix is an M-matrix, the spectral radius of the iteration matrix associated with ILU(p) factorizations is smaller than or equal to the spectral radius of the iteration matrix associated with block Jacobi with the 2D splitting. Prop. 5.2 applies, for example, to the constant coefficient case where be, cd, fg > 0. For the model problem we can also obtain an existence result for the ILUT factorization. For showing this, we need the following [95, p. 290]: Definition 5.1. An N x N matrix H is an M-matrix if its last entry on the diagonal is non-negative, the rest of the diagonal values are all positive, the off-diagonal values are nonpositive, and N hhi < 0 . l<i< N . (5.11) j=i+ l Proposit ion 5.3. Consider the model problem and suppose that either upwind differences are used, or centered differences with be, cd, fg > 0. Then the IL UT factorization exists for the reduced matrix (for all the orderings that have been considered in Chap. 3). Proof. The diagonal values of the matrix are a 2 — 2be — led — 2fg in rows associated with interior points, and larger in rows associated with with non-interior points. Since the matrix is an M-matr ix in the cases specified in the lemma, all its off-diagonal elements are nonpositive. The matrix has at least one negative value to the right of the main diagonal in each row except 124 Chapter 5. Solvers, Preconditioners, Implementation and Performance the last. Thus E q . (5.11) holds and the matr ix is an M - m a t r i x . In addit ion, since the mat r ix is diagonally dominant, and moreover, since the diagonal entries are all positive, it is guaranteed that al l its row sums are nonnegative. This special property is termed in [95] a diagonally dominant M-matrix. The matr ix thus satisfies all the conditions required for existence of the I L U T factorization, by [95, T h m . 10.13]. • In F i g . 5.5 the nonzero pattern of the factors corresponding to I L U factorization wi th drop tolerance 10~ 2 is presented. Here the factorization was obtained by using Ma t l ab ' s command ' luinc ' , where the dropping rule is: for each entry, the threshold is the drop tolerance specified by the user, mult ipl ied by the norm of the column containing the entry (with the exception that diagonal values of the matr ix U are never dropped). The mat r ix which is shown in the figure comes from centered difference discretization of the equation —Au + 10(a:, y, z)TVu = f. Note that for I L U factorizations based on thresholds, there is flexibility (which does not exist in I L U ( p ) factorizations) which allows more fill-in for more ill-conditioned problems. 0 50 100 150 200 nz = 9146 Figure 5.5: sparsity patterns of factors for I L U with drop tolerance 10 2 . Block factorizations based on I D part i t ioning require the following amount of computat ional 2 work: the number of block rows is and the semibandwidth of the mat r ix is n + 1. The number of nonzero blocks in the lower part as well as in the upper part of this matr ix is n2 + 0(n). Typ ica l ly there are 9 non-zero block elements in each row. Each element is a 125 Chapter 5. Solvers, Preconditioners, Implementation and Performance 2n x 2n matr ix . In total there are approximately 5 n 2 operations involving 2n X 2n mat r ix products and n2 inversions or approximate inversions of 2n X 2n matrices. Taking the exact inverses of the diagonal blocks, and considering the fill-in in off-diagonal block elements, the total cost is 0(n5) operations, which is not satisfactory. Variants which restrict the fi l l- in in the off-diagonals blocks are possible but have not been tested. Some tests wi th incomplete L U factorizations based on 2D part i t ioning with approximate inverses of the pivotal blocks have been performed and have not been found particularly effective. 5.3 The Overall Cost of Construction of the Reduced System One important issue in the implementation of one step of cyclic reduction is the cost of per-forming it . It wi l l now be shown that the construction of the reduced system is an inexpensive component compared to the work involved in computing the solution. Nevertheless, it is not negligible. Consider first the constant coefficient case. Here the reduced mat r ix can be constructed in a straightforward manner, and there is no need to actually construct the unreduced mat r ix first and perform the Gaussian elimination. B y the difference equation (2.7) it is evident that since the computat ional molecule is gridpoint-independent there are merely 2-3 floating point operations for each of the 19 components of the computat ional molecule (and fewer floating point operations in total for points next to the boundary). The overall computat ional work is thus negligible; in fact most of the construction time is spent on assigning the values of the computat ional molecule to the matr ix . Since the reduced matr ix has a block structure similar to the block structure of the unreduced matr ix , the overall construction work is almost equal to the work involved in constructing the latter. The variable coefficient case requires more careful attention. Here each entry in the mat r ix is computed separately, as the values of the computat ional molecule are gridpoint-dependent. For both the reduced and the unreduced systems the set-up of the components of the computat ional molecule throughout the grid needs to be done. F i rs t , the P D E coefficients for every gridpoint 126 Chapter 5. Solvers, Preconditioners, Implementation and Performance need to be computed. There are six trivariate functions, each of which needs to be evaluated at n 3 gridpoints (or approximately this number). The cost of this step depends on the cost of evaluating the P D E coefficients and could be high. In addit ion, there is a need to construct the components of the computat ional molecule. B y E q . (2.13) and (2.14), this takes approximately 125n 3 operations for centered differences, and 105n 3 operations for upwind differences. A s for the step of cyclic reduction, counting operations in E q . (2.15), there is a total of about 125 floating point operations for computing the entries of the matr ix in a row that corresponds to an interior gridpoint . However, certain values in the difference equation appear several times; each of the values hi ei,j,k 9i,3,k appears six times in the difference at,j-l,k ' ai-l,j,k ' ai+l,j,k ' ai,j+l,k ' ai,j,k-l ' ai,j,k+l equation, and careful implementation reduces the number of floating point operations by 20%-25%. Even if the construction is done without such opt imizat ion (which is the case if block Gaussian elimination is applied directly), since the matr ix is of size -, the overall amount of floating point operations for constructing the reduced matr ix is approximately 6 0 n 3 . The construction t ime of the right-hand-side vector is approximately 3 n 3 . In Table 5.2 we present the actual amount of work that was required for the overhead of constructing the reduced mat r ix for a few grids, using Gaussian elimination. n n 3 Mflops flops 8 512 0.025 48.8 12 1,728 0.090 52.1 16 4,096 0.222 54.2 20 8,000 0.442 55.3 24 13,824 0.775 56.1 28 21,952 1.243 56.6 32 32,768 1.869 57.0 Table 5.2: computat ional work involved in the construction of the reduced mat r ix . The reduced system is constructed once and for al l , and the construction is clearly an inexpensive component compared to setting up a preconditioner (approximately 2 0 0 n 3 floating point operations for setting up the ILU(0) factorization) or solving the system (at least 8 0 n 3 127 Chapter 5. Solvers, Preconditioners, Implementation and Performance operations per iteration for K r y l o v subspace solvers like B i - C G S T A B or C G S ) . F ina l ly , it is noted that the construction can be almost fully parallelized or vectorized, as i l lustrated in Sec. 5.5. 5.4 Numerical Experiments This section provides details on numerical experiments. Results of some numerical experiments for stationary methods have been given in Chap . 4. The computations have been performed on an S G I Origin 2000 machine, wi th four parallel 195 M H Z parallel processors, 512 M B R A M and 4 M B cache. There is no parallel component in the implementation; the actual computations are performed by a single processor on the machine. The computer programs are wri t ten in M a t l a b 5. Test P r o b l e m 1 Consider the separable problem -Au + pi x ux + p2 y uy + p3 z uz = w(x,y,z) (5.12) on Q = (0,1) X (0,1) X (0,1) , wi th zero Dirichlet boundary conditions, where w(x,y,z) is constructed so that the solution is u(x, y, z) = x y z (1 - x) (1 - y) (1 - z) exp(a: + y + z) . (5.13) F i r s t some convergence analysis for block stationary methods is provided. For notational convenience, denote 7 = 5 = ^ and p — Suppose h is sufficiently small and centered difference discretization is performed. Below we refer to the components of the computat ional molecule, as explained in Chap . 4. Since the problem is separable, each of the components of the computat ional molecule depends only on one variable, and thus can be wri t ten wi th a single subscript. The components in the a;-direction satisfy ci+idi - (1 + 7 ^ + i K 1 ~ lxi) = 1 + l h ~ 7 2 / i 2 ( « 2 + 0 , 128 Chapter 5. Solvers, Preconditioners, Implementation and Performance which leads to 1 + y h - 7 2 (1 - h) < c i + 1 d t < 1 + 7/1 - 2y2h2 . If —1 < 7 < jzrfi then Ci+id{ > 0. For bj+iej and fk+igk the bounds are obtained in an identical manner. The center of the computat ional molecule is exactly a = 6. In terms of the P D E coefficients, the condition on 7 means that the convergence analysis performed in Chap . 4 for centered difference discretization of the convective terms is applicable i f the P D E coefficients are 0(n). In this case the matr ix is symmetrizable by a real diagonal matr ix . Using the notation of Chap . 4, let S be the symmetrized matr ix , let (3X = 1 + jh — 2y2h2, (3y = 1 + 5h — 282h2, f3z = 1 + ph - 2p,2h2, and let S* be a modified version of of S, such that each occurrence of Ci+xdi, bj+iej or fk+igk in S is replaced by the bounds, namely (3X, (3y and )3Z respectively. Since S* > 5 , S* is a symmetrized version of a mat r ix corresponding to the constant coefficient case, and by Cor . 2.5 is a diagonally dominant M - m a t r i x . Us ing Theorem 4.8, bounds on the convergence rates of block stationary methods are available. spl i t t ing I D 2D n P bound ratio P bound ratio 8 512 0.793 0.894 1.13 0.682 0.826 1.21 12 1,728 0.895 0.946 1.06 0.825 0.908 1.10 16 4,096 0.937 0.968 1.03 0.892 0.944 1.06 20 8,000 0.958 0.979 1.02 0.927 0.962 1.04 24 13,824 0.970 0.985 1.02 0.948 0.973 1.03 Table 5.3: comparison between the computed spectral radii of the block Jacobi i teration ma-trices and the bounds, using centered differences, for the two splittings, wi th pi = p2 = P3 = 1. A s opposed to the constant coefficient case, here the bounds are sensitive to sign, and in general, we find that they are tighter if the convective coefficients are negative. A n explanation for this is that the values (3X, j3y and j3z are larger than 1 if p i , p2 and p 3 are positive. On the other hand, i f the latter are negative, then we obtain values of the type 1 — C\h2 — c2hA for (3X, Py and (3Z, wi th ci,c2 > 0. These values are closer to the values that would be obtained 129 Chapter 5. Solvers, Preconditioners, Implementation and Performance if the analogous constant coefficient problem was considered, for which the bounds have been shown in Chap. 4 to be very tight. In Table 5.3 the tightness of the bounds is demonstrated; in this case their quality is as good as bounds for the constant coefficient case. In fairness we remark that for larger values of the P D E coefficients the tightness of the bounds deteriorates. A n explanation for this is that the values f3x, f3y and f3z, which are attained for x — h, y — h and z = h, are very loose bounds for gridpoints whose coordinate values are close to 1. In Table 5.4 the performance of solvers for the reduced and the unreduced systems for Pi = 50, pi — 20, P3 = 10 is compared, jj^ joyjj < 1 0 - 1 0 was used as a stopping criterion, where is the residual at the ith iterate. The method that is used is B i - C G S T A B , preconditioned by ILU(O). In this example B i - C G S T A B was implemented using Netlib's Matlab routine. The rate of increase in iteration count as the grid is refined is in agreement with theory, at least if one makes the assumption that for this well conditioned and mildly nonsymmetric system, the convergence behavior is qualitatively the same as that of the conjugate gradient method for symmetric problems of the same type. The preconditioned matrix in this example has a condition number whose square root is of magnitude 0(h~1), thus it is expected that refining the mesh by a factor of 2 would result in doubling the iteration count. The factor of time as well as number of floating point operations should be at least 16 as the mesh is refined by a factor of 2, since the iteration count is doubled and the size of the matrix is larger by a factor of 8. n n3 iterations time (sec.) Mflops unreduced reduced unreduced reduced unreduced reduced 8 512 11 6 0.26 0.13 0.59 0.34 16 4,096 22 11 4.49 2.07 12.5 7.2 32 32,768 45 23 97.3 47.1 300.2 177.8 64 262,144 99 44 2,499 1,138 8,559 4,594 Table 5.4: comparison of solving work/time of the unreduced and the reduced solvers for various mesh sizes, using preconditioned B i - C G S T A B , for pi = 50, p2 — 20, p3 — 10. 130 Chapter 5. Solvers, Preconditioners, Implementation and Performance The solver of close to symmetric systems converges slower than the solver of more strongly nonsymmetric systems: a solver applied to p\ — p2 = p3 = 1 on a 64 X 64 X 64 grid (262,144 unknowns) converges in 173 iterations, and the C P U time is 4, 390 seconds. The convergence is thus slower than the convergence for larger magnitudes of the P D E coefficients and the same grid size - compare to the last row in Table 5.4. F i g . 5.6 presents the norm of the relative residual throughout the iteration for the case which is closer to symmetric . 10* IO-»I 1 1 ' ' 1 1 1 1 1 0 20 40 60 80 100 120 140 160 1B0 iterations Figure 5.6: /2-norm of relative residual for preconditioned B i - C G S T A B applied to a 262,144 X 262,144 system corresponding to discretization of the test problem wi th pi = p2 = p3 = 1, on a 64 x 64 x 64 gr id . In Table 5.5 estimates of the condition numbers of the reduced and unreduced matrices for pi = 500, p2 = 200 and p3 = 100 wi th upwind difference discretization are presented. The estimates were obtained using Mat l ab ' s command 'condest'. Next , we discuss the question which method should be used. A s far as K r y l o v subspace solvers are concerned, it has been shown in [82] for three methods ( C G N , C G S , G M R E S ) , that for each of them there can be found an example where it fails and an example where it is superior to the other methods. For our reduced system we have no knowledge of how to give analyt ical justification for preferring one method over another, and we have to settle for numerical experiments for particular cases and/or follow "recipe" recommendations (e.g. the 131 Chapter 5. Solvers, Preconditioners, Implementation and Performance n « 2 ( t f ) K2{R) 8 420.85 195.04 12 1,049.0 489.83 16 1,859.4 866.53 20 2,693.7 1,258.6 24 3,578.4 1,657.1 Table 5.5: comparison between estimates of condition numbers of the unreduced mat r ix (de-noted by TJ') vs.. the reduced matr ix ('#'), for px = 500, p2 = 200, p 3 = 100. "flowchart" given in [11] for picking a solver). Consider the case px = p2 = p3 = 10. T h a t is, -eAu + (x,y,z)TVu = f , (5.14) wi th e = ^ j . Zero is used as an ini t ia l guess, and centered difference discretization is performed on a 16 X 16 X 16 gr id . In Table 5.6 we examine the performance of block stat ionary methods. The convergence criterion was | | ?*W | | 2 / | | r ( 0 ' | | 2 < 1 0 - 1 0 . The values for 'Mf lops ' have been obtained by using Ma t l ab ' s command 'flops'. For all cases Mat l ab ' s buil t in functions have been used. method time (sec.) iterations Jacobi 16.02 305 Gauss-Seidel 11.56 152 S O R (w = 1.2) 7.98 97 S O R (UJ = 1.4) 4.38 53 S O R {UJ = 1.5) 2.81 34 S O R (UJ = 1.6) 3.80 46 S O R (UJ = 1.8) 8.76 106 Table 5.6: performance of block stationary methods for Test Problem 1. F r o m Table 5.6 we can conclude that, as expected, Jacobi and Gauss-Seidel are slow to converge. A s for S O R , the opt imal relaxation parameter is approximately UJ* = 1.5, and for 132 Chapter 5. Solvers, Preconditioners, Implementation and Performance this value the performance of the scheme is very good. However, in this case our bound fails to provide an effective approximation to the opt imal relaxation parameter. Here we used our bound for the block Jacobi scheme, and the estimate we obtained for LO* was 1.851, which is obviously far from the actual opt imal relaxation parameter. M o v i n g to consider preconditioned K r y l o v subspace solvers, the preconditioners that were used were ILU(O) , I L U wi th numerical dropping ( N D ) , block ( ID) Jacobi and S S O R . preconditioner time (sec.) Mflops nz(L+U) nz(S) ILU(O) 14.1 0.42 1 'ND(O. l ) 0.12 0.074 0.07 ND(O.Ol) 2.04 5.47 1.28 ND(O.OOl) 5.94 22.29 3.68 Table 5.7: construction t ime/work of I L U preconditioners. ' N D ' stands for numerical dropping, and in brackets the threshold value is given. The work involved in setting up block Jacobi and S S O R is negligible; For the I L U factor-izations Table 5.7 presents the construction time and work (in megaflops), as well as the ratio between the number of nonzeros that were generated in the factors and the number of nonzeros of the coefficient matr ix . A s was mentioned earlier, the latter is useful in order to estimate the amount of work involved in the preconditioning solve. The results in Table 5.7 are difficult to interpret. In particular, the construction t ime of the ILU(0) factorization is extremely long, but the flop count is very small . It seems that this contradict ion has to do wi th Mat l ab ' s implementation of the ' luinc ' command, and for the purpose of computing overall work, it wi l l be reasonable to take the flop count as a more reliable da tum - the number of flops is in accordance with the estimate specified in E q . (5.8). In Tables 5.8-5.11 we present our results when testing the various preconditioners wi th various K r y l o v subspace solvers. In the K r y l o v subspace methods considered ( B i C G , Q M R , B i - C G S T A B , C G S ) , two matrix-vector products and two preconditioner solves are required in 133 Chapter 5. Solvers, Preconditioners, Implementation and Performance preconditioner time (sec.) Mflops iterations ILU(O) 3.77 9.8 22 N D ( O . l ) 6.05 15.28 49 ND(O.Ol) 3.28 8.77 18 ND(O.OOl) 3.22 9.23 11 B J 7.08 18.22 49 S G S .4.55 11.99 27 Table 5.8: performance of Q M R for Test Problem 1. preconditioner time (sec.) Mflops iterations ILU(O) 3.51 8.9 22 N D ( O . l ) 5.82 13.99 51 ND(0.01) 3.06 7.99 18 ND(O.OOl) 3.03 8.55 11 B J 6.56 16.35 49 S G S 4.26 10.91 27 Table 5.9: performance of B i C G for Test Problem 1. each i terat ion. For B i - C G S T A B , convergence can occur after "ha l f an i teration" [11]. We also remark that restarted G M R E S was also tested and was generally slower to converge. In general, B i - C G S T A B and C G S are more efficient than B i C G and Q M R . In fact, the latter are slower by a significant factor. We have no simple explanation for this; note that B i C G and Q M R involve computing the transpose of the matr ix , which in our implementation and for this problem is available, but might not always be available. C G S and B i - C G S T A B are on the same level of efficiency for this problem, wi th marginal difference in performance (except for an unclear difference between their performance when block Jacobi preconditioning is used). O f the preconditioners, ILU(0) seems to be most effective in this case. We remark that when testing problems in strong convection-dominated regions, we did find examples where 134 Chapter 5. Solvers, Preconditioners, Implementation and Performance preconditioner time (sec.) Mflops iterations ILU(O) 1.8 6.01 12 N D ( O . l ) 3.53 11.07 30 ND(O.Ol) 1.51 5.15 9± ND(O.OOl) 1.26 4.83 ° 2 B J 3.61 11.79 2 7 i S G S 2.17 7.25 1 4 i Table 5.10: performance of B i - C G S T A B for Test Problem 1. preconditioner time (sec.) Mflops iterations ILU(O) 1.74 5.30 13 N D ( O . l ) 3.56 10.33 29 ND(0.01) 1.51 4.91 11 ND(O.OOl) 1.23 4.66 6 B J 4.31 13.20 37 S G S 1.90 6.11 15 Table 5.11: performance of C G S for Test Problem 1. ILU(0) preconditioned solvers performed poorly. S G S is also competitive, in part icular because it involves a small set-up time. F ina l ly , we mention that in cases where the opt imal relaxation parameter (or a reasonable approximation to it) is available for the S O R scheme, it performs very well and is highly competi t ive wi th preconditioned K r y l o v subspace solvers. If the bounds presented in Chap . 4 are t ight, a good approximation to the opt imal relaxation parameter can be obtained. S O R requires only one matrix-vector product per i teration, and is thus significantly cheaper than an i terat ion of preconditioned K r y l o v subspace solvers. A s an example, consider the same problem wi th P D E coefficients px, p2, P3 = 1, for which the opt imal relaxation parameter is approximately u* = 1.5. For this value the S O R solver converges wi th in 37 iterations. However, 135 Chapter 5. Solvers, Preconditioners, Implementation and Performance the convergence analysis provided an approximation UJ = 1.599; for this value the SOR scheme converges within 46 iterations, and for a fair comparison, this is the datum that should be used. In comparison, ILU(O) preconditioned CGS converges in this case within 11 iterations and B i - C G S T A B converges within 10\ iterations. Thus SOR and these two Krylov subspace solvers converge within almost the same number of matrix-vector products. If the construction work of the preconditioner is added, the conclusion is that SOR is faster. When the grid is finer, the estimate of the optimal relaxation parameter is more accurate and SOR performs even better. In addition, SOR requires much less storage. Test Prob lem 2 Consider the following singularly perturbed problem: -sAu + Vv • Vu = f (5.15) where v= ^(x2 + y2 + z2) , (5.16) subject to nonzero Dirichlet type boundary conditions on a square domain (ax, bx) x (ay, by) x (az,bz) which contains the origin. This class of problems is currently being studied by Sun & Ward - see [103] and references therein. Eq . (5.15) is similar to the equation considered in Test Problem 1, but now there are turning points inside the domain, which turn the problem into a much more difficult one. The problem comes from a large variety of applications, including semiconductors, population growth [75], the exit problem [77], and more. One interesting phenomenon here is that for any number of space dimensions the continuous problem has an isolated exponentially small eigenvalue. The eigenvalue problem for a class of one-dimensional problems which contain the one-dimensional version of Eq. (5.15) has been studied by De Groen [30]. Asymptotic analysis (for different aspects) has been done by Ludwig [75], Grasman & Mantkowsky [60], Mantkowsky & Schuss [77], and others (see [103] for a list of references). The analysis in these papers shows that the analytical solution of (5.15) has boundary layers near the edges and is constant in the interior 136 Chapter 5. Solvers, Preconditioners, Implementation and Performance of the domain . In [60] a variational approach is used to determine the constant as a weighted average of the boundary data. F in i te difference schemes have truncation errors which might be larger than e, and it is not clear how accurate the computed solution is. Adjer id , Aif fa & Flaher ty [1] have experimented wi th a similar two-dimensional problem (with different boundary conditions) and used a finite element code. They have observed that iterative solvers converge quickly to a constant in the interior of the domain, but not necessarily to the exact constant predicted by Grasman & Man tkowsky ' s asymptotic analysis. Fol lowing Sun's suggestions [102], I l ' in scheme was used [74]. Essentially it is a cen-tered difference scheme, modified by using fi t t ing parameters associated wi th the functions cotha;, co thy , coth z applied to the gridpoints. When e is very small , the scheme behaves like the upwind scheme; when e is large, the scheme is second order accurate [74] (see also [97] for useful analysis and results on the accuracy of finite difference schemes applied to singularly perturbed problems). E q . (5.15) has been considered wi th boundary conditions = x + y + z. T h e leading order of the asymptotic expansion requires in general computing a mul t idimen-sional integral (see [60, p. 594] for the closed form). For the particular P D E coefficients in (5.15) the constant can be obtained by a weighted sum of the contact points of the function v with the boundary, that are closest to the origin. There are six contact points in this case, all of which are wi th in distance 1 from the origin; the constant in this case is expected to be zero. The cyclical ly reduced operator inherits the properties of the continuous problem: for e = 0.03 and discretization on a 16 X 16 X 16 grid, by Mat lab ' s command 'eigs' the three smallest eigenvalues of the reduced matr ix (multiplicity excluded) are: 2.64 • 1 0 - 6 , 1.93 and 3.59. For this case the condition number estimated by Ma t l ab ' s 'condest' command is approximately 10 9 . Table 5.12 presents flop counts and the numerical solution for a problem discretized on a 16 x 16 x 16 grid, using ILU(0) preconditioned B i - C G S T A B . The numbers represent averages of 5 tests, started wi th a random ini t ia l guess. The sum of all the absolute values of the solution in interior points of the grid, presented in the th i rd column, can be considered an approximation 137 Chapter 5. Solvers, Preconditioners, Implementation and Performance to the constant in the interior. For e = 0.03 and e = 0.02 the accuracy is satisfactory. The average ( third column) for e = 0.03 is larger because the boundary layer is less "sharp" in this case. For e — 0.01 there is an error in the solution. Here e is already significantly smaller than h, and in the numerical tests there were fluctuations in the constants obtained, as well as the computat ional work. Differences in accuracy between the reduced and the unreduced systems have been observed to be negligible. For this class of problems the reduced solver does not converge significantly faster than the unreduced solver, in particular when e gets smaller. £ Mflops ( n _ 2 ) 3 zZ 0.03 10.0 7.54e-4 0.02 11.1 2.71e-5 0.01 14.6 0.53 Table 5.12: Test Problem 2: overall flop counts and average computed constants for three values of e, when ILU(0) preconditioned B i - C G S T A B is used. For severely ill-conditioned problems ILU(0) preconditioned iterations do not at al l con-verge. Here the only preconditioners that have been found effective are incomplete L U factoriza-tions wi th a very small drop threshold. Table 5.13 presents details on numerical experiments for £ — 0.005. Notice the large amount of computat ional work that is required. The ratio between the number of nonzeros in the factors and the number of the nonzeros of the mat r ix is given in the column ti t led " ^ g ^ • For comparison, the same ratio for the complete L U factorization is 27.93 in this case. Since e is very small , in this discretization the constant in the interior cannot be expected to be computed accurately. We note, though, that the solution for numerical drop-ping wi th both 1 0 - 4 and 1 0 - 6 has been observed to be constant in the interior and thus behaves quali tat ively like the leading ordering of the analytical solution. Th is example demonstrates the robustness of incomplete factorizations based on dropping. The computat ional work that is required is large, but these are the only preconditioners for which the solver converges. Notice also that once these preconditioners obtain convergence, the tradeoff between the larger work 138 Chapter 5. Solvers, Preconditioners, Implementation and Performance per i teration and the small iteration count translates into almost the same amount of overall computat ional work (last two rows in the table). method nz(L+U) nz(S)_ Mflops ILU(O) 1 N / C N D ( 1 0 - 2 ) 1.43 N / C N D ( 1 0 ~ 4 ) 3.25 36.2 N D ( 1 ( T 6 ) 5.09 40.4 Table 5.13: comparison of performance of various incomplete factorizations in conjunction with C G S , for Test Prob lem 2. ' N / C stands for no convergence. Test P r o b l e m 3 Consider the following problem: -Au + v- Wu = w , (5-17) on the unit cube, where v = (sm(ny),sin(7ra:), 1 ) T , (5.18) subject to Neumann boundary conditions on z = 0,1 and Dirichlet boundary conditions on the rest of the boundary. Th is is an example of a nonseparable problem, for which the convergence analysis of Chap . 4 does not apply. The Neumann conditions on z-direction are discretized using centered difference schemes, by adding two artificial planes of variables (whose z indices are —1 and n + 1). In order to examine the accuracy of the scheme, w has been constructed so that the analyt ical solution is u(x, y, z) = sin(7ra;) • sin(7ry) • COS(TTZ). Table 5.14 presents the nice behavior of the norm of the error, and illustrates the second order accuracy of the scheme. Table 5.15 compares the performance of the reduced solver to the performance of the unre-duced solver, using ILU(O) preconditioned B i - C G S T A B . The criterion used here is number of flops. A l l the components are included: construction of the system, construction of the precon-ditioners, solution and back substitution for the gridpoints which were previously eliminated by 139 Chapter 5. Solvers, Preconditioners, Implementation and Performance Figure 5.7: a 2D slice of the numerical solution of Test Problem 3, for z = pp. n 1Mb 8 0.004346 16 0.001123 32 0.000285 Table 5.14: norm of the error for Test Problem 3, for various sizes of the gr id . In the header, e stands for the error and N stands for the number of unknowns. the reduction. The performance of the reduced solver is better, and by a larger margin as the grid gets finer. Th is reflects the fact that the overhead of constructing the cyclical ly reduced mat r ix becomes less significant compared to the solve time as the number of unknowns increases. The numbers in the last row (for a 32 X 32 X 32 grid) are typical of the factor of improvement; here the solving t ime is by far longer than the construction; the ratio of improvement is expected to be similar for larger grids. 5.5 Vector and Parallel Implementation Since the reduced matr ix has a clear block structure, parallel implementation is an attractive option for both constructing the matr ix and computing the solution. The need for parallelism is important especially because systems of equations associated with three-dimensional problems are typical ly very large. For example, a grid of size 64 X 64 X 64 on the unit cube, which might be too coarse for difficult applications, requires solving a non-small system of more than 250,000 140 Chapter 5. Solvers, Preconditioners, Implementation and Performance n n3 Mflops unreduced reduced 8 512 1.17 0.91 16 4,096 14.03 10.13 32 32,768 237.83 139.17 Table 5.15: comparison between the performance of the unreduced and the reduced solvers for various grids, using ILU(O) preconditioned B i - C G S T A B , for Test Problem 3. unknowns. The problem here is storage as well as time. If the matrix is too big to fit in the physical memory of the computer, swapping (paging) takes place, which substantially slows the execution time (the computation time could slow down by a factor of 50 [87]). This section addresses a few issues associated with parallel and vector implementation. It should be viewed only as a general overview; in practice our implementation is vectorized, written in Matlab, and includes no parallel components. Overview General overviews on parallel implementation of solution of linear systems are, for example, Demmel, Heath & van der Vorst [32], and an earlier review by Ortega & Voigt [84]. The earliest forms of parallelism (or vectorization, rather) were based on having a single processor with multiple functional units, such as multipliers and adders. During compilation, dependence analysis was done, and so, if for example a certain computation involved four additions and two multiplications which were independent of one another, these operations were performed simultaneously using four adders and two multipliers. The next stage was pipelining. Golub and Van Loan [55] illustrate pipelining by making an analogy to an assembly line used in car manufacturing. Suppose a certain operation can be divided into m stages, each of which takes t seconds to complete. If we have n such operations to perform, then the overall time it would take when performing all the operations sequentially is m • n • t. If instead of waiting for the first operation to be fully completed before performing the second operation we perform the first stage of operation number 2 while the second stage of operation number 1 is taking place, and 141 Chapter 5. Solvers, Preconditioners, Implementation and Performance so on, the overall amount of time is the necessary t ime m • t, plus the overhead of the wait ing t ime for operation number n to start, which is (n - 1) • t. Th is is a substantial improvement over the t ime it takes to perform the computation sequentially. A n important term here is speed-up, which is the time it takes to perform an algorithm sequentially, divided by the t ime it takes to perform it in the parallel environment the user has. In this particular example we have speed-up of m™™-\ > which is close to m for n sufficiently large. The first vector computers used pipelined functional units, and had the capabil i ty to work very efficiently on vectors and arrays by using vector instructions as part of their instruction sets. Such instructions included loading and storing vectors, adding or subtract ing vectors, performing dot products and so on. Users were not required to specify their desire to perform these operations in a vectorized manner; rather, the object-code for pipelining was typical ly generated in compilat ion time, and it was the role of the compiler to identify components that could be pipelined. Today 's parallel computations are based on multiprocessor systems, wi th several processors, each having a C P U , local memory, and capability to pass messages to other computers in the network. Loosely speaking, there are two main philosophies as to the architecture of a parallel network [32]: 1. Shared memory architecture. 2. Dis t r ibuted memory architecture. Shared memory architectures are ones where each processor has access to a global , common memory area. The advantage of this architecture is that as far as the user is concerned, any processor can access any data, and very litt le work has to be done in order to manage data access. Th i s makes the programming considerably easy. O n the other hand, locali ty of data (as is the case wi th discretized partial differential equations) cannot be taken advantage of as efficiently as in other architectures [95]. In a distributed memory environment a large number of processors are connected in some topological form; each of them has its own local memory, and executes its own programs. D a t a 142 Chapter 5. Solvers, Preconditioners, Implementation and Performance is sent to other processors in the network in the form of messages. T h e network topology defines how the processors are interconnected. Examples for possible topologies are rings, meshes, hypercubes and trees. Description of these architectures can be found, e.g. in [95],[55]. Meshes of processors well suit solution of discretized partial differential equations, as groups of gridpoints can be straightforwardly mapped into each of the processors. A 2D mesh is i l lustrated in F i g . 5.5. Figure 5.8: 2D mesh architecture ( 3 x 3 ) . W h e n considering solving a linear system on a parallel machine using preconditioned K r y l o v subspace methods, different aspects need to be addressed: preconditioner set-up, matrix-vector mul t ip l ica t ion, vector updates, dot products and preconditioner solve. For a general overview, see [87]. The preconditioner set-up and the preconditioner solve are usually the bottleneck. D o t products are easy to handle, but require a fairly large communicat ion, as each processor performs the computat ion for its assigned variables, but the final result is required by all the processors. Construct ion of the reduced system For the reduced system, when considering distributed architecture, for s implici ty it is convenient to refer to a 3D mesh of p3 processors, assume that we have a uniform mesh wi th n3 unknowns, and that p divides n (if this is not so, then we can proceed by mapping an unequal number of gridpoints into each of the processors). One important advantage of the ordering strategies discussed in Chap . 3 is that all the block rows have the same size, and thus the issue of load balancing [32],[55],[87] in this particular 143 Chapter 5. Solvers, Preconditioners, Implementation and Performance aspect is resolved. Our working assumption is standard [95]: pairs of rows-unknowns are handled by the same processor. That is, row number i and unknown number i are both mapped into the same processor. Usually for PDEs, due to the locality of the operators, the geometry of the gridpoints is the source for mapping the unknowns into the processors in the network. We can thus assign subcubes of the original domain to each processor. As a result, each processor constructs a part of the reduced matrix which is a rectangular matrix whose number of rows is the number of unknowns that are associated with it, and the number of columns is the size of the reduced matrix. If only a 2D mesh of processors is available, then mapping of the unknowns can be done by assigning "stripes", for example of size n X 2 X 2 (equivalent to the ID sets of the two-plane ordering, described in Chap. 3) to each of the processors. However, the drawback here is that it allows less flexibility than a 3D mesh of processors in dynamically changing the orientation of the blocks relative to the independent variables x, y and z. The construction of the reduced matrix entries is divided into two stages: 1. Compute the components of the seven-point operator. 2. Perform the algebraic reduction. The first stage can be parallelized in a straightforward manner. For the reduction step, constructing a row associated with a certain gridpoint, the values of the computational molecule of its neighbors are required. If k) is the point being considered, then its furthest neighbors are (i ± 2, j , k),(i,j ± 2, k), k ± 2); the rest of the neighbors are within a smaller distance of y/2. Each processor must have data on an external interface layer of thickness 2 on each of the six sides of the subcube (see Fig . 5.9). There are two ways to obtain the values of the computational molecule associated with the gridpoints that belong to the external interface layer: either by communicating with neighboring processors, or by computing these values as part of the first stage, which will cause overhead in arithmetic. Since typically the number of gridpoints in the mesh is very large, and since communication will require message exchange with a large number of 26 processors, the second option (of overlap in arithmetic) seems a better alternative as it eliminates communication. 144 Chapter 5. Solvers, Preconditioners, Implementation and Performance EXTERNAL INTERFACE LOCAL INTERFACE r 1 i i L O C A L . I 1 I J Figure 5.9: a 2D slice of gridpoints associated with one processor. The local and external interface layers for the reduced system are of "thickness" of 2 unknowns. The computation of the entries of the reduced matrix is thus almost fully parallelizable, with minimum communication and overhead caused by additional computation of components of the computational molecule for external interface variables. Suppose computing a value of the computational molecule of the seven-point operator takes fx flops and computing each com-ponent for the reduced computational molecule takes / 2 flops. Then, for step 1, the amount of computations per each processor (referring to processors associated with an interior subdomain) is {^ff^j • fi flops, and for step 2 the work involved is | • • / 2 flops. Thus the speed-up in arithmetic is nearly p if n >> p. In a vector environment the computation of the components of the matrix is vectoriz-able. Here both construction diagonal by diagonal or construction block by block can be efficiently vectorized. When constructing diagonal by diagonal, the idea is to reshape the three-dimensional arrays which contain the previously computed values of the computational molecule of the cyclically reduced operator (which can clearly be done in a vectorized manner) into a one-dimensional array, and correct to zero the values that correspond to values of the compu-tational molecule which are zero due to the associated gridpoint being close to the boundary. In an environment which has a parallel as well as a vector component, the construction of all the diagonals of the matrix can be done in parallel, provided that all the processors have copies of the values of the computational molecule of the whole domain (here, a shared memory 145 Chapter 5. Solvers, Preconditioners, Implementation and Performance environment might be appropriate). A s an example, below we present a piece of code wri t ten in M a t l a b syntax which assigns two (arbitrarily picked) diagonals of the reduced mat r ix which corresponds to natural lexicographic ordering. 1 dzmm= - ( b ( i , j , k - l ) . * f ( i , j ,k) . / a ( i , j , k - l ) . . . . + b ( i , j . k ) . * f ( i , j - l , k ) . / a ( i , j - l , k ) ) ; dzmm(: ,1 , : )=0 ; dzmm(: , : ,1 )=0; diag_zmm=reshape(dzmm,sys_size ,1) ; izmm=lny+lnx*lny; d iag_zmm(l : sys_s ize - izram)=diag_zmm(izmm+l:sys_s ize ) ; dppz= - ( d ( i , j + l , k ) . * e ( i , j , k ) . / a ( i , j + l , k ) . . . + d ( i , j , k ) . * e ( i + l , j , k ) . / a ( i + l , j , k ) ) ; d p p z ( l n x , : , : ) = 0 ; d p p z ( : , l n y , : ) = 0 ; d i a g _ p p z = r e s h a p e ( d p p z , s y s _ s i z e , 1 ) ; i p p z = l n y + l ; d i a g _ p p z ( i p p z + l : s y s _ s i z e ) = d i a g _ p p z ( l : s y s _ s i z e - i p p z ) ; In the above code, a, b, d, e, f are three-dimensional arrays which hold the previously computed components of the computat ional molecule, Inx, Iny, Inz are the numbers of com-ponents in each direction, diagjzmm is the diagonal that contains gridpoints whose coordinates are indexed (i, j - 1, k — 1) and diagjppz corresponds to the gridpoints indexed (i + 1, j + 1, k). The integer numbers izmm and ippz are the indices of the diagonals; finally, the values are shifted so as to fit in the matr ix . The shift is different between superdiagonals and subdiagonals. Reordering is done by using vectors which map coordinates to row # in the mat r ix and vice versa. For example, for the two-plane ordering, the connection between the gridpoints location in the mat r ix £ and its coordinate values is given by l I n the actual Matlab program the implementation is different. The code presented here is only for the purpose of illustration of the general idea. 146 Chapter 5. Solvers, Preconditioners, Implementation and Performance i = fix{[(£ - 1) mod (2n)]/2} + 1 (5.19a) 2 • [fix(^) + 1] £ mod 4 = 0 or 1 J = < n (5.19b) ( 2 - f i x ( ^ - ) + l £ mod 4 = 2 or 3 2-fix(( < - 1 > 9 m o d n ) + l I odd 2-[fix((<-1)27d " ) + l] * even For the two-line ordering, the connection is i = [(£ - 1) mod n] + 1 , (5.20a) ^ = f i * ( ^ ) + l , (5.20b) 2 • (fix((fix(£ - (Jfe - 1) • (n2/2)) - l ) /n)) + 1 A: odd J = < (5.20c) [ 2 - ( f i x ( ( f i x ( £ - ( A ; - l ) - ( n 2 / 2 ) ) - l ) / ? i ) ) + 2 k even Matrix-vector products for the reduced matrix One drawback that the reduced matrix has, compared to the unreduced matrix, is that its associated computational molecule is non-compact. As a result, when performing matrix-vector products each of the processors in the network must communicate with all the processors surrounding it, which amount to 26 for "interior" processors. This can be observed by examining the computational molecule, see Fig. 2.4. However, the messages that need to be exchanged are very small in size and the communication can be done simultaneously with other computations, as will be clarified below. 147 Chapter 5. Solvers, Preconditioners, Implementation and Performance For the purpose of the i l lustrat ion, it is convenient to assign two colors to the processors in the network: suppose the processors are either "red" or "black", by the usual red/black ordering of a 3D mesh (see F i g . 2.1). Also , suppose the processor that is being considered is one which is an interior processor in the mesh of processors, and let P stand for its " id number". Then the sizes of the messages exchanged by processor P and its neighboring processors in the process of comput ing the product of the reduced matr ix wi th a vector are: • (j^J numbers for the six processors of the opposite color. Here a layer of size ^ x ^ X 2 of external interface variables is required. It contains unknowns, not 2 (j^J , because only the black gridpoints appear in the reduced grid). • E i ther ^ or less numbers for the processors wi th the same color of processor P. Thus six processors wi l l transfer most of the data required. This is different from the unreduced system where these same six processors wi l l transfer all the required data to processor P. The sizes of the messages are small , and as an i l lustration, consider a network of 1,000 processors wi th 1,000,000 unknowns; then six processors exchange 100 numbers wi th processor P, and the rest exchange 10 or less numbers. The local variables that cause the need to exchange messages are the ones close to the boundary: the local interface points [95]. For the reduced system, the thickness of this layer in terms of unknowns is two, just like the thickness of the external interface layer. Th is follows from the structure of the reduced computat ional molecule. For the unreduced system the thickness of these layers is one. The local points (see F i g . 5.9) are the points which form the dense "block diagonal" part of the rectangular matr ix . The actual matrix-vector product consists of three separate stages, the first two of which can be done in parallel [95]: mult ipl icat ion of local variables, message exchange and mult ipl icat ion of external variables. The size of the each subcube in the reduced grid is ^ , and it contains | • ( ^ ) 3 unknowns; In order to illustrate how this works, we refer below to a part icular example: we take n = 16, p — 4 and refer to natural lexicographic ordering. The interior subcube is indexed (2, 2, 2) in the mesh of processors (the equality of the indices here has no special 148 Chapter 5. Solvers, Preconditioners, Implementation and Performance meaning whatsoever). The subcube indices in this case, in terms of the index of the original ordering of the reduced grid, are: (547, 548, 555,556,563, 564, 571,572) + i • 128, i = 0 ,1 , 2, 3 ; there are 32 = 4j- gridpoints in total . The sparsity pattern of the 32 X 2048 rectangular matr ix which is the part of the reduced mat r ix handled by processor P is depicted in F i g . 5.10. This is before the unknowns were locally ordered. In the top figure, the spy pattern of the original reduced (square) mat r ix illustrates the (two) 2D sets of gridpoints (see Chap . 3 for definition) in which the subcube we are considering is contained; these are 32 of the rows between the two horizontal lines, namely between rows 512 + 1 and 1024 (indeed, 1024 - 512 = 2 n 2 ) . The bot tom figure is the sparsity pattern of the rectangular matr ix of size 32 X 2048 which contains all the rows associated wi th the gridpoints of the subcube. A s is evident, the matr ix contains several nonzero blocks, more dense in some places (which correspond to the location of the local gridpoints). 0 300 400 600 D00 1000 MOO MOO 1600 1800 ?COO Figure 5.10: the part of the reduced matr ix that contains the gridpoints associated wi th subcube (2 ,2 ,2 ) , and the rectangular matr ix before local ordering in processor P has been done. After local ordering of the unknowns is performed (which is done by using the same strategy as used for the whole reduced grid) the result is a rectangular matr ix whose "diagonal block" is fairly dense, and corresponds to the local variables. F i g . 5.11 illustrates this for our particular example. Here the band or the sparsity pattern of the matr ix are not important ; what is 149 Chapter 5. Solvers, Preconditioners, Implementation and Performance important is the density. In this example the number of nonzeros is in the main block is 344, compared to 608 of the whole rectangular matr ix (see F i g . 5.10). In practice, for large n, the si tuation is much better than in this example (where n is very small) , in the sense that most of the nonzeros of the rectangular matr ix appear in the main diagonal block and most of the rows wil l have 19 nonzero elements (equal to the number of components in the reduced computat ional molecule). For these local points no communication is required in order to perform the matrix-vector product. In order to perform the mult ipl icat ion of the local part more efficiently, re-ordering of the "block diagonal" matr ix could be done, so that al l the local points are ordered first, and only then the local interface points appear. Figure 5.11: the "local part" of the rectangular matr ix depicted in F i g . 5.10, after re-ordering. If n ^> p, for the reduced system there are approximately | • {j^^j local points and only approximately 6 • (j^j local interface points and about the same number of external points. T h a t is, the external matr ix (that is, the rectangular matr ix , wi th the "block diagonal" excluded) has a number of nonzeros which is smaller by an order of magnitude from the number of nonzeros in the square matr ix that corresponds to the "block diagonal". Th is is important , because it means that the communication required for exchanging values of the external interface variables does not form a bottleneck - it can be done while a massive 0 ( n 3 ) computat ion is being carried out. The communicat ion with the neighboring processors wi l l include also t ransmit t ing numbers corresponding to the local interface points; these are considered external interface points for the neighboring processors. Once the stage where simultaneous computat ion of the local matrix-vector product and the 150 Chapter 5. Solvers, Preconditioners, Implementation and Performance exchange of local and external interface variables has been completed, what is left is to perform and add the inexpensive component of the external matrix-vector product. For the reduced mat r ix this operation takes approximately 12n 2 floating point operations. Precondit ioner set-up and solve Figure 5.12: a 2D slice of processors in a 7 X 7 X 7 mesh that hold entries of I D or 2D sets of unknowns that are associated wi th the unknowns mapped into the processor marked by ' X ' , or communicate wi th this processor when matrix-vector products are performed. It has already been mentioned before, that the solve is the costly part of the computa t ion . One of the most important aspects here is the ordering strategy. In fact, introducing the four-color block ordering (for I D partitioning) and the red/black block ordering (for 2D parti t ioning) in C h a p . 3, and testing their performance (Chap. 4) is mainly for the purpose of vectorization and parallelization. These ordering strategies are less efficient than natural ordering. Intuitively, this can be easily explained by observing that the number of nonzeros of these orderings in the (complete) L U is larger, and thus an incomplete L U factorization is less accurate. O n the other hand multicolor orderings are easily parallelizable. Exper imental investigation of effects of coloring on performance of parallel solvers is provided, for example, in [88]. For the reduced system, the four-color scheme (Fig . 3.6) gives rise to solution in parallel for four blocks of unknowns, according to their color. It should be noted, however, that the diagonal blocks are block diagonal matrices; F i g . 5.12 depicts the processors which hold un-knowns associated wi th a I D or 2D block, or communicate wi th the processor marked by ' X ' 151 Chapter 5. Solvers, Preconditioners, Implementation and Performance when matrix-vector products are performed. In a network of p3 processors, approximately 3p processors wi l l hold the entries of a ID or a 2D set of unknowns of the reduced gr id . Table 5.16 presents iteration counts for ILU(O) preconditioned B i - C G S T A B , applied to Test Prob lem 1 (Sec. 5.4) wi th p\ = p2 = P3 = 2. The natural ordering is more efficient than multicolor version (the difference in percentage is not negligible) but would obtain much smaller speed-up rates. Investigation of parallel solution techniques for the reduced system should be subject to further investigation and is left as a possible future project. ordering iterations Natura l 16 ID four-color 19i 2D red/black 20| Table 5.16: comparison of iteration counts of natural and multicolor orderings for Test Problem 1 wi th p\ = p2 = P3 = 2. 152 Chapter 6 Summary, Conclusions and Future Research In this dissertation a three-dimensional cyclically reduced linear system of equations arising from the steady-state convection-diffusion equation has been derived and analyzed. Analys is of cyclic reduction for non-self-adjoint problems was done for the first t ime only in the early 1990s, when E l m a n & Golub considered the two-dimensional problem. Derivat ion and analysis of the three-dimensional non-self-adjoint problem has been done in this thesis for the first t ime. A summary of the main results, conclusions and suggestions of future research in related topics is given below. The main results of this thesis are also presented in [62],[63],[64]. 6.1 Summary and Conclusions The following issues have been addressed: 1. Derivat ion of the linear system for both the constant and the variable coefficient prob-lems. The step of cyclic reduction for three-dimensional problems has been described in detai l . General numerical properties of the reduced matr ix have been described. It has been shown that the reduced matr ix is diagonally dominant, symmetrizable or an M - m a t r i x whenever the original matr ix is, and moreover, there are regions of P D E co-efficients for which only the reduced matr ix has such properties. The reduced mat r ix is generally better conditioned than the original matr ix . 2. Orderings. The reduced grid is different than a standard tensor-product grid, and re-ordering of the unknowns is an important issue. A general block ordering strategy for three-dimensional grids has been presented. The approach is based on referring to 153 Chapter 6. Summary, Conclusions and Future Research one-dimensional or two-dimensional grids of sets of gridpoints and their associated block computat ional molecule. A highly effective family of two-plane orderings has been pre-sented, and comparisons with other ordering strategies have been performed. Na tu ra l as well as multicolor block versions have been considered. 3. Symmetrizat ion. In order to obtain bounds on convergence rates of stat ionary methods, E l m a n & Golub ' s strategy of finding a diagonal symmetrizer has been adopted. Here it turns out that the three-dimensional case is similar to the two-dimensional case: in both cases the reduced matr ix can be symmetrized for a substantially larger range of P D E coefficients compared to the original matr ix . In particular, the reduced matrices for both the 2 D and 3 D problems can be symmetrized when all the mesh Reynolds numbers are larger than 1 in magnitude. 4. Bounds on convergence rates. Comprehensive convergence analysis has been per-formed for block stationary methods. F i r s t , tight bounds on convergence rates of the block Jacobi scheme have been derived. The convergence analysis applies to upwind difference discretization with any value of the P D E coefficients, or centered difference discretization wi th mesh Reynolds numbers smaller than 1. T w o parti t ionings based on dimension ( I D and 2D) have been considered. The reduced mat r ix is not consistently ordered relative to I D part i t ioning, and in order to use Young's analysis for Gauss-Seidel and Successive-Over-Relaxation, it has been shown by extensions of the theory for p-cyclic matrices, due to Varga , that the two-plane matr ix is nearly consistently ordered relative to this part i t ioning for the range of P D E coefficients for which the convergence analy-sis applies. The bounds on convergence rates for block Gauss-Seidel are tight, and for S O R the estimate of the opt imal relaxation parameter, obtained by using Young's for-mula, is accurate. Fourier analysis extends the convergence results to the case of centered difference discretization of convection dominated equations. 5. Computat ional work. Careful analysis of the work involved in each of the solvers has been performed. In particular, when the P D E coefficients are such that the numerical solution is stable and convergence is guaranteed, i t has been shown that block solvers based 154 Chapter 6. Summary, Conclusions and Future Research on I D part i t ioning are generally more efficient than solvers based on 2D part i t ioning. 6. Preconditioners have been considered. For incomplete L U factorizations, the amount of f i l l- in in the reduced matr ix and the computat ional work involved in constructing the preconditioner have been discussed. The construction of incomplete L U factorizations is more costly for the reduced matr ix , compared to the unreduced matr ix . Exper imenta l results show that when the problem is not too i l l conditioned, ILU(O) is more effective than other preconditioners; incomplete factorizations based on threshold require more computat ional work but perform well for ill-conditioned problems. 7. T h e cost of the construction of the reduced system has been addressed, and it has been shown that the construction is an inexpensive component in the overall computat ional work required to compute the solution. 8. Compar i son with the unreduced system has been performed, and it has been shown that reduced solvers typically converge faster. Th is has been proven analyt ical ly for block stat ionary methods, and has been demonstrated empirically for a variety of K r y l o v subspace solvers. A l so , it has been shown that there is a range of P D E coefficients for which the unreduced solver diverges whereas the reduced solver converges. 9. Exper imenta l comparison between various solvers has been performed. For pre-conditioned K r y l o v subspace solvers B i - C G S T A B and C G S seem to be more efficient than other K r y l o v subspace methods that have been considered. They seem more appropriate to use for other reasons as well - due to the high memory requirements for the class of problems considered in this thesis G M R E S might not be the first choice, and due to the need to have the transpose matr ix , Q M R and B i C G are less preferable. W h e n the opt imal relaxation parameter is known, which is the case for constant coefficient problems and certain variable coefficient problems, S O R is highly competi t ive wi th K r y l o v subspace solvers. Th is underlines the importance of the convergence analysis that has been performed in Chapter 4. 10. Implementation has been discussed. Some preliminary observations on aspects of vec-155 Chapter 6. Summary, Conclusions and Future Research torization and parallelism have been provided. The construction of the reduced system is fully parallelizable and highly vectorizable; four-color ID block orderings and red/black 2D block orderings are appropriate for parallel implementation of solvers. 11. Numerical results which validate the analysis and illustrate various aspects have been presented. The issue of ordering is especially important: in particular, the derivation of an ordering strategy which fits the three-dimensional reduced grid. Such orderings are superior to natural lexicographic orderings or general ordering strategies which do not take the special structure of the reduced grid and the reduced computational molecule into account. The most important advantage of the cyclically reduced system is that there is typically an improvement in the condition number by a factor of 2 to 3. This directly affects the rate of convergence. As has been shown, for block stationary methods the bounds on convergence rates can be referred to as a reliable indication of the gains, and here the ratio between the bounds on convergence rates of the reduced system and the unreduced system goes up to approximately | for the ID partitioning, and slightly less for the 2D partitioning. On the other hand, the 3D cyclically reduced system has some disadvantages; the most serious one is: more significant loss of sparsity compared to lower dimensions, which causes an increase in computational work involved in a single matrix-vector product, and in construction of preconditioners. In fact, the three-dimensional problem is different from the ID and 2D problems, in that it is the only case where the number of nonzeros of the reduced matrix is higher than the number of nonzeros of the unreduced matrix: about 35% more nonzeros in the 3D case, as opposed to nearly 50% and 10% fewer nonzeros in the ID and 2D cases respectively. Another disadvantage is that in general, setting up the reduced system is more difficult than using other finite difference schemes, due to the structure of the computational molecule and the structure of the grid. 156 Chapter 6. Summary, Conclusions and Future Research Despite these disadvantages, one step of cyclic reduction is highly effective: the significant gain in i teration counts compensates for the more expensive matrix-vector products, and in practice the solution time has been observed to be typical ly faster by a factor of two or more. The difficulty in implementation is minor; implementation can be done by performing the step of block Gaussian elimination directly, which is easy. Alternatively, the difference operator, which is given explici t ly in Chap . 2, can be used. The above discussion leads to the conclusion that one step of cyclic reduction for three-dimensional problems is an effective preconditioning technique, which leads to faster solution of linear systems associated wi th three-dimensional elliptic problems. 6.2 Future Research A few possible future research activities are: 1. Appl icat ion of the cyclic reduction step to other discretization schemes. The cyclic reduction step is effective for matrices which satisfy Proper ty A . W h e n the P D E has rough or discontinuous coefficients there are more appropriate schemes, in which there is strong coupling between the red and the black points. W h e n considering cyclic reduction for schemes whose associated matr ix does not satisfy Proper ty A , one potential difficulty is that the matr ix to be inverted is not diagonal. A s a result: • The step of cyclic reduction requires more construction time and more computat ional work per i teration. • It is difficult to derive a difference operator and perform convergence analysis. O n the other hand, the significant gains in computat ional work that have been presented in this thesis lead one to believe that cyclic reduction might st i l l be highly effective. Some numerical experiments that have been performed as part of this research on Roe & Silikover's op t imum positive residual scheme [92] for three-dimensional problems wi th constant coefficients, demonstrate significant gains in iterations for the reduced system but 157 Chapter 6. Summary, Conclusions and Future Research considerably larger computational work per iteration. Careful investigation and imple-mentation are required in order to reach definite conclusions as to whether cyclic reduction is effective here. As part of application of the cyclic reduction step to other discretization schemes, higher order schemes and linear systems associated with finite elements are of much interest. 2. E x t e n d the class of problems. Navier-Stokes equations, which are difficult to solve numerically, translate into a 2 X 2 block indefinite linear system of equations, which contains "a convection-diffusion matrix". Here cyclic reduction can be used as an inner iteration. 3. Order ing strategies. The framework of block orderings which has been described for the reduced grid in Chap. 3 is an effective tool, and research in this area should be pursued. Questions of interest are whether there can be obtained a "nearly-optimal" ordering strategy, based on point and block computational molecules and grids, along with other considerations. An interesting idea of Golub [53] is a dynamic change of direction of the ordering through-out the iteration. Such an adaptive ordering strategy does not seem to have been applied, and might be a promising technique, in particular when the direction of the flow is not clear. 4. Parallel ism. As already mentioned, parallelism is important in particular for three-dimensional problems, due to the size of the discretized systems. The cyclically reduced operator has some features that are different from standard finite difference operators. In particular, the structure of the computational molecule is more complicated. A possible idea is to map unknowns into processors not necessarily in the standard way; rather, assign subcubes whose sides form a 45° angle in all of x, y and z directions, relative to the axes. This fits the structure of the reduced computational molecule, and might reduce the communication time required between processors in the network. On the other hand, it requires a nontrivial treatment of the grid, for example close to the boundaries. 158 Chapter 6. Summary, Conclusions and Future Research Implementation in a parallel environment requires further investigation. 5. Analysis of the cyclically reduced operator as a smoother for mult igrid algo-rithms. Preliminary smoothing analysis suggests that the cyclically reduced operator is a good smoother for multigrid, and generates smaller smoothing factors than the anal-ogous seven-point operator. In multigrid, since it involves several grids, and since on coarser grids the mesh Reynolds numbers grow larger, the fact that the cyclically reduced difference equation corresponds to adding artificial viscosity to the original equation sug-gests that in convection-dominated PDEs the gain of applying cyclic reduction might be significant. An indication for this is the stability of the block Gauss-Seidel scheme, when applied to the cyclically reduced linear system in convection-dominated regions. Research on this topic is currently underway. 159 Bibliography S. Adjer id , M . Aiffa , and J . E . Flaherty. High-order finite element methods for singularly perturbed parabolic problems. Technical report, Scientific Computation Research Center, Rensselaer Polytechnic Institute, New York, 1993. P. A m o d i o and N . A . Mast ronard i . A parallel version of the cyclic reduction algori thm on a hypercube. Parallel Comput., 19(11):1273-1281, 1993. P. A m o d i o and M . Paprzycki . A cyclic reduction approach to the numerical solution of boundary value O D E s . SIAM J. Sci. Comput., 18(l) :56-68, 1997. R . J . A r m s , L . D . Gates, and B . Zondek. A method of block i teration. J. SIAM, 4:220-229, 1956. W . E . A r n o l d i . The principle of minimized iteration in the solution of the mat r ix eigen-value problem. Quart. Appl. Math., 9:17-29, 1951. U . M . Ascher. Introduction to mult igr id . lecture notes, University of British Columbia, 1997. U . M . Ascher and P. L i n . Sequential regularization methods for s imulat ing mechanical systems wi th many closed loops. SIAM J. Sci. Comput., to appear, 1998. O. Axelsson. A general incomplete block-matrix factorization method. Linear Algebra Appl., 74:179-190, 1986. O . Axelsson. Iterative Solution Methods. Cambridge Universi ty Press, 1994. S. Barnet t . Matrices - Methods and Applications. Clarendon Press, Oxford , 1990. Richard Barret t , Michae l Berry, Tony F . Chan , James Demmel , June Donato , Jack D o n -garra, V i c t o r E ikhout , Roldan Pozo, Charles Romine, and Henk van der Vors t . Tem-plates for the Solution of Linear Systems: Building Blocks for Iterative Methods. S I A M , Phi ladelphia , 1994. L . Bauer and L . Reiss. Block five diagonal matrices and the fast numerical solution of the biharmonic equation. Math. Comp., 26:311-326, 1972. R . Beauwens. Factorizat ion iterative methods, M-operators and H-operators. Numer. Math., 31:335-357, 1979. M . Benz i , D . B . Szyld , and A . van D u i n . Orderings for incomplete factorization precondi-t ioning of nonsymmetric problems. Technical Report 97-91, Temple University, Philadel-phia, 1997. S. Bondel i and W . Gander. Cyc l i c reduction for special t r idiagonal systems. SIAM J. Matrix Anal. Appl., 15:321-330, 1994. A . Brand t . Mul t i - leve l adaptive solutions to boundary-value problems. Math. Comp., 31:333-390,1977. 160 Bibliography F . Brezz i and M . For t in . Mixed and Hybrid Finite Element Methods. Springer-Verlag, New York , 1991. 0 . Buneman. A compact non-iterative Poisson solver. Report 294, Stanford University Institute for Plasma Research, Stanford, California, 1969. B . L . Buzb.ee, F . W . Dor r , J . A . George, and G . H . Golub . The direct solution of the discrete Poisson equation on irregular regions. SIAM J. Num. Anal., 8:722-736, 1971. B . L . Buzbee, G . H . Golub , and C . W . Nielson. O n direct methods for solving Poisson's equations. SIAM J. Num. Anal, 7:627-656, 1970. T . F . Chan and H . C . E l m a n . Fourier analysis of iterative methods for ell iptic problems. SIAM Rev., 31:20-49, 1989. T . F . C h a n and T . P. Ma thew. Domain decomposition algorithms. In Acta Numerica 1994, Cambridge University Press, Cambridge, 1994. M . P . Chernesky. O n preconditioned K r y l o v subspace methods for discrete convection-diffusion problems. Numerical Methods for Partial Differential Equations, 13:321-330, 1997. E . C h o w and Y . Saad. Exper imental study of I L U preconditioners for indefinite matrices. Minnesota Supercomputer Institute Technical Report TR 97/95, University of Minnesota, 1997. P. Concus and G . H . Golub . Use of fast direct methods for the efficient numerical solution of nonseparable elliptic equations. SIAM J. Num. Anal., 10:1103-1120, 1973. P . Concus, G . H . Golub , and G . Meurant . Block preconditioning for the conjugate gradient method. SIAM J. Sci. Statis. Comput., 6:543-561, 1985. R. W . Cot t l e . Manifestations of the Schur complement. Linear Algebra Appl., 8:120-211, 1974. E . C u t h i l l and J . M c K e e . Reducing the bandwidth of sparse symmetric matrices. Brandon Press, New Jersey, 1969. E . F . d 'Azevedo, P . A . Forsyth , and Wei -Pa i Tang. Ordering methods for preconditioned conjugate gradient methods applied to unstructured grid problems. SIAM J. Matrix Anal, 13(3):944-961, 1992. P . P . N . de 'Groen . The nature of resonance in a singular perturbation problem of turning point type. SIAM J. Appl. Math, l l ( l ) : l - 2 2 , 1980. M . A . D e L o n g and J . M . Ortega. S O R as a preconditioner. Applied Numerical Mathemat-ics, 18:431-440, 1995. J . W . Demmel , M . T . Heath, and H . van der Vorst . Paral lel numerical linear algebra. In Acta Numerica 1998, Cambridge University Press, Cambridge, 1993. 161 Bibliography [33] E . De tyna . Point cyclic reductions for elliptic boundary-value problems I. the constant coefficient case. J. Comp. Phys., 33:204-216, 1979. [34] J . C . D iaz and C . G . Macedo. Ful ly vectorizable block preconditionings wi th approximate inverses for non-symmetric systems of equations. Int. J. Num. Meth. Eng., 27:501-522, 1989. [35] I. S. Duff, A . M . Er i sman, and J . K . Reid . Direct Methods for Sparse Matrices. Clarendon Press, Oxford , 1986. [36] I. S. Duff and G . A . Meurant . The effects of ordering on preconditioned conjugate gradi-ents. BIT, 29:635-657, 1989. [37] L . C . D u t t o . The effect of ordering on preconditioned G M R E S algori thm for solving the comprensible Navier-Stokes equations. Int. J. Num. Meth. Eng., 36:457-497, 1993. [38] H . C . E l m a n . Stabi l i ty analysis of incomplete L U factorizations. Math. Comp., 47:191-217, 1986. [39] H . C . E l m a n and M . P. Chernesky. Ordering effects on relaxation methods applied to the discrete one-oimensional convection-diffusion equation. SIAM J. Numer. Anal., 30:1268-1290, 1993. [40] H . C . E l m a n and G . H . Golub . Iterative methods for cyclically reduced non-self-adjoint linear systems. Math. Comp., 54:671-700, 1990. [41] H . C . E l m a n and G . H . Go lub . Iterative methods for cyclically reduced non-self-adjoint linear systems II. Math. Comp., 56:215-242, 1991. [42] H . C . E l m a n and G . H . Go lub . Line iterative methods for cyclically reduced discrete convection-diffusion problems. SIAM J. Sci. Stat. Comput., 13:339-363, 1992. [43] H . C . E l m a n and G . H . Golub . Inexact and preconditioned Uzawa algorithms for saddle point problems. SIAM J. Numer. Anal., 31(6):1645-1661, 1994. [44] H . C . E l m a n and D . Silvester. Fast nonsymmetric iterations and preconditioning for Navier-Stokes equations. SIAM J. Sci. Comput., 17:33-46, 1996. [45] K . Fan . Note on M-matr ices . Quart. J. Math. Oxford Ser.(2), 11:43-49, 1960. [46] C . A . J . Fletcher. Computational Techniques for Fluid Dynamics 1. Springer Series in Computa t iona l Physics. Springer-Verlag, 2st edition, 1991. [47] G . E . Forsythe and W . R . Wasow. Finite-Difeerence Methods for Partial Differential Equations. John Wi l ey and Sons, New York - London, 1960. [48] R . W . Freund. A transpose-free quasi-minimal residual algori thm non-Hermit ian linear systems. SIAM J. Sci. Comput., 14:470-482, 1993. [49] R . W . Freund, G . H . Golub , and N . M . Nachtigal . Iterative solution of linear systems. In Acta Numerica 1992, Cambridge University Press, Cambridge, 1992. 162 Bibliography [50] R . W . Freund and N . M . Nachtigal . Q M R : a quasi-minimal residual method for non-Hermi t ian linear systems. Numer. Math., 60:315-339, 1991. [51] E . Gal lopoulos and Y . Saad. A parallel block cyclic reduction algori thm for the fast solution of elliptic equations. Parallel Comput., 10(2):143-159, 1989. [52] J . A . George and J . W . L i u . Computer Solution of Large Sparse Positive Definite Systems. Prent ice-Hal l , Englewood Cliffs, New Jersey, 1981. [53] G . H . Go lub . personal communication. [54] G . H . Go lub and C . F . Van Loan . Matrix Computations. Second Ed i t i on , Johns Hopkins Univers i ty Press, Bal t imore, M D , 1989. [55] G . H . Go lub and C . F . Van Loan . Matrix Computations. T h i r d Ed i t i on , Johns Hopkins Universi ty Press, Bal t imore, M D , 1996. [56] G . H . Go lub and D . P. O'Leary. Some history of the conjugate gradient and Lanczos algorithms. SIAM Review, 31:50-102, 1989. [57] G . H . G o l u b and R . S. Tuminaro. Cyc l ic reduct ion/mult igr id . Technical report, Stanford University, CA, 1993. [58] G . H . G o l u b and H . A . van der Vorst . Closer to the solution: iterative linear solvers. Technical report, Stanford University, 1996. [59] G . H . Go lub and R . S. Varga . Chebychev semi-iterative methods, successive over-relaxation methods, and second-order Richardson iterative methods. Numer. Math., 3:147-168, 1961. [60] J . Grasman and B . J . Matkowsky. A variational approach to singularly perturbed bound-ary value problems for ordinary and part ial differential equations wi th turning points. SIAM J. Appl. Math, 32(3):588-597, 1977. [61] A . Greenbaum. Iterative Methods for Solving Linear Systems. S I A M Frontiers in Appl ied M a t h , 1997. [62] C . Greif . Cyc l i c reduction for three-dimensional elliptic equations wi th variable coeffi-cients. SIAM J. Matrix Anal. Appl., accepted for publication, 1998. [63] C . Gre i f and J . M . Varah . Block stationary methods for nonsymmetric cyclical ly reduced systems arising from three-dimensional elliptic equations. SIAM J. Matrix Anal., accepted for publication, 1998. [64] C . Gre i f and J . M . Varah . Iterative solution of cyclically reduced systems arising from discretization of the three-dimensional convection diffusion equation. SIAM J. Sci. Comp., to appear, 1998. [65] P . M . Gresho and R . L . Lee. Don ' t suppress the wiggles - they're tell ing you something. Comput. Fluids, 9:223-253, 1981. 163 Bibliography M . Gunzburger . Finite Element Methods for Viscous Incompressible Flows. Academic Press, San Diego, 1989. L . A . Hageman and D . M . Young. Applied Iterative Methods. Academic Press, New York , 1981. D . Heller. Some aspects of the cyclic reduction algorithm for block tr idiagonal linear systems. SIAM J. Numer. Anal, 13(4):484-496, 1976. M . R . Hestenes and E . Stiefel. Methods of conjugate gradients for solving linear systems. J. Res. Nat. Bur. Standards, 49:409-436, 1952. C . Hirsch . Numerical Computation of Internal and External Flows. John Wi l ey and Sons, New York , 1988. R . W . Hockney. A fast direct solution of Poisson's equation using Four ier analysis. J. Assoc. Comput. Mach., pages 95-113, 1965. C . Johnson. Numerical Solution of Partial Differential Equations by the Finite Element Method. Cambridge Universi ty Press, New York , 1987. O . A . Karakashian . O n a Galerkin-Lagrange multiplier method for the stationary Navier-Stokes equations. SIAM J. Numer. Anal, 19:909-923, 1982. R . B . Kel log and A . Tsan . Analysis of some difference approximation for a singular perturbation problem without turning points. Math. Comp., 32(144):1025-1039, 1978. D . L u d w i g . Persistence of dynamical systems under random perturbations. SIAM Review, 17:605-640,1975. M . M . M a g o l u and B . Po lman . Theoretical comparison of pointwise, linewise and planewise preconditioners for solving three-dimensional problems. Technical Report No. 9609, University of Nijmegen, The Netherlands, 1993. B . J . Ma tkowsky and Z. Schuss. The exit problem for randomly perturbed dynamical systems. SIAM J. Appl Math, 33(2):365-382, 1977. J . A . Meijerink and H . A . van der Vorst . A n iterative solution method for linear systems of which the coefficient matr ix is a symmetric m-matr ix. Math. Comp., 31:148-162, 1977. G . Meuran t . A review on the inverse of symmetric tr idiagonal and block tr idiagonal matrices. SIAM J. Matrix Anal. Applic, 13:707-728, 1992. A . R . M i t c h e l l and D . F . Griffiths. The Finite Difference Method in Partial Differential Equations. John Wi ley and Sons, Chichester - New York - Brisbane - Toronto, 1967. K . W . M o r t o n . Numerical Solution of Convection-Diffusion Problems. C h a p m a n and H a l l , 1st edition, 1996. N . M . Nacht igal , S. C . Reddy, and N . Trefethen. How fast are nonsymmetric matr ix iterations? SIAM J. Matrix Anal. Appl, 13(3):778-795, 1992. 164 Bibliography [83] J . O ' N e i l and D . Szyld . A block ordering method for sparse matrices. SIAM J. Sci. Stat. Comp., l l (5 ) :811-823 , 1990. [84] J . M . Ortega and R . G . Voigt . Solution of partial differential equations on vector and parallel computers. SIAM Review, 27:149-240, 1985. [85] C . C . Paige and M . A . Saunders. Solution of sparse indefinite systems of linear equations. SIAM J. Numer. Anal, 12:617-629, 1975. [86] M . Papadrakakis and G . Pantazopoulos. A survey of quasi-Newton methods wi th reduced storage. Int. J. Num. Methods in Eng., 36:1573-1596, 1993. [87] C . Pommerel l . Solution of Large Unsymmetric Systems of Linear Equations, volume 17. Hartung-Gorre Verlag, Konstanz, 1992. [88] C . Pommerel l , M . Annaratone, and W . Fichtner. A set new of mapping and coloring heuristics for distributed-memory parallel processors. SIAM J. Sci. Comput., 13(1):194-226, 1992. [89] E . L . Poole and J . M . Ortega. Mul t i co lor I C C G methods for vector computers. SIAM J. Numer. Anal, 24:1394-1418, 1987. [90] R . Fletcher. Conjugate gradient methods for indefinite systems, in Lecture Notes in Math, volume 506. Springer-Verlag, Berlin-Heidelberg-New York , 1976. [91] R . D . Richtmyer and K . W . M o r t o n . Difference Methods for Initial-Value Problems. Interscience Publishers (John Wi l ey and Sons), 2nd edition, New York - London - Sydney, 1967. [92] P . L . Roe and D . Sidilkover. O p t i m u m positive linear schemes for advection in two and three dimensions. SIAM J. Numer. Anal, 29(6):1542-1568, 1992. [93] Y . Saad. A flexible inner-outer preconditioned G M R E S algori thm. SIAM J. Sci. Comp., 14:461-469, 1993. [94] Y . Saad. I L U T : a dual threshold incomplete L U factorization. Numerical Linear Algebra with Applications, 1:387-402, 1994. [95] Y . Saad. Iterative Methods for Sparse Linear Systems. P W S Publ ishing Company, 1996. [96] Y . Saad and M . H . Schultz. G M R E S : a generalized minimual residual algori thm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comp., 7:856-869, 1986. [97] A . Segal. Aspects of numerical methods for elliptic singular perturbation problems. SIAM J. Sci. Statis. Comput., 3:327-349, 1982. [98] D . Sidilkover. A genuinely multidimensional upwind scheme for the compressible Euler equations. W o r l d Sci . Publ ishing, Stony Brook, New York , 1994. [99] D . Sidilkover and U . M . Ascher. A mult igrid solver for the steady state Navier-Stokes equations using the pressure-Poisson formulation. Comp. Appl. Math., 14( l ) :21-35, 1995. 165 Bibliography P. Sonneveld. C G S : a fast Lanczos-type solver for nonsymmetric linear systems. SIAM J. Sci. Stat. Comput., 10:35-52, 1989. J . C . Str ikwerda. Finite Difference Schemes and Partial Differential Equations. Wadswor th and Brooks /Co le , 1989. X . Sun. personal communication. X . Sun. P h . D . thesis proposal. University of British Columbia, 1997. P. N . Swarztrauber and R . A . Sweet. Vector and parallel methods for the direct solution of Poisson's equation. J. Comp. Appl. Math, 27:241-263, 1989. R . A . Sweet. A generalized cyclic reduction algori thm. SIAM J. Num. Anal, 11:506-520, 1974. R . A . Sweet. A cyclic reduction algorithm for solving block tr idiagonal systems of arbi trary dimension. SIAM J. Num. Anal., 14:706-720, 1977. M . C . Thompson, J . H . Ferziger, and G . H . Go lub . Block S O R applied to the cyclically-reduced equations as an efficient solution technique for convection-diffusion equations. in Computational Techniques & Applications, edited by J. Noye & C. Fletcher, pages 637-646,1988. W . F . Tinney and J . W . Walker . Direct solutions of sparse network equations by opt imal ly ordered triangular factorization. Proc. IEEE, 55:1801-1809, 1967. D . J . T r i t ton . Physical Fluid Dynamics, second edition. Clarendon Press, Oxford , 1988. H . van der Vorst . B i - C G S T A B : a fast and smoothly converging variant of B i - C G for the solution of nonsymmetric linear systems. SIAM J. Sci. Stat. Comp., 13:631-644, 1992. J . M . Varah . O n the solution of of block-tridiagonal systems arising from certain finite-difference equations. Math. Comp., 26:859-868, 1972. R . S. Varga . p-cyclic matrices : a generalization of the Young-Franke l successive over-relaxation scheme. Pacific J. Math., 9:617-628, 1959. R . S. Varga . Matrix Iterative Analysis. Prent ice-Hall , 1962. P. S. Vassilevski. Algor i thms for construction of preconditioners based on incomplete block-factorizations of the matr ix . Int. J. Num. Meth. Eng., 27:609-622, 1989. P. Wesseling. An Introduction to Multigrid Methods. John Wi l ey and Sons, New York , 1992. D . M . Young . Iterative methods for solving part ial difference equations of elliptic type. Trans. Amer. Math. Soc, 76:92-111, 1954. D . M . Young . Iterative Solution of Large Linear Systems. Academic Press, New York , 1971. 166 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0080032/manifest

Comment

Related Items