inTHE FACULTY OF GRADUATE STUDIESCOMPUTER SCIENCEWe accept this thesis as conformingto the required standardTHE IMPLEMENTATION OF SEQUENTIAL ANDPARALLEL ALGORITHMS FOR SOLVING ALMOSTBLOCK BIDIAGONAL SYSTEMSByYan SongB. Sc. Fudan University, 1984M. Sc. Fudan University, 1987A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCETHE UNIVERSITY OF BRITISH COLUMBIAMarch 1993© Yan Song, 1993(Signature) In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.Department ofThe UniversityUniversity of British ColumbiaVancouver, CanadaDate ^41 DE-6 (2/88)AbstractIn this thesis, we mainly concentrate on the implementation of sequential and parallelalgorithms for solving block bidiagonal linear system arising from the discretization ofboundary value problems for linear ordinary differential equations.It consists of two parts. In Part I, several sequential algorithms, the Alternate Row andColumn Pivoting Algorithm, the Structured QR Algorithm, the Structured LU Algorithmand the Least Square Algorithm, are implemented and some stability issues of these al-gorithms are considered. Numerical results are presented. In Part II, the correspondingparallel algorithms, except the Parallel Alternate Row and Column Pivoting Algorithm,are implemented on INMOS T800 Transputers, a microcomputer with multiprocessors.To abstract the application from the system function calls, we use a template based on asimple software system with master-slave technique as the interface with the application.The description of those parallel algorithms and numerical results are presented.Only the case with separate boundary conditions is considered here. Among all thesequential algorithms, we can say that the Alternate Row and Column Pivoting Algorithmis the most efficient algorithm with no fill-ins, while the Structured QR Algorithm isthe most stable algorithm. Though the Structured LU Algorithm is much faster thanStructured QR Algorithm in the sequential case if the block size is greater than 3, theabsolute run times of both parallel algorithms are almost the same because the ParallelStructured QR Algorithm obtains better speedup.iiKeywords: almost block bidiagonal linear system, ordinary differential equations,structured QR, alternate row and column pivoting, structured LU, least square, Trans-puters.iiiTable of ContentsAbstract^ iiList of Tables viList of Figures^ viiiAcknowledgement ixIIIIntroductionSequential Algorithms and Their Implementation191 Formation of the Linear System 102 Sequential Algorithms and their Properties 152.1 Alternate Column and Row Pivoting Algorithm (ARCP) ^ 152.2 Structured QR Algorithm(SQR) ^ 192.3 Structured LU Algorithm (SLU) 232.4 Least Squares Algorithm(LS) ^ 253 Numerical Comparison of Algorithms 303.1 Test Examples^ 303.2 Implementation and Numerical Results ^ 33ivIII Parallel Algorithms and their Implementation on IN-MOS Transputers^ 414 Background and Related Issues^ 434.1 Why Parallel Processing? ^434.2 Parallel Architectures ^ 444.3 Performance Analysis 465 The Transputer Network and a Performance Prediction Model^495.1 Transputer Architecture ^ 495.2 A Performance Prediction Model ^ 515.3 Paradigm Implementation ^ 556 Parallel Algorithms and their Implementations^ 586.1 Parallel LS Algorithm ^ 586.2 Parallel Structured QR Algorithm ^ 656.3 Implementation Considerations 707 Numerical Results and Performance Analysis^ 727.1 Parallel LS Algorithm ^ 757.2 Parallel SQR and SLU Algorithm ^ 787.3 The Comparison of Numerical Results 80IV Summary and Future Work^88Bibliography^ 92vList of Tables3.1 Accuracy of solutions of Example I with N = 7 ^ 343.2 Accuracy of solutions of Example I with N = 15 353.3 Accuracy of solutions of Example II with N = 7 ^ 363.4 Accuracy of solutions of Example II with N = 15 373.5 The stability of elimination in SLU with example I ^ 383.6 The complexity comparison of algorithms ^ 393.7 Performance of algorithms with Example I (A = 100) ^ 393.8 Performance of algorithms with Example II (E = 0.01) 394.9 Family of parallel computers ^ 456.10 Job complexity of B = AT A and f = AT g ^ 616.11 Job complexity in the Reduction^ 636.12 Job complexity in the Backward Substitution ^ 656.13 Job complexity of the parallel SQR algorithm 716.14 Communication cost in parallel SQR algorithm ^ 717.15 Formation of the Normal Equation (Phase I) 737.16 The Odd-Even Reduction (Phase II) ^ 747.17 The Backward Substitution (Phase III) 747.18 Optimal Parameters at Three Phases ^ 747.19 The Parallel SQR/SLU Elimination (Phase I) ^ 79vi7.20 The Parallel SQR/SLU Backward Substitution (Phase II) ^ 807.21 The Parallel Structured QR/LU Algorithm (Phase I) 817.22 The Parallel Structured QR/LU Algorithm (Phase II) ^ 817.23 The Speedups of PLS Algorithm with 1023 Mesh Points 867.24 The Parallel Structured QR/LU Algorithm (Ex.I) ^ 867.25 The Parallel Structured QR/LU Algorithm (Ex.II) 867.26 The Speedups of SQR/SLU with 8 transputers ^ 87viiList of Figures5.1 The family of execution times ^ 525.2 A simple software system 535.3 Processes running on a worker node ^ 565.4 Task distribution management 576.5 The structure of the formation of normal equation ^ 606.6 The structure of Odd-Even reduction ^ 626.7 The structure of LS backward substitution 646.8 The structure of structured QR elimination^ 697.9 The performance analysis in Phase I of PLS algorithm ^ 777.10 The performance analysis in Phase II of PLS algorithm 777.11 The performance analysis in Phase III of PLS algorithm ^ 787.12 The timing of each job in SQR and SLU algorithms (Ex.I) ^ 827.13 The timing of each job in SQR and SLU algorithms (Ex.II) ^ 827.14 The predicted speedups of SQR and SLU algorithms (Ex.I) ^ 837.15 The predicted speedups of SQR and SLU algorithms (Ex.II) ^ 83viiiAcknowledgementThis thesis work is under the guidance of Dr.James Varah. It is my great pleasure to havehim as my supervisor. He is always very patient to discuss problems with me, though heis very busy on other things. Not only he helps me a lot in acdemic, but he also concernsabout me in other aspects of my life.It is Dr.Uri Ascher who introduced me the problem that I am working on as my thesistopic. He spent his precious time to tell me the background and application of the prob-lem. Also, he is the second reader of my thesis.Many thanks go to the transputer group. Dr.Alan Wagner provided me a chance to workon the Transputer and helped me to get familiar with it. I would very much appreci-ate the impressive cooperation with H.Sreekantaswamy. Thanks Kunhua Lin, MandeepDhami, David Feldcamp for helping me deal with the transputers.I would also thank Jeff Joyce, Feng Gao, Robert Miura for their help during my stay inUBC. Sijian Zhang, Ying Li, Runping Qi, Ying Zhang and other graduate fellows havegiven me lots of valuable helps during my stay in the computer science department.I am very grateful to my mother-in-law Ning Zheng. Without her generous help, it isimpossible that I can finish my thesis in such a short period after I have my baby. Ihave already owed my grandmother too much, who brought me up and left me withoutixany request. I would like to express my many thanks to my parents and brother whoconsistently give me great support.It is my husband, Hua Yang (Henry), who supports me throughout my graduate program.Without his love, none of these work can be done. In the last, I really like to expressmy endless love to my 5 month old baby Stanley. His joyful face always motivates me somuch to work hard.xPart IIntroduction1Introduction^ 2Massive computing technology has been paid more and more attention recently. Withthe introduction of vector-, array- and multi-processors into scientific computing in thepast few years, the invention of new languages and algorithms has been required to ensurerealization of the potential of these new architectures. Already the use of parallel com-puters has given rise to studies in concurrency factors, vectorization and asynchronousprocedures. These have led to increases in speed over conventional computing to take ad-vantage of specific hardware. The increasing market demand for high-speed and low-costcomputing also makes it more worthwhile to explore parallel algorithms for the availableparallel computers. The intent of this thesis is to perform some numerical experiments insolving an Almost Block Bidiagonal (ABD) system on a particular multiprocessor system— the Transputer.This particular linear system, in which the coefficient matrix has a block bidiagonalform, arises in solving ordinary differential equations (ODE), partial differential equations(PDE), integral equations, etc. ([17]). A system of n ordinary differential equations, forexample, may be expressed asy/ = f (t, y),^a < t < bwith y/ =...E dy/dt subject to a system of n independent boundary conditions,g(y(a),y(b)) = 0.Generally, the functions f and g are nonlinear and smooth.The Orthogonal Spline Collocation Method, in which the Gaussian points are used ascollocation points, has been widely used for solving this kind of boundary value prob-lem since the nineteen seventies. With the rapid development in both theoretical andIntroduction^ 3computational aspects, software packages such as COLSYS and COLNEW [2, 5] havebeen implemented and shown to be very effective for solving a variety of boundary valueproblems arising in science and engineering [10].A special case of this boundary value problem (BVP) is an initial value problem (IVP),where y(a) is given, e.g.,y(a) = a,with known constant a. There is a close theoretical relationship between BVPs andIVPs because the basic theorems of solution existence and representation for a BVP aregiven in terms of fundamental solutions which are the solutions of associated IVPs [3,Ch3,4]. The simplest initial value method for BVPs is the single shooting method. Thisis a useful and easy-to-understand method; however, it suffers from a stability problem.Several alternatives such as the Multiple Shooting, the Stabilized March and the RiccatiMethod attempt to avoid the stability problem. However, they lose the simplicity ofsingle shooting method as a result.The Collocation Method, the Multiple Shooting Method and the One -Step Finite Differ-ence Method, all end up with a linear system with the same sparseness structure — anAlmost Block Bidiagonal system. As well, it is easy to see that these methods also have ahighly parallel nature [3, 17], i.e., the system blocks can be assembled in parallel, whichin turn makes the discovery of parallel algorithms for solving the underlying linear systemvery important. As pointed out in [17], in the sequential case, typically the runtime in thediscretization step is more than 50% of the total and the portion for solving the resultinglinear algebra system, in which standard Gaussian elimination with row partial pivotingis used, needs only about 15% of the runtime. By comparison, however, the runtime ofIntroduction^ 4both the decomposition routine and the solve routine exceeds that of the discretizationroutine if there are more than two processors used in the shared-memory machine IBM3090-600J in most testing cases shown in [17]. To make the code more efficient in theparallel case, some improvements have been made in [17] by using another linear systemsolver — the Alternate Row and Column Pivoting, which we will see later is a very effi-cient sequential algorithm for solving such special systems. Unfortunately, this does notchange the situation very much because the total saving is very little. This convinces usthat the sequential approach to solve the linear system becomes the bottleneck in solvingthe ODEs and it should be replaced by a parallel approach to achieve more speedup andmake it meaningful to use parallel computers to solve this sort of problem.Meanwhile, stability is a very important issue for both sequential and parallel algorithms.Some theoretical analysis of the stability has been done already, such as [22, 23]. An ODEyields an underlying varying propagation of information, which may be fast for an ill-conditioned problem. Therefore, a stable algorithm must decouple rapidly increasingand decreasing fundamental solution modes, which may exist in a well-conditioned BVP.Several sequential algorithms such as the Reorthogonalization, the Stabilized March, andthe Riccati Method attempt to achieve this decoupling explicitly. However, they appearto be inherently sequential in t. Thus they become very inefficient in parallel becausethe information required by the current subinterval depends on that of the previoussubinterval.From the other hand, since we have noticed that the Multiple Shooting Method and theCollocation Method both end up with a linear system, the task of finding a stable parallelalgorithm for BVPs now converts to finding a stable parallel algorithm for the resultinglinear system. This narrows our attention into a linear algebra problem. Generally, thereIntroduction^ 5are two approaches: the straightforward way is to exploit parallelism in the existing se-quential algorithm; the other is to invent new parallel algorithms.There are several sequential algorithms used for solving this special form of linear sys-tem, including Gaussian Elimination with Partial Pivoting (GEPP), Alternate Row andColumn Pivoting (ARCP) and Compactification. Unlike the GEPP algorithm, a generalapproach for solving linear systems, the ARCP algorithm takes the special form of theresulting linear system into account. This algorithm was invented by Varah [36] whobased it on the idea of alternate row and column elimination in [20] and made an essen-tial improvement of the original algorithm in [20]. The advantages of this algorithm areflexibility, simplicity and efficiency.It is natural to try to find parallelism for this algorithm. Indeed, several attempts havebeen made recently. Paprzycki and Gladwell[27] described a tearing algorithm in whichthe original ABD is partitioned into several smaller matrices, each of which is also anABD. They perform the ARCP algorithm in each smaller ABD and connect them withsome global technique. This achieves parallelization to some extent because the work ineach submatrix can be done independently. However, the speed-up is very poor due tothe global factor in the algorithm. Besides, there is also a stability problem. Although itis always possible to select intermediate boundary conditions and construct the smallerABDs, as pointed by the authors, there is no guarantee that the sub-problems will bewell-conditioned.Meanwhile, Lentini [21] suggests an algorithm based on Gaussian elimination, in whichGaussian elimination with partial pivoting is performed simultaneously from both endsof the matrix. This ensures the algorithm is as stable as the Gaussian elimination withIntroduction^ 6partial pivoting. However, it can only make use of two processors synchronously, and nomore parallelism can be exploited in this algorithm.Since there was no satisfactory parallel algorithms found based on the sequential algo-rithms, Ascher and Chan [4] suggested a stable algorithm called the Least Squares (LS)algorithm, in which the block Odd-Even Reduction is performed on a set of normalequations — a positive definite block tridiagonal linear system. It is well-known that theOdd-Even Reduction algorithm proposed in [11, 13] is stable for positive definite tridi-agonal system and has parallel characteristics. Several implementations have been donein both shared-memory machine and distributed memory machine, e.g., [15, 16], etc. Itneeds O(log M) block-steps with M the size of the matrix to solve the system. It is agood idea to convert this almost block bidiagonal system into a positive definite blocktridiagonal system, for which there are several stable algorithms. However, there is stillsome unstable factor in this algorithm because the condition number with respect tothe 2-norm of the original matrix has been squared after the formation of the normalequations. Therefore, this algorithm may not be very robust for some linear systems withrelatively large condition number.Wright [38] presented another stable parallel algorithm called Stabilized QR Algorithm,in which he divided the system into several pieces. Each piece has the same structure,and a new smaller system with identical structure as the original one is formed. Sinceorthogonal matrices are used in the elimination, the condition number of the coefficientmatrix stays the same and therefore it is a stable algorithm in principle. The algorithmneeds O(log M) steps to finish and MI log M processors may be used simultaneously. Infact, Wright's algorithm can be interpreted as a QR solution for the same system thatAscher and Chan solved earlier using normal equations. Several recent papers [17, 14]Introduction^ 7comment that this is the first stable parallel algorithm for solving ABDs.A straightforward extension of this algorithm is Stabilized LU Elimination [40], in whichGaussian elimination with partial pivoting is used instead of orthogonal elimination, orQ R decomposition. The Stabilized LU Elimination algorithm can also be finished withinO(log M) steps and M/ log M processors may be used simultaneously. This is a practicalstable algorithm, although it fails in some cases [41].Another essential part of parallel processing is the implementation of algorithms, whichverifies the effectiveness of an algorithm based on the theoretical analysis of the algo-rithm such as complexity and performance. In contrast to the sequential case, where theprograms are basically the same running on different machines. Parallel implementationshave additional difficulties because the architecture of a parallel machine plays a veryimportant role in the implementation. There is no general program suitable for machinesin different categories. Even when machines are in the same category, the implementationof an algorithm can still be different, as mentioned by Bennett [17].Most of the implementation of the proposed parallel ABD algorithms so far has beendone in shared-memory machines, such as CRAY, Sequent Symmetry, Alliant FX/8 andIBM3090, etc.. Due to the difficulty and complexity in dealing with inter-processor com-munication, there hasn't been any work seen in the literature which implements parallelalgorithms of the ABD solver in distributed memory machines. However, there has beensome work done on such machines using other parallel algorithms, such as the parallelodd-even reduction algorithm as a tridiagonal linear system solver[15].An INMOS T800 transputer is a 32 bit 10 MIPS microprocessor(CPU). Our existingIntroduction^ 8transputer network consists of 74 INMOS transputers, each of which has its own mem-ory, 4 serial data communication links and a floating point unit on a single chip. It is akind of distributed memory machine and inter-processor communication is required fordata exchange.This thesis will concentrate on the studies of sequential and parallel algorithms for ABDsystems and the implementation of those algorithms in the INMOS Transputer network,or transputer for short. It is basically divided into two parts. In Part one, we use theMultiple Shooting method to illustrate how the ODEs are discretized and end up withan ABD linear system. Several sequential algorithms such as Alternate Row and ColumnPivoting Algorithm, Structured QR Algorithm, Structured LU Algorithm and Least SquareAlgorithm are then implemented and their stabilities and efficiencies are compared anddiscussed. Based on the characteristics of Alternate Row and Column Pivoting Algorithm,we here consider it as a representative of state-of-the-art algorithm in the sequential case,and as such we compare it to the other algorithms which are designed with parallelismin mind. In the second part, the parallel implementation of the above algorithms (exceptthe Alternate Row and Column Algorithm) has been done on the INMOS T800 Trans-puter. Some performance comparisons are finally made.Part IISequential Algorithms and TheirImplementation9Chapter 1Formation of the Linear SystemAs we mentioned earlier the Almost Block Bidiagonal (ABD) system has significant ap-plications. In this chapter, we will describe how the multiple shooting scheme convertsan n ordinary differential equations into an irreducible block bidiagonal linear system.As is well known [3, CM], Newton's method can be applied to a system of n ordinarydifferential equations (ODEs)y' = f(t, y),^a < t < b^ (1.1)(y = dy/dt) subject to a system of n independent boundary conditionsg(y(a), y(b)) = 0,^ (1.2)which leads to n linear ODE equationsy' = A(t)y + q(t),^a < t < b^(1.3)at each iteration, subject to linear boundary conditionsBay(a) + Bby(b) = #^(1.4)with A(t), B a , Bb E Rn", # E Rh.10Chapter 1. Formation of the Linear System^ 11The 'standard' multiple shooting method was first introduced as a means of reducing theinstability of the single shooting method for solving BVPs. The single shooting methodattempts to solve BVPs by solving corresponding IVPs and matching the boundary con-ditions, but it becomes unstable essentially when these IVPs are unstable even thoughthe given BVP is well-conditioned. This instability of the single shooting method maygrow exponentially with the size of the interval of integration, so in the multiple shootingmethod this interval is partitioned into a set of smaller intervals, and on each subintervalan IVP solution scheme is carried out.We define a partition r in the interval [a, b]a = <t 2 < t3 < • < tN < tN+1 = b,^(1.5)and let yi denote our approximation of y(t i ), 1 < i < N +1. On each subinterval wenow solve IVPs to approximate a fundamental solution Y(t; t2) and a particular solutionv(t; t i) of (1.3) on it, which in turn allows us to express y(t i+i ) in terms of y(t i ), i.e.,y(ti+i) = Y(t,.4. 1 ; t2)y(t,) + v(t2+1 ; t i ).^(1.6)The approximated relation thus becomes= riyi + ri7^1 < i <N,^ (1.7)where Ili is the approximation of the fundamental solution Y(t; t i ) at t = ti+i , which canbe obtained in parallel for each i. The approximated boundary conditions areBay], BbYN+1^ 9• (1.8)If we denote-(„T Y2Yir^kJ 1 , J. • • • 3/1/41)T , (1.9)Chapter 1. Formation of the Linear System^ 12then the resulting linear algebra system can simply be expressed asAy, = g, ,^ (1.10)where matrix A isF1 IF2 IA=FN IBa^BbIf the boundary conditions are separated, then, Ba and Bb can be further decomposedas:[Ba 1 0Ba = and^Bb -=-[ ^,0^ Bbwhere Ba E Rq" and B, E R(n-q)" with q the number of boundary conditions at leftboundary a, and A thus written as: BaI(1.12)A=F1 IF2FN IBbAccording to [38, 17], we know that the linear systems discretized from the one-step finitedifference scheme, the Runge-Kutta method and the collocation method are of the samesparse bidiagonal structure as that from multiple shooting (1.11) or (1.12).Chapter 1. Formation of the Linear System^ 13So, it is of interest to focus our work on the kind of block matrix system which we cangeneralize as follows:A l^C1A2^C2x1x2flf2AN CNBa^BbX3XN+1=fNfN-1-1(1.13)for nonseparated boundary conditions and_Ba-x1faAl^CiA2^C2x2IIf2x3 = (1.14)AN CN INBbXN+1fb_for separated boundary conditions, where fa E R9x1 and fb E R( 71-q)X1. Thus, we cangeneralize the above two cases with a single matrix notation:Ax = f. (1.15)It is obvious that the case with separable boundary conditions is easier to cope with, andthe case with inseparable boundary conditions can be converted into a separable one atthe cost of doubling its number of equations. Some algorithms was used to deal withthe case with separable boundary conditions [36, 4], while others deal with the latter[38, 40]. The Gaussian Elimination with Partial Pivoting method can deal with bothcases, however, it suffers from large amount of fill-ins if cross block pivoting is used. ThisChapter 1. Formation of the Linear System^ 14makes it very inefficient in solving such systems [3, Ch.4].Note: The exact values of the block matrix and right hand side vector will be differentwith different boundary condition cases and also with different discretization schemes,though the same notation is employed here.Chapter 2Sequential Algorithms and their Prop-erties2.1 Alternate Column and Row Pivoting Algorithm(ARCP)The Alternate Column and Row Pivoting algorithm (ARCP) proposed by [36] in 1976is used for solving the linear algebra system (1.14), which results from the n linear ODEequations (1.3) subjected to separated linear boundary conditions.This algorithm has two significant characteristics: stability and efficiency. It has beenpointed out by [36, 38] that this algorithm is as stable as Gaussian elimination with par-tial pivoting because all the multipliers are bounded by 1. During the whole eliminationprocess, there is no fill-in introduced in the original matrix by this algorithm. This, asa matter of fact, is a significant feature when the system we deal with is very large. Inother words, it needs only the minimum number of operations and memory.However, this algorithm works in a more complicated way than the Gaussian Elimina-tion with Partial Pivoting method because both fill-ins and the magnitude of multipliers15Chapter 2. Sequential Algorithms and their Properties^ 16during the elimination have to be controlled. To achieve this, a tricky mixed row andcolumn pivoting strategy is used to implement this algorithm It starts by applying qcolumn elimination steps with column pivoting, and proceeds by n — q row eliminationwith row pivoting. These two strategies are used alternatively.We use a small example to illustrate this algorithm here with N = 2, 71 = 4 and q = 2.The elements in the matrix is represented by x:XXXxXXXxXXXxxxXxXxXxXxXxx x x x x x x xx x x x x x x x(2.16)x x x x x xx xx x x x x x x xx x x x x x x xx x x x xxxxxxxxxxxxAfter performing the alternate column and row eliminations with corresponding columnChapter 2. Sequential Algorithms and their Properties^ 17and row pivoting from top-left to bottom-right, we obtainx 0 0 0xx00xxxxxxxxxxexxxxxxxeex000xxEDexx00xxxxxxxxxxexxxxxxxeex000xxeexx00xxxxxxexwhere ® stands for the element eliminated by column elimination with column pivoting,and ED stands for the element eliminated by row elimination with row pivoting.To see clearly the structure of the reduced matrix above, we denote X E R272 as a fullmatrix, L E R2" and U E R2x2 as a lower triangular and an upper triangular matrixrespectively, and then (2.17) can be rewritten as follows:(2.17)L 0X R X XXOL 0XRXXXOLOX R(2.18)Chapter 2. Sequential Algorithms and their Properties^ 18To generalize this procedure, we may denote X as a full matrix of order either (n — q) x(n — q), q x q, (n — q) x q, or q x (n — q), L E Rqxq and U E R(n-q)x(n-q) as a lowertriangular and an upper triangular matrix respectively. Hence, the alternate row andcolumn pivoting process can be described in the following way according to the matrixnotation in (1.14):• Step 1: Eliminate Ba into[ L 0 ,by column elimination with column pivoting;For i = 1, • • • ,N• Step 2i: Eliminate Ai intoXX 0by row elimination with row pivoting;• Step 2i-F1: Eliminate Ci into[X XLby column elimination with column pivoting;End For i • Step Last: Eliminate Bb into{ X R]by row elimination with row pivoting.Chapter 2. Sequential Algorithms and their Properties^ 19The resulting matrix isL 0XRX XX 0 L 0(2. 19 )X RXXX 0 L 0X RIt is easy to verify that the linear systemAx= f,with A as in (2.19), can be simplified by splitting it into two sublinear systems: a lowerblock bidiagonal one and an upper block bidiagonal one. The corresponding coefficientmatrices have the forms as in the following:R XR2.2 Structured QR Algorithm(SQR)Unlike the ARCP algorithm, the Structured QR (SQR) algorithm works for the generalcase of boundary conditions (1.13). It was proposed originally by [38] as a stable parallelsolution of (1.13).L^R XX L R XandX LX L(2.20)n =1f2_ [ Al 0Q1 0 C2 [ G1 ElG2 C2 f2Chapter 2. Sequential Algorithms and their Properties^ 20The idea of this algorithm is that it decomposes the coefficient matrix into an almostdiagonal form with a sequence of orthogonal matrices, which do not increase the conditionnumber of the original matrix during this process. The complete algorithm may bedescribed as follows:• Elimination step: Considering the augmented system of the coefficient matrix (1.13) with its right-hand sideAl CiA2 C2AN CNnf2^(2.21)fNwe, first of all, are to find an orthogonal matrix Q i E R2nX 2n such thatR1Q_ 1 Ci. = [ 'A2^0 1 where R1 E R' is upper triangular and Q i is a product of n Householder orGivens transformations. This row transformation also has effect on the same rowsas those of Ci. and A2 but different columns which do not consist of C1These changes can be seen clearly fromand A2.If we denote1 '0Q1 = [ 010 IChapter 2. Sequential Algorithms and their Properties^ 21where I is an identity matrix of order (N — 1) x n, thenAl ClA2 C2AN CNGi R1 ElG2 0 C2A3 GAN CNQifif2fNglf213^. (2.22)INThis elimination step will be carried on until all the lower diagonal blocks {have been eliminated. Generally, if we denoteC1 = C1 , Gl = A1, fl =then the ith step of elimination with any integer i (1 < i < N — 1), is to find anorthogonal Q i E R2nx 2n such thatCi = RiQi Ai+i^0and the other parts of the augmented matrix remains the same except Gi 00 Ci+1(2.23)which becomesi^0Qi0 Ci-FiGifif+i iLetChapter 2. Sequential Algorithms and their Properties^ 22Qi =with the identity matrix h E Rinx in (i = 1, • • • , N— 1), then the augmented systemchanges fromG1 Ri gl Ri_1 E1-1^Gi-i^ gi- 1^Gi Ci^ iiA1+1 Ci+1^fi-Fi(2.24)(2.25)G1 R1 E1^Gi^Ri^Gi+1 Ci+iAi+2 Ci+2AN CN fNAN CNgigifi+2fNtoafter left-transformation by Q i .This elimination step completes when i = N — 1, and the resulting linear systemis:Chapter 2.^Sequential Algorithms and their PropertiesG1 Ri X1GN-1 RN-1 EN-1 xN-.1GN CN XNBa Bb X N+123 glgN-1fNfN+1(2.26)• Backward substitution:From (2.26), we can easily see that x 1 and xN4. 1 can be solved from[GN CNBa Bb^XN-F1 fN+1(2.27)a 2n x 2n matrix, and the rest {x i }'s with any integer i(2 < i < N) is determinedrecursively byGi_i xi + Ri_ixi^= gi-i,orx i = Ri (gi--1 — Gi_ixi — Ei_ i x i+i )^i = N,^, 2.^(2.28)2.3 Structured LU Algorithm (SLU)The SQR algorithm has high stability but low efficiency since it is well-known that or-thogonal QR decomposition of a dense matrix requires about twice as much work asGaussian elimination with partial pivoting. The way to increase the efficiency, and keepthe stability property as much as possible, is to introduce a new algorithm, StructuredLU (SLU) algorithm.Chapter 2. Sequential Algorithms and their Properties^ 24The SLU algorithm inherits the basic idea of elimination strategy from SQR, but witha different elimination tool — Gaussian elimination with partial pivoting. This has beenwidely accepted as a stable algorithm in most practical cases. However, it can achievemuch higher efficiency than the SQR algorithmLet Li E R2" 2"i (i = 1, -,N — 1) be the product of n permutation matrices andelementary elimination matrices, such thatCi^U1Li ^,Ai+1^0where U1 E Rn" is an upper triangular matrix. LetE RNnxNnwith identity matrix Ii E Rin x.n, be the elimination matrix of A at the i-th step; thenthe elimination process of SLU isLA= U,withL = LN-1 • • Ll,and U has the identity form as the coefficient matrix in (2.26).Since all the elements in L i are controlled in the same way as in the Gaussian Elimina-tion with Partial Pivoting algorithm, it is expected that the condition number of L willnot increase dramatically in practice. At least, we can predict here that this algorithmshould have the same stability behavior as the ARCP algorithm. Our numerical resultsL2 =^L ,Chapter 2. Sequential Algorithms and their Properties^ 25next will show this true.The backward substitution of SLU is exactly the same as that of SQR.2.4 Least Squares Algorithm(LS)The motivation of [4] to propose this Least Squares (LS) algorithm was to bring thestability issue to a satisfactory stage in parallel computation.Considering the separated boundary condition case, or linear system (1.14), this algo-rithm can be decomposed into two major phases:• Form a normal equation, orAT Ax = AT f,^ (2.29)where ATA is a positive definite block tridiagonal matrix provided A is a nonsin-gular matrix;• Use Odd-Even reduction strategy to solve the underlying positive definite blocktridiagonal system (2.29).The Odd-Even reduction algorithm was originally proposed by [11] for solving a band lin-ear algebra system. It was then extended to the block tridiagonal case in [13]. Also, [13]analyzed convergence and stability issues of this block odd-even reduction algorithm, inwhich it was shown that if the coefficient matrix is diagonal dominant, the errors incurredare not dangerously propagated relative to the entire solution, and an incomplete cyclicreduction would also generate a good approximation because norms describing the off-diagonal blocks relative to the diagonal blocks decrease quadratically with each reduction.- E2 jB2i+1 D2 j+1 jE2jB2j+1(2.31)(2.32)(2.33)(2.34)-^D2 j_ jB2 j - D2 j^E2j-1-E2jB2i+1 E2j+i j= a2j - D 2 j^ig2i_i —Chapter 2. Sequential Algorithms and their Properties^ 26Similar to the parallel algorithm in the tridiagonal case [32], the Odd-Even reductionapplied to the block tridiagonal case is also well-suited for parallel computers, as manyof the quantities involved may be computed independently of the others [15],[16].The block odd-even reduction algorithm consists of the operations of elimination of odd-indexed block unknowns and their eventual recovery through backward substitution. LetB ATA beB1 E1D2 B2 E2(2.30)DN-1 BN-1 EN-1DN BNand g = AT f, then (2.29) becomesBx = g,where the order of the block system N = 2' — 1 with nonzero integer m for the balanceof the cyclic reduction. If we multiply equation 2j — 1 by —D 23B2j1 1 , equation 2j + 1by —E2iB2j1+1 , and add these to equation 2j with 1 < j < 2' — 1, the 2jth equationbecomesn(1)^+ D(1)^(1 )^(1)ifj X22 -2 X2j^1-:/j X23+2 - gj ,whereChapter 2. Sequential Algorithms and their Properties^ 27These new equations form again a block tridiagonal system, involving only the evenindexed block unknowns of x. This process can be described graphically using a smallexample m = 3, in which the reduction phase will be finished after two elimination steps.Let the original matrix beX XX X XX X XX Xxxxxxxxxx,^ (2.35)then it will have the following forms respectively after two elimination steps: X X(8) x 0 xX X Xx 0 x 0 xX X Xx 0 x 0X X(2.36)andChapter 2. Sequential Algorithms and their Properties^ 28X^Xxx xxxxxx0xxx x(2.37)where ® is the element eliminated at current step. The block unknown which can besolved first is x4 , followed by x2 and x6 , then x i , x3 , x 5 and x 7 .Using superscripts to denote reduction and backward substitution steps, the odd-evenreduction is defined by the following set of equations:• Reduction:Let initial conditions be D? = D i , B? = Bi , E? = Ea , and g? = gi ,For j = 1,2,• • • ,m — 1 For i = 2j,2x2j,3x2a,•••,2m- 2j Fi = —M -1 * (Bt 21;_ 1 ) -1 ,Gi =^* (B-ii+-21;_ 1 ) -1 ,Di =^* M_ 21;_„Gi * E;+-21; _2 ,B = Bt 1^Et-21;_i Gi M±21;_,*^Gigi =(2.38)Chapter 2. Sequential Algorithms and their Properties^ 29End For iEnd For j.The only remaining equation isX0^X2m-i E2rnmili X2m g2mm-from which x 2m-1 can be solved since x o and xN+1 should be zero, i.e.,Dm-1 \ -1 m-1X2m-1 = 11 2m-1 ) g2m-1.• Backward substitution:For j = m — 1, m — 2,•• • , 1For i = 2j -1 , 3 x 2j -1 , 5 x 2j -1 , • • • , 2m — 2j -1xi = (m-1)-1^— D2 -1 xi_2;-1 — Et -1 x2+2J-1)End For iEnd For j (2.39)(2.40)(2.41)with1 0 0 01Bb = [0 0 '1Ba =(3.43)Chapter 3Numerical Comparison of Algorithms3.1 Test ExamplesIn order to examine the stability and performance of sequential and parallel algorithms,we choose two testing examples from [3, 7], which are frequently used as testing examplesin recent literatures (see [4, 38, 40, 14]). These examples are:Example I. The boundary value problem[ 1 — A cos 2t 1 + A sin 2ty'(t) =^ y(t) + f (t)—1 + A sin 2t A cos 2t + 1(3.42)where a = 0, b = 1 and the exact solution of this problem isy(t ) _ ( et , et)T .^ (3.44)Though the exact solution of this problem is independent of the value of A, we will seefrom the fundamental solution given by30y(t) = erf(t/21i)/erf(1/2fi) + z(t) + cos 7rtz(t) = e- (t+i)/VJ{(3.48)Chapter 3. Numerical Comparison of Algorithms^ 31Y(t) =(cos t^sin t^e(A+1)t^0(3.45)— sin t^cos t^0^e(i-A)tthat the parameter ). controls the growth and decay of modes, or the condition numberof the coefficient matrix of the linear algebra system (1.14) if the mesh points are given.{Example II. A system of second order differential equations—Ey" — 1y' + ..z' + z =1or2 cos irt + -2 7rt sin 7rtEZ II = zwith the separated boundary conditions(3.46)1 y(-1) = —1, y(1)z(-1) = 1, z(1)= 6 -2/vrj21- ,/= e(3.47)where a = —1, b = 1 and the exact solution of this problem isThis problem has a singular turning point if E is very small, and if 6 = 0, the secondorder equations (3.46) will reduce to a first order system. The condition number of thecoefficient matrix of the resulting linear system (1.14) is proportional to the reciprocalof f provided the mesh points are given.To be able to use the multiple shooting method, it is necessary to convert this secondorder system to first order, which will in turn double the number of equations of theoriginal system.Chapter 3. Numerical Comparison of Algorithms^ 32Let{u = Y'V = Z 1(3.49)then the new first order differential system isy' = u-Eu' - l2 Iu +v2 4- z =icr2 COS rt + -2 7rt sin 7rtz' = vEvi = zIf we introduce a fourth order vector variable ss = [y, z, u, yr(3.50)(3.51)then a set of linear ODE equations and boundary conditions with respect to .s is consistentwith (1.3) and (1.4), where0^0^1^00^0^0^1A(t) = (3.52)0^1/c^- t/2c^t/2c0^1/c^0^0q(t) = [0, 0, - qo/c, Or^ (3.53)with1qo = C7 2 cos irt + 2-irt sin rtand1^0^0^0 0^0^0^00^1^0^0 0^0^0^0Ba = Bb = (3.54)0^0^0^0 1^0^0^00^0^0^0 _ 0^1^0^0Chapter 3. Numerical Comparison of Algorithms^ 33and# = [-1, 1, C2M, e-2/17'. (3.55)Since only some of the algorithms described above work with nonseparated boundaryconditions and all of them successfully work in separated boundary conditions, we herecompare these algorithms by some examples in the separated case.3.2 Implementation and Numerical ResultsThe implementation of all algorithms to solve the linear ODE (1.3) with its boundarycondition (1.4) includes two steps:• The first step, or discretization step, which is common among the algorithms, is touse the multiple shooting method introduced in Chap.1 combined with an accurateIVP integrator. The Runge-Kutta method of order four [8, Ch 5] is used as theintegrator here;• The second step is the implementation of solving the underlying linear system (1.13)or (1.14) resulting from the first step. Five algorithms, including ACR, SQR, SLUand LS, which were introduced in Chap.2, and Gaussian Elimination with PartialPivoting (GEPP) will be compared numerically here.Therefore, the stability issue related to solve a BVP includes both the stability of themultiple shooting method and the stability of the resulting linear system slover. Letus denote , as the conditioning constant of the BVP (1.3) and (1.4) [3, pp.111], N thenumber of shooting points, N the number of integration steps used in the numericalintegration, then the condition number of matrix A resulting from multiple shootingsatisfies:cond(A) = MAIIIIA -1 11 5_ HAWK ry_ KN,Chapter 3. Numerical Comparison of Algorithms^ 34Table 3.1: Accuracy of solutions of Example I with N = 7A cond (A) EARCP ESQR ESLR ELS EGEPP10 6.2e+00 7.2e-9 7.2e-9 7.2e-9 7.2e-9 7.2e-950 2.0e +03 1.4e — 7 1.4e — 7 1.4e — 7 1.6e — 7 1.4e — 7100 2.5e-1-06 4.6e-7 4.6e-7 4.6e-7 1.6e+2 4.6e-7200 2.9e+12 5.4e-4 8.3e-4 6.6e-4 1.6e+2 6.6e-4and the condition number of the multiple shooting xh [3, pp.150] is:Xh '= Ncond(A) < KNNK = If RIZwith N = NN.It has been taken for granted that both stabilities are very important to the solutions ofthe system. However, it is our interest here to look into the stability of those algorithmssolving the linear system. The multiple shooting method is only a tool used to form thelinear system so that the numerical comparison of all linear system solvers can be donebased on the same problem.The accuracy of a numerical solution of an ordinary differential system depends on bothtruncation error and roundoff error. The truncation error is introduced when the dif-ferential equations are discretized by some scheme, such as the Runge-Kutta numericalintegrator used here, and the roundoff error, which largely depends on the conditioningof the problem, machine accuracy, the stability of numerical algorithms, and the numberof significant figures used in the implementation.Chapter 3. Numerical Comparison of Algorithms^ 35Table 3.2: Accuracy of solutions of Example I with N = 15A cond (A) EARCP ESQR ESLR ELS EGEPP50 3.1e+01 7.2e-9 7.2e-9 7.2e-9 7.2e-9 7.2e-9150 3.2e+04 6.2e-8 6.2e-8 6.2e-8 7.3e-4 6.2e-8200 9.0e+05 1.1e-7 1.1e-7 1.1e-7 1.7e+1 1.1e-7400 4.5e+11 8.7e-5 1.2e-4 5.8e-5 Nan 3.9e-5The programs are implemented on Sun Sparc 2 with double precision using the C lan-guage. All the routines are self-generated. As mentioned earlier, a Runge-Kutta inte-grator of order four is used in the numerical integration. The condition number of eachcoefficient matrix is estimated in the way as mentioned in [12].Tables 3.1-3.4 show the condition number of the coefficient matrix and numerical com-parisons of the accuracy of solutions resulting from ARCP, SQR, SLU, LS and GEPPwith respect to the parameter A. Two examples collected in §3.1 are tested with differentnumber of mesh points. In all the tables, the absolute errors of solutions, notated by Ewith the names of algorithms as its subscripts, are basically controlled by the order ofthe truncation errors, which is consistent with the theoretical prediction and generally,no better results can be expected.It is shown in Tables 3.1 and 3.2 that if the number of mesh points is fixed, then theparameter A which controls the increasing and decreasing modes, will determine theconditioning of the linear system. For instance, the linear system becomes more ill-conditioned as A increases. On the other hand, for a fixed parameter A, we can see thatthe condition number of the linear system is controlled by the number of mesh points ifa BVP is given. In particular, the conditioning of the linear system will improve withthe increase in the number of mesh points, therefore the algorithms will produce moreChapter 3. Numerical Comparison of Algorithms^ 36Table 3.3: Accuracy of solutions of Example II with N = 7E cond (A) EARCP ESQR ESLR ELS EGEPP0.1 4.2e-1-03 1.3e-8 1.3e-8 1.3e-8 1.3e-8 1.3e-80.05 1.0e+05 5.3e-8 5.3e-8 5.3e-8 5.2e-4 5.3e-80.02 8.7e+07 2.5e-7 2.5e-7 2.5e-7 7.6e-4 2.5e-70.01 1.2e+11 6.4e-7 6.4e-7 6.4e-7 1.16+1 6.4e-7accurate solutions. Unlike the estimation given by [4]cond(A) ,,, N exp(A/N),our numerical results, however, bear out the relationcond(A) , exp(X/N).If we letE = 1/A,then a similar argument can be applied to Example II with respect to A according to thenumerical results in Table 3.3-3.4.The accuracy of the numerical solutions is displayed in Tables 3.1-3.4, which show theclose relationship between the predicted and the computed results. We notice that allalgorithms produce almost the same results for both examples with slightly inaccuracyoccurred in the results of LS algorithm However, it seems that LS algorithm can notobtain accurate results when condition number is higher than 10 7 . This is because of thesquared condition number of the normal equations [4, 17, 38, 14], though the odd-evenreduction process of a positive definite linear system is of high stability.Chapter 3. Numerical Comparison of Algorithms^ 37Table 3.4: Accuracy of solutions of Example II with N = 15E cond (A) EARCP ESQR ESLR ELS EGEPP0.1 4.6e+03 6.3e-10 6.3e-10 6.3e-10 6.3e-10 6.3e-100.05 8.0e+04 2.5e-9 2.5e-9 2.5e-9 2.5e-9 2.5e-90.02 4.1e+07 1.6e-8 1.6e-8 1.6e-8 1.4e-5 1.6e-80.01 2.8e+10 6.1e-8 6.1e-8 6.1e-8 2.9e+0 6.1e-8The advantage of the SQR algorithm is rarely seen compared to those based on Gaus-sian elimination with partial pivoting, such as ACR, SLU and GEPP. One explanationwe can make is that this is because the conditioning of the elimination matrix of thosealgorithms is not large, as shown in Table 3.5. Therefore, the condition number of theremaining matrix is of the same order as the original one.In addition to the stability issue, we also examine and compare the efficiency of algo-rithms ARCP, SQR, SLU and LS, since it is expected that an ideal algorithm should haveboth reliable stability and high efficiency. The efficiency of an algorithm here directlyrelates to the simplicity of a computational method and restraint of fill-in with respectto the special form of linear algebra system (1.13) or (1.14). The algorithm GEPP is notof interest here since it considers the matrix in a dense form, thus introduces too manyfill-ins during the elimination.To transform an arbitary matrix into upper triangular form, we know that there willbe more arithmetic work needed if orthogonal transformations are used instead of ele-mentary transformations. With our first criterion of efficiency, we may prefer algorithmsARCP, SLU and LS to SQR. Table 3.6 shows the comparisons of operation counts ofthese algorithms and the number of fill-ins introduced by each algorithm up to order ofNn 2 or Nnq for ARCP algorithm, where q is the number of boundary conditions at leftChapter 3. Numerical Comparison of Algorithms^ 38Table 3.5: The stability of elimination in SLU with example IA cond (A) cond (L) cond (U)50 3.1e + 01 2.7e + 00 5.1e + 01150 3.2e + 04 2.6e +0000 2.8e + 04200 9.0e + 05 2.6e + 00 7.8e + 05400 4.5e + 11 2.6e + 00 3.8e + 11end and q < n. It becomes an IVP if q = n. The operation counts usually provide theorder of complexity of each algorithm, though in [40] it is claimed that they are verypoor predictors of runtime, which depends on the speed of the particular computer.Except for the ARCP algorithm, the others have (N — 1)n2 fill-ins shown in Table 3.6,which not only increases the storage space demanded by each algorithm, but also re-stricts the sizes of problems to be solved practically with fixed memory space. Tables 3.7and 3.8 show the runtime in seconds for each algorithm. The ACR algorithm proves itsefficiency and flexibility there because it is much faster than the others, though LS andSLU use the same elimination tool. This reflects how the fill-ins affect the efficiency ofan algorithm. In fact, once fill-ins are introduced, they have to be carried on from stepto step, and more other fill-ins will be introduced by them. This dramatically increasesthe computation time, and in turn affects the efficiency of an algorithm very much evenif we neglect the memory problem.There is no surprise that the performance of algorithms SQR and SLU are almost thesame with Example I because the size of the block is 2. The advantage of the simplicityof the elementary elimination is not demonstrated in the computation. However, withthe increase of block size, such as in Example II, the SLU algorithm is almost 1.7 timesas fast as SQR when N = 511.Chapter 3. Numerical Comparison of Algorithms^ 39Table 3.6: The complexity comparison of algorithmsAlgorithms Fill-insOperation CountMultiplications AdditionsARCPSQRSLULS0(N-1) n 2(N-1) n2(N-1) n2Nn (sn 2+2nq—Aq2—n+-q)N (TO^10n2 )N (e-n3^4n 2 )N (29 n3^11n2 )Nn^n2+2nq-3N (7-n3 + 4n2 )N(In3^4n 2 )N (-i7 n3^2n 2 )q)Table 3.7: Performance of algorithms with Example I (A = 100)N TsQR TSLU TARCP TLS31 0.063 0.063 0.013 0.0863 0.153 0.14 0.023 0.173255 0.59 0.583 0.103 0.793511 1.093 1.033 0.197 1.41Summary:In this chapter, we have examined the numerical stability and efficiency of several sequen-tial algorithms such as ARCP, SQR, SLU and LS algorithms. The numerical experimentsshow us that ACR, SQR and SLU achieve as the same order of accuracy of solutions asthat of the Gaussian Elimination with Partial Pivoting. The algorithm LS fails for muchTable 3.8: Performance of algorithms with Example II (c = 0.01)N TsQR TSLU TARCP TLS31 0.387 0.237 0.06 0.38363 0.747 0.557 0.09 0.757255 2.633 1.677 0.353 2.737511 5.613 3.427 0.73 5.653Chapter 3. Numerical Comparison of Algorithms^ 40smaller condition numbers due to the formation of the normal equations, in which thecoefficient matrix has a squared condition number. The numerical results confirm thetheoretical prediction about the stability issues of all the algorithms.Meanwhile, we also obtain information about the performance of the algorithms fromexperiments. The ACR is the most efficient sequential algorithm, with 4.5 times fasterthan SQR, 4.24 faster than SLU, and 6.1 faster than LS when N = 511 in Example I.It is even faster in Example II shown in Table 3.8. The SLU is preferred to SQR inperformance when the size of block is relatively large, say M > 3.Part IIIParallel Algorithms and theirImplementation on INMOSTransputers42In previous chapters, we have studied and implemented several sequential algorithms forsolving a block bidiagonal linear system resulting from discretization of a set of linearBVPs (1.3) and (1.4). Some numerical results related with stability and performance is-sues, which are also extremely important for any parallel algorithm, are compared amongthose algorithms It has been pointed out in [17] that the new bottleneck in parallel im-plementation of solving a set of linear BVPs, is the solution of the resulting linear blockbidiagonal system (1.13) or (1.14). This fact becomes the motivation of this thesis toexplore the potential parallelism of existing algorithms or to invent some new parallelalgorithms.Chapter 4Background and Related Issues4.1 Why Parallel Processing?Parallel processing is defined by contrasting it with normal serial processing, as the useof many processors together in a single machine to solve a single problem. As arguedin [35], all of the major actors in supercomputing today acknowledge that parallelism ofone kind or another is the only way that they'll be able to meet the market's increasingdemand for high-speed, low-cost computing.It is well known that to put some of the new ideas for computer applications into effect,users must have much greater performance available to them at much lower cost, and itis suggested that parallel processing is the best way of meeting their requirements.Since the invention of von Neumann architecture in 1940s, parallel computers have gaineda rapid influence in the computer world. Most of the changes began in the 1960s when theswitch from vacuum tube to solid state components was obviating Grosch's Law on thehardware front. Meanwhile, programmers were inventing ways of handling concurrency.There are two arguments to show that the future of high-performance computing belongs43Chapter 4. Background and Related Issues^ 44to them. The first is economic: parallel computers tend to be much more cost effectivethan their serial counterparts. This is primarily because of the economics of scale ofVLSI technology — with any given technology, ten small processors with the same totalperformance of one large processor almost invariably cost less than that one large pro-cessor.The second argument is based on a fundamental physical law. One can trade individualprocessor performance against the number of processors to achieve the same absoluteperformance in a variety of ways. However, to achieve teraFLOPS performance it is clearthat the only practical approach is to use a very large number of very fast processors.4.2 Parallel ArchitecturesBased on Flynn's Taxonomy [35] of computer architectures, computers are divided ac-cording to whether they use single or multiple "streams" of data, and (orthogonally)single or multiple "streams" of instructions, into four categories: SISD, SIMD, MISDand MIMD.An SISD computer carries out one instruction on one datum at a time. This is theconventional von Neumann architecture. An MISD computer, on the other hand, wouldapply several instructions to each datum. No computers conforming to this model haveyet been constructed.Compared with SISD and MISD, processors in an SIMD computer, are able to simulta-neously execute the same instructions, but on different data, while an MIMD computerSingle Instruction Multiple InstructionSingle DatumMultiple DataSISD (von Neumann)SIMD (CM,CRAY)MISDMIMD (Transputer, Alliant)Chapter 4. Background and Related Issues^ 45Table 4.9: Family of parallel computerscontains several independent processors, each of which executes its individual program.Table 4.9 shows this classification and examples related to each model, where CM standsfor the Connect Machine.It is desirable for processors to be able to communicate among themselves during thecomputation in order to exchange data or intermediate results in parallel computers.Basically, there are two ways to achieve this, i.e., the communication is performed eitherthrough a shared memory or an interconnection network. This leads us another classi-fication of parallel computers, namely shared-memory machine and distributed-memorymachine.In general, a shared-memory machine has a small number of powerful processors, in whichcase each processor is able to access the entire memory of the machine with high speedif there is no confliction. Overhead in such a system is governed by the synchronizationrequired to prevent conflicts in memory access. As the number of processors on such asystem increases, so does the cost of memory access. There is, therefore, a practical limiton the number of processors which may share global memory efficiently.This model is in contrast to the distributed memory model, consisting of a large num-ber of small processors, where each processor maintains its own local memory space,and there is no global memory available. The inputs needed for computation by eachChapter 4. Background and Related Issues^ 46processor are obtained through communication between processors achieved by "mes-sage passing", that is, by sending data packets from one processor to another via aninterconnection network. The speed of communication in this model, which is machinedependent, plays a very important role in performance. There is no real limit on thenumber of processors and it is expected that more jobs can be done simultaneously iflarger number of processors are used. However, this will also cause the increase of thecost of communication.With this classification, MIMD computers sharing a common memory is often referredto as multiprocessors (or tightly coupled machines) while those with an interconnectionnetwork are known as multicomputers (or loosely coupled machines) [29], which is wherethe transputer belongs.4.3 Performance AnalysisTraditionally, one of the most important tools of the numerical analyst for evaluatingalgorithms has been computational complexity analysis, i.e., operation counts. Thisarithmetic complexity remains important for parallel computation, but several other fac-tors become equally significant, such as speedup (or performance) and the degree ofparallelism.Most parallel algorithms follow one or both of two related principles which we refer toas divide and conquer and reordering [26]. The divide and conquer approach involvesbreaking a problem up into smaller subproblems which may be treated independently.Frequently, the degree of independence is a measure of the effectiveness of the algorithmChapter 4. Background and Related Issues^ 47for it determines the amount and frequency of communication and synchronization. Theconcept of reordering may be viewed as restructuring the computational domain and/orthe sequence of operations in order to increase the percentage of the computation thatcan be done in parallel.The most commonly used definition of speedup is:execution time using one processorSp =^execution time using p processors 'however, it is too general to see any particular feature of the algorithm from the abovedefinition. Ware ([37]) suggested another definition of speedup in order to reflect moreclearly the degree of parallelism in a parallel algorithm.SP = [(1 - a) + par,where a is the fraction of work in the algorithm that can be processed in parallel, andthe execution time using a single processor has been normalized to unity.Buzbee in [9] noted that a weakness of the Ware model of speedup is that S,, = p foran algorithm that is completely parallel (a = 1), which is unlikely because of variousoverheads associated with parallel computation. To find a more realistic estimation, hesuggests the following change to the Ware model:Sp = [(I — ce) + pa + cr(P)] - 1 ,where a(p) reflects the overhead of using p processors.It has been widely accepted that efficiency is defined as the ratio of speedup and thenumber of processors, i.e.,SpEp = - ,PChapter 4. Background and Related Issues^ 48which shows that in the ideal situation of a speedup of p for p processors, the efficiencymeasure is unity.A new performance model will be used to measure all algorithms implemented here. Themodel take all the hardware and software factors into account, and make a good pre-diction of the performance of algorithms running on any MIMD architecture. It will beexplained in the next section.There are several types of parallelism. One of the simplest is the task farm. In thisapproach, a master process produces a number of independent tasks which are farmedout to slave processes. Once these have been processed, the results are collected by athird process, which writes them to disk or displays them on a graphics board. Typically,the master and each slave reside on different processors. This arrangement provides goodload balancing for many types of problem — if one task takes a long time to complete,the other processors in the system can get on with processing other tasks in parallel.This type of parallelism is very effective for a problem which can be broken down intoa large number of tasks which can be executed independently of each other. In orderto maximize the throughput of a task farm the number of tasks should be much greaterthan the number of processors available.Another effective form of parallelism is grid decomposition, in which case the applicationmust be based an underlying grid. A third is algorithmic parallelism, in which differentfunctions within the application are put on different processors. The performance willbe restricted in this case when one of the functions is more compute-intensive than theothers.Chapter 5The Transputer Network and a Per-formance Prediction Model5.1 Transputer ArchitectureThe existing transputer network here consists of 74 INMOS (model T800) transputers[33], which is the latest member of the INMOS transputer family [34].As described in [35], each IMS T800 transputer is a 32 bit 10 MIPS microprocessor (CPU)that has been specifically designed for use in concurrent message-passing system. Thegeneral philosophy behind the transputer is to provide a family of compatible compo-nents which are able to communicate with one another using a minimum of external logic.In addition to the VLSI CPU, the transputer has its own either 1 or 2 Meg memory, 4serial data communication links, and a floating point unit(FPU) on a single chip. Thememory is not shared amongst the transputers; inter-node communication is effectedsolely through the serial data links, which may be wired directly to each other. Eachserial link provides a data rate of 1.7 Mbytes per second unidirectionally, and the max-imum data rate into an IMS T800 is 6.8 Mbytes per second since there are four links49Chapter 5. The Transputer Network and a Performance Prediction Model^50receiving simultaneously.In the network, there are 10 C004 programmable 32 x 32 crossbar switches, each of whichhas 32 data links and 1 control link and is capable of creating, under software control,16 bidirectional connections. Pairs of data links on a C004 may be connected by sendingappropriate software commands to the control link. Other transputer links are used forI/O with the external world. In this way, transputerlink interconnections may be variedthrough software commands, as requested by a user.The transputer family has been designed for the efficient implementation of high levellanguage compilers. Transputers can be programmed in sequential languages such as C,PASCAL and FORTRAN (compilers for which are available from INMOS). However theoccam language allows the programmer to fully exploit the facilities for concurrency andcommunication provided by the transputer architecture.The Trollius operating system [33] was developed out of dissatisfaction with the standardsoftware environment for transputer-based multicomputers. It was developed at CornellUniversity and Ohio State University to run on distributed memory multiprocessors. Itis a dynamic programming environment that is easy to use, and one that uses standardFORTRAN and C programming languages. It consists of two parts, running on thefront-end workstation and transputer respectively. Trollius operates over UNIX on thehost, providing a user command interface to boot the multicomputer nodes, load parallelprograms to transputers, and kill processes, among other facilities.Trollius is designed in layers of functionality, starting with message passing with a node,Chapter 5. The Transputer Network and a Performance Prediction Model^51extending to nearest neighbor communication, then to arbitrary node to node communi-cation at the higher level. Along with standard programming languages, Trollius providesfor the inclusion of standard UNIX libraries, access to standard I/O from all processes,and dynamic memory allocation. It also provides mechanisms for message passing be-tween processes in addition to routines for process creation, process destruction andaccess to remote file systems.5.2 A Performance Prediction ModelAs pointed out earlier, the measurements of speed up discussed previously are too generalto accurately predict the programs running on a particular system, such as a transputernetwork. As opposed to shared-memory machine and super distributed memory machinesuch as Hypercube, systems like the transputer need a high portion of communicationduring the computation, which fails most of the existing performance prediction models.To effectively program on these systems, a more accurate theoretical model for perfor-mance prediction was proposed in [30]. It is good in the sense that it takes all thehardware and software information into account.The performance prediction model proposed in [30] can be used for two important parallelstrategies: processor farm and divide and conquer. It based on a system which is simple,efficient and general enough to implement processor farm and tree-structured computa-tion. In addition to modelling both the hardware and software of the system we know,it also considers both parallelism and the communication cost involved in exploiting thisparallelism.TcommCTcompChapter 5. The Transputer Network and a Performance Prediction Model^52According to the model, the execution time (Texe ) is basically divided into computation time(Tcomp) and communication time (Tcomm ) which consists of both non-overlapped (0)and overlapped time (r) with computation. The non-overlapped time which includesboth the overhead for receiving and executing a task locally /3 e and the overhead forreceiving and forwarding a task /3/, is the time for startup which is expressed as a fixedtime unit (microseconds) and it has nothing to do with the number of processors used inthe computation, while the others is expressed as bytes/second. The family of executiontime is shown in Fig.5.1.Figure 5.1: The family of execution timesIt is of our interest here to review briefly the performance model of [30] in the caseof a processor farm. The model is based on a software system in which only a singleroot processor is allowed to take tasks entering the system. The remaining processorsrequest tasks from its parent processor which is assigned statically at compile time. Onreceipt of a task, a processor either processes the task locally, or forward it to its childrentask^ request/result"......._task^ request/resultChapter 5. The Transputer Network and a Performance Prediction Model^53processors depending on whether there is a request from its children or not. The resultswill be returned from children to parent. It can be seen easily from paradigm Fig.5.2.Figure 5.2: A simple software systemBased on the software system described above, we now consider the time of executionwhich consists of two parts: time for steady state computation and for startup cost. Inthe steady state, we assume an ideal task distribution in which processors are never idleas long as there are unprocessed tasks in the system, and the structure of processors is abalanced k-ary tree with k any positive integer (1-ary tree is a chain) and all processorson the same level of the tree will execute the same number of tasks. Thus, the throughputof a balanced k-ary tree in the steady state is:SL = 1^ \ [1^(k (Te + Oe — 0.0 )1,1,(Tie + 13e ) —k^(Te + Oe — 0 f i Te + 0,where L is the level of the tree, Te is the processing time of the task, Of is the startupoverhead for receiving and forwarding a task and p e is the startup overhead for receivingChapter 5. The Transputer Network and a Performance Prediction Model^54and executing a task locally.The startup cost is defined as the time it takes for the system to return the first result.For the case with an L level of the tree,Tstartup = (L — 1) (2T- + 13f ) + Te + /3e.Therefore, the execution time T of an L level balanced tree system isJ — 1T = Tstartup + si,with J the total number of tasks, and the speedup SP", is defined asJT,SPE., = T 'SP', ==whereJT,(1 — aL )(J — 1) [(T, + 13,) — k (T, + /3,) — k (T, + /3, — ,fif)1+ Tstartup (1 — aL )JTe (1 — al')(J — 1) [(1 — 2k) (Te + fie ) + /Of] + Tstartup (1 — aL)or(5.56)k (T, + ,3, — i3f )a = T, + 0,It is easy to get the speedup prediction for an application if the total number of tasksand the execution time of each task are given. The two overheads for communication O fand i3e , which depend on the underlying hardware and software system, are assumed tobe constant and can be measured experimentally for a given implementation.Chapter 5. The Transputer Network and a Performance Prediction Model^555.3 Paradigm ImplementationTo implement the underlying software system we described above, [31] proposed an im-plementation based on a processor farm, where little or no communication is requiredbetween processors except that the data has to be distributed before execution and theresults from all the processors need to be collected after execution. Each processor willrun the same program with a different set of data. This type of implementation lies inthe first category of parallelism described before and also similar to a "Master-workers"situation, where a controller or master issues tasks to a network of processors, withoutspecifying which processor should execute a particular task. This processor farm imple-mentation is done under two common system environments, i.e., linear chain and binarytree.On the master node, there is a task manager process and a result manager process, wherethe task manager process controls task distribution on the whole system and the resultmanager process controls the collection and forwarding of results. Both of them run athigh priority.Meanwhile, there are four communication processes at high priority and one computa-tional process at low priority running on each worker node. As shown in Fig.5.3, thosefour communication processes include receiving the tasks from the predecessor node,forwarding the tasks to successor node, receiving the results from successor node andforwarding the results to the predecessor node. There are two buffers with fixed sizequeues used for storing unprocessed tasks and resutls to be forwarded. The computa-tional process, which runs independently with the communication processes, executes thetasks allocated to the local node.P task_fwdP rslt_recvP task recvPrsltfwdPredecessorChapter 5. The Transputer Network and a Performance Prediction Model^56Worker NodeIFigure 5.3: Processes running on a worker nodeA "flood-fill" technique is used to distribute tasks from the master node to the workernodes. The task receiver process, task forwarding process and local computational processare able to request tasks from task manager if the task buffers on a node is empty, or anew task is available, or there is any unprocessed task in the buffer, respectively. Thetask manager will block the corresponding process if it is not ready to request the task,as shown in Fig.5.4.The purpose of using small size of task buffers on each node is to balance the load onall the worker nodes by keeping them busy as long as there are any unprocessed tasks inthe system. It has been pointed out in [31] that buffers of size one are good enough ifthe communication time required to transmit a task from one node to the next is verysmall compared to the processing time of a task.Parallel Algorithms and their Implementations^ 57Figure 5.4: Task distribution managementChapter 6Parallel Algorithms and their Imple-mentationsThere are two things important in designing a parallel algorithm in our templated-basedimplementation with a processor farm strategy. One is to design a parallel algorithm the-oretically, the second is to define a job unit, which is executed by the local computationalprocess. A parallel algorithm can therefore be decomposed into a number of identicaljobs with different data flow. The speedup largely depends on the number of jobs andthe size of each job, according to the performance prediction model described before.The parallel LS algorithm and parallel SQR algorithm and their implementation con-siderations will be described here. The performance and complexity of the parallel SLUalgorithm will be shown later on because the strategy of parallelization for SLU is iden-tical to that of SQR.6.1 Parallel LS AlgorithmBased on the description of the sequential LS algorithm in Part II, and the idea ofparallelization mentioned in [4], we know that the parallel LS algorithm actually consists58Chapter 6. Parallel Algorithms and their Implementations^ 59of three phases of parallelization. The first phase of parallelization is to form a blocktridiagonal linear system in parallel, which includes the formation of its coefficient matrixand its right hand side, or B = ATA and f =ATg. The other two phases are to parallelizethe cyclic odd-even reduction and backward substitution processes.1. Formation of the block tridiagonal linear systemAt this phase, the more time consuming work is to form the coefficient matrix, ablock tridiagonal matrix B = ATA comparing with the work for f = ATg, whichhas a lower order of complexity 0(0N). For the ease of analysis, we denote B'and A' as the the augmented matrices of B and A, respectively. In other words,B' = (B f) and A' = (A I g).The parallelization of this part of the algorithm is somehow straightforward becauseit is easy to check that the (i 1)th row of the matrix B' depends only on the ithand (i 1)th rows of A' for 1 < i < N — 1, and the first row of B' depends onthat of A' only. The structure of this relation can be depicted by an example ofN = 9, as in Fig.6.5, where each white ball and black ball stand for a block rowof the matrices A' and B', respectively. Two balls in the same vertical line havethe same row number, which is marked on the corresponding white ball. Each linewith arrow shows the dependency among the rows of A' and B'.If we use the notation A = (A1,i ) and B = (Bi,j) with the full matrices Ai,j andBi ,i of order n, g = (gi ) and f = (fi) with the vectors gi and fi of order n, thenB = (Bi,j) can be expressed as follows:Bi,i-1 =^*Chapter 6. Parallel Algorithms and their Implementations^ 60BAFigure 6.5: The structure of the formation of normal equationBi , 1 = AT 1 , i * AT Li + AT * Ai ,j,Bi,i-Fi = AT * A.1,i+17^ (6.57)fi =^+ Angi , 2 < i < Nwhere 2 < i < N. Also, we use the notation " *" to represent the multiplication oftwo matrices of order n.Let a positive integer Sjob be the number of rows in B to be formed in each job;then we are able to estimate the highest order of complexity at this step, as shownin the Table 6.10, where n is the number of equations to be solved, or the size ofeach block in the coefficient matrix. The complexity at this phase is^ (8n3N b) = —1 8n3NPS.job^Sjob) pwith P the number of processors. In each job, the lengths of the input and outputdata are 2 (n 2 n) x (Sj ob + 1) and 2n 2 x Sj ob, respectively.2. The Odd-Even ReductionOperation Bi,i-1 Bi,i Bi,i+i each jobMultiAddn3^2n3n3^2n32n 2 4n3 x Sj ob2n2 4n3 x Sj obn3Chapter 6. Parallel Algorithms and their Implementations^ 61Table 6.10: Job complexity of B = AT A and f = AT gSince a basic idea of the odd-even reduction is that all the off-diagonal blocks ineven-numbered block rows can be eliminated independently by their adjacent odd-numbered block rows, i.e., the off-diagonal blocks in the 2ith row can be eliminatedby the (2i — 1) th and (2i + 1) th rows using elementary operations. This providesus with the idea of parallelization at this reduction phase. The elimination processcontinues until the last even-numbered row has been done.In fact, according to the sequential odd-even reduction described before, we knowthat for a fixed j (1 < j < m — 1), k in (2.38) can be computed for all i indepen-dently. Fig.6.6 shows the tree structure of this reduction process with N = 15 orm = 4, where each circle represents a block row of B with the row number insideit. Each line with arrow shows the dependency of the rows between the adjacentsteps of the elimination. It takes three steps (or m — 1 in general) to finish thereduction phase in this example.Let Sjob be any positive integer, and we can define the job unit in terms of theparameter Sj ob. Each job will eliminate all the off-diagonal blocks in Sjob con-secutive even-numbered block rows. The complexity of each job can be estimatedChapter 6. Parallel Algorithms and their Implementations^ 62Figure 6.6: The structure of Odd-Even reductionaccording to the formula in (2.38), as shown in Table 6.11. This compares with102 i" - 1E max n^ 1 1) n3Sj ob^(6.58)i=2job ras the highest order of complexity of the complete reduction process, where m =[log N] and P is the number of processors. The complexity of this reduction processbecomes 0 (log N) if there are an infinite number of processors available, as in [4].It is simplified to be in the sequential case if Sj ob = 1 and P = 1, or5^2' - 1max ([^ ], 1) n3 Sj obi=2 °job r- E(2 i-1 - 1)n33 i=22'n335-3n3N.The lengths of the input data and output data in each job are:(6.59)(3n 2 n) x (2Sj ob + 1) and (3n 2 n) x Sj ob,Operation Fi/Gt DI /El .13iMulti 5n3 n3 2n3Add 3 n3 n3 4n3each job10 n3 x Sjob 310 n3 X S. 1_,3 '^SobChapter 6. Parallel Algorithms and their Implementations^ 63Table 6.11: Job complexity in the Reductionrespectively.3. Backward SubstitutionThe solutions of the linear system will be solved at this stage. According to theformula (2.41), the complexity for computing each x i is a n3 multiplications andln3 additions. It is not significant compared with that of the reduction phase inthe sequential case because the complexity for this process is 3n 3N, which is about1/10 as much as in the reduction phase. However, it could also become a bottleneckin the parallel case if it is not well-parallelized. The procedure of this backwardsubstitution is described in Fig.6.7, where each circle represents a block unknownwith the number in it.From (2.41) again, we know that with a fixed j, x i can be computed independentlyfor a set of i, i.e., those x i 's are able to be processed in parallel. The number of x i 'sto be solved in each processor at each time determines the size of each job assignedto the processor. Therefore, we may define a positive integer Sj ob as the numberof x i 's to be computed in each job, which needs about n 3S. b multiplications and3^J on3Sj ob additions, or 372,3Siderive the complexity for the complete backward substitution process based on theob in total, as shown in Table 6.12. Hence, it is easy toChapter 6. Parallel Algorithms and their Implementations^ 64definition of each job and the number of processors available in processing. That is32 m -1^217z-jE max ([^ ,n} 1) n3 Sjob.=1^°job/-With the same argument as in the reduction phase, the complexity becomesFigure 6.7: The structure of LS backward substitution51 n3 log N multiplications and 3 n3 log N additions if there are an infinite numberof processors available, and becomes -13- TON multiplications and 3 77, 3 1V additions inthe sequential case when Sjob = 1 and P = 1.To compute an x i based on the formula (2.41), we need to input three matricesand three vectors in total. They are M -1 , Et i , M -1 , and xi+2;-i, x i_v -i,Therefore, the lengths of the input and output data are (30+3n)xSjob and nxSjob,respectively.(6.60)Operation xt each jobMultiAdd51 n3 x SobJOn3 x s. b3^JOIn33in33Chapter 6. Parallel Algorithms and their Implementations^ 65Table 6.12: Job complexity in the Backward Substitution6.2 Parallel Structured QR AlgorithmUnlike the method used for the parallelization of the LS algorithm, which very easilyexploited its parallelism because most of the computation can be expressed by a formula,there is another way to exploit the parallelism of the SQR algorithm Based on thesequential SQR algorithm and the idea proposed in [40], we may describe the parallelSQR algorithm as follows:• Divide the coefficient matrix (1.14) except the last block row containing Ba andBb into several divisions, say p divisions. Each division consists of (N — 1)/p blockTOWS;• Perform sequential SQR on each division, which produces the same structure as(2.26) excluding its last block row. This step can be done independently withineach division;• Select the last row from each division and the row consisting of 13,, and Bb to forma new block matrix of order (p 1)M. It is easy to see that the new matrix hasthe same structure as (1.14);• Repeat this operation until the linear system with the new block matrix as itscoefficient matrix is small enough to be solved efficiently;• Solve the last small system sequentially;Chapter 6. Parallel Algorithms and their Implementations^ 66• Reverse the course of the operation and use the recursion to compute the corre-sponding components of the solution.To be more clear, we use an example N = 7 and p = 2 to illustrate the above description.First of all, we write the coefficient matrix asAl CiA2 C2A3 GA4 C4A5 Cs(6.61)A6 C6Ba Bband assign three partitions with two block rows in each to this matrix. After the firstelimination step, the matrix A becomesA =Gi Ri ElC21)G3 R3 E3G (41)^di)Gs R5 Es4 )^di)(6.62)A( 1 ) =Ba BbA new block matrix A( 1 ), which consists of all the rows with superscript (1), i.e., the lastrows in all partitions, is then formed according to the third item of the description. ItChapter 6. Parallel Algorithms and their Implementations^ 67can be expressed as follows: - G1) d1)OP d1)4)A( 1) . C61)Bb(6. 63)BaSince A( 1 ) is small enough and its related linear system can be solved efficiently sequen-tially, we thus stop repeating the elimination procedure. Instead, the sequential SQRalgorithm is applied to the underlying system. The corresponding unknowns x 1 , x3 , x 5and x 7 are solved at this stage and the rest of the unknowns can be computed by usingthe recursion formula (2.28) with selected i because only those unknowns xi with knownxi+1 can be computed.In general, this parallel process contains two phases, structured QR elimination andbackward substitution. We will consider them separately:1. Parallel SQR EliminationAt this phase, the number of partitions plays an important role in the paralleliza-tion because the sequential structured QR elimination can be applied to each ofthem independently. This, in turn, relates to the size of each partition which isindependent of the order of the underlying matrix. Therefore, we may define aninteger Sjobjob (Sjob > 2) as the size of each partition or the number of block rowsinvolved in each partition. A job is then defined as performing the sequential struc-tured QR elimination on a partition.Chapter 6. Parallel Algorithms and their Implementations^ 68It needs about 29-^ 2n3 multiplications and n3 additions for each block row elimination. A partition with Sj ob block rows then requires aboutQR58—2n3 x (Sjob —1)^multiplications and additionsto accomplish each job, shown in Table 6.13. The total amount of work to completethe elimination phase is58 m^N- E max ([^J, 1) n32 (Sjob —1),^(6.64)joi.i.^(Sb)iPwhere integer m = log 2 (N-1)/ log 2 Sj ob and P is the number of processors available.The expression (6.64) shows us the highest order of the complexity in the generalcase. Particularly, it can be simplified as- Em58^ Jn3S. b = —58 n3rn S.1 b = -58 723 log 2 N2 i=1^°^2^°^2 (6 .65)if there is an infinite number of processors available and the integer Sj ob is inde-pendent of N, whereas in the sequential case, it becomesbecause Sj ob = N and m = 1.8 3 0^8 3 it rn aibb = --2- n iv. (6. 66)The structure of this elimination process can be shown by an example with N = 9and Sj ob = 3, as in Fig.6.8, where each circle represents a block row of the matrixA with the row number in it and each line with arrow represents the dependency ofrows between the adjacent steps in the elimination. It takes two steps to finish thiselimination process. The lengths of input and output data in each job are listed inTable 6.14.Chapter 6. Parallel Algorithms and their Implementations^ 69Figure 6.8: The structure of structured QR elimination2. Backward SubstitutionThe backward substitution is largely based on the parallel strategy used in thestructured QR elimination phase, although we still have the freedom to define thesize of each job. For the sake of simplicity, we use the same integer Sjob as inthe elimination phase. Each job here can be defined in a way such that there are(Sjob- 1) unknowns to be solved in it. The complexity for each job in the backwardsubstitution phase is listed in Table 6.13, which is based on the workn3 multiplications and additionsfor computing each xi in the backward substitution according to the formula (2.28).The complexity for the whole process is of the same order as in the elimination phaseexcept the coefficient with 1/2 instead. The lengths of input and output data ineach job are also listed in Table 6.14.Chapter 6. Parallel Algorithms and their Implementations^ 706.3 Implementation ConsiderationsSo far, we have studied and analyzed both the parallel LS algorithm and the parallelstructured QR algorithm and their complexities. According to the classification of theparallel strategy described in §2, it is interesting to notice that these two parallel algo-rithms belong to different categories. In other words, the parallel LS algorithm is in thefamily of divide and conquer, while the other is in the family of reordering. We will seefrom the next section the performance of these two parallel implementations by usingthe same processor farm strategy.However, these two parallel algorithms share some common features in the design. Bothof them introduce an integer parameter to control the size of each job, which is the com-putational work executed by each processor without data exchange. Each algorithm isthen divided based on different strategies into several independent portions which canbe treated in parallel. It is clear that the size of each job increases with the increase ofparameter Sjob and there is an optimal timing for each job, when the maximum speedupcan be achieved. However, this also increases the size of input and output data, whichin turn increases the cost of communication. As we observed earlier, the communicationcost is one of the key factors to the speedup in the implementation on the transputers.Therefore, we have to consider the trade-off between the computation efficiency and com-munication cost.There are three phases in the parallel LS algorithm and two phases in the parallel Struc-tured QR algorithm Due to the sequential feature among different phases in each algo-rithm, a global flag has to be used to control the processing. The complexities of thesetwo parallel algorithms have the same highest order of 0 (n3 log N) if Sjob is independentOperation SQR Elim. Backward Subst.-2-29 0 x (Sjob — 1)2 7-t3 x (Sjob — 1)An3 x (Sjob — 1)1723 X (Sj ob — 1)MultiAddPhase Inputs OutputsSQR Elim.Backward Subs.(2n2 +n) x Sj ob(3n2 -1-n) x (Siob —1)-I-2n(3n 2 +n) x Sjob —n2nx(Siob-1)Chapter 6. Parallel Algorithms and their Implementations^ 71Table 6.13: Job complexity of the parallel SQR algorithmTable 6.14: Communication cost in parallel SQR algorithmof N.Chapter 7Numerical Results and PerformanceAnalysisThe parallel LS algorithm, the parallel SQR algorithm and the parallel SLU algorithmhave been implemented on the INMOS transputer with two testing examples introducedin §4. As discussed in the earlier sections, the communication and computation costsare two important keys to the performance. However, some trade-offs have to be consid-ered between the increase of computation time (or the size of each task) and the decreaseof communication time (or the size of input and output data) to achieve optimal speedup.There is another factor which affects the performance also. It is often noticed that all thejobs in a tree structure are not equivalent and independent since those parent jobs needthe data from their children. An algorithm which tries to solve a block bidiagonal linearsystem always contains jobs with tree-like structure, and therefore consists of severalsteps. This factor can not be neglected in the performance analysis because the depen-dencies among jobs restrict the concurrent execution of the program. However, for theease of using the performance prediction model, we assume that all the jobs we considereach time are independent. We will see later that the speedups obtained based on theperformance prediction model in most cases are better than that in the real situation,72Chapter 7. Numerical Results and Performance Analysis^ 73Table 7.15: Formation of the Normal Equation (Phase I)Sl obInputs Outputs No.ofJobsEx.I Ex.II Ex.I Ex.II2 144 480 64 256 5123 192 640 96 384 3414 240 800 128 512 2565 288 960 160 640 2056 336 1120 192 768 1717 384 1280 224 896 1478 432 1440 256 1024 1289 480 1600 288 1152 11410 528 1760 320 1280 103which is to be expected.To obtain a clear picture about the performance of an algorithm, we suggest to ana-lyze it phase by phase in the way we described in Chap.6. This is because the job sizevaries with each phase and also the parameters should be chosen differently to achievebest performance. Generally, we would expect nearly maximum speedup of an algorithmachieved if the maximum speedups have been achieved at each phase of the algorithm.According to the performance prediction model described in §6, the speedup, from theuser point of view, depends on the total number of jobs, input and output data of eachjob, and the timing in microseconds for accomplishing each job (or the size of each job).The job size and the input and output data which depend on the number of equationsto be solved are independent of the number of mesh points. However, the total numberof tasks does depend on the number of equations.In the following, we will show some data with tables and figures about the speedups atChapter 7. Numerical Results and Performance Analysis^ 74Table 7.16: The Odd-Even Reduction (Phase II)Sl obInputs Outputs No.ofJobsEx.I Ex.II Ex.I Ex.II2 280 1040 112 416 5113 392 1456 168 624 3414 504 1872 224 832 2565 616 2288 280 1040 2076 728 2704 336 1248 1757 840 3120 392 1456 1498 952 3536 448 1664 1299 1064 3952 504 1872 11710 1176 4368 560 2080 107Table 7.17: The Backward Substitution (Phase III)S.Jo bInputs Outputs No.ofJobsEx.I Ex.II Ex.I Ex.II2 144 480 16 32 5123 216 720 24 48 3464 288 960 32 64 2575 360 1200 40 80 2106 432 1440 48 96 1767 504 1680 56 112 1538 576 1920 64 128 1309 648 2160 72 144 11910 720 2400 80 160 108Table 7.18: Optimal Parameters at Three PhasesPhaseEx.I Ex.IISjob Nodes Speedup Siob Nodes SpeedupIIIIII6461319134.56.94.53242024178.211.16.9Chapter 7. Numerical Results and Performance Analysis^ 75phases of three parallel algorithms and the predicted optimal parameters with which thebest performance is achieved theoretically at phases will be verified in the experiment.Some numerical comparison of the performance of three algorithms will also be given.7.1 Parallel LS AlgorithmRecall what we have described in the last section: there are three phases which are theformation of a normal equation, the Odd-Even reduction and the backward substitutionin this algorithm Tables 7.15-7.17 show that the sizes of input and output floating pointdata in bytes increase with the increase of parameter Sj ob, and the number of jobs withparticular mesh points decreases respectively. They also show that the sizes of input andoutput data in Phase I increase more slowly than in the other phases. Phase II requiresmore data than the others, which restricts the use of large values of the parameter toincrease the job size.Figures 7.9-7.11 show the time to finish each job obtained from real execution of the pro-gram and the speedups of each phase based on the performance prediction model with1023 mesh points. In Phase I, the maximum speedup achieved is 4.64 in Example I and8.2 in Example II. The maxima occur at Sj ob = 8 and Sj ob = 4, respectively. We canclearly see from 7.9-7.11, with those Sjob bigger than 8 or 4, the speedups decrease eventhough the time of each job increases. This reflects the fact that the communication costdominates the computation cost in the execution with the dramatic increase of the inputand output data.Similarly, the maximum speedups in Phases II and III are 6.5 and 4.31 in Example I,Chapter 7. Numerical Results and Performance Analysis^ 7611.1 and 6.9 in Example II. The speedups of Example II are better than those in Ex-ample I at all three phases because it takes longer to finish each job, which means moreparallelism is available though the communication time increases as well. However, dueto the fact of the increase of communication time, the maximum speedups are achievedat smaller values of Sj ob in Example II compared with those in Example I. For instance,the maximum speedup in Phase II of Example II is obtained when Sj ob = 2. Table 7.18shows the optimal choices of the number of transputers and the best speedup we canexpect at each phase. The related Sj ob values are also listed in the table.Generally, we could expect better speedups if more mesh points are used according tothe prediction model. The total number of jobs increase with the same lengths of inputand output data. However, the height of the tree will increase also. This will restrict thespeedups in the real execution.It is known that there is only one step in the first phase, which allows us to believe thatthe speedup shown in Figure 7.9 is close to the real speedup in the execution. However,the speedups predicted by the model in Phases II and III will be better than what wecan actually get in the experiment.Figure 7.9: The performance analysis in Phase I of PLS algorithm806040200'0 6Sjob421.--.1--1-18^10^12-0- Time(Ex.1)-•- Time(Ex.II)-0- SP(Ex.I)-0- SP(Ex.II)Chapter 7. Numerical Results and Performance Analysis^ 77Figure 7.10: The performance analysis in Phase II of PLS algorithmChapter 7. Numerical Results and Performance Analysis^ 78Figure 7.11: The performance analysis in Phase III of PLS algorithm7.2 Parallel SQR and SLU AlgorithmThere are two phases in the parallel structured QR/LU algorithms, that is, the Elim-ination and the Backward Substitution. The programs are exact the same in Phase IIfor both algorithms. The size of input and output data are also the same between twoalgorithms in Phase I, though the elimination tools they used are different.Table 7.19 and 7.20 show the length of input and output data in bytes and the timingof each job in microseconds with respect to the parameter Sjob in two phases of bothSQR and SLU algorithms, where Ex.I and Ex.II stand for Example I and II. It is easyto see from these two tables that the parameter Sjo b controls the size of each job (the"Time" columns in the tables), and the size of input and output data in both algorithms.In other words, we can say that both the communication and computation time increasewith the increase of parameter SJ.o Jb. The effect of AS-o b becomes more clearly in ExampleChapter 7. Numerical Results and Performance Analysis^ 79Table 7.19: The Parallel SQR/SLU Elimination (Phase I)Sjo b Inputs OutputsTimeSQR SLUEx.I Ex.II Ex.I Ex.II Ex.I Ex.II Ex.I Ex.II2 80 288 96 352 2.3 11.3 2.3 7.53 120 432 152 560 4.2 21.9 3.9 13.74 160 576 208 768 6.1 32.5 5.5 20.05 200 720 264 976 8.1 43.1 7.2 26.66 240 864 320 1184 9.9 53.7 8.8 32.87 280 1008 376 1392 11.8 64.1 10.5 38.88 320 1152 432 1600 13.6 75.7 12.2 45.59 360 1296 488 1808 15.5 85.9 13.8 52.210 400 1440 544 2016 17.5 96.0 15.5 57.3II.Besides, we may also notice that more data are used in Phase I than in Phase II, whichmeans the transputers will spend more time in the communication in Phase I. Table7.19 shows that it takes longer to finish a job in the SQR algorithm than in the SLUalgorithm, and therefore more speedup can be expected on the SQR's than SLU's, whichis true according to Table 7.20. Meanwhile, Figure 7.12 and 7.13 show the timing of eachjob for Phase I of the SQR and SLU algorithms, and Phase II of SQR/SLU algorithmswith two examples respectively.With a particular parameter, the number of mesh points has to be chosen in a way sothat it is the power of the parameter. Table 7.21 and 7.22 show the total number of jobsas well as the predicted speedups of both algorithms when the corresponding number ofmesh points is used. The speedups are also compared in Figure 7.14 and 7.15 with twoexamples. Three lines in each figure stand for the speedup in Phase I of SQR and SLUChapter 7. Numerical Results and Performance Analysis^ 80Table 7.20: The Parallel SQR/SLU Backward Substitution (Phase II)SlobInputs Outputs TimeEx.I Ex.II Ex.I Ex.II Ex.I Ex.II2 72 240 8 16 1.0 2.53 128 448 16 32 1.7 4.74 184 656 24 48 2.4 6.95 240 864 32 64 3.2 9.16 296 1072 40 80 3.8 11.37 352 1280 48 96 4.5 13.58 408 1488 56 112 5.2 15.79 464 1696 64 128 5.9 17.810 520 1904 72 144 6.6 20.0algorithms, and Phase II of SQR/SLU algorithms.The optimal number of transputers for different parameters is listed in the columns Nodesin Table 7.21 and 7.22. Based on the information about the number of mesh points andthe speedups, we may conclude that parameters Sj ob = 6 and Sjob = 7 in ExampleI, and parameters Sjob = 3 and Sjob = 4 in Example II are more attractive than theothers. Due to the restriction of prediction model with respect to the job dependency,the less number of mesh points, the closer the speedups showed in tables to those in thereal execution.7.3 The Comparison of Numerical ResultsThree programs have been implemented on the INMOS Transputers. In general, theparallel SQR algorithm shows the best results among all, especially in Example II. How-ever, the absolute timing for executing the program is still slightly less with parallel SLUChapter 7. Numerical Results and Performance Analysis^ 81Table 7.21: The Parallel Structured QR/LU Algorithm (Phase I)Sl ob MeshPointsNo.ofJobsSQR SLUSpeedup Nodes Speedup NodesEx.I Ex.II Ex.I Ex.II Ex.I Ex.II Ex.I Ex.II2 2048 2047 4.3 15.2 12 30 4.3 10.5 12 213 2187 1093 7.0 20.1 18 44 6.5 14.4 17 284 1024 341 7.6 13.6 22 53 7.1 11.4 20 335 3125 781 10.8 22.3 26 59 9.9 17.6 23 376 1296 259 8.6 11.6 28 64 8.1 10.9 26 647 2401 400 11.1 16.1 31 64 10.3 14.4 28 418 4096 585 13.1 20.8 33 64 12.4 17.5 30 439 6561 820 15.4 25.4 35 64 14.3 21.0 31 4510 1000 111 5.7 5.8 36 64 5.6 5.7 38 45Table 7.22: The Parallel Structured QR/LU Algorithm (Phase II)S obJ MeshPointsNo.ofJobsSQR/SLUSpeedup NodesEx.I Ex.II Ex.I Ex.II2 2048 2046 2.0 4.4 8 123 2187 1092 3.2 7.0 10 164 1024 340 4.0 7.3 12 185 3125 780 5.3 10.0 14 206 1296 258 5.2 8.2 15 227 2401 399 6.2 10.1 16 238 4096 584 7.3 11.6 17 249 6561 819 8.1 13.2 18 2410 1000 110 4.7 5.4 19 25-13- SQR(Ph.I)-*- SLU(Ph.I)-in- SQR/SLU(Ph.II)1008060402082 4 6Sjob1210Chapter 7. Numerical Results and Performance Analysis^ 82Figure 7.12: The timing of each job in SQR and SLU algorithms (Ex.I)Figure 7.13: The timing of each job in SQR and SLU algorithms (Ex.II)Chapter 7. Numerical Results and Performance Analysis^ 83Chapter 7. Numerical Results and Performance Analysis^ 84algorithm than with SQR algorithm.To test the performance of the parallel LS algorithm, several sets of parameters are cho-sen as the testing cases. The speedups for different set of parameters are basically thesame, which are between 2 and 3 in Example I, and between 3 and 4 in Example IIwith 1023 mesh points. Table 7.23 shows the results for two examples. The columns"1-Node" show the execution time with one transputer, while columns "n-Node" showthe best execution time we have obtained with any number of transputers. There are nomore speedups achieved even if more mesh points are used in the execution.One of the reasons for the poor speedup of this parallel algorithm is that each job is toosmall compared with the input and output data. As the computation time increases, thecommunication time increases even more, thus constraining the size of each job. Anotherreason is that there are three steps in the algorithm and thus it takes longer to finish dueto the dependencies among phases.Table 7.24 and 7.25 show the speedups of the parallel SQR and SLU algorithms withtwo examples. With the same explanation of the columns "1-Node" and "n-Node" asin Table 7.23, we see that the speedups of the SQR and SLU algorithms are relativelyclose with Example I. However, there is a big difference in speedups between these twoalgorithms with Example II. The parallel SQR algorithm shows its great potential in theparallelism. The best parallel execution time is very close between the two algorithms,though the SQR algorithm needs much more time with one transputer than the SLUalgorithm.It is important and interesting as well to compare the speedups and execution timesChapter 7. Numerical Results and Performance Analysis^ 85shown in [38] and [40] with what we have obtained here. The implementation of SQRand SLU in [38] and [40] are on the Alliant FX/8 — a shared-memory machine with eightprocessors. Normally, one would expect better results there because there is no com-munication cost, while the communication cost within transputers is very considerable.To compare both results reasonably, we obtain the speedups of both algorithms for ourimplementation with 8 transputers, shown in Table 7.26.After comparing the results shown in Table 9 in [38] and Table 11 in [40] with Table 7.26here, we can clearly see that the speedups of our implementation with 1000 mesh pointsare around 4.7 for both algorithms, much better than 3.2 for SQR and 4.2 for SLU in [38]and [40], respectively. If 1296 mesh points are used, then the best speedup of exampleII for both algorithms is above 5, almost one and half times faster than that of SQR in [38].It is also interesting that the parallel SLU algorithm achieved better speedups than thoseof the parallel SQR algorithm according to [38] and [40]. This is opposite to the resultswe have obtained here. The reason is probably due to the different roles of data commu-nication in different categories of parallel machines.Chapter 7. Numerical Results and Performance Analysis^ 86Table 7.23: The Speedups of PLS Algorithm with 1023 Mesh PointsParameters(Sjl, Sj2, Sj3)Ex.I Ex.IITimeSpeedupTimeSpeedup1-Node n-Node 1-Node n-Node(8,5,9) 3108 1080 2.88 11246 3074 3.66(3,3,3) 3750 1764 2.13 12064 3443 3.50(3,5,9) 3378 1404 2.41 11651 3316 3.51(8,5,3) 3294 1192 2.76 11440 3129 3.66(9,9,9) 2974 1048 2.84 11164 3163 3.53(6,4,6) 3249 1149 2.83 11394 3107 3.67Table 7.24: The Parallel Structured QR/LU Algorithm (Ex.I)Slob MeshSQR SLUTiming TimingPoints 1-Node n-Node Speedup 1-Node n-Node Speedup3 2187 11610 2770 4.19 9178 3003 3.064 1024 3516 787 4.47 3452 770 4.486 1296 4206 767 5.48 4083 739 5.537 2401 7809 1242 6.29 7579 1216 6.2310 1000 3065 447 6.86 2936 427 6.88Table 7.25: The Parallel Structured QR/LU Algorithm (Ex.II)Slob MeshSQR SLUTiming TimingPoints 1-Node n-Node Speedup 1-Node n-Node Speedup3 2187 27054 4275 6.33 19361 4187 4.624 1024 12370 1685 7.34 8873 1633 5.436 1296 15208 1825 8.33 10672 1738 6.147 2401 28240 3098 9.12 19514 3019 6.4610 1000 11302 1345 8.40 7719 1134 6.81Chapter 7. Numerical Results and Performance Analysis^ 87Table 7.26: The Speedups of SQR/SLU with 8 transputersSlobSQR SLUEx.I Ex.II Ex.I Ex.II3 3.18 4.75 3.62 4.214 3.92 4.83 3.57 4.656 4.35 5.02 4.88 4.947 4.42 5.41 4.15 5.2510 4.69 4.69 4.72 4.58Part IVSummary and Future Work88Summary and Future Work^ 89• SummaryIn this thesis, we have concentrated on the implementation of sequential and paral-lel algorithms for solving block bidiagonal matrices arising from the discretizationof a set of linear ordinary differential equations. These sequential algorithms arethe Alternate Row and Column Pivoting Algorithm, the Structured QR Algorithm,the Structured LU Algorithm and the Least Square Algorithm. Except for the Al-ternate Row and Column Pivoting Algorithm, the other three algorithms have beengiven parallel versions and have been implemented on the INMOS Transputers.The comparison of numerical results of these sequential and parallel algorithms arein Part II and Part III, respectively. We can conclude now after surveying andexperimenting on these algorithms that the Structured QR Algorithm is the moststable algorithm and have achieved best speedups in parallel, while the AlternateRow and Column Pivoting Algorithm is the most efficient and flexible sequentialalgorithm.The numerical results have reconfirmed the importance of parallelism. It saves us alot of time in the computation even when there is some cost on the communicationintroduced. As we discussed in the last section, the results we obtained are shownbetter than those in [38] and [40], and the absolute run times of the algorithms onthe transputers and Alliant FX/8 are very close. Furthermore, we can have muchbetter speedups when we deal with the whole problem of solving the ODEs becausethe portion of solving the linear system is only a very small part compared withthe others and it is also the hardest part in the parallelization.Summary and Future Work^ 90Since we implemented the algorithms on the INMOS Transputers, the communica-tion plays a very important role in the execution. The speedups may be expectedto be better if the program runs on a parallel machine with faster data transfer.The unsatisfactory speedups can be seen from the distribution of jobs if there islarge number of transputers involved in the execution. This limits the use of thenumber of transputers, and more transputers can not increase the speedups, thuscan not be efficiently used.• Future WorkThis thesis work relates two areas: One is the linear algebra issue which is to findor determine the most stable and efficient parallel algorithm; The other is a parallelimplementation issue which relates to the architecture of the transputer and thefeatures of the underlying software system, and details of implementation as well.We here pointed out the future work from these two aspects.1. Is the ARCP algorithm parallelizable?In fact, throughout this thesis work, we have been trying to find an efficient andstable parallel version of the Alternate Row and Column Pivoting Algorithm.This, however, has not been successfully done and it is left to future research.It is still an open problem whether it is worthwhile to parallelize this efficientsequential algorithm and how to exploit more the parallelism of this algorithm.Since the basic idea of the ARCP algorithm works in a sequential way, weSummary and Future Work^ 91may believe that the parallel ARCP algorithm would not be as attractive asits sequential version. The efficiency of the sequential ARCP algorithm will bediminished by the decrease of execution time in parallel, i.e., the speedup canmake up for the slow execution in the sequential case. This can been clearlyseen from the comparison of parallel results of the SQR and SLU algorithmsthough the SLU algorithm is much more efficient than SQR in sequential form.2. Machine architecture and parallel implementationSince the transputer configuration, which determines the transputer intercon-nections, can be changed through software commands by user, there are severalconfigurations available such as linear chain, binary tree, etc.. Here, for thesimplicity of explanation, we only collected our results in the case of a linearchain. Meanwhile, we also have to admit that the software model ProcessorFarm which the implementation was based on, might not be the best choicefor our problem. These two facts may affect the speedup of each algorithm.A further investigation is encouraged on other transputer configurations andsoftware models.The implementation we have done is only an experiment about the applica-tion of scientific parallel computation on the transputers. During the periodof our experiments, we have known some of the pros and cons about the trans-puters. The numerical results do confirm some predictions and exploit moreuseful information. This is very encouraging for more research in the parallelcomputing on transputers.Bibliography[1] W.C.ATHAS AND C.L.SEITz, Multicomputers: Message-passing concurrent com-puters, IEEE computer, 1988.[2] U. ASCHER, J. CHRISTIANSEN, AND R.D.RUSSELL, A collocation code forboundary-value problems, B. Childs et al., eds., Lecture Notes in Computer Sci-ence 76, Springer-Verlag, New York, pp. 164-185(1979).[3] U. ASCHER, R.M.M.MATTHEIJ AND R.D.RUSSELL, Numerical solution of bound-ary value problems for ordinary differential equations, Prentice-Hall(1988).[4] U. ASCHER AND S.Y.P.CHAN, On parallel methods for boundary value ODEs,Computing 46, pp.1-17(1991).[5] G. BADER AND U. ASCHER, A new basis implementation for a mixed order bound-ary value ODE solver, SIAM J. Sci. Stat. Comput. 8, pp. 483-500(1987).[6] C.DE BOOR AND R.WEISS, SOLVEBLOK: A package for solving almost blockdiagonal linear systems, ACM Trans. Math. Softw., 6, pp.80-87(1980).[7] D.L.BROWN AND J.LORENZ, A high-order method for stiff boundary value prob-lems with turning points, SIAM J. Sci. Stat. Comput. 8, pp. 790-805(1987).[8] R.L.BURDEN AND J.D.FAIRES, Numerical Analysis (Fourth Edition), PWS-KENT(1989).[9] B.BUZBEE, Remarks for the IFIP congress '83 panel on how to obtain high perfor-mance for high-speed processors, Los Alamos National Laboratory Report LA- UR-83-1392, Los Alamos, NM.[10] M.DAVIS AND G.FAIRWEATHER, On the use of spline collocation methods forboundary problems arising in chemical engineering, Comput. Methods Appl. Mech.Engrg., 28, pp. 179-189(1981).[11] B.L.BuzBEE, G.H.GOLUB AND C.W.NIELSON, On direct methods for solvingPoisson's equations, SIAM J. Numer. Anal.8, pp. 627-656(1970).[12] G.H.GOLUB AND C.F.VAN LOAN, Matrix Computations, 2nd ed. Baltimore: JohnsHopkins Press,1989.92Bibliography^ 93[13] D.HELLER, Some aspects of the cyclic reduction algorithm for block tridiagonallinear systems, SIAM J. Numer. Anal.13, pp. 484-496(1976).[14] K.R.JACKSON AND R.N.PANCER, The parallel solution of ABD systems arising innumerical methods for BVPs for ODEs, Tech. Report No. 255/91, Computer ScienceDepartment, University of Toronto, 1992.[15] LENNART S. JOHNSSON, Solving tridiagonal systems on ensemble architectures.SIAM J. SCI. STAT. COMPUT 8, pp. 354-392(1987).[16] R.N.KAPUR, J.C.BROWNE, Techniques for solving block tridiagonal systems on re-configurable array computers. SIAM J. SCI. STAT. COMPUT 8, pp. 701-719(1984).[17] KARIN R. BENNETT, Parallel collocation methods for boundary value problems.Ph.D thesis, the University of Kentucky, Lexington, Kentucky, 1991.[18] R.M.KARP AND V.RAMACHANDRAN, A surrey of parallel algorithms for shared-memory machines, Tech. Rep. UCB/CSD8B/408,UC Berkeley(1988).[19] H.T.KUNG, Synchronized and asynchronous parallel algorithms for multiprocessors.Algorithms and Complexity, pp. 153-200(1976).[20] D.C.L.LAM, Implementation of the box scheme and model analysis of diffusion-correction equations, Ph.D thesis, University of Waterloo, Waterloo, Ontario, 1974.[21] M. LENTINI„ Parallel solution of special large block tridiagonal systems: TPBVP,manuscript (1989).[22] R.M.MATTHEJJ, Stability of block LU-Decompositions of matrices arising fromBVP, SIAM J. Alg. Disc. Math. 5, pp. 314-331(1984).[23] R.M.MATTHEJJ, Decoupling and stability of algorithms for boundary value prob-lems, SIAM Review 27, pp. 1-44(1985).[24] M.MINSKY, Form and content in computer science, J.ACM,17, pp. 197-215(1985).[25] F.MAJAESS,ETC, FORTRAN packages for the solution of almost block diagonallinear systems arising in spline collocation at gaussian points with monomial basisfunctions, Tech. Report CCS-90-3, University of Kentucky, 1990.[26] J.ORTEGA AND R. VOIGT, Solution of partial differential equations on vector andparallel computers, SIAM Review 27(2), pp. 149-240(1985).[27] M.PAPRZYCK AND I.GLADWELL, Solving almost block diagonal systems on parallelcomputers, Parallel Computing 17, pp. 133-153(1991).Bibliography^ 94[28] A.SAMEH, Numerical parallel algorithms — a survey, High Speed Computer and Al-gorithm Organization, pp. 207-228(1977).[29] S. AKI, The design and analysis of parallel algorithms, Prentice Hall, EnglewoodCliffs(1989).[30] H.V.SREEKANTASWAMY,ETC, Performance Prediction Modelling of Multicomput-ers, Tech. Report 91-27, University of British Columbia, 1991.[31] H.V.SREEKANTASWAMY, Ph.D thesis, University of British Columbia, 1991.[32] H.S.SToNE Parallel tridiagonal equation solvers, Jbid. 1, pp. 289-307(1975).[33] Transputer technical notes. INMOS Limited. Prentice Hall 1989.[34] Transputer reference manual. Prentice Hall 1988.[35] A.TREw AND G.WiLsoN(Eds.), Past, Present, Parallel — A survey of availableparallel computing systems, Springer-Verlag.[36] J.M. VARAH, Alternate row and column elimination for solving certain linear sys-tems, SIAM J. Numer. Anal. 13, pp.71-75(1976).[37] W.WARE, The ultimate computer, IEEE Spect.,10,3, pp.89-91(1973).[38] S.J.WRIGHT, Stable parallel algorithms for two-point boundary value problems,Tech. Report MCS-P178-0990, Argonne National laboratory, 1990.[39] S.J.WRIGHT AND V. PEREYRA, Adaptation of a two-point boundary value problemsolver to a vector-multiprocessor enviroonment, SIAM J. Sci. Stat. Comput. 13,pp.425-449(1990).[40] S.J.WRIGHT, Stable parallel elimination for boundary value ODEs, Tech. ReportMCS-P229-0491, Argonne National laboratory, 1991.[41] S.J.WRIGHT, A collection of problems for which Gaussian elimination with par-tial pivoting is unstable, Manuscript, Mathematics and Computer Science Division,Argonne National laboratory(1991).
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- The implementation of sequential and parallel algorithms...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
The implementation of sequential and parallel algorithms for solving almost block bidiagonal systems Song, Yan 1993
pdf
Page Metadata
Item Metadata
Title | The implementation of sequential and parallel algorithms for solving almost block bidiagonal systems |
Creator |
Song, Yan |
Date Issued | 1993 |
Description | In this thesis, we mainly concentrate on the implementation of sequential and parallel algorithms for solving block bidiagonal linear system arising from the discretization of boundary value problems for linear ordinary differential equations. It consists of two parts. In Part I, several sequential algorithms, the Alternate Row and Column Pivoting Algorithm, the Structured QR Algorithm, the Structured LU Algorithm and the Least Square Algorithm, are implemented and some stability issues of these algorithms are considered. Numerical results are presented. In Part II, the corresponding parallel algorithms, except the Parallel Alternate Row and Column Pivoting Algorithm, are implemented on INMOS T800 Transputers, a microcomputer with multi processors.To abstract the application from the system function calls, we use a template based on a simple software system with master-slave technique as the interface with the application. The description of those parallel algorithms and numerical results are presented. Only the case with separate boundary conditions is considered here. Among all the sequential algorithms, we can say that the Alternate Row and Column Pivoting Algorithm is the most efficient algorithm with no fill-ins, while the Structured QR Algorithm is the most stable algorithm. Though the Structured LU Algorithm is much faster than Structured QR Algorithm in the sequential case if the block size is greater than 3, the absolute run times of both parallel algorithms are almost the same because the Parallel Structured QR Algorithm obtains better speedup. |
Extent | 3754887 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2008-10-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
IsShownAt | 10.14288/1.0051422 |
URI | http://hdl.handle.net/2429/2584 |
Degree |
Master of Science - MSc |
Program |
Computer Science |
Affiliation |
Science, Faculty of Computer Science, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1993-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1993_spring_song_yan.pdf [ 3.58MB ]
- Metadata
- JSON: 831-1.0051422.json
- JSON-LD: 831-1.0051422-ld.json
- RDF/XML (Pretty): 831-1.0051422-rdf.xml
- RDF/JSON: 831-1.0051422-rdf.json
- Turtle: 831-1.0051422-turtle.txt
- N-Triples: 831-1.0051422-rdf-ntriples.txt
- Original Record: 831-1.0051422-source.json
- Full Text
- 831-1.0051422-fulltext.txt
- Citation
- 831-1.0051422.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051422/manifest