TRANSFORMATIONS AND EFFICIENT PARALLELEXECUTION OF LOOPS WITH DEPENDENCIESbyAbderrazek ZaafraniM. A. Sc., Purdue University, West Lafayette, iN, U.S.A, 1990B. A. Sc., Pennsylvania State University, State College, PA, U.S.A, 1988A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFDOCTOR OF PHILOSOPHYinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF ELECTRICAL ENGINEERINGWe accept this thesis as conformingto the sandardThe UNIVERSITY OF BRITISH COLUMBIASeptember 1994© Abderrazek Zaafrani, 1994 ‘In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature)Department of Ein. OQAJ’/The University of British ColumbiaVancouver, CanadaDate________DE-6 (2188)AbstractLoops are the main source of parallelism in scientific programs. Hence, severaltechniques were developed to detect parallelism in these loops and transform them intoparallel forms. In this dissertation, compile time transformations and efficient parallelexecution of loops with various type of dependencies are investigated. First, Doacrossloops with uniform dependencies are considered for execution on distributed memoryparallel machines (multicomputers). Most known Doacross loop execution techniquescan be applied efficiently only to shared memory parallel machines. In this thesis,code reordering technique, improvements to partitioning strategies, and finding a balancebetween communication and parallelism are presented to reduce the execution time ofDoacross loops on multicomputers. As with most paralleizing transformation techniques,only single loopnests are considered in the first part of this dissertation. However,parallelizing each loopnest in a program separately, even if an optimal execution canbe obtained for each loopnest, may not result in an efficient execution of all the code inthe program because of communication overhead across the loops in a multicomputerenvironment. Hence, across loop data dependence representation and analysis areintroduced in this work to improve the parallel execution of the whole code. Ourcontribution consists of finding and representing data dependencies whose sources anddestinations are subspaces of the iteration space mainly common across the loops. Thistype of dependence information is used in this thesis to improve global iteration spacepartitioning, automatic generation of communication statements across ioops, and indexalignment. The final part of this dissertation presents new parallelizing techniques11for loops with irregular and complex dependencies. Various data dependence analysisalgorithms can be found in the literature even for loops with complex array indexing.However, the improvement in data dependence testing has not been followed by similaramelioration in restructuring transformations for loops with complex dependencies. Suchloops are mostly executed in serial mode. Our parallelizing techniques for these loopsconsists of identifying regions of the iteration space where all iterations can be executedin parallel. The advantages of all the transformations presented in this dissertation are: 1)they significantly reduce the execution time of ioops with various types of dependenciesas shown in this work using the MasPar machine 2) they can be implemented at compiletime which makes the task of parallel programming easier.111AbstractList of FiguresAcknowledgmentChapter 11.11.21.3Chapter 22.12.22.32.42.4.12.4.22.4.3XIII1267ContentsList of TablesIIixIntroductionCompiling for Parallel ArchitectureMotivation and Scope of this WorkOrganization of Thesis and ContributionEfficient Execution of Doacross Loops with SingleSynchronizationsIntroductionLoop ModelChain-Based ExecutionSubchain Execution of Loops with DelayedSynchronizationsCode ReorderingCompiler Estimation of Execution TimesSubchain Size Computation11.1112.131616.18.19iv2.5 Chain Combining Execution of the Loop 262.6 Increasing Processor Utilization 312.7 Loops with Non-delayed Synchronizations 332.8 Conclusion 33Chapter 3 Partitioning and Efficient Execution of Loops with UniformDependencies 353.1 Introduction 353.2 Loop Model 353.3 Partitioning for Loops with Multiple Synchronizations. . . 373.3.1 Independent Set Partitioning 373.3.2 Supernode Partitioning 393.3.3 Independent Tile Partitioning 393.3.4 Pseudo-node Computation 433.4 Communication Patterns 453.5 Code Reordering 483.6 Computing the Partition Size 533.7 Static Transformation of Loops 593.8 Conclusion 61VChapter 4 Partitioning the Global Iteration Space 644.1 Introduction 644.2 Dependence Types in Partitioning 654.2.1 Flow Dependence 654.2.2 Anti-Dependence 654.2.3 Input Dependence 694.2.4 Output Dependence 724.2.5 Control Dependence 734.3 Forming the Global Iteration Space 734.3.1 Finding the Domain 744.3.2 Finding Across Loop Data Dependencies 754.3.3 Mapping Loop Indexes 764.4 Global Iteration Space Partitioning 764.5 Conclusion 81Chapter 5 Across Loop Data Dependence Analysis 845.1 Introduction 845.2 Global Iteration Space 855.3 Hyperplane Data Dependence Analysis 89vi5.45.4.16.36.46.4.16.4.26.56.66.76.8• 103108109114115.115.117• 125• 130130133• 13814114414699995.4.1.15.4.1.25.4.25.5Chapter 66.16.2Benefits of Hyperplane Dependence AnalysisAutomatic Generation of Communication Statementsacross LoopsExperimental ResultsRelated WorkIndex AlignmentConclusionParallelizing Loops with Irregular Dependencies. .IntroductionThree Region Execution of Loops with SingleDependenciesStatic Transformation of the LoopGeneralizationParallel Region Execution of Loops with MultipleDependenciesParallel Region Execution in n-Dimensional SpaceFour Region Execution of LoopsRelated and Future WorkExperimental ResultsConclusionviiChapter 7 Conclusion and Future Directions 1507.1 Primary Research Contributions 1507.2 Future Research Directions 152References 154yinList of TablesTable 3.1 Communication Patterns for Due to 47/1 —3\Table 3.2 Data Communication for D’= ( 2 1 ) Due to 48\2 —3)Table 5.1 Resulting Hyperplane Dependence for Every PossibleMapping 111Table 6.1 Conditions on the Non-Invalidating Uniform DependenceVectors for our Three Region Execution 134ixList of FiguresThesis Contribution to Various Tasks of Paralleizing compilers 9Loop Model (Loopi) 14Example of a Doacross Loop with a Single Synchronization. . 15Iteration Space with D=( 1,1) 16Benefits of Code Reordering on Subchain Execution 17Scatter Plot of Iterations Execution Times of the Loop inFigure 2.2 20Compile Time Transformation of Loop1 23Execution Time vs Subchain Size for the Example in Fig 2.b(L=50) 25Chain Combining of the Iteration Space of Figure 2.3 28Index Computation for Obtaining the Chain CombiningExecution 29Comparison between Loop2 and Loop3 Execution Time 30Increasing Processor Utilization by Subchain Grouping 31Code for Increasing Processors Utilization (Loop4) 32An Example of the Loop Model 36Loop Partitioning for D’= ( ) 38Independent Tile Partitioning for D’= ( ) 41Algorithm for Computing the Pseudo-nodes 44FigureFigureFigureFigureFigureFigure1.12.12.22.32.42.5Figure 2.6Figure 2.7Figure 2.8Figure 2.9Figure 2.10Figure 2.11Figure 2.12Figure 3.1Figure 3.2Figure 3.3Figure 3.4xFigure 3.5 Algorithm for Finding the Independent Groups ofSynchronizations 51Figure 3.6 Reduced Iteration Space of Figure 3.3 56Figure 3.7 Execution time vs Partition size on the MasPar 60Figure 3.8 Compile Time Transformation of Doacross Loops 62Figure 4.1 Importance of Flow and Anti-Dependence in Partitioning. . . . 67Figure 4.2 Independent Tiles Partitioning for Input Dependencies 70Figure 4.3 Execution Time Improvement of Zigzag Partitioning overColumn and Block Partitioning on the MasPar Machine(N=48) 71Figure 4.4 Benefits of Using Hyperplane Dependence Information onPartitioning 79Figure 4.5 Independent Tile Partitioning Using Hyperplane DependenceInformation 83Figure 5.1 Loops with 2D Local Iteration Spaces 88Figure 5.2 Code with Hyperplane Data Dependence 92Figure Communication Cost on the MasPar for DifferentCommunication Mechanisms 106Figure 5.4 Algorithm for Automatic Generation of CommunicationStatements 108Figure 5.5 Example illustrating the Importance of Index Alignment onHyperplane Dependence 111xiFigure 6.1 Loop Model 118Figure 6.2 Examples of a Compile Time Restructuring of Loops withfrregular Dependencies 128Figure 6.3 Graphical Representation of Non-invalidating UniformDependence Vectors for the Execution Ordering of area1 andarea2 when i3i > 0 134Figure 6.4 Inequalities Characterizing the Different Regions of ann-Dimensional Iteration Space 136Figure 6.5 Decreasing the Number of Loopnests in the Three RegionExecution Method 137Figure 6.6 Inclusion of area Iterations into a Single Loopnest 139Figure 6.7 Examples from Related Papers 142Figure 6.8 Three Region Execution Performance for the Code in Figure6.2.c 145XIIAcknowledgmentI would like to thank my supervisor, Professor M.R. Ito, for his continual encouragement and highly constructive comments. I also wish to express my sincere thanksfor Professor A. Wagner for his guidance throughout this research. I would like alsoto express my gratitude to Professors G. Neufeld, H. Elnuweiri, and G. Schrack formany inspiring discussions and the anonymous reviewers for their valuable commentsand suggestions.x1uChapter 1IntroductionParallel computers have started to gain acceptance as a cost-effective alternative tosequential computers when solving large compute-intensive problems. This emergenceof parallel machines is due to two major driving forces. First, the processing speedsof sequential computers is bounded by the fundamental physical limitations of thetechnology. Second, current science and engineering application are demanding moreand more computational power and memory space. As a result, the current and futuregeneration of high performance computing environment will eventually rely largely onexploiting the inherent parallelism in problems.Early work in parallel processing concentrated on the hardware side of the machines[1]. This has resulted in the design and manufacturing of various types of parallelcomputers [2]. However, the hardware advances were not matched by similar progressesin software design and development. This has made parallel machines complex to useas only highly trained programmers were able to write code for them. Recent work inthe parallel processing field is addressing the problem of the unavailability of efficientand simple software tools for parallel environments. In fact, efficient and machine-independent parallel languages are being introduced [3, 4]. Powerful parallelizing andoptimizing compilers are being built for different types of languages and machines [5—8].User-friendly debugging tools are becoming available. However, given that efficientsoftware development for parallel machines is still in its early stages, a lot of effort is1Chapter 1still needed in order to make parallel computers more widely used among scientists andengineers.In this thesis, new compile time parallelization techniques of loops with various typeof dependencies are presented. The first and major part of this thesis concentrates onrestructuring code for parallel distributed memory machines (multicomputers) while thesecond part investigates parallelizing techniques for shared memory parallel machines(or multicomputers with low communication overhead). Even though multicomputersare becoming more popular because of their flexibility (can be built from off-the-shelfcomponents) and scalability, they are complex to program. In fact, the programmeris responsible for distributing the data and the tasks among the processors, and addingcommunication statements. In addition, the user has to establish in his/her code a balancebetween parallelism and communication. Hence, providing compile time support for theprogrammer is crucial in order to increase the popularity of multicomputers. This thesisfirst presents compile time transformations of Doacross loops with uniform dependenciesfor efficient execution on multicomputers. It then investigates and improves iterationspace partitioning for single loopnest and for multiple loopnests. In addition, a newform of expressing and finding dependencies across loops is introduced. These acrossloop dependencies are used in this dissertation to ameliorate the automatic generationof communication statements and index alignments. Finally, parallelization methodsof loops with irregular dependencies (traditionally executed in serial mode) are alsopresented.1.1 Compiling for Parallel Architecture In the mid-eighties, two opposing views2chapter 1about compiling for parallel machines emerged. The first opinion found that powerfulparallelizing compilers should be built and used for parallel coding. Not only oldprograms could be automatically transformed into parallel code, but also users couldkeep programming in the serial and simple mode [9, 10]. Supporters of the second viewargued that optimizing and parallelizing compilers can never be efficient enough to berelied upon. As a result, users should start to think parallel in their early experiencewith programming. After gaining more experience with parallel computing, it becameclear that combining these two views is the best approach for the advances of parallelcomputing. In fact, the programmer and the compiler should cooperate to write efficientparallel code. High Performance Fortran (HPF), which is designed by a consortium ofvendors, users, research laboratories, and universities to standardize the next version ofFortran, constitutes one step toward this direction [3]. For example, users of HPF areresponsible for distributing the data onto the processors while the compiler automaticallygenerates communication statements for the code.Compile time support for improving the execution of the code is not specific toparallel machines. In fact, compiler optimization techniques for uniprocessor machineshave been in use for a long time [11, 12]. In order for a compiler to understandand transform a program, data flow analysis techniques were introduced. These dataflow analysis techniques augmented with control flow analysis attempt to provide thecompiler with an understanding of the algorithm underlying a program and hence permitthe compiler to re-code the algorithm in a more efficient form. Even though data flowanalysis can be used to parallelize serial programs, they are not adequate enough for this3Chapter 1task [9, 131. In fact, most of the inherit parallelism in scientific code can be extracted fromloops where most of the execution time is spent. Therefore, techniques that concentrate onthe understanding of loop execution were introduced and called data dependence analysis[9, 10, 14]. The main function of data dependence analysis is to assert the existenceor non-existence of data dependencies between the iterations of a loopnest. Numerousdata dependence analysis algorithms varying in the degree of complexity and precisenesswere developed [9, 15, 14, 8, 16—19, 6, 20, 211 and are still being improved [22, 23].Once data dependence analysis is performed and if no dependence is found betweenthe iterations of a loop, then the loop can be fully executed in parallel (i.e. all of itsiterations can be executed simultaneously without the need of synchronization statements).On the other hand, when a data dependence exists in a loopnest, then various looptransformations should be investigated to extract some parallelism form the loop. Theserestructuring techniques can be carried out automatically by the compiler or done by theuser manually [24, 141. For example, loop interchange consists of switching inner andouter ioops for various reasons (such as to find vector operations for vector machines,or to bring the parallel loops to the outermost positions for multiprocessor machines)[14, 25]. Most of the widely known transformation techniques are efficient only forcode to be executed on shared memory parallel machines or multicomputers with lowcommunication overhead such as SJMD (Single Instruction Multiple Data) computers.This is due to the fact that the main concern of these transformations is to expose asmuch parallelism as possible. In addition to these parallelizing transformations, othercompile time support for parallel computing includes the scheduling of the iterations4chapter 1on the processors [26]. This static scheduling is important given that the number ofprocessors available is usually smaller than the number of iterations in a loop.Parallelizing compilers were initially provided only for shared memory parallelmachines. However, with the increasing demand for computing power and data space, andthe difficulty in scaling shared memory machines, multicomputers are gaining popularityin the high performance computing community. As a result, compile time supportfor multicomputers is becoming available. However, providing parallelizing compilersfor these machines is challenging given that compilers may worry not only aboutfinding parallelism in the code but also finding a balance between this parallelism andcommunication [27—34]. In addition, among other tasks that can be done by the compiler(and are still being investigated) include generation and optimization of communicationstatements, and data distribution [35—37, 29].In the previous paragraphs, an introduction to parallelizing techniques is presentedbut only at the iteration level. Finding a more fine-grain type of parallelism has also beeninvestigated and is called Instruction Level Parallelism (IPL) [38—40]. The advantage ofIPL is that it can be applied to a more general class of problems (as opposed to the iterationlevel parallelism which is efficient only for scientific code). However, IPL techniquescan not be used with all type of parallel machines. As a result, IPL work generated newtypes of parallel architectures such as datafiow machines, superscalar architecture, andVery Long Instruction Word (VL1W) computers [38]. For example, an instruction in VLIWmachines consists of several operations to be executed simultaneously. It is the compilerwhich is responsible of packing several operations into a single word (instruction).5Chapter 11.2 Motivation and Scope of this Work Loops’ are the main source of parallelismin scientific programs. Hence, several techniques were developed to detect parallelismin these loops and transform them into parallel forms [14, 221. The transformationsthat can be performed at compile time are automated and collected to form parallelizingcompilers. The purpose of most of these restructuring transformations is to extract asmuch parallelism from the loop as possible [10, 14]. Therefore, they are more suitable forshared memory parallel machines as the trade-off between parallelism and communicationis not usually considered. For a multicomputer environment, compile time restructuring ofloopnests is usually limited to scheduling loops with no dependencies (i.e Doall loops) orpartitioning the iteration spaces of loops with dependencies (Doacross loops). In the nexttwo chapters, compile time transformations of Doacross loopnests for efficient executionon multicomputers is presented in. These transformations are based on code reorderingtechniques and improvement to partitioning strategies.In order for parallelizing compilers to restructure ioops, the compilers perform datadependence analysis by inspecting the pattern of data usage in programs (especially arrayusage in loops). Various data dependence analysis algorithms can be used to find dependence or independence between iterations of a loopnest but none addresses the issue ofdependencies across loops. Such information may not be useful to parallelizing loops onshared memory parallel machines given that these loops can be efficiently parallelized independently of each others. However, collecting across loop data dependence informationcan be helpful for multicomputer parallelizing compilers because of the communicationUnless Otherwise stated, loop and loopnest are used interchangeably in this dissertation to mean a nested loop.6chapter 1overhead across loops. In this thesis, a new form of expressing data dependence (calledhyperplane data dependence) that efficiently represents the dependencies across loops ispresented. In addition, this dissertation shows how across loop data dependence information can improve iteration space partitioning, automatic generation of communicationstatements, and index alignment.The importance of parallelizing compilers to parallel machines has led to the development of numerous data dependence tests to detect dependencies in loops. These tests varywidely in terms of preciseness and complexity. The improvement in data dependencetesting has not been followed by a similar amelioration in restructuring transformations.In fact, most of the existing transformations can be applied to loops either without dependencies or with simple dependence patterns (ex. loop interchange is allowed onlyfor loops with certain types of dependence direction vectors while loop skewing can beuseful only for loops with uniform dependence distance vectors). On the other hand,loops with complex dependencies are most of the time classified as serial loops withoutany attempt to parallelize them. In the last part of the thesis, a novel approach for efficiently parallelizing loops with complex dependencies is presented. This parallelizationtechnique can be applied to shared memory parallel machines or multicomputers withlow communication overhead.1.3 Organization of Thesis and Contribution This thesis is composed of seven chapters. The contribution of each chapter to the different functions of parallelizing compilers is shown in Figure 1.1 Chapter 2 discusses efficient transformation and execution ofDoacross loops on multicomputers. Only loops with single and uniform synchronizations7Chapter 1are considered in Chapter 2 so that elaborate restructuring ideas are applied to the loopswhile keeping their compile time transformations simple. Chapter 3 generalizes Chapter2 by considering loops with multiple and uniform synchronizations. The main contribution in both chapters includes a code reordering technique, improvements to partitioningstrategies, and a simplified approach to find suitable partition sizes.As most partitioning and parallelization techniques, only single loopnests are considered in both Chapter 2 and 3. However, partitioning each loopnest in a programseparately (even if an optimal partition can be obtained for each loopnest) may not result in an efficient execution of all the code in the program because of communicationoverhead across the loops. The main contribution of Chapter 4 is the use of across ioopdata dependence information to improve the partitioning of the global iteration space.Chapter 5 improves data dependence testing and representation by presenting hyperplanedata dependence analysis. A hyperplane data dependence is a data dependence whosesource and destination are subspaces of the iteration space. This dependence form canbe useful mainly for expressing dependencies across loops. In addition to iteration spacepartitioning, hyperplane dependence information is used in Chapter 5 to improve theautomatic generation of communication statements across the loops and index alignment.Various data dependence algorithms can be found in the literature even for loopswith complex array indexing. However, the improvement in data dependence testing hasnot been followed by similar amelioration in restructuring transformations for loops withirregular dependencies. Chapter 6 introduces new parallelizing techniques for loops withirregular dependencies. Such loops are most of the time executed in serial mode. Our8rd, CD IParallelizingCompHersInstructionLevelParallelismSharedmemoryZNSchedulingSynchron.ParallelizingtransformationsMulticomputers ANDataCommunicationComp.Decompositiondistribution(chap.5)[3;D (chap. 5)withsimpledep.withirregulardep.(chap.6)ProcesslevelPartitioning(Iterationlevel)(chap.2, 3,4)C)Chapter 1techniques consist of identifying regions of the iteration space where all iterations canbe executed in parallel. All the parallelizing transformations presented in this thesis aretested on the MasPar machine and are shown to significantly reduce the execution timeof the code. In addition, these techniques can either be used manually by the programmeror integrated in a parallelizing compiler which, in this case, simplifies the task of parallelprogramming.10Chapter 2Efficient Execution of Doacross Loopswith Single Synchronizations2.1 Introduction The major source of parallelism in scientific programs are loops.However, the cross-iteration dependences that sometimes exist between different iterations of a loop limit the degree of parallelism. Doacross loops are generally used toexploit any of the parallelism in these loops. On shared memory machines, Doacrossexecution usually improves the execution time over serial code. This is not the case withdistributed memory systems (multicomputers) where communication overhead can outweigh the benefits of parallelism. As a result, transformation of the Doacross loop mightbe needed to achieve any acceptable speedup. These transformations should examine thetrade-off between parallelism and communication and find a balance between them.In this chapter, we present compile time transformations of Doacross loops with singleand uniform synchronizations for efficient execution on multicomputers. These transformations consist of not only grouping the iterations into processors (based on expectedexecution time of iterations [41-43], or data dependence [32, 44]), but also changing theexecution order of the code inside the groups of iterations (code reordering). Only loopswith single synchronization and uniform synchronization distances are considered in thischapter. These loops are already in a parallel form (but using the simple shared memoryparadigm to be described in section 2.2). The compiler transforms these parallel loopsinto distributed memory parallel loops while achieving 1) a lower execution time than11Chapter 2traditional execution, 2) identical processes so that static scheduling can be efficientlyused, and 3) minimum number of processors used (i.e maximal processors utilization).Note that loops with multiple synchronizations are discussed in the next chapter. Thisdissertation presents the transformations of Doacroos loops with single synchronizationsand with multiple synchronizations in two different chapters so that 1) the former loops(which are more frequent) are investigated in more depth while retaining the simplicityof the transformations and 2) to isolate and investigate in details the partitioning processfor the latter ioops.2.2 Loop Model In this chapter, transformations of Doacross loops for efficient execution on multicomputers are presented. Only nested Doacross loops with single synchronizations (a single synchronization can satisfy many data dependencies) and uniformsynchronization distances are considered. The loops to be transformed are already in aparallel form. That can be done by either a parallelizing compiler or by the user himself. In both cases, the simple shared memory model where data is viewed to belongto one global space is used2. Cross-iteration dependences are enforced by the simpleevent synchronization (post/wait). The operation post (‘ 12, ..., I) sets the vector(‘1’2’ ...,i) to True, while the operation wait(J1,J2...,J) waits until the vector(J1 , J2 ..., J) becomes True. A data dependence in a parallel loop is satisfied byhaving a post statement after the dependence source and a wait statement before thedependence sink. It should be noted that one synchronization can satisfy more than one2 We could have taken the traditional serial loop as our loop model. However, we chose the shared memory paradigm parallelloop model for two reasons: 1) the transformation of serial loops to parallel loop satisfying the shared memory paradigm is simpleand well studied 2) our transformations can be better understood and are simpler when applied to parallel loops (we go back to thetraditional serial loop model in chapter 4 and 5).12chapter 2data dependence. For example, two data dependencies can be honored using only onesynchronization if the dependence distance of one is multiple of the other. In [45], theauthors describe techniques that can be used to minimize the number of synchronizationsneeded in a parallel loop.Figure 2.1 shows the ioop model being considered (called Loopi). We define the vectorLB=(l,...,l)astheLowerBoundindexvector,UB=(u..,u)astheUpper Bound index vector, and D = (d1 , d2 ..., d) as the synchronization dependenceDistance vector. In this chapter, only loops with uniform vector are considered (a vectoris called uniform if it does not depend on the iteration indexes). Note that if d > 0, thenu > l. If d = 0, then the Doacross loop in dimension i can be changed to a Doall loop.If d <0, then u2 <li. The previous inequalities are needed in order to avoid deadlockthat may result because of the Doacross loop definition. We call Ti (region 1) the codebetween the Doacrosses and the wait statement, T2 (region 2) the code between the waitand post statement, and T3 (region 3) the rest of the code. Perfectly nested loops are usedin our model to simplify the description of the transformation. Only few adjustments areneeded to accommodate non-perfectly nested ioops. Figure 2.2 shows an example takenfrom the Perfect Club Benchmark [46, 47] and satisfying our loop model.2.3 Chain-Based Execution A simple way to partition the iteration space of the ioopis to have every independent chain (path) execute on one processor [48, 441. A chainis a sequence of data dependent nodes (iterations). Figure 2.3 shows an iteration spacewith D = (1, 1). Every chain is characterized by an early node which is the iteration inthat chain that does not need to satisfy any cross-iteration dependence. The early nodes13Chapter 2Doacross Ii =11, UiDoacross 12=12, u2Doacross In=ln, Unriwait (Il—di, 12—d2, . . ., In—dn)r2post (I1,12,...,In)r3EndoacrossesFigure 2.1. Loop Model (Loop1)nodes in Figure 2.3 are the nodes in the first row, and the first column. Finding all theindependent paths require the computation of the set E of early nodes. E is computedin the following equation:(ii,i2,. .. •,j) /min(1i,ui)+f(di)<ii<max(li,ui)—f—d , andmin(12,u)+f(d<i2 < max(l,u)—f( d,.., andE=U (2.1)i=’ fi(l,d) <i2 <f1(13,d)+ IdI, andmm (lj+1, uj+i) i+i < max(li, andmin(ln,un) <in <max(ln,un)where fi (ii, d)= { i + d + 1 and f2 (d) = { . Note that (2.1)can be considered as an algorithm for finding E which is expressed as a union of non-overlapping sets, can be made simpler if overlapping sets are used inside the union14chapter 2Doacross 10 I1=2,JLDoacross 10 12 = 2,ILQQ = (W(I1,12,2)**2+W(I1,12,3)**2)/W(I1,12);S = min(1.,QQ/GAMMA*P(I1,12); IH = (W(I1,12,4)+P(I1,12H/W(I1,12,1)—HO;H = H**2; I riDT = CFL*((1._VT).*DTL(I1,12)+VT*DTMIN); IRT = DW (I1,12,1)/DT; IRT = RT**2wait (I1—dl,12—d2)NSUP = NSUP + SHRNS =HRMS +H; I r2RTRMS= RTRMS + RT;Post (11,12)EndoacrossesFigure 2.2. Example of a Doacross Loop with a Single Synchronization.operator. The number of independent chains is:n /j—l \ / nlEI=IdiI(jI(luk_1k+1l_dkI),)( llluk-_ik+il) (2.2)j=1 k=1 \k=j+1 /where D is different from the zero vector. In [271, the authors present an algorithm tofind the set of early nodes for the general case of iteration space with linearly independentdependence vectors. In our case, finding the early nodes using equation (2.1) is simplerthan using the algorithm in [27] because only one synchronization exists in the ioop. Inaddition, section 2.5 derives a subset of the early nodes based on the set E defined above.The Doacross loop in Figure 2.1 can be easily modified into Doall loop such thatevery chain is executed on its own processor. This known type of partitioning andscheduling has the advantage of satisfying data flow dependencies without inserting anycommunication statement. However, the execution time of the loop can be improved.15Chapter 2Based on the execution time estimation of ri, T2, and r3 in Loopi, efficient execution ofthe loop is presented in the next sections.\.\.\\\\\\\\.\.\ \\ \ \•\\.\.Figure 2.3. Iteration Space with D=(1,1)2.4 Subchain Execution of Loops with Delayed Synchronizations This section considers Doacross loops with delayed synchronizations. A delayed synchronization occurswhen the wait statement comes lexically before the post statement (because of lexicallybackward dependencies [41]) in the absence of goto statements. Otherwise, it is calleda non-delayed synchronization.2.4.1 Code Reordering Every iteration of Loop1 can be executed on a differentprocessor in which each synchronization is satisfied by a communication statement. Analternative partitioning is the chain-based method described above (section 2.3). Thesetwo methods represent two extremes. A better approach is to divide every chain intosubchains (groups of successive iterations in a chain) and allocate one processor to everysubchain. This can improve the execution time especially if the code is reordered. Thecode is reordered such that, inside every subchain, ri of all the iterations is executed16chapter 2first, followed by r of all the iterations, followed by r of all the iterations. This canimprove the execution time of the loop in two ways: execution delay of r to after thetime of sending all data needed in the next subchain, and execution advancement of allr while waiting for the data.iiIIi1 :1 ji 11] 112 2 1 1.3j 3 111 221 2 23 21 1 32 2 33 3 J 3222J- 2 23f 3 31 32 2 33j 3121 231 3T=18 T=18 T=14(a) Every node (b) Subchain exec. (C) Subch. exec.executes on its without code with codeown processor reordering reorderingFigure 2.4. Benefits of Code Reordering on Subchain Execution.Figure 2.4 illustrates how subchain execution combined with code reordering improves the overall execution of the loop. The execution time of region r1, r2, and r3are taken to be equal while the communication time is twice as costly as the execution17Chapter 2time of a region. The dotted segments represent the communication time while the solidsegments denote the execution time of the regions. The number next to each segmentindicates the region number (i.e 1, 2 or 3). Figure 2.4.a shows the execution of thelongest chain of the iteration space in Figure 2.3 when every iteration executes on itsown processor. In this case, the execution time (I) of the chain is 18 units. The executiontime of the chain is also 18 if the whole chain is executed on 1 processor (this case isnot represented in Figure 2.4). Both approaches happen to have the same execution timein this example; but this is not generally the case. Figure 2.4.b shows the subchain execution without code reordering when the longest chain is divided into two subchains. Noimprovement over Figure 2.4.a is obtained because the decrease in communication timeis cancelled by the lack of parallelism between the two subchains. Figure 2.4.c showsthe subchain execution with code reordering. In this case, there is a 22% improvementover Figure 2.4.a and 2.4.b. This improvement depends on the execution time of r1, r2,r, the communication time, the chain size, and the subchain size.2.4.2 Compiler Estimation of Execution Times The new subchain-based partitioningof Loop1 described above requires the estimation of the execution time of the code insidethe loop in order to determine an optimal chunk size. We call r the percentage of codein r with respect to all the code in the loop (excluding the synchronization statements),r the percentage of code in r2, and r the percentage of code in r3.Various studies concerning the approximation of the execution time of programs areavailable. For example, Sarkar uses in [49] an efficient counter-based execution profileto determine frequencies of ioops, and branches. He then uses the execution time of18chapter 2primitive operations to estimate the execution time of the code and its variance. In ourcase, we need to approximate r, r, and r. If their variance is small, then efficienttransformations of loop 1 can be performed as described in the remainder of this chapter.The ioop in Figure 2.2 has three flow dependencies because of the reduction operations.These dependencies are satisfied by one synchronization statement. Figure 2.5 is scatterplot of 500 points showing the execution time of the code in r1 (x-axis) vs the code inr2 (y-axis) of the ioop in Figure 2.2. Each point represents an iteration executing on itsown processor where it has all of its corresponding data. The variance of the executiontime of the code is small. For example, r usually takes between 96% and 98% of thetotal computation time of the iteration.2.4.3 Subchain Size Computation The subchain execution described above can onlybe efficient if an appropriate subchain size can be found. For this purpose, we define L asthe number of nodes in the longest chain. L can be calculated using the following formula:L = ‘in (FIu- lI/d1) (d 0)j=1If every iteration of the loop executes on a different processor (i.e subchain size is 1),then the total execution time of loop is:T(i)= (2.3)where c represents the communication start-up time, t the transmission time for 1 unit ofdata, and s is the size of data units to be sent. We assume that the start-up and transmissiontime are constant. c, and t are given as a fraction of the normalized code r + r + r19Chapter 2Region 2 Exec. time (%) x 10 -aI I I I I I70.00 ——65.00 --60.00 -—55.00--50.00 --45.00 --40.00 --35.00- \-\30.00 --25.00--20.00 --15.00 --10.00--5.00--0.00--I I I I Region 1 Exec. Time (%)0.90 0.92 0.94 0.96 0.98 1.00Figure 2.5. Scatter Plot of Iterations Execution Times of the Loop in Figure 2.2.inside the ioop. It is reasonable to assume constant communication time if a ring canbe embedded in the topology of the available system3 because every process created(I iteration/process for equation (2.3), or n iterations/process in remaining parts of thissection) communicates with at most two processes: receives data from only one process,and sends data to only another one. Therefore, problems that contribute considerablyto the variance of communication time such as network congestion and routing are notrelevant in our case.Most topologies satisfy this requirement.20chapter 2Grouping consecutive iterations of a chain into a subchain results in the followingtotal execution time of the loop ( L > 1 is assumed. Otherwise, no dependence existsand the loop is fully parallel):T(size) = size. (r + r) + (LL/sizej — 1).(size.r + c + s.t)+(2.4)(L — size. L/sizej ).r + c + s.t + size.rwhere size is the subchain size. if size divides L, then we haveT(size) = size. (r-i- r) -f- ((L/size) — 1).(size.r + c+ .s.t) + size.r (2.5)Note that in Doacross execution, every iteration is composed of a parallel part and aserial part (called delay in [41, 421). The subchain execution inherits this property wheresize.r + c + s.t is the serial part of a subchain. In order to simplify solving for sizethat minimizes the execution time of the ioop, T(size) is approximated to (2.5). Wehavedsize(T(5i)) = size= L+t) (2.6)As a result, the following 4 cases summarize the optimal partition size:• if L = 1, then size = 1.• else, if 4 = 1, then size = L (i.e every chain should be assigned to 1 processor).• else if L.(c + s.t) — 1 + 4 > 0, then the optimal chunk size is mm (Lsizei , L)or mm (Fsizel , L), where size is expressed by equation (2.6). Both values can besubstituted in (2.5) to determine the exact optimal chunk size.• else size = 1 (every iteration executes on 1 processor).21Chapter 2After computing the subchain size, the compiler can automatically transform Loopi(Figure 2.1) to Loop2 in Figure 2.6.a. The send and receive statements have n+2arguments. The first n+1 arguments identify the process on the other side of thecommunication. The last argument identifies the data buffer. The code in r1, r, and ris slightly changed to account for the new reordering of the code. For example, scalarvariables becomes array references with index K in rj and r, and with indexes K or K—iin r2 (ex. the first line of r2 of Figure 2.2 becomes NSUP [K] = NSUF[K — 1] + S[K]in Figure 2.6.a).In addition to its optimal execution time, Loop2 can be scheduled efficiently atcompile time. This can be done by replacing the first two lines of Figure 2.6.a) by thecode in Figure 2.6.b) to obtain an SPMD (single program multiple data) code [36]. MaxPis the maximum number of processors available, and ID is the processor number (rangingfrom 0 to MaxF — 1). Note that our subchain execution of Doacross loops shown inFigure 2.6 can be considered as a simultaneous combination of loop splitting [14][wolfe](the initial loop is divided into three loops corresponding to ri, r, and r3), iteration spacepartitioning, and determination of a near-optimal partition size. It is essential to considerthese tasks simultaneously to avoid losing loop independent dependence information.In fact, if traditional loop splitting techniques are applied first on the code of Figure2.2, then two loops are obtained (r3 is empty). Partitioning the iteration space of theseloops and determining a near-optimal partition size then become complex given that loopindependent dependencies in the original loop are now cross-loop dependencies.When the number of processors available for the execution of Loop2 is less than22chapter 2Doali (Ii’,12’,...,In’) in EDoacross J= i,L,sizeDoserial I = l,nLi’ = Ii’ + (J—l)diendoserialBuffer = emptyDoseriai K = 0, size—iif ( ((IiLi’+K*di>ul)&&(di))iI ((Ii<ul)&&(!dl)) ) quit doif ( ((12=L2’+K*d2>u2)&&(d2)) H ((12<u2)&&(!d2)) ) quit_doif ( ((In=Ln’+K*dn>un)&&(dn)) I ((In<un)&&(ldn)) ) quit_doriEndoserialif ( J#i) receive (Il’,12’ In’,J, Buffer)store buffer date into memory.Doserial K 0, size—isame index checks as Doserial above.• r2Endoseriaistore data to be sent into buffer.if ( J+size <= L) send (Ii’,12’, . . .,In’,J+size, Buffer)Doseriai K = 0, size—isame index checks as Doserial above.• r3EndoseriaiEndoacrossEndoali(a) Loop2: Transformation of Loop1 for Mukicomputers.Do IDi = ID, IEI*IL/sizel, MaxPset vector (Ii’,12’,...) to (Idl+i)/lL/size Ith element of B.j = i + (IDi mod L/sizel)*sizeEndo(b) Static Scheduling of Loop2.Figure 2.6. Compile Time Transformation of Loopi.I El x (which is the total number of subchains created), then the static schedulingof Loop2 can introduce a small loss in performance given that few of the processes createdare not of equal size. In fact, Loop2 was partitioned on the assumption that all chains23Chapter 2are of length L. The chain length is much shorter when it is close to some of the cornersof the iteration space. Consequently, the previous loop created processes that performnothing but check for loop indexes, and send/receive empty messages. This inefficiencyis eliminated in the next section by creating identical processes.The grouping of iterations in Loop2 introduces some code overhead such as havingmore loop index calculations, checking that these indexes are in the specified range,and changing scalar accesses to array accesses. In addition, system overhead such as anincrease in the cache miss ratio and page fault is probable because the new code is longerand accesses more data. Most of the program overhead distributes evenly between rj,r2, and r3 and can be incorporated in the analysis. If 6r, 6r, and 6r are respectivelythe code overhead fraction of rtl, rt2, rt3, then the optimal subchain size of equation(2.6) becomes/ L(c+s.t).szze = 1V 1 — r + +Figure 2.7 shows the execution time of the loop in terms of the subchain size forthe example in Figure 2.2 The first two graphs are actual execution on the MasPar Mp-1machine4.The third plot is a simulation of the loop execution on a MIMD machine whilethe fourth plot represents a graphical representation of equation (2.5) having the sameparameter as the simulated MIMD machine of graph 3. It should be noted that, for eachgraph, the execution time is normalized with respect to the execution time of the loop‘ The MasPar MP-l is a massively parallel 2D mesh SIMD machine. The examples presented in this dissertation use the MPLlanguage (which is an extended dialect for C on the MasPar) on a 8192 processing element (PE) machine. The MasPar providestwo types of communication mechanisms: xnet and router mechanism. Xnet communication uses a high-speed local routing to allowcommunication between PEs in the mesh that lie in a straight line from the sending PEs in one of 8 possible directions. Routercommunication uses the global router to allow any arbitrarily type of communication [50]. Global router is usually slower than xnet.However, both mechanisms provide a low communication cost especially when compared to MIMD machines.24chapter 2Figure 2.7. Execution Time vs Subchain Size for the Example in Fig 2.b (L = 50)when size is 1. Therefore, Figure 2.7 should not be used to compare the execution timesbetween the curves. However, it can be used to compare the shapes and the optimalsubchain size between the four plots. The following observations can be deduced fromFigure 2.7:Subchain execution of Doacross loops can result in a considerable improvement overI I I INormalized Execution Time4.60 —4.40—4.20 —4.00—3.80 —3.60 -3.40 -3.20 -3.00 -2.80 -2.60 -2.40 -2.20 -2.00 -1.80-1.60-1.40 -1.20 -1.00—0.80 -0.60 -0.40 -0.20 -I———V—I——. _ ‘+1a----10 10 20 30 40 50Size (in iterations)25Chapter 2traditional techniques (i.e size = 1 or size = L). For example, the optimal subchainexecution improves the execution time over the fully parallel loop (i.e size = 1)by 35% for the router communication on the MasPar, and by 60% for a simulatedMIMD machine.• As communication overhead increases, the optimal partition size increases and theimprovement of the optimal execution over the fully parallel loop also increases.Note that graph 3 has a higher communication cost than graph 2 which in turn has ahigher communication cost than graph 1 (xnet communication is faster than the routercommunication)• The graphical representation of equation (2.5) closely matches the actual execution ofthe loop (compare graph 3 and 4). In addition, the theoretical partition size definedin equation (2.6) nearly predicts the actual partition size. In fact, the optimal partitionis exactly anticipated for graph 1 and graph 2 and is closely predicted for graph 4.This is consistent with other examples.• Graph 4 has local minima at partition size values that divides evenly the number ofiterations in the chain. This is expected because one communication statement issaved for these points. This feature could not be observed in the MasPar becauseof the low communication cost.2.5 Chain Combining Execution of the Loop The previous transformation of theloop has the disadvantage of creating processes of unequal sizes. This reduces the staticscheduling performance when the number of processors is not large enough. This problem can be solved by combining all chains with less than L length (short chains) into full26chapter 2chains of length L. Instead of including in E all early iterations, only the earl’,’ nodes ofthe critical dimension are included (as described in equation (2.7)). The critical dimension is the dimension that determines the length of the longest chains. This dimension isthe index j that determines L in section 2.4.3. If there are many such dimensions, thenanyone of them can be chosen as the critical dimension. The set E becomes:/ 11 i1 <Ui,E = and 12 i2 u2,..., and fi(l,u) i, <f1(i,,u3)+ (2.7)and Ij+i i,i Uj+i, and i, i uwhere j is the critical dimension. The new set E of Figure 2.3 is composed of the firstrow only.All the early nodes that have been removed from E are short chain early nodes.Some of the nodes remaining in E are also short chain early nodes. The iteration spacecan be viewed as divided into three parts (Figure 2.8). The middle part, part2, containsthe full chains. It separates parti and part3 which contain the short chains, part] canbe combined with part3 to obtain full chains. Joining chains together might add oneunnecessary communication statement. However, this does not affect the execution timeof the whole ioop because all the chains execute the same number of iterations andcommunication as the longest chains. An example of a combined chain contains thenodes (1,6), (2,7), (3,1), (4,2), and (5,3). Previously this path was distributed into twochains: One with (1,6), and (2,7). The other with (3,1), (4,2), and (5,3). Note that onceall the chains in the iteration space are transformed into full chains, these full chains arepartitioned into subchains as described in section 2.4.27Chapter 2In order to implement the chain combining method, the index computation of the threeDoserial loops (Figure 2.6.a) that contain the actual code is changed to the one indicatedin Figure 2.9. The modification is done inside the if statement where the starting indexL of a subchain is set to the lower bound of dimension i in case an iteration indexsurpasses the upper bound of dimension i (kd is subtracted from the starting index toaccount for the fact that the process might be in the middlle of the subchain). Note that, inevery if statement, the branch is taken at most once. Therefore, the new ioop (Loop3)hasalmost the same overhead as Loop2. The dotted curve of Figure 2.10 shows simulationresults of Loop3 execution time which is very close to the execution time of Loop2. Onthe other hand, Loop3 is composed of identical subchains, and has less subchains thanLoop. Therefore, this loop can be scheduled more efficiently than Loop2 especially ifthe number of processors available is less than IE x Felijpartipart3...-Figure 2.8. Chain Combining of the Iteration Space of Figure 2.3.28chapter 2Figure 2.9. Index Computation for Obtaining the Chain Combining Execution.We now show that the chain combining technique divides the iteration spaceinto equal chains of length L. This is done by proving that every node I =(i1,2 is included in a unique path of length L.Proof: Every path is characterized by its early iteration. Therefore, finding a path is thesame as finding its corresponding early iteration. Since j is the critical dimension, thento find the early node coordinates, we need to subtract (1d, 1d2, . . , ld, ld+i, . . ld,)from Ito get ‘e = (iej, e2 , i) where 1 [ij. For the dimension j. we havei = i — Id,. Thereforef1(li, u) < fi (li, u) + d,.For the other dimensions,we have two cases:Doserial K = 1,if ( (Ii= Li’Ii = (IiLi’ = Iielse if ( (IiIi = (iiLi’ = Iisize+ k*di > ul) && (di) ) then— ui) + (ii— 1)- k*dl< ul) && (Idi) ) then+ 1) — (Ui — Ii)- k*dlendifif ( (ln=In =Ln’ =else ifIn =Ln’ =Ln’(InIn(In(inIn+ k*dn > un) && (dn) ) then— un) + (in— 1)- k*dn< Un) && (!dn) ) then+ i)— (Un — In)- k*dnendif1Endoserial29Chapter 2Figure 2.10. Comparison between Loop2 and Loop3 Execution Time.1. if (i — ld l and cl 0) or (i — ld I and d 0) for all i, then the earlyiteration‘e is equal to(i1—ld,2•••,i,—ld31—ld,+1,...i—ld), andI E B2. Else, we define the vector B = (b1,b2,.. . , b_10, b,1 ... , b.,?,) where b = 1 if i—ld < i and b = 0 otherwise. We also define the vector C = (Cl, C2, , c) wherec=uif(i—ld2landdj0),and c=—li if(i1—ld<l and d<0). Wehave Je = I — ID + BC. This node satisfies equation (2.7). Therefore, I E B.I INormalized Execution Size2.00 -I1.90 -1.80 -1.70 -1.60 -1.60 -1.40 -1.30 -1.20 -1.10 -1.00 -0.90 -0.80 -0.70 -0.60 -I iteratiouz/chunk0 20 40 60 80 100•1‘II I30chapter 21 2 34 56 1 232y :a) Chain Execution (Composed 016 Subchains) b) Grouping of SubchainsFigure 2.11. Increasing Processor Utilization by Subchain Grouping.We found a unique earl)’ node for each iteration. Thus, every iteration is included in onlyone chain. We need also to prove that every chain is of length L. This can be easilyseen because of the critical dimension j.2.6 Increasing Processor Utilization L oop3 reduces the expected execution time aswell as creates identical processes suitable for static analysis. However, the utilizationof processors might be low. If every process created by Loop3 is executed on adifferent processor, then every processor (except the few ones that do not receive anydata) stays idle for y = ar + c + dt where y is given as a fraction of the normalized codea (r1 + r2 + r3). Note that y represents the delay part of the subchains as explained insection 2.4.3 If a Doacross loop (with a sequential part, and a parallel part) is executed inparallel, and the sequential part represents y percent of the total code, then the maximum31Chapter 2Doall M = 1, ceiling (l/y)Doacross (Ii’ , 12’, . . . , In’ ) EDoserial J = M,L, (ceiling (l/yH*sizeDoserial i = l,nLi’ = Ii’ + (J—l)diendoserialbuffer = emptyLoop index in fi gure_81endloopif ( J#l and M#l) receive (Il,12’,...,In’,M—l, Buffer)store buffer into memory.Loopindexinfigure82endloopadd data to be sent into buffer.if ( J +size <= L) send (Il’,12’,...,In’,M+l, Buffer)Loop index in figure 83endloopen do serialEndoacrossEndoal 1Figure 2.12. Code for Increasing Processors Utilization (Loop4)speedup that can be obtained is [41, 42]. Therefore, executing all the tasks in parallelrequires only processors for each chain which can be achieved by merging someprocesses to get rid of the idle time. This can be done by round robin grouping theEei subchains of each chain into processes of a (1p) iterations. Figure 2.11illustrates the increase in processors utilization without increasing the execution timeof the loop. Figure 2.11 .a shows the execution of six subchains with every subchainexecuting on its own processor while Figure 2.1 1.b shows the grouping of subchains (1with 4, 2 with 5, and 3 with 6). Both Figures achieve the same execution time but thenumber of processors needed in Figure 2.1 1.b is half the number of Processors needed32chapter 2in Figure 2.1 l.a. The new ioop (Loop4) is shown in Figure 2.12. The outer Doall ofFigure 2.12 is special is in a sense that all of its iterations should start before any of themterminates. Otherwise, a deadlock might occur. Therefore, we assume that the numberprocessors is larger or equal to2.7 Loops with Non-delayed Synchronizations For a non-Delayed synchronization,the post statement comes lexically before the wait statement (in the absence of gotosstatements). Non-delayed synchronizations are caused by lexically forward dependencies.This makes the loop more parallelizable than the loop with delayed synchronization [411.We call rI the fraction of code before the post statement, r the fraction of code betweenpost and wait, and r the fraction of code after the wait statement. If computation andcommunication can be overlapped, then we have:T(1) =r +max(c+s.t,r) +rT(size) = size.r + max (c + s.t, .size.r) + .size.r if size L=L.(r+r+r)=L if size=LAnalysis of the above equations shows that T(size) is minimal for size = 1 except when1— r + c -- s.t L. In that case, size = L. This is an expected result for a loopwith a non-delayed synchronization because when every iteration is executed on its ownprocessor, all communication can be done in parallel (data to be sent is approximatelyready at the same time). That is not the case with a delayed synchronization where thedata to be sent is ready only after a delay.2.8 Conclusion In this chapter, we presented automatic transformations of Doacrossioops with single and uniform synchronizations for efficient execution on distributed33Chapter 2memory parallel machines. Based on the expected execution time of the iterations,these transformations achieve minimum expected execution time, balanced load, and highprocessor utilization rate. This work can be integrated in the final phase of parallelizingcompilers or used manually by the programmer to write parallel code. In the formercase, the programmer is left with an easier job as most multicomputer issues such aspartitioning, scheduling, and processor utilization are performed by the compiler. Giventhat Doacross loops with single synchronizations occur more frequently in real applicationthan those with multiple synchronizations (to be discussed in the next chapter), we expectthis chapter to be more useful than the next one. For this reason, this thesis separatesboth kind of ioops so that the former loops are investigated in more depth while keepingtheir compile time transformations simple.34Chapter 3Partitioning and Efficient Execution ofLoops with Uniform Dependencies3.1 Introduction In the previous chapter, transformations of Doacross loops withsingle synchronizations for efficient execution on multicomputers are presented. Inorder to generalize these transformations to Doacross loops with multiple and uniformsynchronizations, this chapter presents improvement to iteration space partitioning. Infact, an improved version of supernode partitioning that increases the parallelism exploitedin the loop without adding any communication overhead is introduced. In addition, anapproach for computing the optimal partition size is also shown. Based on our improvedpartitioning and generalized code reordering, compile time transformation of Doacrossloops with multiple synchronization is presented. In real applications, the number ofsynchronizations in a Doacross Loop is usually small and the synchronizations vectorsare usually simple. As a result, this chapter may look theoretical at some points as someof the generalizations do not occur quite often.3.2 Loop Model In this chapter, efficient execution of Doacross loops on multicomputers is presented5. The loops to be transformed are already in a parallel form. Asin the previous chapter, the shared memory paradigm is used. Figure 3.1 shows anexample of a loop satisfying this model. It should be reminded that the synchronizationoperation post (ev(Ji, J2, ..., J)) sets the event variable ev(Ji, J2, ..., J,) to True, whileFor the sake of simplicity, only perfectly nested loops are used in our model to describe the transfonriations. The techniquespresented in this paper can be applied to non-perfectly nested loops with minor adjustments.35Chapter 3Doacross Ii =11, UiDoacross 12=12, u2Doacross In=ln, Un• riwait (evl(I1—dli, 12—d12, . . ., In—din))wait (ev2(I1—d21, 12—d22, . . ., In—d2n))• 1r2post (ev2(I1,12,...,In)) Ipost (evl (11, 12, .. .,In)• r3post (eV3(I1,12,...,In))• r4wait (ev3(Ii—d31, 13—d32, .. .,13—d3n))• r5EndoacrossesFigure 3.1. An Example of the Loop Modelthe operation wait (ev (ii’ 12, ..., ia)) waits until the event ev (11,12, ..., I,) becomesTrue [51, 45]. The vector (J1,J2,• , J)—(Ii,‘2, , I,) is called the synchronizationvector. This chapter only considers loops with uniform synchronization vectors. Weassume that compiler optimization techniques have been performed to improve the codeand to minimize the number of synchronizations (i.e. minimize data movement) [45, 27].We also assume that read-only data is distributed perfectly before ioop execution. Notethat the code inside the loop of Figure 3.1 (represented by dots in the figure) is dividedinto five regions of code labeled ri to T5. The justification and the approach for dividingthe code inside the loop into regions is presented in section 3.5.36chapter 33.3 Partitioning for Loops with Multiple Synchronizations In the previous chapter,only Doacross loops with single synchronization are considered. The subchain execution(section 2.4) consists of dividing the iteration space into chains. The chains are furtherdivided into subchains. Code reordering is done inside each subchain to improve theexecution time of the loop. In order to generalize the subchain partitioning and executionto Doacross loops with multiple synchronizations, an equivalent partitioning calledindependent tile partitioning is presented in this section. It is an improved partitioningthat combines two existing ideas: independent set partitioning [28, 27, 52] and supernodepartitioning [30, 32]. For an n dimensional loop with m synchronizations (Figure 3.1), if(D D,••• , D) are the synchronization vectors where D = (d1,d2,••• , d) for 1i <m, then D’ = (D, , D) is the synchronization matrix6. Only ioops withuniform synchronization matrix are considered.3.3.1 Independent Set Partitioning The presence of cross-iteration dependencies, expressed by event synchronizations in our model, divides the iteration space into independent sets of iterations (any two nodes from two different sets are independent).An iteration I = (i1i2 . . i) is independent of iteration J= (jij • in), with respect to the matrix D’, if and only if there is no (ai, ct2,••• , am) e zm such thatI = J + (aia .. am) x D’. All independent sets can be executed in parallel withoutany communication overhead. Various approaches for identifying these sets are proposedby D’Hollander, Peir, and Cytron in [28, 27].6 It is equivalent to the dependence matrix [27] if every synchronization satisfies a unique data dependence.37C •1 CD .t’J•........C.,••••••0:3;!•••••••••••COC•••••••••••CD••••••••••••-t•ot-lo••S••••••••C009••S•••SSSSftSCD••••SSSSSS555U)005555•SSS••SCi2t••••••••••••SSSSSS••..•..•.•0-tSSSSSSS•SS•II—•SSSSSS•SS•SIItsDf—’•55555555555•••S••••••••S55555555t%)•••••••••••C)f-I. 1 CD COU)C-chapter 3the iteration space into 7 independent sets. The figure shows the dependence between thenodes in only one of these 7 sets. The independent set partitioning is efficient when thenumber of available processors is close to the number of independent sets. For a largenumber of processors, using the wavefront method [53, 14, 54] inside each independentset is proposed in [27]. The dotted lines in Figure 3.2.a shows the wavefronts of anindependent set. All iterations inside one front are executed concurrently while the frontsare executed sequentially. As indicated in [27], this method is best suited for a systemwith a small number of processors with vector processing capability. For multicomputers,the communication overhead can easily nullify the advantages of executing nodes of afront in parallel.3.3.2 Supernode Partitioning Supemode partitioning (also called tile partitioning in[321) consists of dividing the iteration space into tiles according to the dependence vectors.Each tile can not proceed until it receives all the data needed from other tiles. A methodfor dividing the iteration space into deadlock-free tiles is presented in [30, 32]. Figure3.2.b shows the tiles for the same iteration space and same matrix D’ as in the previoussubsection. In addition to considerably reducing communication time, code generationfor this partitioned loop is simple. However, the drawback of this approach is thatindependent iterations can be found in a same tile. For example, in Figure 3.2.b, eachtile contains iterations from all 7 independent sets described above.3.3.3 Independent Tile Partitioning The independent tile partitioning proposed in thischapter is a hybrid approach between independent set partitioning and tile partitioning. It39Chapter 3consists of dividing the iteration space into tiles. Each tile is further divided into subsetseach of which includes only nodes from one independent set (the word subset is generallyused in this chapter to denote the independent tiles). Figure 3.3 shows one independentset divided into subsets. Each of the 7 independent sets is partitioned in the same manneras the one represented in the figure. This approach extracts more parallelism from theloop than supernode partitioning without adding any communication overhead.In order to partition the iteration space into tiles of independent iterations (subsets)as briefly described in the previous paragraph, we need to find a partitioning matrix Dcomposed of 1 linearly independent vectors (Di,D2,• , D1) such that every vector inD’ can be written as a linear combination of the vectors in D. When D is used as thepartitioning matrix, then it has the advantage that every node in an independent set canbe expressed as a unique integer linear combination of the vectors in D (this allowsa simple compile time transformation to be shown in section 3.7 since every iterationis uniquely defined from a starting node and the vectors in D). In addition, given theassumption that a subset can not start until it gets all the data needed from other subsets,then a sufficient condition for the deadlock-free execution of the loop is that all vectorsin D’ are positive (or all negative) linear combination of the vectors in D [30, 32] (thisdeadlock free condition is relaxed in section 2.4.1 once code reordering is performed). InFigure 3.3, we have D = D’ because the vectors in D’ satisfy the above two conditions.Note that it is essential to include in D as many vectors as possible from D’ so thatcommunication between subsets is reduced.After finding a partitioning matrix D, the iteration space is then divided into indepen40chapter 35Q......:..,SZ,.• • • • 6• • • • • • • • • . . • • ••7.• I • • • I • I • I • • I • • I• . • I I I I I I I • • I I • • I I• I I I • I • • • I I I I I • •Figure 3.3. Independent Tile Partitioning for D= ( )dent sets according to the vectors in D. The number of such independent sets is called I,and set i is called F2. For each set F2, where 0 i < I — 1, a node a2 from that set isfound. Partitioning the iteration space into independent sets and finding a representativenode from each set can be accomplished using the minimum distance algorithm [27]. Forevery node p F2, we have p2 = a+(1cr...)x D, where (ci,a2,”.,i) e Z1.For each node a2 (0 i I— 1), a pseudo-node a can be found such that V p2 E F2,we have p2 = a + (c4c4.. •o4) x D where ,Q4) e N1 (N is the set ofpositive integers). These pseudo-nodes are computed in section 3.3.4 and are used in thecompile time transformation of the Doacross loops. Depending on D, a can be insideor outside the iteration space.Once a is found for 0 i I—i, each independent set P2 is partitioned into subsets41Chapter 3such that subset includes all the nodes a + (/31132 i3t) x D where/3, [i, x sizej,(i + 1) x .size[ where 1 j 1sizes = length of partition in the direction of D,The size of each subset (except the subsets at the boundaries of the iteration space) isfi size. Figure 3.3 shows one independent set divided into subsets of size 3 in thedirection of D1, and of size 2 in the direction of D2.Partitioning the iteration space according to D makes the number of independentsets larger than the actual number of independent sets had D’ been used (I’ I, whereI’is the number of independent sets if D’ is used to partition the iteration space intoindependent sets). We have two cases:1. Every vector in D’ can be written as an integer linear combination of the vectors inD. In this case, I’ = I.2. There exist at least one vector in D’ that can not be written as an integer linearcombination of the vectors in D. We have I’ < I. Given that it is desirable thatonly subsets in the same independent set communicate with each other, then theabove partitioning can be improved by combining some of the subsets such that=O I = i mod for 0 i I’ — 1.Our partitioning method improves on independent set partitioning [28, 27, 52] bydividing the sets into subsets so that the wavefront method can be used on the subsets.It also improves on supernode partitioning [30, 32] by dividing a supernode into subsets42chapter 3that include dependent nodes only so that the subsets that previously formed a supernodeare executed concurrently. This results in a speedup factor of I’ (when I’ I) oversupernode partitioning for grid topologies with large number of processors. Note thatthe compile time transformation of the Doacross loop with multiple synchronization canbe found in section 3.73.3.4 Pseudo-node Computation In the previous section, pseudo-nodes are introducedsuch that (a, D) is a basis for the set P. This basis has the property that every nodein P can be written as a positive integer combination of the vectors in D. For every setP, there are an infinite number of pseudo-nodes a that satisfy the previous requirement.These pseudo-nodes can be computed from any node a2 E P2. Using (a, D) as a basisto partition the iterations space into subsets as described previously may result in somedummy subsets. For example, in Figure 3.3, node (—20, 15) can be used as a pseudo-node for the set P1 (Sub0, is an example of an empty subset). Since the Doacross loopcan be statically scheduled (section 3.7), the inconvenience of these empty subsets is thatsome resources are wasted by creating these dummy processes. Therefore, the optimalpseudo-node a is the one closest to a2 (i.e the one that minimizes a — ajW) so that itgenerates the smallest number of empty subsets. Even though the expected running timeto solve for the optimal a using integer programming techniques is usually reasonable(usually 1, and n are less or equal to 3), the running time is potentially exponential. InFigure 3.4, a cost effective algorithm to find a reasonable a is given. Note that E’is the set of iterations in the iteration space that can start immediately (early nodes).This set can be computed by finding the set of early nodes for each synchronization43Chapter 3a = aDone = Falsewhile (!Done)for (j = 1;j < l;j + +)if ( e N such that (a + 7D2) (is — E’)) thena=a—stepx(Di+D2+•••+Dz)quit for loopelse Done = TrueendforendwhileFigure 3.4. Algorithm for Computing the Pseudo-nodes.considered separately and using equation (2.1) of the previous chapter. In this case,B’ is the intersection of all these sets: B’= fl E (E is the set of early nodes forsynchronization vector D:). A necessary and sufficient condition for the desired a isthat any node which is a positive multiple of a vector in D and having a as a basis isnot in the iteration space (called IS in Figure 3.4) minus the set E’. The variable step isused to control the cost of the algorithm. A step of 1 gives the best a the algorithm canfind. Increasing step reduces the cost of computing the pseudo-nodes at the expense ofthe latter being closer to the optimal ones. The algorithm can be improved by having adynamic step that changes inside the loop or by giving weights to the vectors in D. The44chapter 3complexity of the code inside the while loop is 0 (ri x 1). In fact, the if statement takes0(n) steps because it only involves comparing indexes in n dimensions.Note that the previous algorithm is executed only once to find the pseudo-node ofone independent set. All other pseudo-nodes can be deduced from that node. In fact,if a is the pseudo-node of set i, then pseudo-node a of any set j can be computedusing the equation a; = a2 + (a — ai) for 1 I provided that a3 has the samecoordinates in P, as a has in P.3.4 Communication Patterns In the previous section, Doacross loops are partitionedsuch that the iteration space is divided into independent sets and each set is furtherdivided into subsets according to the linearly independent vectors in D. In this section,the communication patterns between the subsets are established.For every vector D, (1 i rn) in D’, we have D. = /3D1 + + . .. + 13Diwhere (i /2,• , 3;) R1 because every vector in matrix D’ can be written as linearcombination of the vectors in D (section 3.3.3). In addition, it can be shown that if avector V E Z’, then V a E 13, for 1 p I (a is an iteration in the independent set13,), we have (a-I- V) E 13, (1 < q I), where q—p is a constant (i.e the function+V maps one independent set into a unique independent set: adding vector V to thecoordinates of any node in P results in a node in 14). Therefore, D. results in datacommunication from one independent set to a unique independent set. In addition, adifference vector between any two independent sets can be found. This vector is definedto be the difference vector between node (i 12, ,j) in the first independent setand its corresponding node in the second independent set. For example, node (1,1)45Chapter 3in Figure 3.3 represents the first node in P1 while iteration (1, 2) represents the firstnode in P2. Therefore, Diff(P2,Pi) = (2,1) — (1,1) (0,1), where Diff(P,,P:)represents the difference vector between independent set j and i. When partitioningthe iteration space into independent sets according to the vectors in D, all the vectorsDiff(Pk, Fi) for 1 < Ic < I can be found (note that Diff(Pi, F1) = 0). From theabove results, for any vector there exists Ic such that can be uniquely written asD;. = Diff(1,k)-f-3iDi+/92+•..+3jDiwhere(/3i./32,...,/3j) C Z1, and 1 k I.Based on this new expression for D, we can derive Table 3.1 which illustrates the datato be sent from the producer subset sub,,to the consumer subsets because ofsynchronization D. The first colunm shows all the consumer subsets while the secondcolumn represents the size of the data to be sent from sub,1 to the correspondingsubset in the first column7. This size is given in d bytes where d represents the size of thedata to be sent between any two iterations in the unpartitioned space because of D. Notethat for the case when is also in D, there exists Ic (1 Ic 1) such that D = Dk.Hence, D results in a data communication of size fT size from subseti—_i (if=k)to subset sub only.For every synchronization D, there are 21 consumer subsets. However some of thesesubsets have a zero data size (i.e no communication is needed). It should be noticed thatthe sum of all entries in the data size column in Table 3.1 is equal to the size of the/1 —3\subsets. In Figure 3.3, when the synchronization matrix D’= ( 2 1 ), a possible\2 —3)The data size presented in the second column of Table 3.1 are only valid for regular subsets (the subsets at the boundaries ofthe iteration space are inegu1ar in the sense they have different sizes).46Consumer Subset Data Size (in d bytes)(i+k) mod z xi x size1—X z2 X size2— /92 xI••• x xsizei—f3z(i--k)mOdI I’sub, L8i — X1 X sizel I > i2 X size — /92 I x1 +j ,t2+v2 ti +i‘I••• x z1 xsizej—/3j(i+k)modI I I Isub x1 x sze—I x I/9 — x2 X size2l x1 +l 2+ ti +i I II,• x z xsizej—/3z(i--k)mOd i 113i —zi x sizeil x 1/92 —X2 X size2l XsubI, I••• x xsizej—/3z(i+k) mod i I x1 x size1—x Ifl — 2 X size2l xx I/9i — Xl x sizei(i+k) mod i I/9i — xi x size1 x 1,9 — X2 )< size2I Xsubt1+Xl,22+X2,,ii+Xz•.. x— Xj x sizeilwherextxsize</3<xxsize for/9>Owhere O<i<l andx=x+1x sizej < /3 < Xj x size for /9 < 0where 0ii andxx—1Table 3.1 Communication Patterns for Due to47chapter 3Receiver Subset Data Size(i+3) mod 7 2sub.Zi(i+3) mod 7 4sub.zi+1,z2b(+3) mod 7 0_ili2+1mod 7 0suChapter 3j’l —3\Table 3.2 Data Communication for D’= ( 2 1 ) Due to\2 —3)partitioning matrix is D= ( ). We have D = D1 = Diff(Pi, F1) + D1,= = Diff(Fi,Pi) + D2, and D3 = = (0,3) + 2D1 = Diff(P4,F)+ 2D1(8i 2, and /32 = 0). If sizei = 3, and size2 = 2, then for D, we have= 0, x1 = 1, X2 = 0, and x = 1. D results in a conmiunication of size 2 fromany Sub2 to Sub1+i,2.D’2 results in a communication of size 3 from any Sub,2 toSub12+i, while the communication due to D is shown in Table 3.2. On the other hand,if = (3,—2) mD’, thenwehaveD3= (0,0)+D1+D2= Diff(Fi,Fi)-i-Di+DIn this case all communication occurs in the same independent set (i.e k = 0) and istherefore simple. This simple communication pattern occurs when 1 = rn (i.e the vectorsin D’ are linearly independent).3.5 Code Reordering8 In addition to the efficient partitioning proposed in the previoussection, the Doacross loop execution can be further improved by reordering the codeThis section is a generalization of the code reordering section (section 2.4.1 in the previous chapter (2.4.1).48chapter 3inside each subset. The code inside the ioop is divided into regions of code such that,for every subset, one region is executed for all the iterations (in that subset) beforeproceeding to the next region. The purpose of this reordering is to first make the datato be sent ready as soon as possible, and second to make every process (subset) ableto execute useful code while waiting for data. This results in a faster execution of theDoacross loop because the delay between the dependent subsets decreases. Consequently,the maximum speedup that can be obtained increases [55]. For example, in Figure 3.1,the loop is divided into five regions such that, for every subset, ri of all the iterations areexecuted first, followed by r2 of all the iterations, followed by r3 of all the iterations, andso on. In addition to the improved execution time, code reordering relaxes the conditionsimposed on the partitioning vectors as explained in the last paragraph of this section.In order to improve the execution time of the ioop and keep its static transformation simple, the code inside the loop is divided into regions based on disjoint groups ofsynchronizations. A group of synchronization is a set of minimal number of synchronization (wait and post) statements satisfying the following two rules: 1) A wait and itscorresponding post are in the same group, 2) Any synchronization statement that comeslexically between a wait and its corresponding post is in the same group as that wait andpost9. We denote by G the ith group of synchronizations, by s the first synchronizationstatement in G2, and by L the last synchronization statement in G, where 1 i <f andf is the number of synchronization groups. A stack-based algorithm is used in Figure3.5 to find all the synchronization groups in a ioop.We assume that goto like statements are not allowed in the program.49Chapter 3Once the groups of synchronizations are identified, the code inside the ioop is dividedinto regions such that the code between s and s forms region r2j while the code betweens and s1 forms region r2,+1. The code before 4 forms r1 and the code after 4 formsr2f+1. It should be noticed that an even region rj (1 j 2f + 1 and j even) receivesdata from rj of the producer subsets before starting execution while an odd region r(1 k <2f + 1 and k even) does not receive any data and proceed immediately onceri in the same subset is executed. The loop in Figure 3.1 has two synchronization groups:one containing synchronization 1 and 2 (identified by evi, and ev2) while the othercontaining synchronization 3 (ev3). We have 4 = wait(evl(. •)), s, = post(evl(...)),4 = post(ev3(’..)), s wait(ev3(. •)). As a results, the code is divided intofive regions as indicated in the figure. Note that section 3.6 shows the improvementin execution time for one example but for different subset sizes when code reorderingtechnique is used. The average decrease in execution time obtained there is 54%.It should be noted that when the vectors in D’ are linearly independent, then 1 = m(recall that 1 is the number of synchronization vectors in D while ni is the numberof synchronization in D’). In this case, every subset communicates with 1 differentneighboring subsets. Hence, it is efficient to make data ready as soon as possible bydividing the code as indicated above. However, when 1 < m, every subset sends data tom subsets which are not all different. In this case, after dividing the code into regions,combining some regions together may improve the overall execution of the loop byreducing the communication time. Given that only consecutive regions may be adjoinedtogether, then there are ([j — i) x ([j —2) x ... x 1 = ways to50chapter 3i=O 1* i is the number of groups in the ioop */while (N=getsynchO)/* get_synch () returns the event I of the next synchr. (returns 0when no more synchr. are left) . Every synchr. is composed of twoparts: a wait part and post part. get_synch() returns thesame I for both parts if they represent the same Synchr. *1if ( is_in_stack (N) ) thenif ( N == top_of_stack () ) thenpop () / * pops the first element and any elementin the stack that is marked. */if ( stack_is_empty () ) theni = i + 1;form_a_group ()/* all previously poped elements constitute a group ofindependent synchr. called G[iJset SF[i] to the first synchr. with event N andset SL[i] to the other corresponding synchr. */else mark (N)else push (N)Figure 3.5. Algorithm for Finding the Independent Groups of Synchronizationscombine the regions.As explained in the next section, the serial part of a subset (called delay in [41, 55]) isdelay = mx (f(r) + t(r)) (3.1)i=li evenThis is the time a processor executing the iterations of a subset stays idle while waitingfor data. f(r) (1 i f) evaluates the communication time for region r of anysubset while t(r) estimates the execution time of region r of any subset. When1 = m, the delay in equation (3.1) becomes smaller as the code inside the loop isdivided into more regions (based on independent synchronization as explained above)51Chapter 3because t(r) and ffr) are smaller. However, when 1 < m, dividing the regions intomore regions decreases t(r) but might increase f(r) because the start-up time increasesdue to the fact that two different regions may communicate with the same subset (thecommunication for the same subsets could have been merged if these two regions arecombined into a single region). To illustrate the improvement obtained by dividing thecode into regions, we consider the loop in Figure 3.1 with t(r2) = a for (1 <i <5),f(r) = b (i = 2, or i 4), and 1 = m = 3. If the code inside the loop forms one regionthen delay (in % of total code)== 1, and the maximum speedup when executingthe subsets in parallel is 1. If the code is divided into 5 regions, then delay= 5b’ andthe maximum speedup for the concurrent execution of the subsets is Therefore,reordering the code inside the loop by dividing it into 5 regions runs faster thanwithout code reordering. Note that in this analysis, we ignored the overhead introducedbecause of code reordering such as more complex indexing of the loop.Once the code inside the loop is divided into regions, the vectors in Dshould be found such that the execution of the subsets is deadlock-free. Partitioning the iteration space according to D may cause a deadlock if and only ifi,j (1i<rn,1<jm,anditj),and k (1kl)suchthat=(; . . ...Dand sign of ak is opposite to the sign of cTherefore, a sufficient condition for finding the partitioning matrix D is that all vectors inD’ are positive (or all negative) linear combination of the vectors in D [32] (the vectors in52chapter 3D are called extreme vectors in [32]). This condition for finding extreme vectors is basedon the requirement that a subset can not start until it receives all the data needed for theentire code by that subset (this is only needed if the whole ioop forms a single region).This requirement can be relaxed if the code inside the loop is divided into regions asdescribed above. In that case, all subsets can start but a subset can not proceed past aneven region until it receives all the data needed for that region only. Therefore, the newcondition for deadlock-free partitioning is to find the extreme vectors (D1,D2,. . . , D1)such that for every even region rp (taken separately), only the synchronization vectorsthat are lexically in that region are all positive (or all negative) linear combination ofthe vectors in D. Given that it is more efficient to choose extreme vectors from D’ inorder to reduce inter-subset communication, the new condition on the extreme vectorsmakes it more likely to be able to choose the extreme vectors from D’. For example, if/1 —3’\D’= ( 2 1 ) in the loop of Figure 3.1, then we have D’3 = —D — D. Using\—3 2)the more flexible condition for deadlock-free code, the vectors D, and D can be usedas the extreme vectors. No extreme vectors from D’ could be found if the deadlock-freerequirement in [30, 32] had been used.3.6 Computing the Partition Size In sections 3.3 and 3.5, new partitioning and codereordering techniques are presented for efficient execution of Doacross loops. However,this can only be achieved if an adequate partition size is found. In this section, asimplified approach for computing the optimal partition size is presented. For thisreason, the reduced iteration space where each subset is represented by a node needs53Chapter 3to be analyzed. Figure 3.6 shows the reduced iteration space of Figure 3.3. The verticalarrows represent the communication between the subsets while the horizental lines showthe wavefronts of the reduced iteration space (all subsets in a same wavefront can beexecuted concurrently). The number assigned to each node in Figure 3.6 is the samenumber given to its corresponding subset in Figure 3.3. Note that the nodes in thereduced iteration space inherit the execution property of Doacross loops. In fact, theyare each composed of a serial part (delay) and a parallel part. It is the serial part thathas the most effect in determining the overall execution time of the loop.It was mentioned in section 3.4 that each synchronization pair (wait and post) result incommunication statements when the loop is transformed to a distributed memory parallelcode. In fact, because of synchronization vector D (i <j <m), every subsetsends data to one or more consumer subsets (results from section 3.4 indicates that ifthe partitioning matrix is equal to the synchronization matrix, then every synchronizationvector results in a communication from a producer subset to a unique consumer subset;Otherwise, a synchronization vector results in communication from a producer subsetto several consumer subsets). As a result, we define the communication functionC(r) (1 <i <2f + 1) to be the function that evaluates the time to send the datacollected in region r from the producer subset to the consumer subsets. Note that it isassumed in this section that data is sent only at the end of a region even if it is ready,ri Iearlier. We have C(r) = Comk(S, s, t) where rl is the number of consumer subsetsk=1due to all synchronizations in region r and Cornk is a function that determines the timeit takes to send data from to the ktul receiver subset processor. Gomk is54chapter 3expressed as a function of S the data size, s the start-up time, and t the propagation time.Note that Comk depends on the architecture and the mapping of subsets onto processors.It should also be noticed that C(r) = 0 when i is odd because (as indicated in section3.5) no communication is needed for odd regions.We denote by y the size of the partition (we have y= II size, where size is thelength of the partition in the direction of vector D2), by x the number of wavefronts inthe reduced iteration space (x = 4 in Figure 3.6), and by r the execution time estimation(for one iteration) of the code in region r. If T(sizei, size2, ,size1) represents theoverall execution time of the Doacross loop in terms of the length of the partition ineach direction of the partitioning vectors, then we have.fT(size, size2, size1) =y. r +(x — 1). mx ((C(r) + y.r)) (3.2)i=1i evenThe serial part (delay between two wavefronts fronts) of the Doacross loop is representedby the expression inside the last parentheses of equation (3.2). This expression indicatesthat the delay between two subsets in two consecutive fronts is equal to the longest delaybetween any two same regions in these two subsets. In fact, if the longest delay issatisfied, then any smaller delay between any other two same regions is also satisfied. Atiming figure similar to Figure 2.4 can clearly illustrate the last point.55Chapter 3Figure 3.6. Reduced Iteration Space of Figure 3.3In order to demonstrate the use of equation (3.2), we consider the following example:• The code in the ioop is divided into three regions (i.e there is only one group ofsynchronizations).• There are 1 synchronization vectors in the loop. These vectors are linearly independent.• An i-dimensional grid processor target machine is available.• A constant communication time between neighboring processors which is s + d.t,where s is the start-up time, d is the data size, and t is the transmission time.In this case, every subsets sends data to l neighboring node. Given that the loop isdivided into three regions, then all the communication occurs at the end of r2. Note thatthe number of iterations in the producer subset that sends data along the D3 directionis JJ sizer. If we assume that the data to be sent by every iteration in a subset is ofi=1 ijuniform size d along all the directions and that data transmission can be done concurrently56chapter 3in all the direction, then the data transmission time is z.d, where z = mx ( ?i sizei). If.i1 j=1 iwe also assume that a processor serially establishes the communication channels with the1 neighboring consumer processors, then we C(r2) = l.s + z.d.t. As a result, equation(3.2) becomesT(sizei,size2,•. ,size1)= y.(r + r + r) + (x — l).(y.r + l.s + z.d.t) (3.3)In order to have an optimal partition, the values of size1,size2,•, and size1 thatminimize the right-hand side of equation (3.3) should be found. These values are subjectto the following conditionn 1Lsize<min(——1 wherel<i<l— j=i jdJ——which guaranties that the partition to be found fits into the iteration space. Nonlinear programming techniques can be used to find the optimal partition size. However,given that faster solutions are essential for compile-time transformations of loops, somerestrictions on the partition size can be imposed. such as considering only the partitionswith size1 = size2 = = size1 = size. In this case, the number of fronts in the reducediteration space is z = , where x0 is the number of fronts in the initial iteration space(i.e size = 1). We have xo = (F1 + i). where L is the th dimension’s lengthin the initial iteration space and R1 is the coordinate of the progress vector R10 [12].We also have y = (size)’ and z = size1. Thus, Equation (3.3) becomesT(size) =(r-j- r).size1+ (xo.r — d.t).size’+The progress vector is the vector perpendicular to the wavefronts and whose distance is equal to the distance between any twosuccessive fronts. For example, in Figure 3.2.a, we have R = (1.41, —0.35)57Chapter 3(xo.d.t).size2— (l.s) + (34)sizeequation (3.4) is a rational function whose minimum values can be found faster thanthose of equation (3.3).Figure 3.7.b and 3.7.c shows the execution time of the loop in Figure 3.7.a in termsof the partition size. Only the synchronization statements are shown in Figure 3.7.a. Theregions of code represented by ri and r in Figure 3.7.a are both similar in length to ri inFigure 2.2 while r and r4 in Figure 3.7.a are similar in length to r2 of Figure 2.2. Figure3.7.b shows the execution time of the loop using code reordering technique (i.e the codeinside is divided into 4 regions). Figure 3.7.c displays the execution time of the sameloop but without code reordering (i.e the code inside the loop forms one single regionand therefore, a subset of iterations can not start execution until it receives all the dataneeded). Note that the execution time in both graphs is normalized to the execution timeof the loop with code reordering when size1 = size2 = 1. It should also be noticed thatboth graphs represent the execution time of a simulated MIMD machine on the MasPar(it is not possible to execute code in SIMD mode (without increasing considerably theexecution time) if the loop is divided into more than three regions because the processorsneed to execute different regions of code concurrently. However, when the loop isdivided into three regions only, it is possible to execute it efficiently in SJMD modebecause there is only one serial region (the even region r2). In this case, r1 (a parallelregion) is executed concurrently on all processors, then r2 is executed serially across theprocessors, and finally r3 is executed concurrently on all processors. No improvementcan be obtained if a three-region loop is executed in a MIMD mode).58chapter 3It can be deduced from Figure 3.7 that code reordering improves significantly theexecution time. In fact, the second graph is clearly shifted upward when compared tothe first graph. Depending on the partition size, there is up to 72% decrease in executiontime. The average decrease is 41% if negative decrease’1is counted and is 54% otherwise.Finally, we can also conclude that equation (3.4) can closely predict the actual optimalpartition. In fact, various examples show that the execution time of the optimal partitionobtained using (3.4) and the execution time of the actual optimal partition are very closeeven though sometimes the sizes of both partitions are not close.3.7 Static Transformation of Loops In this section, Doacross loops are staticallytransformed based on partitioning method of section 3.3.3, communication patterns insection 3.4, code reordering of section 3.5, and finding a partition size of section3.6. The new parallel ioop is shown in Figure 3.8.a. Note that the set A in theloop contains the pseudo-nodes a of the independent set described in section 3.3.4.The Doall loop refers to the independent sets while the Doacross loops refer to thesubsets. The send, and receive functions need three arguments to establish the appropriatecommunication: the independent set number identified by the pseudo-node a, the subsetnumber identified by its origin I, and the region number. In Figure 3.8.a, Ij representsthe node origin of the subset, while 1 is the node origin of a region inside the subset.1 is based on I and the synchronization vectors lexically in that region. For regionrp (1 p f and p is even), if all synchronization vectors in that region have negativeFor large partition sizes, code reordering increases the execution time because the ovethead of index changes, the high cachemiss ratio and page fault cancel the advantages of code reoulering described previously. The upper right corner of Figure 3.7.b and3.7.c illustrate this point.59Chapter 3Figure 3.7. Execution time vs Partition size on the MasPar.coordinates with respect to D(1 i 1) when expressed as linear combination of D, then= 1. Otherwise (all synchronization vectors in that region have positive coordinateswith respect to Di), then e2 = 0. Note that when p is odd, we have 1 = Ij. A slightmodification in code of any region might be needed for variables with intra-iterationdependence that crosses regions. For example, scalar variable with intra-iteration, andcross region dependence should be changed to array references with index (i1,i2•. , i1).Doacross 1=1,30Doacross J=1,30riwait (evl(I—1,J)r2post (evl(I,J)r3wait (ev2(I,J—1)r4post (ev2(I,J)Endoacrosses(a) Loop Example.(b) Execution of the Example With Code Reordering. Cc) Execution of the Example Without Code Reordering.Sze160chapter 3Note also that number of available processors should be larger than IA I in case 1 <m (seesection 3.3.3). Otherwise, the Doall loop should be changed to a Doacross loop and theordering of elements in A should follow the order of execution of the independent sets.The loop in Figure 3.8.a can be statically scheduled. In fact, if we assume that an 1+1dimensional grid machine of size N0 x N1 x x N1 processors is available (such a targetmachine can be embedded in many available parallel systems), then every processor canbe identified by the tuple (p0 , p) where 0 p N for (0 i 1). One way toschedule the parallel loop at compile time assuming N0 IAI is shown in Figure 3.8.b.The Doall and Doacross loops in Figure 3.8.a are changed to the loops in Figure 3.8.b.The rest of the code is unchanged. N represents the number of subsets in the directionof D for (1 i 1). The ioop in Figure 3.8.b is an SPMD (Single Program Multipledata) [361, where every processor executes the same code but on different data.3.8 Conclusion In this chapter, transformations of Doacross loops for efficient execution on distributed memory parallel machines are presented. These transformationsare based on an improved partitioning and code reordering. The performance of thesetransformations shows a considerable improvement for Doacross loops execution. Thetransformations presented in this chapter can be integrated in the final phase of parallelizing compilers, or can be used by the programmer to efficiently write parallel code. Inthe former case, it may be too complex to implement all the techniques presented here.However, simplifications such as applying our Doacross transformations when I = mcan make the implementation easier. It is assumed in this chapter that read-only datais perfectly distributed onto the processors. This can be done when executing a single61Chapter 3Doall I’ in A /* A is set of pseudo—nodes a’i of section 3.3.4 /Doacross Ii = I’, B ,sizel X Dl 1* B=(max(ll,ul),max(l2,u2), ..., max(ll,ul))/Doacross 12 = Ii, B ,size2 X D2Doacross 13 = 12, B ,size3 X D3Doacross Ii = 1(1—1), B ,sizel X Dlif ( subset (Ii) != empty/* subset () function that checks whether the subset is empty ornot. Ii is the origin of the subset *7/* serial loop for ri *71’l = Ii;Do il l,sizelDo i2 = l,size2Do ii = l,sizelri (with indexes based on I’l, i’s, and D)endos7* serial loop for r2 *7receive (ai’,Il,r2) f*receive data according to sect. 3.4 and 35*/Same serial loop as for ri above but for r2 and withI’l= Il + el.sizel.Dl + e2.size2.D2 + . . . + el.sizel.Dl;send (ai’,Il,r2) /*sends data according to sect. 3 and 4. *77* serial loop for r3 (similar to ri) ./* serial loop for r4 (similar to r2) *77* serial loop for r5 (similar to rl) . *7end ifendoacrossesendoall(a) Transformation of Doacross Loops.Do JO = 1, Aa = JO th element in ADo jl 0 , ceil (Nl’/Nl)—lDo j2 0 , ceil (N2’/N2)—lDo jl = 0 , ceil (Nl’/Nl)—lIi = a + ( Jl X sizel. (Nl,O, . . .,O) +J2 X size2. (O,N2,O, .. .,O) + . . . +Jl X sizel.(O,O,...,Nl) )xDEnd Do(b) Static Scheduling of the Loop in (a).Figure 3.8. Compile Time Transformation of Doacross Loops.62chapter 3loop, but not feasible when having a program with several loops that are dependent oneach others because of conflicting data distribution. Therefore, merging data placementtechniques [29] with our techniques needs also to be addressed. The next chapter appliesthe independent tile partitioning presented here to multiple loopnests simultaneously.63Chapter 4Partitioning the Global Iteration Space4.1 Introduction In the previous chapter, independent tile partitioning is presented.It is an improvement of supernode partitioning for loops with multiple and uniformdependencies. Similarly to most partitioning techniques, our independent tile partitioninguses flow dependence information only to partition the iteration space. However, itis shown in this chapter that other types of dependencies (such as input and anti-dependencies) should also be considered in the partitioning process. In addition, andsimilarly to many other works [30, 56, 27, 32, 28, 52], the partitioning in the previouschapter is performed for single loopnests. However, in multicomputer environment,improving the execution time of every loopnest in a program separately by finding anefficient partition for that loopnest is not sufficient because of communication overheadacross loops. As a result, this chapter introduces the formation of a global iterationspace for the whole program so that all dependencies can be considered simultaneouslywhen partitioning the global space. Across loop data dependence information is used topartition the global space. In contrast to other techniques where partitioning the globalspace is performed using few predefined partitioning patterns only (such as row, column,and block partitioning) [57, 58, 29, 59—61], our partitioning is more general and allowsmore complex patterns that are based on data dependence information (traditional datadependence as well as our across loop data dependence).64chapter 44.2 Dependence Types in Partitioning In the previous chapter, an efficient partitioning of the iteration space based on the dependence matrix D’ is introduced. Amongthe various type of dependencies (flow, anti, and input dependencies), only flow dependencies result in synchronization statements for Doacross loops to be executed on amulticomputer. As a result, it is implied in the previous chapter that D’ includes dataflow dependence information only. In this section, we describe the type of data dependence vectors that should be included in D’ and consequently be used in partitioning theiteration space. The results presented in this section should apply to most partitioningtechniques and not to our independent tile partitioning only. We assume that a dependence should affect the partitioning of the iteration space only if it limits parallelismbetween iterations or introduces communication overhead between iterations. There arefive known types of dependencies.4.2.1 Flow Dependence A flow dependence from iteration i = (i1,i2• . , i)iteration j = (ii , J2, , in) exists in an n-nested loop if i j and i defines avariable that is used later in j. As a result, these two iterations can not be doneconcurrently. In addition, if the two iterations are executed in two different processors,then a communication statement is needed between these two processors. Therefore,flow dependence should be included in D’.4.2.2 Anti-Dependence An anti-dependence from iteration i to iteration j exists in an nnested loop if i j and i uses a variable that is defined later in j. For multicomputers,anti-dependence does not limit the parallelism between iterations nor does it introduce65Chapter 4any communication overhead provided that data is allowed to be replicated. In this case,anti-dependence vectors are only used to enforce data correctness (i.e which processorholds the correct value of a variable after the execution of the loop). For the case whendata replication is prohibited, then anti-dependence introduces communication overheadif iterationi and j are not executed on the same processor. As a result, anti-dependenciesshould be taken into consideration when partitioning the iteration space and be includedin D’. However, it should have a lesser weight than flow dependence given that it doesnot limit parallelism.In order to compare the use of flow and anti-dependence in the partitioning process,the example in Figure 4.1 .a is considered . The partltiomng matnx is D=Note that D1 = (0, 1) corresponds to the flow dependence in the example whileD2 = (1,0) corresponds to the anti-dependence. Partitioning the iteration space accordingto D results in a block partitioning. However, the size of the tiles is yet to be determinedand depends considerably on the type of dependencies found in the loopnest. For theexample in Figure 4.1 .a, we define the dependence factor df to be equal to (recallthat size3 is the length of the tile in the direction of D3). This factor expresses therelative importance of flow and anti-dependence. In fact, a df larger than one means thatthe flow dependence is more important given that more flow dependencies are satisfiedin a same tile than anti-dependencies. The execution time of the loopnest in Figure 4.1 .a12 Tn the previous chapter, the Doacross loop model of section 3.2 is used so the static Iransfonnations of the Doacross loopusing our independent tile partitioning, and code reordering can be shown. Transforming a serial loop to a Doacross satisfying theloop model in section 3.2 can be easily performed automatically or by hand using any of the available paralleizing techniques [14,62, 51]. In this chapter, no transformation is presented. Thus, our loopnest examples are shown in their serial (and more readable)format.66chapter 4Do I=1,N1Do J = 1,N2A[I,J] = 4*B[I+1,J]/A[I,J]B[I,J] = 2*A[I,J_1]*B[I,JjEndos(a) Example with 1 Flow Dependence and 1 Anti-dependence.size2 = 1size2 = 2size2 =3(b) Execution Time vs Dependence Factor for Different Values of size2 (N1=N2=256).Figure 4.1. Importance of Flow and Anti-Dependence in Partitioning.—3Execution Timex 10260.0—I I I I/////240.0—220.0200.0—180.0(160.0(—//I‘I’,‘I0 50 100 150 200 250Dependence Factor67Chapter 4can be approximated tosizei.C+ [ N2 1).size2.C+ ( N2 )sizei.size.E (4.1)\ szzewhere C is the communication time to send one data element from the source tile tothe destination tile while E is the execution time of computing one iteration. Note thatthe first and second terms represent the communication time of respectively the anti-and flow dependence while the third term represents the computation time. In order tominimize (4.1), size2 should be set to 1. However, this may require a target machinewith a considerably large number of processors. As a result, the optimal dependencefactor is given in terms of size2. In this case, minimizing (4.1) is similar to minimizingthe expression (sze1 + — size2) which is equivalent to the expression(df.size2+ — .szze) (4.2)For N1 = N2 = 256, finding the minimum of (4.2) results in a df = 16 when size2 = 1,df = 11.3 for size2 = 2, and df = 9.2 for ssize2 3. The plots in Figure 4.1.bshow the execution time on the MasPar vs the dependence factor for three differentvalues of size2. It can deduced from these plots that the execution time of the computedoptimal dependence factors closely matches the actual optimal execution time (less than1% difference). We can also conclude from Figure 4.1.b that ignoring anti-dependence(i.e size2 = 1) when partitioning the iteration space could slow down the execution ofthe loop. In fact, the optimal execution time of the plot corresponding to size2 = 1 is10% slower than the optimal execution time of the curve corresponding to size2 = 2.Finally, it can be deduced that the optimal dependence factor for different values of68chapter 4size2 is usually high. As a result, flow dependence should be given more importancethan anti-dependence in partitioning the iteration space of the example in Figure 4.1 .a4.2.3 Input Dependence An input dependence from iteration i to iteration j exists inan n-nested loop if i j and i uses a variable that is also used in j. If replication ofdata is allowed, then input dependence should be ignored while partitioning the iterationspace because it does not limit parallelism nor does it add any communication overhead.On the other hand, if replication of data is prohibited, then input dependence introducescommunication overhead if both iterations i and j are executed on different processors.In this case, input dependence vectors should be included in D’. Similarly to anti-dependence, input dependence should have a lesser weight than flow dependence whenpartitioning the iteration space.We consider the example in Figure 4.2.a taken from [29]. There are six inputdependences in the loopnest. If replication of data is allowed, then the loop can befully executed in parallel without any communication overhead. If data replication is not/1 iNpermitted, then the partitiomng matrix can be chosen to be D= c\l if Note thatthe other four input dependencies can be written as linear combination of the vectorsin D and therefore are not included in the partitioning matrix. We have Idet (D) I = 2.Hence, the iteration space is divided into two independent sets [27]. Each independentset is further partitioned into tiles as described in section 3.2.3. Figure 4.2.b showsthe independent tile partitioning for one of the two independent sets (solid arrows).Only one subset from the other independent set is shown (dotted arrows). In [29], theproposed partitions are row, column, and block partitions which are all different from69Chapter 4Do I=1,NDo J = 1,NA[I, J] =B[I—1, J]+B[I, J—1] +B[I+1, J]B[I, J+1]Endos(a) Example with Input Dependencies Only.(b) Independent Tile Partitioning for D= ).(C) Zigzag Partitioning: an Equivalent of Independent Tile Partitioning for the Example in (a)Figure 4.2. Independent Tiles Partitioning for Input Dependencies.70chapter 4Figure 4.3. Execution Time Improvement of Zigzag Partitioning overColumn and Block Partitioning on the MasPar Machine (N=48).the one proposed here. In order to compare the performance of column partitioning andindependent tile partitioning, the independent tile partitioning of the loopnest in Figure4.2.a is done according to Figure 4.2.c (solid lines represent one independent set whilethe dotted lines represent the other set). This zigzag partitioning is chosen over theoriginal independent tile partitioning represented in Figure 4.2.b because it allows a moreobjective comparison between column partitioning and independent tiles partitioning. Infact, every process in both partitions have the same number of iterations (i.e a columnis equivalent to a zigzag line).——IExecution Time Improvement (%)I I Izigzag/colI I;iza,block1.00-0.95 —0.90 —0.85 —0.80—0.75—0.70—I I4 10 20 30I II40 48# of PBs used71Chapter 4Figure 4.3 shows the execution time improvement of the zigzag partitioning vs thenumber of processors used. The y-axis for the solid plot represents the zigzag partitioningexecution time divided by the column partitioning execution time while the y-axis forthe dashed plot represents the zigzag partitioning execution time divided by the blockpartitioning execution time. The figure shows that the execution time improvement due tozigzag partitioning varies between 6% and 25%. Note that the zigzag partitioning resultsin only half the communication needed by column partitioning. However, it requires amore complex code generation. It can be deduced from Figure 4.3 that, on the MasParmachine, the benefits of reducing communication outweigh the disadvantages of complexcode generation. We expect an even higher improvement for other distributed memorymachines given that firstly the MasPar system has a low communication overhead andsecondly the penalty of complex code generation is much higher for SIMD machinesthan it is for MIMD machines. Note that when the number of PEs used goes from 24 to48, the execution time improvement decreases instead of continuing to increase. WhenN = 48, every column is executed on one processor, resulting in a quite simple codegeneration for column partitioning that cancels most of the execution time improvementof the zigzag partitioning due to reducing communication.4.2.4 Output Dependence An output dependence from iteration i to iteration j existsin an n-nested loop if i j and i defines a variable that is also defined in j. Wedistinguish between two cases of output dependencies:1. Indirect output dependence: The output dependence is due to a combination of flowf1ow. anti —and/or anti-dependence (ex: i —* j because of variable a, j k because of the72chapter 4same variable a, then i ouut k). In this case, the output dependence can be ignoredwhen partitioning the iteration space since the other dependencies that caused it areincluded in D’.2. Direct output dependence: A variable in a loop is defined in iteration i, then definedagain in a later iteration j without being used in any intermediate iteration betweeni and j. Uniform dependencies of this type are atypical in scientific programmingand consequently are ignored.4.2.5 Control Dependence A control dependence from iteration i to iteration j existsin an n-nested loop if i j and i defines a variable that controls the execution ofj. or some code in j For simplicity, control dependence is considered as equivalentto flow dependence.4.3 Forming the Global Iteration Space In the previous chapter, an efficient partitioning of single loopnests with uniform dependencies is presented. However, partitioningeach loopnest in a program separately may not result in an efficient execution of the program, especially for multicomputers, where communication overhead due to across ioopdata dependence can be costly. In this section, a new method for partitioning the iterationspace of a segment is presented, where a segment is defined to be an interval of codefrom the program with multiple ioops (scalar code can be considered as a loop witha single iteration). Partitioning the iteration space of a segment into tiles of iterationsand distributing the data into the subsets is performed only once before the start of thesegment execution and can not be redone inside the segment. Dividing a program into73Chapter 4segments of code to minimize the overall execution of the program is beyond the scope ofthis chapter and needs to be investigated. In the meantime, a segment can be consideredto include the whole program. Before presenting segment partitioning in the followingsection, there are three issues that need to be addressed: 1) forming the global iterationspace for the segment, 2) finding across loops data dependence, and 3) mapping localindexes of each ioop into the global iteration space. These issues are briefly introducedin this section but are discussed in details in Chapter 5.4.3.1 Finding the Domain When every loopnest is considered separately, minimizingthe execution time of every loopnest in a segment does not necessarily lead to an efficientexecution of the segment because of communication across the loops. Therefore, a globaliteration space for the segment called the segment domain needs to be found so thatpartitioning of iteration space and data distribution is done for the whole segment ratherthan for each ioop separately.For a segment with k loopnests, the corresponding domain is the union of the localkiteration space of each loopnest considered separately: IS= U IS1, where IS is thedomain and IS1 is the iteration space of lOOj, the ill? loopnest in the segment. We definethe iteration space dimension of a loopnest to be the maximal rank of any array definedinside the loop (section 5.2). As an example, if A[I, J] is the only array defined in a3—nested loop, then the iteration space dimension of that loop is 2. If A[I, J, I + J} isthe only array defined in a 3—nested ioop, then the dimension of iteration space of thatloop is also 2.74chapter 44.3.2 Finding Across Loop Data Dependencies In the previous subsection, a domain(global iteration space) for the segment is introduced. Given that one iteration from IScan include iterations from any IS, then across ioop data dependence should also befound and be used in partitioning the domain.Traditional data dependence analysis can be used firstly to investigate the existenceof across ioop dependencies, and secondly to identify their type by finding the directionvectors or the distance vector [9, 14]. However, this is usually achieved for a dependencewhere both the source and destination are single iterations. In this subsection, a moregeneral form of data dependence called hyperplane dependence is introduced. It is adependence where the source and destination are subspaces of the iteration space. Itshould be noticed that data dependencies whose source or destination are subspaces ofdimension larger than zero are common mainly across loops.In an n-dimensional domain, a hyperplane dependence is defined as a dependencefrom (s(11,12, ,‘m),Sdir) to (d(11,12 . ,-[m),1dir) where (11,12, ,-[m) andd(11,12,, -Im) are respectively the origins of the subspace source and destination while3dir = (S1,S2,• , S8) and ddir = (Di, D2,. . , D) are the set of direction vectors(basis) that generate the subspace source and destination (rn n, s n and d n).It should be noticed that a traditional uniform dependence corresponds to a hyperplanedependence, where 3djr = ddir = (O, ..., ) and d( Ii, ‘2, , im) s(1i, ‘2, , [rn)is a constant. In the code of Figure 4.4.a, two hyperplane dependencies can be found.The one from S1 to S3 is a dependence from ((1,0), ((0, 1))) to ((1,0), ((0, 1))) (i.e.there a dependence from column I to column I) while the other, from S2 to S3, is a75Chapter 4dependence from ((0, J), ((1,0))) to ((0, J), ((1,0))) (i.e. there a dependence from rowJ to row J). An algorithm that identifies the source and destination of a hyperplanedependence is included in Chapter 5.4.3.3 Mapping Loop Indexes Forming the domain is highly dependent on mappingthe indexes of the loopnests onto the dimensions of the domain. A lexical mappingcan be initially used to compute the domain and find across loop data dependencies.A lexical mapping implies that the th dimension of any loopnest is mapped to the i’dimension of the domain. Given that the cost associated with across loop dependenciesvaries considerably with the index mapping, an improvement over lexical index mappingis investigated in the next chapter to reduce the overall cost of data dependence in thesegment.4.4 Global Iteration Space Partitioning In the previous partitioning methods such assupernode partitioning [30, 32] and the independent tile partitioning presented in theprevious chapter, only uniform dependence distance vectors are used to partition theiteration space. In addition, iteration space partitioning is usually presented for singleloopnests. Determining an efficient partitioning for each loopnest separately does notguaranty an efficient execution of the segment of code especially for distributed memorymachines, where across loop data dependence may result in a considerable communicationcost. Other partitioning and data distribution methods that analyze the whole programsuch as the approaches in [57, 58, 63, 29, 59—61, 31] are not general enough becauseonly a few predetermined partitions (such as column, row, and block partitions) are76chapter 4allowed. With the introduction of hyperplane dependence in section 4.3 as a new formof expressing data dependencies, across loop as well as intra loop data dependence areused to partition the domain of a segment.A traditional uniform dependence is represented by its distance vector. This vector isused in the partitioning by virtue of being included in the dependence matrix D’. However, a hyperplane dependence from (s(Ii, ‘2,” ,Im), 9djr) to (d(I1,12, . . , Im), ddir)can be represented by three vectors namely:i. 0 = d(Ii, I2, , Im) — 8(-Ti, ‘2, ,im), the distance vector between the subspacedestination origin and subspace source origin.11. Si = (Si, 52,— . , Se), the set of direction vectors of the subspace source.iii. dd = (D1,D2,• . . , D), the set of direction vectors of the subspace destination.Among the vectors (0, S1,S2,” , S, D1,D2,—. , Dj), only the uniform ones should beused when partitioning the iteration space and are therefore included in the dependencematrix D’. Note that the partitioning vectors due to an across loop dependence shouldbe given less weight than the partitioning vectors of an intra loop dependence becausethe latter dependence limits parallelism more than the former.In the remaining of this section, two examples are considered to show the benefitsof identifying hyperplane data dependencies on partitioning the global space. The matrixmultiplication code of Figure 4.4.a is investigated first. Without hyperplane dependenceanalysis, partitioning methods in [30, 32] and our independent tile partitioning in theprevious chapter may use a column partition, a row partition, or any other partitiongiven that no traditional uniform dependence can be found in both loopnests of the77Chapter 4example. However, with hyperplane dependence analysis, two hyperplane dependenciescan be recognized. The one from S1 to S3 is a dependence from ((1, 0), ((0, 1))) to((I, 0), ((0, 1))) (a dependence from column I to colunm I) while the other, from S2 toS3, is a dependence from ((0, J), ((1,0))) to ((0, J), ((1,0))) (a dependence from row Jto row J). As a result, the vectors (0, 1) and (1,0) are used to guide the partitioningprocess. Note that the 0 vector for both dependences is the zero vector and hence isignored. We have D = D’= ( ). Partitioning the iteration space according toD using independent tile partitioning (which is equivalent to supemode partitioning [30,32] in this case) leads to a block partitioning.Figure 4.4.b shows the execution time on the MasPar machine versus the machinesize for block partitioning as well as column partitioning for the code in Figure 4.4.a. Theplots clearly indicate that block partitioning is superior to column partitioning especiallywhen a large number of PEs are used (ex. for 48 PEs, the execution time of thecode using a block partition is 3 times faster than the one using a column partition).Given the low communication cost on the MasPar, we would expect an even betterimprovement on distributed memory MJMD machines, where communication overhead isusually expensive. In fact, for an N2 iteration space and p PBs available is assumedto be an integer), the execution of the code in Figure 4.4.a using block partitioningresults in every PE receiving data of size N2/p elements (N2/p is also assumed to bean integer) from 2 (/ — 1) different PBs. On the other hand, in column partitioningexecution, every PB receives data of size N2/p from p — 1 PBs. Given that 2(/— i)is much smaller than p — 1 (especially for large p), then block partitioning outperforms78chapter 4Do I =1,NDo J=1,NSi: A[I,J] 4*1 + 5*(J+i)S2: B[I,J] = 2*1 — 3*(J+i)EndosDo I =i,NDo J=i,NDo K=i,NS3: C{I,J]Endos18.0016.0014.0012.0010.008.006.004.002.000.00=C[I,J] ±A[I,K]*B[K,J](a) A Matrix Multiplication Example.Execution TimeI Column Partitioningftiti12-— Number of PEs0 10 20 30 40 50(b) Execution Time vs Machine Size for Block and Column Partitioning (N=48).-IFigure 4.4. Benefits of Using Hyperplane Dependence hiformation on Partitioning.79Chapter 4column partitioning for the code in Figure 4.4.a.The second example of this section can be found in Figure 4.5.a, where twohyperplane dependencies exist in the code. The dependence from Si to 54 is a dependencefrom ((I — 2,0), ((0, 1))) to ((1,0), ((0, 1))) while the one from S2 to 54 is a dependencefrom ((I + 2,0), ((0, 1))) to ((1,0), ((0, 1))) (i.e. there is a dependence from everycolumn to its second neighbor on the left and its second neighbor on the right). Hence,the first dependence is represented by the vectors (2,0) and (0, 1) while the seconddependence is represented by the vectors (—2,0) and (0, 1). This results in a partitioning/‘20’\ .matrix D equal to Figure 4.5.b shows mdependent tile partitiomng for theabove partitioning matrix D when size1 = 2 and size = 3. Note that the partitioningof only one of the two independent sets is shown in the figure. Figure 4.5.c shows theexecution time of the segment on the MasPar machine vs the dependence factor, wherethe dependence factor df is again defined to be equal to The first two plots in Figure4.5.c respectively shows the dependence factor when size2 = 1 and size2 = 2. The latterplot lies entirely above the former one. As a result, the plot with size2 = 1 should beused to determine the size of the optimal partition provided that enough processors areavailable. For a smaller machine size, another appropriate value of size2 should be used.The third plot in Figure 4.5.c shows the block partition (i.e D= ( )) executiontime when size2 = 2. This partition is selected to compare its performance withindependent tiles partitioning given that it is one of the popular predetermined partitionsallowed by parallelizing compilers [57, 29, 64, 59-6 1]. It can be deduced from thesecond and third plot that independent tile partitioning (enhanced by across loop data80chapter 4dependence analysis) outperforms block partitioning for the example in Figure 4.5.a.The improvement of our method over block partitioning is between 30% and 40% whena large number of processors are used. Both plots get closer to each other when smallernumber of processors are used because, in that case, the execution time is dominated bythe computation part which is the same for both methods.4.5 Conclusion In this chapter, the type of dependencies that should be used whenpartitioning the iteration space are discussed. While less important than flow dependence,anti and input dependence information should be used in the partitioning process. Inaddition, this chapter applies our independent tile partitioning (of the previous chapter)to the partitioning of the global iteration space. Because of across loop data dependencies,finding the best partition for each ioop separately is not sufficient. In our approach, aglobal iteration space for a segment of code is formed so that the partitioning is performedfor the global space using the traditional uniform dependencies as well as the cross ioopdata dependencies. This has been made possible by the new expression of dependenciesas hyperplane data dependencies. As shown through some examples, our partitioningoutperforms other partitioning of multiple loopnests which allow only some predefinedpartitioning strategies such as block, column, and row partitioning. A future work forthis chapter includes heuristically dividing a program into segments of code so that theoverall execution time of the program is minimized.81Chapter 4Do I =i,N1Do J=i,N2Si: A[I,J] =S2: B[I,J] =S3: C[I,J] =EndosDo I =3,N3Do J=1,N4—2temp = C[I,J1Do K=i,N5S4 : C[I,J] = C[I,JJ+temp/(A[I_2,K]*B[I+2,K])Endos(a) Example with Two Hyperplane Dependencies.• fl.(b) Independent Tile Partitioning for the Segment in (a).Figure 4.5. Independent Tile Partitioning Using HyperplaneDependence Information. (Continued) .82chapter 4Figure 4.5. Independent Tile Partitioning Using Hyperplane Dependence Information.size2 12size2 = 23size2 — 2, co’umn Part.I,ii1,I,‘I‘II,(1I,I,I,I,Execution Time1.000.90—0.80—0.70 —0.60 —0.50—0.40—0.30 —0.20 —0.10—I I I I0 10 20 30 40 50(c) Execution Time vs Dependence Factor for the Partitioning in Figure 5.b (N1=N2=...=N5=48).Dependence Factor83Chapter 5Across Loop Data Dependence Analysis5.1 Introduction Traditional dependence analysis techniques usually attempt to recognize the existence of dependencies between iterations of a loop and, in some cases,characterize these dependencies by finding direction vectors or distance vectors [10, 9,15, 14]. However, this characterization is usually achieved for a dependence where boththe source and destination are single iterations. In the previous chapter, we expressedthe need for a more general form of data dependence (called hyperplane dependence)whose source and destination are subspaces of the iteration space. This dependence formcan be useful mainly for expressing dependencies across loopnests, and consequently tobetter understand the interaction between loopnests. Hyperplane dependencies are calledso because it is fairly common in a 2D iteration space to have dependencies where thesource and/or the destination are hyperplanes (i.e. lines). Hyperplane dependence is usedin the previous chapter to improve the partitioning of the global iteration space. However,chapter 4 does not indicate how these dependencies can be found which is the concernof this chapter. In addition to its use in partitioning the iteration space, this chaptershows how hyperplane dependence analysis can be used to improve automatic generationof communication statements across ioops and index aligmnent for n-dimensional gridtarget machines.Identifying across ioop data dependencies can be beneficial especially for parallelizing code on multicomputers where a dependence may result in costly communication84chapter 5between processors. In this chapter, automatic generation of communication statementsacross loops and index alignment are improved by using hyperplane dependence analysis.Generating communication statements can be simple if asynchronous communication isused. However, recognizing hyperplane dependencies in a program can reveal communication patterns that lead to organized communication mechanisms which, as shown inthis paper, outperforms the asynchronous communication mechanism. Li and Chen [37]generate communication by matching the desired communication with the best availablecommunication routine provided by the system. However, a perfect match can not alwaysbe found and unnecessary communication may be introduced. Our approach consists ofgenerating more customized communication statements based on the precise informationobtained from hyperplane dependencies. In addition to communication generation, hyperplane dependence analysis can be used to efficiently map the indices of a loopnest intothe target machine dimensions (index alignment). An appropriate mapping can decreasethe communication cost in a program given that hyperplane dependencies, the cause ofcommunication in a program, are dependent on index alignment.5.2 Global Iteration Space Advances in compiling technology for parallel systems isshifting the complexity of parallel programming from the user to the compiler. In fact,tasks such as automatic generation of communication statements, loop partitioning, anddata distribution can be performed by the compiler. Most of these compile time supportsare quite effective when a single loopnest is considered [35, 9, 17—19, 14]. However,applying single loop techniques to every loopnest in program separately may not generatean efficient execution of the program because of the communication overhead due to85Chapter 5across ioop data dependencies. In order to be able to represent these dependencies andbetter understand the interaction between the loops, we propose to build a global iterationspace for the segment’3of code to be analyzed, where every iteration in the iteration spaceof a loopnest from the segment is mapped to an iteration in the global space. Once thisis achieved, then data dependencies across the loops can be found and examined in orderto improve the execution of the segment. The benefits of using across ioop dependenceinformation on automatic generation of communication statements and index alignmentcan be found in the remaining sections (in addition to its use in partitioning the iterationspace shown in the previous chapter).Before considering the global iteration space (GIS), we define the local iterationspace (LIS) dimension of a loopnest to be the maximal rank of any array defined insidethe loop. The rank of an array A is defined to be the rank of its corresponding matrixMA. For an rn-dimensional array A defined inside an n-nested loop, M (i.e. the entryin row i and column j of matrix MA) contains the coefficient of the fth loop index in theth dimension of array A. Assuming that one iteration constitutes one indivisible process(i.e instruction level parallelism is of no concern in this dissertation), then the abovedefinition of the US dimension sets a reasonable upper bound on the parallelism thatcan be extracted from the loop by limiting the number of iterations (i.e processes) in theLIS. This is done by ignoring some of the loop indices when forming the LIS (seeexamples below). Note that if m indices (called Ii, , -Tm) are used to form the LIS13 As mentioned in Chapter 4, instead of analyzing the whole program at once, it is sometimes more efficient to divide the programinto segments of code (also called phase [65, 66]), where each segment is analyzed separately. Efficiently dividing a program intosegments is beyond the scope of this dissertation and is an area of future research. In the meantime, a segment can be considered toinclude the whole program.86chapter 5of an n-nested ioop (rn n), then the code in iteration (Ii, 12,”, -Irn) is executed byiteration (or process) (I,, 12,” , Im). Using the owner-computes rule [36], it assumedthat any variable defined in iteration (‘i ‘2,’ , im) is owned by iteration (1,, J2,” , Im)(or more precisely by the processor that executes (Ii,-[2,” , Im)).In order to justify the above LIS dimension definition and show how an US can becomputed, we consider the examples14in Figure 5.1. Using the US dimension definitionto the code in Figure 5.1.a, it can be concluded that the LIS dimension of the ioop istwo (corresponding matrix of the defined array A is 11A= ( Index I and Jare used as subscripts to define array A. Hence, they should be used to form the USof the ioop and ultimately be mapped to the target machine dimensions (process/iteration(i, i) defines and owns iteration A[i, j]). However, index K is only used on the righthand side of statement Si to compute serially an instance of A. As a result, there isno benefit in creating another dimension of processes if these processes would be serialalong that dimension’5.Therefore, we have US = [1..N,] x [l..N2] In Figure 5.l.b,/0 1\the LIS dimension of the loop is also two (MA = (1 0 1) Even though A is a 3D\1 1)array, the maximal number of processes that can be created depends on two indexes only(I and J). Thus, we have LIS = [1..N,j x [1..N2] Note that A[i,j, k] is executed andowned by iteration (j, i) provided that k = i + j.Applying the iteration space definition to Figure 5.1.c, it can be deduced that theUS dimension is two. Even though three indexes are used to define the arrays inside14 For simplicity, we assume that all loops in this chapter are normalized (i.e they have a step of 1).15 Traditional uniform dependencies such as the one in the example Do 1,3 {A[I,3j=A[I,J-1]+B[1,J]} are not used to simplify theformation of the US given that they can be used subsequently in partitioning the GIS once all dependencies in the code are found.As a result, the US dimension of this footnote example is also 2.87Chapter 5Do I = 1,NDo J = 1,N2Do K = 1,N3Si: A[I,J]=A[I,J]+B[I,K]*C[K,J]Endos(a) ExamplelDo I = 1,NDo J = 1N2Do K = 1,N3Si: A[I,J]S2: B[I,K] =Endos(c) Example3Do 1= 1,NDo J = 1,N2Si: A[J,I,I+J]Endos(b) Example2Dol = 1,NDo J = 1,N2Do K = 1,N3Si: A[I + K, J + K] =Endos(d) Example4Figure 5.1. Loops with 2D Local Iteration Spaces.the ioop, these defined arrays are 2D arrays. If a 3D US is formed and mapped toa 3D target machine, then that may result in a high communication overhead betweenthe iterations. In fact, real examples resembling the code in Figure 5.1 .c have either Susing various values computed by S2 and/or the opposite. Given that A[i,j] and B[i, k]88chapter 5are respectively defined in (i, i) and (i, k), then index I forms one dimension of the 2DLIS while J and K are both used to form the second dimension. In this case, we haveUS = [1..N] x [1.. max (N2,N3)]. Finally, the LIS of the loop in Figure 5.1.d isalso of dimension two (MA= ( This example differs from the previousone by the fact that all three indexes are used to define A which is a 2D array. Giventhat A needs only two indexes to be defined, one of the three indexes is ignored byvirtue of not being used as a dimension in US. We choose to ignore the innermostindex. However, this index is used to adjust the size of dimension I and dimension Jwhich results in LIS = [2..N1 + N3] x [2. .N2 + N3]. Note that a traditional iteration(i, j, k) from the code in Figure 5.l.d is mapped to iteration (i1,i2) in our LIS, where(i1,i2) = (i + i i + k). This iteration (i1,i2) defines and owns A[ii, i2].Four examples were used in this section to illustrate the computation of the US.Once the LISs of all loopnests in a segment are found, then the global iteration space(GIS) of the segment is defined as the union of all UISs. For a segment with k loopnests,kwe have GIS= U LIS, where US2 is the local iteration space of lOOj, the ill? loopnesti=1in the segment. Note that the definition of GIS implies that the dimensions of an LIS aremapped lexically to the dimensions of the GIS (i.e thejt’ dimension in an US is mappedto the jt dimension of the GIS). However, this lexical mapping is only temporary sincesection 5.4.2 discusses remapping the dimensions of loopnests to decrease the cost ofhyperplane dependencies.5.3 Hyperplane Data Dependence Analysis Across loop dependencies can not berepresented unless a common iteration space for all loops in a segment is found. For this89Chapter 5reason, the previous section introduces the global iteration space as the union of localiteration spaces. As a result, traditional data dependence analysis can be used firstly toinvestigate the existence of across loop dependencies, and secondly to characterize themby finding the distance vectors [9, 17, 14]. However, traditional dependence analysisonly characterizes dependencies from a single source iteration to a single destinationiteration. In this section, a more general form of expressing data dependence calledhyperplane dependence is introduced. It is a dependence whose source and destinationare subspaces’6of the iteration space. These data dependencies with source or destinationthat are subspaces of dimensions larger than zero are common mainly across loops andin non-perfectly nested ioops. Only the former loops are considered in the remainderof this chapter.In an n-dimensional global iteration space, a hyperplane dependence is definedas a dependence from [So, .s(1,.] to [d0,dd], where .s0 and d0 are respectively theorigins of the subspace source and destination while 8djr (Si, S2, , S) and ddjr =(D1,D2,• , D) are the set of direction vectors (basis) that generate the subspace sourceand destination (s n and d n). Note that a dependence from all iterations in row Ito all iterations in column I can be expressed as a dependence from [se, Sdjr] to [d0, ddir],where ,s = (Ii,O), 3dir = (S = (0,1)), d0 = (0,Ii), and dd = (D1 = (1,0))17. Notethat a traditional uniform dependence corresponds to a hyperplane dependence, where3dir = ddir= ( , , ) and s — d0 is a constant. It should also be noticed that our16 in this chapter, we define a subspace to be a set of iterations from the iteration space characterized by an origin o and n linearlyindependent vectors (b1,2. ,b,) such that every iteration in this subspace can be written as o + a1b + n2b + + anbnwhere (ai, n2,.,n,) E Z” (n is the dimension of the subspace).17 in this chapter, the first coordinate in a vector refers to the vertical index while the second coordinate indicates the horizontalindex.90chapter 5hyperplane data dependence analysis does not improve over recent works in assertingthe existence of dependencies in the code [16, 19, 5, 6] but improves the representationof dependencies across the loops so that the interaction between loops can be betterunderstood. In fact, it is used in the following section to improve automatic generationof communication statements across loops and index alignment for n-dimensional gridmulticomputer. In addition, it has been used in Chapter 4 to improve computationdecomposition.We consider the loop model in Figure 5.2.a, where f and g are affine functions (f(I1,I2,”.,I,) = a0 + a:iIi + a21 + + andg3(I1,I2,. ,iq) =b,0 + b1I + b221 + + bjqlq for 1 i v, where v is the dimension of arrayA, p is the number of indexes in lOOPs (the source loop), and q is the number of indexes in 1QQPd (the destination loop)). We call p’ the dimension of US3 and q’ thedimension of UISd. Without any loss of generality, it is assumed that the outer p’ loopsof lOOPs constitute LIS3, whereas the outer q’ loops of loopd form LISd. Dependencies are found on the basis that the code whose index is (Ii, 12,•’ .,I, 1)in lOO and (i1, 12,” ,1rq’1q’+1’” , Iq) in 1OPd is executed respectively by iterations (li, 12, . . , ii) and (hi 12,” , Iqi). Moreover, using the owner-computes rule,any variable defined in iteration (ii,I2,• . , i,i) is owned by iteration (Ii, ‘2, , in,).Therefore, the definition of A[fi(11,12,”,In),f2(I1,1 ..,1,),. ., f(Ji, ‘2, Jp)]in lOOPs of Figure 5.2.a is executed and owned by iteration (Ii, ‘2,” . , while thevariable A[gi(Ii,12,.. ,Iq),g2(I1,I2, ,g(I1I2... ,Iq)] is used in iteration(i1, 12, • , Iqi). It should be remembered that in order to determine the data depen91Chapter 5Do 4 — Ls1,Us1Do12Ls,U82DoILdUdS1 Afg1(j,2‘P)g2(J1,I2 ,J) g(J1j2 ,J)JEndosDO rLd1,UDoI2Ld2,udDoJpLdudS2•..Afg1(j, 4),g2(11,i ‘c) (Ii, 4,Endos(a) Loop Model with a Depende<Do I 1, N1Do J 1 N2S1DOI1,N3DoDO It1,NS2 BfIKJAfj,÷J(b) Exanp1e of a Rypeipi DepenFigure 5.2. Code with Hyperp1 Data Dependence92chapter 5dencies, loop indexes used to form an LIS are mapped lexically to the GIS (which inturn will ultimately be mapped to target machine). This mapping is only temporary andis needed to find across loop dependencies.An algorithm that identifies the source and destination of a hyperplane dependencebetween lOOPs and loopd is presented in this section. However, before doing so and inorder to ease the understanding of the algorithm, we sketch its description by applying itto the code in Figure 5.2.b. Index I and J in the first loopnest (source loopnest) form LIS8and are respectively mapped to Ii and ‘2. I and K in the second loopnest (destinationloopnest) forms LISd and are respectively mapped to I. and 12 while J is not mapped toany dimension. So far, our algorithm does not include loop bound information. However,techniques from [16, 19, 5, 6] can be used to include this boundary information in ourhyperplane data dependence representation. The following steps determine the source anddestination of the hyperplane dependence in Figure 5.2.b (a general and brief descriptionof each step is presented first before applying it to the example in Figure 5.2.b):• Determine the iterations that own the instances of the arrays defined in loop8.In our example, A[ij, i2] is owned by iteration (i2,i1).• Find any equality relationship between the indices used as subscripts of the sourcearray (such as first subscript + second subscript = third subscript). Then, apply theserelationships to subscripts of the destination array.In the code of Figure 5.2.b, no such relationship is found. However, if S1 in ourexample consists of defining A [J, Ii] instead of A [‘2, Ii], then a dependence fromS1 to 52 exist only if destination indices satisfies J = I + J — 2 (i.e I = 2).93Chapter 5• Find the destination iteration and express it as a sum of tuples, where each tupleincludes one index only.In the second ioop of our example, A[J, I + J — 2] is needed in iteration (I1, ‘2)which is the destination. It is then written as (Li, 12) = (Ii, 0) + (0, ‘2)• Based on the first step, determine the source iteration and express it as a sum oftuples, where each tuple includes one index only.In our example, A[J, Ii + J — 2] is defined by the first ioop in iteration (Ii + J — 2, J) which is the source iteration. It is then written as(Ii + J— 2,J) = (I,O) + (J,J) + (—2,0)• Classify the indices in the source and destination iterations into common indicesand free indices. Common indices are used in both the source and destinationiterations while free indices are used to express either the source iterations onlyor the destination iterations only. Common indices are used to form the origin of thesource and destination subspaces while the free indices are used to form the directionvectors of the source and destination subspaces.In our example, there is a dependence from (Ii + J — 2, J) to (Ii, 12). I appearsin both tuples and is therefore a common index while 12 and J are free indices.• Determine o and d0 by respectively adding all the tuples with common indices in thesource and adding all the tuples with common indices in the destination. Determinethe subspace source and destination direction vectors by removing the free indicesfrom their corresponding tuples. Express the hyperplane dependence as a dependencefrom [8o,3d] tO [dO,ddir].94chapter 5In our example, we have s0= (Ii — 2, 0), Sd:r = ((1, 1)), d0 = (Ii, 0), afld dd =((0, 1)). This is a dependence from diagonal lines to horizontal lines (Figure 3 .c).In the remainder of this section, an algorithm that identifies the source and destination of a hyperplane flow dependence between loops and loOpd’8 is presented (looptransformation techniques such as index changes, loop fission, etc. should be used tomake the subscript functions of defined arrays as simple as possible):1. Determine the maximum number of linearly independent vectors from the set of vectors v such that vj = (aii,ai2,... ,ai) for 1 v.The ft functions whose corresponding v vector is included in the linearly independent set are called deterministic functions while the remaining function are calledrestrictive functions. The former functions are used to identify the iteration(s)(ii, 12, . . . that define A[i1,2•, iv]. The restrictive functions (if any) restricts the element involved in the dependence by establishing a relationship betweenthe indexes II, ‘2,’ , I, (see step 3). As examples, the deterministic functions inA[11,12,11 + 12] arefi andf2, in A[12,12, I] arefi andf3, and in A[11 + 12,11 + 12]isfi. We assume that there are v” deterministic functions and v-v1 restrictive functions.Without loss of generality, we assume that the first v’ f functions are deterministic.2. From the deterministic functions, derive the f functions such thatf(ai,a2,... ,a’) = I for 1 i p’, where a3 = f3(I1,12,•.,I,) forall 1 j < v’.18 1oop comes lexically before loopd. In addition, it is assumed that the execution order of both loops is equivalent to theirlexical order (i.e there is no jump statements that makes some iterations of lOOpd execute before some iterations of loops). Only flowdependencies are considered here. Other type of dependencies can be found similarly. In addition, and for sake of simplicity, weignore any control flow except for the loops themselves. We also assume the definition of any other array inside both loops does notchange the formation of LISA and LISd.95Chapter 5The f functions are used in step 5 to determine the iteration(s) that own a definitionof an instance of array A. Finding the f functions is equivalent to solving anon-homogeneous system with v’ equations of p unknowns (note that the indexesto I, are considered as constant when solving the equations because they arenot part of IS). Therefore, there are p’ — v’ free variables (‘‘+ to in’). The firstv’ functions (f) result from solving the equations and are expressed in terms of a3(i <j and the p’ — v’ free variables. The remaining (‘ — functions f(unrelated to the non-homogeneous system of equations) are expressed as a functionof one of the free variables (i.e f$’(ai, a2,• , ai) = I for v’ + 1 i p’)19•3. From the restrictive functions, determine a linear relationship (if any) between theindexes (ii, 12, . , Iqi) of the destination loop by solving the equationsf;(g1(I1,I2,...,Iq),g2(Ia,I2,...,Iq),...,gi(I,l2,...,Iq)),f1(gl(I1,12,...,Iq),g2(I1,12,...,Iq),...,gI(I1,12,...,Iq)) )= gj(Ii, I, Iq)for all V1 + 1 < i < v.Similarly to the previous step, indexes to‘q are considered as constant whensolving for the above equations. These equations result in two cases:19 It is easier to describe step 2 if p’ = v’ (eveiy iteration defines one element of A). However, given that the iteration spaceshould accommodate any array defined inside the loopnest, one element of A can be similarly defined by many iterations (u’ <p’).96chapter 5• Unsolvable equations —> There is no dependence. Exit.• Solvable Equations —> Find a linear relationship between the indexes. Withoutloss of generality, we assume that the last q’—q” indexes can be expressedas linear combinations of the first q” indexes (I liz (ii, 12, , Iqi) forq” + 1 i q1).4. Write the destination tuple(I1,I2,•,IqI)=(I1,I2,_,Iqn,hqn+l(I1,I2,,Iqfl),hqIl+9 (ii, 12, Iqii), . hqi (ii, 12,• Ii))asI,. + (c1c2,.. •,cqi) (5.1)where the second term in (5.1) is the constant tuple.5. write the source tuple(f(g1(11,I2,.,Iq),g2(I1,I2,..,Iq),...,g1(Ia,I2,...,Iq)),f(g1(li,I2,,Iq),g2(I1,I2,.,Iq),,g1(I1,I2,,Iq)),f, (ga(Ii,I,. . , Iq), g2(I1, 12, ,Iq), . ,g’(Ii, 12,• , Iq))) (5.2)asm=mx (p,q)( + .. (5.3)97Chapter 5after substituting I in (5.2) with h (I1 ‘2,• , for q” + 1 i q’. The lastterm in (5.3) represents the constant tuple.6. Separate the indexes used in (5.1) and (5.3) into common indexes and free indexes.An index that exists in both (5.1) and (5.3) is called a common index. Otherwise, itis called a free index. Without loss of generality, we assume that the first r indexes(Ii, 12,” , Ir) in both loops are the common indexes.7. Create the tuplesS(I1,I2,”,Ir) ( + (clc2,...,cl)d(Ii,I2,,Ir) — (E(di1Iidi2I•dgIIi)) +sar (Sr+1, S+2 Sm) where S = (si, S2, .. for r + 1 < i qddjr = (Dr+i, Dr+2, Dm) where Th (d1,d2,•• ., dii) for r + 1 iS is included in Sdir if S (0, 0,.. . , 0)D is included in ddtr 1ff DThere is a data dependence from the hyperplane (subspace) sourceSj1,I2,• Ir characterized by (s (Ii, ‘2,” , I,.), ‘ dir) to the hyperplane destination D11,2.. .,-,. identifiedby (d(11,12.• where s(I1,12.“,Jr) and d(11,12.. ,I) respectivelyrepresent the origin of the hyperplane source and hyperplane destination while Sdjrand ddjr are the set of direction (basis) vectors of the subspace source and subspacedestination. Given that the origins of the source and destination vary with the indexes 11,12, ir, then mm IUs1 — Ls1, Ud — Ld1() is an upper boundon the number of hyperplane dependencies in the ioop. The last expression does not98chapter 5represent the exact number of dependencies because some of the hyperplanes (eitherthe source or the destination) may be empty.The new expression of dependencies as hyperplanes is a generalization of traditionaldata dependence expressions. For example, a traditional uniform dependence correspondsin our case to a dependence from S11,2..., to D11,2..., where sdjr ddir =and d(11,I2, .,I) — s (Ii, I2, .,I) is independent of any index. This new form ofexpressing dependencies is best suitable for expressing the interaction between differentloopnests in a program which is important especially for multicomputers. Even thoughfinding the hyperplane source and destination may seem complex, using the hyperplanedependence algorithm on real applications is normally straightforward (see exampleshown before the algorithm). The complexity of the above algorithm is determinedby the first step which is the most expensive step and is 0 (p’v2).5.4 Benefits of Hyperplane Dependence Analysis In the previous section, a hyperplane dependence is defined as a dependence where the source and the destination aresubspaces of the global iteration space. Recognizing such dependencies can be usedby either a parallelizing compiler or by the user to transform serial code into parallelcode. In this section, we show the benefits of identifying hyperplane dependencies onautomatic generation of communication statements across loops and index alignment forn-dimensional grid multicomputers.5.4.1 Automatic Generation of Communication Statements across Loops Eventhough the use of distributed memory machines is increasing, generating communi99Chapter 5cation statements for these machines remains one of the tedious tasks usually performedby the user. Shifting this task from the user to the compiler can make these machineseven more popular [35, 36, 64, 37, 671. In this section, we present an efficient methodfor automatic generation of communication statements across loops upon detection of ahyperplane data dependence. For this problem, a hyperplane dependence from [se, dir]to [d0, ddjr] in an n-dimensional GIS iteration space is considered, where s, and d0 arethe origins of the hyperplane source and destination while 5djr = (S1,S2, . , S8) anddd = (D1,D2, .. , D,j) are the set of direction vectors of the hyperplane source anddestination C s n and d n). Note that, out of the n indexes that form the GIS,the tuples s and d0 are expressed in function of only m indexes (Ii, ‘2,” , ‘m) calledcommon indexes in the previous section.This new form of expressing dependencies not only provides precise data dependenceinformation, but also unveils communication patterns across the loopnests that can beefficiently implemented as will be shown below. Assuming that an n-dimensional gridtarget machine is available, then communication can be statically generated according tothe following cases20:1. (S,2...,S3)= and (Dl,D2,.,Dd)=a. If so — d0 is a constant tuple, then a traditional uniform dependence exists.Generating communication statements for this case is straightforward.20 For simplicity, communication is described in this section based on the assumption that every iteration is executed on its ownprocessor. For smaller maclime size and other partitioning sirategies, this section can be adjusted by replacing every reference to aniteration by the processor that the iteration is mapped to. This can be done by using a computation decomposition function (similarto the ones in [2]) that returns the corresponding processor for every iteration.100chapter 5b. Else, there exists an irregular communication (i.e a simple communication patterncould not be found). Hence, asynchronous communication can be used here[35—37, 671. This is an expensive operation because of the overhead of trafficcongestion and message coffision.2. If(Si,S,...,)(O, ..,)or (Dl,D2,...,Dd)=(.”,)The intersection set between the subspace source and subspace destination should bedetermined. This can be done by finding A = (ai,a2,.’. ,cS,cS+1,aS+2,” ,CES+d)in terms of (Ii, ‘2,’’’ ,Jm) such thats + = d0 +d(5.4)Note that every value of (Ii, ‘2,’ . , Im) represents one instance of the hyperplanedependence. Based on the outcome of solving (5.4), we have the following threesubcases:a. For every value of (-li, ‘2,’’ , Im). A is unique (i.e unique intersection point).In this case, an instance of the hyperplane dependence is satisfied by a communication procedure that consists of a reduction operation form all the iterations inthe source to the intersection iteration followed by a broadcast operation from theintersection iteration to all the iterations in the destination. Note that for the reduction and broadcast operations, communication occurs in both the positive andnegative direction of the hyperplanes direction vectors (i.e +S, — S (1 i s)for the reduction operation, and +D3,—D(1 j d) for the broadcast operation). It should also be noticed that the reduction for all instances of a hy101Chapter 5perplane dependence can be done fully in parallel (and without any collision)because they all have the same direction vectors. The same can also be saidabout the broadcast operation. In Figure 3.a, there is a hyperplane dependencefrom [SO = (I1,0),Sdir = ((0,1))] to [d0 = (0,I1),dd = ((1,0))] (i.e a dependence from row I to column Ii). This dependence has 6 instances because1 < I 6. For this case, equation (5.4) becomes (Ii, al) = (cr2, Ii) whichresults in A = (ai, cr2) = (Ii, Ii). Hence, A is unique for all 6 instances of thehyperplane dependence (intersection iterations are marked with an x in Figure 3).As a result, communication is achieved for this example by a reduction operationfrom row I to iteration (Ii, J) followed by a broadcast operation from (Ii, Ii)to column I. In Figure 3, the reduction operations are represented by the dottedarrows while the broadcast operations are expressed by the solid arrows. In Figure 3.c, there is a dependence from ((Ii — 2,O),((1, 1))) to ((I,O), ((0,1))) (thisdependence corresponds to the code in Figure 5.2.b). In this example too, thereare 6 instances of the hyperplane dependence. Solving (5.4) for this dependencealso results in a unique intersection point which is (Ii, 2). Figure 3.c illustratesthe communication pattern for this example.b. For every value of (Ii, ‘2,’ , -[m), there is no A that satisfies (5.4). In this case,for every instance of the hyperplane dependence, an intermediate iteration s0from the subspace source and an intermediate iteration d0 from the subspacedestination should be selected. The communication procedure consists of areduction operation from the iterations in the source to s0, a send operation102chapter 5from s0 to d0, and finally a broadcast operation from d0 to the iterations inthe destination. Note that choosing the intermediate iterations that minimizes— s0 is likely to result in an efficient communication procedure. In Figure3.e, there is a hyperplane dependence from ((Ii, 0), (0, 1)) to ((Ii + 2, 0), (0, 1)).Solving (5.4) results in no intersection point between the source and destination.The intermediate points are selected such that so = (Ii, 0) and d0 (-[i + 2,0).The hyperplane dependence of Figure 3.e has 4 instances.c. For every value of (Ii, 12, . , Im), A exists but is not unique. In this case, anyiteration from the intersection set can be selected as the intersection iteration.Then the communication procedure described in case 2.a can be used.Figure 5.4 presents an algorithm for automatic generation of communication statementsfor all the cases described above. The functions reduce and broadcast are dependent onthe system and should take advantage of any of the communication features availablein the machine. Two versions of the algorithm may be needed: one is for the sendingprocessors while the other is for the receiving processors. For other architectures suchas the MasPar machine, only one version is needed because the sending processor(or the receiving processor) can by itself achieve the desired communication. Notethat various communication optimization techniques (such as communication merging,communication placement) [35, 37] are not discussed here but can be integrated in ouralgorithm.4.1.1 Experimental Results The graphs in Figure 3 (3.b, 3.d, and 3.0 show thecommunication time on the MasPar machine vs the problem size (the x-axis represents103Communication Time x 10150.00-140.00-130.00 -120.00 -110.00-100.00-90.00 -80.00 -70.00 -60.00 -50.00-40.00 -30.00-20.00-10.00 -0.00-4 •444 —.--—_Hyperplane Dependence fromChapter 5—04-4-• jarn,(a) Communication Pattern for a.3((Ii,O),(O,1)) to ((0, Ii), (1,0))Aaync5ironousRouterXnetI//I.I/—I//I IIter. Space Size0 10 20 30 40 50 60(b) Communication Cost of the Dependence in (a)Figure 3. Communication Cost on the MasPar for Different Communication Mechanisms.the square root of the number of iterations in the GIS) for the corresponding hyperplane104..4—4 .—..—\4•c p.-. • i.•‘ ‘o•cCommunication Time a 10360.00340.00320.00300.00280.00260.00240.00220,00200.00180.00160.00140.00120.00100.0080.0060.0040.0020.000.00Communication Mechanisms. (Continued)chapter 5\\\\\ \ \\ S.\.4 . ii(c) Communication Pattern for a Hyperplane Dependence from ((Ii — 2,0), (1, 1)) to ((I,o), (0,1))AsynthronousRouterXnetIter. Space Size0 10 20 30 40 50 60(d) Communication Cost of the Dependence in (c)Figure 3. Communication Cost on the MasPar for Different105150.00140.00130.00120.00110.00100.0090.0080.0070.0060.0010.0040.0030.0020.0010.000.00Chapter 5. . .. . . .Communication Timex 10 -3I I..(e) Communication Pattern for a Hyperplane Dependence from ((Ii, 0), (0,1)) to ((Ii + 2,0), (0,1))AsyncbronouoRouterXnetI TIer. Space SireCost of the Dependence in (e)/‘I/0 20 40(f) Communication60Figure Communication Cost on the MasPar for Different Communication Mechanisms.dependencies represented by the figures on the left side (Figure 3.a, 3.c, and 3.e). Three106chapter 5plots are included in every graph. The first plot is for asynchronous communication (at thesame time, every iteration requests to fetch the needed data from its corresponding sourceiteration. This requires using the router communication mechanism). In the second plot,communication is performed according to the algorithm in Figure 5.4 and using the routercommunication mechanism of the MasPar. Finally, in the third plot communication isalso achieved according to the algorithm in Figure 5.4 but while taking advantage of thefast xnet communication mechanism made possible by finding communication patternsexpressed through hyperplane dependencies.The algorithm in Figure 5.4 introduces some serialization in the communication procedure by establishing a communication pattern, but it eliminates traffic congestions andmessage collisions of the asynchronous communication. In addition, finding a communication pattern usually allows using some of the communication features available incertain machines (such as xnet for the MasPar). Consequently, in all three graphs asynchronous communication is highly outperformed by the communication of the algorithmin Figure 5.4 that uses xnet communication. Even when the algorithm of Figure 5.4uses the slow router communication mechanism, it improves over asynchronous communication in two of the three examples of Figure 3 (3.b and 3.f). For Figure 3.d,our algorithm using the router mechanism is slower than asynchronous communicationbecause the corresponding hyperplane dependence represented by Figure 3.c results inless communication between the iteration than the dependencies of Figure 3.b and 3.e.As a result, the elimination of coffisions and traffic congestions could not offset theserialization introduced by the algorithm.107Chapter 57* Communication generation for the hyperplane dependence from(S (11,12, ...,Im),Sdir) to (D(Il,12, ...,Im),Ddir) .A Find_intersection_points fl;(sO,dO) Find_intermediate_points (A);/* case 1 sO and dO are respectively the source and destination.They are expressed in terms of (Il,12,...,Im).case 2.a or 2.c: sO do the intersection iteration.case 2.b : sO and dO are the intermediate iterations as describedin case 2.b. *1Forall (sO,dO)7* Reduction operation in the source /for (il,i<s,i++) / s is I of direction vectors in hyperplane source /reduction+ (sO,Si); /* reduction in positive direction of Si. *7reduction— (sO,Sj); 7* reduction in negative direction of Si.endforIf ( sO <> dO ) then send (sO,dO); / or receive for the receiving version ofthe algorithm.I Broadcast operation in the destination *1for (i=l,i<=d,i++) 1* d is I of direction vectors in hyperplane dest. *1broadcast+ (dO,Di);broadcast— (dO,Di);endforendoallFigure 5.4. Algorithm for Automatic Generation of Communication Statements.5.4.1.2 Related Work In this section, an efficient method for automatic generation ofcommunication statements across ioops is presented. Even though the algorithm presentedhere is general enough to handle any across loop communication, improvement over108chapter 5previous methods is only made for dependencies from a source to a destination eitherof which is a subspace of dimension larger than zero (i.e. case 2 of section 5.4.1).In this case, the new form of expressing dependencies allows not only to avoid anyunnecessary communication but also to efficiently orchestrate communication in sucha way that collisions and traffic congestion are reduced. Li and Chen [37] assumea set of available communication routines (such as spread, reduce, multispread, etc).Communication is generated by matching the desired communication with the closestroutine from the set of available routines. Given that a perfect match can not always befound, the chosen communication procedure may result in unnecessary data movement.For example, a dependence from [(Ii, 0), ((0, 1))] to [(0, Ii), ((1, 0))] would result inevery iteration receiving data of every other iteration. Our algorithm avoids this type ofsituation by generating customized communication statements. Amarasinghe and Lamalso generate exact communication statements (based on exact data flow informationinstead of data dependence). Optimizing communication in [35] is more suitable forsame loopnest communication while our approach optimizes cross-loops communicationby efficiently organizing communication to reduce collision and congestions.5.4.2 Index Alignment In section 5.2, we described the formation of the GIS of asegment of code from the LISs of all loopnests in the segment. Every iteration in anUS is mapped to an iteration in GIS. In order to keep the mapping problem simple,the US dimensions are mapped onto the GIS dimensions (index alignment) which inturn will be mapped later onto the machine dimensions. In order to find the dependenciesacross the loops, a lexical mapping was used (i.e. the th index used in forming the US109Chapter 5of a loopnest is mapped to i dimension of GIS). Given that these dependencies andtheir costs (i.e cost of the corresponding data movement) depend considerably on indexalignment, the mapping is improved in this section in order to minimize the overall costof these dependencies.The cost associated with hyperplane data dependencies varies with the subspacesource and destination. In fact, if the cost of a dependence is considered to be the communication cost due to that dependence, then hyperplane data flow dependencies can beranked in ascending order of their cost according to the following types of dependencies 1) dependence whose subspace source is the same as the subspace destination 2)dependence with parallel subspacc source and destination (but not equal) 3) Others. Notethat an actual ranking of dependencies and their cost estimation depend on the availablearchitecture, communication mechanism provided by the system, the subspace source anddestination of the dependencies and their sizes.Using a lexical mapping and applying the hyperplane dependence algorithm tothe example in Figure 5.5 result in a dependence from [so (0, 1i), 3dir ((1,0))] to[d0 = (Ii, 0), dd = ((0, 1))] (i.e there is a data dependence from all iterations in colunmI of GIS to all iterations in row I of GIS). However, index alignment can be improvedin order to minimize the cost associated with the hyperplane dependence. Table 5.1 showsthe resulting dependence of every possible mapping for the example in Figure 5.5. Jj and‘2 in the table refer respectively to the first and second dimension of the GI S. Index Iand J in both loopnests are used to form the LISs while index K of the second loopnestis not used in forming the corresponding. US and consequently does not need to be110chapter 5mapped to any dimension of the GIS. It can be deduced from the last column of Table5.1 that the second and the third mapping produce the best mapping (they result in adependence of type 1) while the first and fourth mapping cause a dependence of type 3).Mapping Loops Loopd Resulting DependenceNumber I J I JMapping 1 I 12 Il .12 ((O,Ii),(1,o)) ! ((Ii,O), (0, 1))Mapping2 11 ‘2 ‘2 Il ((0,12),(1,0)) ((0,12),(1,0))Mapping 3 12 Ii I 12 ((Ii, 0, (0, 1)) ! ((Ii, 0), (0, 1))Mappimg 4 12 1 ‘ ((1, 0), (0, 1)) ! ((0, 12), (1, 0))By examining the formation of US and GIS in section 5.2, the assumption that aniteration owns the instances of an array defined in that iteration, and the expression ofhyperplane dependencies (section 5.3), we can deduce the following two rules concerningthe effect of changing the index mapping on hyperplane dependencies:1. Switching the mapping of two indices in the source loopnest (loop5) results inswitching the coordinates of so and the coordinates of the vectors inDo I = 1,NDo J = 1,NA[J,11 =EndosDo I = l,NDo J = 1,NDo K = 1,NB[I,J] = A[I,K]EndosFigure 5.5. Example illustrating the Importanceof Index Alignment on Hyperplane Dependence.Table 5.1 Resulting Hyperplane Dependence for Every Possible Mapping.111Chapter 52. Switching the mapping of two indices in the destination loopnest (loopd) results in i)switching the coordinates of the vectors in dd:r, ii) changing d0 indices and switchingits coordinates, and iii) changing the indices of SO.As an example, Mapping 2 in Table 5.1 can be obtained from Mapping 1 by switchingthe mapping of I and J in the destination loopnest. Applying rule 2), we have i) thedestination direction vector is changed from (0, 1) to (1, 0), ii) index I in do is changedto ‘2 and then the coordinates are switched to obtain a destination origin equal to (0,12),and iii) index Ti in so is changed to ‘2 to obtain a source origin equal to (0,12). Acomplete set of rules can be used to transform a hyperplane dependence of type 3)to a hyperplane dependence of type 2) or type 1). Note that this is only possible ifthe destination direction vectors can be obtained from the source direction vectors byswitching around the coordinates of the latter a number of times. Finding an indexmapping that minimizes the cost of a single dependence can be achieved as discussedabove. However, minimizing the cost of every dependence in a program with multiplehyperplane dependencies may result in conflicting mappings.We consider a segment of code with k loopnests. The dimension of GIS is calledDim while dim is the dimension of LIS (local iteration space of lOOp1 1 i k). Thenumber of ways to map the dimensions of US2 to those of GIS is (Dirn—dim1)! Hence,the total number of possible mappings for the entire segment is II (Dirn—dim)! lii orderto find the segment optimal mapping, then every mapping should be considered by findingthe overall cost of its corresponding dependencies. However, the hyperplane dependencealgorithm is only used once (using the lexical mapping) to find the dependencies in the112chapter 5code. The dependencies for any other mapping can be deduced from the lexical mappingdependencies using the switching index rules of the previous paragraph in 0(N) time,where N is the number of dependencies in the segment (note that it is faster to use theswitching index rules and obtain the dependencies in a new mapping from the lexicalmapping than to apply our hyperplane dependence algorithm to compute the dependenciesin a new mapping). Given that in real applications Dim and dims can be consideredas constants because they rarely exceed 3, then the optimal mapping can be found in(0 (c’N)) time, where c is a constant. Hence, optimal mapping can be obtained in areasonable time only for a segment with a small k. For large k, heuristic mapping can beused. A possible heuristic consists of ranking all dependencies according to the size ofdata to be moved. Then at each step (starting from the dependence with the most data tobe moved to the dependence with the least data to be moved), the indices of the sourceand destination loops of the currently selected hyperplane dependence are mapped so thatthe cost of that dependence is minimized. Once all dependencies are analyzed and thereare still indices that are not mapped yet, the remaining indices are mapped arbitrarily.Index alignment is also investigated [65, 661 but for aligning array dimensions. Eventhough our alignment is applicable for loopnest indices, it is equivalent to the indexalignment in [65, 661 given the assumption that an iteration owns the data defined bythat iteration. Li and Chen uses a graph based approach for index alignment. Findingthe optimal mapping is an NP-complete problem. However, even for a program with asmall number of arrays, using the graph based approach to find the optimal mapping istime consuming. As a result, a heuristic algorithm is presented [65, 66]. In our case,113Chapter 5obtaining an optimal mapping for a problem with small k can be reasonably obtained usingan exhaustive approach because the new form of expressing dependencies (hyperplanedependencies) allows to deduce the cost of dependencies in a new mapping from theircosts in the old mapping in only 0(N) time.5.5 Conclusion In this chapter, a new form of data dependence called hyperplanedependence is introduced. It is a dependence whose source and destination are subspacesof the iteration space. These type of dependencies are common mainly across loops. Inorder to be able to express these dependencies, a global iteration space for a segment ofcode is formed. The concept of global iteration space and hyperplane data dependenceallows the compiler to analyze a whole segment of code at once instead of analyzingeach loopnest in a segment separately. This has resulted in improvement in the globaliteration space partitioning, automatic generation of communication statements acrossloops, and index alignment. Future research related to this chapter includes enhancementto hyperplane data dependence analysis such as checking for integrality of solutions whencomputing the subspace source and destination, including loop bounds in the expressionof hyperplane data dependence, analyze conditional statements within the loopnests, andothers.114Chapter 6Parallelizing Loops with Irregular Dependencies1 Introduction Chapter 2 and 3 analyze and restructure loops with uniform dependencies. Chapter 4 and 5 diverts from the traditional uniform dependence analysis and parallelization by introducing the concept of hyperplane data dependences analysis where thesource and the destination are subspaces of the iteration space. However, the improvement made by the use of hyperplane dependence information in partitioning the globaliteration space, automatic generation of communication statements, and index alignmentis obtained in the presence of simple hyperplane dependencies (with uniform origin distance vectors, uniform source direction vectors, or uniform destination direction vectors).This chapter goes beyond uniform dependence analysis by presenting parallelizing transformations for loops with irregular (complex) dependencies. The techniques presentedhere can be applied to shared memory parallel machines as well as multicomputers butwith low communication overhead given that the main concern of this chapter is to expose as much parallelism as possible from loops with complex dependencies traditionallyexecuted in serial mode.The importance of parallelizing compilers to parallel machines has led to the development of numerous data dependence tests to detect dependencies in loops [9, 68, 69, 14,54, 70, 71]. These tests vary widely in terms of preciseness and complexity. The GCDand the Banerjee test [9, 14] are two inexpensive tests that can find simple dependenciesbut perform poorly in the presence of arrays with complex subscripts such as coupled115Chapter 6subscripts in multidimensional arrays (two subscripts of an array are called coupled ifthey contain the same index [17, 18]). This ineffectiveness is due to the array dimensionsbeing analyzed separately instead of simultaneously. The power test [7], the \ test [181,and the Omega test [6, 5] improve data dependence testing by providing a more precise(but more expensive) data dependence information.The improvement in data dependence testing has not been followed by a similaramelioration in restructuring transformations. In fact, most of the existing transformationscan be applied to loops either without dependencies or with simple dependence patterns(a simple data dependence is used in this chapter to refer to a dependence with atraditional uniform dependence distance vector or uniform dependence direction vector).For example, loop interchange is allowed only for loops with certain types of dependencedirection vectors [25] while loop skewing can be useful only for loops with uniformdependence distance vectors [54]. On the other hand, loops with complex dependenciesare automatically classified as serial ioops without any attempt to parallelize them. Inthis chapter, a novel approach for efficiently parallelizing the latter loops is presented.According to an empirical study in [72], 44% of two-dimensional array references withlinear subscripts have coupled subscripts. It is these type of array references that mayresult in irregular dependencies. In addition, if the serial part of a program representsx % of the code, then the maximum speedup that can be obtained is (Amdahl’s Law)[73]. As a result, even a single loop in a program can severely limit the speedup obtainedif this loop is executed serially. Therefore, every loop in a program (including loops withirregular dependencies) should be analyzed to extract useful parallelism from it.116chapter 6The only recent work we are aware of that statically restructures loops with irregulardependencies can be found in reference [68]. Tzen and Ni use a dependence uniformization technique that consists of decomposing an irregular dependence into two uniformdependencies such that if these two uniform dependencies are satisfied when executingthe loop, then the initial irregular dependence is also satisfied. The loop can then beeasily parallelized according to the two uniform dependence vectors and usually resultsin a Doacross type of execution. Our parallel region execution and loop uniformization complement each other. In fact, our method consists of dividing the iteration spaceinto parallel regions and serial regions. All the nodes in the parallel region can be executed fully in parallel (i.e Doall type of execution) while the nodes in the serial regionare executed serially. Loop uniformization can then be applied to our serial region tofind parallelism in this region only. Given that parallelizing loops via dependence uniformization technique can only achieve a Doacross type of execution, the advantage ofour approach is its ability to identify the iterations that can be executed fully in parallelbefore applying any dependence uniformization.6.2 Three Region Execution of Loops with Single Dependencies Various data dependence analysis techniques have been developed to detect dependencies in a ioop.Based on the existence of dependencies in a loop and their characteristics, standard unimodular transformations (such as loop interchange [14], loop skewing [54, 14], loopreversal) can be applied in order to expose possible parallelism in the loop [15]. Thesetransformations usually consider loops with simple dependencies while loops with complex dependencies are automatically classified as serial loops. In this section, the exe117Chapter 6Do I = 1, N1DoJ—1o+liI uo+uil81: A[fj(I, J), f2(I. J),... f(I, J)]S2: =A[gi(I,J),g(I,J),...,gn(I,J)]EndosFigure 6.1. Loop Model.cution of loops with irregular dependencies is improved by dividing the iteration spaceinto two parallel regions and one serial region. The iterations in the parallel regions canbe fully executed in parallel while the iterations in the serial region can only be executedin serial mode. Given that most loops with complex array subscripts are 2—dimensionalloops [18, 72] and in order to simplify the description of our transformations, only doublynested loops are considered in this section. In addition, this section assumes the existenceof only a single dependence in the ioop. These conditions are relaxed in section 6.4 byallowing multiple dependencies and an n-dimensional iteration space.In this section, we consider the loop model in Figure 6.1 with a convex polygonshaped iteration space21. All the indexing functions are assumed to be linear (i.e.f(I, J) = fio + fiI + f2J and g(I, J) = 9i0 + giI + g2J for 1 < i < n). Inthis loop, there is a dependence from S1 to S2 which can be expressed as a dependencefrom a source iteration s to a destination iteration d. Our three region execution approach21 In case the coefficients of the inner loop index bounds are not integers, the floor or the ceiling functions can be used to assurethat these bounds are integers.118chapter 6to be described requires that the dependence in the ioop is expressed as a dependence forms = (a’i I + ,6 J + 71, 2I + /32 J + 72) to d = (I, J). In order to find such a dependence,we need to find the tuple (i1 ji, j2) satisfying the following system of Diophantineequations and system of inequalities [9]:f1(i,ji) = gi(i2,j2)f2(ii,ji) = g2(i2,j2)f(i1,j)= gn(i2,j2)1<i9<N0 + 1i ji U0 + tLlill0+11i2 i2 UO+Ul2Note that s = (i1 ii) and d (i2, j2). The solution of the above Diophantine equationscan be expressed in terms of free variables. The number of free variables is equal to thenumber of unknowns (4 in this case) minus the number of non-redundant equations (anequation f (ii, i) = gi (i2, j2) is redundant if it can be written as a linear combinationof the non-redundant equations) [7]. Based on the number of free variables, we candifferentiate between the following solutions for the above system:1. No solution —* the loop can be fully executed in parallel.2. The number of free variables is 0 — there is a dependence from only one iterationin the iteration space to another iteration. Parallelizing such ioop can be simple(all iterations except the destination iteration can be executed concurrently; then thedestination iteration is executed).119Chapter 63. The number of free variables is 1. Hence, the dependence in the loop can be expressedas a dependence from s = (o1I+71,o2I+72) to d = (I,c) or d = (c,I) where cis a constant —* parallel region execution can be applied here.4. The number of free variables is 2. Hence, there is a dependence from s(au + 13J + 71, a21 + 132J + 72) to d = (I, J) — parallel region execution canbe applied here.5. The number of free variables is 3. Given that the system of Diophantine equations ofthe doubly nested loop degenerates into one equation, elements of array A in S arewritten to more than once. This results in an output dependence from S1 to itself (inaddition to the dependence from Si to 52). Our parallel region execution does notconsider yet loops with output dependencies and thus this case is not considered inthis thesis. However, various papers [19, 74, 751 discuss removing false dependencies(anti and output dependencies) through array expansion techniques. In this case, arrayexpansion transformation should be used first before applying our parallel regionexecution.Our parallel region execution can only be applied to loops with a dependence froms = (aiI+S1J71,a2+j32)tod= (J,J) (i.e case 3 and4 above. Note thatfor case 3, either I or J is a constant). Note that this dependence can be obtained byreplacing i2 and j2 in the Diophantine equations by I and J and expressing ii and jiin terms of I and J. As an example, we consider the loop in Figure 6.1 with n = 2,fi(I,J) = ao +aiI-f- a2J,f2(I,J) =-f- b1+ b2J, gi(I,J) = cj +cuI+c2J,andg2(I, J) = d0 H- d1 I + d2J. In this case, the following steps can be used to determine120chapter 6s and d:1. From statement S1. it can be noticed that iteration (I, J) definesA[ao + a1 + a2J, b0 + b1 + b2J]. Hence, A[x, y] is defined in iteration(b2z—ay+bo aob —bix+aiy+aobi —ajboaib2—abi 1b2—a2. From statement S2, it can be seen that iteration (I, J) uses the van-able A[co + cI + c2J, d0 + d1 -- d2Jj. From step 1, it can be deducedthatA[co + cI + c2J, d0 + d1 + d2J] is defined in iteration1(b2—adi)I+(b)Jbo tco—a06o(aidi—bicj)I+(aid— jc)J+aobi+aido—aibo—b coa12abj a12—a3. It can be concluded from step 2 that there is data dependence from .s =(&—adj)I+(b)Jboc —a (aidi—bjci)I+(ajd— ic)J+aobi+aido—aibo—bicoaib2—ab1 a12—abjto d = (I,J)22.We define the dependence distance vector D to be the difference between thedestination iteration and its corresponding source iteration (d — s). This is equivalent to the traditional definition of distance vectors [9, 17, 14]. We have D((1—a)I 3J—-y, a2I+(1—J -.Based on D, we propose to dividethe iteration space of a loop into the following three regions:1. areai represents the part of the iteration space where the destination iteration comeslexically before the source iteration. In this region of the iteration space, S2 useselements of array A defined before the loop (i.e. an anti-dependence from 32 toSi may exist). As a result, the iterations in area1 can be fully executed in parallelprovided that variable renaming is perfonned. area1 corresponds to the region where22 For brevity, the case wherealb2— a2bl = 0 is not included here. However, a1b2 — a2bl = 0 results in case 3 in the aboveclassification of solutions to the Diophantine equations.121Chapter 6the direction vector Dir, is equal to (<,*) or equal to (, <)23. The region satisfyingDir = (<,*) can be described by(1—ai)I—/3iJ—-i<O (6.1)While the region satisfying Dir (=, <) is characterized by_a2fll+(1_)(1_al)I+7l(l57<0 and(62)_1(6.1) and (6.2) can be used to generate area1. In addition, we call area the part ofareai described by (6.2). Note that, for simplicity, special cases in the equations ofthis chapter (such as = 0 in (6.2)) are not discussed.2. area2 represents the part of the iteration space, where the destination iteration comeslexically after the source iteration and the source iteration is in area i. If area1 isexecuted first, then the nodes in area2 can be fully executed in parallel. The iterationsin a possible area2 can be expressed by the following inequalities:(1—ai)I---8J71>0 and(1—ai)(aiI+8iJ+’yi)—.8i(aI99) 1<Oand!in (area)We define the direction vector to be the sign of the distance vector, where < refers to negative, = to zero, and> to positive.Note that the conventional direction vector definition uses < for positive and> for negative [9, 17, 14]. We find it more expressiveto use the non-traditional definition in this paper because it directly reflects the sign of the distance vectors.122chapter 6which are equivalent to:(1—ai)I—/3J7i>O and—— /31c2)1 + (j3 — c1/3 — /312)J—(ci7i + /3172) < 0 and!in (area) (6.3)3. areas represents the rest of the iteration space (iteration space —(area1 U area2)).Once areal and area2 are executed, then the nodes in area3 should be executedserially.In the above description of the three region execution of loops with single irregulardata dependence, line L and L’ are used to divide the iterations space (L: (1——= OandL’: (elI— —131a2)I+(i —el1/3 —/3i/32)J—(a171 +/3172) = 0).If we assume that, on average, a line divides a convex polygon into two convex polygonsof equal size, then we have S(areai) N/2, S(area2) N/4 and S(area3) N/4,where N is the number of nodes in the iteration space, while S(a) represents the averagenumber of iterations in region a. If t denotes the execution time of one iteration, thenthe average execution time of the loop on a shared memory machine using the threeregion execution mechanism is (1 + 1 + N/4)t. Hence, the average speedup obtained isAs a result, the average speedup approaches 4 for large values of N. However, thefollowing sections will show that the three region execution leads to a large speedup formany examples taken from other papers.123Chapter 6It should be noticed that before finding L and L’, we could first determine theregion of the iteration space where a dependence can exist. This part of the iteration space is called the dependence convex hull (DCH) [681 which is determinedby imposing conditions to ensure that the source and destination iterations are withinthe loop bounds. Given that a dependence is expressed here as a dependence froms = (au + 13J + 71, a21 + 32J+ 72) to d = (I, J), then the following inequalitiesshould be used to find the DCH1<aI+j3J+<N1max(lio+liiI,••,1o+liI)<c2I+/3J 7<Our three region execution should be applied to the DCH only instead of the wholeiteration space. The nodes that are not in the DCH can be executed fully in parallelbecause of the nonexistence of dependencies for these nodes. This is equivalent todividing the iteration space into four regions (areai, area2, area3, and non-DCl-I).However, in the remainder of this chapter we conservatively assume that the DCHincludes the whole iteration space so that the description of the static transformationcan be kept simple.Figure 6.2 illustrates the three region execution on two examples. Figure 6.2.a and6.2.d show the two initial loops to be analyzed. Using the steps presented above todetermine dependencies in our ioop model when n = 2, it can be concluded that the loop124chapter 6in Figure 6.2.a has a dependence from s (0.5J + 5,0.51 + 0.5J + 3) to d = (I, J).We have D = (1 — 0.5J — 5, —0.51 + 0.5J — 3). Hence, areai is the region of theiteration space satisfying the relation J > 21 — 10 (equation (6.1)) or J = 21 — 10 andI < 16 (equation (6.2)). area2 is the region satisfying the relation J 21 — 10 andJ < I + 6 (minus area’1). area3 is the rest of the iteration space (i.e J 21 — 10and J I + 6 minus area’1). Similar inequalities and regions can also be found forExample 2 . Figure 6.2.b and 6.2.e respectively show the iteration space of Example 1and Example 2 divided into the different regions. It should be noticed that Example 2can be executed in two steps only because area3 is empty24 while Li, Yew, and Zhu [18]propose a serial execution of the outer loop and a Doall execution of the inner loop (i.eN1 steps are needed to execute the loop).6.3 Static Transformation of the Loop In the previous section, the iteration space ofloops with single irregular dependencies is divided into three regions according to thelines L and L’. The three region execution can be simply obtained by transforming theinitial loop into three loops (a loop for each region) with the same loop index bounds24 Actually, area3 includes the iterations located on the line J = I if the inequalities presented in this section are used. However.it can be seen that the coffesponding direction vector for these nodes is (=, =). Such iterations can be included in area2. However,we conservatively did not check for the (=, r) direction vector in section 2 since it usually results in one iteration satisfying thisvector.125Chapter 6Do I=1,N1E ndo SDo J=1,N2A[21,2J] == A[J+1O,I+J+6](a) Example 1.1=21-10(b) The Three Regions of Example IFigure 6.2. Examples of a Compile Time Restructuring ofLoops with Irregular Dependencies. (Continued)126chapter 6/* areal, parallel loop.*/copy A into A’Do I=l,NlDo J= max (1,21—10), N27* Only part of J=21—l0 is in areal.*/if ( (21—J—l0!=0) II (1<16)A[21,2J] == A’ [J+l0,I+J+6]EndifEndos/* area2, parallel loop.*/Do I=l,NlDo J=l,min(21—10, I+6—l,N2)if ( (21—J—lO!=0) II (I>=l6)A[21,2J] == A[J+lO,I+J+6JEndifEndos/* area3, serial loop./Do I=l,N1Do J=I+6,min(21—l0,N2)if ( (21—J—lO!=0) (I>=16)A[21,2J1 == A[J+l0,I+J+6]Endos(c) Static transformation of Examplel.Figure 6.2. Examples of a Compile Time Restructuring ofLoops with Irregular Dependencies. (Continued)127Chapter 6Figure 6.2. Examples of a Compile TimeRestructuring of Loops with Irregular Dependencies.as the initial loop but with a body included within an if statement whose conditionssatisfy the equations of the corresponding region. However, this type of transformationcreates many iterations with no useful code. This would hurt the execution time of theDo 1=1,N1Do J=1,N2A{I,Jj == A[J,I]Endos(d) Example 2.J=I(e) The Three Regions of Example 2128chapter 6serial loop (because of serially checking the conditions of the if statement) and the staticscheduling of the parallel loops (iteration space is as big as the iteration space of theinitial loop, but many of the iterations have no useful code). A better alternative tothis compile time transformation is to find the exact bound of the three loops and is theconcern of this section.Figure 6.2.c shows the loop in Example 1 transfonned into three loops. Given thatarray A in the first loop uses values of A computed before the loop, variable renamingis needed. The if statements in the three loops are used to place area’1 in the appropriateloop. In fact, area is composed of a part of line L (this line is also used as a bound inall regions). A better static transformation (not shown here) would create a fourth loopfor area’1 (fully parallel too) and avoid having any if statement in all loops.Algorithm 1 (at the end of the paper) shows the compile time transformation of theloop model in Figure 6.1 to the three region execution described in section 6.2. The mainfunctionality of the algorithm is to set the bounds of the three loops. In the algorithm,the expressions J = A0 + A1 and J = B0 + B1 (where A0 = —, A1 =B0 = and B1 used in the loop index boundscorrespond respectively to L and L’. In addition, the expression C1 + C0 < 0 usedin the if statements corresponds to the first equation in (6.2) (C1 = 2h1211)and C0 =— /32)— 72). Note that the lines L and L’ can be upper bounds orlower bounds in the restructured loops based on firstly the corresponding region of theloop, secondly the sign of /3i which determines the direction of the inequality in (6.1),and finally the sign of (/3i — cii3i— /31/32) which determines the direction of the second129Chapter 6inequality in (6.3). As an example, in the ioop corresponding to area i, Li is an upperbound if < 0 (because (6.1) becomes J < — ) and is a lower bound ifi > 0. Since the syntax of some programming languages may not allow non-integerindex bounds, the ceiling and the floor functions are used with the expressions A0 + A1 Iand B0 + B1, which may not necessarily have integer values.6.4 Generalization In this section, a generalization of the parallel region executionof loops with irregular dependencies is presented. This generalization consists of firstlyconsidering doubly nested loops with multiple dependencies, and secondly analyzingloops with n dimensional iteration spaces.6.4.1 Parallel Region Execution of Loops with Multiple Dependencies In the previous sections, loops with single and complex dependencies are statically transformed toallow parallel execution of these loops. The iteration space is divided into three regionstwo of which are parallel regions. In this section, a generalization of the parallel regionexecution is presented by considering doubly nested loops with multiple dependencies.First, we assume the existence of m irregular dependencies in a doubly nested loop.It was mentioned in section 6.2 that the presence of an irregular dependence results inpartitioning the iteration space based on L and L’. The existence of multiple dependenciesrequires finding the corresponding L and L’ of every dependence. This results in finding2m lines which can divide the iteration space into at most 2m regions. A region can beidentified by an m-tuple vector (r1, r2,’ . ,rm) such that r (1 i rn) describes theregion number with respect to dependence i. Note that r can take one of three values (1130chapter 6for areai, 2 for area2, and 3 for area3). Once all regions are identified, the executionof the loopnest can be done by executing the region identified by the vector (1, 1, , 1)first and fully in parallel. All other regions should be combined and executed in serialmode. Given that the actual number of regions is usually smaller than 2m (i.e. manypossible regions are empty), several improvements can be made to increase the number ofparallel regions. For example, region (1, , 1) can be executed first, then all regionsdescribed by (xi, X2,”’ , Xm) such that at least one x is 1 should be combined andexecuted serially. Then, region (2,2,... , 2) can be executed fully in parallel providedthat no region whose corresponding tuple includes a mixture of l’s and 3s’ exists. Asexpected, only a limited amount of parallelism is extracted from a ioop with multiplecomplex dependencies.In the previous paragraph, loops with only irregular dependencies are considered.However, if a loop includes a mixture of irregular dependencies and uniform dependencies, then the uniform dependencies should be examined to assure that they do notinvalidate the execution ordering of the regions. For this problem, we consider forsimplicity the existence of one complex dependence and n uniform dependencies. Thecomplex dependence divides the iteration space into three regions such that area1 isexecuted first followed by area2 and then area3. All n uniform dependencies should notviolate this ordering which means that any uniform dependence vector must not crossthe regions in any opposing order (i.e. there should be no uniform dependence distancevector that crosses the regions from area3 to area2 or to area1 or from area2 to area1).For a uniform distance vector v, we call c the angle between v and the i-axis. The slope131Chapter 6of the lines L and L’, and c can be used to check for the validity of our region ordering.For the example in Figure 6.2.e, only the existence of dependence distance vectors v suchthat—c does not violate our region ordering. After checking the validityof the region ordering, the restructured code corresponding to areai and area2 shouldbe further transformed based on the uniform dependencies. For example, if a uniformdependence with a distance vector equal to (1,0) exists in the loopnest corresponding toFigure 6.2.e (in addition to the irregular dependence), then areai and area2 are not fullyparallel regions anymore. In fact, the I loop for both regions should be executed seriallywhile the J loop can be executed in parallel.In order to express the conditions necessary for a uniform dependence not to invalidatethe region ordering by crossing the regions from area2 to area1,we have the followingtwo cases:1. area1 is described by the inequality J > A1 + A0 (i.e > 0). A possible vector on line L1 identified by J = A1 + A0 is vo = (1, A1). Using a graphicalanalysis, it can be concluded that the vectors that do not cross the iteration spacefrom area2 to area1 are the vectors that are positive multiple of v1 = (1, x1)where— < x1 < Ai or multiple of (—1, x2) where —oc < x2 < —A1.Figure ? shows an iteration space with L1 identified by J A1J + A0 andarea1 described by J < A1J + A0. v0 is represented by the solid arrow whilesome of the possible v1 vectors that do not violate the region execution ordering are shown with dotted arrows. The set of valid vectors for this case is{(v,x.T1) s.t. xE R, x1 ER, and(((T> 0) and (—cc < r1 <Ai))or((r < 0) and (—cc < <—A1)))} which132chapter 6is equivalent to{(x, y) s.t. a E R, yE R, and (((x> 0) and (—oo < < Ai)) or((r < 0)and (—c < — < —Ai)))}.2. area1 is described by the inequality J < A1J + A0 (i.e th < 0). Using asimilar graphical analysis to the previous case, it can be deduced that the vectors that do not cross the regions from area2 to area1 are the vectors that arepositive multiple of (1, x3) where A1 < x3 < + or multiple of (—1, x4)where —A1 < X4 < +00. Hence, the set of valid vectors for this case is{(,y) s.t. ER, y ER, and (((x>O) and (A1 < < +co))or((a < 0)and (—A < — < +oo)))}.The above constraints on uniform dependence vectors only guaranty that the latter do notcross the regions from area2 to area1. Similar constraints should also be found to securethat the dependence vectors do not cross the regions from area3 to area2. This can bedone similarly to the above analysis but based on L2 identified by J = B1I + B9 andthe sign of (!3i— — /1/32) which determines whether the inequality J > B1 + B0or the inequality J < B1 + B0 refers area2. Note that it is not necessary to check forthe execution ordering of area1 and area3 because it would be a redundant check if theexecution ordering of area1 and area2 and the execution ordering of area2 and area3 aresatisfied. Table 6.1 summarizes all the results of possible uniform dependence vectors.6.4.2 Parallel Region Execution in n-Dimensional Space In previous sections, onlyloops with 2D iteration spaces are considered. This section generalizes the parallel regionexecution to loops with n-dimensional iteration spaces and an irregular dependence froms = to d =(1,I2,,1n)133Chapter 6is assumed, where ‘k (1 k n) represents the kth index of the ioop. Note thatdetermining the source and destination of such data dependence can be obtained using theextended GCD algorithm [9] or the power test algorithm [7]. For example, the extendedGCD algorithm finds whether a set of linear equations with integer coefficients have anysolution and gives a way to enumerate all those solutions.Region Description Non-invalidating Uniform Dependencies4i >0 {(x,y) s.t. z ER, Y ER, and(((r> 0) and (—co <- < A1)){(or ((r < 0)and (— < < A1)))}<0 {(r,y) s.t. x ER, yE R, and (((er> 0) and (A1 < - < +oo)){(or((x<0)and(—oo< L<A1)))}13—o1131—i/2>O ((i,y)s.t.rER,yER,and(((r>0) and (_co<2L<Bi)){(or((T<0)and(Bj <13i — ai/3i— 131132 < 0 {(x, y) st. r E R, y E R, and (((s> 0) and (B1 < < +oc)){(or((r<0)and(—co < .<Bi)))}J = A,I +0Figure 6.3. Graphical of Non-invalidating Uniform DependenceVectors for the Execution Ordering of areal and area2 when /3i > 0.Table 6.1 Conditions on the Non-Invalidating UniformDependence Vectors for our Three Region Execution.134chapter 6The distance vectors of the above dependence is D = d — s = (Hi, H2, ,where H = —ado + (1 — c)I — cI for 1 j n. In addition to thek=1 (kf=j)distance vector, we compute another n-tuple vector D’ equal to (H, , H) suchthat H = —co + (1 — c,,)S3 — a2jSj, where S1 refers to the 1th coordinatesk=1(kl=,)of the source tuple s (1 1 n). Note that H is obtained from H3 by replacing Ij(1 < 1 < n) in H3 by S1. We define the following hyperplanes: H3 = 0 and H 0for (1 j n). The former hyperplanes correspond to L in section 6.2 while the latterones correspond to L’ .As in section 6.2, the iteration space is divided into the followingthree regions: 1) area1 characterized by the equation D < 0 2) area2 characterized bythe equations D 0 and D’ < 0, and 3) area3 which is the rest of the iteration space(D 0 and D’ 0). Given that these inequalities involve n-tuple vectors, then eachindex generates its own set of inequalities. A tree (called index tree) is used in Figure 6.4to show the different regions and their corresponding characteristic inequalities. Level i inthe index tree corresponds to the th index in vector D and D’. The index tree can be usedto automatically transform an n-dimensional loopnest with a complex dependence into aseries of loopnests that satisfy the parallel region execution. Every path from the startingmode in the index tree to a terminal node results in a loopnest. Thus, the total numberof loopnests used in the parallel region execution of the initial loopnest is 3n (there aren loopnests for each of the three regions). The constraints on H and H found along apath should be used to have the appropriate indexes for the corresponding loopnest.Even though . of the 3n loopnests obtained are parallel loops (area1 and area2),the overhead of the loopnests indexing can be large. There are two techniques implicitly135Chapter 61H =Oand H<O H1>OI (area2) (area3)(area ) H2 Oand H <0 or H1’> oorH;=Oancl-1<O H’.OandH2(area2) (area3)(area1) H1’< Oor H OorH=I-t’...=HO and I-j <0 and f(area2) (area3)Figure 6.4. Inequalities Characterizing the DifferentRegions of an n-Dimensional Iteration Space.used in section 2 to reduce the number of loopnests from 6 (=3*2) to 3:1. Some of the iterations that could have been included in area2 are conservativelyplaced in area3. In Figure 6.5, the box around three terminal nodes corresponds tothree loopnests that could have been created in the three region execution of doublynested loops (section 6.2). Two of these loopnests correspond to area2 while theother corresponds to area3. However, these two loopnests were merged (in section6.2) into a single loopnest by including the iterations of the area2 loopnest into area3loopnest. It is important to notice that the only possible way to conservatively moveiterations from a region described by the index tree of Figure 6.4 to another regionis to move iterations from area2 to area3. Given that the iterations in area2 (whichis a parallel region) use values computed in iterations from area1, it is necessary notto miss any iteration that is in areai according to the index tree. Even though some136chapter 6iterations in area3 use values computed in iterations from area2, it is possible tomove iteration from area2 to area3 because area3 is executed serially and thereforepreserves the initial ordering of its iterations.2. Loopnests corresponding to the same region can be merged into a single loopnest.However, if statements should be added inside the latter loopnest to assure having noextra iterations. In Figure 6.5, there are two paths shown in dotted lines that shouldcorrespond to two different loopnests (one with the constraint H1 <0; the other withthe constraints H1 = 0 and H2 < 0). These two loopnests are combined in section6.2 into a single loopnest (with the constraint H1 0). An if statements was addedin the merged loopnest to account for the constraint [2 < 0. For an n-dimensionalloopnest with a single dependence, the code in Figure 6.6 shows how all iterationscorresponding to area1 can be included in a single loopnest.Figure 6.5. Decreasing the Number of Loopnests in the Three Region Execution MethodNote that even though the index tree can be used in the general case to restructurethe initial loopnest using our parallel region execution, it is not efficient to use ourH1 <0-< 0(area3)(area1)‘H = 0 and (area2)2_H2 <0(area)\ 113>0 H2>OH3=Oaxid H<0or 1>Oor(area) H0and-I’2<O H’=0andH’0(area) (area3)137Chapter 6parallelizing techniques if areai is empty. areai is usually empty when H1 > 0 (arealis also empty when H1 = 0 and H2 > 0, or H1 = 0, H2 = 0 and H3 > 0, etc). In thiscase (i.e. H1 > 0), our parallel region execution should not be used given that the initialloopnest can be simply and efficiently executed by running the outer loop serially whileexecuting all other loops in parallel (Doall loops).6.5 Four Region Execution of Loops In previous sections, the iteration space of aioop is divided into two parallel regions (areai and area2)and one serial region (area3).In this section, the serial region is further divided to extract more parallelism from theloop. For simplicity, we again assume a doubly nested loop with a dependence froms = (au + i3J H- -y’, a2’ + j32J + 72) to d = (I, J). In order to decrease the numberof iterations included in the serial region, we propose to divide area3 into two regionsareas and area such that the sources of iterations in area are located in areas andthe sources of the iterations in area are also located in area. The iterations in areacan be fully executed in parallel provided that the nodes in areas (the new and reducedserial region) are executed first.A simple approach for reducing the size of the serial region is to find a line L”defined by D1I— J + D0 = 0 such that the iterations located on the negative (positive)side of the iteration space with respect to L”25have their source iterations on the positive(negative) side while the iterations located on the positive (negative) side have theirsource iterations also on the positive (negative) side. This is equivalent to determining23 They are the iterations satisfying the inequality D1 I — J + Dc, <0.138chapter 6which is equivalent toD1—J + D < 0(2 — Diai)I + (/32 — D1/3)J + (-2 — D1y — Do) 0D1— J+Do >0Dl(alI+/3lJ+yi)_(a2I+/3J) De 0Do Ii = 1, N1Do12 = 1,N2io + (ci — 1)I + 12I2 + +ai7_il_iDo I,.=1,1rc10 + (c11 — 1)I + 12I2 + . . + t1n_:1I70;l/ * should be I = N,-if(((H1==O)&&(H2<13)) 1((H==O)&&(H3<0)) II((H_i==o)&&(H<—o)) )loop bodyEndosFigure 6.6. Inclusion of area1 Iterations into a Single Loopnest.D1 and D0 such that26(Dii — J + D0 <0andt Dl(li+/3lJ+7l)_(2I+J2) D>OandD1—J + D0 > 0(Diaj — a2)i + (D1i3 — 132)i +(Do + D0y1—72) 0(6.4)26 This section does not analyze nor show the inequalities corresponding to the tenns in parenthesis of the last sentence. Suchinequalities would allow us to find other values of D0 and D1.139Chapter 6In order to optimize the execution of the ioop, values of D0 and D1 that maximizeareas (and thus minimizes areas) should be found. However, determining such aand D1 is a time consuming integer programming problem that can not be solved in areasonable compile time. As a result, we only consider determining possible values ofD0 and D1.If we assume that the ioop to be analyzed has positive indexes (i.e I 0 and J 0),then sufficient (but not necessary) conditions on D0 and D1 that satisfy (6.4) can beexpressed by the following inequalities— D1a D1— 2 D1— D1/3 —1 DJ31— /32 1 (6.5)y2—Diyi—DoDo y+Do—y2oEquation (6.5) allows finding a possible line L” that can be used to reduce the numberof iterations in the serial region. It should be noticed that (6.5) did not have any solutionfor several examples that were analyzed because of the restrictions made throughoutthis section to find fast solutions. One improvement made to increase the possibility ofreducing the size of the serial region is to use two lines (L’ and L) to divide area3instead of a single line (L”).Even though L” is introduced here to divide area3 into two regions, L” can be usedto divide the whole iteration space into two regions. In fact, if we call area the part ofthe iteration space that is on one side of L” and call area8 the part of the iteration spacethat is on the other side (using similar conditions on the sources and destinations to theones used in the last paragraphs), then area8 should be executed first and in serial modefollowed by a parallel execution of area. In this case, the initial loop is transformedinto two loops only instead of the four loops that result if L, L’, and L” had been used140chapter 6to divide the iteration space. This can only be efficient if the number of nodes in area(when only L” is used to divide the iteration space) is close to the number of nodesin area1 U area2 U area (when L, L’, and L” are used to divide the iteration space).Finally, it should be noticed that L” can sometimes be used to prove the non-existenceof data dependence in a loop. In fact, if we can find L” such that area3 = 0, then nodependence exists in the loop. It is sometimes possible to find such an L” because ofthe choices that equation (6.5) allows in selecting D0, and D1.6.6 Related and Future Work In this section, we discuss the relevance of our parallelregion execution to other work that investigate loops with complex dependencies. Tzenand Ni [681 use a dependence uniformization technique to transform doubly nested loopswith irregular dependencies into parallel ioops. This is achieved by decomposing thedependence distance vector of the complex dependence existing in the loop into twouniform dependence vectors such that honoring the two uniform dependencies duringthe execution of the loop results in satisfying the irregular dependence. As a result,transformations for parallelizing loops with uniform dependencies becomes applicable.Our parallel region execution and the above dependence uniformization can be integratedto take advantage of the speedup allowed by both methods. In fact, our approach shouldbe used first to divide the iteration space into regions. Then, dependence uniformizationtechniques should be applied but only to our serial region (area3) instead of the wholeiteration space given that area1 and area2 can both be fully executed in parallel. It isinteresting to notice that for all three examples used in reference [68] to illustrate theuniformization techniques, area3 is always empty. As a result, all the examples there can141Chapter 6Do 1=1,100Do J= 1,100A[31+2J,2J1 A[I—J+6,I+J]Figure 6.7. Examples from Related Papers.be executed in at most two steps. This significantly outperforms the speedup obtained in[681. Figure 6.7.a shows one of the examples in [68], where area3 is empty. If we ignorethe small overhead introduced by our technique (real experimental results can be foundin the next section), then the execution time of the ioop in Figure 6.7.a using the parallelregion execution (on a shared memory machine with an infinite number of processors)is 2t where t represents the execution time of one iteration. This results in a theoreticalspeedup of ‘°°° (the theoretical speedup obtained in [68] isIn [71 and [18], two data dependence decision algorithms called the power testand the ). test are introduced. Both tests improve data dependence analysis for arrayswith complex references by being more successful than previous methods in findingindependence in loops. Given that our parallel region execution should be only usedwhen irregular dependencies are found in the loop, the power test or the ). test should beapplied first to investigate the existence of such dependencies. Even though our parallelDo 1=1,1000Do J=1,1000A[I+J,31+J+3] ==A[I+J+1,I+2J+4]Endo(a) Example 1 from [14].Endos(b) Example 2 from [18].142chapter 6region execution is not intended in this dissertation to find independence in a ioop, it canbe enhanced with some other observations to find independence in a loop. For example,if we consider the loop in Figure 6.7.b used in [7] to illustrate the superiority of thepower test over the )‘ test (the power test finds no dependence in the example while the\ test fails to do so), it can be concluded from section 6.2 that there is a dependence for= (—J+ 2,41+ .J) to d= (I,J). The first index ofs requires that —J+ 2 1which results in J . Hence, a dependence may exist only if J = 1. However, forJ = 1 the first index of the source tuple is not an integer. Therefore, no true dependenceexists in the loop. It would be interesting to compose a list of conditions on L, L’, ands that need to be met in order for a dependence to exist. Such a list can be used as datadependence test to complement the power or \ test.It was implied throughout this chapter that memory-based data dependence analysisis used to obtain the source and the destination. Recent dependence analysis work [41,16, 19, 21] use a more precise value-based data dependence analysis and is still beingimproved [23, 22]. For example, in the codeDo I=1,N1Do J=1,N2Do K=1,N3A[J] ==A[K]Endosthe memory-based analysis produces a dependence from (*, K, *) to (I, J, K) (* refersto any value within the loop bound) while the value-based dependence analysis finds adependence form (I— 1,K,N3) to (I,J,K) if J < K, from (I,K,N3) to (I,J,K)143Chapter 6if J > K, and from (I, J, J) to (I, J, J) if J = K, Generalizing our parallel regionexecution to value-based data dependence analysis is part of our future work.6.7 Experimental Results In previous sections, the execution of loopnests with complex dependencies is improved by dividing the iteration space into parallel and serialregions. The execution time of the ioop depends primarily on the number of nodes in theserial region. In order to measure and analyze the performance of our transformations,the compile time restructuring of doubly nested loops has been implemented. Algorithm4 is used to automatically transform loops with single dependencies (section 6.2) intothe three loop format shown in the example of Figure 6.2.c. The restructured loops arethen slightly changed (by hand) in order to execute them on the MasPar 1 machine usingthe MPL language.Figure 6.8 shows the speedup obtained for Example 1 in Figure 6.2. The serialexecution version of the loop used to find the speedup consists of executing the wholeloop (Figure 6.2.a) on the MasPar host which is faster than the individual processingelements of the Data Parallel Unit (DPU). The parallel region execution (Figure 6.2.c)consists of running the parallel regions (first two loops) on the DPU and the serial region(third loop) on the host. The elements of A that were defined by the serial loop are thenstored back in their appropriate processing elements of the DPU. Figure 6.8.a shows thetheoretical and experimental speedup for different problem sizes. In this figure, the squareroot of the number of nodes in the iteration space is displayed on the x-axis. Note the loopbody consists of two simple assignment statements (i.e. the loop has a small granularityof around 200 its). Note also that the theoretical (expected) speedup is defined to be equal144chapter 6Speed up300.00250.00200.00150.00100.0050.000.0030 40 50 60(a) Comparison between Theoretical and Experimental Speedup.20Figure 6.8. Three Region Execution Performance for the Code in Figure 6.2.cto 1+1 Q/°rea3• It can be deduced from Figure 6.8.a that when the theoreticalspeedup decreases, the actual speedup approaches the theoretical one. This divergence atlarge speedups is due to the large degree of parallelism found in the ioop which results inExperimentalTheoreticalProblem SizeSpeedup9.509.008.508.007.50ExperimentaltheoreticalEZZ0 10Granularity (in ms)30(b) Speedup vs Granularity (Ni = N2 = 64).145Chapter 6a large communication overhead for the parallel region execution that was not accountedfor when computing the theoretical speedup (the MasPar is a SIMD machine with everyprocessing element having its own memory space). It should be noticed that there isno relationship between the expected speedup and the problem size. The fact that thetheoretical speedup decreases when the problem size increases is only specific to thisexample (the expected speedup depends on L, L’, and the iteration space size).Figure 6.8.b shows the speedup of Example 1 in Figure 6.2 vs. the granularityof the loop. As expected, the experimental speedup increases when the granularity ofthe loop increases because the communication overhead becomes small compared to thecomputation part for large values of ioop granularity. It can also be deduced from thegraph that the experimental speedup reaches a constant level. This upper limit is obtainedwhen communication becomes negligeable with respect to computation. It is interesting tosee the experimental speedup exceeding the theoretical speedup after a certain granularity.This “superlinear” speedup is known to be due to the memory management overhead ofthe serial version. In fact, there is a large overhead associated with a large problemrunning on a uniprocessor machine (memory paging, cache misses, etc.). This particularoverhead is usually reduced for parallel machines because the problem is partitionedamong the processing elements.6.8 Conclusion In this chapter, restructuring transformations for loops with irregulardependencies are presented. These transformations consist of dividing the iterationspace into parallel and serial regions, where the iterations in a parallel region can befully executed in parallel. The execution of various examples on the MasPar machine146chapter 6shows a considerable improvement over the serial execution and the loop uniformizationexecution.147Chapter 6copy A into A’ I * Actually only elements of A area1 should be copied. * /if (AreaiUpperbound (L)) / * i.e /3 < 0 //* area1 */Do I = 1. N1Do J = max(110 + ljJ,... -f ljI), mm (uio +u11, . . . u0 + u1I. LAo + AiIJ)if((AI—J--A!=o)II(cjj_co<o)) /* OnlypartofAJ—J+AOjsjnareai */A[ao +aiI+a2J,bo +biI+b2J]=.=A’[co+cil+c2J,d+diJ+d]EndifEndosif (Area2Upperbound (L’)) / * :.e — — /3162 > 0 * // * area2 */Do 1 1, N1Do J = maxl10 + liiI,... ,lo + liI, lAo + A.fl), min(uio + uiil, . + ujI, IB0 + B1 —if((AI—J+A!=o)II (C1+C0>0))same loop body as Figure 1.EndifEndos/ area3 */Do I = 1. N1Do J mac (l + .. + hi1,A0 + AiI1, [B0 + B19), mm (u10 + ‘-iiI, to + u1I)if((A1—J+A’_—o)II (C01+ oo))same locq, body as Figure 1.Endo3Else /* i.e /31—1.31—j3j/32<O */area2Do I 1, N1Do J = max(lio + liii,.. + 1111, lAo + Ail1, fB0 + 13l + 1]),min(ujo + viiI, ,uo + Ujsame loop body as loop for are a2 above.Endosarea3Do I = 1, N1Do J = max (l .f lJ . . ., 1o + l 1, lAo + A1]), mm (uiO + ujjI, . . . u8o + u1, [B1+ BoJ)same loop body as loop for area3 above.En dogEnd:fAlgorithm 1. Compile Time Transformation of the LoopModel Example In Section 6.2. (Continued)1481)chapter 6Else / * i.e th > 0 * //* area1 */Do I = 1, NDo J = max(lio + l11I,•• lo + Itil, fAo + Aifl), mm (Ulo + uiil, . , uo + uiI)same loop body as loop for areal above.Endosif (Area2upperbound (L’)) / * i.e — a1/i — 131/92 > 0 * /area2 */Do I = 1, N1Do J = max(ljo +ljiI,.. ‘,lo +lI), min(uio + uiiI,..• ao + uiil, LAo + AiIj, B0 - B1— )same loop body as loop for area2 above.Endos/* area3 */Do 1 = 1, N1Do J = max(110 + 1111,..., 1o + 1:11, [B0 + BiI]),min (Uio + .... ,u0 + uiI, [A1+ A0Jsame loop body as loop for area3 above.EndosElse / * — o1/31 — /982 < 0 * /area2 */Do I = 1, N1Do J max(lio + l11I,••, hO + hill, [Bil + B0 + 1j),min(uio + uijl,. .. UO +u1I, [Au + A )same loop body as loop for are al above,Endos/* area3 */Do I = 1, N1Do J = max(lio + l11J,’’, +4i1), mm (u10 +u11,. ,u0 + u,iI, [A0 + A1J, [B1 + B0Jsame ioop body as loop for area3 above.EndosEndifEndifAlgorithm 1. Compile Time Transformation of the Loop Model Example In Section 6.2.149Chapter 7Conclusion and Future DirectionsIn this thesis, compile time transformations and efficient execution of loops withvarious type of dependencies are presented. This dissertation started by investigatingloops with uniform dependencies (Doacross loops) [76—781. Previous parallelizingtechniques for Doacross loops concentrate on shared memory parallel machines while thisthesis presents compile time transformations of Doacross loops for efficient executionon multicomputers. Similarly to other parallelizing transformations, only single loopnestsare initially investigated [76-78]. However, finding the best execution for each loopnestin a program separately may not necessarily result in an efficient execution of all the codein the program because of communication overhead across the loops [79]. As a result,we introduced the concept of hype rplane data dependence which is a dependence from asubset of iterations to another subset of iterations [801. This type of data dependence canbe useful for expressing dependencies across loopnests and understanding the interactionbetween them. Finally, this thesis proposes parallelizing transformations for ioops withirregular dependencies [81, 82]. Such loops are rarely investigated in the literature andno attempt is usually made to extract parallelism from them.7.1 Primary Research Contributions In the following, we summarize the primaryresearch contributions of this work:For Doacross loops with single and uniform synchronizations, we presented compiletime transformations of the loops for efficient execution on multicomputers. Iterations150chapter 7grouping and code reordering techniques are used to improve the execution time ofthe code, obtain a load balanced execution, and increase the processors utilization.For Doacross loops with multiple and uniform dependencies, we presented independent tile partitioning which improves over supernode partitioning by finding moreparallelism in the code without adding any communication overhead. A simplifiedapproach to find a near-optimal partition size for grid-topology machines is alsopresented.• Using some code examples, it was shown that even though flow dependence vectorsshould be given more weight when partitioning the iteration space, input and anti-dependence vectors should not be ignored.• We introduced the concept of hyperplane data dependence analysis which is adependence from a subset of iterations to another subset of iterations. Hyperplanedata dependence is useful for expressing dependencies across the loops. An algorithmfor computing and expressing these dependencies can be found in this dissertation.• We improved the global iteration space partitioning by using the hyperplane datadependence information. This resulted in a more methodical approach for partitioningthe global iteration space because only vector information is used in the partitioningprocess. In addition, it allows more general partitioning strategies than the fewpredetermined ones (row, column, block) used in the literature to partition the globaliteration space.• We used hyperplane data dependence information to automatically generate communication statements across the loops and reduce the execution time of this type151Chapter 7of communication overhead. Hyperplane dependence information is also used tosimplify the index alignment process.We presented the parallel region execution of ioops with irregular dependenciestraditionally executed in serial mode. Significant improvement in execution timeis obtained for various loop examples. Given that no parallelism is traditionallyextracted from such loops, we hope that our method will motivate other researchersto work on problems currently classified as non-parallelizable.7.2 Future Research Directions There are many tasks that still need to be improvedin order to produce efficient and complete parallelizing compilers especially for mu!ticomputers. As for the future direction closely related to this thesis, we propose thefollowing few points:• Enhance hyperplane data dependence analysis such as to include information concerning the integrality of solutions when computing the subspace source and destination.In addition, analyzing conditional statements within loopnests and using this information to express hyperplane dependencies less conservatively can be the next stepin improving across ioop dependence analysis.• Our parallel region execution showed that the iteration space can be easily dividedinto parallel regions after manipulating equations taken from the dependence distancevectors. It may be possible to generalize and formalize our approach so that it canhandle any type of dependencies in such a way that our parallel region execution(described in Chapter 6) is obtained for loops with irregular dependencies while thewavefront loop execution is obtained for ioops with uniform dependencies. Such152chapter 7an approach would make known transformations such as the wavefront method asubcase in our parallel region execution.The transformations and analysis in this work such as the parallel region execution usememory-based data dependence analysis. Recent work in data dependence analysisuses a more precise value-based data dependence analysis [16, 6, 19] and is stillbeing improved [22, 23]. Improving our transformations based on the additionalinformation provided by the more precise value-based dependence analysis is an areaof future research. However, this may result in considerably complex transformationsgiven that value-based dependence analysis is expressed in a conditional form.• Use hyperplane data dependence information towards other goals such optimizingserial code, program documentation and verification, etc.153Chapter 7References[1] M. Flynn, “Very High-Speed Computing Systems,” IEEE Proceedings, vol. 54,pp. 1901—1909, Oct. 1966.[2] H.S. Stone, High-performance Computer Architecture. Reading, Massachusetts:Addison-Wesley, 1990.[3] High Performance Fortran Forum, “High Performance Fortran Language Specificationversion 1.0,” Technical Report CRPC-TR92225, Rice University, May 1993.[4] P.J. Hatcher and M.J. Quinn, Data Parallel Programming on MIMD Computers.Cambridge, Massachusetts: MIT Press, 1991.[5] W. Pugh, “The Omega Test: a Fast and Practical Integer Programming Algorithm forDependence Analysis,” Supercomputing 91, pp. 4—13, Nov. 1991.[6] W. Pugh, “A Practical Algorithm for Exact Array Dependence Analysis,” Communication of the ACM, pp. 102—114, Aug. 1992.[7] C. T. M. Wolfe, “The Power Test for Data Dependence,” IEEE Transactions onParallel and Distributed Systems, pp. 591—601, Sep. 1992.[8] H. Zima and B. Chapman, Supercompilers for Parallel and Vector Computers.Reading, Massachusetts: Addison-Wesley, 1990.[9] U. Banerjee, Dependence Analysis for Supercomputing. Nowell, Massachusetts:Kluwer Academic, 1988.[101 C.D. Polychronopoulos, Parallel Programming and Compilers. Nowell, Massachusetts: Kiuwer Academic Press, 1988.[11] A.V. Aho, R. Sethi, and J.D. Uliman, Compilers. Principles, Techniques, and Tools.Reading, Massachusetts: Addison-Wesley, 1986.[12] D.A. Padua and M.J. Wolfe, “Advanced Compiler Optimizations for Supercomputers,” Communication of the ACM, vol. 29, pp. 1184-1201, 1986.[13] E. Duesterwald, R. Gupta, and M.L. Soffa, “A Practical Data Flow Framework forArray Reference Analysis and its Use in Optimization,” ACM SIGPLAN Conferenceon Programming Language Design and Implementation, Jun. 1993.154chapter 7[14] M.J. Wolfe, Optimizing Supercompilersfor Supercomputers. Cambridge, Massachusettsl: MIT Press, 1989.[15] U. Banerjee, Loop Transformations for Restructuring Compilers. Nowell, Massachusetts: Kiuwer Academic, 1993.[16] P. Feautrier, “Dataflow Analysis of Array and Scalar References,” InternationalJournal of Parallel Programming, pp. 23-53, Aug. 1991.[17] G. Goff, K. Kennedy, and C. Tseng, “Practical Dependence Testing,” ACM SIGPLANConference on Programming Language Design and Implementation, pp. 15—29, Jun.1991.[181 Z. Li, P. Yew, and C. Zhu, “An Efficient Data Dependence Analysis for ParallelizingCompilers,” IEEE Transactions on Parallel and Distributed Systems, pp. 26-34,Jan. 1990.[19] D.E. Maydan, S.P. Amarasinghe, and M.S. Lam, “Array Data-Flow Analysis andits Use in Array Privatization,” ACM Conference on Principle of ProgrammingLanguages, pp. 2—15, Jan. 1993.[20] Z. Li, P. -C Yew, and C.-Q Zhu, “Data Dependence Analysis on Multi-dimensionalArray References,” International Conference on Supercomputing, pp. 2 15—224, Jul.1989.[21] D.E. Maydan, J.L. Hennessy, and M.S. Lam, “Efficient and Exact Data DependenceAnalysis,” ACM SIGPLAN Conference on Programming Language Design andImplementation, pp. 1—14, Jun. 1991.[221 W. Pugh and D. Wonnacott, “Static Analysis of Upper and Lower Bounds onDependences and Parallelism,” Technical Report CS-TR 2994.2 Department ofComputer Science, University of Maryland, 1994.[23] V. Maslov, “Lazy Array Data-flow Dependence Analysis,” ACM Conference onPrinciple of Programming Languages, Jan. 1994.[24] J.R. Allen and K. Kennedy, “Automatic Translation of Fortran Programs to VectorForm,” ACM Transaction on Programming Languages and Systems, vol. 9, pp. 491-542, Oct. 1987.155Chapter 7[25] M.J. Wolfe, “The Tiny Loop Restructuring Research Tool,” International Conferenceon Parallel Processing, vol. 2, pp. 46-53, Aug. 1991.[26] C.D. Polychronopoulos and D.J. Kuck, “Guided Self-scheduling: a Practical Self-scheduling Scheme for Parallel Supercomputers,” IEEE Transaction on Computers,pp. 1425—1439, Dec. 1987.[27] J.K. Peir and R. Cytron, “Minimum Distance: A Method for Partitioning Recurrencesfor Multiprocessors,” iEEE Transaction on Computers, pp. 1203-1211, Aug. 1990.[281 E. H. D’Hollander, “Partitioning and Labeling of Index Sets in Do Loops withConstant Dependence Vectors,” International Conference on Parallel Processing,vol. 2, pp. 139-144, Aug. 1990.[291 M. Gupta and P. Banerjee, “Demonstration of Automatic Data Partitioning Techniques for Parallelizing Compilers on Multicomputers,” IEEE Transactions on Parallel and Distributed Systems, pp. 179—193, Mar. 1992.[30] F. Irigoin and R. Triolet, “Supernode Partitioning,” ACM Symposium on Principlesof Programming Languages, pp. 319—329, 1988.[31] J. Ramanujam and P. Sadayappan, “A Methodology for Parallelizing Programsfor Multicomputers and Complex Memory Multiprocessors,” Supercomputing 89,pp. 637-646, Nov. 1989.[32] J. Ramanujam and P. Sadayappan, “Tiling of Iteration Spaces for Multicomputers,”International Conference on Parallel Processing, vol. 2, pp. 179-186, Aug. 1990.[33] K. Pingali and A. Rogers, “Process Decomposition through Locality of Reference,”ACM SIGPLAJY Conference on Programming Language Design and Implementation,pp. 69-80, Jun. 1989.[34] J.M Anderson and M.S. Lam, “Global Optimization for Parallelism and Locality onScalable Parallel Machines,” ACM SIGPLAN Conference on Programming LanguageDesign and Implementation, pp. 112—125, Jun. 1993.[35] S. P. Amarasinghe and M. S. Lam, “Communication Optimization and CodeGeneration for Distributed Memory Machines,” ACM SIGPLAN Conference onProgramming Language Design and Implementation, pp. 126—138, Jun. 1993.156chapter 7[36] D. Callahan, and K. Kennedy, “Compiling Programs for Distributed MemoryMultiprocessors,” Journal of Supercomputing, vol. 2, PP. 15 1-169, 1989.[37] J. Li and M. Chen, “Compiling Communication-efficient Programs for MassivelyParallel Machines,” IEEE Transactions on Parallel and Distributed Systems,pp. 361—376, Jul. 1991.[38] J.A. Fisher, “Trace Scheduling: A Technique for Global Microcode Campaction,”IEEE Transactions on Computers, vol. 30, pp. 478-490, Jul. 1981.[39] M.S. Lam, “Software Pipelining: An Effective Scheduling Technique for VLIWMachines,” ACM SIGPLAN Conference on Programming Languages Design andImplementation, pp. 3 18—328, Jun. 1988.[40] A. Zaafrani, H. G. Dietz, and M. T. O’Keefe, “ Static Scheduling for Barrier MIMDArchitectures,” International Conference on Parallel Processing, pp. 187—194, Aug.1990.[411 R. Cytron, “Doacross: Beyond Vectorization for Multiprocessors,” InternationalConference on Parallel Processing, pp. 836—844, Aug. 1986.[42] C.D. Polychronopoulos and U. Banerjee, “Speedup Bounds and Processor Allocationfor Parallel Programs on Multiprocessors,” International Conference on ParallelProcessing, pp. 96 1—968, Aug. 1986.[43] H.M. Su, and P.C. Yew, “Efficient Doacross Execution on Distributed Shared-Memory Multiprocessors,” International Conference on Parallel Processing, vol. 1,pp. 45-48, Aug. 1991.[44] P. Tang and G. Michael, “Chain-Based Partitioning and Scheduling of NestedLoops for Multicomputers,” International Conference on Parallel Processing, vol. 2,pp. 243-250, Aug. 1990.[45] S. Midkiff and D. Padua, “A Comparison of Four Synchronization OptimizationTechniques,” International Conference on Parallel Processing, vol. 2, pp. 9—16,Aug. 1990.[46] M. Berry et al., “The PERFECT Club Benchmarks: Effective Performance Evaluation of Supercomputers,” International Journal of Supercomputing Applications,vol. 3, pp. 5-40, Mar. 1989.157Chapter 7[471 W. Blume and R. Eigenmann, “Performance Analysis of Parallelizing Compilers onthe Perfect Benchmarks Programs.,” IEEE Transactions on Parallel and DistributedSystems, vol. 3, pp. 643-656, Nov. 1992.[481 I. Gottlieb, “The Partitioning of QSDF Computation Graphs,” Parallel Computing,vol. 9, pp. 347-358, Feb. 1989.[49] V. Sarkar, “Determining Average Program Execution Times and their Variance,”Programming Language Design and Implementation, pp. 298-3 12, 1989.[50] DEC, “ANSI User’s Guide,” DECmpp SX Programming Language, Feb. 1993.[51] Z. Li, “Compiler Algorithms for Event Variable Synchronization,” Universityof Illinois at Urbana-Champaign, Center for Supercomputing Research andDeveloment, CSRD Report No. 1082, Jun. 1991.[521 W. Shang and J.A.B. Fortes, “Independent Partitioning of Algorithms with UniformDependences,” Inter,wtional Conference on Parallel Processing, pp. 26—33, Aug.1988.[53] L. Lamport, “The Parallel Execution of DO loops,” Communication of the ACM,vol. 17, pp. 83—93, Feb. 1974.[54] M. Wolfe, “Loop Skewing: The Wavefront Method Revisited,” International Journalof Parallel Programming, vol. 15, pp. 279-293, Aug. 1986.[55] C.D. Polychronopoulos, D.J. Kuck, and D.A. Padua, “Execution of Parallel Loopson Parallel Processors Systems,” International Conference on Parallel Processing,pp. 5 19-527, Aug. 1986.[56] C. Ancourt and F. Triolet, “Scanning Polyhedra with DO Loops,” ACM SIGPLANConference on Principles and Practice of Parallel Programming, pp. 39—50, 1991.[57] V. Balasundaram, G. Fox, K. Kennedy, and U. Kremer, “A Static PerformanceEstimator to Guide Data Partitioning Decisions,” Third ACM SIGPLAN Symposiumon Principle and Practices of Parallel Programming, pp. 213—223, Apr. 1991.[58] V. Balasundaram and K. Kennedy, “A Technique for Summarizing Data Access andits Use in Parallelism Enhancing Transformations,” ACM SIGPLAN Conference onProgramming Language Design and Implementation, Jun. 1989.158chapter 7[591 H.M. Gemdt and H.P. Zima, “MIMD Parallelization for SUPERNUM,” International Conference on Supercomputing, Jun. 1987.[601 H.P. Zima, H.-J. Bast, and H.M. Gerndt, “SUPERB - a Tool for Semi-automaticMIMD/SIMD Parallelization,” Parallel Computing, vol. 6, PP. 1-18, 1988.[611 C. Koelbel, P. Mehrotra, and J. Van Rosendale, “Semi-automatic Domain Decomposition in BLAZE,” International Conference on Parallel Processing, pp. 521—525,Aug. 1987.[62] S.P. Midiciff and D.A. Padua, “Compiler Algorithms for Synchronization,” IEEETransactions on Computers, pp. 1485—1495, Dec. 1987.[63] M. Gupta, and P. Banerjee, “Automatic Data Partitioning on Distributed MemoryMultiprocessors,” The Sixth Distributed Memory Computing Conference Proceedings, pp. 43—50, Apr. 1991.[64] 5. Hiranandiani, K. Kennedy, and C. Tsen, “Compiling Fortran D for MIMDDistributed-Memory Machines.,” Communication of ACM, pp. 66—80, Aug. 1992.[65] J. Li and M. Chen, “The Data Alignment Phase in Compiling Programs forDistributed Memory Machines,” Journal of Parallel and Distributed Computing,1991.[661 J. Li and M. Chen, “Index Domain Alignment: Minimizing Cost of Cross-ReferenceBetween Distributed Arrays,” 3rd Symposium on Frontiers of Massively ParallelComputation, pp. 424-433, Oct. 1991.[67] R. Gupta, “Synchronization and Communication Cost of Loop Partitioning onShared-Memory Multiprocessor Systems,” International Conference on ParallelProcessing, vol. 2, pp. 23-30, Aug. 1989.[681 T.H. Tzen, and L.M. Ni, “Dependence Uniformization: A Loop ParallelizationTechnique,” IEEE Transactions on Parallel and Distributed Systems, pp. 547—558,May 1993.[691 M.E. Wolf and M.S. Lam, “Maximizing Parallelism via Loop Transformations,”Third Workshop on Languages and Compilers for Parallel Computing, Aug. 1990.159Chapter 7[70] K.C. Kim and A. Nicolau, “Parallelizing Non-Vectorizable Loops for MIMDMachines.,” International Conference on Parallel Processing,, vol. 2, pp. 114—118,Aug. 1990.[711 X. Kong, D. Klappholz, and K. Psarris, “The I Test: A New Test for Subscript DataDependence,” International Conference on Parallel Processing, Aug. 1990.[72] Z. Shen, Z. Li, and P.C. Yew, “An Empirical Study on Array Subscripts and DataDependencies,” International Conference on Parallel Processing, vol. 2, pp. 145—152, Aug. 1989.[73] G. Amdahl, “The Validity of the Single Processor Approach to Achieving LargeScale Computing Capabilities,” AFIPS Conference Proceedings, vol. 30, pp. 483-485, Oct. 1967.[74] P. Feautrier, “Array Expansion,” International Confereiwe on Supercomputing,pp. 429-441, Jul. 1988.[75] Z. Li, “Array Privatization for Parallel Execution of Loops,” International Conference on Supercomputing, Jul. 1992.[76] A. Zaafrani and M.R. Ito, “Efficient Execution of Doacross Loops on DistributedMemory Machines,” IFIP Conference On Architectures and Compilation Techniquesfor Fine and Medium Grain Parallelism, pp. 27-36, Jan. 1993.[77] A. Zaafrani and M.R. Ito, “Transformation of Doacross Loops on DistributedMemory Systems,” 7th International Parallel Processing Symposium, pp. 856—863,Apr. 1993.[78] A. Zaafrani and M.R. Ito, “Efficient Execution of Doacroos Loops with UniformDependencies on Multicomputers,” Accepted in the Journal of Parallel andDistributed Computing, 1993.[79] A. Zaafrani and M.R. Ito, “Partitioning the Global Space for Distributed MemorySystems,” Supercomputing 93, pp. 327—336, Nov. 1993.[801 A. Zaafrani and M.R. Ito, “Expressing Cross Loop Dependencies through Hyperplane Data Dependence Analysis,” Supercomputing 94, Nov. 1994.160chapter 7[811 A. Zaafrani and M.R. Ito, “Parallel Region Execution of Loops with IrregularDependencies,” International Confernce on Parallel Processing, pp. 11—19, Aug.1994.[821 A. Zaafrani and M.R. Ito, “Parallelizing Loops with Complex and IrregularDependencies,” Sent to the IEEE Transactions on Parallel and Distributed Systems,1994.161
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Transportations and efficient parallel execution of...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Transportations and efficient parallel execution of loops with dependencies Zaafrani, Abderrazek 1994
pdf
Page Metadata
Item Metadata
Title | Transportations and efficient parallel execution of loops with dependencies |
Creator |
Zaafrani, Abderrazek |
Date Issued | 1994 |
Description | Loops are the main source of parallelism in scientific programs. Hence, several techniques were developed to detect parallelism in these loops and transform them into parallel forms. In this dissertation, compile time transformations and efficient parallel execution of loops with various type of dependencies are investigated. First, Doacross loops with uniform dependencies are considered for execution on distributed memory parallel machines (multicomputers). Most known Doacross loop execution techniques can be applied efficiently only to shared memory parallel machines. In this thesis, code reordering technique, improvements to partitioning strategies, and finding a balance between communication and parallelism are presented to reduce the execution time of Doacross loops on multicomputers. As with most paralleizing transformation techniques, only single loopnests are considered in the first part of this dissertation. However, parallelizing each loopnest in a program separately, even if an optimal execution can be obtained for each loopnest, may not result in an efficient execution of all the code in the program because of communication overhead across the loops in a multicomputer environment. Hence, across loop data dependence representation and analysis are introduced in this work to improve the parallel execution of the whole code. Our contribution consists of finding and representing data dependencies whose sources and destinations are subspaces of the iteration space mainly common across the loops. This type of dependence information is used in this thesis to improve global iteration space partitioning, automatic generation of communication statements across ioops, and index alignment. The final part of this dissertation presents new parallelizing techniques for loops with irregular and complex dependencies. Various data dependence analysis algorithms can be found in the literature even for loops with complex array indexing. However, the improvement in data dependence testing has not been followed by similar amelioration in restructuring transformations for loops with complex dependencies. Such loops are mostly executed in serial mode. Our parallelizing techniques for these loops consists of identifying regions of the iteration space where all iterations can be executed in parallel. The advantages of all the transformations presented in this dissertation are: 1) they significantly reduce the execution time of ioops with various types of dependencies as shown in this work using the MasPar machine 2) they can be implemented at compile time which makes the task of parallel programming easier. |
Extent | 4833225 bytes |
Genre |
Thesis/Dissertation |
Type |
Text |
FileFormat | application/pdf |
Language | eng |
Date Available | 2009-06-05 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0065118 |
URI | http://hdl.handle.net/2429/8827 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 1995-05 |
Campus |
UBCV |
Scholarly Level | Graduate |
AggregatedSourceRepository | DSpace |
Download
- Media
- 831-ubc_1995-983736.pdf [ 4.61MB ]
- Metadata
- JSON: 831-1.0065118.json
- JSON-LD: 831-1.0065118-ld.json
- RDF/XML (Pretty): 831-1.0065118-rdf.xml
- RDF/JSON: 831-1.0065118-rdf.json
- Turtle: 831-1.0065118-turtle.txt
- N-Triples: 831-1.0065118-rdf-ntriples.txt
- Original Record: 831-1.0065118-source.json
- Full Text
- 831-1.0065118-fulltext.txt
- Citation
- 831-1.0065118.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
https://iiif.library.ubc.ca/presentation/dsp.831.1-0065118/manifest