Thread-Parallel Mesh Generation andImprovement Using Face-EdgeSwapping and Vertex InsertionbyReza ZangenehBASc., Mechanical Engineering, University of Tehran, 2012A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinThe Faculty of Graduate and Postdoctoral Studies(Mechanical Engineering)THE UNIVERSITY OF BRITISH COLUMBIA(Vancouver)November 2014© Reza Zangeneh 2014AbstractThe purpose of this thesis is three-fold. First, we devise a memory model forunstructured mesh data for efficient use of memory on parallel shared mem-ory architectures with the purpose of lowering the synchronization overheadbetween threads and also excluding the probability of occurring race con-ditions. Second, we present a new thread-parallel edge and face swappingalgorithm for two and three dimensional meshes using OpenMP for sharedmemory architectures. We show how removing the conflicts from the recon-figuration procedure by applying a vertex locking strategy can result in anear linear speed-up with parallel efficiency of close to one on two threadsand 70% with sixteen threads on shared-memory processors. Finally, wederive a parallel mesh generation and refinement module for shared memoryarchitectures based on pre-existing serial modules — GRUMMP — by im-plementing Chernikov and Chrisochoides’ parallel insertion algorithm alongwith the two above tools. Experiments show a worst case parallel efficiencyof 60% for parallel refinement with 16 threads.iiPrefaceThe research ideas and methods explored in this thesis are the fruits of a closeworking relationship between Dr. Carl Ollivier-Gooch and Reza Zangeneh.The implementation of the methods, the data analysis, and the manuscriptpreparation were done by Reza Zangeneh with invaluable guidance from Dr.Carl Ollivier-Gooch throughout the process.iiiTable of ContentsAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiPreface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iiiTable of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . ivList of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viList of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viiList of Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . ixAcknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . x1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Objectives and Layout of the Thesis . . . . . . . . . . . . . . 42 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.1 Delaunay Triangulation . . . . . . . . . . . . . . . . . . . . . 72.2 Mesh Topology Improvement . . . . . . . . . . . . . . . . . 112.2.1 Face Swapping . . . . . . . . . . . . . . . . . . . . . . 112.2.2 Edge Swapping . . . . . . . . . . . . . . . . . . . . . 122.3 Vertex Insertion . . . . . . . . . . . . . . . . . . . . . . . . . 172.3.1 Watson’s Algorithm . . . . . . . . . . . . . . . . . . . 172.3.2 Bowyer’s Algorithm . . . . . . . . . . . . . . . . . . . 182.3.3 Green and Sibson’s Algorithm . . . . . . . . . . . . . 20ivTable of Contents2.4 Parallel Mesh Generation and Improvement . . . . . . . . . . 202.4.1 Distributed Memory Parallel Programming . . . . . . 232.4.2 Shared Memory Parallel Programming . . . . . . . . 263 Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.1 Race Condition . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 False Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.3 Serial Memory Model . . . . . . . . . . . . . . . . . . . . . . 343.4 Parallel Memory Model . . . . . . . . . . . . . . . . . . . . . 364 Parallel Edge and Face Swapping . . . . . . . . . . . . . . . . 424.1 2D Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 424.2 3D Swapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 454.3 Parallel Performance and Speed-up Results . . . . . . . . . . 484.3.1 Speed-up . . . . . . . . . . . . . . . . . . . . . . . . 514.3.2 Load Balance . . . . . . . . . . . . . . . . . . . . . . 545 Parallel Vertex Insertion . . . . . . . . . . . . . . . . . . . . . 615.1 Chernikov and Chrisochoides’ Algorithm . . . . . . . . . . . 635.2 Parallel Performance and Speed-up Results . . . . . . . . . . 665.2.1 Speed-up . . . . . . . . . . . . . . . . . . . . . . . . . 685.2.2 Load Balance . . . . . . . . . . . . . . . . . . . . . . 736 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 766.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 766.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79Appendices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85vList of Tables2.1 Possible triangulation in edge swapping. . . . . . . . . . . . . 165.1 Breakdown of the refinement time. . . . . . . . . . . . . . . . 715.2 Comparison of 2D speedup results generated by GRUMMPand Chernikov’s algorithm. . . . . . . . . . . . . . . . . . . . 735.3 Comparison of 3D speedup results generated by GRUMMPand Chernikov’s algorithm. . . . . . . . . . . . . . . . . . . . 73viList of Figures2.1 In-circle test. . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 The polyhedron undergoing face-swapping. . . . . . . . . . . 132.3 Eligibility of face swapping. . . . . . . . . . . . . . . . . . . . 132.4 Edge swapping from 5 tetrahedra to 6 . . . . . . . . . . . . . 142.5 All the possible canonical configurations after edge swapping. 152.6 Delaunay triangulation with new point Q added . . . . . . . 192.7 Inserting a new vertex in Green and Sibson. . . . . . . . . . . 202.8 Forward propagation of edge swapping . . . . . . . . . . . . . 212.9 Fork-Join model for shared memory parallel programing. . . 222.10 MPI parallel model . . . . . . . . . . . . . . . . . . . . . . . . 222.11 N-way partitioning. . . . . . . . . . . . . . . . . . . . . . . . . 242.12 Multilevel recursive bisection. . . . . . . . . . . . . . . . . . . 252.13 Domain Decomposition. . . . . . . . . . . . . . . . . . . . . . 252.14 Mesh partitioning and boundary entities . . . . . . . . . . . . 262.15 Mesh migration. . . . . . . . . . . . . . . . . . . . . . . . . . 273.1 Occurrence of false sharing, Courtesy of Intel . . . . . . . . . 343.2 Memory model of GRUMMP . . . . . . . . . . . . . . . . . . 374.1 2D swapping independency . . . . . . . . . . . . . . . . . . . 444.2 Vertices to be locked in face swapping . . . . . . . . . . . . . 474.3 Vertices to be locked in edge swapping . . . . . . . . . . . . . 484.4 2D boundary domain sample. . . . . . . . . . . . . . . . . . . 494.5 3D boundary domain sample. . . . . . . . . . . . . . . . . . . 504.6 Quality of the mesh prior to swapping operations. . . . . . . 51viiList of Figures4.7 Quality comparison of meshes generated by serial and parallelface swapping algorithms in 2D. . . . . . . . . . . . . . . . . . 524.8 Quality comparison of meshes generated by serial and parallelface swapping algorithms in 3D. . . . . . . . . . . . . . . . . . 524.9 The running time for 2D face swapping. . . . . . . . . . . . . 544.10 The running time for 3D face swapping. . . . . . . . . . . . . 554.11 Comparison of our parallel swapping algorithm and Rokos’on a 2D mesh. . . . . . . . . . . . . . . . . . . . . . . . . . . 564.12 Load-balance for 2D face swapping. . . . . . . . . . . . . . . . 574.13 Load-balance for 3D face and edge swapping . . . . . . . . . 584.14 Effects of keeping load balance in different phases of swappingalgorithm in 2D. . . . . . . . . . . . . . . . . . . . . . . . . . 605.1 Non-conformal mesh. . . . . . . . . . . . . . . . . . . . . . . . 625.2 Non-Delaunay mesh. . . . . . . . . . . . . . . . . . . . . . . . 625.3 An example of using independent set of leaves for safely in-serting vertices in parallel . . . . . . . . . . . . . . . . . . . . 655.4 Quality comparison of 2D meshes generated by serial andparallel algorithms. . . . . . . . . . . . . . . . . . . . . . . . . 685.5 Quality comparison of 3D meshes generated by serial andparallel algorithms. . . . . . . . . . . . . . . . . . . . . . . . . 695.6 The running time for 2D mesh refinement. . . . . . . . . . . . 705.7 The running time for 3D mesh refinement. . . . . . . . . . . . 705.8 Sensitivity of refinement and swapping time with the size ofthread private containers. . . . . . . . . . . . . . . . . . . . . 725.9 Load balance in 2D refinement. . . . . . . . . . . . . . . . . . 745.10 Load balance in 3D refinement. . . . . . . . . . . . . . . . . . 75viiiList of Algorithms2.1 Ruppert’s Delaunay refinement . . . . . . . . . . . . . . . . . 102.2 Watson’s insertion algorithm . . . . . . . . . . . . . . . . . . 182.3 Algorithm 1 on dealing with the boundary entities. . . . . . . 272.4 Algorithm 2 on dealing with the boundary entities. . . . . . . 283.1 Occurrence of data race condition problem . . . . . . . . . . . 323.2 Occurrence of false sharing in a shared array . . . . . . . . . 333.3 Corrected algorithm by adopting private variables. . . . . . . 353.4 Corrected algorithm by aligning arrays with the cache line. . 353.5 Getting a new entity . . . . . . . . . . . . . . . . . . . . . . . 383.6 Allocating pointers of unused entities to the threads on the fly 403.7 Synchronization at the entrance and exit of the parallel section 414.1 Serial face swapping algorithm . . . . . . . . . . . . . . . . . 434.2 Vertex locking strategy. . . . . . . . . . . . . . . . . . . . . . 464.3 Parallel face swapping algorithm . . . . . . . . . . . . . . . . 475.1 Parallel Delaunay Insertion . . . . . . . . . . . . . . . . . . . 67ixAcknowledgmentsFirst and foremost, I would like to express my sincere appreciation to mywonderful family, especially my mother, Halimeh, and my father, Hossein,for their love and encouragement all through my life.I would also like to acknowledge my research supervisor, Dr. CarlOllivier-Gooch, whose constant support, patience, and guidance has beeninvaluable throughout the course of this research.I am also grateful to my dear colleagues at ANSLab research group,Gary Yan, and Chandan Sejekan who patiently helped with proofreadingthis thesis.xChapter 1Introduction1.1 MotivationMesh generation and adaptation are part of the fundation for finite vol-ume and finite element analysis. Automatic unstructured mesh generationhas been investigated extensively. However, in the era of ever developingcomputer technology, serial mesh generation algorithms have become a bot-tleneck in the overall performance of many finite volume and finite elementanalyses. For many cases resolution in time and space is a limiting factorwhich is highly dependent on the available computational resources. Forexample, a three-dimensional direct numerical simulation of turbulent flowrequires N3 mesh points satisfyingN3 ≥ Re9/4 = Re2.25,which means that with the increase of Re number, the number of meshpoints increases as Re9/4, [27]. Thus, the mesh generation and adaptationtime increases accordingly. In another example, if a serial CFD simulationis comprised of 90% of the time spent in a well-scalable finite volume orfinite element solver, and 10% of the time is consumed in a non-scalablemesh adaptation module, the latter will eventually become a bottleneckin the overall performance. According to Amdahl’s law (Gustafson [24]),the maximum speed-up will not exceed 10 no matter how many CPUs weare using. If P is the portion of the code which is parallelized and N isthe number of processors, the maximum speed-up that can be achievedaccording to Amdahl’s law is:11.1. MotivationS =[PN + (1− P )]−1=[0.9∞ + (1− 0.9)]−1≈ 10Moreover, it is generally observed that the microprocessor industry hasreached its limit on significantly increasing the performance of single coreswhile the number of cores in each processors is increasing. Today, it is com-mon to see laptops and personal computers with 4-8 cores. Also, we can rea-sonably anticipate supercomputers exploiting hundreds or thousands of coresin intra-node parallelization sharing a single memory. These facts motivateus to develop a shared-memory parallel mesh generation and improvementmodule which could be used in conjunction with finite volume and finite el-ement solvers. To accomplish this, we have looked deeply into the problemof threaded mesh generation and refinement algorithms on shared memoryarchitectures. This specifically excludes parallelization on distributed mem-ory architectures, where domain decomposition and the Message PassingInterface (MPI) are the dominant approach.The field of sequential Delaunay mesh refinement and generation hasbeen developed extensively (See [12, 13, 21, 47, 41, 42] and the referencestherein). Amongst these algorithms, Ruppert’s 2D Delaunay refinementalgorithm [42] and Shewchuk’s 3D counterpart [46] are implemented in off-the-shelf mesh generation and refinement modules such as Triangle [45] andGRUMMP [36]. As the need for larger and more complicated meshes in fi-nite volume and finite element analyses grows, attempts to develop parallelmesh generation modules are accelerating. One of the first parallel meshgeneration algorithms was Chow’s 1980 PhD thesis [14]. However, this al-gorithm was not practical. Since then it has been a common practice todecompose the mesh and allocate each subsection to one processor to meshit sequentially [16, 25, 26, 32, 39, 31, 3, 43]. These attempts brought updifferent methods to decompose a given domain. Despite their differences,they all try to construct separators such that the domain is decomposed intosub-domains, while creating no new features like small angles. A thoroughsurvey of these approaches has been carried out by Chrisochoides [15]. Naveet al. [35] presented a guaranteed-quality parallel mesh generation algorithm21.1. Motivationfor polyhedral domains. However, their algorithm lacks good performancedue to the imposed rollbacks1 whenever concurrent point insertions lead toinvalid meshes. In another work, Edelsbrunner and Guoy [19] presented acondition for independent insertion of two vertices. Their approach checkswhether the cavities of two candidate Steiner points are disjoint. This con-dition ensures the mesh will be conformal. However, it is work intensiveas it requires checking every pair of vertices for independence before inser-tion. Spielman et al. [49] presented theoretical polylog parallel time Delau-nay refinement meshing algorithms with provable termination and qualitybound; however they did not present any practical results. Chernikov andChrisochoides [6, 7, 8, 9, 10] proposed distributing the data rather thandecomposing the domain. In this way, mesh remains intact while indepen-dent sets of data are distributed between processors. In their work, theyproved the required theoretical and practical conditions to use Quad/Octtree structures to guide parallel Delaunay triangulation on both shared anddistributed memory architectures.Once mesh generation is complete, mesh improvement via edge and faceswapping can take up to 20% of runtime depending on the refinement andswapping criteria. This fact drove us to multi-thread the swapping process.Surprisingly, there is not much research on parallel swapping; it seems thatresearchers have mostly focused on the refinement process. Recently, Rokoset al. [40] proposed a parallel edge swapping algorithm for anisotropic meshgeneration and adaptation which suffers from poor scalability. They haveachieved parallel efficiency of 60% on 8 cores and 40% on 16 cores. Althoughtheir algorithm uses a vertex locking strategy, they have not specified theinherent data dependencies in advance. Consequently they may end upswapping two adjacent edges and introducing data structure hazards. Toavoid these data structure hazards, they have opted to use deferred updatesto the adjacency information which requires setbacks2 and threads leaving1Rollback is an operation which for example removes a vertex after it has been inserteddue to problems posed by itself or other processors.2Setback refers to a case when a thread postpones its current task until an anothertask is done.31.2. Platformthe task of swapping independent sets to update the data connectivitieswhich kills the performance. Also, there is no extension of their 2D algorithmto 3D. Their algorithm, along with its advantages and disadvantages, isexplained more thoroughly in Chapter 2. In our parallel swapping algorithmwe are aiming for greater parallel efficiency and approximately linear speed-up.1.2 PlatformFor shared memory architectures, parallelization is usually implemented bythreads. Multi-threading is mostly managed by either POSIX threads [1]or OpenMP [2] on shared memory architectures. The former is a low levelprogramming paradigm which provides extremely fine-grained control overthread management but also more complexity issues. Thus it would be morechallenging and work-intensive to get the same speed-up and scalability. Onthe other hand, OpenMP is much higher level. OpenMP, a standard jointlydefined by a group of major computer hardware and software vendors, iswell supported in various compilers such as GCC, IBM, Oracle, and Intelon many operating systems. Also, its continually evolving features suchas work-sharing, data-sharing, and synchronization constructs enable thecode developer to divide workload across multiple threads, impose synchro-nization, and manage parallelization with relative ease. For these reasons,OpenMP — specifically OpenMP 3.1 and higher which supports some newfeatures such as tasks and extended atomic constructs — is the choice ofinterest. All the code associated with this project is written in C++ andembedded in an object oriented manner in the existing in-house sequentialmesh generation and refinement modules of GRUMMP. Compiling is donewith GCC 4.9.1 which supports OpenMP up to version 4.0.1.1.3 Objectives and Layout of the ThesisIn this thesis, we develop a new thread-parallel edge and face swapping ap-proach for shared memory architectures which to the best of our knowledge,41.3. Objectives and Layout of the Thesisis the first of its kind. This parallel swapping algorithm offers good scalabil-ity and speedup; also it exploits the maximum code re-use. Parallelization isbased on data distribution and work-sharing rather than well-known domaindecomposition and message passing algorithms which divide the mesh intosub-domains and then mesh each sub-domain sequentially. Data dependencyin both 2D and 3D is identified. A vertex locking strategy is introduced toremove dependencies in edge and face swapping. This rejects some of theproblematic candidate faces and edges to be swapped so that the remainingbucket of faces and edges are reconfigured without conflict in parallel.Then, Ruppert’s [42] and Shewchuk’s [44] insertion algorithms are par-allelized on shared memory architectures using the theoretical frameworkdeveloped by Chernikov et al [6, 7, 8, 9, 10]. This parallel vertex insertionis based on construction of Quad/Oct tree structures which guide threadsthrough the mesh to keep them at a safe conflict-free distance without de-composing the mesh.There are three main requirements which need to be met in every goodimplementation of parallel meshing algorithms: stability, scalability, andcode re-use. Stability determines the quality bounds of meshes generatedby the parallel algorithm compared to its sequential counterpart. Scalabil-ity refers to whether the speed-up is proportional to the number of threads,which is of utmost importance in most endeavors to parallelize. Finally, thegoal is to achieve the maximum code reuse of highly optimum and continu-ously developing sequential meshing libraries. In this thesis, we also seek togain maximum speed-up while presenting meshes whose qualities are com-parable to those that result from the sequential algorithm. Throughoutthis work the practical aspects of algorithms have been investigated whileaddressing all these three mentioned criteria on up-to-date multi-core hard-wares.The remainder of this work is laid out as follows: in Chapter 2, the nec-essary background on Delaunay triangulation, face and edge swapping andthe general procedures for mesh refinement and improvement are presented.In Chapter 3, data storage and memory access patterns in GRUMMP areexplained. Also, some non-trivial drawbacks of naively accessing memory in51.3. Objectives and Layout of the Thesisparallel are pointed out along with their workarounds. Finally, the propermemory model for thread-parallel meshing, on shared memory architecturesto preclude data race conditions and false sharing problems is explained. InChapter 4, a new thread-parallel algorithm for edge and face swapping ispresented. In Chapter 5, practical implementation aspects of Chernikov’sparallel insertion algorithm are discussed. A brief guide to parallel pro-gramming with OpenMP is appended for the readers not acquainted withOpenMP.6Chapter 2BackgroundIn this chapter, a brief review of triangulation, specifically Delaunay tri-angulation, along with mesh topology improvement techniques is presented.First, Delaunay triangulation and its most important features are explained.Second, relevant mesh modification techniques to this thesis such as face andedge swapping, and vertex insertion algorithms are summarized. Finally,some leading works on different parallel mesh generation and adaptation al-gorithms with their advantages and disadvantages are presented. This willillustrate the existing stance on the parallel mesh generation and adaptationproblem and will pave the way for our parallel algorithms explained in thenext chapters.2.1 Delaunay TriangulationA triangulation of a set of vertices V refers to any set of triangles or tetrahe-dra whose vertices are V that covers the whole interior of the given bound-ary. Delaunay triangulation which is named after Boris Delaunay [18] is asuitable technique to find a triangulation for a given specific domain. Delau-nay triangulation according to Shewchuk [46] is defined as a triangulationin which the circumball of every triangle or tetrahedron is empty of othermesh vertices. A circumball of a triangle or tetrahedron is the unique circleor sphere which passes through all the vertices of the triangle or tetrahe-dron respectively. Because no other points lie inside the circumball, all thevertices are far from each other.Numerous properties of Delaunay triangulation or Delaunay refinementalgorithms have made them popular amongst the algorithms for triangula-tion constructions. Some of the important properties and characterization72.1. Delaunay Triangulation(a) Delaunay (b) Not DelaunayFigure 2.1: In-circle test for △ ABC and point Dof these triangulations are as follows. Unfortunately not all of them couldbe extended to 3D.• Uniqueness. The Delaunay triangulation of a vertex set is uniqueas long as no four points (or five points in 3D) are co-circular (co-spherical).• The circumball criterion. A triangulation T of a set of points P is aDelaunay triangulation DT (P ) if the circumball of any triangle of Tdoes not contain a point of P in its interior. The circumball test whichchecks for the validity of the Delaunay triangulation of four points isshown in Figure 2.1.• Edge circle/sphere property. Two points pi, pj ∈ P form an edge ofDT (P ) if there is a closed disc/ball C that contains pi and pj on itsboundary and does not contain any other point of P . This propertyis used mainly to create meshes conforming with a boundary.• Equiangularity property. In the plane, the Delaunay triangulationmaximizes the minimum angle of the triangulation in compared to anyother triangulation of the points. Hence, the Delaunay triangulationis also called the MaxMin triangulation. Consequently, if anisotropic82.1. Delaunay Triangulationfeatures are not sought, the Delaunay Triangulation, which attemptsto equate the size of angles, is the optimal choice of triangulation.Finite element and finite volume methods typically require meshes with spe-cific quality and size bounds. The angles should not be too small or too largeand the size of elements is important. The Delaunay triangulation can guar-anty such constraints. Delaunay refinement meshing algorithms begin withan initial mesh overlaying the input vertices and conforming to the inputboundary. This initial mesh is not required to be well-shaped. Ruppert [42]was the first to show that proper application of Delaunay triangulation canlead to a well-shaped and quality bounded mesh with good grading. Lateron Shewchuk [44] extended the Delaunay refinement algorithm to 3D. Rup-pert’s Delaunay refinement algorithm chooses two sets of vertices to insert.The first category of Steiner points in Ruppert’s algorithm are the circum-centers of poor quality triangles. Thereby, Delaunay refinement algorithmswhich insert new vertices at the circumcenters of poor triangles or tetraherdaare guaranteed to have new vertices far from existing ones. The second cate-gory is the mid-points of boundary segments which are encroached upon. Aboundary segment is said to be encroached upon, if the circle which passesthrough its two end-points contains other vertices in the mesh. Then thisedge is split before any other vertices are inserted. The pseudo-code inAlgorithm 2.1 shows the algorithm which Ruppert developed for Delaunayrefinement. This incremental insertion algorithm itself can employ severalvertex insertion algorithms such as Bowyer’s [5] and Watson’s [52] which areexplained in Section 2.2.2. Also, Shewchuk’s scheme is similar to Ruppert’swith the only difference that there is one additional type of encroachmentconsidered in the Shewchuk’s algorithm.There are other algorithms to construct a Delaunay triangulation. Forexample, Lawson’s algorithm [30] solely employs edge swapping to constructDelaunay triangulation. This algorithm assumes that a non-Delaunay meshalready exists, then makes it Delaunay with edge swapping. In Lawson’salgorithm all the interior edges are examined to be swapped. If they are as-sociated with a triangle which fails the in-circle test, then they are swapped.92.1. Delaunay TriangulationAlgorithm 2.1 Ruppert’s Delaunay refinement with a desired minimumangle αInput: A planar straight line graph (PSLG) is the input.Output: A Delaunay triangulation with the desired minimum angle α.1. Let S be the union of boundary segments in PSLG.2. Let V be the union of the circumcenters of the poor quality triangles.3. Until there is no segment encroached upon and no poor quality trian-gle:(a) While any segment s is encroached upon:i. Split segment s.(b) Let v ∈ V be the circumcenter of a poor quality triangle(c) If v encroaches upon segment s1 · · · sk ∈ S, theni. for i = 1 · · · k:A. Split segment si.ii. elseA. Insert circumcenter v.4. Update S and V .102.2. Mesh Topology ImprovementAlso, in 2D the equiangularity property of Delaunay triangulation can beused for the swapping criterion with the condition that the newly createdtriangles have angles in a closer range. A thorough survey of these Delaunaytriangulation methods is presented in Barth’s notes [4] and in more detailin Shewchuk’s book [47].2.2 Mesh Topology ImprovementAlgorithms for mesh improvement are normally divided into four categories:face or edge swapping, vertex insertion, vertex deletion, and smoothing. Inthis thesis, the focus is solely on the first two techniques. Face or edgeswapping improves the mesh connectivity. In our in-house code, GRUMMP,we apply swapping in two places. First, after mesh refinement, swappingis applied to all faces as an initial procedure to remove the poor qualitytriangles or tetrahedra. Second, swapping is applied as part of a compoundprocedure with smoothing and point insertion to remove the poorest qualitytriangles or tetrahedra from the mesh in a targeted way. The poor qualitytriangles or tetrahedra are typically defined as the ones with either too smallor too large dihedral angles3 (See Freitag and Ollivier-Gooch [20] for moredetails) but other geometric criteria can be used. In an isotropic mesh,swapping is done if the quality of every cell in the the new configurationis better than the worst of the previous configuration. In the case of ananisotropic mesh, a swap operation is done if the average circumradius ofthe new triangles is smaller than the average circumradius of the old onesin the metric space (See Pagnutti [38]).2.2.1 Face SwappingAccording to Lawson[30], given a convex set of d+ 2 points in Rd there areexactly two valid ways of triangulating it. In 2D for each given set of 4vertices forming a convex quadrilateral, there are exactly two configurationsto choose between. By quick comparison of these two configurations the3The angle between two planes.112.2. Mesh Topology Improvementbest can be picked by alternating the diagonal.Face swapping in 3D reconnects a polyhedron consisting of two tetra-hedra, made up of a total of five points, separated by a common face; thisis an extension of 2D diagonal swapping in a quadrilateral. Face swappingremoves the shared face from the mesh by means of creating another edge.The so-called “2-3” swapping as shown in Figure 2.2a swaps the commonface of two tetrahedra for an edge common to three new tetrahedra makingup the same polyhedron. In terms of validity, a face is swappable if and onlyif its polyhedron is convex which means for example in Figure 2.3 that seg-ment αβ passes through the triangle M1M2M3. Hence, before swapping thepolyhedron αM1M2M3β is comprised of tets αM1M2M3, and βM1M2M3,while afterwards it is repartitioned into αM1M2β, αM2M3β, and αM1M3β.Moreover, there is another configuration which is only possible when thegenerating polygon is planar. For example in the Figure 2.2b the two co-planar triangles alternate diagonal to form a new configuration. The twoco-planar faces must either be at boundary (“2-2” swapping) or be backedby another pair of tetrahedra (“4-4” swapping). It is notable to mentionwhen choosing between two and three tet configurations, we choose twotets, provided they have the same quality. Also, we opt not to swap in thetwo for two case when the quality does not change after swapping. Thesechoices prevent infinite recursion.2.2.2 Edge SwappingGiven an edge αβ as the generating edge of a shell withN tetrahedra incidenton it, edge swapping removes the edge and reconnects the shell with 2N − 4new tetrahedra provided that the every new tetrahedron has better qualitythan the worst of the N original tetrahedra.Definition 1. Given an edge, a shell is the polyhedron made up of tetrahe-dra sharing this edge. The corresponding edge is called the generating edge.All the vertices of the polyhedron other than the ones at the end-points ofthe generating edge form the generating polyhedron.For example, consider 5 tetrahedra incident on the edge αβ, perpendic-122.2. Mesh Topology Improvement(a) “2-3” face swapping (b) “2-2” co-planar face swappingFigure 2.2: The polyhedron made up of two tets undergoing face-swapping.Figure 2.3: Eligibility of face swapping. Left, segment αβ cuts throughgenerating polygon M1M2M3. Right, The polyhedron is not convex so face-swapping is not allowed132.2. Mesh Topology Improvement(a) The original shell with 5 tets (b) Final shell with 6 tetsFigure 2.4: Edge swapping from 5 tetrahedra to 6ular to the page forming a shell with 7 vertices as shown in Figure 2.4a.Edge swapping removes the edge αβ and then reconfigures the shell by re-connecting the two end-points α and β to the newly created triangles on theequatorial polygon as sketched in Figure 2.4b. From the topological pointof view all the possible remeshings of the shell are categorized into differenttriangulations of the generating equatorial polygon. Figure 2.5 shows all thepossible triangulations of the equatorial polygons along with their uniquenumber of rotations. This approach to access and store the possible config-urations for swapping reduces the cost of choosing the best configuration asonly canonical configurations are stored and then all the other rotations arederived from them.George[22] showed that the number of possible triangulations for n tetra-hedra incident on an edge can be calculated easily byNn = Cat(n − 1 ) wherethe Catalan number is Cat(n) = (2n−2 )!n!(n−1 )! . On the other hand the number ofdifferent triangles on the equatorial plane is measured readily byTrn = C3n =n(n − 1 )(n − 2 )3 ! .As each triangle on the generating plane is attached to two endpoints,the different number of tetrahedra is measured by142.2. Mesh Topology ImprovementFigure 2.5: All the possible canonical configurations after edge swappingwith 4 ≤ N ≤ 7 number of faces incident on the generating edge along withthe number of unique rotations for each configuration.152.2. Mesh Topology ImprovementN 2N − 4 Nn Trn Tets×Nn3 2 1 1 24 4 2 4 85 6 5 10 306 8 14 20 1127 10 42 35 4208 12 132 56 15849 14 429 84 600610 16 1430 120 22880Table 2.1: Number of different tetrahedra, and possible triangulation versusthe number of sides of the generating polygonTetn = 2 × C3n =n(n − 1 )(n − 2 )3 .The number of unique tetrahedra and possible configurations are tabu-lated in Table 2.1 with different numbers of incident tets.Theoretical and Computational ComplexityThe number of candidate triangulations increases exponentially with n butthis is not the only complexity as the number of different triangles increasescubically. Therefore, as the quality of a set of elements is determined by thelowest quality of one of its elements, to determine which configuration hasthe best quality for a given shell, the quality of each tetrahedron for eachconfiguration needs to be computed, which is an enormous amount of workto do. Instead as is shown on the Table 2.1, the number of unique tetrahe-dra is far less than the number of possible configurations multiplied by thenumber of tetrahedra in each. Hence, an efficient approach is to evaluateeach unique tetrahedron once and store the value to use when evaluatingthe configuration quality. Then, as the quality of a shell is that of its worsttetrahedron, the trick to find the best configuration is to find the worsttetrahedron and pinpoint all the configurations that the noted tetrahedronappears on and remove the corresponding tetrahedron from the list of pos-sible configurations. Following this procedure [17], the only configuration162.3. Vertex Insertionretained is the one that does not have any of the worst tetrahedra. Sinceafter edge-swapping each configuration can be identified by an equatorialplane connected to two endpoints, a further efficiency gain is possible if wechoose to specify all the configurations by their canonical configurations asFreitag and Gooch [20] represented. In this manner, for the practical cases,3 ≤ n ≤ 7, canonical configurations along with the number of rotations aregiven in the Figure 2.5. This categorization of the possible post configura-tions dramatically reduces the amount of connectivity information to storeas we only need to store face-to-vertex, face-to-cell, and cell-to-face data forcanonical configurations. The connectivity information for other rotationsis calculated as needed.2.3 Vertex InsertionAlthough there are many incremental algorithms to construct a triangulationusing vertex insertion, the focus in this research is only on the Watsoninsertion algorithm[52], which constructs a Delaunay triangulation based onrefining a pre-existing Delaunay triangulation as it is the choice of vertexinsertion for the GRUMMP. However, some other incremental algorithmswill be briefly explained. In all incremental algorithms, existence of an initialtriangulation to include the new site is required. For creating the originaldomain, a triangular (tetrahedral in 3D) or quadrilateral (hexahedral in3D) initial domain can be created as a background. We shall assume theexistence of a valid mesh of the domain, and that all points to be insertedfull inside the domain or on its boundary.2.3.1 Watson’s AlgorithmWatson’s algorithm starts with locating the triangle that contains the newsite to be inserted. Then, all the triangles whose circumcircle encloses thenew site are found recursively. Removal of all these triangles introduces acavity in the mesh. This cavity is a polygon with edges visible to the newsite. Finally, all the vertices on the polygon are connected to the newly172.3. Vertex InsertionAlgorithm 2.2 Watson’s insertion algorithm1. Insert the new point associated with the bad triangle into the trian-gulation.2. Traverse mesh locally to find all the cells whose circumcircles containthis new point.3. List all the edges of the influenced zone.4. Remove all the edges from the list that appear more than once. (Theseare the interior edges).5. Reconnect the remaining edges to the new point.inserted (Steiner) point to form new triangles.Figure 2.6a shows the triangulation before inserting the new point, whileFigure 2.6b shows how the newly inserted point is connected to the verticesconstructing its corresponding cavity. The procedure starts with a Delaunaytriangulation, one terminates with Delaunay triangulation too. Hence, bypreserving the Delaunay nature of mesh, no swapping is needed after theinsertion. The Watson’s insertion is explained more in algorithm 2.2.2.3.2 Bowyer’s AlgorithmBowyer’s algorithm [5] inserts new vertices exactly in the same fashion asWatson’s does, to the point that many people confuse it with the Watsonalgorithm. The fundamental difference is that in Bowyer’s algorithm ev-erything works with the Voronoi graph while the Watson’s algorithm dealswith the triangulation plane. Similar to its counterpart, Bowyer’s algorithminserts a new point, finds its affected neighborhood, deletes all the edgescompletely in the effected territory, and then reconnects the remaining hullto the newly inserted point. Due to the similarities of the two algorithms,Algorithm 2.2 is referred to as the Bowyer-Watson algorithm.182.3. Vertex Insertion(a) Performing the incircle test and creating cavity(b) Reconnecting the new point to the vertices on the polygonFigure 2.6: Delaunay triangulation with new point Q added192.4. Parallel Mesh Generation and ImprovementFigure 2.7: Inserting of a new vertex and swapping the suspected edges inGreen and Sibson algorithm.2.3.3 Green and Sibson’s AlgorithmIn the Green and Sibson’s algorithm [23] when a point is inserted into themesh, instead of removing all the triangles with circumcircles encirclingthe new point, the new point is just connected to the three vertices of thecontaining cell (see Figure 2.7). If the new point happens to fall on anedge, the edge is removed and the point is connected to the newly createdquadrilateral. Nearby edges that fail to be Delaunay are reconfigured withswapping. All the new edges incident on the new point are proved Delaunay,so they do not need to be checked. If any edge is swapped, its neighboringedges must also be checked (see Figure 2.8).2.4 Parallel Mesh Generation and ImprovementOne of the first considerations in parallel programming is the relationshipbetween the cores and the machine’s memory. One parallel computing archi-tecture uses a single memory address space which is called shared memoryparallel computing. In the shared memory parallel programming, the com-202.4. Parallel Mesh Generation and ImprovementFigure 2.8: Forward propagation of edge swappingmunication between processors is done by the variables stored in sharedmemory. Multi-threading is an ideal choice for shared memory parallel pro-graming. In this model a master thread runs the program serially until itreaches a point of possible parallelization. Then the master thread spawnsdescendant threads to run the ongoing task in different processors. Thismodel which is called the fork-join model is illustrated in Figure 2.9. Thereare some standards to provide parallelization on shared memory multiproces-sors by enforcing this fork-join model such as OpenMP and POSIX Threads.In the other parallel programming paradigm, each processor has its ownmemory. Such a distributed memory parallel architecture is constructedby connecting each component by a high speed communication network.In this paradigm a task is divided into smaller parts either in terms ofdata or instructions. Then each part is allocated to one processor to beexecuted serially. These processes communicate with each other by sendingmessages. MPI is the dominant standard for inter-process communicationon distributed memory architectures. A simple schematic of this parallelmodel is sketched in Figure 2.10.212.4. Parallel Mesh Generation and ImprovementFigure 2.9: Fork-Join model for shared memory parallel programing.Figure 2.10: MPI parallel model222.4. Parallel Mesh Generation and Improvement2.4.1 Distributed Memory Parallel ProgrammingAs mentioned earlier, typically domain decomposition through the mes-sage passing interface (MPI) is adopted for parallel mesh generation ondistributed memory systems. Domain decomposition methods divide themesh into sub-domains (see Figure 2.13) and allocate each to one proces-sor to mesh it sequentially. One of the earliest attempts in decomposingthe mesh was to first construct an initial coarse mesh conforming to theboundaries. Then the neighboring cells coalesce to form sub-domains whichare of roughly equal size as is done by Said [43]. Soon other algorithmsevolved to perform the decomposition more efficiently. Many decomposi-tion algorithms employ 2-way or its extension k-way partitioning. In 2-waypartitioning, the domain is first split into two sub-domains of equal size byalgorithms such as the greedy algorithm or the medial axis domain decom-position (MADD) [33]. In the greedy algorithm the volume of the cells iscompared and then the neighbor cells are integrated to split the domain intotwo regions. In medial axis domain decomposition, first a boundary Delau-nay triangulation is constructed. Then the circumcenters of these triangleswhich are the Voronoi points of the domain are connected to each other.Finally two of these points are connected to the boundary vertices of theassociated triangles to decompose the mesh into two sub-domains. K-waypartitioning which is imperative to decompose the mesh into more than twosub-domains is implemented normally by two approaches. First, a recur-sive 2-way partitioning of a mesh could result in a domain with roughly kequal size sub-domains such as Figure 2.11. The second approach for k-waypartitioning, which is called multilevel recursive bisection (MLRB) [28], isa highly efficient method for computing the k-way partition of a domain.The basic idea of this algorithm is to first successively coarsen the meshto approximately P = cK number of vertices where K is the number ofpartitions and c is an user-defined constant. Then, the coarsened mesh ispartitioned into K sub-domains either directly or by recursive bisection. Fi-nally, the partitioned coarse mesh is uncoarsened to the original fine mesh.The various phases and schematic of this algorithm is illustrated in Figure232.4. Parallel Mesh Generation and ImprovementFigure 2.11: N-way partitions, where N=2,4,8,16, by the MADD. Imagecourtesy of Linardakis and Chrisochoides [33]. Copyright ©2006 Society forIndustrial and Applied Mathematics. Reprinted with permission. All rightsreserved.2.12.The main challenge after the domain is decomposed is always how to dealwith the mesh entities on the boundary of the domain separators. Lepage etal. [31] opted to freeze the vertices and edges on the boundary of separators.Interior domains undergo one iteration of refinement while the boundarycells are kept frozen. Then, all the processors exit the refinement cycle,and the mesh is repartitioned again so that the previously frozen cells fallinside of the new domains. The repartitioning process makes sure that anyfrozen cells on either sides of the separator boundaries are assigned to oneprocessor. The cost of repartitioning in this algorithm is 20% of the costof refinement cycles. Although there is no need for message passing andinter-partition communication, the repartitioning is expensive.To alleviate the repartitioning cost, Alauzet et al [3] employs two algo-rithms for different mesh modification operations which instead of freezingthe boundary entities, sends large packs of data to neighbor partitions. Sincethe inter-partition communication is costly, small messages of mesh modi-fication data are packed together into large messages by a parallel messagepacking library, Autopack, so that the number of messages passed decreases.The key idea to their algorithms is that common entities to multiple parti-tions are duplicated between them so that each entity on one domain bound-ary is either the main boundary entity or its remote copy from the nearby242.4. Parallel Mesh Generation and ImprovementFigure 2.12: The various phases of the multilevel recursive bisection. Imagecourtesy of Karypis and Kumar [29]. Copyright ©1999 Society for Industrialand Applied Mathematics. Reprinted with permission. All rights reserved.Figure 2.13: Decomposing the domain and allocating each to one processor252.4. Parallel Mesh Generation and ImprovementFigure 2.14: Mesh partitioning and boundary entitiespartition as shown in Figure 2.14. In the first Algorithm 2.3, whenever amesh modification operation is going to occur, it checks whether the wholeassociated polyhedron is owned by the current partition. If so, it proceedswith the operation. If not, a message containing this polyhedron is storedin a local buffer which is designed for bookkeeping all the messages of thiscurrent partition. After mesh modification of interior entities is complete,all the messages of the local buffer are sent to nearby partitions and partsof the locked polyhedron migrate (see Figure 2.15) so that the whole poly-hedron resides in the same partition. In the second Algorithm 2.4, theboundary entities are dealt with first. Each partition first performs meshmodifications on the boundary entities which owns. Then data is sent toother partitions to fix their remote copies as well. The partition also fixes itsremote copy entities of other partitions before proceeding with performingmesh modification in its interior region.2.4.2 Shared Memory Parallel ProgrammingIn another programming paradigm, the parallelization of mesh generationand improvement modules on shared memory architectures evolved as thetrend towards mutli-core processors increased. Because in this paradigm allthe communication between cores is done by means of shared memory, man-262.4. Parallel Mesh Generation and ImprovementFigure 2.15: Example of mesh migration between partitions. Image courtesyof Alauzet et al. [3]. Copyright © 2006, Springer-Verlag London Limited.Reprinted with permission. All rights reserved.Algorithm 2.3 Algorithm 1 on dealing with the boundary entities Alauzetet al. [3].1. Create and initialize a local buffer on each partition2. For each mesh operation on the interior nodes:(a) Determine whether the whole corresponding polyhedron is ownedby the partition.(b) If owned:i. apply the operation.(c) elsei. put the desired mesh operation along with its polyhedron inthe local buffer.3. Put together all the the requests on the local buffer and send them topartitions.4. Perform mesh migration so that the locked polyhedrons reside entirelyin the same partition.5. Repeat 2-4 until the mesh modification criteria are met.272.4. Parallel Mesh Generation and ImprovementAlgorithm 2.4 Algorithm 2 on dealing with the boundary entities Alauzetet al. [3].1. Create and initialize a local buffer on each partition.2. Determine all the mesh entities which are on the boundary partitionand owned by it.3. Perform the mesh operation on these mesh entities.4. Pack the connectivity data and mesh entities that have been changedfor their remote copies.5. Insert all these remote data in the local buffer.6. Send out messages in local buffer.7. Fix the remote copies in other partitions.8. If all the boundary entities are modified to the desired criterion.(a) modify mesh in the interior of all partitions9. Else(a) goto line 2.10. End.282.4. Parallel Mesh Generation and Improvementaging the shared data structures is of foremost importance. In the sharedmemory paradigm, processors can communicate by the variables which arestored in the shared global memory which are visible to all processors; how-ever they each can have private memory. The common approach for parallelmesh refinement in shared memory is to find a “maximal set of independent”[49, 34] vertices to insert. In the earliest attempts, Luby [34] introduced thenotion of “maximal independent sets” as the union of all the vertices thatcan be inserted simultaneously without posing any problem. Later Spielmanet al. [49] proposed a length scale to construct independent sets of verticesto insert and update the Delaunay triangulation. Although their algorithmfinds the independent sets of vertices in parallel, it requires checking eachpair of vertices before insertion which introduces more overhead in the per-formance. In the process of finding the maximal independent sets efficiently,Chernikov and Chrisochoides [6, 7, 8, 9] proposed to use Quad/Oct treestructures to guide the processors in the mesh. In their algorithm, theyprove theoretically that in a properly constructed Quad/Oct tree, leavescan be divided into independent sets, and all cells in such leaves can berefined in parallel. This algorithm is explained thoroughly in Chapter 5 asour refinement algorithm is built upon it.In recent work to parallelize mesh generation algorithms for shared mem-ory architectures, Rokos et al. [40] presented a thread-parallel algorithm foranisotropic mesh adaptation. In their yet unpublished paper, they haveexploited four optimization modules — refinement, coarsening, smoothingand edge swapping — to construct anisotropic meshes in parallel withoutthe typical coloring schemes. To deal with the data structure and memoryhazards, they have introduced a deferred update scheme which stores thedata structure and connectivities in parallel. When the threads modify themesh, they need to update data connectivities such as edges connected toa vertex. Doing this simultaneously by more than one thread may result indata race conditions. In their deferred update algorithm, each thread hasa private list of all data structure of threads including its own. During themesh modification process, threads only write to their own list of data struc-ture. When threads exit the mesh modification process, they each traverse292.4. Parallel Mesh Generation and Improvementthe private data structure list of other threads looking for entities allocatedto each corresponding private list and commit any updates found. Theirpaper is unclear about how this commit process updates the global sharedconnectivity data without race conditions. The other arguable point in theiralgorithm is the fact that two adjacent edges or vertices may be modifiedsimultaneously by different threads leading to contradicting updates fromdifferent threads. Because updates to the data connectivities are deferreduntil one sweep of mesh modification is complete, the data arriving fromdifferent threads may invalidate each other.Rokos’ element refinement algorithm employs the notion of maximal in-dependent sets. However, they have not gone about explaining their re-finement algorithm further than this. Also, they do not describe how theirrefinement algorithm works in conjunction with the deferred updates to themesh data structure. Similarly, the swapping algorithm constructs a maxi-mal independent set of vertices to swap the edges outbound from these ver-tices. The algorithm starts with marking all edges and then going throughthe whole mesh and updating the maximal independent set of active vertices.A vertex is called active if it has an outbound edge which is marked. Thenthreads swap all possible edges outbound from the maximal independentset of active vertices. After each edge is swapped, the four edges surround-ing it are marked for swapping too. They claim that edges outbound fromindependent vertices are independent, therefore they can be swapped with-out any conflict. However, the deferred updates require locking some of theunswapped edges as their data connectivity has not yet updated. Moreover,it is not clear how they construct the maximal independent set of verticesin the first place to make sure the vertices are safely away from each other.30Chapter 3Memory ModelA key step in enabling multithreaded parallelism for mesh operations is mak-ing the mesh database thread-safe. Since GRUMMP was originally designedfor serial operation, this required significant changes to the mesh databaseto eliminate two common problems — race conditions and false sharing —which prevent robust, efficient execution on parallel shared memory archi-tectures. A race condition problem occurs when two or more threads bothtry to update a single shared memory location, thereby corrupting the previ-ously saved data and crashing the program. False sharing occurs when two ormore threads write to different memory locations at the same cache line forc-ing an unnecessary cache update. Although false sharing does not corruptthe execution of a program, it can degrade its performance greatly. In thischapter, these common problems are described. Then, the way GRUMMPstores mesh data and its internal container class are explained. Finally, I de-scribe a new parallel memory extension to this container developed to storeand access mesh data in a thread-safe manner, without race conditions orfalse sharing.3.1 Race ConditionA race condition occurs when two or more threads in a shared memoryparallel program can access shared data and they try to change it at thesame time. Data race condition problems can easily occur in shared memoryparallel programming models such as Pthread and OpenMP programs if aspecial care is not taken. Because the thread scheduling algorithm can switchbetween threads at any time, the order in which the threads will attemptto access the shared data can vary from one execution of the program to313.1. Race ConditionAlgorithm 3.1 Occurrence of data race condition problemif (x == 5) // the check{y = x+1; // the act}another. Therefore, the result of the change in data is dependent on thethread scheduling algorithm, i.e. both threads are racing to access/changethe data.Problems often occur when one thread does a "check-then-act" (e.g. checkif the value is X, then act to do something that depends on the value beingX) and another thread does something to the value in between the checkand the act. For example in code snippet 3.1 y could be 6, or it could beanything, depending on whether another thread changed x in between thecheck and act. This is because, in order for each thread to increment thevalue of x, they have to first retrieve the value of x, increment it, and thenstore it. Any thread can be at any step in this process at any time.The effect of a data race problem is quite subtle and hard to locate,because the problem symptom may show up only once in many runs in onecase or in multiple places in a single run in another. Therefore, great careneeds to be taken when dealing with shared data to protect them from racecondition problems. There are two main ways of avoiding race-conditionproblems: one can employ mutually exclusive schemes such as locks andatomic operations to protect shared memory by allowing only one thread ata time to access the memory, or one can use non-blocking memory models[50, 51] which allow multiple tasks to access a shared object while ensuringthat threads never try to simultaneously read and write the same part ofthat object. Locking models suffer from performance loss due to contentionswhich can occur when a thread owns a lock on some part of memory, andforcing other threads wait for the lock to be freed. Non-blocking algorithmsare lock-free hence allowing better performance at the expense of algorithmiccomplexity. In this work, we emphasize non-blocking algorithms to store and323.2. False SharingAlgorithm 3.2 Occurrence of false sharing in a shared arraydouble Sum = 0.0, Sum_Thread[NUM_THREADS];omp_set_num_threads(NUM_THREADS)#pragma omp parallel shared(Sum,Sum_Thread){int ID = omp_get_thread_num();Sum_Thread[ID] = 0.0;#pragma omp forfor (i = 0; i < N , i++)Sum_Thread[ID] += X[i];#pragma omp atomic updateSum += Sum_Thread[ID];}access mesh data structures which constitute most of the shared memoryhandling.3.2 False SharingFalse sharing is a well-known performance issue on symmetric multiprocessor(SMP) systems, where each processor has a local cache. The memory sys-tem hardware must guarantee cache coherence. False sharing occurs whenthreads on different processors modify variables that reside on the samecache line as illustrated in Algorithm 3.1. This invalidates the cache lineand forces an update, which hurts performance.For example for the code snippet shown in Algorithm 3.2, there is apossibility for false sharing on the array Sum_Thread. This array is smallenough to fit in a cache line. Therefore, when threads are accessing differententities in this line, they are accessing the same line of memory as otherthreads which invalidates the cache line for all processors. False sharing canbe addressed by reducing the use of shared or global variables which maybe accessed frequently by different threads and instead increasing local or333.3. Serial Memory ModelFigure 3.1: Occurrence of false sharing, Courtesy of Intelprivate variables. For cases, where use of global variables is inevitable, itis recommended to align them with the cache line size to fully occupy it.This ensures that the whole line residing in one cache line does not residein any other caches. This in turn guarantees that no two memory slots inone cache line are modified by two processors. The corrected algorithm topreclude the false sharing problem is presented in Algorithms 3.3 and 3.4.3.3 Serial Memory ModelAny mesh database must store mesh entities such as vertices, faces, andcells. Efficient storage and iteration over these elements is of paramountimportance in all mesh generation and adaptation modules. In GRUMMPwe choose not to use the standard C++ containers such as vectors or dequesas the base containers to store the mesh data structures. Some containers –including sets and linked lists – impose significant storage overhead. Also,since GRUMMP uses pointers to entities to store connectivity data after343.3. Serial Memory ModelAlgorithm 3.3 Corrected algorithm by adopting private variables.double Sum = 0.0, Sum_Thread = 0.0;omp_set_num_threads(NUM_THREADS)#pragma omp parallel private(Sum_Thread) shared(Sum){Sum_Thread = 0.0;#pragma omp forfor (i = 0; i < N , i++)Sum_Thread += X[i];#pragma omp atomic updateSum += Sum_Thread;}Algorithm 3.4 Corrected algorithm by aligning arrays with the cache line.double Sum = 0.0, Sum_Thread[NUM_THREADS][cache_line_size];// The Sum_Thread variable is padded to ensure each element// is aligned on a boundary of cache line.omp_set_num_threads(NUM_THREADS)#pragma omp parallel shared(Sum,Sum_Thread){int ID = omp_get_thread_num();Sum_Thread[ID][cache_line_size] = 0;#pragma omp forfor (i = 0; i < N , i++)Sum_Thread[ID][cache_line_size] += X[i];#pragma omp atomic updateSum += Sum_Thread[ID][cache_line_size];}353.4. Parallel Memory Modelgenerating a mesh entity and storing it in the mesh data structure contain-ers, it is important that the location of the newly inserted entity remainsthe same throughout the execution of the program. Hence, use of C++ con-tainers with reallocation — notably the vector template — must be avoidedfor the use in storing mesh data structures. Instead, GRUMMP implementsa container template with functionality similar to the standard vector tem-plate, except that our container allocates additional memory when neededrather than reallocating and moving existing data. This provides contiguousblocks of memory while preserving the existing pointers as shown in Figure3.2. Also, deleted entities are recycled by adding the freed memory slotsto the list of free entities. This procedure enables us to re-use the samememory slots to minimize the growth of container size. As shown in Figure3.2, free-list container which stores the pointers to free blocks of memoryworks in a LIFO4 manner which means that the freed entities are re-usedbefore other entities. This procedure ensures that we fully occupy availablememory blocks.Other features of GRUMMP memory model such as resize, an internaliterator, getNewEntry, and deleteEntry functions help to mimic the featuresof C++ STL libraries along with some new user-defined functions such askeeping track of all empty entities at all time to re-use and also allocatingnew chunks of contiguous blocks of memory without reallocation.3.4 Parallel Memory ModelFor efficient use of parallel shared memory architectures, it is important thatmesh data structures are visible to all processors while allowing insertion andremoval operations to occur concurrently. Otherwise, either synchronizationbetween threads to insert or remove mesh entities compromises the overallspeedup, or race condition problems arise which corrupt the mesh databaseand consequently the execution of the program. The former occurs whennaively using some sort of synchronization to let threads atomically modifythe memory. The latter emerges due to the concurrent update of the same4Last In First Out363.4. Parallel Memory ModelFigure 3.2: Memory model of GRUMMP373.4. Parallel Memory ModelAlgorithm 3.5 Getting a new entitytemplate<typename T> getNewEntry(){if (!In_Parallel){if (empties.empty())resize(empties);T* Entry = empties.back();empties.pop_back();return (Entry);}else{int ID = omp_get_thread_num();if (empties_pa[ID].empty())resize(empties_pa,ID);T* Entry = empties_pa[ID].back();empties_pa[ID].pop_back();return (Entry);}}slot of memory by two or more threads. We have resolved this by introduc-ing a free-list of memory local and private to each thread. At the entrance toeach parallel region in the code, pools of pointers to the unused entities areallocated to each thread to use for new mesh entities. However, after inser-tion the memory is visible to every other thread, meaning the other threadscan remove it without any synchronization. Also, every removed entity isrecycled by the thread which removed it. In this way not only the need forallocating new memory blocks from the shared memory structure is reducedbut also the serial recycling feature is preserved by threads. The processin which a new entity is inserted both in serial and parallel is illustrated inAlgorithm 3.5.383.4. Parallel Memory ModelThe old versions of GRUMMP only kept track of recycled entities, thoughthe new version records all the available entities. In the old version, a thor-ough search in the shared global data was required to find pools of unusedentities to allocate to the threads either on the fly or at the beginning. Yet,there was no guaranty that these entities will not be used by other threads.On the contrast, this new version effectively allocates unused entities to thethread-local containers from pools of available entities. When one threadruns out of pointers to available entities, it is assigned pointers from theshared container. Once the shared container runs out too, the containerallocates a new block of memory and distributes pointers to some portionof the block to the thread in need. This allocation routine is well illustratedin Algorithm 3.6. At the completion of parallel execution a synchronizationfunction appends the thread-local free lists to the shared global list to keepthe memory consistent with the serial sections of program. This synchro-nization function along with the one at the entrance of parallel sections,which only amount to less than 2% of runtime, is illustrated in Algorithm3.7. Experimentally, it is observed that the cost of this synchronization op-eration is negligible (less than 2%) compared to other parts of the runtimeroutines. Implementing this memory model excludes the need to directlyuse atomic or critical operations to store mesh data structures in parallel.We eliminate situations where threads might need to modify the same entityalgorithmically, as described in the next chapters.393.4. Parallel Memory ModelAlgorithm 3.6 Allocating pointers of unused entities to the threads on theflyuint32_t Global_Size;template<typename T> voidresize(std::list<T*> em_pa, int ID){uint32_t Local_Size;uint32_t Where_To_Copy;typename std::list<T*>::iterator itC , itE;int ID = omp_get_thread_num();#pragma omp critical(resize) {Local_Size = Global_Size ;Where_To_Copy = 0;if (Global_Size > Chunk_Size)Where_To_Copy = Global_Size - Chunk_Size);Global_Size = Where_To_Copy;}if (Local_Size != 0){itC = empties.begin();std::advance(itC,Where_To_Copy);itE = itC;std::advance(itE,Local_Size - Where_To_Copy);em_pa[ID].insert(em_pa[ID].begin(),itC,itE);}else{#pragma omp critical(resize){allocate_new_data(Data[n],kn);//updating global list of unused entities.for (uint32 i = 0 ; i < kn ; i++)empties.push_front(&(m_data[n][i]));Global_size = empties.size();}std::advance(itC,Where_To_Copy);itE = itC;std::advance(itE,Local_Size - Where_To_Copy);em_pa[ID].insert(em_pa[ID].begin(),itC,itE);}}403.4. Parallel Memory ModelAlgorithm 3.7 Synchronization at the entrance and exit of the parallelsectionsync_entry(){unit32_t chunk = empties.size();while (chunk < ThreadContainerSize){Allocate_new_memory(empties);chunk = empties.size();}unit32_t chunkSize=p_firstListSize;unit32_t chunkLeft=chunk-chunkSize*NUM_PROCS;typename std::list<T*>::iterator itS;typename std::list<T*>::iterator itE;#pragma omp parallel for default(shared)private (itS,itE){int ID = omp_get_thread_num();itS = empties.begin();itE = empties.begin();std::advance(itS,chunkLeft+chunkSize*ID);std::advance(itE,chunkLeft+chunkSize*(ID+1));em_pa[id].insert(em_pa[id].begin(),itS,itE);}empties.erase(chunkLeft,chunkSize);Global_Size = empties.size();}sync_exit(){for (int ID = 0 ; ID < num_threads ; ID++)empties.splice(empties.end(),empties_pa[ID]);}41Chapter 4Parallel Edge and FaceSwapping4.1 2D SwappingFor swapping a face in 2D, no matter what the swapping criterion is, theprocess of reconfiguration is the same. In serial 2D swapping, the processbegins with a loop over all possible faces. For each face, the algorithmidentifies the face information and its neighbors. Then, it decides whetherswapping the face is possible. If so, it finds the best configuration, andthen based on the swapping criterion performs the reconfiguration. Thereconfiguration process itself is comprised of two steps: first deleting theold cells overlying a quadrilateral for which the corresponding face is thediagonal; second, creating new cells which contain the other diagonal of thequadrilateral. This process is repeated for each face. The pseudo-code 4.1explains abstractly the serial face swapping algorithm.However the standard serial algorithm cannot be parallelized by simplyhaving each thread work on separate faces simultaneously, because doingthis carelessly can result in a non-conformal mesh which arises when twoor more threads swap faces that share a cell. For example, if two facesconnected to each other are swapped simultaneously by different threads,the reconfiguration of one invalidates the quadrilateral (or polyhedron in3D) of the other. This happens as both faces attempt to delete one cellwhile creating different cells which leads to race condition problems. More-over, if two disconnected faces which share a quadrilateral (polyhedron in3D) are swapped concurrently by different threads, there is a high chance424.1. 2D SwappingAlgorithm 4.1 Serial face swapping algorithm1. Queue all faces to be swapped.2. While face queue is non-empty, do:(a) For all the faces in the queue:i. Calculate the face information and its neighborhood.ii. Check the swapping criteria.iii. If face is swappable:A. Do the reconfiguration.iv. Remove the face from the list.(b) Remove other deleted faces from the queue.(c) Add the newly created faces to the queue.of running into race condition problems when updating the data connec-tivity of mesh. As explained earlier, when a face is swapped, old cells aredeleted and new ones are created to form new configurations. Connectivitydata for faces needs to be updated to account for these newly created cells.Thus, two threads can modify the same data concurrently which leads tothe race condition problems. A third possibility of running into race condi-tion problems arises when two faces in different quadrilaterals are swappedin parallel but yet their quadrilaterals share a vertex. Updating the faceconnectivity for this shared vertex by two different threads concurrentlycauses race condition problems as two threads attempt to modify the sameslot of memory. Another drawback of the serial algorithm with respect toparallelization is that a face identification process occurs before each face isreconfigured. This identification process is a read only operation whereasthe reconfiguration process has both read and write operations. Thus, thefrequent occurrences of read and write operations in parallel can degradethe performance by increasing the chances of running into false sharing orrace condition problems.To rectify these problems, we break the swapping loop into two loops:434.1. 2D SwappingLockedVertices:Independent Faces:Figure 4.1: 2D swapping independencyone that identifies faces that should be reconfigured, and one that does thereconfiguration. The first loop acts on all faces that are marked as swapcandidates and is a read-only operation, so it can safely be parallelized.Also, the data dependencies in the swapping process need to be identifiedand removed so that updating the data connectivities concurrently does notcorrupt the mesh data connectivities. To remove the adjacency problem wehave employed a scheduling method other than decomposing the mesh. Inthe scheduling process we reject some of the candidate faces until a laterpass so that the bucket of remaining faces are all conflict-free which meansafter this, the actual reconfiguration process is guaranteed to be conflict-free regardless of the order of operations. Finally, the second loop swap allthe conflict free faces in parallel. Each of the faces selected for swappingis the diagonal of a quadrilateral which is going to be updated. If thesequadrilaterals do not share any vertices, the edges are independent and canbe swapped simultaneously5 as shown in Figure 4.1.Hence, the key criterion for two faces to be independent is if their quadri-lateral do not share any vertices. We enforce this criterion in an identifi-cation loop, which is a read-only operation, by applying a vertex lockingstrategy to get a conflict-free queue of faces ready to be reconfigured. Allfour vertices constructing each face are checked first to be free. If even one5Our data structure stores upward adjacency data for vertices. Without this, we couldallow shared vertices but not shared edges.444.2. 3D Swappingis locked the face is postponed to be swapped in a later pass. If they areavailable, all four of them are flagged atomically to be locked so that noother threads grab a face involving these vertices. This process goes on un-til there are no more faces with vertices unlocked. The subroutine whichmanages this vertex locking strategy is explained briefly in the Algorithm4.2. This code snippet only shows a part of parallel algorithm responsiblefor finding the dependencies in the swapping process and removing them.After this, all the selected faces are guaranteed to be conflict free. Theyare equally allocated to the available threads to swap. Then, with the termi-nation of one swapping pass, the locked vertices are unlocked so the schedul-ing for the next pass determines the new candidate faces. The Algorithm 4.3shows how the parallel swapping works in general including the processesof calculating face information, checking swapping criteria, locking vertices,and performing the reconfigurations.4.2 3D SwappingIn 3D, parallel swapping is much the same as in 2D with the only differencethat there are two different types of swapping, face and edge swapping, withdifferent connectivities, and consequently different dependencies to handle.While identifying configurations to swap, we also record — for both faceand edge swapping — the desired configuration and all vertices affected.We again apply a vertex locking strategy — same as Algorithms 4.2 and4.3 — and select a maximal conflict-free set of faces and edges on whichparallel reconfiguration can be performed safely. In face swapping, all thevertices located on the polyhedron composed of two tetrahedra sharing theassociated face must be locked as shown in Figure 4.2. In edge swapping,all the vertices on the equatorial plane and the end points of the generatingedge must be locked as illustrated in Figure 4.3.454.2. 3D SwappingAlgorithm 4.2 Vertex locking strategy and finding the maximal indepen-dent set of vertices// CandidateFacesVec[ID]// The bucket of faces private to ID thread.// FacesToBeSwapped[ID]// The bucket of conflit-free faces private to ID thread.#pragma omp parallel shared(CandidateFacesVec){int ID = omp_get_thread_num();int QueueSize = CandidateFacesVec[ID].size;Vert * VertsInQuad[4];for (int fN = 0 ; fN < QueueSize ; fN++){IsInUse = false;Face * pF = CandidateFacesVec[ID][fN];Find4VertsInQuadrilateral(pF,VertsInQuad);#pragma omp critical{for (int i = 0; i < 4 && !IsInUse; i++)IsInUse = VertsInQuad[i]->isLocked();if (!IsInUse) {for (int i = 0; i < 4; i++)VertsInQuad[i]->LockThisVert;}}if (IsInUse)FacesToBeSwapped[ID].push_back(pF);}}464.2. 3D SwappingAlgorithm 4.3 Parallel face swapping algorithm1. Initialize a private face queue for each thread.2. Divide faces equally in each private face queue.3. While there are faces in the face queues(a) For each thread, do:i. For all the faces in the queue:A. Calculate the face information and its neighborhood.B. If face is not swappable, remove it from the list, andcontinue to the next face.C. If face is not independent, skip to the next face.D. Add the face to the independent set of faces to beswapped.(b) Wait for all the threads to arrive.(c) Perform face migration between threads to balance the work-loadin each thread.(d) For each thread, do:i. Loop on the independent faces to perform reconfiguration.ii. Add newly created faces to the local face queue of each threadto check for swapping.iii. Remove deleted faces from the local face queues.(e) Wait for all the threads to arrive.LockedVertices:Face to be swapped:Figure 4.2: Vertices to be locked in face swapping474.3. Parallel Performance and Speed-up ResultsLockedVertices:E to be swapped:Figure 4.3: Vertices to be locked in edge swapping4.3 Parallel Performance and Speed-up ResultsWe analyzed the parallel swapping algorithms (if not stated otherwise) onthe Westgrid Breezy computer with nodes featuring 24 AMD cores at 2.4GHz and up to 256 GB of memory. The domains which have been used tostudy the speed-up and stability of the parallel swapping algorithm are pre-sented in Figures 4.4 and 4.5. In both 2D and 3D, after the initial boundaryconforming mesh is constructed, the meshes are refined by a Delaunay re-finement algorithm. In each case, the Delaunay meshes have been swappedby a uniform degree swapping criterion so that a reasonably large number ofswap operations will be performed. This swapping criterion simply attemptsto equalize the degree6 of each vertex in a mesh by breaking the mesh afterit is refined by a Delaunay refinement module.The number of threads in each experiment varied from 1 to 16 to il-lustrate the speed-up performance in a strong scaling experiment. Also,the effects of workload are studied by experimenting with parallel swappingalgorithm with a different number of face and edge swapping operations.However before proceeding with speed-up results, it is of paramount impor-tance to illustrate the stability of the parallel algorithm. Figure 4.6 showsthe quality of the meshes prior to the swapping operation. On the otherhand, Figures 4.7 and 4.8 show respectively in 2D and 3D the quality of themeshes after swapping by the serial and parallel algorithms. These figures6The degree of a vertex in mesh terminology is the number of incident edges on thevertex.484.3. Parallel Performance and Speed-up ResultsFigure 4.4: 2D boundary domain which has been used for the purpose ofstability and speed-up results.494.3. Parallel Performance and Speed-up ResultsFigure 4.5: 3D boundary domain which has been used for the purpose ofstability and speed-up results.504.3. Parallel Performance and Speed-up ResultsAngelDistribution (%)0 30 60 90 120 150 180051015(a) This 2D mesh contains 1,210,000 ver-tices, 3,500,000 faces, and 2,400,000 trian-gles.Dihedral angelDistribution (%)30 60 90 120 150 180051015(b) This 3D mesh contains 320,000 verticesand 1,800,000 tets.Figure 4.6: Quality of the mesh prior to swapping operations.prove that the parallel swapping algorithm creates a mesh with the samequality as the mesh produced by the serial counterpart. Clearly, the dis-tribution of dihedral angle and also the minimum and maximum dihedralangle in the mesh have not changed, suggesting that the parallel algorithmis stable and meets with the required mesh qualities. Also, a comparison ofFigures 4.6, 4.7, and 4.8 emphasizes that quality of the mesh changes con-siderably after swapping operation. Yet, both serial and parallel swappingalgorithms change the quality to the same extent which again proves thestability of the parallel swapping algorithm.4.3.1 Speed-upExperiments on 2D and 3D swapping show approximately linear speed-upup to 16 threads in Figures 4.9 and 4.10 respectively. However, the speedupline departs from the ideal linear speedup line as the number of threadsincreases to 16. To better analyze the performance of the algorithm, aparallel efficiency has to be defined. The efficiency of an algorithm in parallelparadigm is defined as514.3. Parallel Performance and Speed-up ResultsAngleDistribution (%)0 30 60 90 120 150 1800246810121416Serial AlgorithmParallel Algorithm(a) Distribution of angle.AngelDifference in distribution (%)0 30 60 90 120 150 18000.020.040.060.080.10.120.140.160.180.2(b) Differences in distribution of angle.Figure 4.7: Quality comparison of meshes generated by serial and parallelface swapping algorithms in 2D for a sample mesh with 185,000 face swap-ping operations. This mesh contains 1,210,000 vertices, 3,500,000 faces, and2,400,000 triangles.Dihedral angleDistribution (%)0 30 60 90 120 150 180024681012 Serial AlgorithmParallel Algorithm(a) Distribution of dihedral angle.Dihedral angleDifference in distribution (%)0 30 60 90 120 150 18000.0050.010.0150.020.0250.03(b) Differences in distribution of dihedralangle.Figure 4.8: Quality comparison of meshes generated by serial and parallelface-edge swapping algorithms in 3D with 200,000 faces and edges swapped.This mesh contains 320,000 vertices, 6,040,000 faces, and 1,800,000 tets.524.3. Parallel Performance and Speed-up ResultsEp =T1pTp× 100,where T1 is the serial execution time, and Tp is the execution time of parallelalgorithm with p processors. The parallel efficiency is a measure of how wellthe parallel algorithm scales with the increase in the number of threads.The parallel efficiency of the swapping algorithm in the worst case for 3Dswapping is 60% with high workload on 16 threads (Figure 4.10b) whilethe efficiency improves considerably as the number of threads decreases to2 or 4. The same trend in parallel efficiency is observed in 2D where againthe increase in the number of threads and also the workload leads to adegradation of the parallel efficiency (the workload increases by refiningmore the same mesh so that there are more faces to swap). Nevertheless, theparallel efficiency in 2D for the worst case is 64% while this increases to 95%with 2 threads and 81% with 4 threads. In either case, the results show goodspeedup up to 16 cores due to the well balanced workload amongst threadsand also removing the conflicts from the actual reconfiguration process whichcan otherwise compromises the performance. However, the process of vertexlocking and removing the conflicts can become a point of contention whenthe number of threads or the workload increases. This becomes a bottlenecksince the vertex locking process is a mutually exclusive operation whichrequires threads waiting idle when modifying the same vertices. Altogether,there are mainly two reasons for the deviation from full linear speedup inour algorithm. First, when the number of faces to be swapped increases, thenumber of vertices to be locked increases accordingly which in turn causesmore overhead in removing the faces which have conflict. Thus, the parallelperformance degrades as the workload increases. Second, the inherent loadimbalance in the process in which swappable faces are identified prevents thealgorithm to scale linearly. Especially the fact that this problem intensifiesas the number of threads increases suggest where the scalability problem lies.For example in 2D, for the case shown in Figure 4.9a, checking the swappingcriteria and removing conflicts takes up 79% of the whole swapping runtimeby two threads whereas this increases to 93% with 16 threads. This trend534.3. Parallel Performance and Speed-up ResultsNumber of ThreadsTime (s)4 8 12 16100101Ideal Speedup Total SwappingSwapping criteria and removing conflictsReconfiguration(a) 320,000 faces are swapped.Number of ThreadsTime (s)4 8 12 16100101Ideal Speedup Total SwappingSwapping criteria and removing conflictsReconfiguration(b) 700,000 faces are swapped.Figure 4.9: The running time for 2D face swapping with different numberof swaps, while the number of threads increases from 1 to 16suggests that checking the swapping criteria and removing conflicts processdoes not scale as well as the reconfiguration process. This inherent loadimbalance problem is explained more in Section 4.3.2.Our new parallel swapping algorithm is working better in comparison toa recent work by Rokos et. al. [40] which is explained in Section 2.4.2. Theyhave only presented speedup results for a 2D mesh with approximately 250Kvertices and 500K cells on 2, 4, 8, and 16 threads on Intel(R) Xeon(R) E5-2650 CPU. To compare our speedup results with theirs, we have generateda mesh of similar size. With this mesh size, our algorithm swaps nearly46,250 faces, however it is not clear how many swap operations have beenconducted by their algorithm. Nevertheless, Figure 4.11 shows that on everynumber of threads our algorithm produces better speedup and also scalesbetter on higher number of threads.4.3.2 Load BalanceLoad-balance in parallel computing terminology refers to how well the work-load is distributed between available computing resources. Specifically inshared-memory parallel programing, load-balance is determined by howmuch threads are waiting idle for one thread to complete its task. Hence,544.3. Parallel Performance and Speed-up ResultsNumber of ThreadsTime (s)4 8 12 16100101Ideal Speedup Total SwappingSwapping criteria and removing conflictsReconfiguration(a) 120,000 faces and edges are swappedNumber of ThreadsTime (s)4 8 12 1650100150200Ideal Speedup Total SwappingSwapping criteria and removing conflictsReconfiguration(b) 600,000 faces and edges are swappedFigure 4.10: The running time for 3D face and edge swapping with differentnumber of swaps, while the number of threads increases from 1 to 16performance of a parallel task is driven by its slowest thread. Here, load-balance for face and edge swapping is addressed in two ways as illustratedin algorithm 4.3. First, an equal number of faces and edges are allocated toeach thread as the maximal set of independent faces or edges is constructed.Then, faces (and edges in 3D) are migrated between threads so that eachthread ends up with an equal number of faces to swap. These two proceduresensure good load-balance between threads. Figures 4.12 and 4.13 show theload balance distributed between 4 threads in 2D and 3D swapping oper-ations respectively in terms of faces (and edges in 3D) swapped and timespent in each thread doing so. The face migration process in Algorithm 4.3is crucial to preserve the load balance between threads. For example, Fig-ures 4.13a and 4.13b illustrate how one thread may end up doing more workif this synchronization process which balances the workload is removed.We also studied effects of face migration in three different parts of swap-ping operation to make sure that load balance is preserved throughout theswapping operations. As Figure 4.14 shows, running time for the swappingprocess is divided into three parts: reconfiguration, checking swapping cri-teria and obtaining the face or edge information, and finally removing theconflicts from the list of faces to be swapped. Face migration has improved554.3. Parallel Performance and Speed-up ResultsNumber of threadsSpeedup2 4 6 8 10 12 14 16246810121416Our algorithmRokos’ algorithmFigure 4.11: Comparison of our parallel swapping algorithm and Rokos’ ona 2D mesh.564.3. Parallel Performance and Speed-up ResultsThreadsTime (s)Number of faces0 1 2 30.511.522.501500030000450006000075000TimeNumber of faces(a) With face migration to preserve the work-load as shown in Al-gorithm 4.3ThreadsTime (s)Number of faces0 1 2 30.511.522.501500030000450006000075000TimeNumber of faces(b) Without performing face migration.Figure 4.12: Load-balance for 2D face swapping574.3. Parallel Performance and Speed-up ResultsThreadsTime (s)Number of swaps0 1 2 3246810020000400006000080000TimeNumber of swaps(a) With face migration to preserve the work-load as shown in Al-gorithm 4.3.ThreadsTime (s)Number of swaps0 1 2 324681012020000400006000080000100000TimeNumber of swaps(b) Without performing face migration.Figure 4.13: Load-balance for 3D face and edge swapping584.3. Parallel Performance and Speed-up Resultsthe load balance in all three parts however the part that is responsible forchecking the swapping criteria is showing still unequal load balance in dif-ferent threads. The reason behind this is that even though all threads haveequal number of faces, they may spend different amount of time checkingthe swapping criteria due to the different number of swappable faces. Itis noteworthy to mention that in both serial and parallel swapping algo-rithm, a face or edge is first checked if it is geometrically7 swappable. Thenit is checked if it is swappable considering the swapping criterion. There-fore, different number of faces or edges may pass the first test which makesthreads spend different amounts of time calculating the swapping criterionfor the remaining faces. This inherent load imbalance can compromise thewhole swapping speedup. This also intensifies when the number of threadsincreases.7For example, Its quadrilateral or polyhedron is not convex; or it is a boundary facewhich is not swappable.594.3. Parallel Performance and Speed-up ResultsThreadsTime (s)0 1 2 30.20.40.60.811.21.4 Calculating Swapping InfoReconfigurationRemoving Conflicts(a) With face migration to balance load.ThreadsTime (s)0 1 2 30.20.40.60.811.21.4 Calculating Swapping InfoReconfigurationRemoving Conflicts(b) Without face migration.Figure 4.14: Effects of keeping load balance in different phases of swappingalgorithm in 2D.60Chapter 5Parallel Vertex InsertionThe refinement module of GRUMMP is based on Ruppert’s8 algorithmwhich uses B-W algorithm to insert circumcenters of the bad quality cells.Typically a mesh generation and refinement module starts with an initialcoarse mesh which conforms to a given boundary. Then, the application re-fines the mesh to the desired level by imposing some constraints such as anupper bound on the element area or an upper bound on the circumradiusto shortest edge ratio. In any case, the circumcenters of the bad qualitytriangles or tetrahedra are inserted.The parallelization of the Delaunay refinement algorithms described inChapter 2 is achievable if the concurrent insertion of Steiner points by multi-ple threads does not lead to a non-conformal or non-Delaunay mesh. A meshis non-conformal if mesh entities overlap or if there are cavities in the meshwith no triangulation. A mesh is non-Delaunay if there is a d-dimensionalsimplex in the mesh whose circumball contains other vertices. Chernikov etal. [7] proved that simultaneous insertion into a triangulation could lead tonon-conformity if the two cavities C(m) and C(n) share a triangle as shownin Figure 5.1. Also, if the two cavities C(m) and C(n) have a common edge,the new triangulation after insertion of vertices m,n can violate the Delau-nay condition. This in turn results in triangles whose circumcircles are notempty of other vertices as shown in Figure 5.2.From these two facts one can say that if the two cavities C(m) and C(n)do not share any triangle or edge, vertices m,n can be inserted concurrentlyresulting in a mesh which is both conformal and Delaunay. However, inpractice this is not easy to apply. Chernikov also showed that if two Steinerm,n points are far enough from each other, then their cavities do not share8This algorithm is already explained in Algorithm 2.1.61Chapter 5. Parallel Vertex InsertionFigure 5.1: Concurrent insertion of m,n can result in a non-conformal meshif △p1p2p3 ∈ C(m) ∩ C(n)Figure 5.2: Concurrent insertion of m,n can result in a non-Delaunay meshif edge p1p2 is shared by ∂C(m) and ∂C(n)625.1. Chernikov and Chrisochoides’ Algorithmany edge or triangle. The points must be separated by ‖m− n‖ > 4r, wherer¯ is an upper bound on the maximum circumradius in the mesh.if △p1p2p3 ∈ C(m) ∩C(n) =⇒ ‖m− n‖ < 2r(△p1p2p3) < 2r¯if p1p2 ∈ ∂C(m) ∩ ∂C(n) =⇒ ‖m− n‖ < 2r(△p1p2p3) + 2r(△p1p2p7) < 4r¯To parallelize Ruppert/Shewchuk Delaunay refinement algorithm, wehave exploited the theoretical framework developed by Chernikov and Chriso-choides [6, 7, 8, 9, 10] to develop a parallel insertion module on shared mem-ory architectures. As pre-existing GRUMMP modules perform the actualpoint insertion in our implementation, we have met the highest possible levelof code re-use. Our goal is to gain maximum speed-up on shared memoryarchitectures and good scalability on nodes with high numbers of cores.5.1 Chernikov and Chrisochoides’ AlgorithmChernikov et al. [7] showed that the necessary condition for the concurrentinsertion of two points m and n to result in a conformal Delaunay triangu-lation is if their cavities do not have any common triangles and do not shareany triangle edges. From a practical point of view, another lemma whichwas proved in the same paper is of more use: if m and n are two Steinerpoints, and if ‖m− n‖ > 4r where r is the upper bound on the trianglecircumradius, vertices m and n can be inserted simultaneously. However,checking two vertices on the fly to determine whether they are far enoughfrom each other is an expensive operation which requires extensive locking.Thus, based on this lemma a Quad/Oct tree structure is constructed to over-lay the whole mesh ensuring the above distance criterion. The Quad/Octtree initially is comprised of axis-aligned square leaves with the side size of2r; then it will be refined as the mesh gets finer. Every triangle whose cir-635.1. Chernikov and Chrisochoides’ Algorithmcumcircle is located inside a leaf will be assigned to that leaf. In Chernikovand Chrisochoides’ parallel refinement algorithm, independent leaves aredistributed to threads for refinement of cells contained in these leaves. Toensure that insertion in these leaves are independent, the buffer zone of aleaf is defined as:Definition 2. (Buffer Zone). Buffer zone of leaf Li which is the locus of allthe points closer than the threshold 4r to an arbitrary point in the leaf L,is mathematically defined as:N (Li) =⋃α{Nα(Li)⋃{⋃L∈Nα(Li)NORT(α)(L)}},BUF(Li) = N (Li)⋃{⋃L∈N(Li)N (L)}in which α = {North,South,East,West} and Nα(Leaf ) are the leaves adja-cent to the Leaf in the α direction while ORT (East,West) is {North,South}and vice versa.If leaf L is chosen for refinement by one thread, none of the leaves in itsbuffer zone BUF(L), can be processed simultaneously by other threads. Tomanage this concurrency, we create an independent set of leaves where noleaf in the set is in the buffer zone of any other leaf in the set:IndSet =⋃{L ∈ LeafList/⋃L∈IndSetBUF(L)}Figure 5.3 shows how the definition of buffer zone is implemented toconstruct a maximal independent set in each parallel pass. Given an initialmesh conforming to the boundary, an initial layer of quadtree is constructedthat encloses the whole mesh. Then, a first set of independent leaves whichhave the maximum number of cells-to-be-refined is scheduled for refinementin parallel. After the first set is refined, the next sets of independent leavesare scheduled similarly. Figure 5.3 shows the independent leaves on thefirst layer of quadtree. After these eight refinement passes on quadtree645.1. Chernikov and Chrisochoides’ Algorithm(a) (b)(c) (d)(e) (f)(g) (h)Figure 5.3: An example of using independent set of leaves for safely insertingvertices in parallelleaves, next level of quadtree is constructed to allow for more concurrencyin parallel.A filtering function has been used to refine only the large cells. In thisway, we can ensure that most cells are created in the finer layers of theQuad/Oct tree to facilitate load balancing between threads. The filteringfunction, which was inspired by Chernikov and Chrisochoides’ [10] is definedas follows:F(triangle, leaf ) = log2( l(leaf )r(triangle)),Where l(leaf) is the side size of the leaf and r(triangle) is the circumradius655.2. Parallel Performance and Speed-up Resultsof the triangle. The parallel refinement algorithm conducts multiple passesto refine a mesh by refining the overlying Quad/Oct tree on the fly too.In each pass, the algorithm refines all cells which meet F < fi criterion inwhich:fi = fi−1 + gHere i is the number of refinement pass, g is the granularity9 size, and f0 isthe initial filter size which can be any arbitrary number, albeit one in ourimplementation.When constructing the subsequent layers of quadtree, each leaf gets splitto four children leaves. Hence it is necessary that each cell with filter sizeof F provided that F < fi and g >= 1 undergoes refinement; otherwisethe two-leaf distance criterion for independency is not sufficient anymore,since the threshold 4r between two independent Steiner points will not beobserved. The granularity greater than one ensures that in each quadtreelevel, all the large size cells are refined so that the maximum circumradius inthe mesh gets smaller by a factor of two. Thus in the next level of quadtreein which each leaf has divided into four children leaves with the side size ofLnew = Lold2 , the ‖p− q‖ > 4rnew = 4(rold2)condition still holds. Also eachleaf has a data structure which stores the bad quality triangles. So, when aleaf from the independent set is allocated to a thread for insertion, its poorquality triangles which meet the filtering criteria, undergo refinement. Theparallel refinement is presented in Algorithm 5.1.5.2 Parallel Performance and Speed-up ResultsBefore proceeding with the speed-up results, it is crucial to see whether ourparallel algorithm is stable. To check the stability criterion, the quality ofmeshes generated by both serial and parallel algorithm is compared. Figures5.4 and 5.5 show respectively in 2D and 3D that the parallel refinement9Granularity is the extent in which a system or a set of data is broken down into smallparts.665.2. Parallel Performance and Speed-up ResultsAlgorithm 5.1 Parallel Delaunay Insertion1. Construct initial quadtree layer.2. Queue all cells into their proper leaf.3. Initialize the LeafList with all the leaves.4. Set Filter Level (f) to 1.5. (The overall refinement step) While there are still cells to be refined,do:(a) If LeafList is empty:i. Create the next level of Quad/Oct tree.ii. Update the LeafList .(b) Find an independent set of leaves IndSet.(c) LeafList = LeafList/IndSet(d) For all the leaves in the independent set, do:i. Spawn a thread and create a task:ii. Allocate a leaf from the independent set to the thread.iii. Refine the mesh inside the allocated leaf until all triangles inthe leaf have F > f + 1 .(e) Synchronize threads.(f) Increment filter level (f) by 1.6. Done.675.2. Parallel Performance and Speed-up ResultsAngleDistribution (%)0 30 60 90 120 150 180051015Serial AlgorithmParallel AlgorithmFigure 5.4: Quality comparison of meshes generated by serial and parallelvertex insertion algorithms in 2D with 1,210,000 vertices inserted.algorithm does not degrade the mesh quality despite the fact that the orderof operations is changed. Although there are slight differences in the dihedralangle distribution, all these differences are bounded to the intermediateangles. For example, in 2D the minimum and maximum angles generatedby serial and parallel algorithms are bounded between 18.5◦ and 120◦. Forthe case shown in Figure 5.5 the minimum and maximum angle for both theserial and parallel algorithms are 15.5◦ and 153◦ respectively.5.2.1 Speed-upThis section presents speed-up results obtained on the Westgrid Breezy com-puter with nodes featuring 24 AMD cores at 2.4 GHz up to 256 GB of mem-ory for up to 16 threads. Some quality and load balance results have alsobeen run on a desktop computer in our lab featuring Intel core i7 quad-coreat 3.60 GHz and 16 GB of RAM. However, the results presented in thischapter have been conducted on Breezy if not stated otherwise. The Breezy685.2. Parallel Performance and Speed-up ResultsDihedral angleDistribution (%)0 30 60 90 120 150 180024681012 Serial AlgorithmParallel AlgorithmFigure 5.5: Quality comparison of meshes generated by serial and parallelvertex insertion algorithms in 3D with having 520,000 vertices inserted.nodes are slower in performance than the desktop computers in our lab andthey are solely used to investigate the speedup on higher number of threads.Parallel efficiency and the effects of work-load are investigated similar tosection 4.3.1. Figures 5.6 and 5.7 show the approximately linear speed-upresults up to 16 threads respectively in 2D and 3D.One can see that similar to the swapping algorithm, the speedup linedeparts from the ideal line as the number of threads and the workload in-creases. The process in which independent sets of leaves are calculated andscheduled to be refined is done in serial. Similarly, creating the subsequentlayers of Quad/Oct tree is currently performed in serial. However, the devi-ation from the full linear speed-up cannot be attributed to these two serialprocesses as they take up a small portion of the runtime (not more than1− 2 % of the whole refinement time as shown in Table 5.1). Thereby, par-allelizing these two serial sections could not lead to a full parallel speedup.Table 5.1 illustrates that the slight deviation from the full linear speed-up695.2. Parallel Performance and Speed-up ResultsNumber of ThreadsTime (s)4 8 12 1620406080 SpeedupIdeal Speedup(a) 2,100,000 vertices inserted.Number of ThreadsTime (s)4 8 12 1650100150200 SpeedupIdeal Speedup(b) 4,800,000 vertices inserted.Figure 5.6: The running time for 2D mesh refinement with different numberof vertices inserted, while the number of threads increases from 1 to 16Number of ThreadsTime (s)4 8 12 1620406080100120 SpeedupIdeal Speedup(a) 300,000 vertices are inserted.Number of ThreadsTime (s)4 8 12 16100200300400500600700SpeedupIdeal Speedup(b) 1,500,000 vertices are inserted.Figure 5.7: The running time for 3D mesh refinement with different numberof vertices inserted, while the number of threads increases from 1 to 16705.2. Parallel Performance and Speed-up ResultsThreads 2 4Run Time (s)ParallelpartFindingIndSetQuad/Octconstruc-tionParallelpartFindingIndSetQuad/Octconstruc-tion2D550,000Verts5.935 0.187 0.016 3.151 0.183 0.0162,150,00Verts40.110 0.362 0.019 21.714 0.367 0.0193D200,000Verts19.100 0.120 0.060 10.508 0.108 0.048780,000Verts108.668 0.181 0.066 63.967 0.206 0.066Table 5.1: Breakdown of the refinement time to its parallel and non-parallelized subsection on a node featuring Intel core i7 quad-core at 3.60GHz and 16 GB of RAM.is preserved in the threaded-refinement stage too, despite the fact that itshould be conflict-free by implementing Quad/Oct trees. Thorough perfor-mance analyses by the HPCToolKit showed that the problem arises fromthe memory allocation part.In the memory allocation part of the insertion algorithm as described inchapter 3, new chunks of unused entities are allocated to each thread-localcontainer from the shared global container. This process as illustrated earlierin chapter 3 contains some atomic and critical sections which inevitablycompromise the overall performance. Nevertheless, not implementing thismemory model would have posed more challenging problems such as raceconditions, and higher contentions at runtime. Also, it is noteworthy tomention that this memory problem is not encountered in parallel swappingas the swapping process does not increase the number of entities in themesh, so the initial thread-local containers do not run out of unused entitiesand they do not need to allocate new ones. Instead, they simply recyclethe deleted entities. To further investigate this memory problem with theparallel refinement, we conducted an experiment on the size of thread privatecontainers. Figure 5.8 shows that by increasing the size of thread private715.2. Parallel Performance and Speed-up ResultsThread-private container size (number of entities)Time (s)1024 8192 15360510152025303540 RefinementSwappingFigure 5.8: Sensitivity of refinement and swapping time with the size ofthread private containers.containers, refinement time decreases to an asymptotic threshold while theswapping runtime remains constant. Although by choosing a large enoughsize for thread private containers we can minimize the parallel refinementtime, a contention point still resides in the memory model. This contentionpoint in the memory allocation is responsible for the loss of linear scalabilityon the higher number of threads where there are higher chances of multiplethreads running out of available entities at the same time.Our implementation of Chernikov and Chrisochoides’ algorithm presentsbetter speedup results for 2D meshes implemented in GRUMMP [37] com-pared to the original work [10] by Chernikov et al. implemented in Triangle[45]. Similarity, we are gaining better speedup for up to 4 threads in 3Din compared to the original work [11] by Chernikov et al implemented in725.2. Parallel Performance and Speed-up ResultsNumber of threads 2 3 4GRUMMP 1.6 2.4 3.1Triangle 1.9 2.3 2.8Table 5.2: Comparison of speedup results generated by GRUMMP andChernikov’s algorithm [10] on Triangle for a 2D mesh with 17 million trian-gles. Parallel granularity level in both cases is 1. Also, we have used 5 levelof quadtrees in our refinement.Number of threads 2 3 4GRUMMP 1.5 2.3 2.9Tetgen 1.6 1.9 2.1Table 5.3: Comparison of speedup results generated by GRUMMP andChernikov’s algorithm [11] on Triangle for a 3D mesh with 5.8 million tetra-hedra, with 5 level of octrees refinement in each case. Granularity level inour algorithm is set to 1 however it is not stated what their granularity levelfor this example is.Tetgen [48]. The reason for better speedup and scalability in our algo-rithm can be attributed to better managing of concurrency by introducingthe filter function to allow only a certain size cells to be refined in eachQuad/Oct tree level in parallel. On the other hand, in their algorithm thereis a serial preprocessing refinement overhead which is responsible for satis-fying the relation between the upper bound on triangle circumradius andside size of leaves. Hence, there is a trade off between concurrency (thenumber of leaves) and the preprocessing serial overhead. Moreover, it isnot clear how thread-safety in their parallel memory management has beenaddressed. Thus, better scalability could also stem from a more efficientparallel memory model in our implementation.5.2.2 Load BalanceAs emphasized in Sections 1.3 and 4.3.2, distributing the work-load betweenavailable computing resources (threads in this thesis’s context) is a crucialyet not trivial task in gaining the maximum speed-up. In the parallel refine-735.2. Parallel Performance and Speed-up ResultsThreadsTime (s)Number of vertices0 1 2 351015.0E+001.0E+052.0E+053.0E+054.0E+055.0E+05TimeNumber of VerticesFigure 5.9: Load balance in 2D refinement.ment algorithm, the issue of load-balance is addressed in two places. First,the Quad/Oct tree structure gets finer as mesh goes under refinement. Byrefining the Quad/Oct tree, the number of leaves overlying the mesh in-creases, thereby the number of independent leaves increases which in turnhelps faster threads take on more available leaves. Also, a larger numberof leaves means less number of cells residing in each and consequently finerlevel of granularity. Second, a filtering function allows only a restricted num-ber of cells, bounded to a certain size, to be refined in each Quad/Oct treestructure. In this way, only large cells are refined in the coarser levels ofQuad/Oct tree with less number of independent leaves in each pass. As theQuad/Oct tree gets finer, smaller cells are also refined. This approach helpsto balance the number of independent leaves with the number of cells to berefined in each leaf so that no thread has to wait too long for other threadsto complete their tasks. Figures 5.9 and 5.10 show the well distributed loadbalance in mesh refinement both in 2D and 3D.745.2. Parallel Performance and Speed-up ResultsThreadsTime (s)Number of insertions0 1 2 310203040506070050000100000150000200000TimeNumber of insertionsFigure 5.10: Load balance in 3D refinement.75Chapter 6Concluding Remarks6.1 ConclusionFirst, I modified the serial memory model of GRUMMP to support concur-rent insertion and removal of entities in parallel. To do this, I introducedthread-private containers to store empty entities to be used by threads inparallel run time. This not only precludes race condition problems whileupdating memory in parallel but also decreases the occurrence of other mu-tually exclusive operations considerably.Second, I developed a new thread-parallel face and edge swapping al-gorithm for shared memory architectures with a vertex locking strategy toremove conflicts from the reconfiguration process. The algorithm is dividedinto two major parts: first data dependencies are identified and removed toform a conflict free set of faces or edges from a candidate list; then, the con-flict free list of faces or edges is reconfigured. The experiments showed thatthe locking operations, which need to be performed mutually exclusively,in the identifying and removing process are negligible to the other swap-ping operations. Furthermore, the load balance was preserved throughoutthe swapping algorithm by the means of face and edge migration betweenthreads. However, the inherent load imbalances in the first stage of swap-ping which checks the swapping criterion is responsible for a small slowdownin the parallel swapping algorithm. Nevertheless, the numerical experimentsshowed a near linear speedup up to 16 cores with a parallel efficiency of morethan 90% on 2 threads to 70% on 16 threads. To the best of our knowledge,this thread-parallel swapping algorithm is the first of its kind.Third, I parallelized the preexisting mesh generation and refinementmodules of GRUMMP with Chernikov and Chrisochoides’ parallel Delau-766.2. Future Worknay refinement algorithm. Different layers of Quad/Oct tree structures alongwith a proper filter function have been used to well distribute data (insteadof decomposing meshes) between threads while preserving the load balance.These auxiliary structures automatically resolve the conflicts on concurrentDelaunay refinement. Experiments show that this implementation of par-allel vertex insertion has 80% efficiency on 2 threads and close to 60% on16 threads. Also, the deviation from the full linear speedup is attributedto the memory allocation of refinement algorithm in parallel runtime asthreads run out of available entities. Yet, the very existence of this parallelmemory model is crucial to preclude race condition problems while concur-rently storing and updating mesh entities. Nevertheless, I am gaining betterspeedup and scalability for up to 4 threads compared to the original work byChernkov and Chrisochoides, which examined no more than four threads.Finally, well scalable and stable parallel mesh refinement and swap-ping modification algorithms for shared memory architectures are achieved.These algorithms are well implemented in the GRUMMP modules with thehighest possible code reuse.6.2 Future WorkThe parallel algorithms for both swapping and refinement modules wereexamined on shared-memory architectures featuring up to 24 processors.Although the usual number of processors embedded in shared memory ar-chitectures is of the same order, this number will increase sooner or later.Thus, a thorough performance analysis of the algorithm to find the hotspots and contention points which compromise the scalability on computerswith higher number of processors is inevitable. For example in the parallelswapping algorithm, the inherent load balance problem in the process whichstudies the swapping criteria must be addressed. Similarly, the possible con-tention points for the refinement algorithm which prevent the algorithm toscale well on higher number of threads must be reconsidered. One of thesepoints in refinement algorithm is where new memory slots are allocated tothreads on the fly as threads run out of available memory. Empirically, it776.2. Future Workhas been observed that the size of these memory slots plays an importantrole in determining the performance of the refinement algorithm especiallywhen the number of threads increases.The parallel memory model for this work is based on uniform memoryaccess (UMA) architectures where all the processors have the same accesstime to the whole memory. If this algorithm is to be implemented on non-uniform memory access (NUMA) architectures where processors have pref-erences in accessing different parts of memory, the memory model must bereconsidered. Otherwise, the different time accesses to the different parts ofmemory will kill the performance of both parallel swapping and refinementalgorithms.Finally, an extension of this work to the parallel distributed memorysystems by means of MPI can be envisaged. Although that the removingconflicts from the swapping candidate faces or edges is not applicable byitself in an MPI implementation, identifying exact types of dependenciesin face or edge swapping in advance can help deal better with the partitionboundaries. Also, Quad/Oct tree structures with a similar buffer zone can beused to distribute the mesh between processors while having each processorcommunicating only with its buffer leaves.78Bibliography[1] Posix threads programming. https://computing.llnl.gov/tutorials/pthreads/.[2] OpenMP 3.1 specification. http://openmp.org/wp/openmp-specifications/, 2011.[3] Frédéric Alauzet, Xiangrong Li, E. Seegyoung Seol, and Mark S. Shep-hard. Parallel anisotropic 3D mesh adaptation by mesh modification.Engineering with Computers, 21:247–258, 2006.[4] Timothy J. Barth. Aspects of unstructured grids and finite-volumesolvers for the Euler and Navier-Stokes equations. In Unstructured GridMethods for Advection-Dominated Flows, pages 6–1 – 6–61. AGARD,Neuilly sur Seine, France, 1992. AGARD-R-787.[5] A. Bowyer. Computing Dirichelet Tessellation. The Computer Journal,24:167–171, 1981.[6] A. N. Chernikov and N. P. Chrisochoides. Parallel guaranteed qualityplanar Delaunay mesh generation by concurrent point insertion. In 14thAnnual Fall Workshop on Computational Geometry, 2004.[7] A. N. Chernikov and N. P. Chrisochoides. Practical and efficient pointinsertion scheduling method for parallel guaranteed quality Delaunayrefinement. In Proceedings of the 18th Annual International Conferenceon Supercomputing, 2004.[8] A. N. Chernikov and N. P. Chrisochoides. Parallel 2D graded guar-anteed quality Delaunay mesh refinement. In Proceedings of the 14thInternational Meshing Roundtable, 2005.79Bibliography[9] A. N. Chernikov and N. P. Chrisochoides. Parallel guaranteed qual-ity Delaunay uniform mesh refinement. SIAM Journal on ScientificComputing, 28:1907–1926, 2006.[10] A. N. Chernikov and N. P. Chrisochoides. Generalized Delaunay meshrefinement from scalar to parallel. In Proceedings of the 15th Interna-tional Meshing Roundtable, 2006.[11] A. N. Chernikov and N. P. Chrisochoides. Three-dimensional Delaunayrefinement for multi-core processors. In Proceedings of the 22nd annualinternational conference on supercomputing, 2008.[12] L. Paul Chew. Guaranteed-quality triangular meshes. Technical ReportTR-89-983, Dept. of Computer Science, Cornell University, 1989.[13] L. Paul Chew. Guaranteed-quality mesh generation for curved surfaces.In Proceedings of the Ninth Annual Symposium on Computational Ge-ometry, pages 274–280. Association for Computing Machinery, May1993.[14] A. L. Chow. Parallel algorithms for geometric problems. PhD thesis,University of Illinoise, Urbana-Champain, USA, 1980.[15] N. Chrisochoides. A survey of parallel mesh generation methods. InProceedings of the 14th International Meshing Roundtable, 2005.[16] N. P. Chrisochoides. A new approach to parallel mesh generation andpartitioning problems. Computational Science, Mathematics and Soft-ware, pages 335–359, 2002.[17] Eric Brière de l’Isle and Paul-Louis George. Optimization of tetrahe-dral meshes. In Ivo Babushka, William D. Henshaw, Joseph E. Oliger,Joseph E. Flaherty, John E. Hopcroft, and Tayfun Tezduyar, editors,Modeling, Mesh Generation, and Adaptive Numerical Methods for Par-tial Differential Equations, pages 97–127. Springer-Verlag, 1995.[18] B. Delaunay. Sur la sphere vide. a la memoire de georges voronoi. Classedes sciences mathematiques et na, 6:793–800, 1934.80Bibliography[19] H. Edelsbrunner and D. Guoy. Sink-insertion for mesh improvement. InProceedings of the Seventeenth Annual Symposium on ComputationalGeometry, pages 115–123. ACM Press, 2001.[20] Lori A. Freitag and Carl F. Ollivier-Gooch. Tetrahedral mesh improve-ment using swapping and smoothing. International Journal for Numer-ical Methods in Engineering, 40(21):3979–4002, 1997.[21] Paul-Louis George and Houman Borouchaki. Delaunay Triangulationand Meshing: Application to Finite Elements. Hermes, Paris, 1998.[22] Paul-Louis George and Houman Borouchaki. Back to edge flips in threedimensions. In 12th International Meshing Roundtable, pages 393–402,2003.[23] P. J. Green and R. Sibson. Computing the Dirichlet Tessellation. TheComputer Journal, 21:178–173, 1977.[24] John L. Gustafson. Reevaluating Amdahl’s law. Communications ofthe ACM, 31:532–533, 1988.[25] M. T. Jones and P. E. Plassmann. Parallel algorithms for the adptiverefinement and partitioning of unstructured meshes. In Proceedings ofthe Scalable High-Perfromance Computing Conference, 1994.[26] C. Kadow. Adaptive dynamic projection-based partitioning for parallelDelaunay mesh generation algorithms. In SIAM Wrokshop on Combi-natorial Scientific Computing, 2004.[27] G. Karnriadakis and S. Orszag. Nodes, modes, and flow codes. PhysicsToday, 46:34–42, 1993.[28] G. Karypis and V. Kumar. Multilevel k-way partitioning scheme forirregular graphs. Journal of parallel and distributed computing, 48:96–129, 1998.81Bibliography[29] G. Karypis and V. Kumar. Parallel multilevel k-way partitioningscheme for irregular graphs. Society for Industrial and Applied Mathe-matics, 41:278–300, 1999.[30] C. L. Lawson. Properties of n-dimensional triangulations. ComputerAided Geometric Design, 3:231–246, 1986.[31] C. Lepage, A. St-Cyr, and W. Habashi. MPI parallelization of unstruc-tured mesh adaptation. Computational Fluid Dynamics, Part XVIII:727–732, 2004.[32] L. Linardakis. Parallel domain decoupling Delaunay method. SIAMJournal of Scientific Computing, 2004.[33] L. Linardakis and N. Chrisochoides. Delaunay decoupling method forparallel guaranteed quality planar mesh refinement. SIAM Journal ofScientific Computing, 27:1394–1423, 2006.[34] Michael Luby. A simple parallel algorithm for the maximal independentset problem. 4:1036–1053, 1986.[35] D. Nave, N. Chrisochoides, and L. P. Chew. Gauaranteed-quality par-allel Delaunay refinement for restricted polyhedral domains. Computa-tional Geometry: Theory and Applications, 28:191–215, 2004.[36] Carl Ollivier-Gooch. GRUMMP version 0.6.4 user’s guide. Technical re-port, Department of Mechanical Engineering, The University of BritishColumbia, 2012.[37] Carl F. Ollivier-Gooch. GRUMMP — Generation and Re-finement of Unstructured, Mixed-element Meshes in Parallel.http://tetra.mech.ubc.ca/GRUMMP, 1998–2010.[38] Doug Pagnutti. Anisotropic adaptation: Metrics and meshes. Master’sthesis, The University of British Columbia, Department of MechanicalEngineering, 2008. URL http://hdl.handle.net/2429/415.82Bibliography[39] N. M. Patrikalakis and H. N. Gursoy. Shape interogation by medialaxis transform. In Design Automation Conference (ASME), 1990.[40] G. Rokos, G. J. Gorman, J. Southern, and P. H. J. Kelly. A thread-parallel algorithm for anisotropic mesh adaptation. Parallel Computing,1308.2480, 2013.[41] J. Ruppert. A new and simple algorithm for quality 2-dimensionalmesh generation. In Proceedings of the Fourth ACM-SIAM Symposiumon Discrete Algorithms, pages 83–92, 1993.[42] J. Ruppert. A Delaunay refinement algorithm for quality 2-dimensionalmesh generation. Journal of Algorithms, 18(3):548–585, May 1995.[43] R. Said, N.P. Weatherill, K. Morgan, and N.A. Verhoeven. Distributedparallel Delaunay mesh generation. Computer Methods in Applied Me-chanics and Engineering, 177:109–125, 1999.[44] J. R. Shewchuk. Tetrahedral mesh generation by Delaunay refinement.In Proceedings of the fourteenth annual symposium on Computationalgeometry, pages 86–95, 1998.[45] Jonathan Shewchuk. Triangle: Engineering a 2D quality mesh genera-tor and Delaunay triangulator. In Proceedings of the First Workshop onApplied Computational Geometry, pages 124–133, Philadelphia, Penn-sylvania, May 1996. ACM.[46] Jonathan R. Shewchuk. Delaunay Refinement Mesh Generation. PhDthesis, School of Computer Science, Carnegie Mellon University, May1997.[47] Jonathan R. Shewchuk. Delaunay refinement algorithms for triangularmesh generation. Computational Geometry: Theory and Applications,22(1–3):21–74, May 2002.[48] H. Si. Tetgen version 1.4.1. http://tetgen.berlios.de/, 10 2006.83Bibliography[49] D. A. Spielman, S. H. Teng, and A. Ungor. Parallel Delaunay Refine-ment: Algorithms and Analyses. International Journal of Computa-tional Geometry & Applications, 17, 2007.[50] Philippas Tsigas and Yi Zhang. A simple, fast and scalable non-blockingconcurrent FIFO queue for shared memory multiprocessor systems. InProceedings of the thirteenth annual ACM symposium on parallel algo-rithms and architectures, pages 134–143, 2001.[51] Philippas Tsigas and Yi Zhang. Evaluating the performance of non-blocking synchronization on shared-memory multiprocessors. In Pro-ceedings of the joint international conference on Measurements andmodeling of computer systems, 2001.[52] David F. Watson. Computing the n-dimensional Delaunay tessellationwith application to Voronoi polytopes. Computer Journal, 24(2):167–172, 1981.84AppendicesAppendix A: OpenMP Programming GuideOpenMP, which is an abbreviation for Open specification for Multi-Processingis an application program interface (API) which is used to explicitly directmulti-threaded, shared-memory parallelism. OpenMP is comprised of threeprimary API components:1. Compiler directives.2. Runtime library routines.3. Environment variables.There are plenty of tutorials and handbooks associated with OpenMP andshared memory parallelism. Instead of explaining OpenMP all over again,we will guide the reader to the best tutorials available. The best startingtutorials can be found at:• B. Barney, OpenMP, https://computing.llnl.gov/tutorials/openMP/• T. Mattson and L. Meadows, A hands-on introduction to OpenMP,openmp.org/mp-documents/omp-hands-on-SC08.pdfHowever, as a programming guide, it is recommended to use the OpenMPspecification which includes a comprehensive explanation of all the API com-ponents of OpenMP along with ample examples to cover all the aspects ofOpenMP compiler directives and runtime library routines. The OpenMPspecification can be found at:• http://openmp.org/wp/openmp-specifications/85
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Thread-parallel mesh generation and improvement using...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Thread-parallel mesh generation and improvement using face-edge swapping and vertex insertion Zangeneh, Reza 2014
pdf
Page Metadata
Item Metadata
Title | Thread-parallel mesh generation and improvement using face-edge swapping and vertex insertion |
Creator |
Zangeneh, Reza |
Publisher | University of British Columbia |
Date Issued | 2014 |
Description | The purpose of this thesis is three-fold. First, we devise a memory model for unstructured mesh data for efficient use of memory on parallel shared memory architectures with the purpose of lowering the synchronization overhead between threads and also excluding the probability of occurring race conditions. Second, we present a new thread-parallel edge and face swapping algorithm for two and three dimensional meshes using OpenMP for shared memory architectures. We show how removing the conflicts from the reconfiguration procedure by applying a vertex locking strategy can result in a near linear speed-up with parallel efficiency of close to one on two threads and 70% with sixteen threads on shared-memory processors. Finally, we derive a parallel mesh generation and refinement module for shared memory architectures based on pre-existing serial modules — GRUMMP — by implementing Chernikov and Chrisochoides’ parallel insertion algorithm along with the two above tools. Experiments show a worst case parallel efficiency of 60% for parallel refinement with 16 threads. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Date Available | 2014-11-10 |
Provider | Vancouver : University of British Columbia Library |
Rights | Attribution-NonCommercial-NoDerivs 2.5 Canada |
IsShownAt | 10.14288/1.0167037 |
URI | http://hdl.handle.net/2429/51007 |
Degree |
Master of Applied Science - MASc |
Program |
Mechanical Engineering |
Affiliation |
Applied Science, Faculty of Mechanical Engineering, Department of |
Degree Grantor | University of British Columbia |
GraduationDate | 2015-02 |
Campus |
UBCV |
Scholarly Level | Graduate |
Rights URI | http://creativecommons.org/licenses/by-nc-nd/2.5/ca/ |
AggregatedSourceRepository | DSpace |
Download
- Media
- 24-ubc_2015_february_zangeneh_reza.pdf [ 2MB ]
- Metadata
- JSON: 24-1.0167037.json
- JSON-LD: 24-1.0167037-ld.json
- RDF/XML (Pretty): 24-1.0167037-rdf.xml
- RDF/JSON: 24-1.0167037-rdf.json
- Turtle: 24-1.0167037-turtle.txt
- N-Triples: 24-1.0167037-rdf-ntriples.txt
- Original Record: 24-1.0167037-source.json
- Full Text
- 24-1.0167037-fulltext.txt
- Citation
- 24-1.0167037.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0167037/manifest