UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A parallel active-set method for solving frictional contact problems Litven, Joshua Alexander 2012

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
24-ubc_2013_spring_litven_joshua.pdf [ 1.73MB ]
Metadata
JSON: 24-1.0052205.json
JSON-LD: 24-1.0052205-ld.json
RDF/XML (Pretty): 24-1.0052205-rdf.xml
RDF/JSON: 24-1.0052205-rdf.json
Turtle: 24-1.0052205-turtle.txt
N-Triples: 24-1.0052205-rdf-ntriples.txt
Original Record: 24-1.0052205-source.json
Full Text
24-1.0052205-fulltext.txt
Citation
24-1.0052205.ris

Full Text

A Parallel Active-Set Method for Solving Frictional Contact Problems by Joshua Alexander Litven  B.Sc., The University of Waterloo, 2010  A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty of Graduate Studies (Computer Science)  THE UNIVERSITY OF BRITISH COLUMBIA (Vancouver) February 18, 2013 c Joshua Alexander Litven 2012  Abstract Simulating frictional contact is a challenging computational task and there exist a variety of techniques to do so. One such technique, the staggered projections algorithm, requires the solution of two convex quadratic program (QP) subproblems at each iteration. We introduce a method, SCHURPA, which employs a primal-dual active-set strategy to efficiently solve these QPs based on a Schurcomplement method. A single factorization of the initial saddle point system and a smaller dense Schur-complement is maintained to solve subsequent saddle point systems. Exploiting the parallelizability and warm-starting capabilities of the active-set method as well as the problem structure of the QPs yields a novel approach to the problem of frictional contact. Numerical results of a parallel GPU implementation using NVIDIA’s CUDA applied to a physical simulator of highly deformable bodies are presented.  ii  Table of Contents Abstract  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ii  Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  iii  List of Tables  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  vi  List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  vii  List of Algorithms  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii  Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  ix  1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1  1.1  Problem Statement  . . . . . . . . . . . . . . . . . . . . . . . . .  1  1.2  Solution Methodology . . . . . . . . . . . . . . . . . . . . . . . .  2  1.2.1  The Primal-Dual Active-Set Method  . . . . . . . . . . .  3  1.2.2  SCHURPA . . . . . . . . . . . . . . . . . . . . . . . . . .  4  Contributions of Thesis . . . . . . . . . . . . . . . . . . . . . . .  6  1.3.1  . . . . . . . . . . . . . . . . . . . . . . . . . . .  7  2 Active-Set Methods for Quadratic Programming . . . . . . . .  8  1.3  Notation  2.1  Quadratic Programming  . . . . . . . . . . . . . . . . . . . . . .  9  2.2  Overview of Active-Set Methods . . . . . . . . . . . . . . . . . .  9  2.3  Classical ASMs . . . . . . . . . . . . . . . . . . . . . . . . . . . .  13  iii  Table of Contents 2.4  The Primal-Dual ASM  . . . . . . . . . . . . . . . . . . . . . . .  17  2.4.1  Derivation via the Semismooth Newton Method  . . . . .  17  2.4.2  Comparison to Classical ASMs . . . . . . . . . . . . . . .  22  3 SCHURPA : a Method for Solving Saddle Point Systems Arising from PDASM . . . . . . . . . . . . . . . . . . . . . . . . . . . .  24  3.1  Limitations of Updating Factorizations  . . . . . . . . . . . . . .  25  3.2  The Schur-Complement Method  . . . . . . . . . . . . . . . . . .  27  3.3  Details of the SCHURPA Algorithm . . . . . . . . . . . . . . . .  31  4 The Eulerian Solids Simulator . . . . . . . . . . . . . . . . . . . .  34  4.1  4.2  Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  35  4.1.1  Simulating Objects Without Contact  . . . . . . . . . . .  35  4.1.2  Simulating Objects With Contact  . . . . . . . . . . . . .  37  Adding Friction Using Staggered Projections . . . . . . . . . . .  40  4.2.1  Friction Model . . . . . . . . . . . . . . . . . . . . . . . .  40  4.2.2  The Staggered Projections Algorithm . . . . . . . . . . .  44  4.3  Applying SCHURPA to Frictional Contact  . . . . . . . . . . . .  46  4.4  The SCHURPA-SP Algorithm  . . . . . . . . . . . . . . . . . . .  54  5 Solving the Contact and Friction QPs . . . . . . . . . . . . . . .  56  5.1  Saddle Point Systems Arising in Contact and Friction . . . . . .  57  5.1.1  Contact . . . . . . . . . . . . . . . . . . . . . . . . . . . .  57  5.1.2  Friction . . . . . . . . . . . . . . . . . . . . . . . . . . . .  58  5.2  Banded SPD Systems . . . . . . . . . . . . . . . . . . . . . . . .  63  5.3  Block Solver for Banded SPD Systems . . . . . . . . . . . . . . .  67  6 GPU Implementation 6.1  . . . . . . . . . . . . . . . . . . . . . . . . .  72  CUDA Programming on the GPU . . . . . . . . . . . . . . . . .  73  iv  Table of Contents 6.1.1  . . . . . . . . . . . . . . . .  74  6.2  Custom Kernels for SCHURPA . . . . . . . . . . . . . . . . . . .  76  6.3  CUDA Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . .  82  7 Results  CUDA Programming Model  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  7.1  Randomly Generated QPs  . . . . . . . . . . . . . . . . . . . . .  7.2  Frictional Contact in the Simulator  83 83  . . . . . . . . . . . . . . . .  85  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  93  Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  96  8 Conclusion  Appendices A Data Storage Formats . . . . . . . . . . . . . . . . . . . . . . . . . 101 B Additional Code  . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103  v  List of Tables 6.1  CPU vs. GPU comparison. . . . . . . . . . . . . . . . . . . . . .  7.1  Statistics of the GPU-based implementations of SCHURPA and DIRECT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  A.1 Data storage formats for dense and symmetric banded matrices.  74  87 102  vi  List of Figures 4.1  Nodal shape functions for a contact cell Ωc on a 2D grid. . . . .  47  5.1  Example of cell adjacency. . . . . . . . . . . . . . . . . . . . . . .  64  5.2  Typical banded structure of the contact Hessian. . . . . . . . . .  65  5.3  Comparison of the banded Cholesky factorization and solver runtimes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  71  6.1  Computational grid of threads executed in parallel. . . . . . . . .  75  6.2  Runtimes of the symbgmm operation. . . . . . . . . . . . . . . .  80  7.1  Runtimes of DIRECT and SCHURPA algorithms. . . . . . . . .  85  7.2  Decomposition of the GPU-based SCHURPA runtimes. . . . . .  86  7.3  Simulation: Cylinder collision. . . . . . . . . . . . . . . . . . . . .  86  7.4  Simulation: Proportion of correct indices in A0c and A0f . . . . . .  89  7.5  Simulation: Number of contacts, dimensions of GC and GF . . . .  90  7.6  Simulation: Number of SCHURPA-SP solves. . . . . . . . . . . .  90  7.7  Simulation: DIRECT and SCHURPA runtimes. . . . . . . . . . .  91  7.8  Simulation: Speedup ratios of SCHURPA to DIRECT. . . . . . .  92  vii  List of Algorithms 2.1  Generic Active-set Method . . . . . . . . . . . . . . . . . . . . . .  11  2.2  Primal-dual Active-set Method . . . . . . . . . . . . . . . . . . .  21  3.1  SCHURPA-INIT . . . . . . . . . . . . . . . . . . . . . . . . . . .  31  3.2  SCHURPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  32  4.1  Staggered Projections . . . . . . . . . . . . . . . . . . . . . . . .  46  4.2  SCHURPA-SP . . . . . . . . . . . . . . . . . . . . . . . . . . . .  55  5.1  C0 _solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  59  5.2  F0 _solve . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  62  5.3  Block Substitution . . . . . . . . . . . . . . . . . . . . . . . . . .  69  viii  Acknowledgments First and foremost, I would like to thank both of my supervisors, Chen Greif and Dinesh K. Pai, for their outstanding guidance. Their curiosity, insights, and passion are truly inspiring. I greatly appreciate the thoughtfulness and understanding that they have shown me. I would like to thank David I.W. Levin for teaching me many important things about life. Special thanks go to Ye Fan, who coded the Eulerian solids simulator. I am also grateful to Uri Ascher for reading this thesis and providing excellent feedback. Finally, I would like to thank my family for their unconditional love and support throughout the years.  ix  Chapter 1  Introduction 1.1  Problem Statement  Physical simulation, the computational process of simulating physical objects that we observe in reality, is a profoundly important and all-encompassing area of modern research. As such, efficient and scalable methods for simulating physical phenomena of ever increasing size and complexity are highly desirable. In particular, simulating realistic frictional contact has applications ranging from computer graphics to biomechanics. An Eulerian solids simulator, presented in (Levin et al., 2011), simulates highly deformable solid bodies undergoing frictionless contact. The simulator utilizes the finite volume method permitting a parallel implementation on the GPU using CUDA (NVIDIA, 2012b). However, the simulator is currently incapable of simulating frictional contact. Lacking the ability to simulate friction restricts the simulator from capturing important phenomena such as hands grasping objects, fingers sliding across a surface etc. Incorporating frictional contact into the simulator is therefore highly desirable. Resolving frictional contact dynamics is a challenging, formidable task, and several techniques have been proposed to do so (Jean, 1999; Stewart and Trinkle, 1996; Kaufman et al., 2008). The staggered projections (SP) algorithm, introduced in (Kaufman et al., 2008), is an attractive method because it is robust and can often converge rapidly. As the simulator implementation is parallel and 1  1.2. Solution Methodology scalable, the implementation of the frictional contact solution procedure should share these attractive properties. The work presented in this thesis details an efficient parallel solution framework which incorporates the staggered projections algorithm to resolve frictional contact dynamics in the Eulerian solids simulator. We demonstrate the effectiveness and scalability of our method with a GPU implementation.  1.2  Solution Methodology  The staggered projections algorithm resolves frictional contact by iteratively solving a pair of convex quadratic program (QP) subproblems. A general QP may be given as  minimize  1 T x Hx − cT x 2  subject to AT x ≤ b  (1.1)  E T x = d,  where c, d, x ∈ Rn , A ∈ Rn×m , E ∈ Rn×p and H ∈ Rn×n . The QP (1.1) is convex if H is symmetric and positive semi-definite. QPs are ubiquitous in a diverse array of fields, and arise as subproblems of general purpose optimization schemes such as the sequential quadratic programming method (Nocedal and Wright, 1999, Chapter 18). As such, solving these problems efficiently has been the subject of decades of research and remains an active area to this day. The fruits of these laborious efforts is evidenced by a plethora of algorithmic methodologies, each with strengths and weaknesses. These include their ability to handle sparse data, scalability, robustness, accuracy etc. One such methodology is the class of algorithms known as active-set methods (ASMs). A particularly attractive trait of these methods is their ability to warm start: A good initial guess of the solution to the QP being solved significantly 2  1.2. Solution Methodology improves the convergence rate of these algorithms. A good guess of an optimal solution in physical simulation arises naturally due to temporal coherence. More precisely, at each time step of the simulation loop, the induced QPs are in some sense a small perturbation of the QP solved at the previous time step, inducing techniques to provide initial estimates. This perturbation depends on the specific implementation, time step size and other factors. Thus ASMs are a natural approach to solve QPs arising in physical simulation. However, there are two fundamental limitations to the classical ASMs: • they cannot handle sparse data, which limits their applicability to largescale problems, and • they cannot be parallelized easily, often requiring many cheap but sequential iterations. In contrast, the class of interior point methods (IPs) can handle sparse data, and have since become the more popular method in solving large problems. However, IPs lack the warm-starting capabilities ASMs enjoy (Wright, 1987, Chapter 11). To overcome the limitations of the classical ASM approaches, we leverage the primal-dual active-set method (PDASM) (Ito and Kunisch, 2008) to retain the benefits of ASMs whilst being able to solve large, sparse QPs efficiently. PDASM is central to the work of this thesis.  1.2.1  The Primal-Dual Active-Set Method  Typically, each iteration of an ASM selects a working set which consists of a subset of the constraints given in (1.1). This working set serves as a guess to the optimal active set and yields an equality-constrained subproblem to be solved, or equivalently, a saddle point linear system. Unlike classical ASMs which ensure primal or dual feasibility over iterates, PDASM can violate both primal and 3  1.2. Solution Methodology dual feasibility. Instead, PDASM induces the subproblem by enforcing complementarity to hold. One can show that PDASM is equivalent to a semismooth Newton method applied to the KKT conditions of (1.1) (Hintermüller et al., 2002) (see Section 2.4). Two fundamental differences between classical ASMs and PDASM are: • PDASM allows multiple changes to the working set per iteration, being able to quickly identify the optimal active set, and • being a semismooth Newton method, PDASM super-linearly converges. Typically very few iterations are required. These properties imply that PDASM overcomes the previously mentioned limitations of classical ASMs. Super-linear convergence allows the method to scale well with problem size. In Chapter 3, we utilize multiple changes to the working set in our parallel implementation. Recently, (PDASM) has been successfully applied to constrained optimal control (Bergounioux et al., 1999), (Kunisch and Rösch, 2002), contact and friction (Brunssen et al., 2007), (Hüeber and Wohlmuth, 2005), and others (Hintermuller, 2004). The convergence properties of PDASM are an active area of research. Kunisch and Ito showed global convergence for certain classes of Hessians (Ito and Kunisch, 2008). In (Kunisch and Rendl, 2003), it was observed that PDASM quickly eliminates the inactive constraints from the current estimate of the active set at a given iteration. This served as a motivation for our work.  1.2.2  SCHURPA  Efficiency of ASMs is determined by the method of solving the saddle point systems. In classical methods, a factorization of the initial saddle point system K0  4  1.2. Solution Methodology is produced, and the factors are then updated, e.g., by orthogonal factorizations. This updating process is what can potentially destroy sparsity. Another approach uses a factorization of K0 and, rather than updating factors, subsequent saddle point systems can be solved using an expanded form, given as   K0  UT      U  y   c    =  . w π V  (1.2)  Equation (1.2) is solved via the factorization of K0 and the Schur-complement matrix C = V − U T K0−1 U. This Schur-complement method was used in (Gill et al., 1990) and (Bartlett and Biegler, 2006), which describe primal and dual methods, respectively. Assuming a sufficient closeness of the initial working set to the optimal active set, the Schur-complement matrix C remains small. This assumption depends on the ability of the warm-starting strategy to accurately reflect the changes of the active in time. The motivation of the previously mentioned work was to solve the quadratic programming subproblems arising in the sequential quadratic programming method (SQP) for nonlinear optimization. In SQP, the nonlinear inequality constraints of the original problem are linearized, and as the solution converges the linearized constraints’ active sets also converge. Thus, employing warm-starts of the QPs results in rapid convergence of the ASMs. The primal Schur-complement approach was successfully implemented in (Betts, 1994) to solve sparse nonlinear problems. In classical ASMs, a single change of the working set occurs per iteration. This corresponds to a solve of the form K0−1 u to form the Schur-complement matrix C. In contrast, PDASM makes multiple changes to the working set per iteration, modifying the Schur-complement method to solves of the form K0−1 U , 5  1.3. Contributions of Thesis where U is a matrix encoding the changes of the working set. If warm-starting is used, one expects that the number of columns of U will not be large. Computing K0−1 U amounts to l solves of the form  K0 xi = ui  i = 1, ..., l,  which direct methods can efficiently solve upon factorization of K0 . Furthermore, a parallel implementation of the solve with K0 can significantly improve the performance of the Schur-complement method. In an ideal sense, if the solves were fully parallelized, the duration of an iteration of PDASM and the classical ASMs would be roughly equal. Since PDASM requires far fewer iterations than classical ASMs to make the equivalent changes to an initial working set, PDASM is expected to perform much better. We demonstrate this improvement via a parallel implementation, called SCHURPA (for SCHUR-complement method in PArallel), on the GPU using NVIDIA’s CUDA.  1.3  Contributions of Thesis  Our contribution in this thesis is a solution framework which utilizes SCHURPA to solve the QP subproblems of SP. To further elucidate, we: • introduce SCHURPA: a Schur-complement approach to solving the saddle point systems arising in PDASM; • integrate SCHURPA into SP. We reformulate the problem of friction to be amenable to SCHURPA, and show that in the case of the simulator the saddle point systems may be phrased as banded SPD systems; • demonstrate the effectiveness of SCHURPA with a GPU implementation.  6  1.3. Contributions of Thesis The thesis is organized as follows. Chapter 2 overviews the general class of active-set methods and introduces the primal-dual active-set method, showing its equivalence to a semismooth Newton method. Chapter 3 introduces SCHURPA, our method for solving the saddle point systems in PDASM, which is conducive to a parallel implementation. We then describe the Eulerian solids simulation framework in Chapter 4. The problem of adding friction to the simulator is discussed, along with the staggered projections algorithm. Chapter 5 describes how we exploit problem structure of the QP subproblems in SP to solve them efficiently. Our parallel GPU implementation is discussed in Chapter 6. Results of SCHURPA on synthetic data and in the simulator are presented in Chapter 7. Conclusions and future work are given in Chapter 8.  1.3.1  Notation  For notation of quantities, we use upper case for matrices, lower case for vectors, and Greek letters generally refer to scalars. We denote the standard basis of Rn by {e1 , ..., en } i.e. ei = [δij ]j ∈ Rn . The notation V = {vi }i∈S defines the matrix V to have column vectors vi indexed by the set S, and V = viT  i∈S  is  defined analogously to have row vectors viT . Quantities with bars denote values at the next iteration (e.g. x ¯), while initial quantities will be subscripted by 0 (e.g. x0 ). The active set of a primal variable x, denoted by A(x), is the subset of constraints which hold with equality at x. We denote an optimal solution to an optimization problem by x∗ .  7  Chapter 2  Active-Set Methods for Quadratic Programming Active-set methods (ASMs) are a class of algorithmic methodologies for solving QPs. A particularly attractive trait of ASMs is their ability to warm-start with an estimate of the solution. In many applications an estimate of A(x∗ ), the active set at the optimal solution, is known. As discussed in Section 1.2, ASMs can effectively utilize this estimate to drastically reduce the number of iterations required for convergence. In this chapter we outline ASMs in the context of solving convex QPs, which are described in Section 2.1. Section 2.2 outlines the background and derivation of ASMs, and discusses the Karush-Kuhn-Tucker (KKT) Conditions. These conditions are necessary and sufficient conditions for obtaining optimality of a convex optimization problem. We then briefly review the classical ASMs in Section 2.3, the primal ASM and the dual ASM, and describe the method of updating factorizations common to ASMs. Unsatisfied with the limitations of these approaches, we turn our eyes towards greener pastures in Section 2.4 to the primal-dual ASM. Details of the algorithm can be found in (Ito and Kunisch, 2008).  8  2.1. Quadratic Programming  2.1  Quadratic Programming  Convex quadratic programs (QPs) are an important class of optimization problems which may be stated as 1 T x Hx − cT x 2  minimize  subject to aTi x ≤ bi ,  (P)  i ∈ I,  where c ∈ Rn , x ∈ Rn , I = {1, 2, ..., m}, b = {bi }i∈I ∈ Rm , A = {ai }i∈I ∈ Rn×m is the constraint Jacobian and H ∈ Rn×n , the Hessian, is a symmetric positive semidefinite matrix so that the problem is convex. The problem could also include equality constraints, but for the purposes of this thesis and without loss of generality they are left out. In this work we assume strict convexity (i.e. H is positive definite) and feasibility of (P) so that the optimal solution x∗ exists and is unique.  2.2  Overview of Active-Set Methods  The class of ASMs (as well as most optimization algorithms) attempt to satisfy the Karush-Kuhn-Tucker conditions (Nocedal and Wright, 1999), which for (P) are given by  Hx∗ − c +  λ∗i ai = 0  (2.1a)  i∈I  aTi x∗ ≤ bi ,  i∈I  (2.1b)  λ∗i ≥ 0,  i∈I  (2.1c)  (aTi x∗ − bi )λ∗i = 0,  i∈I  (2.1d)  For convex problems these are necessary and sufficient conditions for primal 9  2.2. Overview of Active-Set Methods and dual variables x∗ and λ∗ , respectively, to be an optimal primal-dual pair, assuming some constraint qualification is satisfied. For example, the linear independence constraint qualification (LICQ) assumes the set {ai }i∈A∗ is linearly independent. When the inequality and equality constraint functions are affine (as in (P)), no constraint qualification is needed. For the KKT conditions we can equivalently write  Hx∗ − c +  λ∗i ai = 0  (2.2a)  i∈A∗  aTi x∗ = bi ,  i ∈ A∗  (2.2b)  aTi x∗ < bi ,  i ∈ I\A∗  (2.2c)  λ∗i ≥ 0,  i ∈ A∗  (2.2d)  λ∗i = 0,  i ∈ I\A∗ .  (2.2e)  Here we have used the fact that λ∗i = 0 for i ∈ I\A∗ , which follows from Equation (2.1b), Equation (2.1d). Consider the equality-constrained subproblem  minimize  1 T x Hx − cT x 2  subject to aTi x = bi ,  (2.3)  i ∈ A∗ .  The KKT conditions of (2.3) are  ˆ i ai = 0 λ  Hx ˆ−c+  (2.4a)  i∈A∗  aTi x ˆ = bi ,  i ∈ A∗ .  (2.4b)  By strict convexity x ˆ is unique. Since Equation (2.4) is a subset of the KKT  10  2.2. Overview of Active-Set Methods conditions Equation (2.2), the optimal solution of (P) is also the optimal solution to (2.3), and since the solutions are unique, solving the subproblem also solves the full QP. Thus if A∗ were known a priori, one could simply solve (2.3) to solve the original problem. Active set methods employ strategies for finding A∗ . Beginning with a primal-dual pair (x0 , λ0 ), Each iteration k of an ASM updates a working set Ak ⊆ I of constraints which represents a guess of A∗ . The subproblem  minimize  1 T x Hx − cT x 2  subject to  aTi x  = bi ,  (2.5) k  i∈A  is solved, resulting in the optimal primal-dual pair (x∗ , λ∗ ). The step directions p = x∗ − xk µ = λ∗ − λk are then used to update the primal and dual iterates. A generic ASM is given in Algorithm 2.1. Algorithm 2.1 Generic Active-set Method Input: x0 , λ0 , A0 for k = 0, 1, 2, ... do Solve (2.5) Defined by Ak in terms of step directions p and µ Step xk+1 := xk + αp, α ∈ [0, 1] λk+1 := λk + βµ, β ∈ [0, 1] if the KKT conditions to (P) are satisfied then DONE; x∗ = xk+1 and λ∗ = λk+1 else Choose Ak+1 end if end for Output: x∗ , λ∗  The computational “meat” of Algorithm 2.1 lies in solving (2.5), which may 11  2.2. Overview of Active-Set Methods be rewritten as the symmetric indefinite linear system (called the saddle point system)  H  ATk      Ak   x   c    =  , bk λ  (2.6)  where Ak = {ai }i∈Ak and bk = {bi }i∈Ak . The saddle point matrix  H Kk =  ATk   Ak    is a special case of the symmetric indefinite matrix  A M= BT   B .  The following theorem is stated without proof (see, e.g., (Nocedal and Wright, 1999)). Theorem 2.1. M is nonsingular iff A is positive definite on the nullspace of BT . Due to the assumption that H is positive definite we get the following corollary. Corollary 2.1. K k is nonsingular iff ATk is full row rank. From Corollary 2.1 it follows that A0 must prescribe a linearly independent set of constraints to ensure Equation (2.6) is solvable.  12  2.3. Classical ASMs  2.3  Classical ASMs  The two classical variants of ASMs are the primal and dual algorithms. The primal ASM, described in (Nocedal and Wright, 1999), chooses the primal step length α such that xk+1 is always feasible. The dual ASM, described in (Goldfarb and Idnani, 1983), chooses the dual step length β ensuring λk+1 is feasible to the dual problem. To ensure convergence, the primal ASM assumes the active constraints are linearly independent at each feasible vertex; the dual requires that H be positive definite. Convergence proofs for these ASMs are based on showing that f (xk+1 ) < f (xk ) in the primal version and f (xk+1 ) > f (xk ) in the dual version; thus cycling can never occur. Since there are a finite number of active-set subsets, the algorithms converge (Nocedal and Wright, 1999). While this theoretical argument allows for a possibly exponential number of steps in the size of the input data, in practice this is rarely observed. ASMs typically add correct and remove incorrect constraints to and from the working set; thus their running time is proportional to |A0 −A∗ |. This is why ASMs benefit from warm-starting with a good estimate of A∗ . In the primal and dual algorithms one constraint is added or removed to the working set Ak . Each saddle point system is not solved from scratch; rather, factorizations are updated as constraints are added and removed. In typical implementations these factors are dense, as maintaining sparse factorizations is difficult (Gill et al., 1987). We now give an example demonstrating the factorization update process. Consider the QP  minimize  1 T x Hx − cT x 2  (2.7)  subject to x ≥ 0, which is a special case of (P) where A = −I and b = 0. 13  2.3. Classical ASMs At iteration k of Algorithm 2.1 Ak defines the set of bound and free components of x, which we denote by xB and xF , respectively, i.e. B = Ak and F = I\Ak . The resulting saddle point system is then     HF,F  HB,F            T HB,F  xF   λF  cF    −   =  . cB λB xB HB,B  (2.8)  Since xB = 0 and λF = 0, we get  HF,F xF = cF ,  (2.9)  which can then be used to solve for λB via  λB = −cB + HB,F xF .  (2.10)  Solving the saddle point system this way reduces the problem to a symmetric positive definite system of size |F |, the number of free variables. Suppose iteration k+1 binds the ith variable of xF . Dropping the F subscript for simplicity, we can write the current saddle point system Equation (2.9) as  H11   hT  21  H31    h21 h22 h32      c1          hT32    xi  =  ci  .     H33 x3 c3  T H31  x1   Then we can solve the next saddle point system by solving  ¯x H ¯ = c, ¯  where  14  2.3. Classical ASMs   H11  ¯ =  0T H   H31    c1     c¯ =  0.   c3    0 1 0  T H31    0T  ,  H33  Suppose we have a Cholesky factorization of HF,F which we write as   L11  T LLT =   l21  L31  α l32  L33   T  L11      l21 α    LT31  H11   T = T l32   h21   H31 LT33   T H31   hT32  .  H33  h21 h22 h32  (2.11)  ¯ We now wish to determine the Cholesky factorization of H:   ¯ L11  ¯L ¯ T =  ¯lT L  21  ¯ 31 L  α ¯ ¯l32  ¯ 33 L   ¯T  L11      ¯l21 α ¯    T ¯ L31  H11   T ¯lT  = 0 32    ¯T H31 L 33    0 1 0  T H31    0T  .  H33  Writing out the relevant equations of Equation (2.11) give  L11 LT11  =  H11  T L31 LT31 + l32 l32 + L33 LT33  =  H33 .  We can now solve for the components of the updated Cholesky factors:  15  2.3. Classical ASMs  ¯ 11 L ¯T L 11  ¯ 11 = L11 = H11 = L11 LT11 ⇒ L  ¯ 11 L ¯ T31 L  ¯ 31 = L31 = H31 = L11 LT31 ⇒ L  ¯ 11 ¯l21 L  =  0 ⇒ ¯l21 = 0  ¯lT ¯l21 + α ¯2 21  =  1⇒α ¯=1  ¯ 31 ¯l21 + α L ¯ ¯l32  =  0 ⇒ ¯l32 = 0  T ¯ 31 L ¯ T31 + ¯l32 ¯l32 ¯ 33 L ¯ T33 L +L  ¯ 33 L ¯T ⇒L 33  T = H33 = L31 LT31 + l32 l32 + L33 LT33 T = L33 LT33 + l32 l32 .  Thus, as can be seen from the last equation above, binding a variable amounts to a rank-one update of a Cholesky factorization. It can be shown that freeing a variable corresponds to a rank-one down-date (Davis and Hager, 2006). Updating Cholesky factorizations from rank-one changes are detailed in the seminal work (Gill et al., 1974). All updating procedures require O(n2 ) floating point operations as opposed to O(n3 ) required for a full factorization. In summary, we see that classical ASMs admit several advantages over competing methods such as the interior point method (IP). As mentioned, in practice ASMs can take advantage of warm-starting, whereas IPs have difficulty doing so (Wright, 1987, Chapter 11). Also, updating factors is a cheap operation compared to full factorizations required in the iterates of IPs. The tradeoff comes from the large number of iterations in ASMs if the working set requires many changes. We are therefore limited by classical ASMs because • They do not exploit sparsity, which limits their applicability to large-scale problems, and  16  2.4. The Primal-Dual ASM • The iterations are inherently sequential, and therefore they cannot be parallelized. As we will show in Chapter 3 the Schur-complement approach ameliorates the first limitation by solving Equation (2.6) in a different way; rather than updating factors, an initial saddle point system is factorized, and future saddle point systems are expressed as solutions of the original saddle point matrix and a smaller Schur-complement matrix. We now address the second issue.  2.4  The Primal-Dual ASM  In an attempt to utilize the advantages of ASMs, we turn to a more recent approach: the primal-dual active-set method (PDASM). A key divergence from its classical counterparts is the ability to make multiple changes to the active set per iteration. We will see that combining this with a linear algebra technique to solve the saddle point systems results in a parallel ASM we have been searching for (see Chapter 3).  2.4.1  Derivation via the Semismooth Newton Method  We follow (Hintermüller et al., 2002) in deriving PDASM. In a nutshell, PDASM is defined by Algorithm 2.1 by choosing α = β = 1 and selecting the next working set as  Ak+1 = i ∈ I | bi − aTi xk+1 − λk+1 <0 . i  (2.12)  PDASM can be derived from a generalized Newton method applied to a set of semismooth equations. A comprehensive overview of semismooth equations and algorithms can be found in (Ulbrich, 2011).  17  2.4. The Primal-Dual ASM We define a function to be semismooth if it is differentiable almost everywhere, mathematically speaking (that is, everywhere except for a set of measure zero). Consider the function  φ(a, b) = min(a, b), which is differentiable everywhere except along the line a = b and thus is a semismooth function. Other examples of semismooth functions include convex functions and piecewise differentiable functions. We have the following useful property. Lemma 2.4.1. φ(a, b) = 0 ⇔ ab = 0,  a ≥ 0,  b ≥ 0.  Proof. Suppose φ(a, b) = 0.  Then  min(a, b) = 0  ⇔  a = 0 or b = 0,  ⇔  ab = 0,  a ≥ 0,  a ≥ 0,  b≥0  b ≥ 0,  as required. Corollary 2.4.2. The KKT conditions to (P) may be reformulated as the solution to the equations  F (x, λ) =     Hx + Aλ − c  = 0,  (2.13)    Φ(x, λ) where  Φi (x, λ) = φ(bi − aTi x, λi ).  (2.14)  18  2.4. The Primal-Dual ASM Applying a generalized Newton method to semismooth equations was introduced in (Qi and Sun, 1993) and we briefly discuss the approach here. Let F : Rn → Rn and DF ⊆ Rn be the set of points where F is differentiable. The set  ∂B F (x) ≡  lim ∇F (xk ) | {xk } ⊆ DF  xk →x  is called the B-subdifferential of F at x, where ∇F denotes the Jacobian of F . Intuitively this is the set of Jacobians nearing the point x in the limit. If F is differentiable at x it is clear that ∂B F (x) = ∇F (x). We may now formulate a generalized Newton iteration as  xk+1 = xk − Vk−1 F (xk ),  (2.15)  where Vk ∈ ∂B F (xk ). If F is semismooth at a solution x∗ of F (x) = 0 and all V ∈ ∂B F (x∗ ) are nonsingular, local super-linear convergence can be proved (Ito and Kunisch, 2008). Recall that in the differentiable case, quadratic convergence occurs if ∇F (x∗ ) is nonsingular. We now apply the generalized Newton method to equations Equation (2.13). The subdifferential of Φ is given by ∂B Φi (x, λ) = ∂B φ(bi − aTi x, λi )    (−ai , 0)T     = (0, ei )T        (−ai , 0)T , (0, ei )T  if bi − aTi x < λi if bi − aTi x > λi if bi − aTi x = λi  We arbitrarily choose Vi = (0, ei )T ∈ ∂B Φi (x, λ) in the final case. Let  19  2.4. The Primal-Dual ASM  A = i ∈ I | bi − aTi xk − λki < 0 A¯ = i ∈ I | bi − aTi xk − λki ≥ 0 .  Assume without loss of generality that the equations are ordered first by A ¯ Then, by taking the partial subdifferentials of F , the and subsequently by A. generalized Newton iteration Equation (2.15) applied to Equation (2.13) gives   k k A Hx + Aλ − c   xk+1 − xk    = − bA − ATA xk  0 ,   k+1    λ  − λk λkA¯ IA¯     H  −AT  A  0      where AA = {ai }i∈A , IA¯ = eTi  ¯ i∈A  and bA = {bi }i∈A . Equivalently  Hxk+1 + Aλk+1 = c;  (2.16a)  ATA xk+1 = bA ;  (2.16b)  λk+1 = 0, ¯ A  (2.16c)  and after substituting Equation (2.16c) into Equation (2.16a) we get  λk+1 ai = 0 i  Hxk+1 − c + i∈A  aTi xk+1 = bi ,  i ∈ A.  Notice that these are exactly the KKT conditions for solving the equalityconstrained QP  20  2.4. The Primal-Dual ASM  minimize  1 T x Hx − cT x 2  subject to  aTi x  = bi ,  (2.18)  i ∈ A.  Thus, we get an active-set procedure with the working set determined by A, which is how we defined the update in Equation (2.12). Intuitively, rather than satisfying primal or dual feasibility, PDASM satisfies the complementarity condition  (aTi xk+1 − bi )λk+1 =0 i  i ∈ I,  which follows from Equation (2.16b) and Equation (2.16c). Since either bi − aTi xk+1 = 0 or λk+1 = 0, the working set A chooses the former to hold if i bi − aTi xk < λki ; otherwise it ensures the latter holds. The super-linear convergence of the generalized Newton method applies to PDASM, and in practice very few iterations are required to obtain convergence (see Chapter 7). In combination with being an active-set method and therefore amenable to warm-starting, these make PDASM a very attractive approach for solving QPs. Algorithm 2.2 describes this stunningly simple yet powerful algorithm. Algorithm 2.2 Primal-dual Active-set Method Input: x0 , λ0 , A0 for k = 0, 1, 2, ... do Solve (2.5) defined by Ak to compute the solution pair xk+1 , λk+1 if the KKT conditions to (P) are satisfied then DONE; x∗ = xk+1 and λ∗ = λk+1 else Ak+1 = i ∈ I | bi − aTi xk+1 − λk+1 <0 i end if end for Output: x∗ ,λ∗  Global convergence of PDASM in certain cases has been proven for several classes of problems in (Ito and Kunisch, 2008), e.g., problems with M-matrix 21  2.4. The Primal-Dual ASM Hessians. The two pitfalls in the successful convergence of PDASM are cycling and a potentially singular saddle point system. Assuming that all saddle point systems are uniquely solvable (which is true, for example, if the constraint Jacobian has full rank), the algorithm will converge if and only if it is free of cycles. Typically global convergence proofs rely on a merit function which measures the progress of the algorithm to the solution and monotonically decreases, ensuring cycling cannot occur (Nocedal and Wright, 1999). The quantity ||F (xk , λk )|| is a natural choice, but unfortunately is inadequate without additional assumptions on the input data. In practice, convergence occurs for well-conditioned problems. We observe unconditional global convergence of PDASM for our application (see Chapter 7).  2.4.2  Comparison to Classical ASMs  Here we recap the similarities and differences of PDASM with the classical ASMs discussed in Section 2.3. Similarities: • being an ASM, PDASM yields the benefits of warm-starting and accurate solutions, • the standard techniques for computing and updating factorizations may be applied to PDASM, and • under certain conditions global convergence can be proved. PDASM is observed to converge as long as the saddle point systems are nonsingular. Differences: • there is no longer any guarantee of primal or dual feasibility, • multiple changes to the working set are made each iteration, and 22  2.4. The Primal-Dual ASM • due to super-linear convergence, few iterations are required and thus PDASM scales well to large problems. In the next section, we exploit these unique features of PDASM to develop a parallel algorithm for solving QPs.  23  Chapter 3  SCHURPA : a Method for Solving Saddle Point Systems Arising from PDASM The central ingredient for our parallel solver is the solution method to the saddle point system Equation (2.6). In this chapter we first show that the standard approach of updating factorizations for saddle point systems in PDASM cannot be parallelized in Section 3.1. We then introduce the Schur-complement method in Section 3.2, originally developed in (Gill et al., 1990). Their work used the method in the primal ASM to solve large, sparse problems. We show that using PDASM, the Schur-complement method not only permits the use of sparse data, but also produces an efficient parallel solver, which we call SCHURPA. Accuracy and complexity of SCHURPA are discussed in Section 3.3.  24  3.1. Limitations of Updating Factorizations  3.1  Limitations of Updating Factorizations  Our goal is to utilize the information PDASM provides each iteration, a set of l changes to the working set, to develop a parallel algorithm. Recall in the classical regime of ASMs a single change to the working set corresponds to an update-factorization procedure. In principle then, one could apply l such updates sequentially, but this would squander the use of parallelism and revert the algorithm’s behaviour to that of the classical methods, i.e. each update would be cheap, but they could be numerous and inherently sequential. Let us return to the example discussed in Section 2.3, where Cholesky factors of the reduced Hessian were updated via a rank-one change, corresponding to a single change of the working set. A natural approach in the case of multiple changes would be to update the Cholesky factors in parallel, accounting for all changes to the working set simultaneously. For example, suppose we bind two variables so that   L  11  T  l21   T LL =  L31   T  l41  L51   LT   11            α l32  L33  a  T l43  β  l52  L53  l54  l21  LT31  l41  α  T l32  a  LT33  l43  L55  β   H   11   T   T l52   h21    LT53   = H31    T    hT41 l54   LT55 H51 LT51    h21  T H31  T H41  h22  hT32  hT42  h32  H33  H43  h42  hT43  h44  h52  H53  h54  The updated factors for the next iteration are  25  T H51      hT52    T . H53    hT54   H55  3.1. Limitations of Updating Factorizations   ¯ L  11  T 0   ¯L ¯ T = L L  ¯ 31   T 0  ¯ 51 L  1 0  ¯ 33 L  0  0T  1  0  ¯ 53 L  0  ¯ 55 L   ¯T L   11            0  ¯T L 31  1  T  0  0  ¯T L 33  0  0  1   H   11    T 0T   0   ¯T  = L H31 53      0T   0T   ¯T H51 L 55 ¯T L 51    T H31  0  1  0  T  0  H33  0  0  0T  1  0  H53  0  0  0  Working out the equations as in the previous example yields the same rankone update T ¯ 33 L ¯ T33 = L33 LT33 + l32 l32 L ,  as well as  ¯ T53 L ¯ 55 L ¯ T55 L  T ¯ −1 l32 l52 = L + L33 LT53 33 T T ¯ 53 L ¯ T53 . = L55 LT55 + l52 l52 + l54 l54 + L53 LT53 − L  The last equation requires two rank-one updates as well as two the size of L53 . If the variables being bound are close together or the second variable is near the end, this is a low rank update, but these are unknown a priori. As a second attempt, one could attempt to re-order the factors to produce a single columnrow block change to induce a rank-l update. For example, suppose we bind l variables so that  P HP  ¯ T P HP   ¯ H11 =  ¯ 21 H  ¯ H11 =    T ¯ H21   ¯ 22 H  Il  (3.1)     26  T H51      0    T . H53    0T   H55  3.2. The Schur-Complement Method for some permutation matrix  P11 P = P21   P12  . P22  We wish to find the cholesky factors  ¯ 11 = L ¯ 11 L ¯ T11 H  via updating the Cholesky factors  L11 H= L21  L22   T  L11    LT21  . LT22  (3.2)  Substituting Equation (3.2) into Equation (3.1) yields  T ¯ ¯T T ¯T T ¯T T ¯ P11 L11 L11 P11 = L11 LT11 − P11 H21 P21 − P21 H21 P11 − P21 H22 P21 .  (3.3)  Unfortunately by construction P11 is not invertible unless P = I. Therefore Equation (3.3) cannot be used as an factorization update equation. We abandon the dream of updating factors and instead turn to another idea: the Schurcomplement method.  3.2  The Schur-Complement Method  Recall that each iteration k of an ASM computes the solution to a saddle point system of the form  H  ATk      Ak  x  c    =  . λ bk 27  3.2. The Schur-Complement Method A Schur-complement approach, developed in (Gill et al., 1990) for solving saddle point systems arising in a primal active-set method, computes an initial (sparse) factorization of the saddle point system    H K0 =  AT0  A0  (n+p)×(n+p) , ∈R  (3.4)  where p = |A0 | and A0 = {ai }i∈A0 is given by the initial working set A0 . The method relies on the following key observation: Adding or removing a constraint to the working set is equivalent to symmetrically appending a column and a row to the current saddle point system. To see this, consider the first iteration after the initial factorization of K0 . Suppose we add a constraint i. Then the new saddle point system is  H   AT  0  aTi  A0      ai  x  c        λ  = b  ,     0     bi µ  where µ is the Lagrange multiplier corresponding to the added constraint. Now suppose an initially active constraint i is removed from the working set. The new saddle point system can be written as  H  AT  0   A0  eTi       x  c          ei   λ = b0  .     γ 0  This adds the final constraint λi = 0, which is required for a non-active constraint, as well as modifies the equation aTi x = bi to aTi x + γ = bi , which can be satisfied for any x. Therefore, the constraint is “ignored” and the dummy 28  3.2. The Schur-Complement Method variable γ may be discarded. In general, for a given iteration k suppose there are l changes made between Ak−1 and Ak . The system to be solved is given by the form   K0  UT      U  y   c    =  , w π  (3.5)  where U ∈ R(n+p)×l and w ∈ Rl . The variables xk and λk can be obtained by “unscrambling” y k and π k . More precisely,  xki = yik ,  i = 1, ..., n  and  λki =      πsk     0       k yn+i  if constraint iwas the sth constraint added if constraint iwas the sth constraint removed otherwise  Although the system grows in size, by using the Schur-complement matrix defined as  C ≡ −U T K0−1 U, solving Equation (3.5) is equivalent to solving the following systems:  K0 v  = c  Cπ  = w − UT v  K0 y  = c − U π.  (3.6a)  29  3.2. The Schur-Complement Method Thus, the work required at each iteration involves two solves with K0 (and updating/solving with C). However, by solving for R = K0−1 U and noticing that v = K0−1 c is fixed throughout the algorithm we can write Equation (3.6) as  C  =  −U T R  Cπ  =  w − UT v  y  =  v − Rπ  and we see that l solves with K0 is required to form C and solve for y and π. As mentioned, an advantage of the Schur-complement approach is its abstraction of the solve with K0 in Equation (3.5). More precisely, we black-box the solution of K0 via a function K0 _solve so that  R = K0 _solve (U) .  The function K0 _solve is application-dependent and implementations can take advantage of specific hardware such as the GPU. In general computing the function requires a matrix factorization and solving requires solving with the computed factors (see Section 3.3). We emphasize that, in contrast to updating factors, the solves ri = K0 _solve (ui )  i = 1, ..., l  are naturally parallelizable. There is one final ingredient needed to make the algorithm above more efficient and ensure non-singularity of the Schur-complement matrix. Suppose a constraint i ∈ A0 is removed from Ak at some iteration k, adding the lth col-  30  3.3. Details of the SCHURPA Algorithm umn to U as described above. If at some later iteration, the constraint is then added back to Ak , adding the lth column to U will yield a rank deficient matrix and thus C will be singular. Instead of adding another column to U , we may simply remove the lth column from U , as well as the lth column and lth row in C. Thus each modification to the working set of this form decreases the number of solves required by one. A similar argument holds for a constraint i ∈ A0 which is added and subsequently removed from Ak . In addition to avoiding a solve, removing a column also reduces the memory cost of the algorithm. In our GPU implementation, U is stored as a dense matrix, and minimizing the memory footprint on the GPU is crucial. We are now ready to combine PDASM with the Schur-complement method, which we call SCHURPA (SCHUR in PArallel), given in Algorithm 3.2. An initialization step, given in Algorithm 3.1, computes the function K0 _solve and solves the initial saddle point system. We describe application-specific implementations of K0 _solve for the solids simulator in Chapter 5. The properties of Algorithm 3.2 are the topic of the next section. Algorithm 3.1 SCHURPA-INIT Input: A0 Factorize K0 given by Equation (3.4) to produce the function K0 _solve x∗ c Solve the initial saddle point system: = K0 _solve λ∗ b0 Output: x∗ , λ∗ , K0 _solve  3.3  Details of the SCHURPA Algorithm  By solving the saddle point systems via the Schur-complement method, SCHURPA avoids the need for updating factorizations. The approach works well when the cost of computing K0 _solve is significantly more than the cost of solving with said function. This was the case in (Gill et al., 1990; Bartlett and 31  3.3. Details of the SCHURPA Algorithm Algorithm 3.2 SCHURPA Input: A0 , x0 , λ0 , K0 _solve for k = 0, 1, 2, ... do if the KKT conditions to (P) are satisfied then DONE; x∗ = xk and λ∗ = λk else Ak+1 = i ∈ I | bi − aTi xk − λki < 0 end if Compute U based on the changes between Ak and Ak+1 Compute C = −U T R, where R = K0 _solve(U) Solve Equation (3.5) via Equation (3.7) to obtain (y k+1 , π k+1 ) Use y k+1 and π k+1 to obtain xk+1 ,λk+1 end for Output: x∗ , λ∗  Biegler, 2006), which employed the primal and dual methods, respectively, to solve large sparse QPs arising in the sequential quadratic programming (SQP) method. The Schur-complement method computes an initial sparse factorization, and updating factors are avoided. As the QPs in SQP converge, few active set changes are required to find the optimal solution. Our motivation, in contrast, is in physical simulation, where due to the temporal coherence the change in active-sets between time steps may be small but never converging. Thus we expect each iteration a nontrivial number of changes in the active-set, which suits the parallel linear solve of Algorithm 3.2. We also emphasize that their motivation was for solving sparse problems sequentially, while our motivation is solving structured dense problems in parallel (see Chapter 4). This exemplifies the utility of the block-box abstraction of K0 _solve. The computational complexity of SCHURPA depends on the number of iterations, the cost of computing K0 _solve, and the cost of solving with K0 _solve. Due to the super-linear convergence of any PDASM, Algorithm 3.2 typically converges in O(1) iterations; in practice typically less than 10 iterations are required. For a generic implementation where a sparse matrix K0 ∈ Rn×n is factorized, Algorithm 3.1 has a complexity of O n3 and solving with the result32  3.3. Details of the SCHURPA Algorithm ing factors has complexity O n2 . The advantage is that only one factorization is required throughout the entire algorithm. As the number of changes from A0 from Ak grows, so does the size of the Schur-complement matrix C. If C gets too large, the Schur-complement method could be inefficient and a re-factorization of the current saddle point system should be computed to create a new K0 _solve. Additionally, since U is stored as a dense matrix, memory limitations must be considered in a GPU implementation. Therefore if either C gets sufficiently large or U cannot be stored in memory, we compute a new K0 _solve from the current iteration. In our implementation, C is re-computed and factored each iteration, as in (Bartlett and Biegler, 2006). Finally, we note that the Schur-complement method is less robust than methods updating factors via orthogonal transformations. If the initial saddle point system is ill-conditioned, all subsequent saddle point systems will compromise accuracy due to the Schur-complement approach. One possible amelioration to this is fixed-precision iterative refinement applied when constraint drift occurs. However, as our QPs are found to be reasonably well-conditioned and non-degenerate in practice, iterative refinement was not required to identify the optimal active set (see Chapter 7).  33  Chapter 4  The Eulerian Solids Simulator Equipped with SCHURPA, a parallel PDASM for solving QPs, we describe our solution procedure for the frictional contact problem arising in an Eulerian deformable solid bodies simulator, the subject of this chapter. Deformable body simulations typically work by keeping track of material points associated with the object and updating their positions in space. This approach is called Lagrangian simulation. In contrast, Eulerian simulation fixes points in space and simulates objects moving through this space by advecting them through the object’s velocity field. Introduced in (Levin et al., 2011), the Eulerian solids simulator handles complex, highly deformable solid objects undergoing frictionless contact. By discretizing the spatial domain into a regular grid, the simulator utilizes the finite volume method (Versteeg and Malalasekera, 2007). A parallel implementation of the simulator was developed on the GPU using CUDA (NVIDIA, 2012b). However, the contact was assumed to be frictionless. Simulating friction is crucial for many interesting and complex phenomena. Consider hands grasping an object, fingers sliding across a surface, hair blowing in the wind, or cloth sliding off of a surface. Our goal is to imbue the simulator with the ability to simulate friction.  34  4.1. Overview Section 4.1 provides a description of the simulator and a derivation of the contact QP solved at each time step of the simulation. We then introduce the frictional contact problem, and the staggered projections algorithm of (Kaufman et al., 2008) as our solution methodology, in Section 4.2. By appropriately reformulating the linearized friction cone in Section 4.3, we show that SCHURPA can be applied to the QPs that result from SP. The final SCHURPA-SP algorithm is outlined in Section 4.4.  4.1 4.1.1  Overview Simulating Objects Without Contact  Here we briefly describe the method of simulating a single solid body in the simulator, the details of which can be found in (Levin et al., 2011). Suppose we are interested in simulating an object starting at time t0 . The object’s configuration at time t0 is said to be in its reference configuration. A mapping between the object’s reference configuration and its configuration at time t is given by x : (R3 , R+ ) → (R3 , R+ ) x(X, t) = X + u(X, t). The reference coordinate X is the coordinate of a point on the object in its reference configuration, and u(X, t) is the change, or deformation, of that point at time t. In typical Lagrangian solids simulations, the object is discretized into particles. The reference coordinates X of these particles act as degrees of freedom (DOFs) of the system, completely determining the configuration of the object. The spatial positions of these particles are computed by integrating the  35  4.1. Overview momentum equation ρ  dv(X) = ∇·σ+ρb, dt  where v is the velocity of the particle, ρ is the density, σ is the Cauchy stress, and b are the body forces acting on the particle. However, in the Eulerian framework, degrees of freedom are no longer reference coordinates, but rather spatial coordinates fixed on a grid in space. More precisely, we discretize space by a uniform grid G = (∆x, ∆y, ∆z) of dimensions (Lx ∆x) × (Ly ∆y) × (Lz ∆z) and define our degrees of freedom by  xijk  (x0 + i∆x, y0 + j∆y, z0 + k∆z),  =  where i ∈ 0, ..., Lx ,  j ∈ 0, ..., Ly ,  k ∈ 0, ..., Lz .  By writing the velocity v as a function of x, the material derivative begets the modified momentum equation  ρ  ∂v + v · ∇v ∂t  = ∇ · σ + ρb.  (4.1)  Integrating Equation (4.1) yields a velocity field with which the object may be advected through. For an integration region Ω centered around a node N = (i, j, k) we assume a constant density ρ and integrate Equation (4.1) to obtain  mN  ∂vN + vN ∇vN ∂t  = fN,  where mN is the integrated mass and fN are the integrated stress and body t forces inside Ω. To compute the velocities at time t + 1, we use vN in the  36  4.1. Overview advection term vN ∇vN and apply a first order discretization to  ∂vN ∂t  , yielding  t+1 t t t mN vN = mN vN + ∆t(fN − mN D(vN )vN ),  where the matrix D is the discrete Jacobian matrix using the first-order upwind differencing scheme. Assembling equations over the entire spatial grid for all objects yields the equation M v t+1 = f ∗ ,  (4.2)  where M ∈ Rn×n is the globally assembled diagonal mass matrix, v t+1 ∈ Rn is the global velocity vector and f ∗ ∈ Rn is the global impulse vector. Once Equation (4.2) is solved, the reference coordinates are advected through the velocity field via the material derivative:  ⇒  DX Dt  =  0  ∂X + v t+1 · ∇X ∂t  =  0,  yielding the update equation  X t+1 = X t − ∆tD(v t+1 )X t .  By tracking the reference coordinates, strain and stress can be properly computed and an elastic object will always return to its undeformed state.  4.1.2  Simulating Objects With Contact  Contact between bodies in the simulation is resolved by invoking Gauss’ principle of least constraint (Lanczos, 1986), a variational formulation of classical mechanics, which states that the velocity is the solution to the following optimization problem: 37  4.1. Overview  v t+1 = argmin v  1 T v M v − vT f ∗ 2  (4.3)  v ∈ V,  subject to  where V is the constraint set. In the case of the simulator, constraints enforce non-interpenetration between objects. Consider a colliding pair of objects A and B with contact surface ΓAB . Non-interpenetration constraints are formulated on the velocity level so that the relative velocity along the unit normal is always nonnegative along the contact surface: v rel · n ≥ 0,  (4.4)  where v rel = v A − v B is the relative velocity and n is the unit normal to the surface. As in typical FEM fashion, we integrate Equation (4.4) in each grid cell Ωc where contact occurs to obtain ˆ v rel · ndΓ Γc  ˆ  v A − v B · ndΓ ≥ 0,  =  (4.5)  Γc  where Γc = Ωc ∩ ΓAB . By expressing velocities in terms of nodal shape functions we get L  v (x) =  φN (x) vN ,  (4.6)  N =1  where vN ∈ R3 , L = Lx Ly Lz and φN is a scalar shape function. In the implementation of the simulator we use trilinear shape functions as in (Levin  38  4.1. Overview et al., 2011). Substituting in Equation (4.6) to Equation (4.5) yields L    ˆ A  φN ndΓ · vN −  N =1  =  L    ˆ B  φN ndΓ · vN  N =1  Γc  Γc  jcT v,  where  jc  =   T    ˆ ˆ   ˆ  φN nx dΓ, φN ny dΓ, φN nz dΓ     Γc  Γc  Γc  ∈ Rn N =1,...,L  The constraints are assembled into a global constraint Jacobian matrix  J = {jc }c=1,...,m ∈ Rn×m  and the resulting QP solved at each time step is given by v t+1 = argmin v  subject to  1 T v M v − vT f ∗ 2  (4.7)  J T v ≥ 0.  Because contact only occurs on surfaces of objects, the number of constraints is much smaller than the total number of degrees of freedom, i.e., m  n. In  such cases it is useful to transform (4.7) into its dual formulation, given by  αt+1 = argmin α  subject to  1 T T −1 α J M Jα + αT (J T M −1 f ∗ ) 2  (4.8)  α ≥ 0,  39  4.2. Adding Friction Using Staggered Projections where α ∈ Rm . The velocities can then be recovered via v t+1  = M −1 (f ∗ + ct+1 ),  (4.9)  where ct+1 ≡ Jαt+1 can be interpreted as the generalized contact impulse with α giving the magnitudes of those impulses. Reducing to the dual results in a problem of size m. Due to the diagonal structure of M, formulating (4.8) is computationally tractable. We now turn to the more challenging case of simulating frictional contact.  4.2  Adding Friction Using Staggered Projections  4.2.1  Friction Model  We begin our derivation of frictional contact with Coulomb’s law of friction. Consider a contact point p for two objects A and B in contact. Coulomb’s law states that the frictional impulse fp must directly oppose the relative velocity v rel at the contact point, and must lie in the feasible set  {fp ∈ Tp : ||fp || ≤ µα},  where Tp is the tangent plane to the contact point, α is the magnitude of the normal impulse, and 0 ≤ µ ≤ 1 is the coefficient of friction dependent on the material properties. We model friction using a linearized, isotropic Coulomb friction law (Stewart, 2000). We define a set of 2l symmetric unit length vectors 40  4.2. Adding Friction Using Staggered Projections {Ti }i=1,...,2l ∈ Tp so that  Ti+1 = −Ti  i = 1, 3..., 2l − 1.  The frictional impulse is then expressed as  fp = T β p ,  (4.10)  where T = {Ti }i=1,...,2l ∈ R3×2l and βp ∈ R2l gives the magnitude of each tangent direction. Note that, due to the symmetry of the tangent vectors Ti , T is necessarily rank deficient. We address the algorithmic issues this implies in Section 4.3. We discretize βp using piecewise constants so that for each contact cell c the frictional impulse is f p = T βc , where βc ∈ R2l is a constant within the contact cell. The linearized Coulomb law may then be described as the set of feasible points  Fc = fp = T βc : eT βc ≤ µc αc , βc ≥ 0 ,  (4.11)  where e = (1, ..., 1)T , µc ≥ 0 is the coefficient of friction and αc ≥ 0 is the contact magnitude. Fc represents the convex hull of the columns of T . The feasible set over all contacts can now be written in terms of the tangent magnitudes as  β | E T β ≤ diag(µ)α, β ≥ 0 ,  (4.12)  where β ∈ R2lm is the global tangent magnitude vector, E ∈ R2lm×m is globally assembled from e, and diag(µ) constructs the diagonal matrix of the glob-  41  4.2. Adding Friction Using Staggered Projections ally assembled vector µ. Turning from feasibility to optimality, we rephrase Coloumb’s law as a variational principle, given by the Maximal Dissipation Principle (Moreau, 1973): For a contact point p, the frictional force f maximizes the dissipation among all feasible forces. The dissipation is equal to the rate of negative work and defined as −fpT v rel . We integrate over the contact surface Γc of the contact cell c to obtain the dissipation for a given contact: ˆ −fpT v rel dΓ Γc  =    ˆ ˆ −  fpT v A dΓ − fpT v B dΓ Γc  Γc   =  L  ˆ  L   B T T φ N vN  A T T φ N vN −  −βcT  N =1Γ  (4.13)  N =1Γ  c  c  =  ˆ  −βcT TcT v,  (4.14)  where  Tc   T    ˆ ˆ   ˆ T T T   φN Tx dΓ, φN Ty dΓ, φN Tz dΓ =     Γc  Γc  Γc  ∈ Rn×l . N =1,...,L  Equation (4.13) follows from expressing the velocities as a linear combination of the shape functions defined by Equation (4.6). Summing over all contacts yields the total dissipation: m  βcT TcT v  − c=1  = −β T T T v, 42  4.2. Adding Friction Using Staggered Projections where T = {Tc }c=1,..,m ∈ Rn×2lm  (4.15)  is the globally assembled subspace of generalized friction impulses and  f := T β  is the generalized friction impulse. Maximizing the dissipation over the feasible set Equation (4.12) yields the optimization problem β t+1 = argmax β  1 − β T T T v t+1 2  subject to β ≥ 0,  (4.16)  T  E β ≤ diag(µ)α.  We can transform Equation (4.16) into a QP in β by adding the generalized frictional impulse to the momentum Equation (4.9):  v t+1  = M −1 (f ∗ + ct+1 + f t+1 ) = M −1 (f ∗ + ct+1 + T β t+1 ),  yielding the friction QP  β t+1 = argmin β  1 T T −1 β T M T β + β T (T T M −1 (ct+1 + f ∗ )) 2  subject to β ≥ 0,  T  E β ≤ diag(µ)α  t+1  (4.17)  .  Notice that (4.17) is dependent on the solution of the contact QP (4.8). Conversely, taking into account the generalized frictional impulse f t+1 modifies  43  4.2. Adding Friction Using Staggered Projections (4.8) to αt+1 = argmin α  1 T T −1 α J M Jα + αT (J T M −1 (f t+1 + f ∗ )) 2  (4.18)  α ≥ 0.  subject to  We now have two coupled convex QPs:  (4.17), yielding the optimal friction  impulse magnitudes β t+1 given the contact impulse ct+1 , and (4.18), yielding the optimal contact impulse magnitudes αt+1 given the friction impulse f t+1 . This intrinsic coupling is what makes frictional contact a challenging problem, as finding the solution of the coupled problem is equivalent to solving a global minimization of a non-convex QP, which is in general NP-hard (Kaufman et al., 2008). We adopt the method of staggered projections to solve the frictional contact problem.  4.2.2  The Staggered Projections Algorithm  Staggered projections, introduced in (Kaufman et al., 2008), is a robust, computationally efficient method to simulate discrete frictional contact response for a class of systems with finite degrees of freedom described in generalized coordinates. We briefly outline the algorithm now. Recall the coupled friction and contact QPs (4.17) and (4.18), respectively. Beginning with an initial estimate f 0 = T β 0 of the friction impulse, SP solves the following set of QPs at iteration s: αs+1 = argmin α  subject to β s+1 = argmin β  subject to  1 T T −1 α J M Jα + αT (J T M −1 (f s + f ∗ )) 2  (C)  α ≥ 0, 1 T T −1 β T M T β + β T (T T M −1 (cs+1 + f ∗ )) 2 β ≥ 0,  T  E β ≤ diag(µ)α  s+1  (F)  . 44  4.2. Adding Friction Using Staggered Projections Note that if β 0 = 0 this reduces to the case of frictionless contact discussed in Section 4.1.2. The staggered projections algorithm is given in Algorithm 4.1. The convergence criterion is given by  rel_error =  (fcs+1 − fcs )T M −1 (fcs+1 − fcs ) ≤ tol f sT M −1 f s  (4.19)  where rel_error specifies the relative kinetic metric error and tol is a userspecified tolerance. Similar to ASMs, SP benefits greatly from warm starts. Up to two orders of magnitude speed-ups were observed in Kaufman (2009). Since each QP subproblem is strictly convex and feasible, the algorithm will never fail on a given iteration. While global convergence is not guaranteed, in practice few iterations are required, especially if warm-starting is employed. SP can be interpreted as a fixed point predictor correction scheme (Kaufman, 2009). The predicted impulse f ∗ is corrected by the coupled projections  f s+1  = PF (αs ) (f ∗ − cs )  cs+1  = PC (f ∗ − f s+1 ),  where P is the projection operator and  F (α) C  :=  T β | E T β ≤ diag(µ)α, β ≥ 0 ,  := {Jα : α ≥ 0}  represent the set of admissible friction and contact impulses, respectively. We are now in a position to incorporate SCHURPA into SP. Assuming for the moment that SCHURPA can solve (C) and (F), notice that only the linear terms of the objective function and right-hand-sides of the contact and friction QPs change each iteration of Algorithm 4.1. In other words, the Hessians and  45  4.3. Applying SCHURPA to Frictional Contact constraint matrices remain constant. Thus the initial factorization computed in SCHURPA can be used not only within a single QP solve, but throughout the entire staggered projections algorithm! This idea is formalized in Section 4.4. However, we must first investigate the applicability of SCHURPA to the contact and friction QPs. In particular, the friction QP (F) must be reformulated to ensure the resulting saddle point systems occurring in SCHURPA are solvable, which we do in the next section. Algorithm 4.1 Staggered Projections Given β 0 , A0c , A0f , tol rel_error = ∞ for s = 0, 1, 2, ... do Solve the contact QP (C) for αs+1 Solve the friction QP (F) for β s+1 Compute rel_error given by Equation (4.19) if rel_error <tol then BREAK; αt+1 = αs+1 and β t+1 = β s+1 end if end for f t+1 = T β t+1 ct+1 = Jαt+1 v t+1 = M −1 (f ∗ + f t+1 + ct+1 )  4.3  Applying SCHURPA to Frictional Contact  Recall from Section 2.4 local convergence of PDASM is guaranteed under the assumption that the saddle point systems are all nonsingular. In the case of the contact QP (C), this assumption holds, which we prove in Theorem 4.1 using the following lemma. Lemma 4.1. The Jacobian matrix J of (4.7) is full rank. Proof. We prove the lemma on a 2D grid using bilinear shape functions by assuming for a contradiction the Jacobian has linearly dependent columns; the 46  4.3. Applying SCHURPA to Frictional Contact  Figure 4.1: Nodal shape functions for a contact cell Ωc on a 2D grid. extension to 3D using trilinear shape functions as in the simulator is straightforward. Consider the four shape functions φN for N ∈ N c := {N1 , N2 , N3 , N4 } providing local support to a contact cell Ωc ; see Figure 4.1. Assume the associated contact surface Γc has a point strictly in the interior so that  jN    ˆ ˆ ˆ T =  φN nx dΓ, φN ny dΓ, φN nz dΓ = (0, 0, 0) , Γc  (4.20)  Γc  Γc  where j = jc is the contact constraint and N ∈ N c . Let C ⊆ {1, ..., m} be a subset of the constraints containing a linearly dependent set so that  αc jc = 0,  (4.21)  c∈C  for scalars αc = 0, c ∈ C. We define (xc , yc ) to be the grid center of the grid cell Ωc . The idea now is to find a constraint c¯ ∈ C for which ¯j = jc¯ has unique positions which contain nonzero elements, thus contradicting Equation (4.21). One can always find such a constraint c¯ by choosing the grid cell Ωc¯ ∈ {Ωc }c∈C  47  4.3. Applying SCHURPA to Frictional Contact such that  xc¯ ≥ xc  c ∈ C,  yc¯ ≥ yc  c∈K  where K = {c ∈ C | xc = xc¯}. In other words, c¯ selects the constraint cell with maximum x coordinates and maximal y coordinates. Notice that this cell is the only cell with support from the shape function φN2 where N2 ∈ N c¯ (see Figure 4.1). This implies the nonzero entry in ¯j corresponding to ¯jN2 = 0 ( which is nonzero due to Equation (4.20)) is the only constraint with a nonzero entry in that position. Thus ¯j = 1 αc¯  αc jc , c∈C\¯ c  contradicting Equation (4.21). We therefore conclude the columns of J must be linearly independent, and J is full rank. Theorem 4.1. The saddle point systems resulting from PDASM applied to (C) will always be nonsingular and PDASM will locally converge super-linearly. Proof. From Lemma 4.1 J is full rank, and thus the Hessian of (C) is nonsingular. Since the constraint matrix of (C) is simply the identity, any subset of constraints are linearly independent. Therefore by Theorem 2.1 the saddle point systems will be nonsingular, as required. Because SCHURPA is a PDASM, the following corollary immediately follows from the above theorem. Corollary 4.1. SCHURPA applied to the contact QP (C) locally converges super-linearly.  48  4.3. Applying SCHURPA to Frictional Contact We now tackle the friction QP. Unfortunately, in the form given by (F) the saddle point systems can become singular because of the Hessian matrix. Recall the tangent matrix is given by  T = {Tc }c=1,..,m ,  where  Tc   T    ˆ ˆ  ˆ   φN TxT dΓ, φN TyT dΓ, φN TzT dΓ =     Γc  Γc  Γc  ∈ Rn×l . N =1,...,L  As previously mentioned, the symmetry of the unit length vectors Ti used to construct the linearized friction cone results in the matrices Tc , and hence T , to become rank deficient. This implies (F) is a semidefinite QP with a singular Hessian. The infinite solutions arise due to the nullspace of T which induces infinite representations of a single frictional impulse. More precisely, the symmetry implies  Ti+1 = −Ti  i = 1, 3..., 2ml − 1  (4.22)  and thus  f = Tβ =  βi Ti i=1,...,2ml  (4.23) (βi+1 − βi )Ti .  = i=1,3...2ml−1  Even if we assume the set {Ti }i=1,3...2ml−1 is linearly independent, only the differences βi+1 − βi uniquely define a friction impulse, and the Hessian T T M −1 T ∈ R2lm  ×2lm  of (F) has rank lm. To make matters worse, we have  2ml + m constraints vs 2ml unknowns. Both the rank deficiency of the Hessian  49  4.3. Applying SCHURPA to Frictional Contact and possible linear dependence of working set constraints can cause a singular saddle point system in the course of PDASM, prohibiting the direct application to (F). These deficiencies are not just theoretical. In (Daryina and Izmailov, 2009), it was observed that if the number of constraints is greater than the number of variables, PDASM fails a majority of the time. Their experiments applied to generic QPs also showed that if the Hessian is fairly rank deficient PDASM becomes an unreliable method. Therefore, we must reformulate (F) in order to apply PDASM. We can reformulate the friction QP to an equivalent convex QP with a positive definite Hessian for the case l = 2 (we shall henceforth assume l = 2). Let Tˆ be an orthonormal basis for the tangent surface Tp of a contact point p. We can then rewrite the tangent impulse as  fp = Tˆ βc ,  (4.24)  where βc ∈ R2 is taken as constant in a contact cell c as before. In the case of l = 2, the linearized friction cone is simply expressed by the l1 -norm  Fc = Tˆ βc | ||Tˆ βc ||1 ≤ µc αc ,  (4.25)  which can be rewritten as  ˆcT βc ≤ diag(ˆ Fc = Tˆ βc | E µc )ˆ αc ) ,  (4.26)  where  50  4.3. Applying SCHURPA to Frictional Contact   ˆc =  E   1  −1  1  1  −1  −1   −1 , 1 (4.27)  µ ˆc = (µc , µc , µc , µc )T , α ˆ c = (αc , αc , αc , αc )T . ˆ ∈ R2m×4m , µ As before, we assemble the global quantities E ˆ ∈ R4m , and α ˆ ∈ R4m . The global feasible set is given as ˆ T β ≤ diag(ˆ β|E µ)α ˆ, β ≥ 0 ,  (4.28)  and the generalized tangent matrix is  Tˆ = {Tˆc }c=1,..,m ∈ Rn×2lm  (4.29)  where  Tˆc  =   T    ˆ ˆ  ˆ   φN TˆxT dΓ, φN TˆyT dΓ, φN TˆzT dΓ     Γc  Γc  Γc  ∈ Rn×l . N =1,...,L  The modified friction problem is now β s+1 = argmin β  subject to  1 T ˆT −1 ˆ β T M T β + β T (TˆT M −1 (cs+1 + f ∗ )) 2 ˆT  E β ≤ diag(ˆ µ)α ˆ  s+1  (F2)  .  The advantage of the formulation (F2) is that Tˆ has full rank. Lemma 4.2. The tangent matrix Tˆ of (F2) is full rank. Proof. The proof is analogous to the proof of Lemma 4.1 where quantities corresponding to constraint jc are replaced with quantities corresponding to Tˆc .  51  4.3. Applying SCHURPA to Frictional Contact Since Tˆ is full rank, the Hessian TˆT M −1 Tˆ of (F2) is positive definite, and the problem size has reduced by half from 4m in (F) to 2m. Although we have 4m constraints, the following theorem guarantees that PDASM never selects a linearly dependent working set, and therefore the saddle point systems will be nonsingular. Theorem 4.2. Assuming the initial working set A0 results in a linearly independent set of constraints, The saddle point systems resulting from PDASM applied to (F2) will always be nonsingular and PDASM will locally converge super-linearly. Proof. We will show by induction that each step of Algorithm 2.2 selects a set of linearly independent constraints, which by Theorem 2.1 implies that the resulting saddle point system is nonsingular. Base Case: By assumption A0 is a linearly independent set of constraints, so the initial saddle point system is nonsingular. Induction Step: Assume at step k of Algorithm 2.2 that Ak defines a linearly independent set of constraints from (F2), and for a contradiction assume Ak+1 selects a linearly dependent set. The form of Eˆc implies that Ak+1 must contain a pair of constraints of the form  βi + βi+1 = γi  and  − βi − βi+1  = γi  (4.30)  βi − βi+1 = γi  and  − βi + βi+1  = γi  (4.31)  or  for some i = 1, 3, ..., 2m − 1 and γi := µ i+1 α i+1 > 0. Assume WLOG the pair 2  2  Equation (4.30) is chosen, and denote these constraints by j and j + 1. Then by definition of Algorithm 2.2 52  4.3. Applying SCHURPA to Frictional Contact  k+1 λk+1 > γi − (βik+1 + βi+1 ) j  (4.32)  k+1 k+1 . + βi+1 λk+1 j+1 > γi + βi  (4.33)  = 0 or λk+1 Case 1: λk+1 j+1 = 0. j Assume WLOG λk+1 = 0. By our assumption Ak defined a linearly indej pendent set. Thus j ∈ Ak , j + 1 ∈ Ak and  λk+1 j+1 = 0  (Since j + 1was not active)  k+1 γi = βik+1 + βi+1  (Since jwas active)  k+1 k+1 ⇒ λk+1 + βi+1 = 2γi > 0, j+1 = 0 > γi + βi  (4.34) a contradiction. Case 2: λk+1 = λk+1 j j+1 = 0. Combining Equation (4.32) and Equation (4.33) yields  λk+1 + λk+1 j j+1 = 0 > 2γi > 0,  (4.35)  again yielding a contradiction. Thus Ak+1 cannot select a linearly dependent set of constraints. Therefore, by induction Algorithm 2.2 induces a nonsingular saddle point system each iteration, and will locally converge super-linearly. Again, as in the case of contact, the following corollary is an immediate consequence to Theorem 4.2. Corollary 4.2. SCHURPA applied to the friction QP (F2) locally converges super-linearly. 53  4.4. The SCHURPA-SP Algorithm Corollary 4.1 and Corollary 4.2 give us the theoretical confidence to integrate SCHURPA into SP, solving the contact and friction QPs arising each iteration.  4.4  The SCHURPA-SP Algorithm  We now summarize the entire frictional contact algorithm, which employs SCHURPA to solve the QP subproblems induced by SP. Recall that SP can benefit from warm-starting by providing an initial guess of the working sets A0c and A0f close to the optimal active sets  At+1 c  := {i | αit+1 = 0}  At+1 f  ˆiT β t+1 = µ := {i | E ˆi αˆi t+1 }.  In such a case A0c and A0f will be close to the optimal active sets of all subproblem QPs arising in SP. We are therefore motivated to generalize SCHURPA’s initial factorization K0 across all SP iterations so that only one factorization for contact and one factorization for friction is needed. As discussed in Section 3.2, SCHURPA only performs one factorization of the initial saddle point system K0 , and subsequent iterations use this factorization along with the Schur-complement method to solve subsequent saddle point systems. As mentioned in Section 4.2, the Hessians and constraint matrices of the contact and friction QPs do not change within the SP Algorithm. Therefore, we only need to call SCHURPA-INIT during the first SP iteration to compute the initial solutions via the functions C0 _solve and F0 _solve for contact and friction, respectively. Subsequent saddle point systems across all SP iterations are solved using these functions along with the Schur-complement method, as in the SCHURPA algorithm. The combined algorithms of SCHURPA and SP are given in Algorithm 4.2. Notice that each SP iteration, initial active sets must  54  4.4. The SCHURPA-SP Algorithm be given to SCHURPA. These are determined by applying exactly the same prediction scheme Equation (2.12) of PDASM to the current SP iterates, i.e.  As+1 c  =  < 0} {i | − αis+1 − λs+1 i  (4.36)  As+1 f  =  ˆiT β s+1 − ν s+1 < 0}, {i | µ ˆi αˆi s+1 − E i  (4.37)  where λ and ν are the Lagrange multipliers for contact and friction QPs, respectively. SCHURPA-SP defines our solution procedure for solving frictional contact. Now that the high level description has been given, we move to the efficient implementation of the C0 _solve and F0 _solve functions in the next chapter. Algorithm 4.2 SCHURPA-SP Input: A0c , A0f , tol rel_error = ∞ Run SCHURPA-INIT A0c to compute α0 , λ0 and C0 _solve Run SCHURPA-INIT A0f for s = 0, 1, 2, ... do  to compute β 0 , ν 0 and F0 _solve  αs+1 = SCHURPA(Asc , αs , λs , C0 _solve) λs+1 β s+1 Solve the friction QP: = SCHURPA Asf , β s , ν s , F0 _solve ν s+1 Compute rel_error given by Equation (4.19) if rel_error < tol then BREAK; αt+1 = αs+1 and β t+1 = β s+1 end if As+1 = {i | − αis+1 − λs+1 < 0} c i s+1 ˆ T β s+1 − ν s+1 < 0} Af = {i | µ ˆi αˆi s+1 − E i i end for Solve the contact QP:  f t+1 = T β t+1 ct+1 = Jαt+1 v t+1 = M −1 (f ∗ + f t+1 + ct+1 ) Output: v t+1  55  Chapter 5  Solving the Contact and Friction QPs In this chapter we describe the implementation details of Algorithm 4.2, our solution method for solving frictional contact arising in the simulator described in Chapter 4. Crucial to the efficiency of SCHURPA is the method of solving systems with the initial saddle point matrix, which in the case of staggered projections corresponds to the implementation of the functions that solve the initial contact and friction saddle point systems. We will show that, due to the structure of the contact and friction QPs, both functions require a factorization of a banded SPD matrix, and application of said functions amount to solving with these factors. Section 5.1 describes the initial saddle point systems for the contact and friction quadratic programs; these may be formulated as SPD systems. We then show in Section 5.2 the afro-mentioned SPD matrices exhibit a banded structure due to the Hessian matrices of the QPs. In particular, we utilize the nullspace method to solve the friction QP which results in a banded reduced Hessian. Since SCHURPA requires multiple simultaneous solves, we are motivated to design an efficient specialized parallel solver for dense SPD banded systems with multiple right hand sides. A parallel solver based on block substitution is outlined in Section 5.3.  56  5.1. Saddle Point Systems Arising in Contact and Friction  5.1  Saddle Point Systems Arising in Contact and Friction  The SCHURPA-SP loop given by Algorithm 4.2 requires the solution of the initial saddle point systems for the contact and friction QPs, which are performed in the functions C0 _solve and F0 _solve, respectively. In both cases we can rephrase these systems as the solution to an SPD system. Let us first investigate the case for contact.  5.1.1  Contact  Consider the contact QP solved at each iteration of SCHURPA-SP, which we restate here for convenience: αs+1 = argmin α  1 T C α H α − αT c 2  (C)  α ≥ 0,  subject to where  c = HC  −J T M −1 (Tˆβ s + f ∗ )  = J T M −1 J.  (5.1)  Recall A0 defines the initial working set and subsequent initial saddle point system. We define  F  = I\A0  B  = A0 ,  57  5.1. Saddle Point Systems Arising in Contact and Friction to be the initial set of free and bound indices of α, respectively, so that the initial saddle point system for (C) is    C  HF,F  H C  B,F  0  C HF,B C HB,B  −IB      0  αF  cF          −IB   αB  = cB  .     0 λ 0  (5.2)  SCHURPA requires the solution of Equation (5.2) with arbitrary right hand sides, i.e.   C  HF,F  H C  B,F  0  C HF,B C HB,B  −IB      0  αF  cF          −IB   αB  = cB  .     0 λ b  (5.3)  Simplifying Equation (5.3) yields  αB C HF,F αF  λ  = −b  (5.4)  C = cF − HF,B αB  (5.5)  C C = −cB + HB,F αF + HB,B αB  (5.6)  Thus the initial saddle point system of the contact QP amounts to solving an SPD system of size |F | × |F |, given in Equation (5.5). Algorithm 5.1 C gives C0 _solve. Note that a factorization of HF,F must be computed to solve  Equation (5.5); this is performed in SCHURPA-INIT of Algorithm 4.2.  5.1.2  Friction  Due to the simple structure of the reformulated friction constraints, a sparse nullspace can be computed explicitly; this allows us to form the reduced Hessian 58  5.1. Saddle Point Systems Arising in Contact and Friction Algorithm 5.1 C0 _solve Input:  c b  Set αB = −b C C Solve the SPD system HF,F αF = cF − HF,B αB C Set λ = −cB + HB,F αF α Output: λ  without destroying sparsity. We can then reformulate the initial saddle point system of the friction QP as an SPD system via the nullspace method (Nocedal and Wright, 1999). Recall the friction QP of SCHURPA-SP, given by β s+1 = argmin β  subject to  1 T F β H β − βT d 2  (F2)  ˆ ≤ γ, Eβ  where  d = −TˆM −1 cs+1 + f ∗ γ HF  = diag(ˆ µ)α ˆ s+1 = TˆT M −1 Tˆ.  (5.7)  Let p = |A0 | denote the number of initial active constraints and  ˆA0 E   ˆ 0 E  1A    =      ˆ2A0 E ..  . ˆmA0 E       ∈ Rp×2m ,     (5.8)  59  5.1. Saddle Point Systems Arising in Contact and Friction ˆcA0 is the (possibly empty) sub-matrix of where E  1 ˆc =  E  1  −1  1  −1  −1  T −1  1  defined by A0 . Theorem 4.2 ensures that SCHURPA selects at most two linearly ˆc to be active. If exactly two are selected, this independent constraints of E completely determines the unknowns β2c−1 and β2c , and they can be removed from the problem. Therefore we may assume  ˆcA0 ∈ Rq×2 , E  where 0 ≤ q ≤ 1.  (5.9)  If q = 1 we let  Zc =     (1, −1)T  ˆcA0 = ±(1, 1)T if E    (1, 1)T  ˆcA0 = ±(1, −1)T if E  .  (5.10)  ˆcA0 Zc = 0. If q = 0 then there is no reduction, and we let Notice that E  1 Zc =  0   0 . 1  (5.11)  Proposition 5.1.1. The matrix  Z  1    Z=           ∈ R2m×(2m−p)     Z2 ..  .  (5.12)  Zm ˆA0 . forms a basis for the nullspace of E Proof. By construction Z is full rank since each Zc is full rank. Computing the 60  5.1. Saddle Point Systems Arising in Contact and Friction product    ˆ 0Z E  1A 1    ˆ EA0 Z =      ˆ2A0 Z2 E ..  . ˆmA0 Zm E     0     =     (5.13)      ,     0 ..           . 0  ˆcA0 Zc to be the empty matrix if E ˆcA0 is where we define the matrix product E ˆA0 . empty. Thus Z is a basis for the nullspace of E ˆA0 , we can proceed to solve the saddle point Equipped with a nullspace of E system via the null-space method, which we briefly outline here. The initial saddle point system induced by A0 is   F H  ˆA0 E      ˆT 0 β d E   A     =  . ν γA0  (5.14)  SCHURPA requires the solution of Equation (5.14) with arbitrary right hand sides, i.e.   F  H  ˆ A0 E      T ˆ EA0  β  d   =  . ν b  (5.15)  ˆA0 β = b can be written as Any solution to E  β = βp + Zβz ,  (5.16)  where βp is a particular solution. Substituting Equation (5.16) into Equa61  5.1. Saddle Point Systems Arising in Contact and Friction tion (5.15) and multiplying the first equation by Z T yields the reduced Hessian equation  (Z T H F Z)βz = −(Z T H F βp + Z T d).  (5.17)  Note that we can easily compute βp by choosing  βp =  ˆT b E A 2  (5.18)  since ˆ ˆT ˆA βp = EA EA b = b. E 2  (5.19)  Upon solving for βp and βz , β can be computed from Equation (5.16) and the corresponding Lagrange multipliers are  ν=  ˆ T (d + H F β) E A . 2  (5.20)  As in the contact QP, we have reduced the problem to an SPD system, given by Equation (5.17), of size (2m − p)×(2m − p). The function F0 _solve is given in Algorithm 5.2 and, as in the case of contact, requires a factorization of Z T H F Z, which is performed in SCHURPA-INIT of Algorithm 4.2. Algorithm 5.2 F0 _solve Input:  d b ˆT b E  Set βp = 2A Solve the SPD system (Z T H F Z)βz = −(Z T H F βp + Z T d) Set β = βp + Zβz Set ν = Output:  ˆ T (d+H F β) E A 2  β ν  62  5.2. Banded SPD Systems While nullspace matrices are typically dense and the reduced Hessian loses the sparsity structure of the original Hessian, this is not the case here. We show in the next section that, due to the structure of H C and H F , the SPD matrices in Equation (5.5) and Equation (5.17) are banded. This banded structure is critical to the efficient implementation of SCHURPA-SP.  5.2  Banded SPD Systems  The computational “meat” of SCHURPA-SP is the factorization and subsequent solving with the SPD matrices  GC  =  C HF,F  (5.21)  GF  =  Z T H F Z.  (5.22)  These matrices are in fact banded. We first prove that H C and H F exhibit a banded structure. Proposition 5.1. Upon discretization using trilinear shape functions H C is a banded matrix. Proof. We prove the proposition on a 2D grid using bilinear shape functions; the extension to 3D using trilinear shape functions as in the simulator is straightforward. Consider the four shape functions φN for N ∈ N i := {N1 , N2 , N3 , N4 } providing local support to a given contact cell Ωi , as pictured in Figure 4.1. The locality of the shape functions implies T  JiN = (0, 0, 0) ⇐⇒ N ∈ N i .  We define grid cells Ωi and Ωj to be adjacent iff they share a common shape function with local support, as exemplified in Figure 5.1. Thus if Ωi and Ωj are 63  5.2. Banded SPD Systems  Figure 5.1: Example of cell adjacency on a 2D grid. Grid cells Ωi and Ωj are adjacent as they share a common shape function, while Ωi and Ωk are not adjacent. not adjacent  JiN JjN  =  C ⇒ Hij  =  N = 1, ..., L  0  k  Jik Jjk = 0. Mkk  In other words, H C has a bandwidth of at most ρmax , where  ρmax = max {|i − j| i,j  | grid cells Ωi are Ωj are adjacent}  (5.23)  as required. We denote the bandwidth of H C as ρ. The constraint cells are ordered by the standard triplet ordering (x, y, z) (i.e. first by x, followed by y, followed by z coordinates). Assuming this order roughly groups the adjacent constraint cells together, ρ should be small relative to m. Figure 5.2 exemplifies a typical sparsity pattern of H C , exhibiting a banded structure. A bandwidth minimizing  64  5.2. Banded SPD Systems  Figure 5.2: Typical banded sparsity pattern of the contact Hessian H C arising in the simulator. procedure such as Cuthill-McKee reordering could also be performed to decrease ρ. Proposition 5.2. The matrix H F is banded with a bandwidth of at most 2ρ. Proof. The proof is analogous to the proof of Proposition 5.1 where quantities corresponding to J are replaced with quantities corresponding to Tˆ. The following corollary immediately follows from Proposition 5.1. Lemma 5.1. The matrix GC is banded and SPD with a bandwidth at most ρ. Proof. Since GC is a sub-matrix of H C , GC must have a bandwidth at most the bandwidth of H C , which is at most ρ. Finally, we show that GF is also banded. Lemma 5.2. The matrix GF is banded and SPD with a bandwidth at most 2ρ+1. 65  5.2. Banded SPD Systems Proof. Recall that    Z  1    Z=         ,     Z2 ..  . Zm  where Zc ∈ R2×q , with q = 1 or q = 2. We decompose H F into 2 × 2 blocks   HF  F H1,1 .. .        F = Hp+1,1         F H1,p+1  ···  ..  F H2,2  ..  ..  . F Hm,m−p  .  .  ...        F . Hm−p,m    ..  .   F Hm,m  (5.24)  Equation (5.24) ensures the 2ρ bandwidth of H F is captured within the block form. Now we simply compute the product  GF  =  ZT HF Z   =   Z  1         T          Z2 ..  . Zm    =  F Z1T H1,1 Z1    ..  .    T F Zp+1 Hp+1,1 Z1       F H1,1 .. .  ···        F Hp+1,1       ···  ..  F H2,2  ..  ..  . F Hm,m−p  .  .  ...  ..  ..  . T F Zm Hm,m−p Zm−p     Z1     F  Hm−p,m     ..  .   F Hm,m            Z2 ..  . Zm    F Z1 H1,p+1 Zp+1  F Z2 Z2T H2,2  ..    F H1,p+1  .  .  ...        T F . Zm−p Hm−p,m Zm    ..  .   T F Zm Hm,m Zm  66  (5.25)  5.3. Block Solver for Banded SPD Systems From Equation (5.25) we see that GF has a bandwidth of at most 2(ρ + 1) − 1 = 2ρ + 1, as required. We emphasize that while typically the reduced Hessian loses sparsity, this is not the case here due to the particular structure of the nullspace matrix Z. The final section of this chapter describes a parallel solver for the banded SPD systems arising in the contact and friction QPs.  5.3  Block Solver for Banded SPD Systems  Both the contact and friction QP problems require the solution to a banded SPD system. In the case of contact we have  GC αF = cF ,  (5.26)  GF β = −(Z T H F βp + Z T d).  (5.27)  and for friction we have  Recall that SCHURPA requires solving systems with multiple right-hand-sides within the functions C0 _solve and F0 _solve. Within these functions we must solve Equation (5.26) and Equation (5.27), respectively, implying that cF , βp , and d may be matrices. The dimensionality of these matrices depend on the number of changes to the working set. Thus we are interested in developing a parallel solver to systems of the form  HX = B, where H ∈ Rn×n is a banded SPD matrix with bandwidth ρ Rn×k , where k  (5.28) n, and X, B ∈  n is the number of right hand sides. We write the Cholesky 67  5.3. Block Solver for Banded SPD Systems factorization LLT = H. Note that L is also banded. A solution to Equation (5.28) can now be computed via LY = B, LT X = Y. Unfortunately the backward/forward substitution method is inherently sequential. Recently, approaches to parallelize sparse triangular solves have been investigated (Naumov, 2011) wherein one reorders the matrix by grouping equations which may be solved independently. However, the results only offer a modest speedup (e.g. 2× on average) and don’t apply to the dense banded case being discussed here. For large, very narrow banded systems, the SPIKE algorithm Polizzi and Sameh (2006) can work extremely well. However, our bands are not fixed and may be small to medium sized, depending on the simulation data (see Figure 5.2). We opt for a simple yet effective parallel approach based on block substitution. We decompose the Cholesky factors into blocks of size s as follows:  L  11  L21  L=      L22 .. .  ..  .  Lp,p−1 where Lij ∈ Rs×s and p =  n s      ,     (5.29)  Lpp  . Notice that s ≥ ρ to ensure the two block-  banded structure Equation (5.29) contains all nonzero elements of L. Extracting  68  5.3. Block Solver for Banded SPD Systems the block diagonals into the matrices    L21 .. .       ∈ R(p−1)s×s  C=    Lp,p−1   L  11   .  ps×s .  , D=  . ∈R   Lpp   we solve Equation (5.29) via Algorithm 5.3, which is simply a block version of substitution. We implement the algorithm on the GPU using CUDA, described in the next chapter. Algorithm 5.3 Block Substitution Input: Block matrices C, D from Cholesky factorization LLT = H Y1 = D1−1 B1 for i = 2, ..., p do Yi = Di−1 (Bi − Ci−1 Yi−1 ) end for Xp = Dp−1 for i = p − 1, ..., 1 do Xi = Di−1 (Yi − Ci Xi+1 ) end for Output: Solution to HX = B Upon computing Di−1 for i = 1, ..., p in parallel as a pre-processing step, Algorithm 5.3 requires 4p − 2 sequential matrix-matrix multiplications which can be performed efficiently in parallel. By choosing appropriately sized blocks, the solving phase of SCHURPA can be significantly improved. The only additional cost is the pre-processing step of computing the inverse diagonal blocks Di−1 . This cost is small relative to the Cholesky factorization and need only be performed once during the initialization step of SCHURPA, after the initial factorization phase. Explicit inversion of the diagonal blocks was also done in (Tomov et al., 2010) to perform triangular solvers on the GPU, and performance gains exceeding 50× that of NVIDIA’S CUBLAS triangular solvers were observed. Figure 5.3 shows runtimes for the factorization and solve time of the  69  5.3. Block Solver for Banded SPD Systems block banded solver implemented on the GPU compared against a sequential substitution implemented on the CPU for a fixed bandwidth of ρ = 1000 solving for k = 10 right-hand-sides as a function of the matrix dimensions. We used Matlab to perform the banded Cholesky factorization on the CPU. For the GPU version we utilized the CULA library (Humphrey et al., 2010) for CUDA. For matrices of size n ≈ 8000 we observe the GPU performs the factorization twice as fast as the CPU. Solving a banded system sequentially runs in O nρ2 , whereas our block banded solver (in an ideally parallelized setting) runs in O (ρs) = O (n). Thus, while both the CPU and GPU solving runtimes are linear in n, the GPU solver is independent of the bandwidth size. Figure 5.3 demonstrates the scalability of the GPU block substitution solver. The gains in the solving phase come at the cost of the pre-processing step during factorization which, as mentioned, is small relative to the cost of factorization.  70  5.3. Block Solver for Banded SPD Systems  1200 CPU Factor GPU Factor 1000  Runtime (ms)  800  600  400  200  0 1000  2000  3000  4000  5000  6000  7000  8000  9000  7000  8000  9000  Banded Matrix Size (n x n)  (a) 350 CPU Solve GPU Solve 300  Runtime (ms)  250  200  150  100  50  0 1000  2000  3000  4000  5000  6000  Banded Matrix Size (n x n)  (b)  Figure 5.3: Comparison of (a) the banded Cholesky factorization and (b) the block banded solver CPU and GPU runtimes, applied to banded matrices as a function of matrix size. The bandwidth ρ = 1000 and number of right-handsides k = 10 are fixed across matrix dimensions.  71  Chapter 6  GPU Implementation General purpose computing on Graphics Processing Units (GPUs), or GPGPU, is a relatively recent endeavor taken up by the scientific computing community. Before common API frameworks existed, customizing GPU functionality was arduous and left primarily to specialists. However, two important developments have made GPGPU popularity explode: • the advent of programming interfaces such as OpenCL (Stone et al., 2010) and NVIDIA’s CUDA (NVIDIA, 2012b), and • double precision capability on GPUs. No longer limited to computer graphics, GPUs are fast becoming a staple of scientific computing. This chapter outlines a parallel implementation of SCHURPA applied to the simulator described in Chapter 4. We implement SCHURPA in the GPGPU framework using CUDA. Our motivation for a GPU implementation is twofold: the simulator is also GPU-based, so CPU to GPU memory overhead is avoided, and SCHURPA’s amenability to parallelism, as discussed in Chapter 3. Dense linear algebra algorithms run on GPUs can lead to orders of magnitude acceleration compared to standard CPU implementations(Tomov et al., 2010), and this performance gap will only widen as the many-core paradigm continues to pervade contemporary programming.  72  6.1. CUDA Programming on the GPU Section 6.1 overviews CUDA, describing the C++ extension syntax. In Section 6.2, we give a taste of the custom functionality implemented on the GPU required by SCHURPA by detailing a symmetric-banded general matrix-matrix multiplication operation. Off-the-shelf GPU libraries were also extensively used in SCHURPA, which we outline in Section 6.3.  6.1  CUDA Programming on the GPU  CUDA (Compute Unified Device Architecture), introduced by NVIDIA in 2006, refers to a massively parallel programming model and architecture, along with an API to program GPUs using high level programming languages such as C++ and Fortran. It is freely available (NVIDIA, 2012b) and requires a CUDAcapable NVIDIA GPU (GeForce 8 or greater, Tesla, Quadro etc.). CUDA uses a SIMD (Single Instruction Multiple Data) execution paradigm to run many concurrent threads (individual processing units) from a single function call of a kernel: functions programmed in CUDA which are compiled and run on a GPU. The many cores on a GPU chip allow it to perform thousands of tasks on large sets of data in parallel, whereas CPU programs perform sequentially (multi-core CPUs and hyper-threading attempt to improve this limitation). Due to the increased number of cores on GPUs, theoretical floating point operation throughput (GFLOP/s) is orders of magnitude greater than that of the CPU. Table 6.1 on page 74 summarizes the differences between CPU and GPU programming paradigms. SCHURPA was written in the extended C++ CUDA API. GPU memory, called device memory, is managed by the GPU; CPU, or host memory, can be copied to device memory and vice versa via the CUDA API. Host-to-device and device-to-host memory copies are slow and thus aimed to be minimized. As the simulator computes all data on the GPU, and device-to73  6.1. CUDA Programming on the GPU  Number of Cores Number of Threads Thread Speed Cache size  CPU Several Several Fast Large  GPU Hundreds Thousands Slow Small  Table 6.1: CPU vs. GPU comparison. device memory transfer is fast, we need not be concerned with any host memory transfer overhead. There are several types of GPU memory, including: global, shared, registers and local. Global memory is the largest, with the slowest read/write access. Pointers passed to kernel functions are typically arrays stored in global memory. Shared memory is much smaller and shared among threads in a single processing block. The access speed is much faster than global memory, and can be thought of as the cache. Registers store local data for each individual thread and has the fastest access. Local memory is again restricted to a thread and stores any data that can’t fit into the register memory. There is also texture and constant memory, which we shan’t discuss here.  6.1.1  CUDA Programming Model  Kernel functions are configured to execute threads partitioned into computational blocks of one, two or three-dimensions which exist on a two-dimensional grid (see Figure 6.1). The number of threads per block is limited as they share resources residing on the core executing the block. The blocks cannot communicate or synchronize with one another; this allows them to be run independently in any order. The computational grid is conceptually useful for decomposing a problem domain such as a fluid simulation, or two-dimensional data such as a matrix. Kernels require two additional parameters placed within “<<<>>>>” 74  6.1. CUDA Programming on the GPU  Figure 6.1: Computational grid of threads, partitioned into blocks, executed in parallel on the GPU. From CUDA Programming Guide (NVIDIA, 2012b). brackets prefacing the regular parameter list: one int2 specifying the number of blocks on the two-dimensional grid, and an int3 specifying the number of threads per block. For example, suppose we want to add two matrices A ∈ Rn×n and B ∈ Rn×n . Each thread will compute one element of the resulting matrix C ∈ Rn×n . We decompose the matrices into 16 × 16 blocks mapped onto the grid and run the kernel addMatrices via the following syntax: // assume N × N m a t r i c e s A, B, C have been d e f i n e d int2 threadsPerBlock (16 , 1 6 ) ; i n t 3 numBlocks (N / t h r e a d s P e r B l o c k . x , N / t h r e a d s P e r B l o c k . y , 1 ) ; addMatrices <<<numBlocks , t h r e a d s P e r B l o c k >>>(A, B, C ) ;  Listing 6.1: Calling a CUDA kernel.  75  6.2. Custom Kernels for SCHURPA In Listing 6.1 the number of blocks is computed to ensure there are n2 threads. Listing 6.2 shows the function body of the addMatrices kernel. The “__global__” identifier is used to specify a kernel function, which requires a void return type. __global__ v o i d a d d M a t r i c e s ( f l o a t [N ] [ N] A, f l o a t [N ] [ N] B, f l o a t [N ] [ N] C) { i n t i = b l o c k I d x . y ∗ blockDim . y + t h r e a d I d x . y ; // row i n d e x i n t j = b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ; // column i n d e x i f ( i < N && j < N) C [ i ] [ j ] = A[ i ] [ j ] + B [ i ] [ j ] ; }  Listing 6.2: Body of a CUDA kernel.  CUDA provides internal variables to determine the unique thread index and block index, given by threadIdx and blockIdx, respectively. Notice that in Listing 6.2 these indices are used to compute the global row and column indices. Each thread executes the kernel in parallel on the GPU, writing the solution to the two-dimensional array C. The arrays are all stored in global memory on the GPU.  6.2  Custom Kernels for SCHURPA  Several operations not available in the current versions of the CUDA libraries (see Section 6.3) were required for the implementation of SCHURPA. Here we simply describe one kernel that performs matrix-matrix multiplication with a symmetric banded and dense matrix, respectively, to get a flavor for the kernel programming process in CUDA. We denote the following operation as symbgmm, for symmetric-banded general matrix-matrix multiplication:  76  6.2. Custom Kernels for SCHURPA  C = BA,  (6.1)  where B ∈ Rn×n is a symmetric banded matrix, and A ∈ Rm×n , C ∈ Rm×n are dense matrices. We define B to have at most ρ nonzero sub/super-diagonals, i.e. Bik = 0 for min (1, i − ρ) ≤ k ≤ max (1, i + ρ) so that Equation (6.1) can be rewritten as n  Cij =  Bik Akj k=1 max(n,i+ρ)  Bik Akj.  = k=min(1,i−ρ)  Such an operation is required due to the banded structure of the contact and friction Hessians described in Chapter 4, and currently the CUBLAS library only supports general dense matrix-matrix multiplication. A sequential version of symbgmm written in C++ is given in 6.3. The bodies of the helper functions are omitted for brevity (see Appendix B). A straightforward extension to CUDA is given in Listing 6.4. As in the matrix addition example Listing 6.2, the two outer loops indexing rows i and column j are replaced with a unique thread for each element Cij . However, due to slow global memory access this kernel is not ideal. For example, a row bTi is required by multiple threads to compute the dot product bTi aj , yet need not be read from global memory by each thread. Rewriting Equation (6.1) in block form yields CIJ =  BIK AKJ .  (6.2)  K  We map the computation of the s × s block CIJ to a block on the computational 77  6.2. Custom Kernels for SCHURPA  s t r u c t Matrix { i n t width ; int height ; i n t l d ; // l e a d i n g d i m e n s i o n f l o a t ∗ elements ; }; s t r u c t BandedMatrix : Matrix { i n t rho ; // # s u b d i a g o n a l s }; // h e l p e r f u n c t i o n s f l o a t getElement ( c o n s t Matrix A, i n t i , i n t j ) ; f l o a t getElement ( c o n s t BandedMatrix B, i n t i , i n t j ) ; v o i d s e t E l e m e n t ( Matrix A, i n t i , i n t j , f l o a t v a l u e ) ; // computes C = B ∗ A, where B i s a symmetric banded matrix v o i d symbgmmHost ( BandedMatrix B, Matrix A, Matrix C) { f l o a t Cval ; // l o o p through each e l e m e n t o f C f o r ( i n t i = 0 ; i < C . h e i g h t ; i ++){ f o r ( i n t j = 0 ; j < C . width ; j ++){ Cval = 0 . 0 ; // compute t h e dot p r o d u c t bTi aj f o r ( i n t k = max ( 0 , i − B . rho ) ; k < min (B . width − 1 , i + B . rho ) ; k++) Cval += getElement (B, i , k ) ∗ getElement (A, k , j ) ; s e t E l e m e n t (C, i , j , Cval ) ; } } }  Listing 6.3: Sequential function of the symbgmm operation.  78  6.2. Custom Kernels for SCHURPA grid. To avoid redundant memory fetching, sub-matrices BIK and AKJ can be loaded into shared memory for each block, with each thread loading a submatrix element in parallel. A thread computes an element of the sub-matrix product CS = BIK AKJ and stores the result in register memory. Summing all such products over K yields the desired block CIJ , which is then written to device memory. Due to the banded structure of B we have  K = (k0 : k0 + s), ..., (kf : kf + s),  where k0 = max(1, I1 s − ρ), kf = min(n − s, I1 s + ρ). A kernel using this shared memory approach is given in Listing 6.5. The __shared__ prefix of the sub-matrix arrays defines them to be stored in shared memory. The __syncthreads() function synchronizes all threads of a given block, so that all previous code has been executed. The observant reader will notice that reading sub-matrices in memory may fail when for example requesting an element below the band of B, or when the block size s does not divide n or m. In such a case the sub-matrix element is set to 0, which produces the desired result. These details are abstracted away in the helper bodies, given in Chapter B. Figure 6.2 compares the runtimes of the different symbgmm routines, along with a comparison to NVIDIA’s CUBLAS matrix multiplication for general matrices. The shared implementation significantly outperforms the naive implementation.  Along with the symgbmm operation, several other custom kernels were written to perform the following operations: • forming the contact and friction Hessians H C and H F , respectively, • forming the free contact Hessian GC , and 79  6.2. Custom Kernels for SCHURPA  // computes C = B ∗ A, where B i s a symmetric banded matrix __global__ v o i d symbgmmNaive ( BandedMatrix B, Matrix A, Matrix C) { i n t i = blockDim . y ∗ b l o c k I d x . y + t h r e a d I d x . y ; // row i n d e x i n t j = blockDim . x ∗ b l o c k I d x . x + t h r e a d I d x . x ; // column i n d e x i f ( i >= C . h e i g h t | | j >= C . width ) return ; f l o a t Cval = 0 . 0 ; // compute t h e dot p r o d u c t bTi aj f o r ( i n t k = max ( 0 , i − B . rho ) ; k < min (B . width − 1 , i + B . rho ) ; k++) Cval += getElement (B, i , k ) ∗ getElement (A, k , j ) ; s e t E l e m e n t (C, i , j , Cval ) ; }  Listing 6.4: Naive kernel of the symbgmm operation.  450 CPU CUBLAS symbgmmNaive symbgmmShared  400 350  Runtime (ms)  300 250 200 150 100 50 0 1000  2000  3000  4000  5000  6000  7000  8000  9000  Banded Matrix Size (n x n)  Figure 6.2: Runtimes of symbgmm, the symmetric-banded general matrixmatrix multiplication operation, as a function of the matrix size. For comparison, we also show the performance of CUBLAS (NVIDIA, 2012a) for general matrix-matrix multiplication.  80  6.2. Custom Kernels for SCHURPA  #d e f i n e BLOCK_SIZE 16 // b l o c k s i z e l o a d e d i n t o s h a r e d memory // computes C = B ∗ A, where B i s a symmetric banded matrix __global__ v o i d symbgmmShared ( BandedMatrix B, Matrix A, Matrix C) { i n t blockRow = b l o c k I d x . y ; i n t blockCol = blockIdx . x ; i n t row = t h r e a d I d x . y ; int col = threadIdx . x ; i n t i = blockRow ∗ BLOCK_SIZE + row ; // row i n d e x i n t j = b l o c k C o l ∗ BLOCK_SIZE + c o l ; // column i n d e x i f ( i >= C . h e i g h t | | j >= C . width ) return ; f l o a t Cval = 0 . 0 ; // l o o p through each n o n z e r o b l o c k o f row blockRow i n t s t a r t B l o c k = max ( 0 , blockRow ∗ BLOCK_SIZE − B . rho ) ; i n t endBlock = min (B . width −1, blockRow ∗ BLOCK_SIZE + B . rho + BLOCK_SIZE − 1 ) ; f o r ( i n t m = s t a r t B l o c k ; m <= endBlock ; m += BLOCK_SIZE) { __shared__ f l o a t Bs [ BLOCK_SIZE ] [ BLOCK_SIZE ] ; __shared__ f l o a t As [ BLOCK_SIZE ] [ BLOCK_SIZE ] ; // each t h r e a d l o a d s an e l e m e n t o f t h e s u b m a t r i c e s Bs [ row ] [ c o l ] = getElement (B, i , m + c o l ) ; As [ row ] [ c o l ] = getElement (A, m + row , j ) ; __syncthreads ( ) ; // e n s u r e s u b m a t r i c e s a r e l o a d e d // compute t h e su b mat rix p r o d u c t Bs As f o r ( i n t k = 0 ; k < BLOCK_SIZE ; k++) Cval += Bs [ row ] [ k ] ∗ As [ k ] [ c o l ] ; __syncthreads ( ) ; // e n s u r e dot p r o d u c t i s computed } s e t E l e m e n t (C, i , j , Cval ) ; }  Listing 6.5: Kernel of the symbgmm operation using shared memory.  81  6.3. CUDA Libraries • forming the reduced friction Hessian GF . The details of these functions are left out of this thesis, but they all follow similar design principles to Listing 6.5.  6.3  CUDA Libraries  Several publicly available libraries were crucial in the efficacy of the SCHURPA GPU implementation. These include : • CUBLAS (NVIDIA, 2012a) - a GPU implementation of the BLAS API, developed by NVIDIA, • CUSPARSE (NVIDIA, 2012c) - a GPU library for manipulating sparse data and performing sparse matrix operations, developed by NVIDIA, • CULA (Humphrey et al., 2010) - a GPU implementation of the LAPACK API, developed by EM Photonics, and • Thrust (NVIDIA, 2012d) - a high-level templated interface analogous to the STL library of C++, developed by Jared Hoberock and Nathan bell. Thrust was used for device memory management as well as vector operations such as reduction. CUSPARSE handled the sparse matrix data and operations including the constraint Jacobian and tangent matrices in the simulator. CUBLAS provided the low level functionality such as device-to-device memory copying, dot products, matrix vector products etc. in the SCHURPA algorithm. The highly optimized banded cholesky solve in CULA was used for the factorization of the matrices GC and GF .  82  Chapter 7  Results We demonstrate the effectiveness of our SCHURPA implementation in the context of generic QPs and the friction and contact QP subproblems arising in the staggered projections algorithm. We compare our method with a direct approach that computes a new factorization in each iteration of PDASM, which we call DIRECT. In Section 7.1 we randomly generate and solve QPs of fixed dimension. Runtimes are compared against CPU and GPU-based implementations of DIRECT and SCHURPA. We show that as the initial working set A0 approaches the optimal active set A∗ , SCHURPA significantly outperforms the direct version. We then leverage the staggered projections algorithm with SCHURPA to resolve frictional contact in the Eulerian solids simulator discussed in Chapter 4 . The results are given in Section 7.2. All experiments were performed on an Intel i7 2.80GHz CPU running a 64-bit Windows 7 OS, with an NVIDIA GeForce GTX 580 GPU. A Matlab interface was created using mex (Matlab executable) files to allow fast prototyping of the experiments as well as visualization capabilities.  7.1  Randomly Generated QPs  To investigate the warm-starting capabilities of SCHURPA, we randomly generated QPs of the form  83  7.1. Randomly Generated QPs  minimize  1 T x Hx − cT x 2  (7.1)  subject to x ≥ 0, where x, c ∈ Rn and H ∈ Rn×n is a banded SPD matrix with bandwidth ρ. In our experiments, n = 8000 and ρ = 1000 were held fixed. The solution to a randomly generated QP was first computed to determine A∗ , and A0 was constructed via altering A∗ . Specifically, we reverse the inclusion of the first δ constraints to obtain A0 , so that |A0 − A∗ | = δ, where A0 − A∗ represents the set difference. CPU and GPU implementations of SCHURPA and DIRECT were then run on the QPs using A0 . The CPU implementations were done in Matlab. The saddle point systems of (7.1) were solved using the reduced Hessian method, as discussed in Section 5.1.1. Figure 7.1 shows the runtimes of the algorithms as a function of the working set perturbation, i.e., |A0 − A∗ |. We observe that the GPU versions significantly outperform their CPU counterparts, with the GPU version of SCHURPA being the most efficient. Notice that as the perturbation increases, the runtimes of both versions of SCHURPA increase, but the GPU version significantly less so than the CPU version; this occurs due to the parallel block solver discussed in Section 5.3. Figure 7.2 shows a breakdown of the GPU-based SCHURPA runtimes. We observe that the factorization runtime is constant as only one factorization occurs, independent of iteration number. The solve runtime increases as the perturbation increases due to a larger Schur-complement solve. Table 7.1 details the statistics of the GPUbased DIRECT and SCHURPA algorithms. We denote factorization runtimes by f(ms) and factorization count by #f, solving the saddle point system runtimes by s(ms) and the number of solves by #s, and total runtimes by t. The number of iterations for both methods are identical as they are both PDASMs. Notice that one factorization of SCHURPA is more expensive than one factor-  84  7.2. Frictional Contact in the Simulator PDASM runtimes for n =8000. ρ = 1000 3000 Direct (CPU) Schurpa (CPU) Direct (GPU) Schurpa (GPU)  2500  Runtimes (ms)  2000  1500  1000  500  0 0  5  10  15  20  25  30  35  40  45  50  Perturbation from optimal active set  Figure 7.1: Runtimes of the CPU and GPU implementations of the DIRECT and SCHURPA algorithms, applied to (7.1), as a function of |A0 − A∗ |. For these experiments the matrix size n = 8000 and bandwidth ρ = 1000 were held fixed. ization of DIRECT due to the pre-processing step of the block banded solver as explained in Section 5.3. Finally, we observe that the speedup SCHURPA attains over DIRECT is proportional to the number of iterations required by PDASM. For the case of |A0 − A∗ | ≤ 50 a speedup of over 2× is observed. These results show that SCHURPA can successfully take advantage of warmstarting, and our GPU implementation scales well as a function of |A0 − A∗ |. Compared to DIRECT, SCHURPA is significantly more efficient, especially when the number of PDASM iterations is large.  7.2  Frictional Contact in the Simulator  We now show the results of the SCHURPA-SP method given by Algorithm 4.2 applied to a simulation involving frictional contact in the simulator. The simulation involves a collision of 3 × 3 grid of cylinders, as depicted in Figure 7.3. Each cylinder has a length of 5m and diameter 3m, and moves toward the center with a speed of 4m/s. The frictional coefficient µ = 1 was used throughout the 85  7.2. Frictional Contact in the Simulator  Schurpa runtime for n =8000. ρ = 1000 400 factor time solve time other  350  Runtimes (ms)  300  250  200  150  100  50  0 0  10  20  30  40  50  Perturbation from optimal active set  Figure 7.2: Decomposition of the GPU-based SCHURPA runtimes, applied to (7.1), as a function of |A0 −A∗ |. For these experiments the matrix size n = 8000 and bandwidth ρ = 1000 were held fixed.  (a)  (b)  (c)  Figure 7.3: Cylinder collision simulation. (a) The initial setup of the cylinders. (b) Contact begins, denoting t = t0 = 0 seconds. (c) Resolution of the collision at t = T = 0.5 seconds.  86  |A0 − A∗ | 0 10 20 30 40 50 #f  1 1 1 1 1 1  f (ms)  279.574 282.66 278.669 281.972 284.377 283.504  44.5667 169.643 227.727 257.621 274.522 359.458  s (ms)  SCHURPA  1 15 31 42 54 72  #s  350.912 484.104 540.222 574.07 593.343 678.072  t (ms)  193.128 767.985 938.094 967.221 959.715 972.877  f (ms)  1 4 5 5 5 5  #f  45.9466 183.745 229.483 229.828 230.715 229.687  s (ms)  DIRECT  1 4 5 5 5 5  #s  263.926 1004.63 1229.05 1260.15 1253.28 1265.25  t (ms)  1 4 5 5 5 5  # Iterations  7.2. Frictional Contact in the Simulator  Table 7.1: Statistics of the GPU-based implementations of SCHURPA and DIRECT applied to (7.1). We denote factorization runtimes by f(ms) and factorization count by #f, solving the saddle point system runtimes by s(ms) and the number of solves by #s, and total runtimes by t. 87  7.2. Frictional Contact in the Simulator simulation. We used a constant time step of ∆t = 0.005 seconds, for a total time of T = 0.5 seconds. We compared the performance of DIRECT and SCHURPA by solving the contact and friction QP subproblems in SP using GPU-based implementations of both methods. Recall that Algorithm 4.2 requires the initial working sets A0c and A0f for contact and friction, respectively, and the quality of these estimates is crucial for effective warm-starting. For our warm-starting strategy we chose a constraint c to be active if its corresponding grid cell Ωc had a constraint that was active in the previous time step’s optimal active set. As contact is initiated the warm-starting strategy fails as there are no constraints during the previous time step. More sophisticated warm-starting techniques which predict contact could improve this limitation. Figure 7.4 shows the proportions  A0c A∗ c  and  A0f A∗ f  over the course of the simulation. Once contact begins the ratios stay above 0.8 throughout the simulation, yielding very good initial estimates. As a result, SP required at most 2 iterations and SCHURPA required at most 8 iterations for contact and 11 iterations for friction throughout the simulation, demonstrating the utility of warm-starting both SP and SCHURPA. The warm-starting also reduces the number of solves required for Algorithm 4.2, which are shown for the contact and friction phases in Figure 7.6. Observe that near t = 0.35 seconds, A0c A∗ c  and  A0f A∗ f  increase which improves the warm-starting and as a result the num-  ber of solves decreases. If the number of solves required by SCHURPA exceeds a threshold max_solve, we reset the algorithm as described in Section 3.3. In our implementation we set max_solve = 500. Over the course of the simulation resetting was required at eight time steps in the friction solving phase. These were near the beginning of the simulation when the warm-starting algorithm failed to adequately predict A∗f . In the contact solving phase restarting never occurred.  88  7.2. Frictional Contact in the Simulator 1 0.98 0.96  Correct Ratio  0.94 0.92 0.9 0.88 0.86 0.84 Contact Friction  0.82 0.8 0  0.05  0.1  0.15  0.2  0.25 0.3 Sim time (s)  0.35  0.4  0.45  0.5  Figure 7.4: The proportion of correct indices in A0c and A0f over the course of the simulation. Let ρC denote the bandwidth of the contact Hessian H C ; the ratio rC =  ρC m,  where m is the number of contacts, determines the sparsity of H C . Over the simulation we observed max rC = 0.48, min rC = 0.04 and mean rC = 0.18, corroborating our previous postulate that our bands being small to medium sized. Solving the contact and friction QPs require factorizing and solving with the free Hessian GC given by Equation (5.21) and the reduced Hessian GF given by Equation (5.22), respectively. The number of contacts, dimensions of GC and dimensions of GF are given in Figure 7.6. These quantities peak at the time of maximum impact (≈ 0.1 seconds), and then decrease as the cylinders bounce apart (see Figure 7.3). Figure 7.7 gives runtime comparisons of DIRECT and SCHURPA applied to the QP subproblems arising in SP. During the peak impact phase (≈ 0.1 seconds), SCHURPA outperforms DIRECT substantially. This occurs due to the more efficient factoring phase. Notice that the solving runtimes for SCHURPA are only marginally more than DIRECT, even though SCHURPA requires many solves per time step (see Figure 7.6). In contrast, DIRECT performed at most  89  7.2. Frictional Contact in the Simulator  1600 # Contacts Contact Dims Friction Dims  1400  Dimensions  1200 1000 800 600 400 200 0 0  0.05  0.1  0.15  0.2  0.25 0.3 Sim time (s)  0.35  0.4  0.45  0.5  Figure 7.5: The number of contacts, dimensions of GC and dimensions of GF over the course of the simulation.  450 Contact Friction  400  Number of Solves  350 300 250 200 150 100 50 0 0  0.05  0.1  0.15  0.2  0.25 0.3 Sim time (s)  0.35  0.4  0.45  0.5  Figure 7.6: The number of solves performed in SCHURPA-SP for the contact and friction phases over the course of the simulation.  90  7.2. Frictional Contact in the Simulator  2000 Direct total SCHURPA total  600 400 200  Runtime (ms)  Runtime (ms)  800  1000 500  0  0 0  0.1  0.2 0.3 Sim time (s)  0.4  0  0.5  0.1  0.2 0.3 Sim time (s)  0.4  0.5  2000 Direct factor SCHURPA factor  600 400 200 0  Runtime (ms)  Runtime (ms)  800  Direct factor SCHURPA factor  1500 1000 500  −200  0 0  0.1  0.2 0.3 Sim time (s)  0.4  0.5  0  40 Direct solve SCHURPA solve  30 20 10 0  0.1  0.2 0.3 Sim time (s)  0.4  0.5  150 Runtime (ms)  Runtime (ms)  Direct total SCHURPA total  1500  Direct solve SCHURPA solve 100  50  0  0  0.1  0.2 0.3 Sim time (s)  0.4  0.5  0  0.1  0.2 0.3 Sim time (s)  0.4  0.5  Figure 7.7: Runtimes of DIRECT and SCHURPA for the contact (left) and friction (right) solving phases over the course of the simulation. The total, factor and solve runtimes are given by the top, middle and bottom subfigures, respectively. 8 and 11 solves in the contact and friction phases, respectively, exemplifying the parallelization of the banded block solver. The total runtime speedup ratios of SCHURPA to DIRECT is given in Figure 7.8. We observe approximately a 4× and 3× speedup during the maximum impact for the contact and friction solving phases, respectively. The presented results show the utility of SCHURPA applied to the contact and friction QP subproblems of SP. Equipped with a simple yet effective warmstarting technique, the number of iterations in SCHURPA and SP remained low. The multiple solves required by SCHURPA were efficiently handled by the parallel block solver, yielding superior performance as compared with DIRECT. 91  7.2. Frictional Contact in the Simulator  7 Contact Friction 6  Speedup Ratio  5  4  3  2  1  0 0  0.05  0.1  0.15  0.2  0.25 0.3 Sim time (s)  0.35  0.4  0.45  0.5  Figure 7.8: Total runtime speedup ratios of SCHURPA to DIRECT for the contact and friction phases over the course of the simulation.  92  Chapter 8  Conclusion In this thesis we have developed a solution framework for frictional contact which incorporates the staggered projections algorithm, along with an efficient parallel implementation which exploits the underlying matrix structure of the QP subproblems. Doing so enabled an Eulerian solids simulator the ability to simulate frictional contact dynamics. We first argued the merits of using active-set methods in the context of solving QPs arising in physical simulation due to their warm-starting capabilities. In particular, we desired a parallel active-set method to solve the QP subproblems of the staggered projections algorithm (Kaufman et al., 2008) to complement the parallel implementation of the simulator (Levin et al., 2011). We showed that there were two fundamental limitations attributed to parallelizing classical ASMs and their implementations. These were: • classical ASMs could require many sequential iterations, and • these iterations updated factorizations, an inherently sequential process. To overcome these limitations, we introduced SCHURPA: a parallel implementation of PDASM using the Schur-complement method. The super-linear convergence property of PDASM overcomes the first limitation above, and the Schur-complement solution method to solving the saddle point systems arising in PDASM overcomes the second limitation by circumventing the factorization updating process. 93  Chapter 8. Conclusion We then reformulated the friction problem so as to be solvable by PDASM, and therefore SCHURPA. The reformulation incorporated a simplified linearized friction cone as the feasible set to the friction QP; convergence of PDASM applied to the reformulation could then be proven. We integrated SCHURPA into staggered projections to produce SCHURPA-SP, which only required the use of a single factorization for contact and friction, respectively, to solve all QP subproblems arising in staggered projections. Further investigation of the structure of the contact and friction QPs in the simulator showed both problems may be phrased as banded SPD systems. A simple yet effective parallel block substitution method for banded SPD systems was designed and incorporated into a SCHURPA-SP implementation on the GPU using CUDA (NVIDIA, 2012b) to imbue the simulator with the ability to handle frictional contact. The merits of SCHURPA were confirmed by the speedup attained over DIRECT, an implementation of PDASM which produces a factorization each iteration. Experiments run on generic randomly generated QPs demonstrated that, when using an initial working set defined by relatively small perturbations of the optimal active set, SCHURPA excels over DIRECT on both CPU and GPU implementations. The GPU implementation showed better performance for larger perturbations due to the parallelizability of the block solver; such perturbations are to be expected for large problems arising in physical simulation as the active-set is unknown and, unlike SQP-methods utilizing warm-starts, there is no active-set the QPs are converging to. Finally, we showed the improved performance of SCHURPA over DIRECT on a large-scale simulation using the simulator involving frictional contact. Several avenues of future work could significantly improve SCHURPA’s usability. More sophisticated warm-starting techniques could significantly reduce  94  Chapter 8. Conclusion the runtime of SCHURPA. A heterogenous CPU-GPU implementation would not only improve performance, but also broaden the algorithm’s applicability to other simulations not confined to a GPU implementations. Along these lines, SCHURPA could be integrated into other simulation frameworks. For example, rigid body simulations share many of the same properties as the Eulerian solids simulator. Integration of SCHURPA would simply require the function that solves the initial saddle point system and could be extremely beneficial. Finally, ideas from SCHURPA could be used in a hybrid iterative-direct solver for the saddle point systems arising in ASMs to solve large, sparse problems.  95  Bibliography Bartlett R.A. and Biegler L.T. QPSchur: A dual, active-set, Schur-complement method for large-scale and structured convex quadratic programming. Optimization and Engineering, 7(1):5–32, 2006. Bergounioux M., Ito K., and Kunisch K. Primal-dual strategy for constrained optimal control problems.  SIAM Journal on Control and Optimization,  37(4):1176, 1999. Betts J.T. A Sparse Nonlinear Optimization Algorithm. Journal of Optimization Theory and Applications, 82(3):519–541, 1994. Brunssen S., Schmid F., Schäfer M., and Wohlmuth B. A fast and robust iterative solver for nonlinear contact problems using a primal-dual active set strategy and algebraic multigrid. International Journal for Numerical Methods in Engineering, 69(3):524–543, 2007. Daryina A.N. and Izmailov A.F. Semismooth newton method for quadratic programs with bound constraints. Computational Mathematics and Mathematical Physics, 49(10):1706–1716, 2009. Davis T.A. and Hager W.W. Row modifications of a sparse cholesky factorization. SIAM Journal on Matrix Analysis and Applications, 26(3):621–639, 2006.  96  Bibliography Gill P.E., Golub G., Murray W., and Saunders M. Methods for modifying matrix factorizations. Mathematics of Computation, 28(126):505–535, 1974. Gill P.E., Murray W., Saunders M.A., and Wright M.H. Maintaining LU factors of a general sparse matrix. Linear Algebra and its Applications, 88:239–270, 1987. Gill P.E., Murray W., Saunders M.A., and Wright M.H. A Schur-complement method for sparse quadratic programming. In Reliable numerical computation, Oxford Sci. Publ., 113–138. Oxford Univ. Press, New York, 1990. Goldfarb D. and Idnani A. A numerically stable dual method for solving strictly convex quadratic programs. Mathematical Programming, 27(1):1–33, 1983. Hintermuller M. The primal-dual active set method for a crack problem with non-penetration. IMA Journal of Applied Mathematics, 69(1):1–26, 2004. Hintermüller M., Ito K., and Kunisch K. The primal-dual active set strategy as a semismooth newton method. SIAM Journal on Optimization, 13(3):865–888, 2002. Hüeber S. and Wohlmuth B. A primal-dual active set strategy for non-linear multibody contact problems. Computer Methods in Applied Mechanics and Engineering, 194(27):3147–3166, 2005. Humphrey J.R., Price D.K., Spagnoli K.E., Paolini A.L., and Kelmelis E.J. CULA: Hybrid GPU Accelerated Linear Algebra Routines. In SPIE Defense, Security, and Sensing. International Society for Optics and Photonics, 2010. Ito K. and Kunisch K. Lagrange multiplier approach to variational problems and applications. Society for Industrial and Applied Mathematics, 2008. Jean M. The non-smooth contact dynamics method. Computer methods in applied mechanics and engineering, 177(3):235–257, 1999. 97  Bibliography Kaufman D.M. Coupled principles for computational frictional contact mechanics. Ph.D. thesis, New Brunswick Rutgers, The State University of New Jersey, 2009. Kaufman D.M., Sueda S., James D.L., and Pai D.K. Staggered projections for frictional contact in multibody systems. ACM Transactions on Graphics (TOG), 27(5):164:1–164:11, 2008. Kunisch K. and Rendl F. An infeasible active set method for quadratic problems with simple bounds. SIAM Journal on Optimization, 14(1):35–52, 2003. Kunisch K. and Rösch A. Primal-dual active set strategy for a general class of constrained optimal control problems. SIAM Journal on Optimization, 13(2):321–334, 2002. Lanczos C. The variational principles of mechanics. Dover Publications, 1986. Levin D.I.W., Litven J., Jones G.L., Sueda S., and Pai D.K. Eulerian solid simulation with contact. ACM Transactions on Graphics (TOG), 30(4):36, 2011. Moreau J. On unilateral constraints, friction and plasticity. New Variational Techniques in Mathematical Physics, 172–322, 1973. Naumov M. Parallel Solution of sparse triangular linear systems in the preconditioned iterative methods on the GPU. NVIDIA Technical Report, NVR2011-0, 2011. Nocedal J. and Wright S. Numerical optimization. Springer Verlag, 1999. NVIDIA. Cublas : http://docs.nvidia.com/cuda/cublas/index.html. 2012a. NVIDIA.  Cuda  :  http://docs.nvidia.com/cuda/cuda-c-programming-  guide/index.html. 2012b. 98  Bibliography NVIDIA. Cusparse : http://docs.nvidia.com/cuda/cusparse/index.html. 2012c. NVIDIA. Thrust : http://docs.nvidia.com/cuda/thrust/index.html. 2012d. Polizzi E. and Sameh A.H. A parallel hybrid banded system solver: the SPIKE algorithm. Parallel Computing, 32(2):177–194, 2006. Qi L. and Sun J. A nonsmooth version of Newton’s method. Mathematical Programming, 58(1-3):353–367, 1993. Stewart D.E. Rigid-body dynamics with friction and impact. SIAM Review, 42(1):3–39, 2000. Stewart D.E. and Trinkle J.C. An implicit time-stepping scheme for rigid body dynamics with inelastic collisions and coulomb friction. International Journal for Numerical Methods in Engineering, 39(15):2673–2691, 1996. Stone J.E., Gohara D., and Shi G. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in science and engineering, 12(3):66–73, 2010. Tomov S., Nath R., Ltaief H., and Dongarra J. Dense linear algebra solvers for multicore with GPU accelerators. In Parallel Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium, 1–8. IEEE, 2010. Ulbrich M. Semismooth newton methods for variational inequalities and constrained optimization problems in function spaces. Society for Industrial & Applied Mathematics, 2011. Versteeg H. and Malalasekera W. An Introduction to Computational Fluid Dynamics: The Finite Volume Method. Prentice Hall, 2007.  99  Wright S.J. Primal-dual Interior-point Methods. Society for Industrial & Applied Mathematics, U.S., 1987.  100  Appendix A  Data Storage Formats Dense matrices are stored in column-major order, which is defined for a matrix A as  Aij = a[i + j ∗ lda], where a is the associated pointer to device memory and lda is the leading dimension, or stride, of the allocated memory storing A. A symmetric banded matrix B with kb sub-diagonals is stored in a (kb+1)×n matrix where row i contains the ith sub-diagonal, beginning with the main diagonal, so that  Bij = b[i − j + j ∗ ldb] max(0,i-kb)≤j≤i, where ldb is the leading dimension of B.Table A.1 describes the data storage formats for dense and banded matrices on the GPU.  101  Appendix A. Data Storage Formats  Matrix A  Type Dense  array a  Data mapping Aij = a[i + j ∗ lda]  B  Symmetric Banded  b  Bij = b[i − j + j ∗ ldb], max(0, i − kb) ≤ j ≤ i  Table A.1: Data storage formats for dense and symmetric banded matrices. For matrices A and B, a and b are their two-dimensional stored representation on GPU memory, respectively. lda and ldb are the leading dimensions of the arrays a and b, respectively, and kb is the number of sub/super diagonals of the matrix B.  102  Appendix B  Additional Code The code B.1 contains the helper body code of the methods 6.3,Listing 6.4, and Listing 6.5. If a memory access other than a legal matrix element is attempted, 0 is returned, which pads the block with zeros. The resulting matrix multiplication of the sub-matrices in Equation (6.2) is then  BIK  0   0 AKJ  0 0    0 CS = 0 0   0 , 0  obtaining the desired sub-matrix.  103  Appendix B. Additional Code  // h e l p e r f u n c t i o n s f l o a t getElement ( c o n s t Matrix A, i n t i , i n t j ) { prec val = 0; i f ( i < A. h e i g h t && j < A. width ) v a l = A[ i + j ∗ A. l d ] ; return val ; } f l o a t getElement ( c o n s t BandedMatrix B, i n t i , i n t j ) { prec val = 0 . 0 ; // i f j > i , swap i & j i n t temp ; i f ( j > i ){ temp = j ; j = i; i = temp ; } i n t min_j = max ( 0 , i − B . rho ) ; i f ( i < A. h e i g h t && j < A. width && min_j <= j ) val = B[ i − j + j ∗ ldb ] ; return val ; } v o i d s e t E l e m e n t ( Matrix A, i n t i , i n t j , f l o a t v a l u e ) { i f ( i < A. h e i g h t && j < A. width ) A[ i + j ∗ A. l d ] = v a l u e ; }  Listing B.1: Helper bodies for kernels.  104  

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.24.1-0052205/manifest

Comment

Related Items