PARALLEL IMPLEMENTATION OF MULTIBODY DYNAMICS FOR REAL-TIME SIMULATION By Darrell Wong B.E. (Electrical) University of Auckland, 1981. M.A.Sc. (Electrical) University of British Columbia, 1985. A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY in THE FACULTY OF GRADUATE STUDIES ELECTRICAL ENGINEERING We accept this thesis as conforming to the required standard THE UNIVERSITY OF BRITISH COLUMBIA May 1991 © Darrell Wong, 1991 In presenting this thesis in partial fulfilment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying of this thesis for scholarly purposes may be granted by the head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. Department The University of British Columbia Vancouver, Canada DE-6 (2788) Abstract A multibody dynamics formulation has been developed for the purposes of real-time simula tion of large scale robotic mechanisms such as excavators. The formulation models the rigid body dynamics of any arbitrary tree structured mechanism, although at present the formula tion is restricted to single degree of freedom rotational joints. This formulation is an example of the orthogonal complement approach, which describes the dynamics by projecting an initial description of the primitive equations of motion (the derivatives of translational and angular momentum plus the kinematic equations) from angular and translational Cartesian coordinates to relative angles. In this thesis the approach was developed from Newtonian and Eulerian principles. Novel single cpu algorithms for inertia matrix and force vector formation have been implemented. Novel multiprocessor algorithms were implemented for the inertia matrix and the force vector on a 2d n("2+1) 4. n triangular mesh architecture. A feedforward systolic matrix so lution technique was also implemented. These algorithms are of 0(n) complexity, and together they form a parallel formulation which is more efficient than other parallel formulations in the literature for mechanisms with fewer than 15 degrees of freedom. A Caterpillar 215B excavator was simulated in real-time using an array of transputers, and teleoperation experiments were conducted to verify the formulation. Single cpu simulations of the PUMA 600 and a human torso were also conducted. ii Table of Contents Abstract ii List of Tables viList of Figures ix Acknowledgements xi 1 Introduction 1 1.1 Modeling Multibody Dynamics 1 1.1.1 Issues 2 1.1.2 Motivation 3 1.1.3 Scope1.2 The Principles of Dynamics 4 1.2.1 Lagrangian 5 1.2.2 Newton-Euler 9 1.3 Overview of the Thesis 10 2 Existing Formulations of the Equations of Motion 12 2.1 A Brief Literature Review 12.1.1 Coordinate selection 2 2.1.2 Reference point 13 2.1.3 Topology 4 iii 2.1.4 Closed chains 14 2.1.5 Other characteristics of multibody systems 15 2.1.6 Parallel formulations 12.2 Forward Dynamics Algorithms in Detail 16 2.2.1 The composite rigid body algorithm 7 2.2.2 Recursive 0(n) techniques 21 2.2.3 Orthogonal complement algorithms 23 2.2.4 The Newton Euler State Space orthogonal complement algorithm 27 2.2.5 Discussion 33 Efficient Computation using Body Coordinates 35 3.1 Referral of the Equations to Body Coordinates 6 3.1.1 Single Chain Algorithm 33.1.2 Branched systems algorithm 8 3.2 Inertia Matrix Calculation and Complexity 41 3.3 Force Vector Calculation and Complexity 5 3.3.1 0(n2) force vector algorithm and complexity 47 3.4 Equation Solving 49 3.5 Simulations3.5.1 Puma 600 manipulator 50 3.5.2 Human torso 7 3.6 Summary 61 4 A Parallel Dynamics Algorithm 66 4.1 Integrating the Inertia Matrix and Force Vector Computations 66 4.2 Parallel Calculation of H 68 iv 4.3 Parallel Calculation of [L mp) 72 4.4 Parallel Multiplication by HT4.4.1 Multiplying H and [L mp] by HT 75 4.5 Parallel Matrix Solver 80 4.5.1 The Jainandunsing feedforward algorithm 83 4.6 Summary of the Proposed Forward Dynamics Algorithm 89 4.7 Parallelism in Other Formulations 84.7.1 Lee and Chang 91 4.7.2 Amin-Javaheri and Orin 92 4.7.3 Fijany and Bejczy 4 4.7.4 Hwang, Bae, Haug and Kuhl 95 4.7.5 Complexity Comparison 6 4.8 Summary 99 5 The Real-Time Excavator Simulator 101 5.1 The UBC Teleoperation Project5.2 A Description of the Caterpillar 215B 102 5.3 Architecture and Hardware Considerations 104 5.4 Implementation of the Simulator 106 5.4.1 Multibody dynamics and distributed Runge Kutta 106 5.4.2 The actuators 109 5.4.3 The display and joystick interface Ill 5.5 Excavator Simulations 113 5.5.1 Timing measurements 115.5.2 JOINT and RESOLVED mode experiments 117 v 5.6 Summary 126 6 Conclusions and Further Work 127 6.1 Summary of the Thesis 126.2 Concluding Remarks 130 6.3 Contributions to the Literature 131 6.4 Topics Requiring Further Exploration 132 6.4.1 Multiple degree of freedom joints 136.4.2 A parallel architecture for the branched formulation 134 6.4.3 Closed chains 135 6.4.4 Actuators and friction effects 139 A Single Chain Dynamics 141 B Branched Chain Dynamics 143 Bibliography 145 vi List of Tables 2.1 Computational Complexity for single chain systems on one cpu 19 3.1 Complexity of Forward Dynamics for One Cpu 45 3.2 Computational Complexity for Inertia Matrix Assembly for One Cpu 45 3.3 Complexity of single cpu force vector algorithm 49 3.4 Denavit Hartenberg parameters for the PUMA 600 50 3.5 Complexity of complete forward dynamics for single chains on one cpu 60 4.1 Computational complexity for parallel H algorithm 71 4.2 H computation schedule for row 1 cpus 74.3 Schedule for calculating [L mp] 3 4.4 Redistributed computations and communications for [L4 m^p^] 74 4.5 Revised H computation (row 1) schedule 74.6 Revised schedule for calculating [L mp] 5 4.7 First four computations of HTH (rows 1 to 4) and HT[L) (row 0) (rotational components only) 79 4.8 Parallel computational complexity 90 4.9 Estimated computational complexity for Hwang's algorithm 96 5.1 Tasks required for the dynamics simulation 107 5.2 Parameters used in excavator simulation Ill 5.3 kp and kv for PD actuator spool control 113 vii 5.4 Theoretical and real timing estimates of one cycle of simulation on one cpu and the multiprocessor (Ziiz^fcll + n Cpus) 115 5.5 Timing estimates for calculating H 6 6.1 Bij for various types of joints 133 viii List of Figures 1.1 Open loop, fixed base, serial chain 6 1.2 Forces and torques on the ith body assuming all rotational joints 8 2.1 Body coordinate frames 28 2.2 Vectors describing the ith hinge constraint 31 3.1 A branched tree of five bodies 39 3.2 Block diagram of the open loop inverse dynamics controlled PUMA 600 simulation 50 3.3 Denavit Hartenberg coordinate description for the PUMA 600. Adapted from (Angeles 88) 51 3.4 0d, Odi 0d profiles for all joints of PUMA 600 model. Overlaid are the 9, 0 and 0 responses for joint 6 4 3.5 Td profiles for joints 1 to 6 generated by the inverse dynamics of the PUMA 600 model 55 3.6 Errors 9d -9,9d- 9, Bd - 6 profiles for joint 6 of PUMA 600 model 56 3.7 Human torso coordinate diagram 58 3.8 0<i, Odi 0d an(l profiles applied to all joints for human torso model. Overlaid are the reponses for joint 6 1 3.9 Td profiles for all 6 joints generated by the inverse dynamics for human torso model 62 3.10 Errors 0d — 0, 0d — 0, 0d — 0 profiles for joint 6 of human torso model 63 3.11 Graph of complexity estimates for formulations in Table 3.5. Algorithms are (A) - Angeles, (W) - Walker, (B) - Brandl, and (P) - proposed in thesis 65 ix 4.1 Triangular MIMD mesh array 67 4.2 Pipeline computation of A\_x, c\j, and />{• • in cpui,, (row 1) 69 4.3 Final distribution of c\ • and p) • vectors 70 4.4 Systolic array for solving the matrix equation Ax = b 86 4.5 Data shifting phase prior to solver 87 4.6 Generalised cube network 91 4.7 N=8 processor cube configuration. Adapted from Amin-Javaheri 1988 93 4.8 Computational complexity for the parallel algorithms of Table 4.8, assuming only M is calculated. L - Lee, F - Fijany, H - Hwang, W - Wong (proposed algorithm). 97 4.9 Computational complexity for the parallel algorithms of Table 4.8, including M and feedforward solver. L - Lee, F - Fijany, H - Hwang, W - Wong (proposed algorithm) 98 5.1 Initial coordinate arrangement for Caterpillar 215B and resolved mode parametersl03 5.2 Computer architecture for the excavator simulator 106 5.3 Voltage (v) to spool (x) relationship with deadband 110 5.4 The algorithm organisation between cpuo,o, the control transputer and the IRIS . 114 5.5 JOINT mode 6 response in the simulator 119 5.6 JOINT mode 6 response in the simulator 120 5.7 JOINT mode v and x response in the simulator 121 5.8 RESOLVED mode 6 response in the simulator 123 5.9 RESOLVED mode radius, cabin rotation, height, and bucket rotation responses in the simulator 124 5.10 RESOLVED mode v and x response in the simulator 125 6.1 Architecture for branched four body system 136 x Acknowledgements I would like to thank my supervisor Peter Lawrence for his encouragement and financial and technical support. My committee members, Dr. M.R. Ito, Dr. F. Sassani and Dr. C.W. da Silva have provided encouragement and guidance. To my coworkers, Nariman Sepehri, Dan Chan, Frank Wan, and Real Frenette I owe a great debt of gratitude for their technical advice and diligent proofreading. A special acknowledgement is owed to Ahmed Ibrahim for pointing me in the right direction. Lastly I would like to thank Serena Mar for her constant encouragement and faith, and my parents for their patience and support. xi Chapter 1 Introduction 1.1 Modeling Multibody Dynamics The dynamics of rigid body motion have been derived and analysed since the 18th century. Indeed, the field has been so well investigated it is now called classical dynamics. Until the early part of this century, dynamicists concentrated on the analytical mechanics of single bodies such as tops, in which the motion could be mathematically described in closed form [Whittaker 27]. Systems and motions of greater complexity were considered too difficult. The development of the computer in the 1950s and 60s, with its ability to numerically integrate, renewed interest in the dynamics of more complex multibody systems for which elegant closed form solutions for the equations of motion could not be obtained. Multibody computer simulations require the equations of motion in the form of an algorithm which generates the derivative of the system state from the known previous state. The state consists of joint positions and their velocities. The joint variables may be rotational (e.g. Euler angles), or translational (e.g. prismatic). The objective of the computer simulation is to continuously solve for the derivative variables (generally joint accelerations and joint velocities) and numerically integrate over time to determine the new state variables (joint displacements and velocities). In the fields of space dynamics, mechanism analysis, and vehicle dynamics, the equations of motion have been studied extensively. The formulations have steadily increased in capability, permitting the analysis and simulation of closed chain (kinematic loop) structures, flexible bod ies and advanced control techniques. This greater capability has unfortunately also resulted in significantly increased computational complexity (in its crudest sense, the number of multiplies and adds as a function of the size of the problem (in this case the number of degrees of freedom 1 Chapter 1. Introduction 2 n)). In addition, the automatic generation of the equations of motion has become more diffi cult. As a consequence, non-real-time simulation has been the norm, although supercomputers have trimmed computational delays. Online man-in-the-loop interaction, however, has not been possible except for hybrid electromechanical simulators e.g. aircraft training simulators. 1.1.1 Issues There are four major issues in multibody dynamics [Schwertassek 89] which continue to attract attention. 1. The basic issue is the choice of an approach for deriving the equations. This involves the principle of mechanics, the choice of state variables, and the description of system topology. Conceptual simplicity, efficiency and computational complexity are the main criteria for evaluation. i 2. A second issue is the incorporation of other types of dynamic behaviour, including such effects as flexibility and friction. These effects are secondary and dominated by the large scale rigid body rotations, but they can greatly increase the complexity of the equations and may often require much smaller step sizes for the integrator, resulting in greatly increased execution time. These characteristics become more important as advances are made in control theory, structural mechanics, and materials engineering, all of which will require models of light and flexible linkages. 3. The third issue is the choice of appropriate methods for dealing with closed chains. Con straint equations representing closed chain forces can lead to the addition of Lagrange multipliers, or to transformations to minimal sets of equations. The presence of Lagrange multipliers can lead to numerical instability, and thus coordinate partitioning and other reduction techniques have been proposed to stabilise the system and minimise the number of equations [Schwertassek 84]. A further problem arises when the closed chains open and close over time, leading to system representations which vary in form over time. These problems are numerical and theoretical in nature, and can greatly increase the complexity of the formulation. Chapter 1. Introduction 3 4. The fourth issue is real-time simulation. The need for shorter design cycles and man-in-the-loop simulation capability requires formulations that execute in real-time. Item one above approaches this through theoretical considerations. Another approach is to use multiple processors to generate and solve the equations. At present this area of research is in its infancy, and the issues of algorithm expandability, architectural complexity, and the integration of various phases in the computation are still research problems. Questions arise as to the most appropriate formulation for multiple processor implementation, and the best computer architecture to use. This thesis addresses these last two questions. 1.1.2 Motivation The robotics research group at the UBC Electrical Engineering Department is interested in applying telerobotic control techniques to robot linkages such as excavators. A simulator is required when experiments in control, sensing, kinematics, and hand controller strategies need to be conducted in a controlled setting, before major field tests with real systems begin. This thesis is concerned with the design and implementation of a real-time manipulator simulator of the Caterpillar 215B, an excavator-type machine. 1.1.3 Scope In addition to excavator-type machines, which are basically four link structures, mechanisms with complicated wrists (e.g. the FMS Timberjack feller buncher) and multilegged bodies (e.g. the Kaiser Spyder) have also been designed. Robots such as the proposed Special Purpose Dextrous Manipulator, designed in Canada for the Mobile Service platform on the space station, have multiple chains of links. Modelling these machines efficiently requires formulations of more complex multibody systems than single-branch chains. One aim of this thesis is to take a step towards the development of an algorithm that accommo dates branched (tree-like) systems. Initially only rotational degrees of freedom will be modelled. Prismatic (translational) joints are not included here, but are easily modeled ([Kim 86a] and Chapter 6). The thesis is limited to the dynamics of rigid body systems, since these large links do not flex appreciably. The formulation does not simulate closed chains, but this area of Chapter 1. Introduction 4 research will be discussed in the chapter on future work. Control issues will not be considered, but the dynamics of hydraulic actuation will be briefly considered in chapter 5, when the issue of interfacing to actuators and controllers is examined. The second aim of this thesis is to implement a real-time simulator for the dynamic equations of the Caterpillar 215B. Researchers in robotics have attempted to obtain real-time simulations of common robots by concentrating on computational efficiency and limiting the problem to simple serial systems (a single open kinematic chain), each joint having only one degree of freedom. Often only one phase of the computation is addressed (e.g. inertia matrix, force vector, or matrix solver), without consideration of the integration of these different phases of the formulation. Even these simplified algorithms have required multiprocessors, since they have are too computationally complex for real-time simulation with existing single cpu computers. Our approach is also to use a multiprocessor. The computer architecture, however, will implement a parallel formulation which is concerned with the integration and efficiency of all phases of the simulation. The proposed formulation is evaluated using the computational complexity of the dynamics algorithm and the simplicity and efficiency of the multiprocessor architecture as the criteria. Initially a single cpu formulation will be developed for branched chains. The parallel archi tecture, however, is specific for a single chain mechanism. In the discussion on future work, modifications for the inclusion of branched systems in the parallel architecture are examined. 1.2 The Principles of Dynamics The dynamics of a multibody system are determined by the kinematics, which describe the topology, and the kinetics, which describe the forces. Kinematics is focused on the geometric aspects of motion, excluding the forces involved. Link length and rotational or translational displacement and velocities are calculated to obtain the end-effector (or some link in between) position or velocity in Cartesian coordinates (forward kinematics), or conversely, to determine the joint displacements that produce the endpoint position (inverse kinematics). Kinetics is concerned with describing the balance of forces that influence body motion, such as gravity, actuator moments, and Coriolis forces due to the effects of angular velocity. Chapter 1. Introduction 5 Together, the kinematics and kinetics completely describe dynamic behaviour through the dif ferential equations of motion, in terms of joint variables, structural parameters, and the internal and external forces applied to the joints. Examples of the external forces which affect the mo tion are the joint torques exerted by the actuators, and forces from the environment such as contact forces. The dynamics can be obtained from two main approaches: Lagrangian, and Newton-Euler. The next two sections outline the fundamentals of these two approaches. Chapter 2 will discuss specific examples in greater detail. 1.2.1 Lagrangian The Lagrangian approach requires the designer to select a set of independent state variables q which can completely characterize the position and orientation of the system. This set is known as the set of generalized coordinates. For a simple manipulator with rotary or prismatic joints, relative joint angles and displacements are easy to measure, and thus make a good set of coordinates. Another choice is inertially referenced angles and displacements. Consider the single chain of n bodies in Fig. 1.1, with one degree of freedom per joint. Assume that the generalised coordinates are qi,...qn (measured relative to the previous link, or with respect to an inertial frame). The Lagrangian is denned as the difference between the kinetic (Ek) and potential (Ep) energy of the system. £(q,q) = Ek(q,q)-Ep(q). (1.1) The equations of motion are defined as: d dC dC . ( . dt^-dq~rFt , = 1'-B (L2) where F{ is the force acting in the direction of the coordinate. The Ek of a link i is given by Ek(i) = ^rriixjii + iicfjjti;.-. (1.3) J, is the link inertia tensor, and i, and tt>, are the 3x1 translational and angular velocity vectors of the centre of gravity of the ith body, measured relative to an inertial frame of reference. The velocities i, and w, are the Cartesian velocity coordinates of the ith body. The system vector Chapter 1. Introduction 6 B. 1 / z1 •A Figure 1.1: Open loop, fixed base, serial chain manipulator [x w] forms a nonminimum set of coordinates which may be reduced to a minimum set when the hinge constraints are added. Transformations from Cartesian coordinates to q or q are necessary because w (a 3 X 1 vector) is not a minimum representation and is generally not a measurable variable (unless it consists of a rotation about a single axis) since it is a function of up to three Euler-type angular rotation rates. One generalised set of q is rotational displacements (for a system with all rotational joints and no translational joints) which may be measured with respect to a global (inertial) frame or with respect to an adjacent link (relative). The transformation between [x, «?,-] and any other q is ii = Bj)q Wi = B$q (1.4) Chapter 1. Introduction 7 where Bj) and Bp* are translational and rotational 3xn Jacobian velocity transformations to the system vector q. The Jacobians for q — 0 can be derived from the following kinematic velocity and acceleration relations: Wi = «;,•_! + Zi$i Wi = Wi_i + Wi X Zidi + ZiOi (1.5) Xi = x,_i + - ki Xi = + (wi_i x /j_i) - (wi X fcj) ^ = x,_! -|- x + x x - X ki — Wi x x A:,- (1.6) where ki is the vector from the centre of gravity of a body to the lower hinge, and /,• is a similar vector to the upper hinge on a body (Fig. 1.2). zt is the unit vector describing the axis of rotation of the ith hinge, ki, /,, and z, are assumed to be constants when referred to their own coordinate frame, but in equations 1.5 and 1.6, they are time varying, as the equations have been implicitly referred to a common coordinate frame. The system kinetic energy can be written as Ek = \qTMq where M = fl(miBTTBT + BRTJiBR]) i1-7) where M is the n x n inertia tensor for the complete manipulator system. The potential energy for the system is due to gravity g, and is n Ev = Y,m>9Txi- (L8) When the Lagrangian is differentiated as per equation 1.2, the equation of motion for the ith. link is n TI n n E MH9j + E E hHkWk + E mi9TB%} = Fi i = 1, ...n (1.9) j=i j=i k=i j=i where Fi is the actuator torque, and B^} is the tth column of the B^p matrix appearing as a result of the differentiation of Ep with respect to dg,. hijk represents the terms that occur Chapter 1. Introduction 8 Figure 1.2: Forces and torques on the ith body assuming all rotational joints as a result of differentiating M with respect to dq. The first term of equation (1.9) represents the moment due to inertia, the second term represents the Coriolis and centrifugal moments, and the third term represents the moment due to gravity vector g. The equation describing the system is then M{q)q=F-h{q,q)-G{BT). (1.10) The equations must then be solved for q given that M and BT are functions of q, and h is a function of q and q, which are known from the previous time step. F is provided from an actuator algorithm. The benefit of working with the Lagrangian approach is that the forces of constraint between hinges do not have to be considered. The internal forces never explicitly appear in the equations because a kinetic energy argument is used to describe the formulation, and the kinematics are Chapter 1. Introduction 9 used to describe the constraints and transform to generalised coordinates. Although the deriva tion of the formulation is elegant, developers of computer codes have not used the complete Lagrangian equation for systems of significant complexity due to the computational inefficiency of the technique. Ibrahim [Ibrahim 88] and Silver [Silver 82], however, have shown that this is not necessarily true, by demonstrating that the principle of mechanics does not greatly influence the efficiency of the equations. It is the cost of performing the transformation to different q that determines the efficiency of the algorithms. 1.2.2 Newton-Euler The Newton-Euler approach uses Newton's Second Law of Motion and Euler's equation for rotational momentum to describe the kinetics of each body. Newton's Second Law, F = P = 4(mi) = ml (1.11) dV ' v ' describes the sum of the translational forces F on a body as equal to the rate of change of linear momentum, P. m and x are described as in the previous section. Rotational motion is described by Euler's equation for the inertial torque of a body, which describes the time rate of change of angular momentum: N = L = ^(Jw) = Jw + w x Jw (1.12) where L is the angular momentum, N is its derivative or moment, J is the link inertia (relative to its own coordinate frame), and w is the body's angular velocity (measured with respect to an inertial frame) defined in the previous section. Jw is known as the angular acceleration torque and w X Jw is known as the gyroscopic torque. The gyroscopic torque appears because the orientation of J varies with time, producing a non-zero time derivative. In the Newtonian approach, each link is isolated as a free body, and the forces and torques due to gravity, inertia, velocity, and actuation appear in translational and rotational equations as a balance of forces or moments (Fig. 1.2). The equation describing the translational forces on a link i is (Fig. 1.2): m,x, = 7i - 7,-+i + m,-0, (1.13) Chapter 1. Introduction 10 where the RHS is the sum of the translational forces about link i. 7,- and 7,4.1 are the forces applied to link i at the hinge points i and i + 1 by the attached links, which seek to constrain the body's motion. 17, is the acceleration vector due to gravity. The rotational torques are: Li = J{Wi + Wi x JiWi = Ri\i - Ri+i\i+1 + ki x 7, - /, x 7;+1 + r, - r,+1 (1-14) where is the actuator torque occurring about the axis of motion i. i?,A, (or i2,+iA,+i) are moments that act to oppose rotational motion in axes orthogonal to the axis of motion of hinge i (or hinge i+ 1). A:, x 7^ and /, x 7,+i are moments due to the translational hinge forces 7,- and 7i+i. All terms in (1.13) and (1.14) have been referred to a common coordinate frame. The above equations can be written for all the bodies in a system with the differentiated state variables x and w. As in the Lagrangian approach, x and w can be projected to a different set of states, such as Euler angle accelerations 9 or Euler parameter accelerations qp, and may be relatively or inertially measured. For one degree of freedom relative joint angles, Wi — f(zi9\, Z2&2, —Zi^i), where z, is the ith axis of rotation. The equations of motion, 1.13 and 1.14, are then a function of 6, and, due to w, functions of 9 and 9. The projection from [i w] to an alternative set of q such as 9 leads to the elimination of the internal constraint forces and moments 7 and A. Once the equations for all bodies are assembled and the constraint forces and moments are eliminated, the equations must then be rearranged to solve for 9. The formulations may be 0(n) computational complexity if the projection/elimination of the constraint forces is done recursively from link to link (e.g. the algorithms of Featherstone [Featherstone 83], Rodriguez [Rodriguez 88], Bae and Haug [Bae 87b], Roberson and Schwertassek [Roberson 88], and Brandl [Brandl 86]). If the projection is done as a global transformation, 0(n2) computational complexity (e.g. Hwang and Haug [Hwang 88]), or 0(n3) computational complexity is obtained (e.g. Angeles and Ma [Angeles 88], Kim and Vanderploeg [Kim 86a], and Hemami [Hemami 82]). o 1.3 Overview of the Thesis In the next chapter, Chapter 2, some of the formulations proposed by the research community are reviewed. The approaches taken once the Lagrangian or Newtonian primitive equations have Chapter 1. Introduction 11 been established are examined, and examples of parallel implementations are described. In the latter half of Chapter 2, the Newton Euler State Space method [Hemami 82], a formulation based on orthogonal complements, is presented. This method is the starting point for the development of the parallel formulation presented in this thesis in later chapters. In Chapter 3, the equations of motion are reorganised by referring them to body coordinates. This change illuminates the recursive nature of the equations, determining the structure of the algorithm, which then influences the multiprocessor architecture. The formulation is then generalised to branched systems and the equations are shown to be computable in an automated manner. Simulations of a PUMA 600 manipulator and a branched six degree of freedom human torso are presented to demonstrate the formulation. In Chapter 4, a multiprocessor algorithm for calculating the inertia matrix for simple serial chains (an open kinematic series of bodies) is obtained. A multiprocessor algorithm for the calculation of the driving force/torque vector is also developed which integrates well with the inertia matrix algorithm. Chapter 4 also describes the implementation of a parallel feedforward systolic matrix equation solver derived by Jainandunsing [Jainandunsing 89]. A parallel systolic architecture decomposes a problem into simple parallel tasks which are identical. These tasks may be run simultaneously on identical cpus, with intermediate results being passed between neighbours in a pipelined manner, analogous to the way blood is pulsed through the body. This solver completes the set of parallel algorithms used for the simulator. Chapter 5 describes the excavator simulator, including the computer architecture and the inter faces to a graphics computer and joystick control. Measurements of relative efficiency between single and multiple cpu simulations are given, and a teleoperation experiment is presented to demonstrate the usefulness of the simulator. Chapter 6 concludes the thesis with a discussion on results and future work. Ideas for extending the formulation and the architecture to deal with branched sysytems, multiple degree of freedom joints, and closed chains are presented. Chapter 2 Existing Formulations of the Equations of Motion In this chapter some of the formulations proposed by the research community are reviewed. A brief review of the development of multibody dynamics is presented by pointing out the most important characteristics required of a formulation. Following this, some of the most relevant approaches will be discussed, and examples of parallel implementations described. Finally the Newton Euler State Space method [Hemami 82], a formulation based on orthogonal complements, is presented. This method is the starting point for the parallel formulation developed in this thesis. 2.1 A Brief Literature Review In this section, a very brief review of the literature is presented. Following this, those formula tions most relevant to the thesis objectives are examined in more detail. 2.1.1 Coordinate selection Selecting the principle of mechanics, as discussed in the previous chapter, is just one step in the formulation process. The selection of the final generalised coordinates (e.g. relative or inertial, minimum or nonminimum set) also greatly influences the final form of the equations of motion. Transformations of the initial equations of motion from Cartesian coordinates [w x] to other coordinate spaces can reduce the number of equations in the final system to varying degrees, and give the user more insight into the mechanics of the formulation. The final equations may be easier to derive (but require a larger set of equations and coordinates) or harder to derive (but result in a smaller set of equations using a minimum set of coordinates). In the 1960s, Roberson and Wittenberg [Roberson 66] and Hooker and Margulies [Hooker 66] 12 Chapter 2. Existing Formulations of the Equations of Motion 13 developed algorithms for problems in spacecraft dynamics that were implementable on a com puter. Newton-Euler principles were used to form the primitive equations for individual bodies, and these primitive Cartesian equations were augmented by constraint equations and Lagrange multipliers, rather than eliminating the constraint forces by projective transformation. Haug [Haug 89] used inertially measured Cartesian generalised coordinates in the DADS software package, in which constraint forces were not eliminated, and the coordinates were not reduced to other state spaces such as relative angles. This choice generated large numbers of equations, but resulted in simpler definitions for constraints and forcing functions. Hooker [Hooker 70] and Jerkovsky [Jerkovsky 78] used transformations to relative joint variables when it was realised that they were easier to measure, more intuitive, and less prone to numerical error. This choice of coordinates resulted in a smaller but more complex set of equations. Examples of similar transforms are found in [Hemami 82], [Bae 88a], [Kim 86a], [Angeles 88], [Bae 87a], and in the work performed in this thesis. 2.1.2 Reference point When defining the equations of motion for a large system of bodies, both Roberson and Hooker used the augmented body approach, which defined a body barycenter as the point of application of body forces. The barycenter is the center of gravity of a body which has been augmented by point masses at the hinge positions around the body. These point masses are the cumulative masses of all the links in branches outboard from each hinge. The augmented body approach uses the system center of gravity as the system reference point. He and Goldenberg [He 89] have used this technique to computationally simplify the inverse dynamics (calculating the desired torques) for single chains. The barycentric approach has generally been supplanted in the literature by the direct path method introduced by Ho [Ho 74] and subsequently used by Jerkovsky and others. The direct path approach assumes a designated home body as a reference point, and then describes a body by relating it to home via the chain of bodies between them. This results in simple recursive equations for the kinematics and kinetics in which the point of application is at the hinge or the centre of gravity of a body. In this thesis the direct path method is used, with the home body being the first body in the chain. Chapter 2. Existing Formulations of the Equations of Motion 14 2.1.3 Topology Another requirement of a computer formulation is a systematic method for specifying the system topology. Graph theory, which mathematically expresses the topology as a matrix (or linked list) of connectivity, is a popular way of describing the connections and constraints among links. Roberson and Wittenberg used graph theory to define the topology of a system using an incidence matrix. The incidence matrix can be quite complex if the numbering of the bodies and hinges is arbitrary. The simplest incidence matrix uses a strict numbering system related to the direct path described by Ho [Ho 74]. In this method each body in a branch is labelled from a home body with an ascending integer, and each branch is completely labelled before another branch is begun. As a result, the ith row of the incidence matrix has a 1 in the jth column if the jtb. body lies on the path between body i and home. Most formulations, including the one developed in this thesis, now use some form of matrix or list to describe the topology (although roboticists have generally limited their simulations to robots without branches and consequently seldom use an incidence matrix). 2.1.4 Closed chains Closed chains present significant problems for researchers due to the difficulty in selecting appropriate coordinates among a large initial set (the coordinates produced when closed loops are cut). This is because of degrees of freedom which are lost or gained when the system forms geometries which cause the hinges to lock during motion . Numerous researchers have modeled closed chains, but no method has demonstrated significant advantages over others. The augmentation method, using Lagrange multipliers to represent the unknown closed chain constraint forces, plus constraint equations to represent the kine matic constraints, has been discussed by Nikravesh [Nikravesh 84], Orlandea [Orlandea 77], and Wittenberg [Wittenberg 77]. One method is through direct integration and solution of the augmented equations (a differential-algebraic set). The problem with this technique is that nu merical errors appear due to numerical integration inaccuracies, resulting in constraint equation violations. Numerical constraint stabilisation has been used by Baumgarte [Baumgarte 72] and Bayo [Bayo 88] to monitor the amount of violation and then numerically dampen the system to Chapter 2. Existing Formulations of the Equations of Motion 15 minimise the errors. State space reduction techniques have also been developed which minimise the number of independent coordinates by partitioning the initial set of coordinates into depen dent and independent sets u and v respectively. The constraint equations are then redefined so that functions of u coordinates become functions of v coordinates. Coordinates and velocities u, v, ii and v can be calculated, followed by the solution to the differential-algebraic equations. Separating q into il and v enables the algorithm to control the amount of numerical error. This technique was initially proposed by Wehage, Mani and Haug [Wehage 82] [Mani 84]. Kim and Vanderploeg [Kim 86b] refined it using QR decomposition of the constraint Jacobian to identify the independent coordinates. Another refinement of Mani's technique was made by Park and Haug [Park 86], who combined Baumgarte's numerical stabilisation and Mani's partitioning technique. This thesis does not consider the problem of closed chains, although in Chapter 6 an approach developed by Murray and Lovell [Murray 89] is suggested as a logical extension to the current thesis work. 2.1.5 Other characteristics of multibody systems Researchers such as Ho, Hughes [Hughes 79], Bodley [Bodley 78], Keat [Keat 83], Kim [Kim 88] and Ibrahim [Ibrahim 88] have developed formulations which include such characteristics as flex ible links and more sophisticated joint constraints in the context of space dynamics. Vehicle dynamics researchers such as Fuhrer [Fuhrer 89], and Rulka [Rulka 90] have developed algo rithms for simulating railguided vehicles and automobiles, which need to deal with friction and nonlinear wheel-rail forces. Characteristics such as these have not been considered in this thesis work as the emphasis has been on the development of the basic rigid body equations as a first effort. 2.1.6 Parallel formulations All the formulations described above were developed for single processors. Bae, Kuhl and Haug [Bae 88b] and Hwang, Bae and Haug [Hwang 88] developed and implemented parallel algorithms on an Alliant FX/8 shared memory multiprocessor. Bae's algorithm is an 0(n) computational complexity algorithm which uses one cpu for each topological branch (chain) in the system. Hwang's algorithm is an n cpu, 0(n + log2 n) computational complexity, parallel Chapter 2. Existing Formulations of the Equations of Motion 16 version of an 0(ra2) single cpu algorithm [Bae 88a]. Robotics researchers such as Lee and Chang [Lee 88], Fijany and Bejczy [Fijany 89] and Amin-Javaheri and Orin [Amin-Javaheri 88] have parallelised single chain linkages using the 0(n3) single cpu algorithm proposed by Orin and Walker [Walker 82] as the basis formulation. They derive 0(n + (log2 n"|) (Lee, Amin-Javaheri) or O(log2 n) (Fijany) complexity parallel algorithms that use a generalised cube or a hypercube computer architecture utilising n cpus. This brief survey, which does not reflect the contributions of many other researchers, shows that a variety of formulations have been derived. The most popular ones use relative joint coordinates as generalised coordinates, and connectivity matrices using the direct path method to describe the topology (and as a consequence, to describe the constraint and kinematic relationships). Modeling techniques for closed chain constraints and secondary characteristics such as flexibility and friction continue to be developed. Research on parallel solutions has only just begun, and will continue to evolve as computer hardware and algorithms change. In this thesis relative joint angles are used as generalised coordinates and a connectivity matrix is used to describe the topology and kinematics. Closed chain constraints are not dealt with, as the emphasis is on real-time performance achieved through parallel processing. 2.2 Forward Dynamics Algorithms in Detail This section discusses several of the most recent formulations, which are divided into three types, each a different approach to the transformation of the primitive equations of Chapter 1. The first approach is known as the composite rigid body method, which uses the Newton Euler inverse dynamics method derived by Luh, Walker and Paul [Luh 80], plus a technique for calculating the inertia matrix which uses partial sets of links at the open end of the chain as composite bodies. The second approach uses recursive kinematic and kinetic equations to form the dynamics in 0(n) time, by locally eliminating the hinge constraint forces. Accelerations and forces are locally projected from body to body across the joints. The third approach uses a global Jacobian velocity transformation to eliminate the hinge forces and project from [w x] to 9. This third approach is pursued in later chapters of this thesis. Chapter 2. Existing Formulations of the Equations of Motion 17 2.2.1 The composite rigid body algorithm formulations The most commonly used algorithm for single chained robots is the composite rigid body method of Walker and Orin [Walker 82] (method 3), derived for a single chain manipulator with one degree of freedom joints. The algorithm begins by establishing the Lagrangian form of equation 1.10 as the final form of the desired equations, in which the coordinates are relative joint angles and relative prismatic displacements. The computations required to obtain the elements of the matrices in equation 1.10 are then derived from the Newton-Euler inverse dynamics algorithm derived by Luh. The RHS of equation 1.10 (excluding F) are forces due to Coriolis, centrifugal and gravity effects, and are calculated by artificially setting q to zero in Luh's algorithm. Initially, the kinematic equations (1.5) and (1.6) are calculated, except that q\ (equivalent to 0i) is set to zero and is described through a vector p,, measured from the inertial coordinate frame to the hinge point of body i, rather than the centre of gravity of body i. The equations are written in a forward spatial recursion: Pi = Xi_i + Xi = pi — ki 'ii = pi — Wi X ki — Wi x Wi x ki where pi = — x — x u>,_i x (2.1) and di = ki — U is the vector from hinge i + 1 to hinge i across body i. In this thesis, equation (2.1) has been changed with respect to Luh's original equations to reflect the fact that in the notation used in this thesis, the ith coordinate frame is placed at the bottom hinge of link i, rather than at the top (in Luh's description). The acceleration due to gravity is the first fink's acceleration p\. The total force F, and total moment JV; about a body are then written as in equations (1.11) and (1.12). The hinge force /, and hinge moment n, are the force and moment exerted on link i by fink i — 1, and are given in a backward recursion as fi = Fi + /„ = fe ni = ni+1 + Ni - ki x Fi - di x fi+1; nn = ne or Ni = n,- - n,+i + fc,- x /,• - /, x (2.2) Chapter 2. Existing Formulations of the Equations of Motion 18 where the equation for Ni in (2.2) is rearranged to consist of hinge quantites only. /„ and nn are equal to the external force or moment at the manipulator tip. Equation (2.2) describes the moments and forces about the isolated body. /, is equivalent to 7, in equations (1.13) and (1.14), and n, is the sum of the moments due to i2,A,- + r, (with a single z axis rotation, i?,A, is about the x and y axes, while TV is about the z axis). The elements of the force vector fv (Coriolis, centrifugal and gravity forces) are projections of the moments rii about the hinge axes 2,: fv(i) = nfzi i = l,...n (2.3) The inertia matrix element Mij from equation (1.9) is calculated using the composite mass Mj, composite centre of mass ij, and composite inertia Jj of a series of rigid bodies (a"implies a composite variable). A composite rigid body is comprised of the subset of links j to n at the free end of the chain, where in this case n (no subscript) represents the number of bodies in the chain. The Mj, tj, and Jj can be computed recursively in a backward recursion according to the following equations: (2.4) dj) - rrijkj), (2.5t' = n,...l (2.6) Mj = Mj+1 + THj, Cj = -i-(MJ+1(ci+1 Mj and Jj = Jj+i + bj, with Mn = mn Cn = -K Jn = Jn) and bj = Mj+1((rjr3)I where = CJ+\ + dj - tj - —Cj — kj. rtf)) + h + m3 {irfr])I - (r*rf)) (2.7) (2.8) (2.9) tj is measured relative to the jfth coordinate frame (at hinge j). The calculation of a column j of M requires q to be artificially set to zero in the kinematic equations for all links other than j (qj = 1, a unit acceleration). In addition, the joint velocities q and gravitational forces are zero. The output of the inverse dynamics algorithm with qj = 1 Chapter 2. Existing Formulations of the Equations of Motion 19 and these other quantities set to zero results in the jth column of the inertia matrix Ai. Since joint j is the only joint in motion (links j to n are static), the total forces Fj and Nj on the composite body are Fj = MJ{ZJ x tj) (2.10) for rotational axes (prismatic joints are considered in Walker's paper but not included here). For i < j, Fi and JV, are zero, and /, and n, are computed in a backward recursion as ni — rii+i + di x and /_,• = Fj, rij — Nj + dj x Fj i = l,...j - 1; when i = j. (2.11) For i > j, fi and n, are zero. An element Mij in column j is then calculated by projecting the hinge moment n, onto the ith joint axis. Thus for the jth column, when ijj = 1, Mij = njzl i<j (2.12) where this value of n, must be recalculated for each value of j (each new column of M). Mij for a translational joint i can be calculated by projecting the hinge force /, onto z,-. For Walker's algorithm, the computational complexity of forming the inertia matrix is 0(n2), of forming the RHS is O(n), and of solving for Mq — RHS using Cholesky factorisation and backsubstitution is 0(n3) (see Table 2.1). Algorithm x + Walker [1982] Angeles [1988],[1989] Brandl [1986] Featherstone [1983] (no force) Bae [1988a] (no inversion) nJ . 27n^ , 577n AQ 6 ' 2 3 *V 7f + 21n2 + 53|n 91 250n - 222 199n - 198 3n2 + 186n - 36 + 8n2 + 10lln 64 n3 + 16n2 + 75n - 90 220n - 198 174n - 173 sni + iiin _ 78 Table 2.1: Computational Complexity for single chain systems on one cpu Chapter 2. Existing Formulations of the Equations of Motion 20 parallel implementations There have been three parallel implementations of Walker and Orin's forward dynamics al gorithm. Lee and Chang [Lee 88] reorganised Walker and Orin's algorithm for an n cpu, single-instruction multiple-data (SIMD) multiprocessor by dividing the computation into three parallelisable tasks - formation of the inertia matrix, formation of the force vector, and solution of the matrix equation. The first two tasks used n cpus, while the matrix solver may use an n("2+1) mesh systolic algorithm or an n cpu set-ordering technique. The computation of the force vector is carried out using n cpus connected in a generalised cube network [Lee 88]. The 3x3 rotation matrices A° which refer a body coordinate system (bcs) variable to the inertial frame are calculated by multiplying a cascaded sequence [Ai...Aj~2Aj-1]. Using the generalised cube network, the A)-1 i = l,...n can be calculated all at once on all n cpus, followed by 0( [log2 n]) time computation of A°, through the use of the recursive doubling algorithm (RDA). The RDA algorithm pairs the computations of the rotation matrices in each of log27i levels e.g. J4JJ42, Af.,43,, etc in level 1; followed by A2J42 and A\A\ in level 2 etc . The inertia J,, axis z,-, and vectors d{ and fc,-, all in body coordinates, are then referred to the inertial frame using the A®. This is done in constant time using all n cpus. Recursive kinematic equations 1.5, 1.6 and 2.1, and hinge force and moment equations 2.2 can be computed in O(|"log2n]) time using the RDA algorithm once they have been referred to the inertial frame. The force vector elements, projections of the hinge moments n, onto the rotation axes z;, are calculated in constant 0(1) time using all n cpus. The overall complexity for the force vector is thus O([log2 n]). The inertia matrix is obtained by first calculating Mj, tj, and Jj in O([log2 n~|) computations using the same computer architecture and RDA algorithm as for the force vector calculation. The fi and n, (i < j) (equation 2.11), for all j, are obtained in 0(n) time, and Mij (i < j), the axis projections of n, for a specific j, are obtained in 0(1) constant time. The matrix is solved in 0(n2) time with n cpus using a set-ordering technique, and in O(n) time when an "(n2+1) 2d grid and a parallel systolic Cholesky factorisation matrix solving algorithm is used. The total complexity for all three dynamics tasks is thus 0(n2 + n+ [log2 n]) for n cpus and 0(n+ [log2 n]) for n+ n("2+1) cpus (exact complexity polynomials for the parallel algorithms given in this chapter are presented in Chapter 4). Amin-Javaheri and Orin [Amin-Javaheri 88] developed a similar algorithm to Lee's, but for the Chapter 2. Existing Formulations of the Equations of Motion 21 inertia matrix alone. As in Lee's work, the algorithm was developed for both n and cpus, and could be computed in 0(n + riog2 n~|) time for an n cpu hypercube architecture, 0(n) time for a pipelined n cpu architecture, and 0(n) time for mesh architecture. Different parallel interpretations of Walker's algorithm were developed in which the calculation of Mij was calculated by row or diagonal rather than by column. The most efficient parallel version was the column algorithm, implemented on both the pipeline (0(n) time) and the hypercube (0(n + flog2nl) time). The hypercube column version was equivalent to Lee and Chang's. They also developed a VLSI processor on which the algorithms could be implemented. Fijany and Bejczy [Fijany 89] developed an algorithm based on the Composite Rigid Body Spatial Inertia algorithm. This algorithm is very similar to Walker's, except that the Cj, r!- and rj equations in Walker's algorithm are replaced by equations for computing the first and second moments of the mass, from which n, and f are easily derived. All equations are referred to the inertial frame. Computation of Mj, A°, and first and second moment of mass is accomplished in O([log2n~|) time with n cpus. The /,, the n,, the diagonal inertia matrix elements Mn, and the off-diagonal elements Mij are calculated in constant time using "(n2+1) cpus. Several architectures and algorithms were presented by Fijany, but the most efficient used cpus in a 2 dimensional mesh to calculate the inertia matrix in O([log2 n]) time. 2.2.2 Recursive O(n) techniques formulations The second approach calculates the acceleration vector in 0(n) time by generating the ex plicit form of the dynamic equations ie. 6i is obtained without matrix inversion. Vereschagin [Vereschagin 74] devised the first 0(n) algorithm by observing that hinge constraint equations could be generated recursively and combined with other equations to generate the accelera tions. The formulation applied to single chain open systems with prismatic and revolute joints. Armstrong [Armstrong 79] independently developed a similar but less efficient formulation for systems with spherical joints, which was additionally capable of modeling flexible links. Featherstone [Featherstone 83] independently redeveloped Vereschagin's method using different initial principles, and introduced the concepts of articulated-body inertias jf and bias forces Chapter 2. Existing Formulations of the Equations of Motion 22 Pi. Jt and Pi are the inertia and force which relate the applied spatial force /, = [/, n,] to the inertially measured absolute Cartesian acceleration a, = [x, tbi] fi = Jfhi + Pi (2.13) Initially, kinematic equations similar to 1.5 and 1.6 are calculated in a forward recursion to obtain values for x, and Si (st- is a 6 x 1 vector describing the ith hinge axis vector Zi and the vector denned by (x; + ki) x Zi, referred to the inertia! frame). Then, articulated-body inertias were defined in a backward recursion by Featherstone as «/, — Ji + Ji+i where j£ — Jn ft s-^sT ft *t+i *t'+l i = n, ...1 (2.14) and Ji = m,x, mil Ji — rrii(xi)2 —m,xt-(2.15) Here J, is denned as the spatial inertia matrix relative to the inertial frame (and is not related to the composite body inertia), and x, is the skew symmetric matrix of vector x; measured from the inertial frame to the centre of gravity of body i. All quantities are referred to the inertial frame. The third term in 2.14 locally projects the articulated-body inertia Jt^x about the (i + l)th axis s,+i. The bias force P, is defined as Pi = + where Pn = 0 Jj+lSj+ljQi+l ~ sJ+1Pj+i) n, ...1 (2.16) Qi+i is the (i + l)th scalar force vector element, consisting of the actuator torque or force, plus the Coriolis, centrifugal, and gravity effects, a, and relative joint accelerations $i can be found in a forward recursion di Oi = a,_! + SiOi = 0 Qi - sjjtqj.r - sJP, sr ft 5-.1 i = 1, ...n (2.17) (2.18) Chapter 2. Existing Formulations of the Equations of Motion 23 The complexity is shown in Table 2.1. Since Featherstone's algorithm appeared in the robotics literature, multibody dynamics researchers have applied these concepts to develop their own formulations from other basic principles. Bae and Haug [Bae 87a] used variational-vector cal culus as the basis for transforming Cartesian equations to relative coordinates and for defining the equations of motion. Brandl, Johanni and Otter [Brandl 86], using Newton-Euler princi ples, introduced modes of motion and modes of constraint to describe complex hinges. They also integrated the calculation of the force vector into their algorithm, whereas Featherstone assumed that the force vector Q would be calculated by an inverse dynamics algorithm such as Luh's. Rodriguez [Rodriguez 88] derived a linear operator technique, based on recursive es timation and Kalman filtering theory, to develop an 0(n) algorithm similar to Featherstone's. Rodriguez, like Featherstone, assumed the calculation of the force vector would be done by an inverse dynamics algorithm. Brandl's algorithm is the most efficient of all formulations for chains with greater than 6 degrees of freedom, regardless of approach (0(n) or otherwise) when simulated on a single cpu [Brandl 86] (Table 2.1), and is thus the most efficient algorithm for simulating redundant (n > 6) robots with one cpu. parallel implementations Bae and Haug [Bae 87a] and Rodriguez [Rodriguez 88] have shown that their 0(n) algorithms can be parallelised, but only one cpu can be used for each branch in the topology of the system. Fijany [Fijany 89] and Lee [Lee 88] have stated that within a single chain, parallelism cannot be achieved at the equation level using 0(n) algorithms. Since most robots are single chains, parallel versions of 0(n) algorithms are not useful unless more complicated branched robots need to be modelled (there is of course an abundance of branched systems outside the robotics field). 2.2.3 Orthogonal complement algorithms formulations The third approach, which is the most relevant to this thesis, is the orthogonal complement method. Orthogonal complement algorithms use a global Jacobian transformation matrix which Chapter 2. Existing Formulations of the Equations of Motion 24 projects the [w x] derivative state space to a reduced (smaller) independent set of differentiated state variables e.g. relative acceleration angles 9. The transformation matrix contains vectors which are tangential to the constraint manifold of the system i.e. the vectors are orthogonal to the coefficients orienting the hinge constraint forces 7 and A. As a result, the transformation matrix eliminates the hinge forces from the primitive equations. One example of this transformation matrix approach is the Newton Euler State Space formula tion developed by Hemami [Hemami 82]. This 0(n3) algorithm uses a two step transformation to eliminate the constraint forces (the translational hinge forces, and the rotational hinge con straint torques which are orthogonal to the axes of rotation) from the Newtonian equations of motion. The first step removes the translational acceleration coordinate x from the principal equations (1.13) and (1.14) by using equation (1.6) (repeated here in (2.19)), Xi = x,_i + Wi_i x /t-_i + u;,-_i x u>,-_i x — ibi x ki — Wi x Wi x ki (2.19) to derive a matrix which projects the system from x to the inertially measured angular ac celeration state w and simultaneously eliminates the translational hinge forces 7. Following this, a second step projects the system from w to either inertially measured [Hemami 82] or relatively measured [Buchner 86] Euler angle accelerations 0. This step utilises the angular velocity Jacobian from equation (1.4), which is a symbolic version of equation (1.5) (repeated here in (2.20)): wi = Wi-i + Wi x Z{9i + ZiOi (2.20) This reduced derivative state space 0 contains only the rotational degrees of freedom (assuming only rotational joints in the system), which are orthogonal to the rotational hinge constraint forces A of equation (1.14). As a result of this orthogonality, the A are eliminated by the projection and the equations appear in a final form similar to that derived by the Lagrangian approach. Note that at this point, translational quantities in (1.13) are referred to and measured in inertial (base) coordinates, while rotational quantities in (1.14) are measured and referred to relative coordinates. Another example of an orthogonal transformation was formulated by Bae, Hwang and Haug [Bae 88a] [Hwang 88], who developed a recursive orthogonal complement formulation for closed chain systems based on variational vector calculus. Instead of denning a matrix which globally Chapter 2. Existing Formulations of the Equations of Motion 25 (i.e. system-wide) projects the inertially measured and referred Cartesian derivative state [x w] to relative coordinates, they defined a local transformation which acts upon every element in a system 'accumulated inertia' matrix. A 6x6 link 'inertia' K{ was denned as an accumulation (i to n) of link masses and rotational link inertias measured with respect to the inertial frame. K{ = Mi + Ki+1 Kn = Mn mil i = n, ...1 where M, = — 771,2; i m,x,- J, - m,(x,)2 (2.21) (2.22) / is a 3x3 unit matrix, and M, is a rearranged version of Featherstone's spatial inertia matrix J,. Ki can only be accumulated if x,- and J, are referred to the inertial frame. For a simple open chain in which all joints are rotational and each has a single degree of freedom, each element of the system inertia matrix Mij is BjKiBj, where Bi = (2.23) (xi + ki)x Zi Zi The vector (x, + ki) describes the position of the ith hinge measured relative to and referred to the inertial frame. Bi locally projects the accumulated 'inertia' about the rotational axes of body i. The computational complexity of forming the inertia matrix plus the force vector is given in Table 2.1. The complexity of the matrix solver is excluded as it was not given. Note that all quantities must be referred to the inertial frame. Kim and Vanderploeg [Kim 86a] developed a formulation for closed chains using the Jacobian velocity transformation to relative joint angles proposed by Jerkovsky [Jerkovsky 78]. Although the Lagrangian kinetic energy expression is used as the principle for deriving the inertia matrix, rather than the Newtonian and Eulerian principles, the final transformation is the same as Hemami's. Kim and Vanderploeg also modeled closed chains using Lagrange multipliers and constraint Jacobians. Ibrahim [Ibrahim 88] developed a similar formulation to Kim's using the kinetic energy expression and a projection matrix. This technique was developed within a formulation which included flexible linkages and variable mass deployment. Angeles and Ma [Angeles 88] presented a formulation for a serial chain manipulator using Kane's approach to produce a natural orthogonal complement to the kinematic constraint equations. Chapter 2. Existing Formulations of the Equations of Motion 26 This natural orthogonal complement is combined with the Lagrangian kinetic energy expression to derive the formulation. The natural orthogonal complement matrix is the same as both Hemami's and Kim's final transformations, although an interesting feature of the algorithm is the referral of each body's variables to its own coordinates. This feature is exploited in the algorithm developed here in Chapter 3 to derive a similar algorithm to Angeles', but based on Hemami's method. Lilly and Orin have recently developed two algorithms which are related to the orthogonal complement concept [Lilly 91]. Lilly/Orin I is similar to those algorithms described above, using a recursive definition of the velocity Jacobian for chains of successively longer length to identify the system inertia matrix. The algorithm is 0(n3) complexity. Lilly/Orin III is similar to Bae, Hwang and Haug's algorithm [Bae 88a] [Hwang 88]. The variables, however, are referred to body coordinates. This algorithm is 0(n2) complexity. parallel implementations Hwang, Bae and Haug [Hwang 88] parallelised their algorithm for an n processor shared bus computer architecture. Generally speaking, Hwang's paper does not lend itself to a computa tional complexity analysis, as closed chain constraint equations are included (the complexity of single chain branches becomes only a small part of the problem). Our estimates of Hwang's al gorithm for a single chain (see Chapter 4) indicate that for the inertia matrix, 0((log2 n]) time is taken for the computation of the rotational matrices A° (using the RDA algorithm proposed by Lee), constant time is taken to refer the link inertias Ji to the inertial frame, O(|fog2n]) for calculating x, (referred to the inertial frame) and accumulating Ki, and 0(n) time is taken to locally project each accumulated 'inertia' Ki to relative coordinates (i.e. BjKiBj). The overall complexity for forming the inertia matrix is thus 0(n + [log2 n~|) using n cpus arranged in a hypercube. The architecture for the matrix solver, which is used in the following phase, is best suited to a nearest-neighbour 2d systolic mesh. As a result, both the hypercube and mesh architectures are theoretically required for implementing the simulator (it must be noted that Hwang chose to implement the algorithm on a shared bus, and consequently our estimate is for an ideal architecture). Lilly/Orin III has not been parallelised at this time. The forward dynamics algorithms proposed by Hemami, Kim and Vanderploeg, Ibrahim, and Chapter 2. Existing Formulations of the Equations of Motion 27 Angeles are all relatively similar. They have not yet been implemented on a parallel architecture, although Hemami and Zheng [Zheng 86] have developed an architecture for inverse dynamics using inertially measured coordinates. One of these algorithms would be a logical reference point for the development of an alternative parallel formulation. 2.2.4 The Newton Euler State Space orthogonal complement algorithm In this thesis the Newton Euler State Space formulation has been chosen as a possible algorithm for parallelisation. In the remainder of this chapter the equations of motion described by Hemami are developed using relative rather than inertially measured coordinates. In Chapter 3 these equations will be made more efficient by referring each body's equations to its own coordinate system. Consider the n link open chain linkage depicted in Figure 1.1. Each body is numbered B\ through Bn, with a home body labelled 1, inertial frame labelled 0 (coincident with joint 1), and all other bodies labelled with increasing integer value as the distance along the direct path between an arbitrary body and home increases. The hinge on body i nearest to the home body and along the direct path is labelled joint i, about which the coordinate rotates. Each of the n joints has one degree of rotational freedom and is driven by an ideal rotational torque actuator. A body coordinate system (bcs) is attached to the centre of gravity of each body. In Hemami's formulation, the bcs was aligned with the principal axes of the body. As a result, the body's inertia matrix J, referred to it's own coordinates, is d\ag[Jxx, Jyy, Jzz ]. This alignment is restrictive since the representation of axes of rotation not aligned with the principal axes is difficult. In this thesis, it has been found that a Denavit-Hartenberg-type axis orientation is more suitable, especially for single degree of freedom (dof) joints. In such a coordinate description (Fig. 2.1) the axis of rotation at joint i is assigned to be the Z{ axis of 6cs, and the x,- axis is the common perpendicular between the z,_i and z, axes, pointing in direction from z,_i to z,. Although the coordinate origin is drawn at the hinge point in Fig. 2.1, it is actually positioned at the centre of gravity of each body in this algorithm since most of the body quantities are defined relative to the centre of gravity of each body. It is drawn at the hinge since it is easy to graphically identify the direction of the axis of motion zi and the x, Chapter 2. Existing Formulations of the Equations of Motion axis when the coordinate frame is shown at the hinge. 28 Figure 2.1: Body coordinate frames For each body, the Newtonian forces and torques acting on the free body are shown as in Fig. 1.2. 7°0 represents a vector of translational hinge constraint forces measured in and referred to inertial coordinates; AJ, represents rotational constraint torques at joint t, measured in and referred to bcsi coordinates, which act in the axes of motion (x, and yi) other than the actual Chapter 2. Existing Formulations of the Equations of Motion 29 rotational degree of freedom (z,); r/t represents ideal actuator torques at joint t defined in and referred to 6cs,, rotating about axis Z{. Newtonian equations of motion for body i are: /«',:,« _ fi iLi'.jiJ _ it: At 0 , ni \t _\i ni+1 \i+l Jtwt,0 — Jt,0 T S.t^OVi.O 'i,t+lAOli+l,0 + •fti,i/,i,i /1«+l-at'+l,t'+l/Vr-l,«'+l +rli-AUiritl+i (2-24) m,z°0 = mig + 7°o - 7°+i,o (2.25Equation 2.24 describes the torques applied to the ith body in 6cs, coordinates, including rotational constraint torques, actuator torques, and moments due to the translational hinge constraint forces. (It should be noted that vectors appear in lower case, and matrices appear in upper case, or have a ~ over the variable to indicate that a skew-symmetric matrix has been created from a vector. The superscript on each variable indicates the bcs to which the variable has been referred. The subscripts may have a number of meanings depending on the type of variable). Parameters not already denned are: 1. w\0 is the angular acceleration of body i (first subscript) measured with respect to the inertial frame 0 (second subscript) and referred to frame i (superscript). 2. /,?0 is the gyroscopic torque — (w\<0 x JiW\0) [Hemami 82]. 3. R\ t describes those axes (measured with respect to, and referred to, ith coordinates) in which the ith body cannot rotate (with one dof (z axis) per joint, R\;, is a 3x2 matrix of two unit vectors in the x and y axes). 4. is the rotation matrix referring a vector in i — 1 to ith coordinates. 5. is the skew symmetric matrix of vector A;),, measured from the centre of gravity of body i to the ith hinge (inboard (see Fig. 1.2)) and referred to the ith coordinate frame. 6. is the skew symmetric matrix of vector /J^+i, measured from the centre of gravity of body i (first subscript) to the outboard hinge i + 1 on body i (second subscript) and referred to the ith coordinate frame. 7. J- is the inertia matrix of a body measured from its centre of gravity (first subscript) and referred to its own coordinates (superscript). Because it is not aligned with the principal axes, J- is not necessarily diagonal. Chapter 2. Existing Formulations of the Equations of Motion 30 Equation (2.25) describes the translational forces affecting the centre of gravity (cog) of body i in inertially measured and referred coordinates. m,<7 is the force due to gravity, where g is [0 0— 9.81]r. Xi is the length vector from the inertial coordinate origin to the cog of body i. Given the above equations for n bodies, a matrix equation can be developed to represent the entire system: Jw = f + Hif + Nt\ + ET Mx = g + #27 J 0 0 M w / Ni A + E + 7 + 'x 9 H2 0 0 (2.26) where H\ and H2 are coefficient matrices for the linear constraint forces; N\ is the coefficient for the rotational constraint torques; and E relates the actuator torques to the appropriate coordinate frames. / and g are vectors describing the system forces generated by the gyroscopic torques and gravity respectively, and J and M are block diagonal matrices J = [J\,J2,—J£] and M = [mil, m2I, ...TTINI]. An example of these coefficient matrices for a five body chain is presented in Appendix A. The internal constraint forces 7 and A may be eliminated by multiplying (2.26) by matrices which are orthogonal complements of [HT Hj]7 and Ni, respectively. First, £ is redefined as a function of w using translational kinematic hinge constraint equations. An equation for the ith hinge constraint is illustrated in Fig. 2.2. The constraint and its second derivative, referred to inertial coordinates, are: ^fi + A^ll-xllfi-Al1ltl. = 0 (2.27) %fl - Ci,o + A?(Mfl)%i - A?_1(tB|:11i0)2/::11fI- + 4%<o - ^zhMzU = o (2.28) Equation (2.27) describes the paths through bodies i and i — 1 for getting to the ith joint and (2.28) is (2.27) twice differentiated (w is the skew symmetric matrix derived from w). The differentiated hinge constraint equation for the whole system is: U 'x = Qw + Qw x = U-^Qw+U-^Qw (2.29) (2.30) Chapter 2. Existing Formulations of the Equations of Motion 31 Figure 2.2: Vectors describing the ith hinge constraint Appendix A describes the internal structure of the matrices U,Q,Qw, U 1Q and U 1. x can be eliminated from (2.26) by substituting (2.30), resulting in: J 0 0 M I / Ni E 0 w = + 7 + A + T — U~lQ 9 H2 0 0 MU^Qw (2.31) Chapter 2. Existing Formulations of the Equations of Motion 32 Hemami describes [IT {U~lQ)TY as the orthogonal complement to [HT Hj]T, with the rows of [IT {U~^Q)T'^r spanning the tangent space of the kinematic constraint surface. The columns of [B~i Hl]T, which are normal to this constraint surface, are orthogonal to the tangent space. Premultiplying (2.31) by [IT (U~lQ)T) thus eliminates 7, since it can be shown that H2 = UT and H\ = —QT, and so the result is zero. A second transformation can be used to eliminate A and project the system to relative joint angle acceleration space 0. Kinematic equation 1.5 is used to project w\0 to relative joint angle velocities. Equation 1.5 can be written as <fi = M-iwt\,o + <,i-i (2-32) where u>| is the angular velocity relative to the previous body, but referred to its own coordinates. w\ t_x is related through kinematic differential equations to the joint angle rates 0{ using the rotational sequence Body-3 xyz [Hughes 86]: Wix cos Oiy cos 9iz sin 0iz 0 Oix Wiy = — cos 9iy sin 0iz cos 0 Oiy sin Oiy 0 1 Oiz When limited to a single dof in the z axis, 0,x and 0{y are zero, as 0ix = a, (the twist angle constant) and 0iy = 0. Thus if 0i = the above equation simplifies to as in equation 1.5. Note that z), is the z axis of the ith body referred to itself. Thus for the whole system w = CO and ii; = CO + CO (2.35) where C is described in Appendix A. 0 is the system vector of <?,2 joint rates. Hemami [Hemami 82] and Zheng [Zheng 84] projected from w to the state space of inertially measured joint angle rates, while Buchner projected to the relatively measured joint rates 0 used here. Hemami [Hemami 82] has shown that CTNi — 0, the rows of of CT being orthogonal to the Chapter 2. Existing Formulations of the Equations of Motion 33 columns of N\. Substituting (2.35) into (2.31) to get a new transform, we get: with the final equations < CT {U-^QCf 0 MU^Qw The final form, (2.37), is the same as that developed using the Lagrangian approach. On the RHS of (2.37), (U~lQC)T g are the forces of gravity, CTEr are the actuator torques, and the remaining terms constitute the centrifugal and Coriolis forces. The LHS coefficient of 0 is the inertia matrix. The constraint elimination transformation of (2.36) is also Kim's velocity transformation matrix [Kim 86a] and Angeles' matrix of twists. In Hemami's method, the transformation is used principally to eliminate constraint forces, and secondly to project to relative coordinates, while for Kim and Angeles the matrix is used with the kinetic energy expression to project to relative coordinates, since the hinge constraint forces do not naturally appear. Although equation (2.37) has the same final form as the equations developed using Lagrangian techniques, it is important to realise that it is not the principles of physics that determine the efficiency of the computation. The procedure and organisation for the formation of the transformation matrices play a more significant role. 2.2.5 Discussion In this chapter a number of formulations have been discussed. In general, they form three major schools of thought: the 0(n2) composite rigid body method, which is popular in robotics; 0(n) local axis projection algorithms; and 0(n2) or 0(n3) orthogonal complement global projection algorithms. I C ce = U-lQC (2.36) f motion: J 0 C r -| U~lQC 9 = CT (U-'QCf 0 M f 9 J C9 + E MU-lQ 0 (2.37) Chapter 2. Existing Formulations of the Equations of Motion 34 Composite rigid body algorithms have been parallelised by several researchers. Parallel 0(ra + riog2n]) and O([log2n]) algorithms have been developed for n and 2l2±ll cpu architectures, respectively. One drawback of these parallel algorithms is that the architecture developed for calculating the matrices is not ideal for solving the matrix equation that follows. Recursive 0(n) techniques were originally developed by roboticists, but have since been adopted by others. Forces and moments are locally projected onto hinge axes in a recursive manner. 0(n) algorithms are faster on a single cpu for systems with more than six degrees of freedom, but as noted by Fijany, a single chain cannot be parallelised at the equation level using this approach (parallel matrix multipliers can be applied, however). Orthogonal complement algorithms use a global Jacobian transformation to project from Carte sian to other coordinates such as relative joint angles, and eliminate internal constraint forces. Bae, Hwang and Haug's algorithm [Bae 88a] is a variation in which the terms are grouped in a different manner so that the projection is locally performed on an 'accumulated inertia' ma trix Ki. With the exception of Hwang, Bae and Haug's work [Hwang 88] which used a shared bus architecture, the orthogonal complement algorithms developed by Hemami, Kim, Ibrahim, Angeles and Lilly have not been parallelised. In the next chapter, an efficient version of the Newton Euler State Space algorithm is developed, and in Chapter 4, a parallel version of this algorithm for a 2 dimensional mesh architecture is presented. Chapter 3 Efficient Computation using Body Coordinates The thesis has thus far provided an overview of the fundamentals and some current research in the area of multibody dynamics. This chapter focuses on the efficiency and complexity of the Newton Euler State Space formulation in the context of a single cpu simulation. In this chapter each individual equation in (2.37) is referred to its body coordinates, rather than the inertial frame used by Kim and Vanderploeg [Kim 86a], or the mixed frames pro posed by Hemami (in Hemami's approach i, equations were referred to the inertial frame, and ibi equations were referred to body coordinates). It is shown that this leads to a computa tionally efficient recursive definition of the matrix elements in which the final transformation matrix is the same as the natural orthogonal complement matrix presented by Angeles and Ma [Angeles 88]. In addition, a connectivity matrix is used to extend the formulation to include branched systems, and a novel algorithm to calculate the force vector is presented. The computational complexity of the overall formulation, which includes the cost of the inertia matrix, the force vector, and the matrix solver, is used as a metric to compare other formulations. Finally, two linkages are simulated which demonstrate the correctness of the formulation. A PUMA 600 is modeled using joint angle, velocity, and acceleration profiles suggested by Angeles and Ma [Angeles 88]. A six degree of freedom model of the human torso is simulated using the same profiles to demonstrate the branched algorithm. 35 Chapter 3. Efficient Computation using Body Coordinates 36 3.1 Referral of the Equations to Body Coordinates 3.1.1 Single Chain Algorithm In equation (2.30), i, measured with respect to the inertial frame, is also referred to the inertial coordinate system. If each element in the x system vector is instead referred to its own bcs, (2.30) can be modified to hcs = (U^QhcW + (U-lQw)hca (3.1) An example of (U~1Q)bcs for a five link single chain is contained in Appendix A. By referring (2.30) to body coordinates, the computation of the transformation matrix in equation (2.36), now modified to [CT ((U-1 Q)bcsC)T]T', becomes more efficient due to its recursive nature. As an example, consider the computation of matrix elements (2,1) and (3,1) in (U~1Q)bcsC: [(^-1g)6«c]2ll = and d) A\z A\A\z Ajv_j • • • A\z — A j d2 \ Z ~f- &2 2^1^ = (A^d^A^ 4- k^^A^z = (^i^i ~l~ ^2,2) x A\z = pl,i x A\z = (A?(pia - l\,2) + kl2) x A\z where p\ 1 — kl 1 't.t+i (3.2) (3.3) (3.4) where d]+11 is the vector measured across body i from hinge i + 1 to hinge i. p2 a Is the vector from the centre of gravity of body 2 to the lower hinge of body 1, in body 2 coordinates. Chapter 3. Efficient Computation using Body Coordinates 37 Similarly, element (3,1) is computed thus: [{U-lQ)hcsC]^ = -^2^1^2,1 -^2^3,2 ^3,3 0 ' Z A\z A^-.-Alz A\A\d\%lz + A%<P3jA]z + kl>zA\A\z (•^•2^1^2,1-^2^3 + y42d|2A3 + fc^J-A^iZ (A32A\d\x + A\dla + fcfi3) x A\A\z P%,\ x A\A\z (Alipli-iy + kl^xAlAlz (3.5) (3.6) In this case is the vector from the centre of gravity of body 3 to the lower hinge of body 1 (joint 1), in &CS3, while A\A\z can be denned as c\x. The (i, j)th term of [U^QbcaC] is therefore [(U^Q^C]^ or as a system, [(U~1Q)bcaC] where pjj Ph and c\j = [pxC] - kj- •• A • 2; (3.7) (3.8) (3.9) (3.10) where [p x C] is a symbolic representation. />} • is equivalent in magnitude to the vector used by Angeles [Angeles 88] but with a reversed direction, and c\ • is equivalent to Angeles' e,j. In this thesis, the order of the cross product is reversed to compensate for the change of direction in p1-. The 3xiV matrices for cj— and p\j x cj—, where i = n; j = l,...n; are the components of the standard end effector Jacobian. In Angeles' method, J and M are factorised to J1J and MTM where M,- = y/mll^xz- In this thesis, J and M remain unfactorised. Chapter 3. Efficient Computation using Body Coordinates 38 Now that we have efficiently defined (U QbcsC) to be (p X C), equation 2.37 is modified to: [CT {P x Cf) JC M(P x C) e = CT (pxCf f + 9bcs i- 1 J E C6 + M(p x C) 0 or MO = [HTH]0 = b (3.11) In equation (3.11), the ith component of g^ is g\0 — miA'0g, the force due to gravity of each body when it is referred to body coordinates. The transformation matrix H = [CT (px C)T] on the LHS of (3.11) was developed independently of Angeles' work. Another interesting feature of the H matrix is its recursive nature. The matrix can be considered as a series of N 3 x N Jacobian matrices, the ith one representing a manipulator with i links. This characteristic was also noted by [Lilly 91]. 3.1.2 Branched systems algorithm Consider the branched system shown in Fig. 3.1. The two branches consist of branchl = {1,2,3} and branch2 ={1,4,5}. Each new branch starts from the home body (1) and is completely labelled out to the tip before a new branch is started. If a branch already has some bodies previously labelled, the next unlabelled body is tagged with the lowest unused integer. Consider body 1. Joint jl connects body 1 to ground, joint j2 connects body 2 to body 1, and joint j4 connects body 4 to body 1. Define l\2 and l\ 4 as the vectors from the cog of body 1 to the hinge connecting bodies 2 and 4 to 1, referred to bcs\. The / vectors point to hinges that are further from the home body. k\ a is the vector from the cog of body 1 to the joint connecting body 1 to ground (joint 1), or in the general case the body which is one closer to home. Then, define d\x = kxl — l]2 as the vector from joint 2 to joint 1, across body 1, in bcs\. In general the vector across body i from joint j to joint i can be defined as = fcjt- — l\j where i = prev(j), the body that is one closer than j on the direct path to the home body. The orientation of each body's coordinate system (not shown in Fig. 3.1) remains the same as for the simple chain (e.g. the X4 axis is parallel to the common perpendicular between the Z\ Chapter 3. Efficient Computation using Body Coordinates 39 Figure 3.1: A branched tree of five bodies and 24 axes). From Fig. 3.1, equations of motion similar to (2.26) may be defined (see Appendix B). The structure of the matrices differ from the single chain configuration as extra elements arise for bodies which have more than one outward body attached (e.g. body 1). The elimination of constraint forces proceeds as before, with the statement of kinematic trans lational hinge constraint equations, referred to the inertial frame: *i,o + 4**1,1=° T° 4. A°h2 - r° - A0!1 -0 X2,Q 1 Ji2K2,2 xl,0 — v x3,0 + -^3*3,3 ~ ^2,0 — -^2^2,3 = ° r° 4- A°h4 r° - A0!1 - f) x4,0 + A4K4,4 ~ zl,0 Alll,4 — u Chapter 3. Efficient Computation using Body Coordinates 40 T° _j_ A°h5 -r° - A°l4 - 0 (3.12) Referring each equation to body coordinates and then differentiating twice yields the branched versions of matrices U,U~l,Q and Qw (denned in Appendix B). As in the single chain case, the translational constraint forces are eliminated due to the orthogonality between [Hx Hjbcaf and [IT ((U~1Q)bcsC)T]T, and the rotational constraints are eliminated by the orthogonality between C and N\ [Hemami 82]. To automate the derivation of the kinematic constraints and the equations of motion, it is necessary to obtain U and f-1. This is because the C, (U^Q)^ and (p x C) matrices (shown below) have structures identical to Z7-1. U-1 is a trivially generated connectivity matrix describing the bodies that appear along the path between a given body and the home body. For example, row 5 in U-1 (in Appendix B) has an I in positions 1, 4, and 5, indicating that the path from home to body 5 includes bodies 1, 4 and 5. It is also useful to identify adjacent bodies. Considering body 4, the body previous to it in the path is found by searching for a -I in the fourth row of U. This occurs in column 1. Once U'1 is generated, Hi and U are easily constructed. U and U~x can be incorporated into simple algorithms for determining other matrices e.g. the computation of Qw and (p x C). As an example of the underlying structure of the matrices, the (U^Q)^ (3N x 3N) and C (3N x N), when computed from U~l, Q, and the kinematics equations (1.5) and (1.6) (Appendix B), are: •^1^2,1 *2,2 ° -^2^1^2,1 -^2^3,2 ^3,3 A\d\x 0 0 A\A\d\x 0 0 0 0 0 K4,4 /14"5,4 0 0 0 0 K5,5 ,and C = z 0 0 0 0 A\z z 0 0 0 A\A\z A\z z 0 0 A\z 0 0z 0 A\A\z 0 0 A\z z (3.13) Post multiplying (U~1Q)bcs by C results in a 3N x N matrix similar to the (p x C) matrix Chapter 3. Efficient Computation using Body Coordinates 41 generated in equation 3.8 but with non-zero elements in the same positions as those in U : [U^QbcC] = [PxC] = Pn x cn 0 0 0 P\\ X c21 ^22 X c22 ° 0 P%\ X c31 P32 X c32 ^33 x c33 0 0 0 0 p\\ x <4i o P%\ X C51 0 0 0 P444 xc444 0 PS4 X Cg4 p|5 X c|5 (3.14) Calculating pj • requires knowledge of the path between bodies i and j, which is found in U and U-1. A recursive algorithm to generate cj—, /?} • and [(U~1)QbcsC]\ j that uses and f/_1 to take into account the system topology is: for j = 1 to N { for t = 1 to N { for p = 1 to i if ([/,> = -/) K0 = P; ph [U-'QbcsC]^ } } c .• = z Kl,l U'<3 Ap(i)Cp(i),J I < J i < j (3.15) Since U~l and U are trivially generated, arbitrarily branched systems can be modeled in an automated manner by utilising this algorithm. 3.2 Inertia Matrix Calculation and Complexity Simulation of the dynamics requires the formation of the terms in (3.11), followed by the linear equation solution of MO = 6. In this section the computational complexity of assembling Chapter 3. Efficient Computation using Body Coordinates 42 the inertia matrix M for the serial chain algorithm (on one cpu) is estimated. Because the complexity of the branched algorithm depends greatly on the system topology, it is not useful to estimate the cost of an arbitrarily branched system. The inertia matrix in (3.11) is factored into two rectangular matrices, HT = [CT (p x C)T] (Nx6N) and r JC M(p x C) H = (3.16) which is (6N xN), where (p x C) = U^QbcaC. The algorithm for calculating (3.16) is: for j = 1 to N { for i = j to N { ci,i = z-c ph [U-'QbcsC]^ = •™t-lc«-l,j M-x(pth - kid + tij Pli x <; i > 3 i > j Vrx-mi(p\,j X Clj) The computations in (3.17) can be decomposed into the following phases: (3.17) 1. Forming the rotation matrices A]^ which refer a vector in 6c5,_i to 6cs,: cos Bi cos a, sin di sin a,- sin 6i — sin di cos ai cos 0, sin a, cos 0, 0 — sin ai cos a, (3.18) Assuming the n (sin#,, cos#i) values have been calculated, and sin a,- and cos a,- are constants, this phase requires An multiplies and 0 adds for an n-body chain. Chapter 3. Efficient Computation using Body Coordinates 43 2. Calculating cj- -, which is the zj • axis referred to fees,- coordinates. Consider the system C matrix for a five body chain: C = z1 0 0 0 0 A2z1 z2 0 0 0 A>2Z2,2 z3,3 0 0 Alzl,l AA72 A2Z2,2 \4 3 A3Z3,3 z4,4 0 A5z1 Ah72 /i2z2,2 A5?3 A3Z3,3 z5 z5,5 where cj- = sjj = zand cj-,- = AJ.^J.Jj (3.19) Each cj- • in equation (3.19) is a matrix-vector product requiring 8x and 54-. The C matrix thus requires ("~2Kn~1)(gx,5+), This is because the first diagonal consists only of constant zjj vectors, while the second diagonal refers constant z\ vectors to 6cs,-+i, which is equivalent to just selecting the last column of the Aj+1 matrix. The other diagonals require the computation of (3.19). 3. Calculating pj- -, the displacement vector from the centre of gravity of body i to the jth hinge: [U-lQ}bcs = P = P\,I o P2,l P2,2 0 0 ^3,1 p\,2 ^3,3 0 0 0 PA,\ P4,2 Pi,3 P4,4 0 0 0 0 P5,l Ph,2 P\,3 P%,4 P5,5 (3.20) The first diagonal values of the p matrix (i = j) are p{- • = k]^, so no computations are necessary. The rest of the elements require (n)(^~1)(8x,ll+). ^ 4. Multiplying every element in the ith row of the C matrix by J\. jir\ . l,...i (3.21) The first diagonal, i = j, can be calculated off line as the multiplicands and the results are constant, and hence the cost will not be counted here. The rest of the matrix requires i2H»z!l(9x,6+). Chapter 3. Efficient Computation using Body Coordinates 44 5. Calculating the cross product of c\ • and p\j, and then scaling by the mass mj. mi(pij x citj) (3.22) The first diagonal can be calculated off line. The cost of the rest of this phase is (n)(21~1)(9x,3+). Note that we have multiplied by J? and m,, instead of J\ and y/m^ This is because the force vector algorithm (to be described later) use C and (p x C) rather than JC or M(p x C). 6. The final phase in the formation of the inertia matrix is to multiply CT by JC and (p X C)T by M(p X C) according to: Mik = £ (cj„)r • (J/4fc) + (4.. x cjit.f • (m^ x cj)fc)) (3.23) Since the inertia matrix M. is symmetric, only half of the matrix needs to be com puted. The complexity for this phase is then (^ + ^ + § - n)(6x,5+) - "(n~^(lx, 1+). The —n(6x,5+) appears because terms multiplying two first diagonal elements, being constants, can be calculated offline. The —n^"2~1^(lx, 1+) term is present because rrij{Kp>- • x Cj j) vectors have zeros as their last elements. The total computational complexity for forming the inertia matrix is the sum of the previous phases' complexities, and is presented in Table 3.1. The algorithm is slightly more efficient than Angeles' because some of the terms are calculated off line. The complexity polynomial given by Angeles also has minor errors, and these have been corrected in Table 3.1, which presents the complexity for a number of single cpu algorithms. Table 3.2 gives data on the costs for calculating the inertia matrix for linkages of various size for various inertia matrix algorithms. The algorithm Lilly III, developed by Lilly and Orin [Lilly 91] and similar to [Bae 88a], is the most efficient in Table 3.2. In this study however, the proposed algorithm will be parallelised because it posesses an architecturally attractive computational structure, compared to Hwang's parallel formulation discussed earlier. Chapter 3. Efficient Computation using Body Coordinates 45 Algorithm force vector inertia matrix solver Angeles X + 105n - 99 90n - 95 n3 + 39n± _ 33n + g 5n3 , 29n2 46n , r "fi" + ~ ~ + 0 nJ . 3nJ , 4n 6, 2 ' 3 n3 i 3n2 i n (j t 2 1" 3 Trrl + 21n2 + 539n _ gl n3 + 16n2 + 75n - 90 Proposed X + 3n2 + 77n - 44 3n2 + 62n - 43 n3 + 39n± _ 49„ + g 5n3 i 29n2 61n , c 6 "•" 2 3 ~r ° nJ , 3nJ 2n 6 f 2 3 HI 4. „2 _ 7n 6 6 Zgi + 24n2 + 2^-36 n3 + 3^i + sin _ 38 Walker X + 137n - 22 lOln- 11 12n2 + 56n - 27 7n2 + 67n - 53 nJ , Sn2 2n 6; "•" 2 3 £ + n2-I* nJ i 27V • 577n 4Q 6 ' 2 ' 3 ^ «3 + 8n2 + 10^ln 64 Brandl X + 250n - 222 220n - 198 Fijany X + 9^ + 232ln 181 4n2 + 88n - 137 Lilly I X + n3 + 22n2 - 35n + 12 n3 + 15n2 - 26n + 10 Lilly III X + 25ny | 49n 37 8n2 + 40n - 48 He X + 91n-97 86n - 96 Table 3.1: Complexity of Forward Dynamics for One Cpu Algorithm 4 5 6 7 Proposed Walker Angeles Fijany Lilly I Lilly III 286x, 209+ 389x, 327+ 318x, 229+ 353 x, 279+ 288 x, 210+ 261x, 240+ 498x, 370+ 555 x, 457+ 538 x, 395+ 509 x, 403+ 512 x, 380+ 398x, 352+ 779 x, 585+ 741 x, 601+ 827X, 615+ 664 x, 535+ 810x, 610+ 560 x, 480+ 1135x, 851+ 953 x, 759+ 1191X, 894+ 848 x, 675+ 1188x, 906+ 747 x, 624+ Table 3.2: Computational Complexity for Inertia Matrix Assembly for One Cpu 3.3 Force Vector Calculation and Complexity The RHS of equation (3.11), which includes the actuator forces or torques plus the effects of Coriolis, centrifugal, and gravitational forces (/„), appears to be computationally expensive when calculated according to equation (3.11). It is commonly calculated more efficiently using the 0(n) inverse dynamics algorithm proposed by Luh, Walker and Paul [Luh 80] (Chapter 2). An 0(n) algorithm which is more efficient than Luh's, proposed by Angeles and Ma [Angeles 89] (Table 3.1), uses Kane's generalised force approach. In this section a new algorithm based on Angeles' algorithm is proposed. Although the proposed Chapter 3. Efficient Computation using Body Coordinates 46 algorithm is of 0(n2) complexity rather than 0(n), it is slightly more efficient than Angeles' for systems with six or less degrees of freedom, as it makes use of the HT = [C (p x C)]T matrix, which has previously been calculated during the computation of the inertia matrix. In contrast, Angeles' algorithm does not make use of the previous inertia matrix calculations. Consider a chain of n bodies. Assuming there are no dissipative forces such as friction, the equation of torque is [Angeles 89]: /« = t[(^)T^o + (%T™3no ~ 9)] (3-24) and are known as the partial velocities, and are the jth 3xn Jacobian row matrices found within the C and (p x C) matrices, respectively. The angular momentum derivative Zj0 from (1.12), measured with respect to the inertial frame and referred to the jth frame, is: The translational acceleration Xj is the acceleration of the Xj vector measured from the inertial frame to the centre of gravity of body j. It may be replaced by pj, which as defined here includes the effect of the acceleration due to gravity (i.e. pj = Xj — g). Angeles' recursive equation for pj is: p£o = Al_1(fiZ\fi+qz\fixlll\j+^^ (3-26) An efficient algorithm for calculating p\0 and L\0 is [Angeles 89]: <o = M-iwtlo + JA (3-27) <o = A^wizlfl + w^xzi^ + zi^i (3.28) Let Wi = <b;i0 + (tZ>;,0)2. (3.29Then #:0 = Ai^Zlo + Wi-tfcl^-Wikli (3.30) Equations (3.27) and (3.28) are simply kinematic equation 1.5 referred to body coordinates. Equation (3.29) combined with (3.30) is an efficient method of representing (3.26), while (3.31) is (1.12) referred to body coordinates. Chapter 3. Efficient Computation using Body Coordinates 47 Angeles substituted the 3xn row matrices in C and (p x C) into (3.24) to get: and t = l,...n t = 1, ...n (3.32) (3.33) where (cj,)T = (cjit)TA« and (pj,- x cj„)TmJ^0 = (cj„-)r(pj,- x mjpji0)- Angeles showed that equation (3.33) is computable in a backward recursion in O(n) time (Table 3.1). 3.3.1 0(n2) force vector algorithm and complexity In this section we define a new algorithm for the force vector which makes use of the [CT (p x C)T] matrix. Equation (3.24) may also be written as fv = cT (pxcy L mp (3.34) where [L mp] is the vector of L{ and m^p,, i — 1, ...n. Thus it can be seen that the momentum derivatives are related to (3.11) by L f 0 J = H \ + — mp \ 9bcs M(p x C) cej (3.35) The complexity of computing equation (3.34) is 0(n2) due to the matrix-vector multiplication, and requires the initial computation of L and mp from equations (3.27) to (3.31), followed by the matrix computation of (3.34). The cost may be estimated as follows: 1. Calculating w\0 requires (n- l)(8x,6+) using (3.27). w\0 = z\ x0\ is a scalar assignment operation. 2. Calculating w\Q requires the last term in equation (3.28) to be ignored (the 0,- are set to zero to exclude the inertial terms). The second term is a cross product which only requires 2x since has only one nonzero element. Thus the complexity is (n - l)(10x,7+), with w\0 = 0. Chapter 3. Efficient Computation using Body Coordinates 48 3. Calculating Wi requires (ra - l)(6x,9+), plus lx for W\. 4. Calculating pi requires (ra — l)(26x,23+), plus 12x,9+ to calculate p|0 = -Alg - Wxk\A (3.36) where g = [0 0 — 9.81]T. The first term in (3.36) compensates for gravity by adding an additional acceleration to that induced by the angular velocity w°0. 5. Calculating m,p, requires ra(3x). 6. Calculating X, in (3.31) requires (ra - l)(24x, 18+) plus 15x,10+ for L\fi. 7. Calculating the matrix-vector product in equation (3.34). The first diagonal elements in C are z = [00 1]T, and so no cost is associated with dot products involving these terms. The complexity of (3.34) is "Y"~ 1)-(6x,5+) + ra(3x,3+)+ ra(n~1)(l+) = 3ra2(x,+). (3.37) (3.38) 2 The first term calculates the dot products Li Ttlipi for those computations not involving the first diagonal. The second term adds the com putations for [(p\i x cj-,)][m,p,], and the last term describes the operation in which the rotational and translational dot products are added together to form the force vector. The complexity for the proposed force vector algorithm is given in Table 3.1. Complexity values for multibody systems of various size are given in Table 3.3, which shows that the algorithm presented here is more efficient than Angeles and Walker's algorithms for systems of six dof or less, but less efficient than He's. For parallel formulations, however, the thesis will demonstrate in the next chapter that the algorithm developed here integrates very well with the parallel inertia matrix algorithm, making it a good choice for parallel implementation. The force vector for a branched system can be calculated if the [CT (pxC)T] matrix is generated using equations (3.13) and (3.15), and if the kinematic equations (3.27) and (3.28) take into account the branching. Chapter 3. Efficient Computation using Body Coordinates 49 dof Proposed Angeles Walker He 4 312x, 253+ 321 x, 265+ 526 x, 393+ 267 x, 248+ 5 416 x, 342+ 426 x, 355+ 663 x, 494+ 358 x, 334+ 6 526x, 437+ 536 x, 445+ 800 x, 595+ 449 x, 420+ 7 642 x, 538+ 636 x, 535+ 937 x, 696+ 540 x, 506+ 8 764 x, 645+ 741 x, 625+ 1074x, 797+ 631 x, 592+ 9 892 x, 758+ 846x, 715+ 1211 x, 898+ 722 x, 678+ 10 1026x, 877+ 951 x, 805+ 1348 x, 999+ 813 x, 764+ Table 3.3: Complexity of single cpu force vector algorithm 3.4 Equation Solving The Cholesky decomposition linear equation solver [Walker 82] was used as the matrix solver for the single cpu case, as the inertia matrix is positive definite and symmetric. The complexity of the Cholesky algorithm, which consists of a decomposition and subsequent backsubstitution, is given by Walker as: £ x = n3/6 + 3n2/2 - 2n/3 = n3/6 + n2-7n/6 (3.39) 3.5 Simulations In this section the simulations of two linkages are discussed. Models of a PUMA 600 six degree of freedom rotary jointed manipulator, and a six degree of freedom model of the human torso (trunk, head, and two upper and lower arms) have been implemented on single cpus. A single cpu simulation of the excavator will be presented in Chapter 5. The PUMA 600 was chosen as an example of a rotary manipulator which is commonly discussed in the literature. In this thesis, profiles of 9, 0, and 9 given by Angeles and Ma [Angeles 88] have been used to drive the inverse dynamics algorithm previously developed in this chapter in sec. 3.4. Although the desired profiles are relatively slow, and hence may not excite all the dynamics of the system, it was necessary to produce results which could be directly compared to those existing in the literature. The inverse dynamics then controls the manipulator model Chapter 3. Efficient Computation using Body Coordinates 50 Joint a.-(deg) a,(m) Mm) Initial 0, 1 0 0 0 0 2 -90 0 -0.149 0 3 0 0.432 0 0 4 90 0.02 -0.432 0 5 -90 0 0 0 6 90 0 -0.056 0 Table 3.4: Denavit Hartenberg parameters for the PUMA 600 in an open loop control mode. All parameters are in metric units, with mass in kg, inertia in kg • m2, length in m, and torque in Nm. The human torso was chosen as an example of a branched system. Open loop inverse dynamics control was used to simulate joint profiles similar to those used for the PUMA simulation. 3.5.1 Puma 600 manipulator A block diagram of the system appears in Fig. 3.2. The length parameters for the PUMA 600 are given in Table 3.4 (Denavit Hartenberg parameters [Angeles 1988]) and the mass and inertia parameters are given in equations (3.40) to (3.42). Fig. 3.3 is a diagram showing inverse dynamics Td PUMA 600 M0=G(9.e.Td) e e e Figure 3.2: Block diagram of the open loop inverse dynamics controlled PUMA 600 dynamics simulation the Denavit Hartenberg coordinate frames attached to the PUMA. In the coordinate system adopted in this thesis (and used in the following equations), joint i is labelled as the lower joint on link i. Because the coordinate frame is attached to the centre of gravity, the following conversions are required to generate k and / from the Denavit Hartenberg parameters and from Chapter 3. Efficient Computation using Body Coordinates 51 Angeles' paper: Chapter 3. Efficient Computation using Body Coordinates 52 d) V • bi+i sin a,+i -bi+i cos a,+i Ki,i "•+!,« resulting in and 73 J3 J,5 mi m2 m3 7714 m5 m6 = 10.521, jfcJJ = [0 0.054 0], l\T2 = 15.781, k%T2 = [-0.14 0 0], l22T3 = 8.767, ifc3^ = [0 0.197 0], /|J = 1.052, k\TA = [0 0 0.057], = 1.052, klf5 = [0 0.007 0], /fr = 0.351, 41 = [0 0 0.019]. .6 — [0 0.203 0] [0.292 0 0] [0.02 -0.235 0] [0 0 0.057] [0 -0.049 0] 5 — 1.6120 0 0 0. 0.5091 0 0 0 1.6120 3.3768 0 0 0 0.3009 0 0 0 3.3768 0.0735 0 0 0 0.0735 0 0 0 0.1273 72 T4 76 -0.4898 0 0 0 8.0783 0 0 0 8.2672 0.1810 0 0 0 0.1810 0 0 0 0.1273 0.0071 0 0 0 0.0071 0 0 0 0.0141 (3.40) (3.41) (3.42) integration algorithm The simulation used a fourth-order fixed step Runge-Kutta method described by [Press 88]. The step size was set at h = 0.01s, and the model was run for 10 seconds of simulation time. Desired values of 0j, 0\, and 0& were fed into the inverse dynamics algorithm so that the desired Chapter 3. Efficient Computation using Body Coordinates 53 actuation torques rd were available to the PUMA model at every function evaluation (four evaluations per step h). inverse dynamics The 0(n2) inverse dynamics algorithm developed earlier was implemented on one cpu. The desired profiles were generated according to the following equations: MO = (tw„i - sin(ti»Bt))/2 MO = (Wn - Wn COS(u>„r))/2 MO = wl sm(wnt)/2 i'=l,...6 (3.43) where wn = 2it/T, (T=10s) and t is the time variable. The torques r,d(i) are calculated in correspondence with the four time points required by the integration algorithm- once at f (operation [1]), twice at t + h/2 ([2]and [3]), and once at t + h ([4]): MO = f(ed(t), ed(t), 9d(t)) [i] rid(t + h/2) = f(9d(t + h/2), 9d(t + h/2), 9d(t + h/2)) [2,3] M* + h) = f{9d(t + h), 9d(t + h), 9d(t + h)) [4] (3.44) where the rd are calculated using the algorithm described in sec. 3.3.1, with 9{ = 9{d included in equation (3.28). forward dynamics The equations for the PUMA model are calculated on a separate cpu. The r,d are read in at every function evaluation, and are used as the actuator torques ra and added to the force vector fv to get b (fv is calculated using sec. 3.3.1, but 0, is set to zero). The simulation algorithm calculates /„, b and M and then solves for 9. The Runge-Kutta algorithm then integrates 9 to get 6 and again to get 9. After four evaluations, the algorithm calculates a weighted response for states 9, 9 and 9 at time point t + h (see Chapter 5, section 5.4.1 for an exact description). Chapter 3. Efficient Computation using Body Coordinates 54 time (s) Figure 3.4: 0d, 0d, 0d profiles for all joints of PUMA 600 model. Overlaid are the 0, 0 and 0 responses for joint 6. simulation results Fig. 3.4 shows the desired 0d, 0d, and 0d, and Fig. 3.5 shows the rd (labelled as tql to tq6) generated by the inverse dynamics algorithm in response to the desired profiles. The simulator response for joint 6 is also shown in Fig. 3.4, and is indistinguishable from the desired values. All of the links produce a similar response when overlaid. Fig. 3.6 shows the error curves for link 6, which show maximums of -1 x 10_4deg for 0, —5 x 10_5deg/s Chapter 3. Efficient Computation using Body Coordinates 55 I Time (second) Time (second) 1 Time ( second) Time ( second) Figure 3.5: rd profiles for joints 1 to 6 generated by the inverse dynamics of the PUMA 600 model for 0, and 6 x 10-5deg/s2 for 9. These results are similar to those obtained by Angeles and Ma [Angeles 88] (while their errors were similar in magnitude, their step size was larger Chapter 3. Efficient Computation using Body Coordinates 56 time (s) Figure 3.6: Errors 0d -0,0d- 6, 0d - 0 profiles for joint 6 of PUMA 600 model (thus requiring less computations)). Angeles' performance is slightly better (computational cost-wise) due to their using the fifth/sixth order DEVERK integration algorithm from IMSL, which calculates eight evaluations (rather than four) of the derivative before advancing one step h (this gives a better estimate). The time taken for 10 seconds of simulation was 71.12 seconds of computer time. This time was measured assuming no results were stored and that the stack was initialised within the internal RAM of the Inmos T800 transputer (a microprocessor designed for parallel processing). The time taken for one evaluation (inertia matrix formation, force vector formation, and matrix solution) is 17.2 ms. Theoretically, the complexity estimate from Tables 3.2, 3.3 and equation (3.39) for n=6 is Chapter 3. Efficient Computation using Body Coordinates 57 1391 x, 1087+ (also shown in Table 3.5 at the end of this chapter). If a multiply and an add costs 1 ^s and 0.5 ps respectively in a transputer, the theoretical computation time should have been 1.93 ms. Adding the cost of n (sin, cos) calculations at 7/xs each, the total increases to 2.02 ms. This estimate would meet the maximum evaluation time of 2.5 ms when a fourth order Runge Kutta and /i=0.01 are assumed. The actual computations are thus 17.2/2.02= 8.5 times slower due to superfluous computations required for loop counters and array indexing, external memory fetches for global variables, and the cost of entering and leaving subroutines (the cost of saving data on the stack). These costs are unavoidable unless the modularity of the formulation is removed by not using indexed arrays and by replacing all subroutines with inline code. 3.5.2 Human torso In this model of the torso, the head, trunk and left and right upper and lower arms are assumed to have one degree of freedom per joint. Fig. 3.7 describes the structure of the system, including the body coordinate systems. The mass and length parameters used by the formulation are: [0.15 0 0.15] [0 0 0.242] lis = [-0.15 0 0.15] m2 = 5.0, k%t2 = [0 0.12 0], m3 = 2.8, Jfc^3 = [0 0.2 0], /J4 = [0 - 0.25 0] m4 = 1.4, kj4 = [0 0.2 0], ms = 2.8, kj5 = [0 - 0.2 0], = [0 0.25 0] m6 = 1.4, fcje = [0 - 0.2 0], (3.45) J\, the trunk inertia, is obtained from [Khosravi 87] while the rest of the inertia parameters are found by approximating each limb as solid cylindrical links with radius r,- and length /, (and Chapter 3. Efficient Computation using Body Coordinates 58 for the head, a solid sphere of radius r2). The values are: r2 = 0.12, r3 = 0.05, r4 = 0.05, r5 = 0.05, r6 = 0.05, and /3 = 0.45, l4 = 0.4, l5 = 0.45, /6 = 0.4 Chapter 3. Efficient Computation using Body Coordinates 59 These values only approximate those given by Khosravi, since in his case the arms are modeled as one single link, as he was modeling a four link chain performing a somersault. Thus: 75 -J5 — 2.7 0 0 2T7127^ 5 0 0 0 0.40 0 , «/2 — 0 2m2r| 5 0 0 0 2.7 0 0 2m2r| 5 m3(r4 + ft) 0 77131-3 2 0 0 0 m5r5 0 0 74 -^4(4 + 5) 0 0 tn<r, 2 0 rn4(r-t + fe) 0 m5Ci- + fe) 0 -2 0 0 m5(^ + i|) The twist angles (in degrees) are: 76 _ 7716 rt 0 -2 ll 0 meC-f + ^|) 0 0 0 0 Mt4+h (3.46) ai = 0, a2 = -90, a3 = 90, a4 = 0, a5 = -90, a6 = 0 (3.47) connectivity matrix The branched formulation makes use of the following connectivity matrix U~l to describe the topology: r 1 1 1 1 0 1 10 11 1 0 0 0 1 1 0 0 0 1 1 c/-x = (3.48) integration, inverse dynamics and joint profiles The integration algorithm uses the fourth order fixed step Runge-Kutta method employed for the PUMA 600 model, except that the step size was set at h = 0.001, as larger values Chapter 3. Efficient Computation using Body Coordinates 60 dof Proposed Angeles Walker Brandl 3 367 x, 277+ 399 x, 306+ 631x, 513+ 528X, 462+ 4 630 x, 484+ 679 x, 530+ 907 x, 742+ 778 x, 682+ 5 969 x, 752+ 1029x, 810+ 1208x, 991+ 1028 x, 902+ 6 1391 x, 1087+ 1456x, 1152+ 1537x, 1261+ 1278 x, 1122+ 7 1903 x, 1495+ 1967x, 1562+ 1893 x, 1553+ 1528 x, 1342+ 8 2512 x, 1982+ 2569x, 2046+ 2279 x, 1868+ 1778X, 1562+ 9 3225 x, 2554+ 3269x, 2610+ 2694 x, 2207+ 2028 x, 1782+ 10 4049 x, 3217+ 4074 x,3260+ 3141 x, 2571+ 2278 x, 2002+ 11 4991 x, 3977+ 4991 x,4002+ 3619x, 2961+ 2528 x, 2222+ 12 6058 x, 4840+ 6027x, 4842+ 4131x, 3378+ 2778x, 2442+ Table 3.5: Complexity of complete forward dynamics for single chains on one cpu caused instability due to numerical error accumulation, leading to incorrect results. The inverse dynamics algorithm is the same as that derived earlier in the chapter, but adapted to suit the branched algorithm. The joint angle, velocity and acceleration profiles are similar to those suggested by Angeles and Ma for the PUMA 600, except that the period T of the profile is 5 seconds rather than 10. simulation results Fig. 3.8 illustrates the 9j, 6d and 0d profiles applied to each of the joints. Fig. 3.9 shows the torque profiles produced by the inverse dynamics, and Fig. 3.10 shows the errors between the desired and actual angle, velocity, and acceleration profiles for joint six. The results of the simulation of joint 6 are overlaid on top of the desired reponses in Fig. 3.8. The errors in joint 6 plotted in Fig. 3.10 indicate that the human torso model was simulated correctly although instability occurred when simulating long periods of time (approximately 5 seconds). This is due to the dynamics of the system, as the higher frequency dynamics are not adequately modeled by the integrator, due to step sizes that are too large. This problem can be overcome with smaller step sizes or more sophisticated integration algorithms. Chapter 3. Efficient Computation using Body Coordinates 61 time (s) Figure 3.8: Od, &d, &d and profiles applied to all joints for human torso model. Uveriaia are uie reponses for joint 6. 3.6 Summary This chapter has examined the complexity of the Newton Euler State Space algorithm for for ward dynamics when calculated on one cpu. By referring all equations to body coordinates, and taking advantage of the fact that the force vector calculation uses results previously cal culated from the inertia matrix calculation, efficient algorithms for the inertia matrix and the force vector have been presented. The algorithms incorporate a connectivity matrix which permits the modeling of branched mechanisms. In the case of a single chain mechanism, the transformation matrix [CT (p x C)T] is the same as Angeles and Ma's orthogonal complement matrix [Angeles 88]. Buchner [Buchner 86] also referred both the translational and rotational Chapter 3. Efficient Computation using Body Coordinates 62 Time (second) _tg3_ Time (second) _tg5_ Time (second) o 3 cr E Z ^—' s cr Time (second) tq4 Time (second) tq6 Time (second) Figure 3.9: TJ profiles for all 6 joints generated by the inverse dynamics for human torso model equations to body coordinates, but the equations were not computed in their most efficient form. As a result the simple recursive nature of the vector cross product (pjj X c\j) was not Chapter 3. Efficient Computation using Body Coordinates 63 time (s) time (s) 0.5 0 -0.5 -1 -1.5 0 5 time (s) Figure 3.10: Errors 9d — 9,9d — 0, 9d — 9 profiles for joint 6 of human torso model used. In addition, Buchner did not explore the representation of branched mechanisms. The complexity estimates in Tables 3.1 and 3.5 show that when modeling low degree of freedom (n < 6) mechanisms on one cpu, such as an excavator or some of the robots in use today, the algorithms proposed here are of comparable (although not superior) efficiency to those with which we have compared. Table 3.5 lists the number of multiplies and adds for chains of up to twelve dof, while Fig. 3.11 graphs the results from Table 3.5, assuming the cost of an add is half that of a multiply (the complexity is thus measured in /xs). Fig. 3.11 shows that the proposed algorithm (P) is very similar to Angeles' algorithms. Brandl's algorithm (B) eventually becomes more efficient due to the 0(n3) complexity of the matrix solver. In the last part of this chapter, the single chain and branched formulations were verified. The Chapter 3. Efficient Computation using Body Coordinates 64 formulations were implemented on single cpus and the PUMA 600 and a human torso were modeled. Open loop inverse dynamics simulation using joint profiles taken from the literature show good agreement with previous results. This thesis has thus far established the equations and complexity for the Newton Euler State Space formulation when implemented on a single cpu. In the next chapter a parallel version of this formulation will be presented which is superior in computational complexity for mechanisms with relatively large numbers of links. Chapter 3. Efficient Computation using Body Coordinates 65 9000 | | I | I I JU , 1 1 1 —I 3 4 5 6 7 8 9 10 11 12 n degrees of freedom Figure 3.11: Graph of complexity estimates for formulations in Table 3.5. Algorithms are (A) - Angeles, (W) - Walker, (B) - Brandl, and (P) - proposed in thesis Chapter 4 A Parallel Dynamics Algorithm In this chapter the algorithms described in Chapter 3 for the inertia matrix, force vector, and solver phases are parallelised and integrated together. New parallel algorithms using a two dimensional, triangular, Multiple Input Multiple Data (MIMD) cpu array (Fig. 4.1) with a systolic/wavefront approach have been developed for the inertia matrix and the force vector calculations. The first row in the array, row 0, calculates the force vector b (gravity, Coriolis, centrifugal and actuator torques). Simultaneously, rows 1 through n calculate the inertia matrix. The leftmost cpu, cpuo,o, connects to the host computer. Once this is accomplished, b{ resides in row 0 cpuo,; (i = 1, ..n), and the inertia matrix element Aiij = resides in cputJ. The data is then shuffled to the edge of the array, and the equation A49 = 6 is solved using a new feedforward systolic solver derived by [Jainandunsing 89] which is more efficient than previous algorithms. 4.1 Integrating the Inertia Matrix and Force Vector Computations The equations of motion derived in the previous chapter can be stated as [HTH)9 = H L + Ta = fv + Ta (4.1) mp or M9 = b (4.2) where ra is the vector of actuator torques. The calculation of both the inertia matrix M. and /„ (rQ is calculated separately) can be divided into two phases: 66 Chapter 4. A Parallel Dynamics Algorithm 67 Host SUN VME Bus cpu11 cpu12 cpu, 22 cpu13 cpu23 CPU33 col 0 col 1 col 2 col 3 col 4 cpu0fJ H>U01 cpu02 CVU03 row 0 cpuu cpU24 row 1 row 2 cpu^ cpu '44 row 3 row 4 Figure 4.1: The triangular 2d mesh array used to implement the dynamics algorithm 1. Calculating the 6n x n matrices H and H (H is a component of H), and the 6n x 1 momentum derivative vector [L mp]T. Row 0 of the array in Fig. 4.1 can be used to calculate the derivative vector, while simultaneously rows 1 to n are used to calculate HT and H. 2. Multiplying H and [L mp]T by H1 In this chapter the M and fv calculations are parallelised so that the computations occur in a balanced manner, where both the computations of H and [L mp] finish simultaneously. Chapter 4. A Parallel Dynamics Algorithm 68 This is necessary so that phase 2 can proceed in a synchronised manner. Following this, the array is reused to solve the matrix equation. This is easy to implement using the homogeneous architecture described here, in comparison to the architectures of [Lee 88] and [Fijany 89], which require different architectures for the equation formation and for the solver. 4.2 Parallel Calculation of H The calculation of H requires the recursive evaluation of equation (3.17): H = JC M(p x C) (4.3) The architecture in Fig. 4.1 can be used to compute equation (4.3) in parallel. The following notes are descriptions of the calculations carried out using this architecture: calculate A\_^ In each cpu in row 1 (Fig. 4.2), cpuit,, the rotation matrix AJ_1 is calculated from equation (3.18) (operation 1 in Table 4.1). The 0, are assumed to be resident. No communication is necessary, and the cost of computing sine and cosine functions (1 of each) are excluded from Table 4.1. Note that the Inmos T800 transputer processor can compute the sine and cosine functions, and thus it is not necessary for the host computer to send these values (assumed by [Lee 88]). calculate cj- • and p\j Calculation of cj • and pj- • (equations 3.19 and 3.20) proceeds using row 1 as a systolic pipeline, in which cj • and then pj • j = 1,...» (operations 2 and 3 in Table 4.1) are alternately calculated in cpui)t- (Fig 4.2). Once a cj- • vector is calculated, it is passed eastward (e) for the calculation of cj+j • by cpu^j+i according to equation (3.19). The vector is also sent southward (s) to the lower rows in the mesh in preparation for operations 4 and 5. The communication of c\j (send Chapter 4. A Parallel Dynamics Algorithm 69 i-1 C. i-1 1 e w 2 e w 3 e w 4 A3 A2 s s s s 11 1 11 21 2 21 2 ) 22 2 22 31 3 31 3 32 3 32 3 I 33 3 33 41 4 41 4 > 42 4 42 4 > 43 4 43 4 > 44 4 44 row 1 time Figure 4.2: Pipeline computation of A\_:, c|j, and p\ • in cpui>t (row 1) east and south; receive west) and the computation of p\; • is assumed to be simultaneous using the Inmos T800 transputer or the TMS320C40 as the cpu, as they both have DMA controllers for each communication link, enabling simultaneous communication and computation. Note that complete overlap is difficult to achieve with current transputer technology (the cpu chosen Chapter 4. A Parallel Dynamics Algorithm 70 for implementing these algorithms) as the amount of data to be transmitted is small, resulting in a communication setup time that becomes relatively significant. Theoretically, however, it has been assumed here that a complete overlap between the multiple communications and the computation is achievable. The time taken to send a vector of three 32-bit floating point numbers is about 12 microseconds. A matrix-vector product or cross product is considered here to be the smallest atom of computation, since the cost is about 12 microseconds (and therefore the computation can overlap the communication of a 3 X 1 vector). Fig. 4.3 shows the final distribution of elements in the part of the array used to calculate M (rows 1 to 4), and Table 4.2 is the schedule for operations 1 to 5 in Table 4.1, for a four link chain. Operations 1 to 11 in Table 4.2 represent atomic operation cycles. col 1 col 2 col 3 col 4 c! P1 11 11 w 2 2 e w 3 e w 4 C 41 A C 21 P 21 C31 n n n r S r S s 2 <» < 4 C22 J C32 C42 n n i s rS •x 3 4 4 C33 P 33 C43 n - s 4 C44 P4 44 row 1 row 2 row 3 row 4 Figure 4.3: Final distribution of c\j and p\j vectors Chapter 4. A Parallel Dynamics Algorithm 71 # operation cost comms cpus no. floats 1 4-1 4x 0 n 2 ii (n - l)(8x, 5+) 3n n(n+l) , 2 3 (n - l)(8x, 11+) 3n n(n+l) 2 4 9x, 6+ 0 n(n+l) 2 5 9x, 3+ 0 n(n+l) 2 6 iiVKk) n(3x, 2+) 0 n(n+l) 2 7 P\M x c{k n(6x, 3+) 0 n(n+l) 2 8 (p x C)TM(p x C) n(3x, 2+) 3n n(n+l) , 2 v 9 n(2+) n n(n+l) 2 Table 4.1: Computational complexity for parallel H algorithm cycle cpun CPU12 CPU 13 Cpu 14 COSt to CpU 14 1 A1 4x 2 C2,2 c3,3 0x,0+ 3 PL P2,2 P3,3 4 P4.4 0x,0+ 4 Jlcl,l -2 c2,l c3,2 4.3 8x,5+ 5 mi (Pi.i x cjfl) P2.1 P3.2 4 04,3 8x,ll+ 6 •^2C2,1 c3,l C4,2 8x,5+ 7 ™2(P'2,1 X C^) P4.2 8x,ll+ 8 J3C3,1 cii 8x,5+ 9 msipii x 40 8x,ll+ 10 74.4 •M C4,l 9x,6+ 11 m4(p\A x c^) 9x,6+ Table 4.2: H computation schedule for row 1 cpus calculate J\c\- and m,(p| • X c\j) All cpus cpu^, in rows 1 through n calculate equations (3.21) and (3.22) with no communications between the cpus (operations 4 and 5 in Table 4.1, and also the last two operations in each column of Table 4.2). The computational complexity of operations 1 to 5 in Table 4.1 is 5 = («-l)(16x,16+) +(22x,9+) 5 ]T = (16n + 6)x, (16ra-7)+ (4.4) Chapter 4. A Parallel Dynamics Algorithm 72 where the computations and communications of cj j and pj- j synchronise the cycle of computa tion, and all communications have been overlapped by computations. 4.3 Parallel Calculation of [L mp] In Chapter 3 an 0(n2) algorithm was presented for calculating the vector /„ of gravitational, centrifugal and Coriolis forces. In this section an 0(n) parallel algorithm for /„ is developed which integrates well with the parallel H algorithm just presented. The calculation of r0, the actuator torques, will not be be discussed until Chapter 5, since they are dependent on the type of drive mechanism employed, and can be considered as a separate computational problem. From equation (3.34), the generation of /„ reduces to a problem of premultiplying the vector [L mp] by H. In this section we calculate [L mp] in a parallel manner. Equations (3.27) to (3.31) can be calculated by row 0 in a systolic pipeline. The calculation of wiQi ^ioi WiU and p\, computed in cpuo,;, requires w\z\to, wj~j0, and p,_i to be sent from the left neighbour cpuo,,_i (h, ki, and p, are simplified forms of /|it+i, k}fi and pj0 used to simplify the notation in Table 4.3). Note that £, = 4- w\0 x a,-, where a,- = J\w\Q and bi = Jlw\Q (here 6, is a dummy variable, and is not related to equation 4.2). From Table 4.3, the computational complexity is: = (n- l)(18x,13+) + (63x,50+) = (18n + 45)x,(13n + 37)+ (4.5) L,mp The computational cost of w*i0 and w\Q, (n— l)(18x,13+), synchronises the computation cycle and dictates the complexity of the calculation. Although w]*\ Q is calculated two computational atoms after w\0 (due to the one atom delay in sending the w\0 vector to cpuo,t+i), the cost of sending w\0 to cpu0,,+i is hidden by the computation of u>)i0 in cpu0,i. 4.4 Parallel Multiplication by HT Once the schedules in Tables 4.2 and 4.3 have been completed, both the vector [L mp) and the rectangular matrix H must be multiplied by HT. Chapter 4. A Parallel Dynamics Algorithm 73 cycle cost to cpuoi cpuoi cost to cpu02 cpu02 CPU03 CpU04 1 4x A1 4x A Ai -^3 2 0x,0+ 3 0x,0+ 4 lx Wi 8x,6+ 5 9x,6+ 10x,7+ wlo 6 9x,6+ 6x,9+ W2 7 8x,ll+ Pi 9x,6+ -W2k2 8 9x,6+ ax = Jlu>l0 9x,6+ w2i2 w3 <0 9 9x,6+ h = Jlw{0 8x,ll+ fa -W3k3 ^4.0 10 9x,6+ Li,m\p\ 9x,6+ a2 = ^2^2,0 w3i3 W4 11 9x,6+ b2 = J'iw'io fa -W4k4 12 9x,6+ a3 = J3u>3.0 W4l4 13 b3 = Jgwlo P4 14 1>Z, a4 = Jiw\Q 15 64 = J4w\Q 16 L4,m4p4 Table 4.3: Schedule for calculating [L mp] The atoms of computation which determine the computation cycle in the inertia matrix algo rithm, the c|j, p\: • computations (16x,16+) are comparable to their counterparts in the force algorithm, the (18x,13-(-) w, and W{ computations. The constant overhead for calculating [i mp] (63x,50+), however, is significantly larger than the overhead for H (22x,9+). This results in idle time in the inertia matrix cpu array, where the a, j cpus wait for the 6,- cpus until both algorithms can proceed with the synchronised multiplication by HT. Minimisation of this idle time can be accomplished by passing some of the [L mp] calculations from row 0 to row 1. w\0 and w'i0 can be sent from row 0 to row 1 cpus, where a,- = J-WIQ and bi = JiW'iQ (18x,12+) can be calculated, relieving row 0 cpus of this burden. The results are then sent back to row 0. Although this requires the communication of vectors a,- and 6,- from row 1 to row 0, it is possible to hide this communication cost with computations. Table 4.4 presents a schedule of some of the communications and computations between cpuo3, cpuo4 and cpui4 (// indicates a parallel process, tx() and rx() indicates transmission and reception). The table shows that all communications are hidden by computations except in the case of cpuo,4-Tables 4.5 and 4.6 are the revised versions of Tables 4.2 and 4.3 (excluding communications) Chapter 4. A Parallel Dynamics Algorithm 74 CPU03 CpU04 CPU14 ^rx(j>2)//W3 ^X^)//^^ 10 j*/7tx(W3/3) - -»rx(W3/3)//-W4*:4//tx(u;Jio) - -*™(«'4j0)//n»4(p4.1XC4.l) 11 c3 = a3xw;^0,m3p3//tx(p3) -»• ->rx(fe)//Wy4//tx(<0) - ->rx(^0)//a4 = y44^0 12 X3 = c3 + 63 p4//rx(a4)<- <-tx(a4)//b4 = 13 c4 = a4xw4i0,m4p4//rx(b4) <-tx(64) 14 L4 — c4 \ 64 15 Table 4.4: Redistributed computations and communications for [L4 m4p\\ cpuu CPU12 CPU13 CPU14 COSt tO CpU 14 1 4 AJ 4 4x 2 <r c2,2 c3,3 r4 c4,4 0x,0+ 3 P2.2 A>3,3 P4.4 0x,0+ 4 c2,l c3,2 c4,3 8x,5+ 5 X Cl.l) P2.1 P3.2 P4.3 8x,ll+ 6 •7JX.0 «'2c2,l c3,l A c4,2 8x,5+ 7 m2(P2,l X P3.1 4 P4,2 8x,ll+ 8 «*2 w2,0 J3C3,1 4, 8x,5+ 9 T2w2 J2 W2,Q m3(p3,l X 4l) Ph 8x,ll+ 10 74-4 ''4 c4,l 9x,6+ 11 J3 w3,0 m*(p4.1 X C^j) 9x,6+ 12 9x,6+ 13 J$W4,0 9x,6+ Table 4.5: Revised H computation (row 1) schedule when di and 6, are calculated in row 1. The complexity for the revised H and [L rnp] algorithms prior to multiplying by H is 53 = (n-l)(16x,16+) + (40x,21+) (4.6) H for the H matrix, and J2 =(«-l)(18x,13+) + (45x,38+) (4.7) L,mp for the [L mp] calculation. The H algorithm in Table 4.5 requires each row 1 cpu to transmit 6, = J\w\Q to their counterpart cpu in row 0 after all computations have ended (operation 14 in Table 4.4). Chapter 4. A Parallel Dynamics Algorithm 75 cost to cpuoi cpuoi cost to CpU02 CpU02 CpU03 cpuo4 1 4x A1 4x A Al Ai 2 0x,0+ <o 3 0x,0+ <o 4 lx Wi 8x,6+ w'io 5 9x,6+ 10x,7+ w'io 6 9x,6+ Wrh 6x,9+ - W2 ^3,0 7 8x,ll+ Pi 9x,6+ -W2k2 ^3.0 8 9x,3+ ci,mipi 9x,6+ W2l2 w3 <o 9 3+ L\ = ci + h 8x,ll+ fa -W3k3 ™io 10 9x,3+ c2,m2fa W3l3 W4 11 3+ L2 = c2 + b2 fa -W4k4 12 c3, m3p3 W4l4 13 L3 = c3 + b3 fa 14 c4,m4p4 15 L4 = 04 + 64 Table 4.6: Revised schedule for calculating [L mp] 4.4.1 Multiplying H and [L mp] by HT By reorganising the computations as discussed above, the momentum derivative vector [L mp] and the H matrix are completed at approximately the same time. HT = [C (px C)]T can now be multiplied simultaneously with H and [L mp). The rectangular matrix JC and the vector L are premultipied with CT', and the matrix M(p x C) and vector mp are premultiplied with (p x Cf. For the inertia matrix calculations, the elements cj- •, (p\j X cj• •), J\c\ j and m,(pjj X c\j) are already in cpu.j (Fig. 4.3) and so there is no loading phase. The HTH multiplication can be accomplished by systolically sweeping H{j vectors (6x1) left in Table 4.7 according to the following algorithm: for i = 0 to n { 1. cpus in columns 1 to n — i calculate dot products with H vector and resident H vectors. 2. shift scalar dot product result in rows n to n — i up to row above and accumulate Chapter 4. A Parallel Dynamics Algorithm 76 3. shift H vectors in columns 1 to ra - i left (4. where the H vector in cpujjt is [Jjcj k mj(pj k x cj k)]T. Table 4.7 graphically illustrates the first four computations of this algorithm. Each matrix element represents a cpu, and the contents of each element represent a computation ( - indicates no computation). Note that the grid is an + ra triangular mesh, so the cpu in element (col 1, row 2) does not exist. Multiplying the momentum derivative vector [L mp] by H requires the [c\j (p\j x c\ j)]T vectors to be shifted into row 0 from row 1. This is done as part of the algorithm in equation 4.8. Table 4.7 shows row 0 [L] appended to the rows calculating H (in computation cycle 0). The vectors in the H matrix, Hitk, are then systolically shuffled leftward through the array, each computation phase involving dot products and/or accumulations. The first four computations (two cycles) of the algorithm are demonstrated in the sequence presented in Table 4.7 (only rotational momentum computations cT-(Je) and cT-L are shown). Between computation phases (1) and (2) in Table 4.7, the dot products calculated in (1) are also passed up from column 4 to column 3 in an accumulative sweep of the dot product results, and the Hij (the c}j in Table 4.7) are simultaneously passed left from rows 1 through ra to rows 0 through n — 1. Computation cycles (2) and (3), the accumulation (2) and the next dot product (3), are then computed as one phase, followed by another communication phase between (3) and (4). By simply passing H?j — [c\j (p x c\j)]T as 6 x 1 vectors, the communication phases are not hidden, as the BJjdot product must be completed before the results can be passed upward and accumulated. However, if the dot product computations and the vector communications are split up so that cj • and (p\j x cj- •) are sent separately as 3 x 1 vectors, and the dot products for each of these vectors are calculated separately, the communications can be hidden (although a 3 x 1 vector dot product computation is a less costly atomic action than a 3 x 1 vector communication). Furthermore, it is possible to partially or completely eliminate the communication of H{j by taking advantage of two facts: Chapter 4. A Parallel Dynamics Algorithm 77 1. c\j need not be transmitted left by rows 1 to n in Table 4.7 (south to north in Fig. 4.3), since it is already present in rows to the left, having passed through during operation 2 of Table 4.1. Thus, in cpu*,,- while (pj-j x cj j) is being received from the right, the dot product partial sum Pki,j-i = (cjj_1)T(^cj)jt) + (pjJ_1 x cjii_1fmi(pj,jt x cjijk) (6x,5+) (4.9) is calculated, and the moving sum of other partials TJjti 1S passed up from below, in the next operation, when cj • would normally be sent, (pjj+i x cjJ+1) can be sent and PkiJ = (c\,jf(Ji4,k) + (/>!,; x cljfmiipi, x <*) (4.10) computed. A moving sum ^2ki is one which originates in cpuj)tl and migrates to cpu,-^ after k cycles, accumulating the partial sums pki,j, j = 1, —k relevant to that cpu's matrix multiplication at each step. Previously in cpu,^, while (pj • xcjj) was being received, only the cjj(«7*cj k) computation (3x,2+) could simultaneously occur since cj — had to be received in the prior communica tion. With cj- • now resident, equation (4.9) can be computed at the time that (pj\j X cj-j) is transmitted, and the (pj- • x cjj) vectors can be sent at twice the rate since the cj j vectors are not interleaved. In fact, the vector transmission is more easily hidden since the cost of the dot products in equation 4.9 is (6x,5+), approximately as expensive as a 3x1 vector communication. The calculation of fv can similarly be improved by passing the cj j vector north to row 0 as it is being produced by row 1 cpus (in operation 3 of Table 4.1). The cj- • vectors will thus be present when phase two begins, and only (pj • x cj- •) needs to be passed in. 2. The communication of (pj• • x cj• •) can also be avoided by noting that the pj • have also passed through the array (rows 1 to n) due to operation 3 in Table 4.1 (and may also be passed from row 1 cpus to row 0 cpus during operation 3 (in a similar manner to c\ j) in note 1 above). However, the cross product (pjj X c\ A (operation 7) must then be computed at each cpu before the dot product in the second term of equation 4.9 is calculated. Chapter 4. A Parallel Dynamics Algorithm 78 The computational complexity of the systolic multiplication by H is n(6x,6+) (the extra n adds occur due to the accumulating moving sum). The overall complexity, assuming the cross product result of (pj • x c\j) is passed through the array, is then: £ = (n-l)(16x,16+) + n(6x,6+) + (40x,21+) M = (22n + 24)x,(22n + 5) + (4.11) 53 = («- l)(18x,13+) + n(6x,6+) + (45x,38+) = (24ra + 27)X, (19n + 25)+ (4.12) and if the p\ j X cj j vectors are calculated rather than passed in, a cost of ra(6x, 5+) is incurred, and so the overall cost is: = (n- l)(16x,16+) + Ti(6x,6+) + n(6x,3+) + (40x,21+) M = (28n + 24)x,(25n + 5)+ (4.13) = (n- l)(18x,13+) + n(6x,6+) + n(6x,3+) + (45x,38+) Sv = (30ra + 27)x,(22n + 25)+ (4.14) The communications for operation 9 in Table 4.1 (the accumulation of partial sums by the moving sum), represented by the vertical data shift in Table 4.7, must still be performed, but the cost is hidden by the dot product computations. The cost of the extra cross product p x c, done to avoid communicating, is computationally expensive, but may be worthwhile if the cost of communication is more than a vector cross product, or if there is difficulty in hiding the communications within the computations. Chapter 4. A Parallel Dynamics Algorithm 79 row 0 row 1 row 2 row 3 row 4 col 1 ii J\ ci,i 0 col 2 Lt J2 c2,l 75r5 J2 C2,2 col 3 L3 J3 c3,l 73-3 J3 c3,2 7 3 -3 •/3C3,3 col 4 U 74r4 •'4 c4,l 74r4 •'4 c4,2 •74^4,3 74r4 J4 c4,4 row 0 row 1 row 2 row 3 row 4 col 1 - (ci,i)J Ji ci,i 1 col 2 - (C2,l)J ^cll (C2,2)J ^2C2,2 col 3 - c3,l) ^3C3,1 (c3,2) ^3C3,2 (C3,3)J J£c3,3 col 4 - (c4,l) ^4C4,1 (c4,2) J*ci,2 (c4,3) ^4C4,3 row 0 row 1 row 2 row 3 row 4 col 1 - - 2 col 2 - - -col 3 - Ell = (C3,l)J ^3C3,1 E22 =4^C^?^ ^C3.2 +(^4,2) ^4C4,2 E33 = (C3j?y J3C3,3 col 4 - - - - -row 0 row 1 row 2 row 3 row 4 col 1 («i.i)Tii 3 col 2 (c2,,)Tia (c2,2)T-72 c|,l col 3 (C3,2)T^3 c3,l (C3,3)T^3 c3,2 _ col 4 (C4,2)T^4 c4,l (C4,3)T^4 c4,2 (c\,*)TJ*C*,S _ row 0 row 1 row 2 row 3 row 4 col 1 - - 4 col 2 - Ell = (c2,l)i ^2C2,1 E22 = (C2,2)J -^2C2,2 + col 3 E10 = «i)Ji3 E2I = (C3,2)J ^3C3,1 E32 = (C3,3)T</33C3,2 +(cj,3)r744ct2 -col 4 - - - - -Table 4.7: First four computations of HTH (rows 1 to 4) and HT[L] (row 0) (rotational components only) Chapter 4. A Parallel Dynamics Algorithm 80 4.5 Parallel Matrix Solver Once the inertia matrix M and the driving vector b (b = fv + ra) have been assembled, the system of equations must be solved for 0. An efficient feedforward wavefront matrix solver has been implemented for the equation M6 = b or, using matrix terminology, Ax = b (4-15) using a feedforward algorithm proposed by [Jainandunsing 89]. Although Jainandunsing's algo rithm is a systolic array, it can be used by the wavefront array employed here (a wavefront cpu such as the transputer is activated by the arrival of data, whereas a systolic cpu is synchronised by a global controller, and calculates regardless of the presence of data). For dense systems of equations (when most elements of A are nonzero), the direct method solver is the best method of solution. Direct methods can be divided into backsubstitution and feedforward methods. factorisation and backsubstitution method Backsubstitution methods compute an LU, QR or Cholesky factorisation of the A matrix, fol lowed by a backsubstitution to derive the solution vector x. The factorisation is performed by premultiplying the A matrix with embeddings of 2x2 rotations, which can be linear, trigono metric/orthogonal, or hyperbolic. The multiplication triangularizes A to an upper triangular matrix. For the linear rotation, the 2x2 matrix has the form: 1 0 (4.16) Chapter 4. A Parallel Dynamics Algorithm 81 which gives the following N x N embedding: ©.i(0) 1 0 (4.17) IN-J For an appropriate a.j, the product 0tJ(O)A eliminates the a,j element in the A matrix. A cascaded backward series of these embeddings N, ...i, (JV is the size of the matrix A) produces a product 0(0) = n.,,0,, ,-(0) = (niI10,,N(o))...(n^10,,2(o)) • (fijl^^o)). (4.18) When the otij are calculated such that the strictly lower part of A is eliminated, the resulting product 0(0) is X-1, the inverse of the factor L in the LU factorisation. When 0(0) is multiplied with J4, the result is the factor U ie. A 0(O)A If 0(0) Q(0)A LU e(0)LU. L~lLU = U. (4.19) Thus, if each cpu in the array has calculated the appropriate a,j = at-j^/ajj^ ana formed the 2x2 rotation (a,-j^ is the value of element after j successive rotations), then feeding A into the systolic array produces U. To solve Ax = b, let y = Ux and A = LU. Then, if Ax = 6, LUx = b. If Ly = b, y = L-Xb = 0(0)6 (4.20) Chapter 4. A Parallel Dynamics Algorithm 82 From equation (4.19) it can be observed that when b is premultiplied with 0(0) (ie when b is also fed into the systolic array), y is obtained, from which x may be calculated through the backsubstitution: U x — y x = U~ly (4.21) Once U = O(0)A and y = 0(0)6 are obtained, they are passed to the backsubstitution algo rithm, which is architecturally organised as a pipeline. This pipeline calculates the x vector in reverse order (xn,...a;i), with y and U fed simultaneously into the end and the side of the pipe respectively. Unfortunately, the elements of U, which are produced a column at a time starting with column 1, are in the reverse order to that desired by the backsubstitution. As a result, a systolic implementation cannot overlap the factorisation and backsubstitution, as the factorisation must be completed before the backsubstitution begins. Systolic implementations such as Liu and Young's [Liu 83] have a complexity of 3n — 2 cycles for the factorisation, and 4n — 3 cycles for the backsubstitution (loading, computation, unloading of solution), for a total of 7n — 5 cycles [Jainandunsing 89]. Each cycle requires (lx, 1+). feedforward method Feed forward techniques solve a system using LU factorisation without backsubstitution. Fad-deev [Faddeev 59] proposed an algorithm which augments the matrix A with an identity matrix —IN. The factorisation of the augmented matrix [AT — Iff] which is accomplished by feeding A and I into the augmented cpu array, produces 0(0) = (Z')_1, which is the inverse of L\ the lower factor in the factorisation result [L' U']. When b is also fed into the array after A, it is transformed by 0(0)6 = x to the solution vector x. While this algorithm is fast (complexity 3n cycles) it requires pivoting to maintain stability, which is difficult to implement for a systolic architecture. Additionally, the algorithm requires cpus for processing —/AT, demanding almost twice the number of cpus compared to other systolic algorithms. Chapter 4. A Parallel Dynamics Algorithm 83 4.5.1 The Jainandunsing feedforward algorithm The algorithm proposed by Jainandunsing [Jainandunsing 89] is a feedforward direct procedure involving a reorganisation of the backsubstitution into a feedforward sequence, so that the sequence of results from the factorisation (which is itself different from the previous algorithm) is immediately processed by the new feedforward phase that follows (in fact, the backsubstitution has been rephrased as an LU-type factorisation). The factorisation, instead of producing U and y starting from column 1 of U, produces [LT U~T] starting from row 1. The feedforward algorithm that follows accepts the [LT U~T] values in the order they are generated. Non-singular matrices are assumed. Variations have been developed by Jainandunsing for Cholesky factorisation (hyperbolic rota tions) and QR factorisation (orthogonal trigonometric rotations). Here, we use the LU variation because it uses linear rotations, which are cheaper to compute than hyperbolic or trigonometric rotations. Although the LU factorisation is susceptible to numerical stability problems due to roundoff errors [Jainandunsing 1989], it was chosen because the matrices dealt with in this thesis are positive definite and very small in size, for which numerical problems are unlikely to occur. The number of cpus required is also reasonable, as apart from the array for the A matrix, only one extra row of n cpus is needed (to process the b vector). Thus the architecture in Fig. 4.1 is ideal for this solver, since row 0 is already used to calculate 6 while rows 1 through n are used to calculate At. If instabilities are likely to occur, the QR factorisation variation should be chosen. For the QR factorisation, the computer architecture remains the same as for the LU factorisation, but the linear rotations are replaced by trigonometric rotations. factorisation Consider the augmented matrix AT IN • ^ matrix is premultiplied by U~T, it will be factorised to [LT U~T]. U~T can be defined as a cascaded backward series (N, ...1) of embedded rotations in a manner similar to the previous factorisation matrix 0(0). If the 2x2 Chapter 4. A Parallel Dynamics Algorithm 84 embedding is a premultiplier of the augmented matrix, it is 4-i 1 0 a iN-i ; i,j = 1,...N-1. (4.22) For j= 1, n^1^,,! is a cascaded backward series (£//v-i,i)(£0v-2,i)—(^i,i) of rotations which zeros the below-diagonal elements of the first column in AT. The appropriate choice of a.j is aii/ai!i- The resulting matrix is then A7^1). This can be repeated for j = 2, ...JV — 1 where atJ = a\ j*/ajJj such, that we get (n^.x^.iv-i)...^1^) (n^Xa) [AT IN] = [LT U~T] (4.23) The above multiplication yields a factorisation in which the rows of [LT U-T] are produced in row order starting from row 1. By producing ^LT U~T] in this order, the rows can be immmediately sent to the feedforward algorithm. backsubstitution redefined as a feedforward algorithm Assume that the factorisation has occurred and the factors |xr U T] are available. Consider Ax = b where A = LU. If Ly = b and x = U~xy, then [yT i] LT -bT (4.24) [yT i] u-T 0T = x (4.25) The matrix [L — b]T can be reduced to an upper triangular form [RT{0) 0]T in which the last row is zero (i.e. the — bT row is zeroed) by applying a sequence of rotations 0(0) = II^G.tO) Chapter 4. A Parallel Dynamics Algorithm 85 where 0,(0) = (4.26) 0(0) a, 1 ' LT ' R(0) -bT 0T (4.27) -R(O) is an upper triangular remainder matrix equal to TJ. Equations (4.23) and (4.26) both produce 0, suggesting that the last row of 0(0) is proportional to k[yT 1], where A; is a scalar proportionality constant (k=l for the LU method [Jainandunsing 89]). This allows equation (4.24) to be rewritten as r U~T since the last row of 0(0) is k[yT 1]. Thus, grouping (4.26) and (4.27), 0(0) = x (4.28) R(0) ? -bT 0T 0T xT 0(0) = , (4.29) [ -b T \ I 0 T \ where ? implies an unused result, interleaving Premultiplying the matrix [AT IN] by n^j1^! eliminates the first column of the strictly lower part of the augmented matrix. The output is the first row of [LT U~T], which is then fed directly into the rephrased 'backsubstitution' calculator. This feedforward 'backsubstitution' algorithm applies 0i(O) to the just-received first row of [LT U~T] and to the first element of vector [—bT 0], causing the first element b\ to be zeroed (the first 0 in the bottom row of the result matrix in equation (4.26)). Continued interleaving of the Il^^Uij and 0j(O) eventually results in the solution vector xT being produced after 4n cycles, x\ being first after 3n cycles. A systolic array for this algorithm is shown in Fig. 4.4. Chapter 4. A Parallel Dynamics Algorithm 86 •JJJJJJJJJB — 0 — — 1 0 — 0 0 0 — 0 1 0 0 0 0 0 0 ~b4 0 1 0 a44 -b3 0 0 834 a43 "b2 1 "24 a33 a42 a14 a23 a32 441 — a13 8 22 831 — a12 421 — — all — — _ 0....0 • 0....0 s e w e w e w e nB 11 12 13 14 04 e e WL e nm 22 23 24 03 0....0 s ,n s n w e 11 e 34 02 w 01 0....0 44 00 col 1 col 2 col 3 cd 4 ROW 0 Figure 4.4: Systolic array for solving the matrix equation Ax = b integration of formation and solving phases The elements of A and b are distributed throughout the array by the inertia matrix and force vector algorithms during calculation. The solver algorithm in Fig. 4.4 requires that A and 6 Chapter 4. A Parallel Dynamics Algorithm 87 be moved to the edges of the array before beginning. Fig. 4.5 shows the movement of the data to the cpus of row 1 e.g. A44, A34, A24 and A14 are shifted to cpux4. Similarly, A34, A33, A23 and A13 are shifted to cpui3. Note that Aij = Aji i.e. A = AT. This data movement phase requires n — 1 data shifts, each one moving a single float value. These communications cannot be hidden by computations. The b cpus (cpus 01 to 04 in row 0 of Fig. 4.5) are connected not w row 0 cpu 00 row 1 row 2 row 3 row 4 e w b2 w s n s n l22 n n e w b3 e w Ml s n l23 '33 e w °4 e w e « 111 s n 24 34 *44 Figure 4.5: Data shifting phase prior to solver only to cpus 11, 12, 13 and 14, but also to 44, 34, 24 and 14 respectively through north-to-east links (see Fig. 4.5). To derive a simple and efficient data shifting phase that integrates well Chapter 4. A Parallel Dynamics Algorithm 88 with the solver, the 6 (row 0) cpus are aligned as column '5' and placed next to the 4th column of cpus, as shown in Fig. 4.4. If this rearranged architecture is not used, an alternative data shifting algorithm (bringing the data in row t out to cpu,\i) is twice as long, since the elements a,j collected in cpu,4 must be moved to cpu4_,,4 after they are brought to the edge of the array, in order to get the data arrangement required in Fig. 4.4. Fig. 4.5 shows how the cpus are arranged in a cone-like mesh architecture. /* row_id is the row in which this cpu resides in the mesh */ for( i = 1; i <= twoN; i++ ) { Chanln( northin, (char *)&yold, 4 ) ; Chanln( westin, (char *)&xold, 4 ) ; if( i == row_id ) alpha = —yold/xold ; if( i < row id ) { xnew = 0.0 ; ynew = 0.0 ; } else { xnew = xold ; ynew = yold + alpha*xold ; } ChanOut( southout, (char *)&ynew, 4 ) ; ChanOut( eastout, (char *)&xnew, 4 ) ; } (4.30) producing and using 0 The solution vector x = 0 is produced at cpuoo (see Fig. 4.4), and the <?, are redistributed back to the 6, cpus (cpuo.i) via the west link of cpuoo to the east link of 64 as they are produced. The 0i are integrated by each cpu0,,' to obtain 0, (necessary for calculating the angular velocity), and integrated again to get 0,-, which is used by the graphics display of the dynamics simulator (further discussed in Chapter 5). The 0; has to be fed back along the row 0 pipeline to cpuo,o and on to the host or the graphics computer. Each cycle in the systolic algorithm requires Chapter 4. A Parallel Dynamics Algorithm 89 (lx,l+) for the rotation (a multiply and accumulate). One -f (equivalent in cost to a x) is necessary to calculate Q,J = ffljJj/ajj (or aoj = —^Vajj f°r the force vector cpus) once during the algorithm for each cpu. The C code for a cpu implementing this solver is given above in (4.30). The last value of 0, is produced after An cycles, in contrast to 7n — 5 cycles for the previous factorisation and backsubstitution algorithm. The communications of intermediate data and the rotation computations cannot be overlapped, and so the communication cost is An cycles (passing single floating point values), plus n cycles for the initial data shifting phase described earlier. 4.6 Summary of the Proposed Forward Dynamics Algorithm To complete one cycle of the forward dynamics algorithm, it is necessary to 1. Calculate the force vector fv 2. Calculate the inertia matrix M 3. Solve the matrix equation MO = fv + ra. This thesis has developed 0(n) multiprocessor algorithms for all three of these phases. The n cpus in row 0 and the n{w+*) cpus in rows 1 to n simultaneously calculate [L mp] and H, respectively, and then multiply by HT to compute /„ and M. These two phases require a 2d mesh architecture as data must be passed between nearest neighbours. Following this, the elements of reside in cpuo,- and Mij reside in cpu,j. This data is shifted to the edge (row 1 cpus cpui, and cpuo4) and the system is then solved using a new systolic solver derived by [Jainandunsing 89]. The costs associated with each phase of the proposed algorithm are given in Table 4.8 for the W (Wong) algorithm. 4.7 Parallelism in Other Formulations In this section a discussion on the parallel formulations mentioned in Chapter 2 is presented in which the computational complexity is considered in more detail. The complexities of the Chapter 4. A Parallel Dynamics Algorithm 90 /. M solver M 1 L X + 87 15|"log2 n] + 65 9\Sf-] +31[log2 nl + 116 SpfH + 28[log2 nl + 3r» + 92 7n - 5 7n - 5 9fZf-] + 31 [log2 nl + 7n + 111 5f^ii] + 28flog2 nl + lOn + 87 cpu n(c) n(c) (m) W X + 24n + 27 19n +25 22n +24 22n +5 4n+l 4n 26n + 25 26n + 5 cpu »(P) (m) ^ii+n(m) A ftS cpu 27n + 99 log2 n - 1 "(c) F X + cpu 48flog2 nl + 102 63[log2 n] + 66 SiS+il (m) H X + cpu 27[log2 n] + 6n+ 109 3l[log2 n] + 5n + 92 »(P) 7n - 5 7n - 5 (m) 27[log2 nl + 13n + 104 31flog2 nl + 12n + 87 Table 4.8: Parallel computational complexity inertia matrix, force vector, and solver algorithms developed in this thesis are summarised in Table 4.8. In the following discussion the communication complexity is discussed but not explicitly estimated in the Table. This is because the authors of the various formulations that are listed in Table 4.8 either excluded or completely included all communication costs, and did not consider the case when some communications could be hidden. Thus, to compare the algorithms under similar circumstances, a comparison is made solely on computational costs. It should also be noted that the complexity estimates in Table 4.8 are not exactly as presented in the relevant papers cited. It has been necessary to assume that extra cpus are available to compute the force vector values, so that this cost is masked by the inertia matrix. Additionally, in cases where only the inertia matrix formation is discussed, Liu's systolic solver has been assumed to calculate the matrix solution. In Table 4.8, (p), (c), and (m) indicate pipe, cube and mesh architectures respectively. L, W, A, F, and H refer to the Lee [Lee 88], Wong (this thesis), Amin-Javaheri [Amin-Javaheri 88], Fijany[Fijany 89], and Hwang [Hwang 88] formulations, while cpu refers to the number of cpus used. Chapter 4. A Parallel Dynamics Algorithm 91 4.7.1 Lee and Chang Lee and Chang [Lee 88] implemented their composite rigid body based force vector and inertia matrix algorithms on an n processor generalised cube network architecture. The force vector was computed in 0( [log2 n]) time and the inertia matrix algorithm in 0(n + riog2 n] ) time. To achieve this, the recursive doubling algorithm (RDA), which is used for many of the calculations (e.g. calculating the rotational matrices A0, the composite centre of mass c,-, and the composite body inertia J{ (z = l,...n)), computes each variable in O(|log2n]) time. It is implemented using the generalised cube as a |log2 ra] stage switching network for distributing the intermediate matrix data among the cpus (Fig. 4.6). The RDA algorithm computes A0 in O(|log2ra]) I N P U T O U T P U T STAGE Figure 4.6: Generalised cube network topology for n = 8. Adapted from [Lee 88] time by subdividing the computations into pairs of matrix-matrix multiply operations over [log2 n] stages. To hide the communications cost of sending or receiving A\ matrices, Aj must be divided into three 3x1 vectors, each sent one at a time so that the matrix-matrix Chapter 4. A Parallel Dynamics Algorithm 92 multiply can be computed as three matrix-vector multiplies. This only hides two of the three vectors, but the beginning of the next matrix-matrix computation can begin while the last result vector is being transmitted. This method of hiding communications also applies to the computation of J,-. Another instance of the RDA algorithm, however, the recursive vector addition of the composite body centre of mass c, (a 3 x 1 vector), cannot overlap communications with computations since the complete vector result (a computation 'atom') must be computed before being communicated to the next stage. Lee's n("2+1) mesh algorithm, which uses an n cpu cube architecture for the inertia matrix and an w(n+1) systolic mesh to solve the matrix equation, is presented in Table 4.8 as formulation (L). Lee uses Liu and Young's systolic Cholesky solver algorithm [Liu 83]. In (L) we assume that an extra n cpus is available to calculate the force vector simultaneously with the inertia matrix n cpu pipeline. Thus, Lee's architecture requires 2n cpus to form M and /„, each n forming a generalised cube network, followed by an mesh to solve the equations. The complexity of Lee's formulation is presented in Table 4.8. Note that we have assumed Jainandunsing's complexity estimate of 7n — 5 cycles for Liu's systolic Cholesky factorisation (plus backsubstitution) algorithm. Lee's inertia matrix complexity has an added 3(n — 1) additions which were left out in Lee's complexity polynomial (equation 25 in his algorithm). The last column of the table is the sum of the inertia matrix and solver complexities, since it is assumed that the force vector complexity is approximately the same as the inertia matrix complexity, and is therefore hidden when calculated simultaneously. 4.7.2 Amin-Javaheri and Orin Amin-Javaheri and Orin [Amin-Javaheri 88] developed a number of parallel algorithms for the inertia matrix, the most efficient of which used a hypercube architecture. The composite body variables c,, J, and Af were calculated in O([log2n]) time using the RDA algorithm and the architecture of Fig. 4.7. To reduce the communication delay in the RDA algorithm, com putations are duplicated by source and destination cpus so that the data for the next level of computation is at most only one hop from its destination. By doing this, the computational Chapter 4. A Parallel Dynamics Algorithm 93 001 sinB? _ cos 82 sin 83 , sin 82 _ | COS 63 , cos 62 H sinW4... sin82 _| COS04- cos O2 I sin05- sin 8g _| cos 05 • cos 62 I sin 86- sin 82 cos 96.. cos 82 sin 67 . • sin 8? f cos 87 • cos 82 I sin8g .. sin82_| cos98. . cos 82 I sinSg-- sin82 cos 8e ••• cos 62 H; 'j.2 f3.3 m f2.3 ^ I n3 3 \ "A3 f2.4 . n4,4 "3,4 "2,4 f2.5 f4,5 . n , , n4.5 " "2,5 f6.6 fc f5,6 _ '2.6 „ n . . "5,6 "2,6 b,b '7.7 m 67 » F?7 FC "7,7 "6,7 "2,7 8,8 f7,8 "7,8 f6,8 n6,8 j.3 [ZKHJ-4 •"j.6 ... "2.81 1 Figure 4.7: N=8 processor cube configuration. Adapted from Amin-Javaheri 1988. complexity is not increased (the cpus would be idle otherwise), but the communication com plexity is minimised to 2[log2n] time rather than the ]T),ti2"' i cost normally incurred by the hypercube (e.g. if (5,8) in Fig. 4.7 were to be calculated only in cpu 5, it would have to be passed to cpus 1,2,3 and 4 in three hops). Data has to be passed bidirectionally, so there is a factor of 2 in the communication cost (assuming only one bidirectional link between any pair of cpus). Once the fjj and njj of equation 2.8 have been calculated (see Fig. 4.7), /,-,_,- and n,j Chapter 4. A Parallel Dynamics Algorithm 94 (i < j), and column j of the inertia matrix are calculated by the jth cpu in 0(n) time without any further communications. A direct comparison of computational complexity is difficult with Amin-Javaheri's algorithm since the cost is not given as a function of multiplies and adds. The computational complexity polynomial (A) in Table 4.8 gives the number of 1 ps cycles. Since Amin-Javaheri and Lee have very similar algorithms, it is expected that Lee's complexity is also representative of Amin-Javaheri's algorithm (e.g. assuming multiplies and adds are 1 ps each, for n = 8 the compute times are 473 cycles for Lee and 512 cycles for Amin-Javaheri, which are very close). 4.7.3 Fijany and Bejczy Fijany, like Amin-Javaheri, presented an algorithm for just the inertia matrix [Fijany 89]. He used the RDA in a similar manner to Lee and Amin-Javaheri to compute the A0 and other composite body variables in O([log2n]) time. Following this, Fijany computed the inertia matrix terms in O([log2 n] + 1) rather than the 0(n) time of Lee, by using the RDA algorithm to compute P,j, the vector from hinge i to hinge j. P,j is used in the Composite Rigid-Body Spatial algorithm (but not in the Composite Body algorithm) according to the following equations: for i = n, ...1 for j = i — 1, ...1 Pij = PW + Pj+1J (4.31) Mj,i = zf(m + P^ x ft) (4.32) The computational complexity of equations (4.31) and (4.32) is O([log2 n\) on a hypercube and 0(1) on an n("2+1) mesh, respectively. The complete algorithm is mapped to a 2d n^"2+1^ nearest neighbour mesh. By choosing this architecture, the RDA communication cost increases to 0(Y^f2 2,_1), as the data must be passed from neighbour to neighbour in order to imitate the hypercube structure. This increases Chapter 4. A Parallel Dynamics Algorithm 95 the communication cost of the RDA by 150% [Fijany 89], but the communication architecture can be reused for matrix solving (whereas the n cpu hypercube architecture does not implement a systolic solver as efficiently as a mesh architecture). 4.7.4 Hwang, Bae, Haug and Kuhl The orthogonal complement algorithm proposed by Bae [Bae 88a] has been implemented on an Alliant FX/8 multiprocessor, an n cpu shared memory bus architecture. Because the for mulation deals with closed loop constraints, which require the identification of dependent and independent coordinates and the maintenance of constraints, the computations are much more complex than those for an open loop chain. As a consequence, the computational complexity polynomial was not given by Hwang and Bae [Hwang 88]. To directly compare the Hwang's algorithm we have estimated the performance of the algorithm for a single open chain in Table 4.9, making a number of assumptions. To take advantage of the fact that the variables are all referred to the inertial frame (and therefore A0 must be calculated) we have assumed a hypercube architecture and the RDA algorithm for calculating A0 and Ki. An "("+*) mesh has been added to solve the matrix systolically using Liu's algorithm, and the force vector is assumed to be simultaneously calculated on another n cpu cube (and therefore will not be counted in the complexity estimate). Table 4.9 presents the operations described in Chapter 2. The calculation of A0, x°0 and A', are performed using the RDA algorithm. Operation 10, Ki = 2~J?=i M{, requires that Mi be sent to neighbouring cpus. Only 10 values need to be sent in each direction for each data transfer of operation 10 because of the symmetry in M,-. It is possible to reduce the computations of operation 12 to (6x,5+) by distributing the (Bj Ki) and Bj vectors throughout the mesh. This, however, would require a complex algorithm to distribute the vectors, followed by another phase in which the inertia matrix elements must be brought to the edge of the mesh before the solver can begin. It is thus not worth the extra cost in communications to do this unless data communications becomes much cheaper, and so Chapter 4. A Parallel Dynamics Algorithm 96 # operation cost comms no. floats cpus 1 AU 4x 0 n 2 A°i riog2nl(27x,18+) 6[log2nl n 3 APJA? 36x,24+ 0 n 4 ... — AO 1 _ AOKi . v* — At-lli-l,i At Kt,i 18x,15+ 3 n 5 xi,0 ~ Xi-\,0 + V* 3riog2nl + 6flog2n] n 6 3x 0 n 7 rni(xlo)2 6x,3+ 0 n 8 J? " ™.(*?,o)2 9+ 0 n 9 6x,8+ 0 n 10 10flog2 nl + 20riog2 n] n 11 BjKi 36x,30+ 0 n 12 (Bj K~i)Bj n(6x,5+) 6(ra - 1) n 13 M-1 (7n-5)x,+ 7ra-5 n(n+l) 2 Table 4.9: Estimated computational complexity for Hwang's algorithm operation 12 is assumed to be executed by the n cpu cube architecture in 0(n) complexity. 4.7.5 Complexity Comparison Fig. 4.8 is a plot of the computational complexity for forming the inertia matrix alone (an + is 0.5 /is, half of the cost of a x, using an Inmos T800 transputer). The performance of the proposed 0(n) inertia matrix algorithm (W) is linear. The algorithm is more efficient at lower degrees of freedom than the other algorithms in Table 4.8. The exact crossover point is sensitive to assumptions made about the cost of implementation for each method. The implementation assumptions used in Section 4.7 indicate that the proposed algorithm is more efficient for up to 13 degrees of freedom (excluding communication costs). When the computational cost of the solver is added to the cost of the inertia matrix algorithm (Fig. 4.9), W apeears to be more efficient for chains of up to 15 degrees of freedom. It is assumed in algorithm W that the Jw and Jw computations are executed by row 1 cpus in algorithm W, and so their cost is included in the M estimate. Algorithm F is included in the figure by adding the complexity of Liu's solver. Chapter 4. A Parallel Dynamics Algorithm 97 Figure 4.8: Computational complexity for the parallel algorithms of Table 4.8, assuming only M is calculated. L - Lee, F - Fijany, H - Hwang, W - Wong (proposed algorithm). Chapter 4. A Parallel Dynamics Algorithm 98 Figure 4.9: Computational complexity for the parallel algorithms of Table 4.8, including M and feedforward solver. L - Lee, F - Fijany, H - Hwang, W - Wong (proposed algorithm). Chapter 4. A Parallel Dynamics Algorithm 99 4.8 Summary The orthogonal complement formulation derived in Chapter 3 has been parallelised. New parallel algorithms for the inertia matrix and the force vector have been developed, producing 0(n) parallel algorithms for single chain mechanisms which are conceptually very simple. The proposed algorithms map onto a 2d + n nearest-neighbour mesh architecture (an extra cpu is used to connect to the host). The connections are wrapped around to form a 3d cone-like architecture (this aids in the data shifting phase of the algorithm, when data must be routed to the edge of the array in preparation for the matrix solver). The algorithms make full use of the computational resources i.e. the pipeline used by the force vector algorithm and the triangular mesh used by the inertia matrix algorithm are reused by the systolic matrix solver. The systolic feedforward solver is very efficient, and does not require the storage and reordering of the intermediate results between factorisation and backsubstitution phases, which is required by the algorithm proposed by Liu. In comparison to other parallel dynamics formulations, which are of 0( flog2 n]) or 0( [log2 n] + n) complexity, the proposed 0(n) formulation is more efficient for mechanisms of 15 degrees of freedom or less (this will vary, however, depending on the assumptions made for each of the implementations). The O([log2 n] + n) algorithms are less efficient for low numbers of degrees of freedom because of the high initial cost of calculating the rotation matrices that refer each body's variables to the inertial frame. The proposed formulation uses nearest-neighbour com munications routing, which makes expansion of the architecture to accomodate larger n very simple. The O(flog2n] + n) algorithm proposed by Lee and Amin-Javaheri uses a hypercube architecture, which is not amenable to VLSI integration, and requires relatively complex com munication strategies to minimise the communication cost (this not only includes redundant computations in source and destination cpus, as discussed before, but also includes the addition of extra communication links when n > 16). The 0(|"log2n"|) algorithm proposed by Fijany uses a mesh architecture, but extra communication costs are incurred when the RDA algorithm is implemented. One drawback of the formulation proposed here is that the HT multiplication Chapter 4. A Parallel Dynamics Algorithm 100 phase in the inertia matrix and force vector algorithms requires data to be sent in the reverse direction (west to east and north to south) compared to that of the previous phase (when [L mp] and [C p x C] were calculated) (east to west and south to north). This means bidirectional links are necessary, in place of the simple unidirectional links which have been proposed by the designers of VLSI systolic arrays. Chapter 5 The Real-Time Excavator Simulator 5.1 The UBC Teleoperation Project This chapter describes the development of the real-time simulator for a Caterpillar 215B heavy-duty excavator. The 215B is typically used in primary industries such as forestry, mining, and construction. The objective of the UBC teleoperation project is to convert this type of machine, with minimum changes, to a semi-automated, human-supervisory control system. The simulator developed in this thesis is a fundamental part of this effort. Semi-automated control uses both human supervision and efficient robotic power for tasks in which these characteristics can simplify operations, such as tree harvesting and materials handling. Present operation of excavators and other similar hydraulic equipment is heavily influenced by the skill and smoothness of the human operator. Computerised joystick control can improve the ergonomics of the present lever per joint arrangement by, for example, having the joystick mapped to Cartesian-like workspace (the world coordinates are inertially measured x,y, and z) rather than joint space. The operator thus controls the end effector of the excavator, letting the computer calculate the control signals needed to achieve the desired endpoint motion. This design change can improve the ease of use, which should decrease the time necessary to attain a high level of skill, and simplify the low level decision making required from the operator. The excavator is modeled as a large system consisting of two subsystems - a complex hydraulic actuation system, and a rigid multibody mechanism. The multibody dynamics is represented as a four link mechanism forming a single chain, with each link having one rotary degree of freedom. Although most heavy-duty manipulators use piston hydraulics with a complex arrangement of coupled hoses and pressure sources, for this thesis we have simulated simple rotary hydraulics. 101 Chapter 5. The Real-Time Excavator Simulator 102 This is because the true actuation dynamics are highly nonlinear and interconnected, with a mixture of fast and slowly changing state variables, causing the system to be mathematically stiff and inefficient. Overcoming the problem of stiffness, so that the integration technique for the hydraulics can be matched to the integration algorithm used by the multibody dynamics, is a thesis topic outside the scope of this thesis. Another participant in the project has simulated the true hydraulics equations [Sepehri 90], and this work is currently being integrated into the simulator. As the excavator is a single chain mechanism, the 0(n) parallel algorithms developed in Chapter 4 are suitable for the simulator. In the next section, the mechanism is described based on information obtained from the specifications of the Caterpillar 215B and from [Sepehri 90]. The discussion then focuses on the hardware requirements necessary for the simulator, as de termined by the algorithms of Chapter 4 and by the computer architecture needed to integrate other components of the simulator, such as the graphics display and the controller. Next, the implementation of the simulator is discussed. This includes the organisation of a parallel fixed stepsize Runge Kutta integration algorithm, the development of equations for simple rotary hydraulic actuators, and the organisation of the input and output between the dynamics simulator and the display and joystick devices. Following this, some experiments will be presented which demonstrate the operation of the simulator, and its use in experiments on joint and resolved mode joystick control. A timing and complexity comparison is made between single and multiple cpu versions of the simulator as a measure of the effectiveness of the implementation. 5.2 A Description of the Caterpillar 215B The Caterpillar 215B is a tracked excavation mechanism with four rotary links named the cab, boom, stick and bucket (Fig. 5.1). The cab sits upon the base and swivels about a vertical axis. The operator sits within the cab, controlling the excavator with a pair of two degree of freedom joysticks. Each degree of freedom controls one link. The boom is the anatomical equivalent to Chapter 5. The Real-Time Excavator Simulator 103 the upper arm of the human body; it rotates about a horizontal axis and provides the ability to lift the bucket. The stick is analogous to the forearm, and is used with the boom when reaching. The bucket is the end link which is used to dig or carry the load, and is analogous to the human hand. Both the stick and the bucket also rotate about horizontal axes. Figure 5.1: Initial coordinate arrangement for Caterpillar 215B and resolved mode parameters Fig. 5.1 is a diagram showing the coordinate frames attached to the mechanism in its initial Chapter 5. The Real-Time Excavator Simulator 104 position. The coordinate frames are attached to the centre of gravity of each body. The cab, boom, stick, and bucket are labelled links 1 to 4, respectively. The parameters used for the simulation are in metric units of kg for mass, m for length, kg • m2 for inertia, and deg for angle: mi = 8031, ^ = [-1.05 0.04 -0.5], /£2 = [0.35 -0.12 0.5] m2 = 1830, k{2 = [-2.3 - 0.2 0], /£3 = [2.9 - 0.2 0] m3 = 688, A;33 = [-0.9 - 0.1 0], = [0.9 - 0.1 0] m4 = 512, kj4 = [-0.5 - 0.5 0] The twist angles are Q\ = 0, a2 = 90, a3 = 0, and a4 = 0, and the inertia matrices are: 8000 0 0 0 8000 0 - 0 0 15700 T2 -I3 -J3 — 10 0 0 0 600 0 0 0 600 74 i J4 100 0 0 0 15400 0 0 0 15400 130 0 0 0 130 0 0 0 130 4 _ 5.3 Architecture and Hardware Considerations Robot systems often have many diverse tasks that can run in parallel, such as the dynamics (in ertia matrix and force vector formation), the inverse kinematics, control algorithms, and sensor fusion algorithms (e.g. image understanding and touch sensing). Although some tasks such as image processing operate well using an SIMD (Single Instruction, Multiple Data) architecture, the simultaneous operation of many of the above tasks necessitates an MIMD (Multiple Instruc tion, Multiple Data) approach, so that different computational structures can be accomodated. As an example, image processing and dynamics are two tasks that can run in parallel, but may be processed by different computational units. Expandability, connectivity and global access to sensors are also important considerations. Expandability is an issue because mechanisms Chapter 5. The Real-Time Excavator Simulator 105 with other topologies will eventually be simulated. Connectivity must be considered so that communication latency is minimised. These requirements have directed our choice of computer architecture to a crossbar+bus based multiprocessor system. The crossbar, a 2d mesh of switches providing a non-conflicting path between any pair of processors, has a high data bandwidth for local task computations in which cpus are interconnected by the crossbar to perform a common task. The bus provides a path between clusters (each cluster forming around a small crossbar network and performing a different task e.g. dynamics). The architecture chosen for this project consists of a 32 bit global VME bus using a SUN 3/280 work station as the file-serving host running UNIX. A cluster of cpus made up of thirty four Inmos T800 transputers is connected together in a 2d mesh and to the host by a crossbar network. The crossbar is used to imitate the mesh as it has great flexibility in connectivity and can adopt architectures that are suited to many tasks. By forming a mesh the expansion to larger n is easily accomplished by adding new columns of n cpus. For the excavator simulation, the crossbar forms a cluster of fourteen transputers configured as a 2d triangular mesh (Fig. 5.2). Transputer cpuo,o connects the mesh to both the host and to another transputer which controls the spool response. The control transputer also connects to the display and joystick subsystems. A Silicon Graphics IRIS 4D70GT is used to present a real time, hidden-surface-removed display of the external environment as viewed through the front window of the excavator cab. The IRIS is connected to the control transputer via an RS232C cable, which is then converted to the RS422 standard for the transputer link. Two joysticks are connected by an A/D board to the IRIS. In normal operation (JOINT mode), each joystick has two degrees of freedom, while in Cartesian-like endpoint control, one joystick has three degrees of freedom and the other has one. Data from the joysticks is read by the IRIS computer and sent on the RS232C/RS422 link to the control transputer. Data is also passed from the mesh via cpuo,o through the control cpu to the IRIS. Chapter 5. The Real-Time Excavator Simulator 106 VME cpu cluster Figure 5.2: Computer architecture for the excavator simulator 5.4 Implementation of the Simulator 5.4.1 Multibody dynamics and distributed Runge Kutta In Fig. 5.2, row 0 is used to calculate the force vector fv (3.27 to 3.31, and 3.34), and rows 1 through n (n=4) are used to calculate the inertia matrix (3.17). cpurj,o provides a link to the Chapter 5. The Real-Time Excavator Simulator 107 # row 0 rows 1 to n Do j = 1 to 4 { 1 Calculate actuator torque ra no activity 2 Calculate [L mp] (eqs. 3.27 to 3.31 sec. 4.3) Calculate [JC M(p x C)] (eq. 3.17, sec. 4.2)) 3 Multiply by RT to get /„ (eq. 4.8, sec. 4.4.1) Multiply by BT to get M (eq. 4.8, sec. 4.4.1) 4 Move b = /„ + ra to cpuo,4 (Fig. 4.5) Move Mij to cpui,,- (Fig. 4.5) 5 Solve for 8, results reside in cpuo.o (Fig. 4.4) Solve for 9 (Fig. 4.4) 6 Move 8i from cpuo.o to cpuo,; no activity 7 } Integrate di to get 8 and 8 (5.1), and then transmit 8i from cpuo,; to cpui.i. no activity Receive 8i from cpuo,; 8 Send 9i from cpuo.i to cpuo.o and then to IRIS no activity 9 Send Xi from control cpu to cpuo.o and then to cpuo,; no activity Table 5.1: Tasks required for the dynamics simulation host for program development, down loading, and data storage. Table 5.1 lists the tasks performed by row 0, and rows 1 through 4. Tasks 1 to 7 are evaluated in a loop four times, and then the 8 are sent out to the display in task 8. The spool valve openings x, calculated by the control transputer from the joystick data, are sent to cpuo,o and on to the array (tasks 8 and 9). The calculation of the actuator torques ra will be discussed later. Tasks 2 and 3 use the algorithms described in sections 4.2, 4.3, and 4.4 of Chapter 4 to derive Al and /„. In this implementation of Al, the cross product (p\j x cjj) is calculated by each cpu before the dot product (method 2 in section 4.3), since the cost of communications using transputers is not small. Task 4 is the data shifting phase discussed in section 4.5 and shown in Fig. 4.5, which moves the elements of b and Af to the edge of the array. Task 5 uses the Jainandunsing systolic solver algorithm to calculate 6. Task 6 redistributes the since the values are resident in cpuo,o after the solver has finished. Chapter 5. The Real-Time Excavator Simulator 108 Task 7, the Runge Kutta integration algorithm, is distributed among the cpus of row 0. The differential state vector y for link i resides in cpuo.i, and consists of four variables: 0,, 0,-, and the input and output differential pressures p,n(i) an£l Pout(i) °f the rotary hydraulic actuator. At the end of task 6 in Table 5.1, the 0, are available in cput)o. The fourth order fixed step-size Runge Kutta algorithm presented by Press [Press 88] is then used to determine the new y = [0, 0, Pm(i) Pout(t')]- This is done by evaluating tasks 1 to 7 four times, each time calculating a new y from an updated state y for a partial step across the interval h — 0.01. At the end of the fourth evaluation, a weighted average of the four estimates of y is used to obtain the true y over the interval h. The Runge Kutta algorithm for each link is exactly the same as for the single cpu algorithm, except that it only applies to the states related to link i. Given an initial state y, the algorithm is: derivative(y, yx); for(i = 0;i < 4;i+ +) yt[i] = y[i] + f derivative(yt, yt); for(i = 0; i < 4; i + +) vt[*\ = y[i\ + iyt[i\, derivative(yt, ym); for(i = 0;i < 4;i+ +) { yt[i] = y[i] + hym[i}\ ym[i\ = yt[i] + jmM; } derivative(yt, yt); Chapter 5. The Real-Time Excavator Simulator 109 for(i = 0; i < 4; i + +) »[*'] = »[«] + £(»*[»] + »[*] + 2^mW); (5.1) The derivative(y, y) function calculates y from y. Task 8 is an output operation occurring once every step h if the state vectors y and y are recorded in a file, or once every ^jth of a second (5/i in this simulation) if the 6{ is sent to the display computer. To send this data out, the 0,-is passed out from row 0 cpus to cpuo.o, and then either sent out to the IRIS for display or to the host for storage. In task 9 x, is sent to cpun,o from the control transputer. 5.4.2 The actuators The actuators simulated in this thesis are simple rotary hydraulic ones, actuated independently through a valve controlled, hydraulic pressure system. The voltages v calculated by the control cpu from the joystick commands c are converted to spool openings x by adding a centre deadband of d = ± 0.005 cm and limiting the range to ± 0.05 cm (Fig. 5.3). The x values are sent to cpuo,o, and each cpu in row 0 (operation 1 of Table 5.1) reads x,-. A negative sign for x, indicates a reversal of the flow, which would cause the link to rotate in the reverse direction. Then, the previous time step's pressure p,n and pOXit are limited between Pe (the drain tank pressure) and Ps (the tank pressure). The flows are then calculated as follows: <Ps- Pin, 'Pin - Pe, 'Pout ~ -Pe; else Qout - kjx • \/Ps- pout; (5.2) where kj is the valve coefficient. The actuator torque ra for the link is then Ta = (Pin ~ Pout)^-G - Pm0 (5.3) Chapter 5. The Real-Time Excavator Simulator 110 deadband Figure 5.3: Voltage (v) to spool (x) relationship with deadband where G is the gear ratio of the actuator, Dm is a motor constant, the value 8 is the imperial-to-metric unit conversion factor, and (3M is a velocity damping constant inherent in the actuator. ra is added to /„ to get the force vector b. The differential pressures are proportional to the valve flows Q, the link velocity, and the hydraulic compliance and may be described as follows: Pin = (Qin-dDmG)(3v; Pout = (BDmG — Qout)Pv] (5.4) Thus, given the pressures and the values of the spool displacements, it is possible to calculate the flows, the actuator torques, and the differential pressures. Table 5.2 gives the parameter data for the simulation. Chapter 5. The Real-Time Excavator Simulator 111 range/value Pe (psi) 0.0 x (cm) ± 0.05 d (cm) ± 0.005 kf 26 Dm 0.7 Pv 1000 cab boom stick bucket Ps (psi) 4000 7000 4000 4000 G 307 545 220 250 An 100000 100000 30000 100 Table 5.2: Parameters used in excavator simulation 5.4.3 The display and joystick interface The joysticks and the display are interfaced to the dynamics computations by the control transputer and cpun,o- These cpus control the updating of the display and the actuation of the simulator. Fig. 5.4 and the following sections describe the procedures involved in the interface routines between the IRIS, the control transputer, and cpuo,o-the IRIS The IRIS displays the view from the cab of the excavator every ^th of a second using 0, data received from the interfacing cpus (cpun,n and the control transputer). It also samples the joysticks and then sends this data to the control transputer. The IRIS first reads the joysticks, obtaining four values. In JOINT mode, when each degree of freedom in the joystick represents a link rotation, the values represent angular velocity commands 0t- for the cab, boom, stick, and bucket links. In RESOLVED mode, the values represent the linear velocity commands along the radius axis (in a radial direction from the cab), the arc axis (tangential to the arc motion made by the end effector), in the z axis (in a Cartesian sense, the up-down motion of the end effector), and the angular velocity command of the bucket relative to the inertial frame (the ground plane). The velocity commands c (linear or rotational) are sent from the IRIS to the control trans puter through the RS232C line at 9600 baud. The control transputer runs an interrupt driven Chapter 5. The Real-Time Excavator Simulator 112 buffering routine for receiving the data. The IRIS then waits for the new values of 0, determined by the transputer array to be sent back from the control transputer. The are sent from cpuo,o to the control transputer which then sends these values on to the IRIS. Since the IRIS receives the angles using an interrupt driven buffer routine, the reception of the angles and the transmission of the joystick values can be accomplished simultaneously. the control transputer Once all four velocity command values have been received, the control transputer calculates the voltages (v). If the simulator is in JOINT mode, the commands are 8d, and so the voltage for each link is calculated using a PD controller according to the following equation: v = kv(8d -8) + kv(8d - 8). (5.5) The proportional gains kp and differential gains kv for each link of the excavator were chosen by trial and error to give a response similar to that obtained when the same PD controller was used on the excavator. The values chosen are given in the Table 5.3. If the simulator is in RESOLVED mode, the commands are linear, and so the inverse kinematics are calculated first to get the desired 8d. Once the 9d are obtained the same PD controller used in JOINT mode is used. The simulator 0 values are estimated by a simple differentiation of the 9 data sent from cpuo,o (this data can easily be read from the transputer array if desired, but the communication costs would increase). The 8d values are estimated by a simple one step integration using the previously estimated value of 8d and 8d. When u, has been calculated, it is transformed to the spool value a;, according to Fig. 5.3 and sent on to cpuo,o-When the angles for the next time instant have been calculated by the array, the data is sent from cpuo,o to the control transputer. Chapter 5. The Real-Time Excavator Simulator 113 cabin boom stick bucket kp 2.0 3.0 2.0 2.0 k 1.0 2.0 1.0 1.0 Table 5.3: kp and kv for PD actuator spool control the interface transputer cpuo.o The interface transputer cpuo,o, acts as the gateway between the display, the joysticks, the spool transputer, and the dynamics simulator. Initially it receives 0{ data from the cpus in row 0 through its east link in a pipelined manner. After all four values of 0{ have been collected they are sent through the south link to the control transputer and on to the IRIS, where they are used to display the excavator. 5.5 Excavator Simulations 5.5.1 Timing measurements Single cpu implementation A single cpu simulation of the excavator was initially performed as a reference from which to compare the parallel simulation. The graphics and joystick subsystems were disabled so that the dynamics computations alone were measured. One cycle, which is comprised of inertia matrix formation (equations 3.17 and 3.23), force vector formation (equations 3.27 to 3.31 and 3.38), simple actuator calculation (equations 5.2 to 5.4), and matrix solution (sec. 3.4), was taken as the best indicator of real-time performance. Results were not stored in a file and the stack pointer of the T800 was initialised to lie in the internal RAM so that all stack operations were faster than if the stack were in external memory. The results, given in Table 5.4, show that the theoretical and actual implementation percentages for each phase of the computation are very close, indicating that the complexity estimates are relatively correct. Note that the solver was more expensive in practice because an LU solver was actually chosen rather than the Cholesky solver since it was a proven and readily available Chapter 5. The Real-Time Excavator Simulator 114 IRIS initialization CONTROL CPU0,0 send 0 display 9 * read © initialize control send 9 —read 0 send 8 to array simulation loop Do Forever { get 6 from array -r-' read joystick c send joystick c —• read 0 ^ display 6 } read 9 * •"* send 9 ^ read joystick c calc calc v calc x send spool x send 9 ' read spool x send x to array 1 Figure 5.4: The algorithm organisation between cpuo,o, the control transputer and the IRIS subroutine. The implementation is a factor of 7.83 times slower than theoretical expectations (the theoretical estimate was made based on the multiply and add times of the transputer). It is suspected that the extra costs are due to array indexing and subroutine calls, as described in Chapter 3, section 3.5.1. Overhead costs in computations Since the actual computational costs exceed the theoretical costs by a factor of 6, it is worth while to investigate the source of these delays. The single cpu excavator simulation was used Chapter 5. The Real-Time Excavator Simulator 115 to time various routines. The algorithm in equation (3.17) for calculating H: for j = 1 to N { for i = j to N { [U-lQbcsC}itj PU = ci ^i-lC.-l,j t-1 -1,. t-1 ph x <i i > j i > j HI; = } Pr1 } was timed. The cost of an inner loop (calculating H\j) was found to be 275 /JS, in comparison to a theoretical minimum of 46.5 ps (a factor of 5.9 times slower). The operations involved in the loop can be found in the table below. The overall cost of calculating H was 2260 ps (including the cost of calculating A]^), while the theoretical estimate from equations (3.18) to (3.22) is 319.5 ps (n = 4). Thus the overall cost is 7.1 times slower for this phase. The actual cost for a matrix-vector multiply operation was found to be 65 ps, a factor of 5.4 times slower than the theoretical estimate of 12 /is. It was found that if the 9 multiplies and 6 adds required were calculated using normal variable names (i.e. arrayOO rather than Operation Actual time (ms) % Theory (x,+ ) Theory (ms) % inertia matrix 3.01 41.28 286 x,2094- 0.391 42.1 force vector 3.38 46.41 312x,253+ 0.439 47.26 actuator 0.35 4.50 44x,24+ 0.056 6.03 solver 0.56 7.74 32x,22+ 0.043 4.6 Total 1 cpu 7.28 100.0 674 x,5084- 0.929 100.0 Total + n cpus 1.353 100.0 143x,115+ 200.5/xs 100.0 Table 5.4: Theoretical and real timing estimates of one cycle of simulation on one cpu and the multiprocessor (n("2+1) -j. n cpus). Chapter 5. The Real-Time Excavator Simulator 116 operation type x,+ Theory (ps) Actual (/is) t2 = A\_x • tl p\j = n + kU m,- • f 3 vec subtract mat-vec multiply vec add mat-vec multiply mat-vec multiply cross product scalar-vec 3-9x,6+ 3+ 9x,6+ 9x,6+ 6x,3+ 3x 1.5 12 1.5 12 12 7.5 3 18 63 18 65 63 14 34 Table 5.5: Timing estimates for calculating H array[0][0] ) the cost was 14 ps, very close to the 12 ps estimate. When indexing was added (array[0][0]), the cost rose to 21 ps, due to the extra instructions the microprocessor has to perform to calculate the indexes into array. When the array value was addressed as a vector (necessary in order to write a general matrix-vector function that can deal with variable size arrays), explicit integer indexing in the subroutine was necessary to obtain the array element addresses. As a result the cost rose to 56 ps. The cost of calling the subroutine (saving the state on the stack, and returning the state when finished) was found to be 9 ^xs. From this investigation, it can be seen that cost reduction can be accomplished by making the matrix-vector functions very specific with respect to the size of the arrays used (e.g. 3x3 matrices). This will reduce the cost of this particular function by 66 %. Direct addressing of array elements is very efficient but the resulting code is very specific and not reusable. Similarly, removing the subroutine structure is not a good solution as the software will lose its generality and also become extremely difficult to debug. Parallel cpu implementation The parallel implementation was timed for one cycle, but information for each phase was un obtainable due to the difficulty of extracting times from individual cpus without a global clock as a reference. Another problem is the fact that the transputer array was fully connected into a cone mesh, and as a consequence no links were free to connect to the host and provide timing data. From Table 5.4 it can be seen that the overall time taken was 1.353 ms. This time is Chapter 5. The Real-Time Excavator Simulator 117 within the period required for real-time operation (2.5 ms for a stepsize of h = 0.01 s and a fourth order Runge Kutta). The parallel implementation is a factor of 6.75 times slower than theoretical expectations would indicate, again due to arrays, subroutine calls, and in the parallel case, the cost of communications. The single/parallel speedup factor was in practice 7.28/1.353= 5.38. Theoretically, this factor was (674x,508+)/(143x,115+)= 4.63. Interestingly, the actual speedup is better than the theoretical speedup in spite of the fact that communications costs were accounted for. Note that in this first implementation, it was decided that explicit (and therefore non-overlapped) communications should be used to simplify the programming and debugging. As a result, no communication costs are hidden. One explanation for the greater speedup factor is that there were fewer subroutine calls and smaller arrays than in the single cpu case. It appears that the cost of subroutine calls and array indexing is at least as costly as communications. Note that the addition of the IRIS and joystick subsystems slowed down the system so that when the cost of moving data back and forth was included, the time per cycle was increased to approximately 2.1 ms. This is due primarily to the slowness in the RS232C communication protocol. Real-time was achieved in spite of this, however, and the simulator is successfully running as a man-in-the-loop system. 5.5.2 JOINT and RESOLVED mode experiments Two experiments were performed using the parallel simulator as a substitute for the excavator. The concept of resolved mode teleoperated control was compared to joint mode control. In both cases PD control was used to control the spool value x to approach some desired motion. The experiment was designed by R. Frenette, a staff engineer. The following two subsections outline the results. Chapter 5. The Real-Time Excavator Simulator 118 JOINT mode control In joint mode control a profile of desired angular velocities 0d for each link is specified from a file (rather than read from the joysticks) and fed into the control transputer. The control transputer calculates the appropriate x to achieve 6d and then feeds x into the array. The resulting values of desired and actual 9 and 0, and v and x are recorded by the control transputer. Figs. 5.5, 5.6, and 5.7 show the appropriate plots for the cabin, boom, stick, and bucket. Fig. 5.5 shows the desired and actual 0. A steady state error is present due to the lack of an integrator in the PD controller and the deadband in the spool. Note that there is a steady state error in the beginning of the simulation for the boom and stick due to the force of gravity. Gravity loading decreases the actual angle (dotted line) because the link reacts to gravity, pulling the link downward, while the pressure in the hydraulic line is still building up from its initial value of zero. This results in an initial decrease in the angle. Once the line pressure has built up, the link has a steady state error. This initial error is not present in the cabin because its rotational axis is orthogonal to gravity. It is not present in the bucket plot because the bucket has a much smaller gravity load due to its smaller mass and a smaller inertia due to its shape, and as a consequence there is a smaller drop in the angle due to the lack of pressure in the lines. All plots in Fig. 5.5 show a steady state lag at the end of the simulation as the line pressures have already built up to their normal values. This particular error can be eliminated by adding integral control. Fig 5.6 shows the joint velocities for each of the links. The cabin demonstrates no oscillation at the start since it is orthogonal to gravity and is thus immune to the initial lack of pressure. At the end of the simulation, the velocity of the cabin oscillates due to the inertia of the machine. The inertia causes an overshoot followed by a reaction from the spool, resulting in an oscillatory response. The boom and stick oscillates initially due to gravity loading and the initial lack of pressure. At the end of the simulation the boom and stick velocities are smooth because the deadband in the spool causes the boom and stick spools to be closed. The bucket velocity has a small initial oscillation due to gravity. Chapter 5. The Real-Time Excavator Simulator 119 Time (second) Time (second) Figure 5.5: JOINT mode 6 response in the simulator. Fig. 5.7 shows the spool (x) and voltage (v, dotted) plots. The v curves are the desired drive voltages, and the x curves include the deadband in the response. The boom v attempts to give a positive drive voltage, but only a zero value of x is produced, because of the deadband. At the end of the simulation, a negative v is desired to close the steady state error, but the deadband causes x to be zero, which results in a zero 6. These plots have not yet been compared to those of the excavator, but initial subjective re sponses from users of the complete simulator (including the joystick and graphics) indicate that the simulator reponse is felt to be realistic. Chapter 5. The Real-Time Excavator Simulator cabin (des,act) o •a > boom (des,act) u •a •a > 5 10 15 Time (second) stick (des,act) 20 5 10 15 Time (second) ^ bucket (des,act) 5 10 15 Time (second) 5 10 15 Time (second) 20 Figure 5.6: JOINT mode 6 response in the simulator. Chapter 5. The Real-Time Excavator Simulator cabin 0.05 0 5 10 15 Time (second) stick 5 10 15 Time (second) o o o & 09 -0.05 0 0.05 o > /- N E "o o &. K3 -0.05 5 10 15 Time (second) bucket 5 10 15 Time (second) Figure 5.7: JOINT mode v and x response in the simulator. Chapter 5. The Real-Time Excavator Simulator 122 RESOLVED mode control In RESOLVED mode, the desired profile represents the linear velocity commands along the radius axis, the arc axis, the z axis, and the angular velocity of the bucket relative to the ground. The setup was similar to the JOINT mode experiment except that inverse kinematics calculations were necessary to derive the 0d from the linear velocity values. The experiment required an initial 5 seconds of no movement followed by 5 seconds in which the cabin rotated 15 degrees and the end effector moved from a radius of 5m to 6.5m at constant radial velocity while maintaining its original height above the ground. The bucket was required to maintain its angle. The last 5s required all links to maintain their positions. Figs. 5.8, 5.9, and 5.10 show the link reponses. Fig. 5.8 shows the joint angle reponse. The desired cabin angle (continuous line) is linear. The cabin response shows a significant lag. To maintain a constant end effector height, the boom and stick show a nonlinear joint angle response. As with the JOINT mode experiment, the boom and stick have an initial gravity loading response which when coupled with the lack of initial pressure causes the boom and stick to sag. At the end of the simulation, however, there is a lag due to the deadband in the spool. Fig. 5.9 shows the desired and actual integrals of the linear velocity curves. The radius, the cabin rotation, and the bucket behave in a similar manner to the results of the JOINT mode experiment, with their initial gravity loads and line pressure buildup responses, followed by final steady state lags. The Z curve indicating the height of the end effector shows an initial sag when the boom and stick sag, followed by a positive error due to the boom and stick lag response. The bucket sags initially due to the initial pressure, but eventually lags the desired response. Fig. 5.10 shows the desired v (dotted) and actual x (continuous) curves. As in the JOINT mode, the deadband in the x response has a large effect because it causes the spool to remain closed when the link has a steady state error and prevents a reduction in this error by the control algorithm. As the boom and stick extend, the spool opens wider to drive the link faster. This Chapter 5. The Real-Time Excavator Simulator Time (second) Time (second) Figure 5.8: RESOLVED mode 6 response in the simulator, is required to make the link travel at a constant radial velocity. Chapter 5. The Real-Time Excavator Simulator 124 radius (mm) (des,cat) 6000-4000 -5 10 15 Time (second) Z (mm) (des,cat) 5 10 15 Time (second) 20 20 -100 5 10 15 Time (second) bucket (dee) (des,cat) 5 10 15 Time (second) 20 20 Figure 5.9: RESOLVED mode radius, cabin rotation, height, and bucket rotation responses in the simulator. Chapter 5. The Real-Time Excavator Simulator 125 0.05 o ,—s E u N—' o P. ca -0.05 0 0.05 o > S—s E CJ o OH ca -0.05 5 10 15 Time (second) stick 5 10 15 Time (second) 0.05 boom > u o ca -0.05 0 0.05 o > o "o o PH ca -0.05 5 10 15 Time (second) bucket 5 10 15 Time (second) Figure 5.10: RESOLVED mode v and x response in the simulator. Chapter 5. The Real-Time Excavator Simulator 126 5.6 Summary In this chapter we have described an implementation of the excavator simulator. The complete computer architecture with joysticks, graphics, and simple actuators has been outlined. The formulation and computer architecture suggested in Chapters 3 and 4 were used to produce a system that met the real-time objective. Theoretical parallel time cost versus the actual parallel implementation cost showed the imple mentation was 6.75 times slower than expected. It is speculated that this inadequacy is due to array indexing and subroutine calls. A theoretical speedup of 4.63 was estimated for the parallel simulation over the single cpu simu lation, while in practice a factor of 5.38 was achieved. There were less subroutine calls and array index calculations in the parallel implementation than in the single cpu implementation, and it is assumed that this produced the better than expected parallel/single cpu speedup. This was in spite of the extra costs incurred by the parallel implementation due to link communications. Finally, JOINT mode and RESOLVED mode simulations were carried out to demonstrate the usefulness of the simulator and to help design the joystick controller. The results have been helpful in tuning the controls, and have been implemented on the excavator. Subjective observations have indicated that the behaviour of the simulator is close to the excavator, even though the true hydraulic system was not simulated. Future experiments, however, will require a more accurate model of the hydraulics. Chapter 6 Conclusions and Further Work 6.1 Summary of the Thesis A multibody dynamics formulation has been developed for the purposes of real-time simulation of large scale robotic mechanisms such as excavators. The formulation models the rigid body dynamics of any arbitrary tree structured mechanism (excluding the actuator dynamics, which have been examined in other theses [Sepehri 90]). It is limited at present to single degree of freedom rotational joints, although later in this chapter, the extension to multi-degree of freedom rotational and prismatic joints is described. Closed chains are also not included in the present formulation, but are discussed later in this chapter. The Newton Euler State Space method developed by Hemami [1982] is used as the basis for the formulation. This method is an example of the orthogonal complement approach, which describes the dynamics by projecting an initial description of the primitive equations of mo tion (the derivatives of translational and angular momentum, and the kinematic equations) from angular and translational Cartesian coordinates w and x to coordinates such as relative Euler angle accelerations 6, using a Jacobian velocity transform. The primitive equations are described using Newton's and Euler's equations for momentum derivatives. The internal con straint forces and torques in the hinges are also eliminated by this projection. Other examples of the orthogonal complement approach use the Lagrangian and kinetic energy to justify the projection [Angeles 1988], [Kim 1986]. Hemami [Hemami 82] initially developed the method using mixed coordinates. The transla tional forces were referred to the inertial frame and the rotational equations were referred to the body coordinate frame. By referring all equations to body coordinates, an efficient 0(n3) 127 Chapter 6. Conclusions and Further Work 128 algorithm was developed in this thesis for inertia matrix formation. Buchner [Buchner 86] also referred his translational force equations to body coordinates, but the equations were not devel oped to the recursive vector cross product form (pj • Xc\ •). As a consequence, the equations were computationally less efficient. The recursive nature of the kinematic and momentum derivative equations has been used in this thesis to derive a single cpu algorithm which is similar to An geles' algorithm when defined for a single chain. The algorithm is relatively efficient, although other algorithms [Lilly 91] are better for single cpu implementation. This technique of referring each equation to body coordinates was developed independently from Angeles and Ma [1988], who used the body-referred Jacobian transform equations for the dynamics of a single chain. Their work was developed using the Lagrangian approach and Kane's equations to identify the equivalent projection matrix. The work in this thesis has been developed using Newton-Euler principles, as proposed by Hemami and Buchner, but it continues those developments until the very efficient recursive cross product form is established. In addition, the single cpu algorithm models tree structured mechanisms by adding a connectivity matrix to describe the topology of the mechanism. A new 0(n2) single cpu algorithm for the force vector has been developed which has its origins in the 0(n) algorithm developed by Angeles, Ma and Rojas [1989]. The algorithm makes use of matrix calculations performed by the inertia matrix algorithm. The algorithm defines the force vector calculation as the multiplication of the HT projection matrix with the momentum derivative vector. As a consequence of using the H matrix (already calculated in the inertia matrix algorithm), this force vector algorithm is only useful when the forward dynamics are calculated. Simulations of the PUMA 600 manipulator and the human torso (with a head and two arms) using open loop inverse dynamics control appeared in Chapter 3 as examples of the veracity of the formulation. The need for real-time simulation has led to the development of a parallel formulation (Chapter 4) derived from the single cpu algorithms from Chapter 3. Multiprocessor algorithms were developed for inertia matrix formation, force vector formation, matrix solution, and a fourth order fixed step size Runge Kutta integration routine. These algorithms are of 0(n) complexity, Chapter 6. Conclusions and Further Work 129 and together they form a parallel formulation which is more efficient (Figs. 4.8 and 4.9) than the parallel formulations with which we have compared (given the implementation assumptions stated in section 4.7) The spatially recursive equations for calculating the vectors and forces required in the dynamics naturally suggest a pipeline approach for the computational archi tecture. This is most easily seen in the computations of row 0 and row 1 of the computer architecture. The inertia matrix algorithm uses a 2d "("+*) triangular nearest neighbour architecture, as all data communication in the algorithm is between nearest neighbours. The force vector algorithm uses an n cpu pipeline, and shares data with the inertia matrix cpu array. The matrix solver is a feedforward systolic algorithm [Jainandunsing 89] that calculates the solution in 4n cycles, compared to 7n — 5 cycles using Liu and Young's systolic Cholesky factorisation plus backsubstitution method [Liu 83]. The feedforward solver requires "("+*) + n cpus arranged in a triangular nearest neighbour array. A good fit is thus accomplished between the topologies necessary for equation formation and matrix solution, enabling a smooth integration between the formation and solving phases. The proposed algorithms have been implemented on a computer architecture consisting of a host cpu which controls a cluster of Inmos T800 transputers interconnected by a crossbar network. Both the formulation and the computer architecture are relatively general in nature, permitting more complex mechanisms to be modeled and other types of robotic tasks to be computed. A Caterpillar 215B excavator was simulated in real-time using fourteen transputers connected in a triangular, 2d nearest neighbour grid network. Another transputer was used to interface between the grid, the host computer, and the control transputer which modelled the spool valves of the actuators. The control transputer connected the interface transputer to the graphical display computer and the joysticks (Fig. 5.2). Single cpu and multiple cpu implementations were developed for the excavator. Simulations testing joint mode and resolved mode joystick control were successfully performed. Timing measurements of the excavator simulations showed that the parallel implementation was able to perform in real-time and that there was a significant speedup relative to the single cpu Chapter 6. Conclusions and Further Work 130 implementation. 6.2 Concluding Remarks At the beginning of this thesis the objective and scope of the study as well as the measures by which its success could be evaluated were stated. The objective was to develop and implement a real-time simulator for the Caterpillar 215B excavator. The scope extended beyond this objective to include a relatively general formulation which would permit implementations of more complex multibody systems. Initial limitations were to exclude closed chain mechanisms and model only single degree of freedom rotational joints. A parallel computer architecture was deemed the best methodology for achieving real-time speed. The thesis work can be divided into two tasks - the theoretical distribution of the computations, and the implementation of the theoretical distribution on a computer architecture. Two criteria were chosen to evaluate these tasks - computational complexity, an objective measure that can be used to compare the theoretical computational costs of other formulations; and the quality of the implementation, a subjective analysis of the computational architecture examining its expandability, homogeneity, efficiency, and how closely the computer architecture matches the ideal computational architecture. The recursive form of the equations for calculating the momentum derivative and kinematics dictates the theoretical distribution of the computations, naturally suggesting a pipeline/mesh architecture. In the case of 0(n + [log2 n\) formulations, the referral of the coordinates to the inertial frame destroys this data dependence, enabling these formulations to use the recursive doubling algorithm to get an efficient algorithm. We have shown, however, that the overhead in referring the equations to the inertial frame results in high initial costs, making our formulation more efficient for systems with moderate numbers of degrees of freedom. The ideal architecture has been directly implemented using the transputer array (whereas [Fijany 89] chose a mesh even though a hypercube was ideal). A direct implementation has Chapter 6. Conclusions and Further Work 131 enabled us to make conclusions about the theoretical distributed architecture. Since the im plementation architecture is a mesh, it can be easily expanded to model robots with varying numbers of links by adding a new column of cpus for each new degree of freedom. As the imple mentation architecture's structure is homogeneous, the architecture does not need to change as the formulation moves from the equation formation phase to the solving phase, thus obviating the need for using either reconfigurable architectures or two separate architectures. The implementation using Inmos T800 transputers has produced some useful information about tightly coupled mutiprocessor arrays. Timing measurements show that real-time performance is achievable for the excavator simulator in spite of the large overhead incurred due to subroutine calls and array index calculations (a factor of 6 slower). Inline code and direct array element addressing can solve this, but would do so with a loss of generality in the software. The commu nication costs can be quite significant and it is desirable to minimise the delays by overlapping the communications with computations (although discussed, in this first implementation the overlap technique was not used). 6.3 Contributions to the Literature The thesis has contributed to the literature in both of the areas described in the last sec tion. On the topic of equation development and the theoretical distribution of computations, a contribution has been made: — in the development of an efficient recursive set of equations for generating the cross product form of the projection matrix using body coordinates, from Newton-Euler rather than Lagrangian principles. — in deriving an efficient single cpu 0(n3) formulation of the inertia matrix for both single chain and tree structured mechanisms. — in deriving an efficient single cpu 0(n2) formulation of the force vector for both single chain and tree structured mechanisms, which must be used in the context of a forward dynamics algorithm as it uses results from the formation of the inertia matrix. Chapter 6. Conclusions and Further Work 132 — in deriving an 0(n) mesh structured multiprocessor formulation of the inertia matrix for single chain mechanisms. — in deriving an 0(n) pipeline multiprocessor formulation of the force vector for single chain mechanisms. The inertia matrix, force vector and solver algorithms together are more efficient than other algorithms for up to fifteen degrees of freedom (the relative efficiencies will depend on the details of the various implementations). The real-time implementation of the simulator has led to the following innovative aspects: — the development of a flexible computer architecture that is able to compute the dynamics as well as other tasks. — the development of a computational architecture which succesfully integrates the equation formation and solving phases. — the development of a complete real-time simulation facility for the UBC Teleoperation Lab oratory. — an implementation of Jainandunsing's feedforward systolic solver. 6.4 Topics Requiring Further Exploration In this thesis we have developed a new parallel formulation for rigid body dynamics. Although the model is limited to rigid bodies with a single degree of rotational freedom, it is a foundation upon which more complex behaviour can be added. The following sections consider some areas of further research that would be logical extensions: multiple degree of freedom joints, a parallel architecture for branched mechanisms, closed chain models, and actuators and friction. 6.4.1 Multiple degree of freedom joints Joints with more than one degree of freedom per joint can be included by modifying the Jacobian velocity transformation vector Hij = [c|- • p\jXc\j], which is a 6x 1 vector relating the Cartesian Chapter 6. Conclusions and Further Work 133 jth joint type size Floating base body Revolute joint Translational joint Cylindrical joint Universal joint Spherical joint 0 A) . 4 & J . Pi* x Cli/z \ o 1 C%i,3l* J c* ., 0 1 . PtJ X Ci,j/z clj/y J c' c' 1 . X C!,i/z Pi,i X ^/v J C»',j/^ ciJ/y ciJ/z Pij*chte PiJxchh Pi,ixch/x. 6x6 6 x 1 6 x 1 6x2 6x2 6x3 Table 6.1: Hij for various types of joints velocity vector [u>; i,] to the rotational coordinate The Hij used throughout the thesis represents the Jacobian vector for a tj, which is a single degree of freedom rotational joint. Kim and Vanderploeg [Kim 86a] have derived the Hij for other types of joints. Table 6.1 contains a modified form of Kim's matrices in which the Hij have been referred to the ith coordinate frame, so that the terms are applicable to the algorithms presented in this thesis. Note that we have denned c)^ = A'zj, cj-^ = A'jrj and c\^x = A)x3^ where z, y, and x are unit axes in the jth coordinate frame and Aj is a function of the appropriate 6Xiy< or z. In the computation of M for single degree of freedom joints, the Hij vectors are elements of the [p x C] and [C] matrices. For more complex joints, the development of the C and p x C matrices proceed with the matrices from Table 6.1 as the new Hij. This can be shown by developing new kinematic equations for the angular velocity, originally described in Chapter 1. These equations must be modified to include more than one degree of freedom in the joint. As an example, consider a body i in which the ith joint has rotational degrees of freedom in y and z axes. The angular velocity kinematics are: Mfi = M-iwtlo + Wi/z + vfr/y (6-1) <o = ^Li^iZl + w.-.o x ^"tf,-/, + wj>0 x yji9,-/v + ar'd'.y, + y5«i/lf (6.2) Chapter 6. Conclusions and Further Work 134 The A\_x matrices must also be redefined to include the sequence of 0,/y and 0,y2 rotations. Once the w and w vectors have been calculated, the L, mp and W vectors or matrices, used for calculating the force vector, are calculated in the same manner as in Chapter 3. Consider a single chain with n links and n joints, some of which have multiple degrees of freedom. The total number of degrees of freedom in the system is m (n < m) due to the extra degrees of freedom. Including these types of joints in the parallel formulation requires a re-examination of the distribution of the computations. One issue that must be considered is the timing of the pipelined sequence of operations. It may be necessary to add a number of cpus for each extra degree of freedom in the system if the time delay for any operation in the pipeline is to remain comparable to the other operations occurring simultaneously. A first examination of the problem indicates that for each extra degree of freedom, a row of n — i cpus would be added next to the ith row for the computation of H, and a column of i cpus would be added next to the zth column for the HTH computation. To compute the inertia matrix, it may be necessary to start with an array of m("+1) cpus, some of which will be idle during different phases of the computation. 6.4.2 A parallel architecture for the branched formulation To simulate branched topologies with the multiprocessor mesh, it is necessary to examine the single cpu algorithm developed in Chapter 3. From equation 3.14, z 0 0 0 0 A\z z 0 0 0 A^A\z A\z z 0 0 A\z 0 0 z 0 A\A\z 0 0 A\z z where the C matrix (as well as the pxC matrix) shows the same structure as the connectivity matrix U~l. One can imagine each row of the C matrix being produced by one of the row 1 cpus in the inertia matrix array. The pipeline in row 1 is broken up, however, since cpui^, calculating A\z, does not make use of any terms produced by the cpu in front of it in the Chapter 6. Conclusions and Further Work 135 pipeline (the third row in the C matrix). It receives its input vector from cpu^i since body 4 is connected to body 1 in the topological description. The pipeline must be altered to provide an extra link from cpuij to cpui^. The pipeline of cpus in the torque calculator (row 0 of the mesh) must also be altered to provide this connection. As a general rule, it is necessary for rows 0 and 1 to conform to the topology of the mechanism. Another consideration is the requirement for the HTH calculation and the systolic solver to use a full triangular nearest neighbour architecture. Connections must be provided between cpu,,3 and cput)4 in order that the terms may be accumulated in the HTH computation, and so that the systolic matric solver can perform its embedded rotations. Thus even the cpus which would contain 0 in the C matrix computation need to be present. It is possible to use the existing network topology to pass the data in rows 0 and 1, but this would require some cpus to transmit data destined for other cpus which are not relevant to their own operations, and this would cause difficulties in the timing of operations. A better solution would be to provide an extra common bus between rows 0 and 1 so that data that is not destined for an adjacent cpu can be routed via this bus to its destination. This would only be necessary during the [C p X C] computation phase in row 1 cpus, and the [L mp] computation for row 0 cpus. After this, the nearest neighbour connections would be used for the subsequent phases. At present this connectivity is not provided in the existing computational architecture, but it should not be difficult to implement, as no extra communication links are required from the transputer. The communication links across rows 0 and 1 need to be connected to a common bus such as the VME bus itself, or perhaps the secondary serial S-bus. Fig. 6.1 illustrates the architecture for a four body branched system. Note that the bus provides the path for cpus &i and an to connect to 63 and 013. 6.4.3 Closed chains The simulation of closed chain mechanisms can involve both analytical and numerical methods. The primary problem is the selection or identification of the set of independent coordinates Chapter 6. Conclusions and Further Work 136 Figure 6.1: Architecture for branched four body system that are used to describe the system. This problem is exacerbated when joints lock together or separate, causing a time varying topology in which the variables belonging to the set of independent coordinates change membership with time. Initially the loops in the closed chain system are cut to form a tree topology. The independent coordinates are then selected, and the constraints forces which close the loops are added to make the motion of the tree system consistent with the closed chain system. Chapter 6. Conclusions and Further Work 137 Murray and Lovell [1989], Ibrahim [Ibrahim 1988] and others have approached the problem by using a tree structure as the initial representation of the closed chain mechanism (loops are cut to form a tree). The inertia matrix and force vector for the tree are developed, using a set of independent coordinates representing the tree. A reduction of this set of coordinates is then made, so that the remaining subset of coordinates can represent the closed loop structure (there are less degrees of freedom in a closed loop system). The reduced set of coordinates may often be chosen by a logical examination of the topology, followed by the formation of the constraint equations used to close the loops in the tree structure (in order to be equivalent to the original closed chain system). In cases when the appropriate coordinates are not obvious, they may be found numerically (e.g. QR decomposition of a matrix representing the system constraints). Once the reduced set of coordinates is identified, a projection matrix is formed which projects from the set representing the tree to a subset representing the closed chain. The following equations describe the approach Murray and Gilbert used to formulate the closed chain dynamics of an arbitrary mechanism. Following this, we discuss how this method might be incorporated into the algorithms of Chapter 3. Consider a closed chain mechanism with m rigid bodies (one degree of freedom (dof) per joint) with n true dof (some of the dof are constrained by the closed loops, and are immobile, so n < m). The n one dof joints are driven by n actuator torques r, producing an n X 1 vector of joint displacements q. An open chain equivalent system (ignoring the constraint forces for now) has m dof and m X 1 joint displacements qr, one for each body. The equations of motion for this open chain system are (relating them to Chapters 2 and 3): where MT is the tree mechanism system inertia matrix and Hr is the force vector of Coriolis, centrifugal and gravity forces. To equate the open chain model to the closed chain model, the virtual work done must be equal, and so According to D'Alembert's principle, no net work can be done by the applied and actuating Tr = MT{qr)qT + Hr(qT,qr) (6.4) SqTr = SqTrr (6.5) Chapter 6. Conclusions and Further Work 138 forces for any virtual displacement consistent with the constraints, and so as the set of q is changed to qr, r changes to rr in order that the system virtual work remains the same. Murray shows that qT is a function of q, and more specifically, that the Jacobian from this relationship can be defined as qr = f(q) (6.6) qT = Gq (6.7Equation 6.7 represents the transformation from the open loop coordinate set to the closed loop subset. G is the m x n Jacobian constraint matrix describing the forces necessary to close the tree structured mechanism so that it is equivalent to the closed chain mechanism. Substituting Gq for qr in the equation of virtual work, it is possible to show that r = GTrr (6.8) The equations of motion can be rearranged to yield r = GT[Mr(qr)qT + Hr(qr,qr)} (6.9) The time derivative of qT is qr = Gq + Gq (6.10) The equation of motion can then be rewritten as T = GT[Mr(f(q))(Gq-rGq)-rHr(f(q),Gq)] (6.11) = Mcq+Hc (6.12where Me = GTMr(f(q))G (6.13) Hc = GTMr(f(q))Gq + GTHT(f(q),Gq) (6.14) In the above development, we assume collocated links (all measured joint coordinates are also ac tuated), whereas Murray and Gilbert also consider non-collocated manipulators. Non-collocated Chapter 6. Conclusions and Further Work 139 manipulators would project the system to yet another set of coordinates, say qm, which would produce a non-symmetric inertia matrix, as equations 6.6 and 6.9 would be functions of qm and Gm. If it is possible to measure all the qr and qr from the tree structured system, then Mr{qT) is obtained from the equations of Chapter 3, and so Mc can be computed as Me = GTHTMHG (6.15) where H is from Chapter 3 and M is the square matrix of Js and mis. Once G is computed, P = HG replaces H as the pre and post multiplier matrix transformation in the inertia matrix algorithm. The computation of Hc is more complicated due to the presence of the Gq term in 6.10. Gilbert proposes to calculate Hc from a closed chain inverse dynamics algorithm by setting q to zero in a similar way to the open loop inverse dynamics algorithm described by Walker and Orin [Walker 1982]. The incorporation of the above procedure into the algorithms developed in Chapters 2, 3, and 4 requires a number of issues to be addressed. These include the computation of Hc, the computation of G (which requires the identification of q), and the parallel computation of P and its effect on the computer architecture. Since the final size of the inertia matrix is n X n, the number of cpus used in the calculation of the open loop Mr (^±H) is more than that required by the systolic solver phase cpus). It is thus necessary to find the best way to compute P so that, if possible, only n(n2+1) cpus are needed for all phases of the computation. 6.4.4 Actuators and friction effects Currently, the hydraulic actuators are modeled as simple rotary actuators which are not coupled to each other. As a consequence, actuator i's equations are calculated in cput>o. Once a distributed version of the true actuation equations is developed, these can be calculated in row 0 cpus, or another row of cpus can be added on top of row 0. Friction terms, may be a function of the velocity 0, (e.g. viscous friction) or a function of reaction forces /„ and inertial terms. Viscous friction may be independently computed because each term is a function of only local Chapter 6. Conclusions and Further Work 140 information (e.g. the 0, variables are produced by row 0 cpus, and so they will be readily available for the friction computations). The resulting friction vector can simply be added to the b vector. Dry friction is more complex and involves the redistribution of variables produced by other cpus in the pipeline. Appendix A Single Chain Dynamics The matrices describing the equations of motion are (for a five body fixed chain): U = ^2-^2,2 0 0 0 0 ^2,2 '•^3^3,3 0 0 0 0 -#3,3 —A^R4 4 0 ;J = 0 0 0 #4,4 0 0 0 0 -#5,5 I 0 0 0 0 m\I -I / 0 0 0 m2I 0 --J / 0 0 ;M = nisi 0 0 -I / 0 mil 0 0 0 --J 4 m5I Qw = Q = +^i«o)2*i\i + ^o(^,0)2^33,3 -^(*33,0)2^,4 + W*U)2kU -A°4(wifl)%5 + A°6{wlfl)*kifi _ Al'klr 0 0 —•'^•1^1,2 A2k22 0 ;/ = W,0 x Jfw{fl mig wlo x J2w20 m2g X J^W^Q ;g = m3g w4,0 x Jfwio m4g w!,0 m5g 0 — ^2^2,3 ^3^,3 0 0 0 --^3^3,4 -^4^4,4 -A°JU 0 0 0 0 A°M,s 141 Appendix A. Single Chain Dynamics 142 u-1 = C = 7 0 0 0 0 7/000 7/700 / / / III z 0 z A\z A\z A\z I 0 / / 0 0 z \E = I -A\ 0 0 0 0 0 0 0 0 0 0 0 0 0 A\z A\z A\Z Z 0 A\z A\z A\z A\z z o o A\h\2 0 A\d\x A\d\2 -^3^3 ;//2 = 0 0 0 A2 0 0 I 0 0 I -Ai 0 0 I -/ 0 0 / -/ 0 0 / -/ 0 0 / 0 0 0 0 0 0 A\d\^ A2d32 A3d\3 At\kA^ 0 0 0 0 A°Atl A%d\2 A%d\3 A°4diA Afklf An algorithm for calculating [(t7-1Qtu)i>cs], is 0 0 0 [(U-xQw)bc3 ]i = [(U-lQw)bcs] i-i + {w\ -1 x2ii--1,0/ 'i--1 -l,t ~k\A 0 0 0 0 -^1^2,1 ^2,2 0 0 0 A\d2X A2 d\2 ^3,3 0 0 A\d\-y A2d32 ^3"4,3 K4,4 0 A\d\t-y A\d\2 •^4^5,4 ^5,5 7/i = -QT. i = l,...n (A.l) Appendix B Branched Chain Dynamics The following matrices are the coefficients for the multibody system represented in Fig. 3.1. <i -^#2,2 0 -A\R\<4 0 iVi = E = U = Q = o o 0 0 -A1 A2 I -I #2,2 ~ -^3^3,3 0 0 0 #3,3 0 0 0 0 0 0 0 -A\ I 0 0 0 0 0 1 0 -A2 0 0 I 0 0 -I I -10 0 0 0 0 0 0 0 1 0 0 0 0-7/ Afk\a 0 _AOJI 40L2 AV1,2 A2K2,2 0 -A%ll 0 0 0 0 0 0 #4,4 —AiRlj 0 #1,5 w\fi x Jlw\fl w2fl X ^2 W2,0 ™3,0 X ^3^3,0 J4w4fl ;/ = W 4,0 Qw = 0 0 5,0 x J5 wl,0 +A°«0)2*i\i -^«o)2<i\2 + A°2(wi0m2 ~A°2{wl 0)2>2)3 + A§(«>!)0)2*33,3 -^Ko)2<!,4 + ^«0)2^4 0 0 77l2g "l3g m4g m5g 2'2,3 -^3K3,3 0 0 0 0 /14K4,4 0 0 0 AO 75 jOp 143 Appendix B. Branched Chain Dynamics 144 7 0 0 0 0 7 7 0 0 0 U-1 = 7 7 7 0 0 7 0 0 7 0 7 0 0 7 7 ; H2 = UT, 771 = -Q3 Bibliography [Amin-Javaheri 88] M. Amin-Javaheri and D.E. Orin, Systolic Architectures for the Manipula tor Inertia Matrix. IEEE Trans. Systems, Man, and Cybernetics, Vol. 18, No. 6, November/December 1988, pp 939-951. [Angeles 88] J. Angeles and O. Ma, Dynamic Simulation of n-Axis Serial Robotic Ma nipulators using a Natural Orthogonal Complement. Int. J. Robotics Res., Vol. 7, No. 5, October 1988, pp 32-47. [Angeles 89] J. Angeles, 0. Ma, and A. Rojas, An Algorithm for the Inverse Dynamics of n-Axis General Manipulators Using Kane's Equations. Computers Math. Applic. Vol. 17, No. 12, 1989, pp 1545-1561. [Armstrong 79] W.W. Armstrong, Recursive Solution to the Equations of Motion of an n-Link Manipulator. Proc. 5th World Congress on the Theory of Machines and Mechanisms, Vol. 2, 1979, pp 1343-1346. [Bae 87a] D.S. Bae and E.J. Haug, A Recursive Formulation for Constrained Mechan ical System Dynamics Part I: Open Loop Systems. Mechanics of Structures and Machines, vol. 15, 1987, pp 359-382. [Bae 87b] D.S. Bae and E.J. Haug, A Recursive Formulation for Constrained Mechan ical System Dynamics Part II: Closed Loop Systems. Mechanics of Struc tures and Machines, vol. 15, 1987, pp 481-506. [Bae 88a] D.S. Bae, R.S. Hwang, and E.J. Haug, A Recursive Formulation for Real-Time Dynamic Simulation. Advances in Design Automation, DE-Vol. 14, ASME Design and Automation Conference, September 1988. [Bae 88b] D.S. Bae, J.G. Kuhl, and E.J. Haug, A Recursive Formulation for Con strained Mechanical System Dynamics Part III: Parallel Processor Imple mentation. Mechanisms, Structures and Machines, 16 (2), 1988. [Baumgarte 72] J. Baumgarte, Stabilisation of Constraints and Integrals of Motion in Dy namic Systems. Computer Methods in Applied Mechanics and Engineering, 1, 1972, pp 1-16. [Bayo 88] E. Bayo, J. Garcia de Jalon, and M.A. Serna, A Modified Lagrangian For mulation for the Dynamic Analysis of Constrained Mechanical Systems. Computer Methods in Applied Mechanics and Engineering, 71, 1988, pp 183-195. [Bodley 78] C. Bodley, A. Devers, A. Park, and H. Frisch, A Digital Computer Program for Dynamic Interaction Simulation of Controls and Structures (DISCOS). NASA Technical paper 1219, May 1978. 145 Bibliography 146 [Brandl 86] [Buchner 86] [Faddeev 59] [Featherstone 83] [Fijany 89] [Fuhrer 89] [Haug 89] [He 89] [Hemami 82] [Ho 74] [Hooker 66] [Hooker 70] [Hooker 73] • H. Brandl, R. Johanni, and M. Otter, A Very Efficient Algortihm for the Simulation of Robots and Similar Multibody Systems Without Inversion of the Mass Matrix. IFAC/IFIP/IMACS International Symposium on the The ory of Robots, Vienna, 1986. H.J. Buchner, Control of Robot Manipulators on Task Oriented Surfaces by Nonlinear Decoupling Feedback and Compensation of Certain Classes of Disturbances. PhD Thesis, Department of Electrical Engineering, Ohio State University, 1986. V.N. Faddeev, Computational Methods of Linear Algebra. Dover, New York, 1959. R. Featherstone, The Calculation of Robot Dynamics using Articulated-Body Inertias. Int. J. Robotics Res., Vol. 1, No. 1, pp 13-30, Spring 1983. A. Fijany and A. Bejczy, A Class of Parallel Algorithms for the Compu tation of the Manipulator Inertia Matrix. IEEE Trans. Robotics and Au tomation, Vol. RA-5, No. 5, October 1989, pp 600-615. C. Fuhrer, B. Leimkuhler, Stabilized Differential-Algebraic Formulation of the Equations of Motion of Constrained Mechanical Systems. NATO Ad vanced Research Workshop on Real-Time Integration Methods for Mechan ical System Simulation, Snowbird, Utah, August 1989. E.J. Haug, Computer-Aided Kinematics and Dynamics of Mechanical Sys tems Vol. I: Basic Methods. Allyn and Bacon, Massachusetts, 1989. X. He, and A.A. Goldenberg, An Algorithm for Efficient Computation of Dynamics of Robotic Manipulators. Advanced Robotics: 1989, Proceed ings of the 4th International Conference on Advanced Robotics, Columbus, Ohio, June 13-15, 1989. H. Hemami, A State Space Model for Interconnected Rigid Bodies. IEEE Transactions on Automatic Control, Vol. AC-27, No. 2, April 1982, pp 376-382. J. Ho, The Direct Path Method for Deriving the Dynamic Equations of a Multibody Flexible Spacecraft with a Topological Tree Configuration. AIAA Mechanics and Control of Flight Conference, paper no. 74-786, California 1974. W.W. Hooker and G. Margulies, The Dynamical Attitude Equations for an n-Body Satellite. J. Astronautical Sciences, 12 (1965), pp 123-128. W.W. Hooker, A Set of r Dynamical Attitude Equations for an Arbitrary n-Body Satellite having r Rotational Degrees of Freedom. AIAA Journal, Vol. 8, No. 7, July 1970, pp 1205-1207. W.W. Hooker, Equations of Attitude Motion of a Topological Tree of Bodies, The Terminal Members of which may be Flexible. Technical Report LMSC-D354938, November 1973, Lockheed Missiles and Space Company, Palo Alto, California. Bibliography 147 [Hughes 79] [Hughes 86] [Hwang 88] [Ibrahim 88] [Jainandunsing 89] [Jerkovsky 78] [Keat 83] [Khosravi 87] [Kim 86a] [Kim 86b] [Kim 88] [Lee 88] [Lilly 91] [Liu 83] P.C. Hughes, Dynamics of a Chain of Flexible Bodies. J. of the Astronau-tical Sciences, Vol. 27, 1979, pp 359-380. P.C. Hughes, Spacecraft Attitude Dynamics. John Wiley and Sons, New York,1986. R.S. Hwang, D.S. Bae, and E.J. Haug, Parallel Processing for Real-Time Dynamic System Simulation. Advances in Design Automation, DE-Vol. 14, ASME Design and Automation Conference, September 1988. A. E. Ibrahim, Mathematical Modelling of Flexible Multibody Dynamics with Application to Orbiting Systems. PhD Thesis, Department of Mechanical Engineering, University of British Columbia, April 1988. K. Jainandunsing and E.F. Deprettere, A New Class of Parallel Algorithms for Solving Systems of Linear Equations. SIAM J. Scientific and Statistical Computing, Vol. 10, No. 5, September 1989, pp 880-912. W. Jerkovsky, The Structure of Multibody Dynamics Equations. J. of Guid ance and Control, Vol. 1, No. 3, May-June 1978, pp 173-182. J. Keat, Dynamic Equations of Multibody Systems with Application to Space Structure Deployment. PhD Thesis, MIT, 1983. B. Khosravi, S. Yurkovich, and H. Hemami, Control of a Four-Link Biped in a Back Somersault Maneuver. IEEE Trans. Systems, Man, and Cybernetics, Vol. SMC-17, No. 2, March/April, 1987. S.S. Kim and M.J. Vanderploeg, A General and Efficient Method for Dy namic Analysis of Mechanical Systems using Velocity Transformations. ASME Journal of Mechanisms, Transmissions and Automation in Design, Vol. 108, June 1986, pp 176-182. S.S. Kim and M.J. Vanderploeg, QR Decomposition for State Space Repre sentation for Constrained Mechanical Dynamic Systems. ASME J. Mecha nisms, Transmissions and Automation in Design, Vol. 108, June 1986, pp 183-188. S.S. Kim and E.J. Haug, A Recursive Formulation for Flexible Multibody Dynamics Part I: Open Loop Systems. Computer Methods in Applied Me chanics and Engineering, 71, 1988, pp 293-314. C. S.G. Lee and P.R. Chang, Efficient Parallel Algorithms for Robot Forward Dynamics Computation. IEEE Trans. Systems, Man, and Cybernetics, Vol. SMC-18, No. 2, March/April 1988, pp 238-251. K.W. Lilly and D.E. Orin, Alternate Formulations of the Manipulator In ertia Matrix. The Int. J. Robotics Res., Vol. 10, No. 1, February 1991. P.S. Liu and T.Y. Young, VLSI Array Design Under Constraint of Limited I/O Bandwidth. IEEE Trans. Computers, Vol. C-32, No. 12, December 1983, pp 1160-1170. Bibliography 148 [Luh 80] [Mani 84] [Murray 89] [Nikravesh 84] [Orlandea 77] [Park 86] [Press 88] [Roberson 66] [Roberson 88] [Rodriguez 87] [Rodriguez 88] [Rulka 90] [Schwertassek 84] [Schwertassek 89] J.Y.S. Luh, M.W. Walker, and R.P. Paul, On-line Computational Scheme for Mechanical Manipulators. ASME J. Dynamical Systems, Measurement and Control. Vol. 102, No. 2, June 1980, pp 69-76. N.K. Mani and E.J. Haug, Use of Singular Valued Decomposition for Anal ysis and Optimization of Mechanical System Dynamics. Technical Report 84-13, University of Iowa, Iowa City, August 1984. J.J. Murray and G.H. Lovell, Dynamic Modeling of Closed-Chain Robotic Manipulators and Implications for Trajectory Control. IEEE Trans. Robotics and Automation, Vol. RA-5, No. 4, August 1989. P.E. Nikravesh, Some Methods for the Dynamic Analysis of Constrained Mechanical Systems: A Survey. Computer Aided Analysis and Optimiza tion of Mechanical System Dynamics, ed. E.J. Haug. Springer Verlag, Hei delberg, 1984. N. Orlandea, M.A. Chace, and D.A. Calahan, A Sparsity Oriented Approach to Dynamic Analysis and Design of Mechanical Systems, Part I and II. ASME J. Engr. Indust., Vol. 99, August 1977, pp 773-784. T.W. Park and E.J. Haug, A Hybrid Numerical Integration Method for Machine Dynamic Simulation ASME J. Mechanisms, Transmissions and Automation in Design, Vol. 108, 1966, pp 211-216. W.H. Press, B.P. Flannery, S.A. Teukolsky, and W.T. Vetterling, Numerical Recipes in C. Cambridge University Press, Cambridge 1988. R.E. Roberson and J. Wittenberg, A Dynamic Formalism for an Arbitrary Number of Interconnected Rigid Bodies, with Reference to the Problem of Satellite Attitude Control. Proc. 3rd Congress IFAC (London 1966), Vol. 1, Book 3, Paper 46D. Butterworth, London. R.E. Roberson and R. Schwertassek, Dynamics of Multibody Systems. Springer Verlag, Berlin, 1988. G. Rodriguez, Kalman Filtering, Smoothing, and Recursive Robot Arm For ward and Inverse Dynamics. IEEE Journal of Robotics and Automation, Vol. RA-3, No. 6, December 1987. G. Rodriguez, Recursive Mass Matrix Factorization and Inversion. JPL publication 88-11, March 15, 1988. W. Rulka, SIMPACK, a Computer Program for Simulations of Large-Motion Multibody Systems, in Handbook of Multibody Systems, W. Schiehlen (ed.) Springer-Verlag, Berlin, 1990. R. Schwertassek and R.E. Roberson, A State Space Dynamical Representa tion for Multibody Mechanical Systems Part II: Systems with Closed Loops. Acta Mechanica 51, 1984, pp 15-29. R. Schwertassek and W. Rulka, Aspects of Efficient and Reliable Multibody System Simulation. Unpublished paper, 1989. Bibliography [Sepehri 90] [Silver 82] [Vereschagin 74] [Walker 82] [Wehage 82] [Whittaker 27] [Wittenberg 77] [Zheng 84] [Zheng 86] 149 N. Sepehri, Dynamic Simulation and Control of Teleoperated Heavy-Duty Hydraulic Manipulators. PhD Thesis, Department of Mechanical Engineer ing, University of British Columbia, 1990. W.M. Silver, On the Equivalence of Lagrangian and Newton Euler Dynam ics for Manipulators. Int. J. Robotics Res., Vol. 1, 1982, pp 118-128. A.F. Vereschagin, Computer Simulation of the Dynamics of Complicated Mechanisms of Robot-Manipulators. Engineering Cybernetics, No. 6, 1974, pp 65-70. M.W. Walker and D.E. Orin, Efficient Dynamic Computer Simulation of Robotic Mechanisms. ASME J. Dynamical Systems, Measurement and Con trol. Vol. 104, 1982, pp 205-211. R.A. Wehage and E.J. Haug, Generalised Coordinate Partitioning for Dimension Reduction in the Analysis of Constrained Dynamic Systems. ASME J. Mechanical Design, Vol. 104, 1982, pp 247-255. E.T. Whittaker, A Treatise on the Analytical Dynamics of Particles and Rigid Bodies. Cambridge University Press, Cambridge, 1927. J. Wittenberg, Dynamics of Systems of Rigid Bodies. Teubner Stuttgart, Germany, 1977. Y.F. Zheng, Modeling, Control and Computer Simulation of a Three Dimen sional Robotic System with Application to Biped Locomtion. PhD Thesis, Department of Electrical Engineering, Ohio State University, 1984. Y.F. Zheng and H. Hemami, Computation of Multibody Dynamics by a Multiprocessor Scheme. IEEE Trans. Systems, Man, and Cybernetics, Vol. SMC-16, No. 1, Jan. 1986, pp 102-110.
- Library Home /
- Search Collections /
- Open Collections /
- Browse Collections /
- UBC Theses and Dissertations /
- Parallel implementation of multibody dynamics for real-time...
Open Collections
UBC Theses and Dissertations
Featured Collection
UBC Theses and Dissertations
Parallel implementation of multibody dynamics for real-time simulation Wong, Darrell 1991-12-31
pdf
Page Metadata
Item Metadata
Title | Parallel implementation of multibody dynamics for real-time simulation |
Creator |
Wong, Darrell |
Publisher | University of British Columbia |
Date | 1991 |
Date Issued | 2011-03-12T00:06:24Z |
Description | A multibody dynamics formulation has been developed for the purposes of real-time simulation of large scale robotic mechanisms such as excavators. The formulation models the rigid body dynamics of any arbitrary tree structured mechanism, although at present the formulation is restricted to single degree of freedom rotational joints. This formulation is an example of the orthogonal complement approach, which describes the dynamics by projecting an initial description of the primitive equations of motion (the derivatives of translational and angular momentum plus the kinematic equations) from angular and translational Cartesian coordinates to relative angles. In this thesis the approach was developed from Newtonian and Eulerian principles. Novel single cpu algorithms for inertia matrix and force vector formation have been implemented. Novel multiprocessor algorithms were implemented for the inertia matrix and the force vector on a 2d [formula omitted] triangular mesh architecture. A feedforward systolic matrix solution technique was also implemented. These algorithms are of O(n) complexity, and together they form a parallel formulation which is more efficient than other parallel formulations in the literature for mechanisms with fewer than 15 degrees of freedom. A Caterpillar 215B excavator was simulated in real-time using an array of transputers, and teleoperation experiments were conducted to verify the formulation. Single cpu simulations of the PUMA 600 and a human torso were also conducted. |
Genre |
Thesis/Dissertation |
Type |
Text |
Language | eng |
Collection |
Retrospective Theses and Dissertations, 1919-2007 |
Series | UBC Retrospective Theses Digitization Project |
Date Available | 2011-03-11 |
Provider | Vancouver : University of British Columbia Library |
Rights | For non-commercial purposes only, such as research, private study and education. Additional conditions apply, see Terms of Use https://open.library.ubc.ca/terms_of_use. |
DOI | 10.14288/1.0101112 |
URI | http://hdl.handle.net/2429/32391 |
Degree |
Doctor of Philosophy - PhD |
Program |
Electrical and Computer Engineering |
Affiliation |
Applied Science, Faculty of Electrical and Computer Engineering, Department of |
Degree Grantor | University of British Columbia |
Campus |
UBCV |
Scholarly Level | Graduate |
Aggregated Source Repository | DSpace |
Download
- Media
- 831-UBC_1991_A1 W65.pdf [ 8.13MB ]
- Metadata
- JSON: 831-1.0101112.json
- JSON-LD: 831-1.0101112-ld.json
- RDF/XML (Pretty): 831-1.0101112-rdf.xml
- RDF/JSON: 831-1.0101112-rdf.json
- Turtle: 831-1.0101112-turtle.txt
- N-Triples: 831-1.0101112-rdf-ntriples.txt
- Original Record: 831-1.0101112-source.json
- Full Text
- 831-1.0101112-fulltext.txt
- Citation
- 831-1.0101112.ris
Full Text
Cite
Citation Scheme:
Usage Statistics
Share
Embed
Customize your widget with the following options, then copy and paste the code below into the HTML
of your page to embed this item in your website.
<div id="ubcOpenCollectionsWidgetDisplay">
<script id="ubcOpenCollectionsWidget"
src="{[{embed.src}]}"
data-item="{[{embed.item}]}"
data-collection="{[{embed.collection}]}"
data-metadata="{[{embed.showMetadata}]}"
data-width="{[{embed.width}]}"
async >
</script>
</div>
Our image viewer uses the IIIF 2.0 standard.
To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0101112/manifest