Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Multilevel debugging of parallel message passing programs Pedersen, Jan Bækgaard 2003

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata

Download

Media
831-ubc_2003-860000.pdf [ 10.84MB ]
Metadata
JSON: 831-1.0051457.json
JSON-LD: 831-1.0051457-ld.json
RDF/XML (Pretty): 831-1.0051457-rdf.xml
RDF/JSON: 831-1.0051457-rdf.json
Turtle: 831-1.0051457-turtle.txt
N-Triples: 831-1.0051457-rdf-ntriples.txt
Original Record: 831-1.0051457-source.json
Full Text
831-1.0051457-fulltext.txt
Citation
831-1.0051457.ris

Full Text

Multilevel Debugging of Parallel Message Passing Programs by Jan Baekgaard Pedersen Cand. Scient. (M.Sc.) Department of Computer Science Institute of Mathematics University of Arhus Arhus, Denmark, 1997 A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy in THE FACULTY OF GRADUATE STUDIES (Department of Computer Science) We accept this thesis as conforming The University of British Columbia July 2003 © Jan Baekgaard Pedersen, 2003 In presenting this thesis/essay in partial fulfillment of the requirements for an advanced degree at the University of British Columbia, I agree that the Library shall make it freely available for reference and study. I further agree that permission for extensive copying for this thesis for scholarly purposes may be granted by the Head of my department or by his or her representatives. It is understood that copying or publication of this thesis for financial gain shall not be allowed without my written permission. c9l J u l y Zoo3 Date Department of Computer Science The University of British Columbia 2366 Main mall Vancouver, BC Canada V 6 T 1 Z 4 Abstract "I am not young enough to know everything" — James M. Barrie "Errare humanum est" - To err is human (Hieronymus, Epistle 57, 12); this fact has been known throughout t ime, and inevitably this means that humans writing computer programs are bound to introduce errors. With computers operating in Frankenstein's Igor mode, 'Your wish is my command', executing instructions without questioning their validity, errors introduced by humans are carried out. When adding parallel programming with message passing an error in one process can spread like a virus through message passing to other processes. Much research has been done on debugging sequential programs, and most of these theories and results apply directly to parallel programs, but the set of potential errors dramatically increases in size when introducing parallelism and message passing. Not only can one process fai l , but sets of processes can deadlock, computational errors can be propagated from process to process, thus infecting otherwise correct programs. Correct programs can stop working because of the underlying implementation of the message passing system. We propose a framework for debugging parallel message passing programs: a multi level ap-proach that divides errors into separate groups at various levels from the wel l known sequential errors, such as stray pointers and array out of bound, to deadlock caused by incorrect message passing code, protocol errors and buffer allocation problems. We show the validity of this ap-proach by developing new debugging techniques and analyses, and by implementing these in Millipede, a prototype multi level debugger written for C programs that use the PVM message passing system. ii Contents Abstract ii Contents iii List of Figures vii List of Tables x Acknowledgements xi 1 Introduction 1 1.1 The Debugging Problem 2 1.2 Problem Definition 4 1.3 Thesis Statement 6 1.4 Contributions 6 2 Background and Related Work 10 2.1 Background and Rationale 10 2.1.1 The Parallel Programming Domain 11 2.2 The Debugging Process 15 2.2.1 Iterative debugging 15 2.2.2 The Why, How and What of Errors 16 2.3 Related Work 20 2.3.1 Program Development Environments 20 2.3.2 Visualization Tools 22 2.3.3 Extension of Sequential Debuggers 23 2.3.4 Replay Tools/Debuggers 25 2.3.5 Relative Debuggers 26 2.3.6 Language Support for Communication 26 2.3.7 Summary of Related Work 26 iii 2.4 Top-down versus bottom-up debugging 27 2.5 Multi level Debugging 28 2.6 Error Classification 29 2.7 Tool Development 31 2.7.1 Automation 33 2.8 Tool Support for Parallel Program Development 34 3 Millipede - A Prototype Multilevel Parallel Debugger 36 3.1 Design Criteria 36 3.2 The legs of Millipede 37 3.2.1 Overview 38 3.2.2 Implementation 39 3.3 The Sound of Little Legs Running 40 4 Sequential Debugging of Parallel Processes 43 4.1 The Sequential Debugging Module 44 4.2 Limitations 4 5 4.3 Examples 47 4.3.1 Division by Zero Error 47 4.3.2 Memory Errors 48 4.4 Implementation Details for the Sequential Debugging Module 50 4.5 Summary 50 5 Message Debugging 51 5.1 Interactive Message Debugging 51 5.2 Message Queries 54 5.3 User Defined Queries 54 5.4 Built-in Message Queries 57 5.5 Discussion 60 5.6 Summary 61 6 Deadlock Detection and Correction 62 6.1 Deadlock Detection and Correction 62 6.2 Description of Problem 63 6.3 The Algorithm 63 6.4 Algorithm accuracy • 67 6.5 Message tags 82 6.6 Summary 83 iv 7 Protocol Conformance Checking 84 7.1 Between Testing and Verification 84 7.2 Protocol Checking and Verification 86 7.3 Protocol Constraint Specification 87 7.3.1 Protocol Contents 87 7.4 The PCSL Grammar and Semantics 88 7.5 Examples 9 1 7.5.1 The Simplest Protocol 92 7.5.2 Pipe-and-Roll Matrix Multiplication 93 7.5.3 A Partial Differential Equation Solver 95 7.6 Online Checking • • 98 7.6.1 Strictness 99 7.7 Offline Checking 99 7.8 Protocol Prediction 100 7.9 Implementation 100 7.10 Discussion 101 7.10.1 State Dependent Communication 102 7.11 Summary 1 0 3 8 Buffer Allocation in Message Passing Programs 104 8.1 Motivation and Related Work 105 8.2 Buffer Allocation Problems 107 8.3 The Non lock ing Buffer Allocation Problem 109 8.4 Approximations of BAP using NBAP 112 8.5 Discussion 116 8.6 Summary 117 9 Conclusion and Future Work 118 9.1 Conclusion 118 9.1.1 The Sequential Level 119 9.1.2 The Message Level 119 9.1.3 The Protocol Level 120 9.1.4 Summary 122 9.2 Future Work 123 9.2.1 Further Development 123 Bibliography 125 v Appendix A A Complete Example of a Millipede Session 132 Appendix B The PCSL Grammar and Semantics 134 B.1 The PCSL Grammar and Semantics 134 B. 2 A Complete Example Using PCSL/MOPED 135 Appendix C The MQL Grammar 137 C. 1 The Mill ipede Query Language Grammar 137 Appendix D Millipede Screen Shots 139 Appendix E Theoretical Framework for The Buffer Allocation Problems 141 E.1 Definitions 141 E.1.1 The Graph Based Framework 142 E.1.2 Colouring the Communication Graph 143 E.2 Useful Lemmas 145 E.3 Buffer Allocation in Systems with Receive Side Buffers 147 E.3.1 The Buffer Allocation Problem 147 E.3.2 The Buffer Sufficiency Problem 149 E.3.3 The Nonblocking Buffer Allocation Problem 153 E.4 Proof of Correctness of the Nonblocking Buffer Allocation Algorithm 155 E.5 Buffer Allocation in Systems with Send Side Buffers 157 E.6 Buffer Allocation in Systems with Send and Receive Side Buffers 158 E.7 Buffer Allocation in Channel Based Systems 162 E.7.1 The Buffer Allocation Problem 163 E.7.2 The Buffer Sufficiency Problem 165 E.7.3 The Nonblocking Buffer Allocation Problem 169 E.8 Summary 169 vi List of Figures 2.1 Using default buffers 12 2.2 Explicitly allocated buffers 14 2.3 The sequential versus the parallel programming domain 15 2.4 Top-down versus bottom-up debugging 27 2.5 Message passing widens the cause/effect chasm 30 3.1 Implementation of Millipede 39 3.2 Examples of redefined PVM functions 40 4.1 Example of the importance of logging PVM function return values 45 4.2 An application with communication library and Millipede 46 4.3 Sequential code with divide by zero 47 4.4 Using Gdb for sequential debugging 48 4.5 Source code with a memory error 49 4.6 Using Purify to locate a memory error 49 5.1 Missing value in a log file 52 5.2 Inspecting and changing message content 53 5.3 Executing the ma tch query 56 5.4 MQL code for the ma tch query 57 5.5 Executing the l o c a t e query 58 5.6 MQL code for the l o c a t e query 58 5.7 Executing the s t a t u s query 59 5.8 Executing the dump query 60 6.1 A simple Error 64 6.2 Executing the matching algorithm 67 6.3 Al l valid communication configurations in C 2 68 6.4 T h e s e t B ( u i , l ) 69 vii 6.5 Example of increasing B sets 70 6.6 Example of overlapping B sets 71 6.7 Configurations reachable in k steps from the configuration 0011 72 6.8 Intersections in C 2 ; 77 6.9 Failure rate for the deadlock correction algorithm 82 6.10 Introducing message tags 83 7.1 Adding elements to the symbol table 88 7.2 How to check a message against a protocol line 91 7.3 Semantics for a PCSL line 91 7.4 Example o f /? [ ] { } ( ) - * /? [ ] { } ( ) ; 92 7.5 Example of j8[]{i}()->i8[]{(t +!)%"}() " V i : 0 < i & & i < n - 1 ; 93 7.6 Algorithm for the master of the pipe-and-roll matrix multiplication 93 7.7 Algorithm for the slave of the pipe-and-roll matrix multiplication 94 7.8 The pipe-and-roll part of the matrix multiplication algorithm 95 7.9 Algorithm for the master algorithm for a differential equation solver 96 7.10 Algorithm for slave algorithm for a differential equation solver 96 7.11 P i - Version 1 of the protocol specification 96 7.12 V[ - Extended version 1 of the protocol specification 97 7.13 Graphical representation of V2 98 7.14 7*2 - Version 2 of the protocol specification 98 7.15 V3 - Version 3 of the protocol specification 98 7.16 V3 - Extended version 3 of the protocol specification 99 7.17 The four different stages of the pipe operation 102 8.1 An unsafe communication graph 108 8.2 A general t-ring 109 8.3 Communication dependences . . 110 8.4 The NBAP algorithm 110 8.5 Communication graph for a 2 x 2 worker system 111 8.6 Executing the n b a p command in Millipede 112 8.7 Detailed information about buffer requirements using n b a p 112 8.8 Examples of worst and best case approximations 113 8.9 Introduction of epochs into a communication graph 114 8.10 Shortening the arrival intervals using epochs 115 8.11 Sub-epochs 116 8.12 Barriers using asynchronous communication 117 viii A. 1 A complete sequential debugging session 133 B. 1 The PCSL BNF grammar 134 B.2 Semantics for the PCSL grammar 135 B. 3 A complete example using MOPED 136 C. 1 The MQL grammar 138 D. 1 The Millipede startup screen 139 D.2 Screen shot illustrating interactive message debugging 140 D. 3 Screen shot showing the status monitor 140 E. 1 Order of execution can cause deadlock 141 E.2 A communication graph with a 2-ring 143 E.3 Dependency cycle in G(S) 143 E.4 Construction of G 148 E.5 The construction of the'components 150 E.6 The disperser component 151 E.7 vi<c+k is communication dependent on viiC 154 E.8 Algorithm for nonblocking buffer assignment 154 E.9 A 2 x 2 worker process mesh 156 E.10 Nullifying send side token pools 159 E.11 Reduction from 3SAT to N B A P s r 160 E. 12 The clause representation in epoch j 164 E.13 Simulating m tokens by m components 166 ix List of Tables 2.1 Dimension 1. Why is an error hard to find? 17 2.2 Dimension 2. How is an error found? 18 2.3 Dimension 3. What is the root cause of the error? 20 5.1 The Senders relation 54 5.2 The Receivers relation 55 5.3 The SentMessages relation 55 5.4 The ReceivedMessages relation 55 5.5 The built-in queries 60 6.1 Examples of the size of B(v,e) 73 6.2 Distances between valid configurations in V 3 C C 3 74 6.3 The rate of growth of c, 75 6.4 Number of valid configuration at different distances in Vn 76 6.5 Intersection sets for various values of k and k' in C 2 77 6.6 Sizes of intersecting B sets 78 7.1 The MOPED Strictness levels 99 7.2 Prediction table for the V'z protocol specification 100 9.1 Results for various buffer placements schemes 122 E.1 Results for various buffer placement schemes 170 x Acknowledgements "Kind words can be short and easy to speak, but their echoes are truly endless." — Mother Theresa of Calcuta I would like to thank Alex Brodsky for taking such an active interest in the buffer allocation aspect of message passing programming. The work presented in chapter 8 is done in close collaboration with h im-many thanks for the hard work on the proofs. I would also like to thank my supervisor, Professor Alan Wagner, for his persistence in pushing the work on the buffer allocation problems, and for his financial support and ability to balance practice and theory in my dissertation. In addition, many thanks go to Professor Peter Welch from the University of Canterbury at Kent, Professor Dyke Stiles at Utah State University in Logan, Utah, and WOTUG for conference support on numerous occasions. I would also like to thank Doctor Bettina Speckmann for her help on proof reading the material presented in section 6. In addition, many thanks are extended to Yvonne Coady for always telling me that I was on the right track and to Chamath Keppitiyagama, Joon Suan Ong and Dmitry Brodsky for everyday support in the DSG. Also thanks to Professor Norm Hutchinson and Professor Kris De Voider for being on my committee. For financial support, acknowledgments and thanks go to Forskningsradet (The Research Academy) in Denmark, and to Randers Reb. Finally, I want to thank the most important people in my life, my parents, Karen and Finn, and my sister Lisa, for always supporting me, never questioning my decisions to remain in school, for what seems like forever, and for all those airline tickets home for Christmas, and the motorcycle loan. The University of British Columbia July 2003 xi Chapter 1 Introduction "If debugging is the art of removing bugs, then programming must be the art of inserting them." — Unknown One of the most interesting and fastest growing fields in parallel and distributed comput-ing is the field of grid computing-to provide on-demand computing. Tom Hawks, grid computing general manager for IBM, states that businesses can improve util ization of their technology infras-tructures by 30 percent or more by taking advantage of Grid technologies to enable on-demand computing [IBM02]. Hawk identifies five industries where Grids are likely to have the biggest near-term impact: Fi-nancial Services, which can harness Grids for derivative analysis, statistical analysis and portfolio risk analysis; Life Sciences, for cancer research, new drug discovery, protein folding and protein sequencing; Energy, for seismic analysis and reservoir analysis; Manufacturing, for mechanical design, process simulation, finite element analysis, and failure analysis; and entertainment, for digital rendering. In addition, Hawk adds: "Each of these industries, while different, shares similar business challenges that can be addressed by the unique benefits that Grids can deliver, including on-demand computing, business transformation, data sharing and IT optimization." Linking computers through Grids to aggregate their power promises to deliver the immense processing capabilities of supercomputers to new venues. For instance, a financial institution could use Grid computing to offer higher levels of service to their best customers for risk man-agement or portfolio analysis. A pharmaceutical company could amalgamate the power of several supercomputers and make the data available to researchers, who access the Grid to collaborate in the development of new drugs. The use of Grid computing as an on-demand utility promises to deliver computing power on a pay-as-you-go basis, as accessible as electricity. 1 1.1. The Debugging Problem 1 A similar system, the NetSolve client-server system [Net] enables users to solve complex scientific problems remotely. This system allows users to access both hardware and software computational resources distributed across a network. NetSolve searches for computational resources on a network, chooses the best one available, solves the problem using retry for fault tolerance, and returns the answers to the user. If we turn our attention to more tightly coupled systems, such as large clusters, yet another set of problems become tractable. In 1991, the US Congress passed the High Performance Computing Act of 1991 (Public Law 102-194), which authorized The Federal High Performance Computing and Communications (HPCC) Program. One class of problems developed in conjunction with the HPCC Program was dubbed 'Grand Challenge Problems' by Dr. Ken Wilson of Cornell University, a physicist and Nobel laureate. Since then, various committees and government agencies have added others to the original list. These problems are characterized as 'fundamental problems in science and engineering that have broad economic and/or scientific impact and whose solution can be advanced by applying high performance computing techniques and resources'. They address issues of great societal impact, such as biomedicine, the environment, economic competitiveness, and national security. One of the Grand Challenge problems includes weather prediction, a task that relies heavily on the availability of computing power; tomorrow's weather report must be available before tomorrow arrives. To obtain such computing power many systems make use of large clusters tightly coupled by a fast network. If we look at the Top 500 list of the fastest computers in the world [Top], an interesting trend appears: many of the fastest machines in the world today are indeed clusters made up of (fairly) ordinary PCs connected by a fast network. Whether dealing with a system of loosely coupled home PCs or a tightly coupled cluster of high performance PCs in a research lab, the same important question arises: how do we program these large parallel systems, since they do not share any memory, and information must migrate from one process to another through the network? The most widely used technique for exchanging data on such systems is message passing. Message passing involves a sender explicit ly sending a message to a receiver containing the data to be exchanged. Some of the most used message passing systems include PVM (Parallel Virtual Machine) [Gei94] and MPI (Message Passing Interface) [Don94]. Many of the fastest computers in the world today use some dialect of MPI, combined with Fortran or C. 1.1 The Debugging Problem The need for debugging is present in any software development project. Programs have errors and bugs, and these need to be located and corrected. Many approaches have been suggested in the literature, but in practice, the following two approaches are most used. The first is the 1.1. The Debugging Problem 3 traditional and well known debugging-by-print-statements. This approach involves inserting print statements, which display information to the screen or write to a file, into the program being debugged. This approach is stil l widely used. In [Pan93c], it is estimated that up to 90% of programmers stil l rely exclusively on print statements in their code as the only means of debugging. Logging, or tracing is a more pragmatic approach; instead of printing to the screen, log files are created. The programming language Java has classes for creating such log files or traces. The tracing can be switched on for debugging purposes, and when debugging has finished, can be turned off; thus the programmer does not have to go through the source code to delete the print statements. The same effect can be achieved with C/C++ by using # i f d e f . . . # e n d i f constructions, and setting flags on the compiler command line. The second approach uses debugging tools. Traditional sequential debugging tools are designed to easily and efficiently provide needed information to debug a sequential program: they provides key views into the different components of such programs and allows the programmer to alter the state of the running program. Examples of such views are stack traces, variable values, and break points. One of the key features of any sequential debugger is the ability to view and alter variable values. That is, the state of the program; the variables of a sequential program can be considered the key players-the core-o f the program. Much time is spent verifying whether variables have the correct values, and that the correct branches of i f and s w i t c h statements are taken based on expressions that contain variables among others. Research such as [Eis97] has shown that one important reason errors are hard to find is because the cause and effect are often separated by great distance in the code; the errors are most often found through print statements or similar debugging techniques, and the root cause often turns out to be a memory problem, that is, pointer errors or other forms of memory corruption. This research is based on sequential programs, but as we discuss later, the cause/effect chasm widens significantly when introducing message passing. A number of tools for parallel debugging have been proposed throughout the years, but in general, these tools are not widely used [Pan93b, Pan93a, Pan93c]. Some complain that the tools are hard and tedious to use, and fail to provide the information users really need and want. One major problem is the information overload that many of these tools suffer from [Pan99]. Too much information is presented to the programmer, thus making the debugging task difficult. Often, this information overload is caused by the tool trying to give the user a global view of the program. That is, taking a top-down approach to debugging without providing the correct views and filtering functions for the user to easily find the information necessary to complete the debugging task. Al l the problems and issues mentioned so far arise in the sequential programming domain. 1.2. Problem Definition A Once parallelism is introduced, not only do these problems become even more prevalent, but new problems are introduced. Not only does the programmer have to deal with asynchronously executing processes on multiple machines, but the message passing increases the distance be-tween the cause of an error and its effect, that is, the location where the error is exhibited. Incorrect variable values may be communicated between processes, making an otherwise correct process behave incorrectly. This increase is not only in the distance in the code as experienced in the sequential domain with, for example, function or procedure calls. When multiple processes execute simultaneously, the cause of an error can originate in one process but the effect man-ifests in another, thus increasing the distance in the code. Another aspect is t ime; when data values propagate from one process to another, the amount of t ime that passes from the cause to the effect of an error also increases. One example of the cause/effect chasm in t ime is deadlock. Consider a number of processes in a ring topology, each receiving from its left neighbour and sending to its right. If one of these processes sends its message to a wrong process, eventually all the processes block in receive calls, and the system deadlocks. Not only is the cause and the effect spread far apart in the code, but the time that elapses from the wrong send to the deadlock might be significant. In summary, the amount of information available to the programmer when debugging n con-currently running processes is magnified, thus making the debugging task even more daunting. 1.2 Problem Definition In this dissertation we show the following: • By decomposing the debugging task of parallel message passing programs according to a number of different levels, each specifically concerned with one type of error (sequential, message content, protocol etc.), it is possible to provide tools specifically tailored to finding these types of errors. At the same time, the amount of useless information given to the user is reduced. We refer to this technique as Multilevel Debugging. • By extracting information about the messages and the protocol from the parallel program, we can focus on specific debugging problems. This has led to a number of new techniques, analyses and algorithms, that can assist the programmer greatly. Many of these can be incorporated into a multilevel debugging tool, such as Millipede. We have implemented Millipede, a prototype multilevel debugger, that implements the algo-rithms and analyses described in this thesis. In addition, a multi level debugging strategy provided in Millipede is closely coupled with the development cycle of a parallel program, whether it is written from scratch or adapted from an existing sequential program. 1.2. Problem Definition 5 Millipede follows the model 'You must crawl before you can walk' , which in the parallel pro-gramming world translates into 'You must fix your pointer errors before you can pass messages'. In general, Millipede considers the various parts of a parallel message passing program as separate levels of the program. Each level has a specific type of information needed to find errors and correct them. The information useful to the programmer to locate and correct errors in the sequential code might not be the most useful for tracking down errors in the communication patterns of the overall program. The three major levels that we propose, to which the multi-level debugging technique should adhere, arise from our definition of a parallel message passing program: A parallel message passing program consists of a number of sequential processes bound together by messages according to a protocol. Thus the three levels are these: • Sequential code - the code executed by each process. This can be different pieces of code (e.g., a data driven pipeline computation where data passes from process to process) or a number of instances of the same piece of code (e.g., an SPMD program—Single Program Multiple Data, a number of processors executing the same program-such as a processor farm, for example). • Messages - the individual messages passing from one sender to one receiver. A typical message passing program consists of a number of such messages; each message originates at a send event (or in some cases a multicast or broadcast event) and ends up at a receive event in another (possibly the same) process. • Protocols - The collection of messages and the pattern in which they are exchanged makes up the protocol of the program. Each level can be decomposed into a number of sublevels. As we illustrate, the protocol level contains several interesting analyses and algorithms that all deal with the protocol level of a parallel message passing program. . The levels can be viewed as a broadened view of the program, as we 'cl imb up' the levels. The main focus of the sequential level is the sequential code executed within one process. The message level is concerned with the communication between two processes, while the protocol level focuses on the entire program, and the overall pattern according to which messages are exchanged. While the levels may overlap, the type of information needed at each level expands to include more and more processes as the view expands. In addition to correcting errors after they have occurred and potentially crashed the program, the added information that can be collected about the protocol and the message passing patterns can also be used for preemptive debugging purposes. An example of such extra information is the protocol and message passing pattern, which can be used to predict deadlocks due to insufficient buffers. 1.3. Thesis Statement 6 The multi level debugging strategy is a bottom-up approach; that is, the programmer uses tools specifically tailored to finding sequential errors in the straight line code of his program before moving on fixing message content errors and protocol errors. 1.3 Thesis Statement The points from the previous sections can be summarized in the following four points: 1. A decomposition of the parallel programming domain into three levels (sequential, mes-sages, and protocol) leads to the multi level methodology for debugging parallel message passing programs. This methodology also provides a guide and framework for developing and integrating new tools into the debugging system. 2. Extracting the information present in parallel message passing programs, such as message content and protocol information, facilitates the design of tools tailored to specific error types at different levels of the parallel programming domain, which further allows automa-tion of a number of analyses that can not easily be performed otherwise. 3. Such tools map errors back to the source code, and sometimes suggest how to correct them. 4. Such tools can be implemented with a simple command-line interface and require minimal configuration. In addition, no translation or rewriting of the source code is necessary, which makes them directly applicable to the source code. 1.4 Contributions The literature suggests that tools are not widely used in the parallel programming community. Reasons for this include wrong abstraction, complicated interfaces, and lack of focus on the problem at hand. This means that many programmers stil l rely on print statements as their only debugging tool. After reviewing many of the faults and shortcomings of tools in general and more specifically tools for debugging parallel message programs we formulate a bottom-up debugging strategy, referred to as multilevel debugging. This strategy not only offers a methodology for debugging parallel message passing programs, but also serves as a design and implementation framework for new tools. Using details about the three level decomposition of the parallel domain (sequential, mes-sages, and protocol) and error types, we derive a number of general design goals for tools. Examples of these goals are navigation tools at different levels, views for key players, state displayed on request, and relations computed by the tool. In addition a number of specific goals for debugging tools are derived. Examples include the support for finding and correcting specific 1.4. Contributions 7 types of errors, applicability to source code with a mapping back to the error in the source, as wel l as providing tools that do not require any rewriting or transformation of the code. We present the Millipede Debugging System, a prototype of a mult i level debugger for parallel message passing systems, and show how such a system can be implemented for message passing systems such as PVM and MPI. With the basic framework in place, we implement a number of specific tools, al l tailored in accordance to our design goals. These tools target different different levels of the parallel programming domain, and are specifically tailored to an error type at one of the three levels of the domain. We show how a specific error types can be located and mapped back to the source code. One of the most important criticisms of existing tools is that the wrong granularity or ab-straction often makes the tool either useless for a specific debugging task or create information overload. Information overload is an excess amount of information presented to the user at one t ime, which in turn makes the debugging task daunting and unnecessarily more complicated. The granularity and level of abstraction of each of the tools we present is set in accordance with the specific error type that the tool is tailored for, thus eliminating much of the excess information not related to the specific debugging task. We start at the lowest level, the sequential level, where we show how a sequential process can be extracted and debugged using existing sequential debugging tools. This allows the programmer to debug the sequential code of one process at a t ime, and thus not have to worry about a number of processes running at the same time. In addition, it enables the use of existing debuggers like Gdb [Gdb], and other tools such as purify [Pur] and program profilers. By providing the user with the ability to use well known sequential tools on the sequential code of a parallel program, we provide a way of matching the abstraction of the tool to the abstraction of the debugging task. In addition, no rewriting of the code is needed to make use of these sequential tool. We demonstrate the usefulness of this tool by showing how a number of different sequential errors are located and corrected by extracting the process and using Gdb and Purify. We have provided a translation of the sequential level of the parallel programming domain into the sequential domain, thus covering all types of sequential errors by allowing the user to deploy any sequential tool at this level. At the message level, we present two techniques related to debugging messages and their content. The first, referred to as interactive message debugging, allows the user to inspect and change the value of messages as they are sent or received. This technique, coupled with the sequential debugging module, allows for the debugging of one process from a parallel system while having the ability to inspect and correct the content of the messages, and also allows the user to perform unit testing of each of the processes during the development cycle. This means that separate parts of the system can be tested independently, which allows for hypothesis testing 1.4. Contributions 8 that includes the message content to take place, without the rest of the parallel system running. The second technique is a query language referred to as MQL (Millipede Query Language) that allows the user to write queries using a simple database language and a number of relations containing information about the messages. This provides structured access to the messages and their content, and allows for the computation of relations. Access to messages is considered important as these are 'key players' at this level. In addition, these relations contain information that map the content of the messages back to the source code, that is, lines where data was packed/unpacked and messages sent and received. We demonstrate how to write simple queries to compute a number of useful relations. At the last level, the protocol level, we implement three different tools. The first tool is an implementation of an algorithm for correcting deadlocks in message passing systems. We show that if a deadlock is caused by a small number of typographical errors in the send or receive calls, the presented algorithm can, with high probability, suggest the correct way of removing the deadlock. We formally argue for the validity and the accuracy of the algorithm by proving an upper bound for the error rate. By focusing on the 'deadlock error type' we have raised the level of abstraction such that an automatic analysis can be performed, and in addition, the algorithm wil l provide a way to correct the program to remove the deadlock. Again, information mapping the error back to the source is provided, as well as information about which lines to change and what changes to make. The second tool at this level is a protocol conformance checking tool. We introduce a tool that allows the programmer to specify a number of constraints on a communication protocol of a parallel system in as much detail as he wants. The tool is used to specify constraints on the message passing pattern of the communication protocol, as opposed to both the temporal and spatial aspects. The runtime system of Millipede reads the protocol specification and checks each message against the specification, and reports any errors. The constraint specification can start out very general, and become increasingly complex as the program, or as the debugging effort evolves. This tool serves as a debugging tool that can be used in connection with iterative hypothesis testing as wel l as a tool that can be deployed during the development cycle. One main argument for this tool is the ability to bridge the gap between a protocol specification (verified or not) and an actual implementation; even if a protocol has been verified using verification tools, when implementing it, the risk of making mistakes stil l prevails. This tool provides a way to subject an implementation of a protocol to a number of constraints that are automatically checked by the runtime system. The motivation for the last tool at this level is the problem of guaranteeing fc-safety, (i.e., the problem of determining the buffer requirements for message passing systems). We investigated k-safety for four buffering schemes, and in all cases showed that the problem remained intractable. 1.4. Contributions 9 We showed that the related problem of computing the number of buffers needed to avoid deadlock and blocking sends is tractable. This algorithm provides an upper bound that can be used in combination with techniques for inserting synchronization points into the code to ensure A;-safety for any k. We show a number of results for different buffer placement schemes (sender side buffers, re-ceiver side buffers, etc.), some of which are proven intractable, and some are proven tractable. We develop an algorithm for computing the number of buffers needed to avoid deadlock and blocking in a system with receive side buffers only. The original fc-safety problem for all buffer schemes is intractable. Using the algorithm we developed to avoid blocking, we describe tech-niques to approximate solutions to the fc-safety problem. The decomposition of the parallel programming domain into three levels and the study of error types induces a multi level debugging strategy; a bottom-up approach where the error type determines what type of tool should be applied. Furthermore, it states that errors at lower levels should be attended to before turning to errors at higher levels. That is, pointer errors and array-out-of-bound errors should be corrected before turning to debugging the protocol. In addition, this decomposition, based on levels, and error types provides a framework for developing new tools, which again drives the debugging process. We showed that it is possible to develop tools according to the multi level debugging method-ology. As described in the following chapters we have provided a number of such tools at al l three levels. The tools all have simple user interfaces, and no rewriting of the source code is necessary to deploy the tool. In addition each tool maps the errors in question directly back to the source code, and in some cases suggest how to remove the error. The work in Chapters 4, 6, 7 and 8 has been published as conference papers in the parallel programming community (the sequential debugging module, presented in Chapter 4, has been published in [BWOO]). The deadlock detection and correction work, described in Chapter 6, appears in [BW01a]. The protocol checking tool has been published in [BW01b] and finally, parts of the joint work with Alex Brodsky on the buffer allocation problems, found in Chapter 8, has been published in [BBW02]. Chapter 2 Background and Related Work "The power of accurate observation is commonly called cynicism by those who have not got i t . " - George Bernard Shaw "If we begin with certainties, we wi l l end in doubt. But if we begin with doubts and bear them patiently, we may end in certainty." — Francis Bacon 2.1 Background and Rationale Debugging sequential programs can be a tedious and time consuming task. The time required can be greatly reduced by using some of the many different tools developed for this task. Some of the more well known debugging tools include Gdb [Gdb] and purify [Pur], and various integrated development environments accompanying programming languages. Unfortunately, these tools are not as readily available in the parallel programming domain. To better understand the lack of tools and the limited use of existing tools, the next section briefly introduces some of the problems encountered when working in the parallel programming domain and some of the observations made about debugging in general. 10 2.1. Background and Rationale 11 2.1.1 The Parallel Programming Domain Parallel programming involves a set of components that must each be considered when develop-ing a parallel system. This set, which we regard as the parallel programming domain, includes, among others, the following aspects of the code: sequential code, interprocess communication, synchronization, and processor uti l ization. Understanding the issues involved with the compo-nents of this domain makes understanding the source and manifestation of errors easier. This understanding is useful for determining the approach needed to efficiently debug parallel pro-grams. In addition, it helps determine where to focus the debugging effort, depending on which component of the domain the programmer looks for errors in. In [Fos95] a four stage model for constructing a parallel program, referred to as PCAM, representing the parallel programming domain is suggested. The four components are: 1. Partitioning. The computation to be performed and the data which it operates on are decomposed into small tasks. 2. Communication. The communication required to coordinate task execution is determined, and the appropriate communication structures and algorithms are defined. 3. Agglomeration. The task and communication structures defined in the first two stages of a design are evaluated with respect to performance requirements and implementation costs. 4. Mapping. Each task is assigned to a processor in a manner that attempts to satisfy the competing goals of maximizing processor util ization and minimizing communication costs. The two last components, agglomeration and mapping, are mostly concerned with perfor-mance issues which, while important, are outside the scope of this dissertation. For the first two components, partitioning and communication, we propose the following breakdown: 1. Algorithmic changes. Most parallel programs begin life as a sequential program. If parallel algorithms are based on, or derived from, existing algorithms and/or programs, then a transformation from the sequential to the parallel domain must occur. The transformation of a sequential program into a parallel program typically consists of inserting message passing calls into the code and changing the existing data layout; for example, shrinking the size of arrays as data is distributed over a number of processes. However, if the sequential algorithm is not suitable for parallel implementation, a new algorithm must be developed. For example, for the pipe-and-roll matrix multiplication algorithm [FJL+88] does not have a sequential counterpart. 1. Background and Rationale 12 2. Data decomposit ion. When a program is re-implemented, the data is distributed according to the algorithm being implemented. Whether it is the transformation of a sequential program or an implementation of a parallel algorithm from scratch, data decomposition is a nontrivial task that cannot be ignored when writing parallel programs, as not only correctness, but efficiency also greatly depends on it. 3. Data exchange. As parallel programs consist of a number of concurrently executing pro-cesses, the need to exchange data inevitably arises. This problem does not exist in the sequential world of programming where al l the data is available in the process running the sequential program. However, in parallel programs, the need for data exchange is present. On a shared memory machine, the data can be read directly from memory by any process. There is still the problem of synchronized access to shared data to consider, but no sending and receiving of data is needed. When working with a cluster of processors, each having a separate memory, message passing becomes necessary. When message passing systems like PVM and MPI are used, the programmer is responsible for a number of different tasks: specifying the correct IDs of the involved processes, packing messages into buffers, using the correct functions to pack the data depending on the type, and assigning tags to the message. In part, the difficulty of using a message passing library like PVM is the low level of the interface of the message passing system. Figure 2.1 shows an example of the minimal number of steps that are needed to perform a send and a receive of 2 integers (stored in variables a and b ) , respectively. Figure 2.1: Simple PVM program that exchanges two values using default buffers. As Figure 2.1 illustrates a number of low level message passing library function calls must be performed to send a message. Line 1 initializes the message passing system to use the default send buffer with default data encoding. Line 2 packs the integer a into the send buffer according to the encoding scheme specified in the pvm_initsend cal l . pvm_pkint can be used to pack arrays of integers as wel l ; the second parameter is the number of Sender: 2: 3: 4: pvm_initsend(PvmDataDefault); pvm_pkint(&a, 1, 1) ; pvm_pkint(&b, 1, 1) ; pvm_send(Receiver, 22); Receiver: 5: pvm_recv(Sender, 22); 6: pvm_upkint(&myA, 1, 1); 7: pvm_upkint(&myB, 1, 1); Background and Rationale 13 integers to pack, and the third is the stride. In this example, only one value is packed, thus, the number of values to pack and the stride are 1. Finally, line 4 sends the message to a process with task ID Receiver with message tag 22. The message tag is an integer number that allows the message passing system to differentiate message types. A typical use of the message tag is to use different values for different message types; this also allows the receiver to specify a specific type of message in receive calls, thus allowing messages to overtake each other in the message buffer queues at their destination. The second part of Figure 2.1 is the code necessary to receive the message. Line 5 issues a receive cal l , which requests a message from a sender with the task ID Sende r and message tag 22. This receive call is blocking. If no message is available from the correct sender with the correct tag, the receive call simply blocks until such a message arrives. The sender task ID and the tag can be specified as wild cards that match anything, but for reasons, such as the ability to read and understand a program, wildcards should be used with care. More importantly, the quality of some of the analyses we introduce later wi l l increase when the use of wild cards is reduced. There exist nonblocking, or timer controlled, versions of the receive cal l , but these add further complications to the code. Once the message has been passed to the process by the underlying message passing system, it must be unpacked into the destination variables. The pvm_upkint is the exact opposite of the packing cal l , it unpacks values into variables. Again, the number of values and the stride can be specified. The low level nature of a message passing library, such as PVM or MPI (MPI does support packing complex data structures in one cal l , but sender, receiver, tags, and variables must stil l be specified), increases the risk of introducing errors: a wrong sender or receiver may be specified, wrong variables are packed, or values are packed or unpacked in the wrong order. These are problems that can occur when the default buffers are used. If the programmer allocates buffers explicitly, instead of using the default buffer, the issues of buffer manage-ment arise. One single send buffer is not always sufficient, the prototype debugger that we implemented in connection with this work makes heavy use of allocated buffers. Figure 2.2 shows an example using an allocated buffer. Only one send buffer can be active at any time during execution. This means that if a process uses more than one send buffer, explicit buffer handling is necessary. When creating a new send buffer, the old one must be saved so it can be restored later. Line 1 creates the new send buffer, line 2 stores the old buffer and activates the new. Line 3 and 4 pack the data stored in variables a and b, and line 5 sends the message to Receiver with tag 22. Line 6 restores the previous send buffer and finally, line 7 deletes the newly allocated send buffer. The receiver in Figure 2.2 is identical to the one in Figure 2.1. The pvm_recv cal l returns 2.1. Background and Rationale 14 Sender: 1 newSendBuffer = pvm_mkbu£(PvmDataDefault); 2 oldSendBuffer = pvm_setsbuf(newSendBuffer); 3 pvm_pkint(&a, 1, 1) ; 4 pvm_pkint(&b, 1, 1) ; 5 pvm_send(Receiver, 22); 6 pvm_setsbuf(oldSendBuffer); 7 pvm_freebuf(newSendBuffer) ; Receiver: 8 pvm_recv(Sender, 22); 9 pvm_upkint(&myA, 1, 1) ; 10: pvm_upkint(&myB, 1, 1); Figure 2.2: PVM code for sending and receiving two values using explicit ly allocated buffers. the identifier of the new active receive buffer; this is a buffer created by the underlying system. If the receiving process works with multiple buffers simultaneously, pvni_setrbuf can be used in a manner similar to pvm_setsbuf. The added complexity of managing send or receive buffers naturally increases the risk of introducing errors, and further complicates the use of the message passing libraries. Data exchange is concerned with the point to point communication of data between two processes, and not the overall communication structure of the entire program. Thus, for every data exchange, there is one send operation and at least one receive (if broadcast or multicast is used there can be multiple receivers). 4. Protocol specif icat ion. The protocol for a parallel system is defined as the content, order, and overall structure of the message passing between communicating processes. Along with the data exchange, the communication protocol of the program is a new concept that has been introduced by parallelizing the algorithm. Figure 2.3 shows a stylized representation of a sequential and a parallel program. As shown, a sequential program is depicted as a single box, representing the sequential code of the program. The parallel program is represented as a number of boxes, each consisting of three nested boxes. The innermost of these boxes represents the sequential program that each process in the parallel program executes. The sequential code of the parallel program can either be an adaption of the existing sequential program, or a completely rewritten piece of code. The middle box represents the messages being sent and received in the system (the data exchange), and the outer box represents the protocol that the communicating processes must adhere to. 2.2. The Debugging Process 15 Sequential Program Sequential Program Transformation Program 1 Program Messages Communication Protocol N Programs Figure 2.3: In the sequential programming domain we work with one (sequential) program, whereas in the parallel domain we encounter a number of parallel processes making up a parallel program. These parallel processes each execute a sequential program, but in addition, send messages to other processes while adhering to a commu-nication protocol. The messages and the protocol are represented as boxes enclosing the sequential program as new levels. 2.2 The Debugging Process In this section we introduce the debugging problem/process, briefly present ideas about how to debug in general, describe the problems with current approaches, look at the purpose of our research, and explain how it differs from existing systems. 2.2.1 Iterative debugging A well known approach to debugging was proposed by Araki, Furukawa and Cheng [AFC91]. They describe debugging as an iterative process of developing hypotheses and verifying or refuting them. They proposed the following four step process: 1. Initial hypothesis set. The programmer creates a hypothesis about the errors in the pro-gram, including the locations in the program where errors may occur, as wel l as a hypothesis about the cause, behaviour, and modifications needed to correct them. 2. Hypothesis set modif icat ion. As the debugging task progresses, the hypothesis changes through the generation of new hypotheses, refinement, and the authentication of existing ones. 2.2. The Debugging Process 16 3. Hypothesis select ion. Hypotheses are selected according to certain strategies, such as narrowing the search space and the significance of the error. 4. Hypothesis ver i f icat ion. The hypothesis is verified or discarded using one or more of the four different techniques: static analysis; dynamic analysis (executing the program); semi-dynamic analysis (hand simulation and symbolic execution) and program modification. If the errors have not been fixed after step four, the process is repeated from step two. In the above model, step four, hypothesis verification, is the focus of our research. 2.2.2 The Why, How and What of Errors M. Eisenstadt describes in [Eis97] a 3-dimensional space in which sequential errors are placed according to certain criteria. This classification shows some interesting results, which we briefly summarize. 51 programmers were asked to participate in a study in which programming errors are placed into a 3-dimensional space. The 3 dimensions are: • Dimension 1: Why is the error difficult to find? • Dimension 2: How is the error found? • Dimension 3: What is the root cause of the error? We briefly describe the results of the survey, with respect to each of the dimensions. Dimension 1: Why is an error hard to find? This first dimension is concerned with the difficulty of locating the problem, and is further divided into 5 subcategories: 1. Cause/effect chasm. Often the symptom of the error is far removed in space and time from the root cause, and this makes the cause hard to detect. Specific instances can involve timing or synchronization problems, bugs which are intermittent, inconsistent, or infrequent, and bugs which materialize 'far away' (e.g., thousands of iterations) from the actual place they were spawned. 2. Tools inappl icable or hampered. These are the so called 'Heisen bugs' [Gra86]. This covers bugs that go away when switching on the debugging tool. Another type of bug in this category are referred to as 'context precludes', and covers the cases where memory constraints or other configuration issues make it impractical or impossible to use the debugging tool. 2.2. The Debugging Process 17 3. WYSIPIG (What You See Is Probably Illusory, Guv'nor) . A piece of code is misconceived; it does not give the result that it looks like it should produce. An example could be the number 010 which in Tel does not equal the decimal value 10 (ten), but the octal value 8 (eight). The preceding 0 makes the Tel interpreter treat the value as octal. 4. Faulty assumption/model. The programmer does not understand the underlying system, model or the environment. An example is assuming the stack grows up rather than down. 5. Spaghetti (unstructured) code. The code is hard to read. This is typically reported as a reason when programmers work with code they did not write themselves. Table 2.1 shows how the 51 answers are placed in the above categories for dimension 1. Category No. of answers Percentage Cause/effect chasm Tools inapplicable or hampered WYSIPIG Faulty assumption/model Spaghetti code No answer 15 29.4% 13 23.5% 7 13.7% 6 11.8% 3 5.9% 8 15.7% Table 2.1: Dimension 1. Why is an error hard to find? It is notable that over 50% of the cases are caused by the two first categories. As we see later on, the first category, the cause/effect chasm is greatly amplified in the parallel programming domain, and the second category is, as we have already pointed out, one of the problems we are researching. Dimension 2: How is an error found? This dimension is concerned with how an error can be found, and it is divided into four categories: 1. Gathering data. The programmer discovers more using methods such as print statements and breakpoints. This category includes a number of subcategories: • Step-and-study, which includes single stepping through the code using a debugger. • Wrap-and-profile, where profiling information is collected by enclosing the suspect function inside another function that does the information collecting. • Print-and-peruse, the most wel l known of the sub categories, involves inserting print statements and observing the output. • Dump-and-diff involves comparing different versions of large amounts of information gathered (e.g., a true core dump). 2.2. The Debugging Process 18 • Conditional break and inspect, which includes the use of breakpoints. • Specialist profile tool, which includes using standard tools, such as purify to locate memory leaks. 2. ' Inspeculat ion' . This term covers inspection of the code, hand simulation, and speculation. Speculation involves leaving the code to think about the problem, then later returning to try to correct it. 3. Expert recognized c l iche. The programmer receives assistance from other people. 4. Control led exper iments. Once the cause of the error is better understood, specialized tools or approaches can be applied. The placement of the answers in dimension 2 can be seen in Table 2.2. Category No. of answers Percentage Gather data 'Inspeculation' Expert recognized cliche Controlled experiments No answer 27 53% 13 25.5% 5 9.8% 4 7.8% 2 3.9% Table 2.2: Dimension 2. How is an error found? An interesting, but not surprising, result is that data gathering (e.g., print statements) and hand simulation account for almost 78% of the techniques reported in locating errors. This result corroborates the result of Pancake [Pan94]: up to 90% of al l debugging is done using print statements. While the use of print statements is straightforward when working with sequential programs, their use in parallel programs is often more complicated. Often, processes run on remote proces-sors, which makes redirecting output to the console difficult. Even when output can be redirected to the console, al l processes are writing to the same window, thus making the interpretation of the output a challenging task. This is an example of the information overload theory mentioned earlier. Furthermore, the order of the output (i.e., the debugging information from the concur-rently executing processes) is not the same for every run, as the processes execute asynchronously and only synchronize through message passing. A possible solution is to have each process write its output to a disk file. However, this introduces the problem of nonflushed file buffers; if a process crashes, the buffer might not be flushed, thus missing output written by the program. Of course this can be solved by inserting calls to flush the I/O buffers, but if these are missing, the programmer ends up spending time on debugging the code he added for debugging purposes! 2.2. The Debugging Process 19 In the worst case this can lead the programmer to believe that the process crashed somewhere between the last print statement that appears in the file, and the first one that does not. A lot of t ime can then be wasted looking for an error in a place where no error can be found. Dimension 3: What is the root cause of the error? This last dimension contains nine categories: 1. Memory: Memory is 'clobbered' or used up. This includes overwriting a reserved portion of the memory causing the system to crash, and array subscripts out of bounds. 2. Vendor: Compilers generate wrong code or the hardware is faulty (logic boards do not adhere to specifications or are broken). 3. Design logic: The logic design of the algorithm is wrong. Examples include cases forgotten or overlooked by the programmer. 4. Initialization: Covers wrong types, redefinition of the meaning of system keywords, or incorrectly initialization of a variable. 5. Variables: Wrong use of operators or variables. 6. Lexical: Lexical problem, bad parse or ambiguous syntax. These are trivial problems such as typographical errors. 7. Unsolved: As yet undetermined. 8. Language: Language, semantic ambiguities or misunderstandings. For example, 250K is not 250,000, but rather 256,000 (250*1,024). 9. Behaviour: Unanticipated behaviour by the user that makes the program behave in a unanticipated way. Table 2.3 shows that nearly 50% of the errors are caused by the first two categories. This also perfectly agrees with previous studies where tools and runtime systems are described as a source of errors [Pan94]. The classification used in dimension 3 is a mixture of deep plan analysis [Joh83, SSP85] and phenomenological analysis [Knu89]. Deep plan analysis states that many bugs can be accounted for by analyzing the high level abstract plans underlying specific programs, and by specifying both the possible fates that a plan component may undergo (i.e., missing or misplaced). An alternative phenomenological taxonomy can be found in [Knu89] where the root causes are divided into nine categories, all very similar to the ones in Table 2.3. 2.3. Related Work 20 Category No. of answers Percentage Memory 13 25.5% Vendor 9 17.7% Design logic 7 13.7% Initialization 6 11.8% Variables 4 7.8% Lexical 3 5.9% Unsolved 3 5.9% Language 2 3.9% Behaviour 2 3.9% No answer 2 3.9% Table 2.3: Dimension 3. What is the root cause of the error? 2.3 Related Work In this section we describe some of the existing approaches to parallel debugging and parallel system development. We try to point out any shortcomings these tools might have, and compare them with the theory of errors and debugging presented earlier. 2.3.1 Program Development Environments One approach to writing programs is to use an integrated development environment (IDE). Some well known examples in the sequential world include Visual C/C++, Visual Basic from Microsoft, and the Eclipse and the NetBeans IDEs for Java. Not only do these environments offer support for program development, but they come with built-in debuggers. The idea of developing pro-grams and complicated systems through a development environment also extends to the parallel programming domain. One of the most important tasks of a program development tool is to allow the user to develop programs in a structured way using some high-level abstraction, such as graphs. An important side effect of the structure and high-level abstraction is a lowered risk of introducing certain error types. For instance, certain tools, such as the P V M b u i l d e r tool [BB97], always create deadlock free message passing code by ensuring that all send calls are matched with receive calls and that the corresponding communication graph does not have cycles. This abstraction allows the user to concentrate on higher levels of the program design, for example, function or control, or data decomposition of the program, depending on which abstraction is adopted by the environment. However, this high level of abstraction restricts the user in which types of programs he can develop using P V M b u i l d e r . Programs with dynamic communication cannot be implemented. Often, the structure and abstraction level offers relatively easy debugging of certain types 2.3. Related Work 21 of errors within the environment, which of course, is a very desirable quality from a debugging point of view. Unfortunately, the concepts that make development environments desirable also have their disadvantages. An environment structured around a high level of abstraction is a good tool for program development, only if the abstraction of the task at hand matches that of the environment. For example, if the environment is structured around the data flow model and the program being implemented is structured according to the control flow model, then implementing such an algorithm becomes complicated and cumbersome. A concrete example of this problem emerges when trying to write a program that makes use of explicit message passing using a tool that supports the data flow model. That is, entire blocks of data flow between functional units in the program representation of the tool. Even if it is possible to implement the program, it wi l l be conceptually difficult and artificially structured. The problem with the structure and the abstraction of the tool not fitting that of the program being implemented is one of the most common reasons for not using such tools [Pan94]. Another reason is that users are often conservative and hesitant to learn new environ-ments [Pan94]. The nature of the generated code can contribute to a programmer's hesitation. If the tool supports a source to source transformation, for example, from the tool abstraction to C source code, this code can be hard to read or illogically structured because of the automated code generation. Even worse, sometimes no source code is available, and that limits a programmer's ability to apply other tools for further development and maintenance. This problem is apparent with the P V M b u i l d e r tool. Since the tool generates all the communication code when the user compiles the application, this code is often extremely hard to read. Debugging such code, or code produced by rewriting tools, is a daunting task. In the worst case, the generated code is virtually unreadable. Furthermore, data structures and functions not implemented by the user might be used by the generated code, adding yet another level of complication to the debugging task. Some examples of environments specific to developing parallel applications are described in the following. Examples of environments that have adopted the data flow model as a main abstraction are Code [NC92, NY93] and HeNCE [Don]. The abstraction is based on data flowing between functional units or entire processes of the system. Trapper [Hei97] is a CSP based too l -a Trapper procedure contains channel communication calls to read and write to the channels. The underlying implementation is hidden, however, there is stil l a need for the programmer to directly specify what to send and receive. It is the graphical user interface's task to link channels together, thus making sure there are no disconnected channel ends. Trapper is comparable to programming libraries like JCSP [WAF02] without a graphical user interface. However, such interfaces are currently being developed for JCSP Like Code, HeNCE, and Trapper, graphs are also used to represent programs by P V M b u i l d e r 2.3. Related Work 22 and VPE [ND94]. However, these two tools both allow explicit message passing. In these cases the abstraction adopted leans more towards the control flow model than the data flow model. Tools like Enterprise [SSS90] and Frameworks [SSG91] take a template-based approach to generating distributed applications. Programs are written as sequential procedures enclosed in templates. The templates hide all the distributed computing implementation details, such as communication and synchronization. The procedures themselves contain only a small amount of information as to how they interact with the rest of the system. Most of it is specified separately via templates. Al l of the tools strive to make parallel programming easier, that is, to reduce the number of errors, and take away much of the work with respect to explicit message passing from the user. Unfortunately, when this is the goal, the user's freedom and expressiveness is reduced; the safer the development tool is required to be the more restrictive it becomes, and the more limited the expressiveness becomes. In other words, the higher the level of abstraction and the more rigid the structure of a tool, the smaller the set of easily implementable programs. The greater the set of programs the user wishes to implement, the more general the environment must be. At one end of the spectrum, tools are specifically designed for a certain type of program with a very rigid structure. At the other end, 100% manually coded programs exist where the programmer himself supplies message passing calls using, for example, PVM or MPI. Many of the tools described fall in between these two extremes. However, the problem of picking the correct tool or trying to util ize familiar tools to solve the problem at hand remains. Picking the right tool might involve having to learn a new abstraction and a new tool. 2.3.2 Visualization Tools One class of tools that can be used not only for performance tuning, but also for debugging is the family of visualization tools. Visualization tools are categorized by their ability to provide the user with information about a program's behaviour. A typical tool offers a fixed set of views, each displaying different information about the system in various ways, such as graphs and charts. A visualization tool that supports message passing views can be used when the programmer searches for errors involving stray messages or simply erroneous protocol specifications. Such tools are also excellent if the programmer is trying to obtain a global view of the entire system. However, often global views are much too vast for a programmer to easily locate errors. These types of tools are faced with the very difficult task of providing a vast amount of information in an easy to understand way. This problem has been addressed by Pancake [Pan99], and some of the more serious issues pointed out include not only the difficulty with presenting large amounts of data, but also the inability to zoom into views, to extract lower level information, and to map 2.3. Related Work 23 these displays/views back to the source code which created the error. This is similar to having gathered data as in dimension 2, but the data is not directly under-standable, which is caused by the inapplicability of the tool (dimension 1). In addition, there are also problems associated with the amount of data that can be displayed, the type of displays, and at least for tuning, the problem of perturbing the execution of the program. The last problem can appear in any tool based on a software monitoring and runtime collection of information; however, it is more crit ical in the case of performance tuning and identifying performance problems. Another problem is the lack of user defined views. Many visualization tools support a limited preprogrammed set of views. We believe that this directly contradicts one of the design goals set by Eisenstadt in [Eis97]: the possibility of a tool that offers a view of what the user wants, when she wants it, not just the information that the programmer of the tool thinks might be useful to the user. Examples of such tools are Paradyn [MHC94], Vampir [NAW+96], and ParaGraph [HE93]. One visualization tool specific to PVM is XPVM [KG96], which uses the tracing facil it ies available in PVM 3.4, and offers a graphical user interface to dynamically visualize network status, util ization, message queues, and much more. These tools all use graphical representations to display program behaviour. 2.3.3 Extension of Sequential Debuggers In this section we look at the family of debuggers that are extensions of well known sequential debuggers. We divide this class of tools into two categories: debugging environments and N-version sequential tools. Debugging Environments A number of debugging environments exist that support parallel debugging. These environments are typically extensions of sequential debuggers. As a result, the set of operations available in these tools are wel l known because of familiarity with standard sequential debuggers. These include stepping, breakpoints and variable inspection. The biggest di f ference-and greatest strength-is that these tools operate on several processes at a t ime, thus allowing collective breakpoints over multiple pieces of source code and macro stepping, which allow several pro-cesses to step through one line of program code at the same time. The strength of these tools is their ability to control multiple processes at the same time. This can be a problem for the user; keeping track of a large number of processes simultaneously blurs the focus of the debugging task. Even though these tools support the common set of debugging activities, they all require the user to learn a new environment with its own graphical interface. Furthermore, the focus is on the sequential code, not on the entire parallel system. That is, the granularity cannot be 2.3. Related Work 24 varied, but is set to 'fine grained'; only the sequential debugging task is supported. In a sense, these parallel debugging environments can be said to suffer from the opposite problem as the visualization tools: the granularity is too fine and the focus is always on the source code. Such tools, therefore, get placed into the 'tool inapplicable or hampered' category of dimension 1. Examples of such debugging environments include DIWIDE [KLK99] and TotalView [Pal99]. The DIWIDE debugger is a parallel debugger that implements collective breakpoints and macro steps (collectively stepping over program parts). It allows the programmer to treat a collection of processes as one, and allows the user to easily issue global commands and set global breakpoints. TotalView, a commercial product, is a multiprocess, multithreaded tool for online source code debugging. N-version Debuggers A naive approach to parallel debugging involves the use of N copies of a sequential debugger like Gdb-one for each process. The disadvantage of providing N versions of a sequential tool is the overwhelming amount of information. In addition, the way in which this vast amount of information is presented to the user is often inappropriate for the task at hand [Pan99]; it is not as easy for the user to focus on one particular process in the system when attending to all of them. The complexity of the program development process alters drastically when parallelism is introduced, and the problems are heightened by the relative instability of current parallel runtime environments [Pan94]. This suggests that debugging a parallel program while al l the processes are running concurrently may be too difficult and a tool tailored more to specific processes in the system is more effective. In addition, the granularity cannot be varied and the user is left with the functionality of a sequential tool, which might not be applicable for a parallel debugging task. N-version debuggers also lack the ability to supply different views of whatever information the user might want. Although al l variables and program texts are available (which can be a great advantage when debugging low level sequential code), this information is spread over TV windows and not readily available for queries. It would take an overwhelming amount of time for the programmer to extract, collect and interpret the information available. In addition, if the focus is on one single process, the debugging views are not needed for the rest of the processes. An example is pdbx [Pdb] for the SP/2 , which is a front end for multiple instances of the UNIX debugger dbx running on multiple nodes on an IBM SP system. As wel l , p2d2 [Hoo96] is a graphical front end for multiple instances of the Gdb sequential debugger, and has been used successfully to debug systems with as many as 128 concurrent processes. It is possible to design a script for PVM to allow users to execute Gdb on every process spawned. However, that would require a script for each sequential tool being supported, and it is more difficult, in PVM, to conditionally 2.3. Related Work 25 spawn the debugger for a given set of processes, without having to rewrite the code in the original program. 2.3.4 Replay Tools/Debuggers Another major class of tools is the family of replay tools which allows the user to animate or replay the execution of a program [XWXS96, TSS96, KV97, Arv92, CFR95]. Replay tools collect information about the system as the program executes: messages are col lected, time stamped and saved on secondary storage for replay. When the tool replays the execution, information about message content and program state is retrieved from the disk. A replay tool is typically considered an offline tool, deployed when the program has finished executing. Many of these systems have a set granularity, thus focusing on, for example, the source code level, leaving the user helpless if debugging on a higher level is needed. The opposite can of course also be the case: when the focus is on the higher level of the system, mapping the error back to the lower level is difficult because the tool does not readily support debugging at a lower level. BUSTER [XWXS96] is one such 'post mortem' replay system; it allows the user to reexecute the debugged program in different modes, depending on the amount of control needed, without having to run the message passing system. A system like PVaniM [TSS96] is another PVM based graphical tool that supports both online debugging and post mortem visualization of a parallel execution. Other examples of tools that combine the online and offline strategies of debugging in a more integrated environment include MAD [KV97], PDT from the Annai toolset [CFR95], and PDM [Arv92]. These tools provide more integrated environments for debugging while providing higher level tools for finding and correcting specific errors, such as communication errors. MAD is a debugging environment based on event graphs and their manipulation. It allows debugging on various levels, from pattern of processes (groups of event graphs), to control flow graphs and source code. PDT is an interactive distributed source-level debugger for distributed memory parallel processes in the Annai toolset, and allows both online debugging and offline replay. The PDM system is a framework for detecting communication-related errors in concurrent Occam programs running on a Transputer network. The major disadvantage of these tools is the massive amount of information involved and the need to learn a new environment. The replay mechanism is extremely useful. However, unless other tools can be used in connection with a replay, it becomes virtually impossible to accomplish a specific debugging task, unless the tool specifically supports it. Again, if the replay mechanism is merged with some of the techniques described in earlier subsections of this chapter, a stronger and more flexible tool results. 2.3. Related Work 26 2.3.5 Relative Debuggers For the sake of completeness we wish to mention the concept of relative debugging. Relative debugging is a technique often used when porting programs to different architectures, thus allowing the execution of two different versions of the same program on two different machines at the same time. Guard [SA97, WA98] is such a debugger. It executes two different instances of the same program on two different machines, thus allowing the programmer to compare the contents of variables and more while the programs execute. Relative debugging is useful for the programmer who ports existing programs to different architectures or operating systems. The technique focuses on comparing two different instances of the same algorithm, where one is known to produce the desired result. Thus, this technique is not directly applicable when it comes to general (parallel) debugging. 2.3.6 Language Support for Communication A different approach altogether involves writing programs using languages with built-in support for parallelism. A well known example of this is the /xC++ language [BS95] from the University of Waterloo. fiC++ extends C++ with new language constructs to express parallelism and provides a runtime system that runs programs concurrently or in parallel, when appropriate hardware is available. However, the need to debug is stil l present. The work in [BK95, Kar95] describes debugging and performance tools for fiC++ in greater detai l . A debugging session for a program written in /*C++ compares to those found in DIWIDE and TotalView: a front end to a number of instances of wel l known sequential debuggers, such as Gdb, attached to each process being debugged. Other well known examples of languages that support communication include CML [RWZ88] (Concurrent ML), and Facile [GMP89a, GMP89b] (ML with higher order concurrent processes based on CCS), both functional languages. Often such languages are not considered to be platforms for implementing parallel applications. One of the strongest arguments is lack of speed, a wel l known side effect of functional programming languages. This is an unfortunate tradeoff, as functional programs are more easily verified by program verification tools, thus reducing the need for debugging. 2.3.7 Summary of Related Work The following points summarize the problems with many of the existing tools: • Restrictive interfaces that support only a number of predefined tasks. • The data gathered need to be interpreted by the user to map the error back to the cause, which often renders the tool less useful. In other words, the cause/effect concept is not 2.4. Top-down versus bottom-up debugging 27 well supported. In [KV97], it is argued that the original source code is a good basis for debugging activit ies, since it contains the cause of the wrong behaviour. • A fixed, often small grained, number of tasks are supported. Fixed granularity in connection with restrictions on the interface makes debugging at higher levels almost impossible. • Information overload: the amount of information presented can be so large that time needed to find the information becomes unmanageable. What the user wants is not always available in any of the reviewed systems. Each of the systems have strong points and can be very useful for certain tasks. Unfortunately, applying different selections of tools from different toolsets is an impossible task; different user interfaces, different representations, different formats and so on make changing between tools for different debugging tasks difficult. This means that the user must choose only one, or at best a small number of tools, which might not be preferred for the debugging task. 2.4 Top-down versus bottom-up debugging Figure 2.4: The top-down debugging strategy versus the bottom-up. Thin solid lines represent various error types. The thick solid line represents a typical top-down de-bugging tool that presents al l the information available to the user. The dotted thick line illustrates a bottom-up approach-the tool is specifically designed to finding and correcting errors of type 2. Specialised debugger General debugger 2.5. Multilevel Debugging 28 Many debugging tools and environments offer a global view of the entire program and leave it to the programmer to narrow the search space, including specializing the formation and testing of hypotheses, and searching for errors. This approach poses one of the greatest problems with existing tools and environments. The set of visualization tools and environments do support a global problem identification and hypothesis-making process, however, they do not readily support the process of localizing the error and mapping it back to the source code. We believe that this is due to the information overload theory presented earlier and supported in [Pan99]. We refer to this method of debugging using a global view tool as a top-down approach to debugging. In Figure 2.4 the inner shape, containing various error types, represents the potential errors in a program. The figure has been divided into a number of parts, one for each type of error that can occur. The outer bold shape represents the typical method of debugging when using debugging tools: the top-down approach. A different approach is obtained by turning this well known method upside down. Instead of providing a global view of a program and allowing the user to look for any kind of error using just one tool, we propose that a bottom-up approach be adopted. Assume the user has made some hypothesis about the type of error, typically based on the report obtained when the error occurred. We propose the application of a tool specifically tailored to supporting hypothesis creation, verification and error search of the specific type of error, using the extra information that can be gathered from a parallel application. As mentioned, this includes information about messages, their content as wel l as protocol information, such as message exchange patterns. 2.5 Multilevel Debugging The purpose of this research is to examine a bottom-up approach, which we refer to as 'mult i level debugging', over the more conventional top-down approach described earlier. The focus is closely tied to the major points of the multidimensional analysis described earlier and the description il lustrated in Figure 2.3. In addition, we strive to develop tools and techniques that make use of the extra debugging information that can be extracted from the parallel program, and we show that new useful analyses can be done based on this information. To succeed in this task we believe that the following three points must be understood and shown to be manageable tasks: 1. Error classif ication. We wish to determine the various types of errors involved in parallel message passing programming and develop a methodology for efficient debugging of parallel message passing programs. A number of new types of errors arise when dealing with parallel message passing systems. We still believe that dimensions 1 and 2 can be applied as is, whereas dimension 3 must be extended to contain error types caused by message passing. 2.6. Error Classification 29 2. Tool development. Understanding these error types makes it possible to write specific tools that can greatly assist the programmer to more easily debug parallel message passing programs. 3. Automat ion. It is possible that some of these tools can be semi-automated to remove part of the burden of debugging from the programmer. By focusing on different error types in an isolated way, tasks that might have been intractable become tractable, and in some cases, it is possible to automate the debugging or correction process. If these three tasks can be accomplished, they wil l promote the writing of parallel message passing programs by allowing easy-to-use debugging tools that users wil l find useful. 2.6 Error Classification For studying the task of debugging parallel programs, Figure 2.3 illustrates a good starting point. An error in a parallel program can occur at any of the three different levels shown in Figure 2.3. The data decomposition can contain errors as wel l , but this thesis is not concerned with these types of errors. Data decomposition is a large separate subject that has been described in detail in books such as [Fos95, FJL+88]. We briefly discuss some of the types of errors that can be encountered at the three different levels. The errors at the sequential level have already been described in the previous sections, but as mentioned, many of these errors occur in the parallel domain as wel l . In particular, it is worth noting that the cause/effect chasm mentioned in [Eis97] further widens as the possibility for even greater distance between cause and effect arises when message passing is involved. When messages propagate from one piece of code to another through message passing, an incorrect value can occur and be used in a piece of otherwise correct sequential code. The distance between cause and effect in a sequential program is l imited to the one process and its source code. In a parallel program, the distance can potentially be much greater as processes communicate. This increases the distance in both dimensions: space and time. The spatial distance increases as the cause and effect can now occur in different processes. With respect to t ime, the distance can potentially increase as wel l . When data is transmitted from one process to another, the time gap between the cause and the effect of an error become larger as it takes substantially more time to pack, transmit, buffer and unpack data than it does to retrieve it from local memory. Combining a large spatial distance with an increased temporal distance further complicates locating the cause of the error. Figure 2.5 illustrates this situation. Process A computes a bad value (0) for variable a and sends it to process B. Process B uses the value of a as a divisor and hence crashes. It immediately looks as if the error is caused by faulty division, but instead it is caused by a wrong computation by function f in process A. This error then 2.6. Error Classification 30 Process A a = f(...); pvm_pkint(buf, a); pvm_send (...); Process B {int:a=0} • pvm_recv(...); pvm_upkint(buf, eta); b = c/a; w Division by zero error. Figure 2.5: An illegal value 0 for the variable a was propagated from process A to process B, where it caused the program to crash. This example shows how the cause/effect chasm is widened by message passing. propagates to process B through message passing. This example is typical of how the cause and effect chasm negatively affect message passing programs, because now errors can be propagated from one process to another through the network. Furthermore, the chasm widens when an error propagates through a number of processes before being detected. The problem with wrong values in messages carries directly into debugging at the message level. Some of the errors that occur at this level are as follows: • Wrong values (variables) sent/received. This is an example of one of the types of errors Knuth classifies as a type T (Trivial typo) error in [Knu89]. • Too l i t t le/much data sent/received. This fits nicely into the M category (mismatch between modules); the programmer is unaware of the mismatch between packs and unpacks. • Variables packed/unpacked in the wrong order. The highest level of debugging is in the communication protocol. One well known type of error is deadlock. Deadlock can occur for a number of different reasons: • If a process crashes due to an illegal computation, and if another process is blocked in a receive cal l , waiting for a message from the crashed process, a deadlock occurs. Alter-natively, if the process does not crash but sends a message to a wrong or a nonexisting receiver, the same receiver blocks, waiting for a message that never arrives. This scenario with missing messages can easily happen in a master/slave configuration, where the slaves communicate with each other. For example, in a ring pattern, if the process IDs are stored in an array and each slave uses a wrong index to access this array and thus sends its message to the wrong receiver, all the processes eventually block and 2.7. Tool Development 31 create a system wide deadlock, because none of the messages are delivered to the correct receivers. • A safe program is defined in [BD95a] as a program that does not require any buffers to complete. That is, communication is synchronous. Often asynchronous message passing is util ized to overlap computation with communication. Such programs are no longer safe, and can deadlock due to an insufficient number of buffers. We investigate this in detail in Chapter 8 and Appendix E. • Messages delivered to the wrong receiver. Depending on the implementation of the program, this can lead to deadlock, or if wild cards are used, the message can end up being delivered to the wrong receiver and potentially cause errors in that process. • Not only can messages be sent to the wrong receiver, but receivers can also attempt to receive a message from a process that is not sending. This can cause the process to block indefinitely, and at worst, cause a deadlocked process. Al l these points are potential pitfalls in the parallel programming domain. All these errors require debugging. Some of them are fairly easy to correct, while others are more problematic. 2.7 Tool Development In the previous section, we gave examples of the types of errors that can occur, and divided them into the three categories associated with the breakdown in Section 2.1.1: sequential errors, message errors, and protocol errors. These errors are conceptually different; sequential errors are errors found in the straight line code. Message errors are errors caused by or associated with messages: a message can contain wrong data, which can affect otherwise correct sequential code (this is an example of an overlap between errors at different levels). The message can be received by the wrong receiver, or at the wrong place in the correct process, which can more easily happen when wild cards are used extensively. The overall structure of the messages, the protocol, can contain errors as wel l , which can result in messages being sent to wrong receivers. Given the difficulty in having users adopt tools, we believe that in order to increase the usage of a new tool, it must be designed with the following goals in mind: • It is vital that the tool can be used directly on the source code. When using development tools, part of the final program code is generated or inserted by the tool. This means that no complete source code exists for the entire program. If the development tool does not support the debugging task, debugging becomes cumbersome and complicated. Some tools generate a complete source, but machine generated code is typically hard to read and 2.7. Tool Development 32 understand. An error in the generated code, rather than the user's code, is very difficult to find. If the user can simply recompile or relink with a debugging library to util ize the tool, the likelihood that the user wi l l adopt the tool is higher. • To promote the usability of the tool it must be easily executable, either from the command line or within a simple interface that does not require the user to learn a new environment. • Finally, the tool should enable users to find and, correct specific types of errors, depending on their manifestation. We believe correct tailoring of the tool is one of the most important goals. Not only does this reduce information overload, but also makes certain complicated and time, consuming debugging tasks easier. This is achieved by targeting a specific type of error, using the information extracted from the program and the messages. Examples include the ability to extract one process from a parallel system and debug it sequentially. We return to this issue in Chapter 4. These three goals are supported by the design goals proposed by Eisenstadt in [Eis97]. The most important ones are these: • Computable relations should be computed on request by the tool, not left to the user to deduce on his own. Examples of violations of this design criteria include the often limited number of views found in many visualization tools [MHC94, NAW+96, HE93]. • Displayable state should be displayed on request, not left to the user to draw or visualize. This design goal is parallel to the previous one, and many of the tools that do not meet the previous goal inherently do not meet this this goal either. • Views for 'key players' (important pieces of information) other than variables should be provided. This design goal is the most frequently violated when considering TV-version debuggers [Pdb, Hoo96]. These tools are designed for the sequential programming domain, and thus, do not offer easy access to information not indigenous to this domain. • A variety of navigation tools should be provided at different levels of granularity. Instead of locking the focus on the sequential code, or the code of a number of sequential processes in parallel, the user should be able to change the level at which the debugging takes place. Many replay tools/debuggers [XWXS96, TSS96, KV97, Arv92, CFR95] focus on one level, namely the sequential code, making it virtually impossible to change the focus during the debugging session to, for example, the protocol, or the messages. Thus, we propose a multi level tool whose modularity (levels) closely follows the error classi-fication mentioned in Section 2.1.1 and the above design goals. (See Figure 2.3). The last of the above points can be expanded into the two following design goals for such a tool: 2.7. Tool Development 33 • Conceptual modular izat ion. Depending on which type of error the user is trying to correct, an appropriate tool should be applied. As a result, certain parts of the tool are tailored specifically to finding and correcting errors of a specific type, which reduces the amount of extraneous information reported by the specific debugging task. • Extensibil i ty. The overall debugging tool should allow for easy extension as new tools are implemented and need to be added. 2.7.1 Automation If a certain task in a debugging session can be automated, then the tool should do so. This follows directly from the design goals in the previous section. For example, when trying to resolve deadlocks, it is possible to automate the search for a way to change the program to avoid a deadlock; at a higher level it is possible to automate the verification of the protocol of the system at runtime. Protocols can be specified in process algebras such as CSP [Hoa78, Ros93, Ros94], and verified using tools like FDR [For] from Formal Systems. However, in order to use these tools, the user must have a strong background in theory as the specification of a protocol is a complicated task. For CSP models, the protocol is checked for deadlocks, livelocks and fairness constraints. Even if the programmer uses a tool like FDR to check the protocol, the potential for errors is sti l l present. The protocol specification is not an implementation of the protocol. When the program is developed, the protocol must be implemented as wel l . Problems can arise when the implementation of the protocol does not match the specification. This problem is a specialization of a problem known in software engineering: guaranteeing that the implementation of a system adheres to the specification. We propose a protocol testing module, where the specification is much easier to write, and where the system simply checks all messages against this specification of the communication protocol. That is, the protocol is not verified but al l the messages in the system are checked against the specification. Many problems in the sequential domain are NP-ha rd or undecidable, hence the need for heuristics. The (debugging) problems considered in the parallel domain are no easier. We believe that by focusing on a particular type of error and developing heuristics directed towards this error type, it might be possible to raise the limit of what can be computed, and even automated. This means that by narrowing the search space, as shown in Figure 2.4, we can increase the size of the set of problems that can be solved (or semi-solved). One such example is the deadlock correction algorithm, described in Chapter 6. 2.8. Tool Support for Parallel Program Development 34 2.8 Tool Support for Parallel Program Development In previous sections we described the parallel programming domain, showed some of the numerous places where errors can occur, and described some of the errors. This discussion shows the importance of good tools for programmers working within the parallel programming domain. In this subsection we briefly describe some of the problems that exist with tools for program-ming, debugging or development. A parallel programming environment is an obvious tool to use when developing parallel pro-grams. In addition, these environments can greatly reduce the number of errors programmers make. A number of these tools have been developed over the years, and in Section 2.3 some of them are presented. Though many of these tools restrict the user, to avoid certain types of errors, the risk of errors in user code remains. These errors can cause subsequent errors in the generated code as wel l . Even when using tools or programming languages with built-in support for parallel programming, the problem of locating and correcting errors persists. Despite the obvious advantages found in many of these systems, Pancake argues that not many are widely adopted. In fact, it is claimed that "often only the developers of the tools end up using them in the end [Pan93b]". A number of reasons for this paradox is given: Steep learning curve. Many of the tools are advanced and offer a wide variety of functionality; they can be quite difficult and time consuming to learn. Difficult abstraction. The abstraction adopted by a tool, for example the way a program is represented, the way communication is specified, and its limitations, can be difficult to understand and familiarize oneself with. Restrictiveness. Many tools are so restrictive that they work against the programmer. One example is a tool that assures that any code created is deadlock free. This apparent advantage has a drawback: programs with dynamic communication can not be expressed using this tool. Conservatism. There tends to be general skepticism towards new tools or languages, especially if they require the user to learn a new language, or a new integrated tool. Given the difficulty with tool adaptation and the inherent conservatism that perpetuates the use of wel l known methods and tools, debugging is stil l unavoidable. Debugging tools can suffer from the same problems as the development environments. So when developing tools for debugging, the above points should be kept in mind as part of the design goal. That is, the tool should be easy to learn and use, and the abstraction adopted by the tool should not restrict one's ability to perform a specific debugging task. 2.8. Tool Support for Parallel Program Development 35 We attempt to avoid the four disadvantages mentioned earlier in the following ways: • By designing the tools to have a simple user interface; that is, without many different windows, menu bars and buttons. • The abstractions in our approach closely follow the natural abstraction found in a parallel program, namely, the three levels proposed. • As with any tool a certain restrictiveness is unavoidable, but by satisfying the design goals mentioned in Section 2.7, we believe that the tool becomes less restrictive and rigid. • General conservativeness is difficult to counter; learning any new tool requires some effort by the user, but we believe that by addressing the previous three subjects we can reduce the amount of inherited conservatism. Whether the user now wants to use the tool or not is a question of personal taste, not so much relying on steep learning curves, restrictiveness, or complicated abstractions. With a good understanding of error types, design goals, and the problems with existing tools, we have formulated a multi level debugging strategy and specified a number of design goals such a tool must satisfy. In the next chapter we describe the design of Mill ipede, a prototype multilevel debugging tool. Chapter 3 Millipede - A Prototype Multilevel Parallel Debugger "The point I have been patiently trying to make," Godwin said impatiently, " is that you expect far too much of a first sentence. Think of it as analogous to a good country breakfast: what we want is something simple, but nourishing to the imagination. Hold the philosophy, hold the adjectives, just give us a plain subject and perhaps a wholesome, nonfattening adverb or two." - Godwin to Danny Deck, Some Can Whistle In this chapter we present Millipede, a prototype multi level debugger designed in accordance with the ideas and methodologies presented in the previous chapters. We show how it is possible to write debugging and analysis tools specifically tailored to the different levels mentioned earlier. These tools are modules that are incorporated into Millipede. However, many of these can be executed outside of Millipede, given the information about message history and the communication protocol extracted by Millipede. 3.1 Design Criteria An aim of this dissertation is to show that debugging can be decomposed into several tools, each tailored to a specific error type, thus working on different levels of the program structure. These levels are sequential code, message passing, and communication protocol. As argued in Section 2.4, we propose a bottom-up technique referred to as 'mult i level debugging' (illustrated in Figure 2.4) and develop tools to support debugging according to this methodology. 36 3.2. The legs of Millipede 37 A number of such tools are useful for locating and correcting a number of different errors; this is illustrated by the overlapping error types within the specialized debugging tools in Figure 2.4. On the other hand, even though this approach might prove to be useful, situations could arise where such a strategy is not applicable. This could happen if an error occurs that is not covered by any of the tools. Given the problems with existing tools (see Section 2.3.7), as wel l as the apparent lack of tool usage, we formulated a multilevel debugging strategy. Using this as the foundation for the Millipede prototype, we can summarize a number of important design goals. 1. Access to source code debugging is vital as errors may be located in the sequential code. It should be possible to apply the user's favourite sequential debugging tool to debug the sequential part of the parallel program. 2. Access to delivered messages, as wel l as messages sti l l in the message queues. Not only do many of the modules make extensive use of this information, but it might also be useful for new modules that perform other forms of analyses on the unmatched (unreceived) messages. 3. Automation of complex parts of the debugging process. This technique, combined with relation-on-demand computations, reduces information overload. 4. Though not a direct design goal, extensibility is another important issue in the implementa-tion of a tool like Millipede. Extensibility of the message passing calls should be developed, thus allowing the user to add new functionality to existing message passing calls. This could become necessary when other modules are developed by other users of Millipede. In Chapter 6, we show how extensibility plays an important role in a tool like Millipede: The analysis described in Chapter 6 relies on the ability to extract information about messages. This information can easily be extracted from the runtime system's internal relations and tables, which is further described in Chapter 5. Using these relations and tables simplifies the implementation of other modules in Millipede. 3.2 The legs of Millipede The Millipede prototype debugging system is written for the PVM message passing system, and consists of the following main parts: • A core system built into the communication system consisting of wrapper functions for al l the communication calls. These new functions (_PVM_XXX) are added to the original com-munication library. They execute the Millipede debugging code and then call the original 3.2. The legs of Millipede 38 message passing functions (pvm_xxx). Figure 3.2 shows a few examples of these new func-tions. Functionality, such as writing and reading log files, and logging messages and protocol information, is implemented in these new functions. • A runtime system that consists of several separately executing processes, which allows the user to interact with the debugging system. • A number of analysis tools/modules which are described in greater details in the following chapters. These tools are typically invoked by the runtime system as a result of a query or a command sent to the system by the user. The current implementation of Millipede supports debugging parallel message programs that use PVM. Millipede uses PVM to communicate internally as wel l . However, an MPI port of Millipede to MPI and to also work for MPI programs is straightforward. 3.2.1 Overview As stated, Millipede is a prototype implementation of a tool that utilizes a multi level debugging strategy. This implies that the implementation and design closely follows the layered approach proposed in the methodology. Figure 3.1 shows a graphical representation of the Millipede debugging system and how it interfaces with the application being debugged and the message passing system. The parallel application has the message passing library linked into its executable. The current version of Millipede uses PVM. The Millipede Core System is a re-implementation of al l the functions in the PVM library, that is, when a message passing function is called from the application, the Millipede version of that function is called. Any information that must be written to or read from log files is handled here, and the original PVM functions are called. In general, code is wrapped around the original cal l ; some is executed before the call and some after. This code can be thought of as a logging aspect (as in aspect oriented programming) [Asp03], that is, it is executed before and after the original message passing code. Depending on whether log files are written or read in a subsequent debugging session, different parts of this code are executed. The core system communicates with the runtime system, informing it about messages sent and received. The information about the messages are kept in relations in the runtime system. The various modules and analysis tools are separate functions/processes that can obtain information from the runtime system about the messages and the protocol. These modules can then perform the analyses or tasks and send the results back to the runtime system. 3.2. The legs of Millipede 39 Application jMillipede Runtim^ PVM Application Millipede Runtime PVM Millipede Core System Modules Analysis Tools = Application level communication = Millipede level communication Figure 3.1: The implementation of Millipede. The gray arrows represent communication within Millipede, and the black arrows represent communication between applications. The Application/Mil l ipede Runtime/PVM boxes represent one process. 3.2.2 Implementation In the following subsections we briefly describe the implementation of the wrapper functions and the runtime system. Details about the implementation of the modules can be found in the respective chapters. Wrapper Functions By redefining the PVM functions, as shown in Figure 3.2, the C compiler substitutes all PVM calls in the user code with calls to the equivalent underscore functions (e.g., _PVM_pkint instead of pvm_pkint). These functions then perform the Millipede debugging code (informing the runtime system about changes, prompting for input, printing output, writing log files, etc.). The implementation of these functions is linked into the original PVM library ( l i bpvm3 .a). When a PVM program is compiled, the redefinition is only included if the M I L L I P E D E flag is set during compilation. If this flag is not set, the program executes like a normal PVM program, but if the flag is set, the Millipede version of the functions is executed. This way of switching between normal and debugger execution is easy to manage, and does not require any rewriting of the program, just recompilation and re-linking. Even if the program is compiled with the - D M I L L I P E D E option the user can choose a normal execution by setting an environment variable. 3.3. The Sound of Little Legs Running AO #define pvm_initsend(X) #define pvm_recv(X,Y) #define pvm_upkint(X,Y,Z) #define pvm_pkint(X,Y,Z) Figure 3.2: Examples of redefined PVM functions. The Runtime System The core of the runtime system consists of the following three main parts: 1. A centralized message number administrator process is responsible for assigning unique numbers to all messages. These are not timestamps as defined by Lamport in [Lam78], but rather a unique marker for each message in the system. The Lamport timestamps impose a partial ordering of the messages using a happens-before relation, whereas the message numbers are merely used to identify messages. This is necessary for matching the sending process of a message with the receiving process. It also makes it possible for the user to easily distinguish messages when working with the Message Debugging Module. 2. A number of status windows where the system reports information about events and addi-tional information requested by the user. 3. A driver/interface process in which the user interacts with the runtime system. It also maintains information about the number of running processes in order to report termination. When a parallel program is running and log files are generated, all the information linking log files with program files is collected in a project file. This file contains all the information required for a module in Millipede to locate the needed log files. Refer to appendix A for an example of a project file. Within the Millipede runtime system each process has a message queue. In this queue all the information about the messages sent or received by the process is maintained. The modules described in the remaining chapters make heavy use of this information. Examples on how the modules use this information can be found in Chapters 4, 5, 6, 7, and 8. 3.3 The Sound of Little Legs Running In accordance with the multilevel debugging strategy, Millipede has the following levels: PVM_initsend (X FILE , LINE ) PVM_recv (X, Y, FILE , LINE ) PVM_upkint(#X,X,Y,Z, FILE , LINE ) PVM_pkint(#X,X,Y,Z, FILE , LINE ) 3.3. The Sound of Little Legs Running 41 The Sequential Level - The module at this level facilitates the application of sequential debug-gers, as wel l as other sequential analyses or profiling tools, to one single process extracted from the parallel application. Chapter 4 describes the Sequential Debugging Module in greater detai l . The Message Level - Here, we are concerned with messages sent between two processes. The current tool in Millipede at this level allows for the interactive inspection and debugging of messages, and supplies the user with a query language (MQL) for querying messages. Examples and an in depth discussion about the Message Debugging Module can be found in Chapter 5. The Protocol Level - This is the last of the three levels and also the level that provides an overview of the entire application with respect to the protocol. This level contains three modules. • The first module is the Deadlock Detection and Correction Module. This module is an example of using automation to reduce the amount of information presented to the user. When an application deadlocks, this module can be applied. The messages are analyzed and a suggestion for altering the source code to remove the deadlock is provided. More detail of the theory behind this analysis is given in Chapter 6. • The second module is the Protocol Conformance Checking Module. This module allows the user to specify a number of constraints on the protocol and have the runtime system check all messages against these constraints and report any violations. This module can be used to advantage in the program development cycle when implementing the protocol from a possibly verified specification. Examples, along with more detail on the implementation, are provided in Chapter 7. • The last of the three modules at the protocol level is the Buffer Allocation Analysis Module. This module performs an analysis on the message history; a graph based on messages is created and analyzed to determine the number of buffers needed to ensure efficient execution. By efficient we mean an execution that does not have any blocking send calls due to a lack of buffers. The algorithm is described in Chapter 8, along with a more detailed description of the general problem of determining buffer allocation in systems with different buffer allocation schemes. Theoretical results are derived for three problems with four different buffer allocation schemes. Although we present the multi level debugging strategy and the implementation of Millipede as having three distinct levels, a certain amount of overlap is present. For example, when using the Sequential Debugging Module to extract and execute one process from the parallel system, 3.3. The Sound of Little Legs Running 42 if the log file belonging to that process is either corrupt or incorrect, or if the user specifically requests it, the runtime system wil l prompt for valid data values for unpacked data. This feature is conceptually part of the Message Debugging Module, but has proven to be a useful addition in the Sequential Debugging Module as wel l . Millipede currently contains a total of five tools, but it provides a general infrastructure for incorporating more. Millipede uses a simple command line interface, and Appendix D contains screen dumps of the Millipede interface windows. Chapter 4 Sequential Debugging of Parallel Processes "A computer lets you make more mistakes faster than any invention in human history-with the possible exceptions of handguns and tequi la." - Mitch Ratliffe Following the multi level debugging methodology outlined in Chapter 2, errors should be fixed at the lowest levels before moving on to higher levels. The sequential code of each process constitutes the lowest level, so we first consider debugging code at the sequential level. The two main problems we address in this chapter are: • Providing the ability to extract a single process from a parallel system. • Allowing the user to take advantage of existing sequential debugging tools on the sequential code of the parallel program. The sequential debugging module allows the user to extract one process from the parallel system and replay the execution. It facilitates using any sequential tool to analyze or debug the process. The messages that the process would have received in the parallel execution are provided by the underlying Millipede runtime system. Extensive research and numerous tools have been developed in the area of sequential debugging. Instead of providing new tools that might be inadequate, the sequential debugging module enables the user to use existing sequential tools specifically tailored to finding and correcting the type of errors that arise in sequential code. 43 4.1. The Sequential Debugging Module 44 4.1 The Sequential Debugging Module To use the sequential debugging module the user must first compile the program and link the Millipede runtime system to her code. The program is then executed in parallel until an error occurs. The runtime system collects al l the messages that were sent throughout the execution and stores them in log files, one for each process of the parallel program. The runtime system writes the total content of the messages along with the return value of the communication cal l . The messages and their content as wel l as the return values of the message passing function, are captured to assure that the sequential re-execution is exactly the same. Handling nondeterminism is a problem for any debugging tool that attempts to locate errors by re-executing the code. In the case of the Millipede tool for executing a single process, we can classify nondeterminism into two types; the nondeterminism within the sequential code that does not affect message history and the nondeterminism that can affect either messaging. Nondeterminism which does not affect the messaging can be handled with the same techniques used for handling nondeterminism within a sequential program. For example in case of random numbers the seed to random number generator can be fixed. As long as it does not affect the message history the message history provided by Millipede wil l produce the same execution and hence exhibit the same error. It possible however for the nondeterminism in the program to affect the messaging. It may affect the ordering of messages or the even the contents of messages to be received. If this cannot be fixed, then it can result in an invalid replay of the message history and the tool itself may fai l . As in the previous case, it can be avoided by removing the nondeterminism to ensure the re-execution runs with the same message history. One particularly difficult type of error occurs when the nondeterminism is due to some type of timing measurements. Although messages are replayed in the correct order, it is impossible to ensure identical timings. In conclusion, what distinguishes these two types of nondeterminism is whether the replay is unsuccessful in exhibiting the error, or whether the tool itself fails because of changes to the expected message history. The message passing, and in particular, the order of the message receipt, is fixed. Therefore, it is not necessary to timestamp messages. It is necessary to capture the return values of all the message passing function calls, as the program behaviour may depend on these values. For example, nonblocking receive calls can return without receiving any data, and in order to replay the execution when the process is executed sequentially, these values must be stored as wel l . Consider the code in Figure 4.1. If no message is ready in the message system the process wil l execute the else part of the if-statement. This can happen a number of times, and each time the state of the program might change. In order to replay this behaviour when the program is 4.2. Limitations 45 a r r i v e d = f a l s e ; while (!arrived) { a r r i v e d = pvm_rirecv(tid, msgtag) ; i f (arrived) pvm_upkint(array, 10, 1) ; else //Do something e l s e } Figure 4.1: If the a r r i v e d variable is not logged there is no way to tel l how many times the loop is executed, thus perhaps putting the process in a state that does not match the state i t was in when it logged the messages. re-executed with messages supplied from the log files, the value of the a r r i v e d variable must be stored each time pvm_nrecv is cal led, and then returned to the process. This ensures an execution that matches the original execution. Figure 4.2 illustrates how an application's executable contains the original PVM functions when linked with the PVM runtime system only, and how it contains both the PVM and the Millipede runtime systems when the M I L L I P E D E flag is set at compile t ime. The Millipede runtime system is compiled into the PVM library, but the replacement of the message passing calls only occurs when the M I L L I P E D E flag is specified. Since the Millipede runtime system is not compiled with the M I L L I P E D E flag, the calls to the original PVM function are called from the _PVM_send function. Appendix A shows a complete example from compiling an application to extracting one process using Millipede, and debugging it sequentially using Gdb. 4.2 Limitations One of the problems with this straightforward logging approach is that log files can potentially be very large. One solution is to try to reproduce the error with a smaller data set, however, this approach is not always feasible. A different approach is to use checkpointing. The idea is to save an image of the running process and then purge the log files. If the process needs to be debugged, it can be restarted from the checkpoint and the execution replayed from there using the messages in the log files. The problem with checkpointing is restoring state; the kernel state associated with the process must be consistent after restoration. There has been work done on tools that allow for the checkpointing of the sequential state of a process. A lightweight library called 'save_world' was developed by Bennet Yee [Yee96] from the University of California in San Diego. Another wel l known system is Condor [LLM88], which 4.2. Limitations 46 Application: pvm_send(.); Application Application: pvm_send() ; PVM: i n t pvm_send(.){ // PVM code } Linked with PVM. -Ipvm3 Application _PVM_send(•); Millipede: i n t _PVM_send(.){ // M i l l i p e d e code r = pvm_send(.); // M i l l i p e d e code r e t u r n r ; } PVM: i n t pvm_send(.){ // M i l l i p e d e code } Compiled with Millipede. -DMILLIPEDE -lpvm3 Figure 4.2: An example of the different parts of an application; first, the application; second, the application linked with the PVM library, and third, the application and communication library with the Millipede runtime system added. also supports process migration. If such a checkpoint library were incorporated into Millipede, it would be possible to save and resume an execution at any given t ime, and at the same time purge log files and reduce their size considerably. Two important benefits arise from checkpointing: first, a reduction in the disk space needed to save log files as these are purged at checkpoints, and second, the time it takes to restart the process and re-execute it to the state of failure is reduced by using the most recent checkpoint as a starting point for debugging. One remaining problem occurs if the cause and effect chasm spans the checkpoint. That is, if the source of the error is before a checkpoint, and if the manifestation is after, then all evidence of the source of the error wi l l have been removed because the process was checkpointed and the log files purged. This could be solved by storing the purged log files and all checkpoint images on secondary storage until the debugging process has finished, if enough space is available. The sequential debugging module is tailored to extract the messages of a parallel system and support the debugging of one sequential process based on the log files captured while the parallel system executes. However, there are classes of errors that are difficult to find using this approach. For example, if an error does not occur every time the program is run, and if log files are not generated during the execution that encounters the error, the Sequential Debugging Module is of no use. Millipede is not thread safe; if it is used with a multithreaded program, there may be problems with the communication library and nondeterminism due to thread scheduling. In general, de-bugging multithreaded parallel programs introduces yet another type of concurrency that further 4.3. Examples 47 complicates the debugging process. The scheduling of threads can result in nondeterminism that cannot be controlled by the user, and checkpointing with threads could potentially become a problem as wel l . Thus, the MPI standard is not defined as thread safe; the two major public domain implementations of MPI, LAM and MPICH, are not thread safe. 4.3 Examples The following examples illustrate the types of errors that can be found and corrected by using Millipede. As a basis for these tests we used a master/slave implementation of an iterative hyperbolic differential equation solver [FJL+88], which we seeded with two different errors. The first error is a division by zero caused by a variable value of zero propagating through a message. The second error is an out-of-bound indexing error; the array indexed is not big enough. The results show that the errors that occurred in the parallel execution are faithfully reproduced in the sequential execution of the process containing the error. Figures 4.3 and 4.5 show the relevant parts of the slave program containing the errors. 4.3.1 Division by Zero Error Consider the code shown in Figure 4.3. The assignment causes a division by zero if the variable nproc equals zero. When the program is executed in parallel the slave process executing the il legal division encounters an arithmetic exception and terminates. In this case, Gdb is used in combination with the sequential code and the messages that were extracted by Millipede. As shown in Figure 4.4, the error is easily located using Gdb. In contrast, finding this error using N versions of Gdb online, the programmer would have to single step each process to a point where the communication has occurred and the division by zero executed. pvmjupkint(&nproc, 1, 1); pvm_upkint(tids, nproc, 1); pvm_upkint(&n, 1, 1); e = n % nproc; Figure 4.3: Sequential code with divide by zero. Tools, such as DIWIDE, that allow macro stepping or tools, such as p2d2 or TotalView, that allow control of all debuggers at the same time, can be used in a similar way to locate the error. Such tools are good for master/slave or processor farm computations as a number of similar 4.3. Examples 48 (gdb) step 45 e = n % nproc; Program r e c e i v e d s i g n a l SPGPFE A r i t h m e t i c exception 0xef4a86a8 i n 0 (gdb) p r i n t n 4 (gdb) p r i n t nproc 0 Figure 4.4: Using Gdb to locate the error in the sequential code of a process from a parallel program processes are controlled as a unit, but this is not as easy for a pipeline computation or a parallel program which is functionally decomposed; the different processes execute different code, thus making it difficult to control al l the processes in a collective manner. 4.3.2 Memory Errors If a process in a parallel system contains a memory leak or memory error, often, especially in C, the most likely manifestation is that the process terminates because of an il legal memory reference. When executed in parallel, even if one process terminates, the others continue to execute until they crash, deadlock, or finish incorrectly. In sequential programming there are tools, such as Purify [Pur], that are effective for discovering memory reference errors. Purify links a runtime library to the process that tracks memory references and reports any illegal ones. Unfortunately, no parallel version of Purify exists. The sequential version does not easily lend itself to finding memory leaks in a parallel program. The only way to apply Purify to a parallel program is to apply one instance to each process. By using Millipede to extract the process that crashed, along with the corresponding log file, it becomes possible to use Purify to find the offending code that contained the il legal memory reference. If a parallel system consists of a number of instances of the same program, for example a master/slave or a processor farm application, it suffices to apply Purify to a single instance of the slave processes in order to catch memory errors. This approach reduces the amount of information the programmer needs to consider during the debugging process, and may reduce the time needed to complete the debugging task. Figure 4.5 shows a code fragment that indexes an array out of bounds. The x array is too 4.3. Examples 49 small, and at index nodes+1 an index out of bound error occurs. This error can result in two different program behaviours: an incorrect result, or a segmentation fault where the process terminates abnormally. This error is easily detected by using a tool like Purify. Note, this error was introduced into the program when the parallel version was developed. Since the data must be distributed across a number of processes, this error might not have been present in the sequential version. x = calloc(nodes,sizeof(double)); for ( i = l ; I <=nodes; i++) x [ i ] = ( 1 * ( s t a r t + i - 1 ) ) / ( n - 1 ) ; Figure 4.5: Source code with a memory error: the x array is one element too short. Figure 4.6 shows the output from running Purify on the extracted process. The output clearly marks the problematic array and specifies that the problem is an attempt to write past the end of the array. In addition, the line in which the array is allocated is printed out. It it now a simple problem to correct the error by either allocating a bigger array or changing the condition. ABW: Array bounds w r i t e . This i s o c c u r r i n g w h ile i n : main [Wave_slave.c:57] for ( i = l ; i<=nodes; i++) ==> x [ i ] = ( 1 * ( s t a r t + i - 1 ) ) / ( n - 1 ) ; W r i t i n g 8 bytes to 0xdc630 i n the heap. Address i s 1 byte past end of a malloc'd block at 0xdc5a 136 bytes. This block was a l l o c a t e d from: 0xdc63 0 8 of malloc [ r t l i b . o ] c a l l o c [ r t l i b . o ] main [Wave_slave.c:50] ==> x = calloc(nodes,sizeof(double)); Figure 4.6: Using Purify in combination with Millipede to locate memory errors. The advantage of having the extracted process is the abil ity to use tools like Purify that 4.4. Implementation Details for the Sequential Debugging Module 50 have not been ported to the parallel domain. These tools can be more effectively used with one process rather than trying to coordinate the use of ./V versions of them running at the same time. It is possible to extract one process and its corresponding log fi le, and then debug it as a sequential program. 4.4 Implementation Details for the Sequential Debugging Module In this section we briefly explain some of the details of the implementation endemic to the Sequential Debugging Module. The Sequential Debugging Module consists of two parts: 1. The collection phase that intercepts the values and names of variables being packed (calls to the pvm_pkxxx functions) and return values of all the message passing functions. 2. The replay phase that reads the log files written in the collection phase, when pvm_upkxxx functions are cal led. The collection part is straightforward. When packing functions are called the names and the values of the variables are written to the corresponding log file. Al l message passing functions also write their return values to the log file. During replay, the log files are read, that is, instead of performing a call to the PVM library the log files are read, and the values are returned to the caller as if a message had arrived from the network. The majority of the code for this part of the tool is for checking that the values read from the log files are consistent with the message passing calls in the code. This is done by comparing the names of the variables in the pvm_upkxxx cal l with the names of the variables in the log file. If a mismatch between variable names or types is found, Millipede prompts the user for a value for those variables that were unsuccessfully unpacked. 4.5 Summary We have shown how it is possible to extract one process from a parallel system and debug it sequentially. After extraction, it can execute as a standalone UNIX process and be debugged using sequential debugging tools. This allows the user to correct errors in the sequential code of a parallel program as if it were a purely sequential program. Debugging message content, and obtaining an overview of messages and message queues in the message passing system, are not supported by the Sequential Debugging Module. These problems belong to the message level of Millipede, and in the following chapters we introduce tools that provide the user with debugging capabilities at this level. Chapter 5 Message Debugging If we knew what it was we were doing, it would not be called research, would it? - Albert Einstein Once errors are corrected in the sequential code, the focus turns to process interaction. This interaction happens through message passing. The purpose of the Message Debugging Module is to support the location and correction of errors which cause incorrect messaging and message content. We provide two tools to help in correcting the messaging: • The first allows the user to interactively inspect and debug messages as the program runs. That is, as messages are delivered to the process by the message passing system, the user can inspect and change parts or al l of the message without the need for a debugger, such as Gdb. • The second is a simple query language to aid in querying the message history maintained by the runtime system. 5.1 Interactive Message Debugging In the Sequential Debugging Module the focus is on the key players of sequential programming, for example, the variables and the control flow. When moving up the hierarchy, the next level is the one concerned with messages passed between two processes. As illustrated in Figure 4.3, an otherwise correct piece of sequential code can produce faulty results or even terminate with an error due to the cause/effect chasm widening through message passing; a wrong value can be sent from one process to another and cause an error. 51 5 . 7 . Interactive Message Debugging 52 The idea of interactive message debugging, or inspection, is to allow the user to choose one or more processes, and while the parallel program executes, give the user separate views/windows for each of these processes. These views show the messages that are sent to the process and allow these to be changed. This is particularly useful if the user notices a message that contains a wrong value, but does not want to terminate the execution to correct the problem. The value is simply changed before it is delivered to the process and execution can continue uninterrupted. At first, this might seem like a typical application of the TV-version strategy. However, the difference is that in Millipede, the user chooses which processes he wishes to interact with at any time during the debugging process. These views can be turned on and off at wi l l . This corresponds to the idea of computing relations on demand, and also attempts to reduce information overload. Naturally, the possibility of information overloading exists, especially if the user turns on too many views at once. However, by starting out by displaying a small amount of information in a few windows (or even no windows to begin with), and allowing the user to extend the view by opening new windows and closing down those he does not need anymore, it differs from the typical use of an TV-version tool. One difference between the tool here and the TV-version type of tools is that in the latter case, the user adds views rather than removes them, which allows him to focus on the activities at hand. If a receiving process attempts to unpack more data than the sender sent, Millipede wil l issue a warning with the line and the variable name that cannot be unpacked. In addition, in order not to terminate the debugging session, Millipede wil l prompt for a value for the variable. Figure 5.1 shows an example where the message contained too litt le data. Figure 5.1 represents a window for one process; for each process observed (i.e., where the user has decided to perform interactive message debugging), a separate window appears. Figure D.2 in Appendix D shows a screen shot of an actual debugging session. Receiving New Message: Line 4: pvm_recv(4,0) <ok> Line 9: pvm_upkint(&a,1,1) = [2] <ok> Line 12: pvm_upkint(&b,1,1) = [?] <error> No value a v a i l a b l e f o r i n t : b. Please s p e c i f y a value. i n t : b = 78 Figure 5.1: No value for b was sent or read from the log files, so Millipede prompts the user for a value. Debugging can continue once a value has been specified. This technique can also be util ized in conjunction with the sequential debugging module for testing purposes: if a replay file does not exist, the runtime system wil l prompt the user to supply 5.1. Interactive Message Debugging 53 values for al l the incoming messages. This allows for the testing of a single process without running, or even having written, the code for the other processes of the parallel system. The user simply runs the program with the Millipede runtime system turned on, specifies no log file and debugs as usual, with the exception of having to specify values for unpacked variables for al l incoming messages and return values for calls to procedures such as pvm_nrecv. It also provides a technique for interactively testing the program with extreme or incorrect values. In addition, this technique is particularly useful when prototyping a program, and it matches the idea of iterative program development proposed by Araki et a l . [AFC91]. This is especially good for testing master/slave, processor farm and pipeline code. Figure 5.2 shows an example of how to use the interactive message debugging module. Note, Millipede not only shows the unpacked values, but also the names of the variables that they are unpacked into as well as the line number of the PVM communication cal l . Providing the user with information about the line numbers and variable names addresses the issue of mapping the display back to the source code. An interesting problem with interactive message debugging is that it can exhibit the opposite of information overload; when a receive call expects a large message, it requires the user to type an unreasonable amount of data. The user has the option of replying f and the system prompts for the name of a file containing the data to be used. Receiving New Message: Line 78: pvmrecv(-1,0) <ok> Line 81: pvm_upkint(&nproc,1,1) = [2] <bk> Do you want to change t h i s [y/n] ? y i n t : nproc = 2. New value = 3 Line 82: pvm_upkint(tids,3,1) = [262151,262152,262153] <ok> Do you want to change t h i s [y/n/f] ? f Filename: tids.txt Read: [262150, 262151, 262152] <ok> Figure 5.2: A message with tag 0 is received from anyone (-1 is a wild card in the receive call), and two unpacking instructions are executed. One that unpacks one integer and one that unpacks three. If the messages are supplied from a replay file and an error or some type of inconsistency appear the <ok> wi l l be replaced with an error message and the user wi l l be asked to provide a value that is compatible with the destination variables for the unpacking cal l . 5.2. Message Queries 54 5.2 Message Queries We now focus on the second part of the Message Debugging Module: the message querying tool along with the Millipede Query Language (MQL) allows the user to write SQL-like queries for an internal database of messages maintained by the Millipede runtime system. One problem with existing tools is that only a fixed number of views are provided. The views, or queries, provided by tools are typically those that the developer of the tool considers useful. Since the developer and the user might have different foci or views, this is one of the contributing factors as to why tools are primarily used only by their developers [Pan93b]. Three of the design goals for the Millipede debugging system mentioned earlier are as follows: • Displayable states should be displayed on request. • Computable relations should be computed on request. • Information overloading should be avoided. Not only can state and computable relations be displayed on request, but queries can reduce the complexity of information gathering from the message history; the alternative would require filtering through a large quantity of data by hand. The problem with a fixed number of views is easily solved for the Message Debugging Module by giving the user a higher degree of freedom within the tool, that is, allowing users to define their own queries. If the message history is viewed as a large database, it is natural to develop an SQL-like language to facil i tate queries. 5.3 User Defined Queries The Millipede runtime system organizes the messages into four relations: Senders and Receivers as shown in Tables 5.1 and 5.2, which contain the overall information about the messages, and SentMessages and ReceivedMessages as shown in Tables 5.3 and 5.4, which contain the details about the messages. MsgNo S i z e STID S L i n e S F i l e STag 7 154 262152 118 slave.c 5 Table 5.1: The Senders relation. 5.3. User Defined Queries 55 MsgNo RTID RLine RFile RTag 7 262150 86 master.c 5 Table 5.2: The Receivers relation. MsgNo NO SLine SType SVarName SCount SValue 6 1 107 i n t me 1 6 2 108 i n t nodes 1 6 3 114 double y t l ] Table 5.3: The SentMessages relation. MsgNo No RLine RType RVarName RCount Rvalue 6 1 87 i n t who 1 6 2 88 i n t r e s u l t _ l e n g t h 1 6 3 94 double y[index] Table 5.4: The ReceivedMessages relation. Queries are implemented using the four relations and the query language. To illustrate the use of MQL, we implement two useful queries: locate, which locates messages sent between two program lines, and match, which matches the packing functions with the corresponding unpacking functions of a message. Figure C.1 in Appendix C shows the grammar for MQL. The match Query Since many variable values can be packed into one message, one of the easiest errors to make is to unpack the message in the wrong order or into wrong variables. This means that values can potentially be swapped in the variables they are unpacked into. By querying the message passing system it is easy to verify in which order the values were packed, and in which order the values were unpacked. In addition, the name of the variables on both the sender and receiver side, and the line numbers are shown. This query is performed by executing the match query. 5.3. User Defined Queries 56 Figure 5.3 shows the wanted output of the match query on message number 6. The upper half of the figure shows the packing routines and their line numbers, and the lower half shows the unpacking routines and their line numbers. (0)MILLIPEDE> match(6) Message number: 6 Sender F i l e : s lave.c Line 107 pvm_pkint (&me, 1 ( 1); Line 108 pvm_pkint(knodes, 1, 1); Line 114 pvm_pkdouble{&y[1], nodes, 1) ; Line 118 pvm_send(262150, 5); Receiver F i l e : master.c Line 86 pvm_recv(262151, 5) ; Line 87 pvm_upkint (&who, 1 ( 1) ; Line 88 pvm_upkint{&result_length, 1, 1); Line 94 pvm_upkdouble(&y[index], r e s u l t _ l e n g t h , 1); Figure 5.3: By executing the match (6) query, Millipede wil l query the message queues for all packing and unpacking commands for message number 6. The MQL code for the match query is shown in Figure 5.4. The l o c a t e Query If the user believes that a message has been delivered to the wrong receiver, it is useful to search the message database for messages that match a specific sender and receiver line number. Such a query, l o c a t e , can easily be implemented using MQL. Figure 5.5 shows the expected output from the query l o c a t e (86,118) , that is, it lists the messages sent from line 86 in some process and received in line 118 in a different process (in practice, the sender and the receiver could be the same process). The Senders and the Receivers relations have the field MsgNo in common. By joining these two relations and selecting the tuples where both the sender and the receiver line match the values 118 and 86, we obtain a new relation containing the result. Figure 5.6 shows the implementation of the l o c a t e query. The match and the l o c a t e queries are two examples of the use of MQL. The user can write his own queries, ranging from very simple to arbitrarily complicated. In addition to the four mentioned relations, there are a few more bookkeeping relations. One such relation is the T I D S relation that maps process IDs to message queue numbers. Its purpose is to allow the user to develop queries that contain more information and have better formatted output. 5.4. Built-in Message Queries 57 define match(mno) as begin print("Message number: %\n",mno); let Senderlnfo be project select from Senders where (MsgNo == mno) over ( S F i l e , STID, STag); let R e c e i v e r l n f o be project select from Receivers where (MsgNo == mno) over ( R F i l e , RTID, RTag); print ("\t Sender -\n"); display Senderlnfo using " F i l e : \ t % \ n L i n e %\tpvm_send(%,%);\n" § display project sort select from SentMessages where (MsgNo == mno) by (No) over (SLine, SType, SVarName, Scount) using "Line %\tpvm_pk%(%,%,1);\n"; print ("\t Receiver -\n"); display R e c e i v e r l n f o using " F i l e :\t%\nLine %\tpvm_recv(%,%);\n" } display project sort select from ReceivedMessages where (MsgNo == mno) by (No) over (RLine, RType, RVarName, Rcount) using "Line %\tpvm_pk%(%,%,1);\n"; print("\t -\n"); end Figure 5.4: The implementation of the match query in MQL. 5.4 Built-in Message Queries To keep MQL small, certain things such as arithmetic, tuple insertion and complex formatting has been excluded. Sometimes certain queries can be cumbersome to implement if formatting the output is important. 5.4. Built-in Message Queries 58 (0)MILLIPEDE> locate(118 ,86) Messages sent from l i n e 118 to 86 Sender Receiver No T i d F i l e Line Tag T i d F i l e Line Tag S i z e 6 262229 slave.c 118 5 262228 master.c 86 5 152 7 262230 slave.c 118 5 262228 master.c 86 5 144 Figure 5.5: The l o c a t e (118, 86) query queries the message passing system for messages sent from line 118 and received by a (different) process at line 86. define l o c a t e ( s i , r l ) as begin print("Messages sent from l i n e % to % : " , s l , r l ) ; print("\n\n\t Sender\t\tReceiver\n"); print(" " ) ; print (" \n"); print(" No\tTid \ t F i l e \ t L i n e \tTag \ t T i d \ t " ) ; p r i n t ( " F i l e \ t L i n e \tTag \ t S i z e \n"); print (" " ) ; print (" \n"); let Msgs be join Senders with Receivers; display project select from select from Msgs where (SLine == s i ) where ( RLine == r l ) over (MsgNo, STID, S F i l e , SLine, STag, RTID, R F i l e , RLine, RTag, S i z e ) ; using " % \ t % \ t % \ t % \ t % \ t % \ t % \ t % \ t % \ t % \ n " ; end Figure 5.6: The query code for computing the l o c a t e query. The s t a t u s Query • Often it is useful to query the message passing system for its status, that is, obtain a list of the messages that have been delivered, the ones that are stil l in the system and any outstanding receive calls. An outstanding receive call is a process that is blocked in a call to pvm_recv(), but has not yet received any data. Such a listing can be obtained by issuing the status query. Figure 5.7 shows an example of using the status query. Millipede matches each send to a receive, and shows the file names and the line numbers of the message passing calls. For a 5.4. Built-in Message Queries 59 p v m _ s e n d the first argument is the I D of the receiver, and the second argument is the message tag. For a p v m _ r e c v the first argument is the I D of the sender, and the second argument is the message tag. Both the sender and the message tag in a receive cal l can be specified as -1 , a wild card value, which matches any sender or message tag. (0)MILLIPEDE> status() Msg No. Command Line F i l e 1 pvm_bcast(262152 0) 78 master.c <-> pvm_recv(262150, 0) 22 slav e . c 1 pvm_bcast(262151 0) 78 master.c <-> pvm_recv (2 6 2151, 0) 22 slav e . c 2 pvm_send(262152, 11) 75 slav e . c <-> pvm_recv(2 62151, 11) 89 slav e . c 3 pvm_send(262151, 22) 80 slav e . c <-> pvm_recv(262152, 22) 85 slav e . c 4 pvm_send(2 62152, 11) 75 slav e . c < > pvm_recv(262151, 11) 89 slav e . c 5 pvm_send(262151, 22) 80 slav e . c <-> pvm_recv(262152, 22) 85 slav e . c 6 pvm_send(262150, 5) 118 slav e . c <-> pvm_recv(262151. 5) 86 master.c 7 pvm_send(262150, 5) 118 slav e . c <-> pvm_recv(2 62152, 5) 86 master.c Figure 5.7: Executing the status query produces a listing of matched and outstanding messages. In this example no messages or receive calls are outstanding, that is, all all messages that were sent were received and the message system is "empty". Note, message number 1 occurs twice, implying that the sending process issued a multicast or a broadcast. This query could, with a l itt le effort, be implemented using MQL, but we believe that the status query is a query that the user might use often. So to provide an easy to read output we have implemented it as a built-in query. The problem with an MQL implementation of this query is trying to list the outstanding receives and the unreceived messages in the same relation as the messages that are already delivered. Since the final relation contains information about both a sender and a receiver for each message, but the outstanding receives do not have a sender part, and the unreceived messages do not have a receiver part yet, implementing the query in MQL is not straightforward, but sti l l possible (possibly with a different structure of the output). Thus, for convenience the status query is built-in. We presented MQL code for the match and the l o c a t e queries in the previous subsection. Table 5.5 shows a list of the currently available built-in queries, and in Appendix D , a number of screen shots of actual query sessions are shown. 5.5. Discussion 60 s t a t u s Displays all the messages that are delivered, that are still in the system, and all outstanding receive calls. l o c a t e Locates all messages sent between two specified line numbers. match Matches up packing and unpacking routines for a specific mes-sage number. dump Displays on a per process basis, in reverse order, all the messages ever sent. Table 5.5: The built-in queries of the Message Debugging Module of Millipede. The dump Query The last query listed in Table 5.5 is for convenience, provided as a built-in. The dump query allows users to obtain a complete listing of all messages sent and received as wel l as undelivered. Figure 5.8 shows an example of the dump query. (0)MILLIPEDE> dumpO Queue: 0 Filename: master.c T i d : 262150 Msg.No Send.Tid Recv.Tag Send Tag 6 262151 5 5 7 262152 5 5 Queue: 1 Filename: slav e . c T i d : 262152 Msg.No Send.Tid Recv.Tag Send Tag 4 262151 11 11 2 262151 11 11 1 262150 0 0 Queue: 2 F i1ename: slave . c T i d : 262151 Msg.No Send.Tid Recv.Tag Send Tag 5 262152 22 22 3 262152 22 22 1 262150 0 0 Figure 5.8: The dump query shows all messages, both the ones that have been delivered and the ones that are stil l in the message passing. 5.5 Discussion The idea of making public the information gathered by the runtime system about the messages, and representing them as relations in a database is a different approach to debugging. By allowing the user to compute relations when needed, we fulfill one of the important design goals for a debugging environment, namely computing relations on demand. However, there are certain limitations. Since everything is represented as relations, only queries that use the query language can be performed. However, the internal structure of the 5.6. Summary 61 Millipede runtime system can make use of these relations as wel l . We have seen two examples of built-in queries that use these relations. In particular, one of these is the status command that uses these relations to extract information but presents it in a way that is not easily implemented as a user query. Another important advantage of the relation/MQL part of the Message Debugging Module is that the proficient user should be able to extend Millipede himself by adding built-in queries into the runtime system if necessary. One possibility that would address the above mentioned limitation is to extend MQL to allow the creation of user defined relations and support an explicit tuple insertion, thus making it a fully functional database language. However, this extension requires the ability to do other forms of computations, for example, arithmetic, which in turn would seriously increase the size and complexity of the query language. An interesting technique for further investigation is the idea of message breakpoints. In sequential debugging, breakpoints are often used; by setting breakpoints the user can let the program run until the line containing the breakpoint is reached. This idea is extended to include collective breakpoints, that is, the ability to set a number of breakpoints in multiple processes. This could be further extended by abstracting away the line numbers; a receive statement might be called a number of times, but instead of using the line number as a breakpoint, the message number could be used. This would allow for greater flexibility for controlling the program execution during a debugging session. Combined with collective stepping and collective breakpoints, expressions such as break pO : 186, p2 :m45 could be allowed, p i : 186 being a breakpoint in line 186 of process p i , and p2:m45 being a breakpoint that is activated when message number 45 is delivered to process p2. 5.6 Summary We provided a simple expressive query language that can compute a large number of relations, and we believe that this shows that providing such services in a debugger is not only possible but also very useful. In addition, we introduced interactive message debugging which allows the user to inspect and change the message content during execution. In connection with the sequential debugging module, the interactive message debugging allows for unit testing of single parts of a system. This means, that these tools can be used in the development phase as wel l . In this chapter and the previous, we introduced the two lowest levels in the multi level debug-ging hierarchy. In addition, for each level we presented examples to illustrate the usefulness of the tools at these levels. Chapter 6 Deadlock Detection and Correction "Problems cannot be solved at the same level of awareness that created them." - Albert Einstein In the previous chapter we demonstrated a number of techniques that are useful for locating and correcting errors in messages sent between two processes. We now focus on the next—and f ina l - leve l in the multi level debugging hierarchy: the protocol level. At this level the focus shifts to include not only messages, but the communication protocol as wel l . In this chapter, and in Chapters 7 and 8, we present three different tools and techniques for performing debugging at the protocol level. Here, we present a tool for locating deadlocks, and suggest corrective measures to remove them. In Chapter 5 we introduced a number of relations that the programmer could use in conjunction with the Mill ipede Query Language. These relations are the foundation for the analysis presented here and in the following chapter. Millipede extracts protocol information from these relations; in this chapter we describe how such information can be used to perform a deadlock correction analysis. 6.1 Deadlock Detection and Correction Detecting and correcting communication errors in message passing programs is a difficult problem. Even simple communication errors are difficult to debug in a parallel environment with multiple processes exchanging large numbers of messages. Although there are visualization tools [KG96, KV97] to help users visualize the communication patterns of parallel programs, they do not directly support the detection and correction of errors based on the user's source code. In this chapter, we present an algorithm for correcting communication errors using delivered 62 6.2. Description of Problem 63 and undelivered messages. The algorithm is used to suggest corrective measures for removing communication errors introduced by users as typographical errors in message passing systems, such as PVM and MPI. This work focuses on the validity of the algorithm by proving that for a nontrivial number of errors the algorithm can suggest changes to correct these errors. The majority of this chapter is devoted to theoretically justifying the validity of using this algorithm for correcting errors. We use a counting argument to show that for less than n/2 errors, where n is the number of processes involved in the deadlock, the algorithm is able to identify a few potential corrections. This demonstrates the usability of the algorithm for debugging these types of communication errors. The algorithm we present not only works for statically specified communication, but can also be applied when the sender or receiver is specified through an index into an array or by a function cal l . It is then the programmer's job to go back and correct the array or function, to return what the algorithm suggests. We assume that these errors are independent and infrequent. The effectiveness of the technique decreases as the number of errors increases. 6.2 Description of Problem The basic structure of send and receive calls in PVM and MPI are as follows: send(buffer, receiver_node_ID, tag) recv(buffer, sender_node_ID, tag) Mistyping the node_iD or tag value results in a message that is either undelivered, or a message that is received by the wrong process. For example, consider the simple case of a single error, as shown in Figure 6.1. There is an error in the send call of process B in Figure 6.1. B attempts to send a message to A, but incorrectly sends it to someone else. Depending on whether the communication is synchronous or asynchronous, process B either blocks, eventually hanging the system, or terminates; in either case the result is an undelivered message in the system. Using the message queues in Millipede, it is possible to extract both undelivered and recently delivered messages from the system, which then can be used in the analysis to correct deadlocks. 6.3 The Algorithm For the sake of simplicity we do not consider message tags or wild cards in our initial analysis, however, we return to these cases later. We start this section with a number of definitions that are used in the next section. 6.3. The Algorithm 64 Process A Process B S e n d ( b u f , B , t l ) R e c v ( b u f , A , t l ) R e c v ( b u f , B , t 2 ) S e n d ( b u f , C , t 2 ) / ? Figure 6.1 A simple error. Definit ion 6.3.1 Let S = ( s 0 , s i , . . . , s n _ i ) be an ordered list of senders where each Si = (a, b) and a, b are integer process identifiers (ranks in MPI). Let 71 = (r0, n,..., r n _ i ) be an ordered list of receivers where each rt = (a, b), and again, a, b are process identifiers. For Si = (a, b) e S, a is fixed as the ID of the sending process, and for n = (a, b) e 11, b is fixed as the ID of the receiver. Definition 6.3.2 A match between a sender si — (oi, bi) and a receiver tj — (a,j,bj) occurs when (at = a,j) A (bi — bj). The rationale behind the algorithm is as follows: Find a set of permutations M - {7r s , 7r r } where the number of fields that need to be changed in order to obtain a system without any unmatched sends and receives is minimal. This is always possible as there is a finite number of senders and receivers, and thus a finite number of different permutations. This means that one or possibly more permutations yield a minimum distance. Therefore, applying these changes induced by the permutations to the sends and receives in the program results in a program where al l the messages are matched (assuming it does not deadlock for other causes as well). This means that all remaining undelivered messages can be delivered to a receiver and the program is deadlock free (we assume that the deadlock is not caused by insufficient buffer space in the buffering process). For an in depth analysis about deadlocks due to buffer insufficiencies please refer to Chapter 8. It is possible to reduce the problem to a bipartite matching problem [Pre92]. The approach is as follows: Let G = (V,E) be a directed graph with weighted arcs as follows: • V = Vs u Vr, where Vs represents processes sending a message, and VT represents processes receiving a message. An element in V is composed of the process' ID and the call point of either the send or the receive function (e.g., the line number of the call). 6.3. The Algorithm 65 • E is constructed as follows: - For each unmatched (undelivered) messages m, do the following: * If m = (s , r ) is an outstanding send (i.e., the message sent by sender s has not been received by receiver r), add arc (s , r ) where s e Vs and r e Vr to E with weight 2. • If m = (r, s) is an outstanding receive, add arc (r, s) where s € V„ and r € Vr to E with weight 2. - Iterate backward through all successfully delivered messages (from newest to oldest) (u,v) and add arcs (u,v) and (v,u) with weight 2 to B if (u,v) or (v,u) does not already exist in E. The addition of arcs based on messages already delivered is done in the opposite order they were delivered. The order is defined using the < operator on the message numbers assigned to each message by the message number process (see page 40). This ordering is chosen in an attempt to involve only processes that communicated close in t ime to the occurrence of the deadlock. - Add arcs with weight 1 to E to make G a complete bipartite graph. Now consider the induced undirected weighted graph G = ( V , E) constructed in the following way: • V = V. • Each pair of directed arcs (u, v) and (v, u) in E is replaced by one undirected arc (u, v) to E with weight equal to the sum of the two arcs (u,u) and (v,u). G is the complete bipartite graph Kn,n, where n =\V\. The maximum bipartite graph matching algorithm can be used to obtain a maximal matching in G [CLR90]. This matching represents a system without a deadlock as G is a complete bipartite graph, and all nodes are involved in the matching. Since all senders are matched to a receiver, no messages are undelivered. Furthermore, this matching can be obtained by changing a minimum number of fields in the senders and receivers. This is because arcs representing actual messages and outstanding receive calls are favored with weight 2 over added arcs with weight 1. Since a maximum flow is computed, as many of the weight 2 arcs as possible are in the result, and each node on the left is matched to exactly one node on the right, thus resolving the deadlock. The following lemmas show that a maximum matching implies a minimal change. Lemma 6.3.3 Let G be as described above, and let f* be a maximal matching to the induced undirected graph G. The total number of changes, needed for the system represented by G to resolve the deadlock is An- \f*\. 6.3. The Algorithm 66 Proof: We have edges in G with three different weights: 2, 3, and 4. We need to consider the number of changes needed for each weight. • For edges with weight 4 no changes are needed. These edges represent delivered messages. • Edges with weight 3 represent an arc with weight 2 and one with weight 1. The arc with weight 2 represents either a posted send or an outstanding receive. When this edge is included in the matching, a change to the source of the arcs with weight 1 is needed to make it match the source of the arc with weight 2. Thus 4 - 3 = 1 change is needed. • Edges with weight 2 represent two arcs with weight 1 each. These arcs are both added to make G complete. This implies that the sender did not attempt to send to the receiver, and the receiver did not attempt to receive from the sender, so a change in both sender and receiver is needed; a total of 4 - 2 = 2 changes must be made. This shows that the number of changes per send/receive pair is equal to 4 minus the weight of the edge. Since the sum of the edges is equal to | / * |, and the total number of send/receive pairs is n, the result follows. • Lemma 6.3.4 Let G be as described above. A maximal matching f* on the corresponding graph G determines a minimum number of changes to the parallel system represented by G to resolve the deadlock. Proof: Since the minimal number of changes needed is 4n-\ f* |, the minimal number of changes occurs when the second term of this expression is maximal. This second term is the size of the matching, which means that the expression is minimal when the matching is maximal. • The time complexity of this max flow algorithm is 0(\E | • | / * |) [CLR90] where | / * | is the size of the matching. Since G = Kn>n, \E\= n2 and | /*|= n. Therefore, the time complexity is 0(n 3). Example 6.3.5 Consider a simple example with three senders (Si, S 2 and S3) and three receivers (Ri, R2 and R3). Assume that Sx has sent a successfully delivered message to Ri. Now assume that S 2 is trying to send a message to R2, but R2 is expecting a message from S i . S 3 is attempting to send to R2, and R2 has posted a receive for a message from S3. This results in a deadlock of R2 and R3, as no messages are ever sent to these processes. We can represent this scenario as a bipartite graph with senders on the left and receivers on the right. The graph labelled (a) in Figure 6.2 shows this deadlocked system. The algorithm requires a complete bipartite graph; the graph labelled (b) in Figure 6.2 illustrates the K3i3, where the weights are shown as pairs (x,y), and where x is the weight of the arc from the sender to the receiver, while y is the weight from the receiver to the sender. We now construct the graph G by replacing the arc pairs and adding their weights. The graph is labelled (c) in Figure 6.2 . With three senders and three receivers there is a total of 6 6.4. Algorithm accuracy 67 (c) (d) Figure 6.2: (a) depicts a deadlocked configuration: R 2 and R 3 are not receiving any messages, (b) illustrates the complete directed bipartite graph G. (c) shows the induced complete undirected bipartite graph G. (d) gives the maximal matching in G. different ways to combine the senders and receivers so all senders are matched with a receiver. A maximal matching with weight 10 is given by matching Si with Ri, S 2 with R2, and S 3 with R3. The graph labelled (d) in Figure 6.2 shows the result of applying the max-flow algorithm to the complete bipartite graph, where the maximal matching is shown as solid lines. 6.4 Algorithm accuracy In this section, we evaluate the effectiveness of the techniques by showing that the algorithm does not frequently return an incorrect, or even more than one, answer. There could be more than one way to correct a deadlock with a minimum number of field changes. To do this we need to introduce a model that describes a system of senders and receivers equivalent to the one used in the previous section. In the following, let n denote the number of senders and receivers, and 6.4. Algorithm accuracy 68 k the number of errors in the system. First we define the following: Definition 6.4.1 A communication configuration is a pair (S,U) (see Definition 6.3.1). Let Cn denote the set of all communication configurations with n senders and n receivers. Definition 6.4.2 A send s = (a,b) is unmatched if for r = (c,b), c ^ a. Equivalently a receive r = (a, b) is unmatched if for s = (a,d), d^b. We call a communication configuration valid if it has no unmatched sends or receives. The set of valid configurations in Cn is denoted by Vn. Given a configuration (S,Tl) = ( { s 0 , . . . , s „ _ i } , {r0,... , r „ _ i } ) in Cn, S i = (aiM) and ^ = {a,j,bj). The associated directed bipartite graph G = (V,E) is defined by the following: This graph is a subgraph of G. More specifically, it has the same node set, but the arc set contains only arcs of weight 2. Example 6.4.3 For a system with two senders and two receivers, Figure 6.3 shows the only two valid configurations. Figure 6.3: Only two of the 16 configurations in C2 are valid: al l sends and receives are matched in both vi and v2, thus making them valid configurations. Definition 6.4.4 The valid communication configuration v e V„, where s , = r^, V i : 0 < i < n is called the correct configuration. There is only one correct configuration in V„ and we denote it byvc. The correct communication configuration is the configuration that the programmer intended to write. Lemma 6.4.15 shows that all valid configurations are equivalent, and without loss of generality, we can choose one of them to represent the correct configuration. 6.4. Algorithm accuracy 69 Definit ion 6.4.5 Let v e C„. B(v,i) is the set of all communication configurations obtainable by first removing k\ + k2 = j < i arcs from v (kx arcs oriented from S toll and k2 arcs in the opposite direction), and then adding k\ new arcs oriented from S to H and k2 in the opposite direction. The set of configurations that can be obtained by removing exactly i, and then adding exactly i new arcs is denoted as B(v,i). This set can be computed in the following way: B(v, i) = B(v, i) \ B(v, i - 1) and B(v, 0) = B(v, 0) = {v}. Example 6.4.6 Figure 6.4 shows B(vi,l) for i>i from Figure 6.3. (a) (b) (c) (d) Figure 6.4: The set B(vi,l) for configuration vi as shown in Figure 6.3. Example 6.4.7 In Figure 6.5 consider an invalid configuration v e C„ \V „ . The boldface x marks v. The boxes mark valid configurations and the rest, marked by x, are other invalid configurations v' e Cn\Vn. The solid line marks B(v,0), the dashed line B(v,l), and the dotted line B(v,2). In order to correct v, such that it becomes a valid configuration by making a minimal number of changes-that is, moving as few arcs as possible to transform v into a valid configuration—we choose the first valid configuration found in the series of increasing sets: B(v, 1), B(v, 2) , . . . . In the example in Figure 6.5, a valid configuration is found in B(v, 1). We show, that for any invalid communication configuration in Cn, the probability that the first encountered valid communication configuration in the series of increasing sets B(v, 1), B(v, 2 ) , . . . is the correct communication configuration is high. In other words, if we introduce k errors into a 6.4. Algorithm accuracy 70 B(v,0) Figure 6.5: The set of elements bounded by the solid line is B(v,0) (This set always contains only one element, namely v itself). The set bounded by the dashed line (including B(v,0)) is B(v,l)), and B(v,2) contains al l the elements. valid communication configuration v e V „ , then the algorithm with a high probability wi l l propose x to correct the error. Lemma 6.4.8 The number of valid configurations in Cn, that is, the size of the set Vn, is n\. Proof: For a configuration to be valid, each sender must send to a distinct receiver, and this receiver must receive from this sender. If st = (a,i,bi), then a receiver rj = (ai,bj) where bi = bj must exist. It is therefore, sufficient to determine the number of different ways to order n senders. There are n! such ways. • In the following, we consider the set of configurations on n senders and n receivers C„, and the corresponding set of valid configurations V„ = {v0,.. .,vn<-i}. Let k < n/2 to be the number of errors in the system. Example 6.4.9 Consider a system with 1 error; we need to consider the configurations obtainable by introducing one error to all Vi e V„ . This is a set of sets like this: B1 = {B(v0,l),B(vl,l),...,B(vni-1,l)}. 6.4. Algorithm accuracy 71 // we know for every system with one error that the following is true: n'.-l p| B(Vi,l) = f] 6 = 0, i=0 6 6 B i then B(v, 1) contains only one valid configuration, which must be the correct configuration. Consider Figure 6.6. If two errors are introduced into the configuration vi, then we have a configuration that is in the intersection of B(vi,2) and B(v2,2), and this non valid configuration can be corrected to either vi or v2 by moving two arcs. Since there are two valid configurations, either may be the correct configuration. Figure 6.6: r\iB(vi, 1) = 0, which means the correct configuration wi l l always be found if only one error is present in the system. However, r\iB(vi,2) ^ 0, which means that if two errors are present, then a wrong valid configuration might be suggested as the correction to the deadlock. In order to argue that the configuration obtained by moving a minimal number of arcs in any invalid configuration is the correct valid configuration, we must therefore show the following for all Vi,Vj eVn,i^ j: B{Vi,e)^B{vhe)\ \ B(vue) n B{vhe) | r-j and —— ——— are small V e < k, \B(vi,e)\ \B(vj,e)\ (6.1) where small means an acceptably low fraction of wrongly proposed corrections. This is equivalent to showing the following: | B(vc,e)nB(vi,e) B{vc,e) is small V e < k,V vt 6 Vn\{vc}, (6.2) 6.4. Algorithm accuracy 72 where vc is the correct configuration in Cn. To simplify the description of the communication configurations we introduce the following notation. For each communication configuration in Cn, where the size of C„ is n2n, we assign a 2n digit number sir1...snrn (s,-,ri e { 0 , . . . , n - 1}) as follows: Si equals the number of the receiver that sender number i is sending to, and n equals the number of the sender that receiver number i is trying to receive from. Example 6.4.10 For example, using the two configurations in Figure 6.3 we obtain the repre-sentation: vi = 0011 and v2 = 1100. Figure 6.7 shows which configurations can be reached in k steps from the correct configuration vc = 0011. Figure 6.7: The solid lines connect configurations that are distance one apart. That is, if two configurations are connected directly by a line, one can be obtained from the other by moving one arc. In Cn the maximum distance is four, so no two configurations can be more than distance four away from each other. 6.4. Algorithm accuracy 73 The following lemmas are needed to prove Equation 6 .2 . Lemma 6.4.11 The number of configurations that can be obtained by moving i or less arcs in v, denoted by | B(v,i) \, is as follows: E ( 2 ; ) ( « - I > ' j=o v J ' Proof: i \B(v,i)\ = \\jB(v,j)\ (6.3) 3=0 i = Y,\B(v,j)\ (6.4) 3=0 3=0 V J ' where Equation 6 .3 follows from Definition 6 . 4 .5 , Equation 6 .4 follows from the fact that B(v,j)n B(v,i) = 0 if i ^ j, and Equation 6.5 follows from the observation that B(v,j) is the set of configurations where we move exactly j arcs: we must choose j out of the 2n arcs to move. Each of these arcs can be moved to any of the either n senders or receivers except for the one it pointed to originally, leaving n - 1 choices. This is done for a total number of j times. • Example 6.4.12 Table 6.1 shows a few examples of \ B(v, e) \. Recall that B{v, e) is the set of configurations that can be obtained by moving a maximum of e different arcs in configuration v. errors (e) n 0 1 2 3 4 5 6 2 1 5 11 15 16 3 1 13 73 233 573 665 729 Table 6 . 1 : | B(v,e) \ is the size of the sets that can be obtained by moving e or less arcs in a valid configuration v. Definit ion 6.4.13 The distance between two valid configurations in Vn, denoted as d(vi,Vj), is defined as follows: 2n d{Vi,Vj) = Y^\VU ± vi,h 1=0 where vt = Vi2... Vi2n, VJ = VJ1 VJ2 . . . Vj2n, and 1 if S is true. 0 otherwise. 6.4. Algorithm accuracy 74 and S is a relational expression. Example 6.4.14 The valid configurations in C3 are as follows V3 = {001122, 110022, 002211, 220011, 221100, 210210}. Table 6.2 shows the distances between these different valid configurations. 001122 110022 002211 122001 221100 210210 001122 0 4 4 6 4 6 110022 4 0 6 4 6 4 002211 4 6 0 4 6 4 122001 6 4 4 0 4 6 221100 4 6 6 4 0 4 210210 6 4 4 6 4 0 Table 6.2: Distances between valid configurations in V 3 c C3. The maximum distance between configurations in V3 is 6, and the number of valid configurations is also 6. Lemma 6.4.15 For any system Cn, a distance k, and a valid configuration v e Vn c Cn, the number of valid configurations in V„ with distance k does not depend on the choice of v. Proof: Permutations are automorphisms. • Lemma 6.4.16 The possible distances between valid configurations in V„ c C„ are 4 , 6 , . . . , 2n -2,2n. Proof: A necessary condition for a configuration to be valid is that {si,..., sn} = {ri,...,rn} = {0,... , n - 1}. Since all valid configurations are equivalent, consider vc = s\ris2r2,. • .,snrn. A minimum of two send/receive pairs must be switched to obtain a different valid configuration. This gives a minimum distance of 4. Now choose two send/receive pairs a, b to switch. There are three cases to consider: 1. Both pairs are of the form s^r, = ii, which means that either they have not been switched before, or that they have been switched back to their original state. When these pairs are switched, the distance increases by 4. 2. One of the pairs, say a, is of the form = ii, and the other one, b, is not. When a and b are switched a contributes distance 2 to the total distance, and b already contributed distance 2, so the total distance only increases by 2. 3. Neither a nor b are of the form sin = ii. Neither contribute further to the total distance by being switched. 6.4. Algorithm accuracy 75 Definit ion 6.4.17 Let V(v,m) be the set of valid configurations exactly distance m from the valid configuration v, that is, the set B(v, m) n V „ . Lemma 6.4.18 The size of V(v, m) for m = 2k is the following: \V(v,2k) |= Q c , where 2=0 Proof: Since V(v, 2k) is the set of configurations exactly distance m away from v, we start by choosing k of the n send/receive pairs to move; this can be done in (£) different ways. Since we are only interested in permutations that result in configurations exactly distance 2k away, we must multiply by the number of permutations of k elements that permute all k elements. We need to determine the number of permutations of k elements which have no fixed points. This problem was first proposed by the French mathematician Pierre Remond de Montmort in 1713 [Mon 13]. The answer is the alternating sum k\ <L~^L- This series of numbers is known as recontres numbers or derangements, m Example 6.4.19 c« is a fast growing series. Table 6.3 illustrates this by computing the first 11 values of c^. It should be clear, that the rapid growth in the number of valid configurations makes correcting systems with a large number of errors virtually impossible. i Ci 0 1 1 0 2 1 3 2 4 9 5 44 6 265 7 1,854 8 14,833 9 133,496 10 1,334,961 Table 6.3: The rate of growth of . Example 6.4.20 Table 6.4 gives an example of the number of valid configurations at given distances from another valid configuration. The columns for distance 0 and 2 are omitted as they are always 1 and 0, respectively. 6.4. Algorithm accuracy 76 Number of valid configurations at different distances. n 4 6 8 10 12 14 16 18 2 1 3 3 2 4 6 8 9 5 10 20 45 44 6 15 40 135 264 265 7 21 70 315 924 1,855 1,854 8 28 112 630 2,464 7,420 14,832 14,833 9 36 168 1,134 5,544 22,260 66,744 133,497 133,496 10 45 240 1,890 11,088 55,650 222,480 667,485 1,334,960 Table 6.4: The number of valid configurations at different distances in Vn. Distances 0 and 2 are omitted as they are always 1 and 0, respectively. Note, the last number in each row corresponds to c, for i = n. Consider the following two configurations: vx = 001122v 2 =002211. These two configurations differ in the last four positions, thus having a distance of four. To compute the intersection B(vi,2) n B(v2,2), we must find the configurations that can be reached from both vi and v2 by changing at most two positions in each. Since the distance between the two configurations is four, and we can change at most two positions in each configuration, it follows that we must change exactly two in each. Choose two fields in v\, say vii and vlj. Change these two positions to have the values of v2i andi> 2 j , and obtain i^ . We know that d(v'1,v2) = 2 . Now change the two positions in v2 that differ from v[, say v2, and v2m to have the values of vlt = v'u and vXm = v'lm, and obtain v'2. We now know that d(v[,v2) = 0 . The original distance is four and we must change two fields in each configuration. The number of different ways this can be done is (*) = 6 . The six configurations are as follows: 113333, 112222, 112323, 113233, 113232, 112332. The underlined positions are the fields changed in vi and the overlined fields are the ones changed in v2. According to Lemma 6.4.15, al l valid configurations are equivalent. Therefore, we can simply study the properties of the correct valid configuration vc of V „ . Example 6.4.21 Figure 6.8 illustrates the overlapping sets, the intersections can easily be seen by comparing the coloured areas with the solid lines. Table 6.5 shows the intersection sets for various values of k and k'. Note, the size of the intersecting sets shown in Table 6.5 can be found on the diagonal in Table 6.6 from the lower left to the upper right. Table 6.6 shows the sizes of the various intersections depending on different values of k and k'. All these values can be read from Figure 6.8. The upper left part is mainly zeros because 6.4. Algorithm accuracy 77 Figure 6.8: An illustration of which configurations are in which intersections when considering all the valid configurations in the C2 system. Constraint 6(1122, jfc) n 6(2211, *') (k < 4) A (k' < 0) (k < 3) A (k1 < 1) (k < 2) A (k1 < 2) (k < 1) A (jfc' < 3) (fc < 0) A (A' < 4) 2211 1211, 2111, 2212, 2221 2222, 1212, 1221, 2112, 2121, 1111 1112,1121, 1222, 2122 1122 Table 6.5: An example of the intersection B(1122, k) n 6(2211, k'), that is, the config-urations that can be transformed into 1122 by moving at most k arcs, and into 2211 by moving at most k! arcs. d(1122,2211) = 4. For example, it is impossible to find a configuration that can be turned into 1122 by moving two arcs, and into 2211 by moving one. 6.4. Algorithm accuracy 78 k' < 0 < 1 k' < 2 A;' < 3 k' < 4 k < 0 0 0 0 0 1 k < 1 0 0 0 4 5 k<2 0 0 6 10 11 k < 3 0 4 10 14 15 /c < 4 1 5 11 15 16 Table 6.6: The size of #(1122, k) n 5(2211, fc') for various values of k and fc'. We can now determine the number of elements in the intersections of the B sets in 6.2. Theorem 6.4.22 Let e be the number of errors in a communication system Cn. The number of configurations with e errors for which the maximum matching either suggests a wrong valid configuration or a set of valid configurations where the correct one is included is as follows: (6.6) \J % , e ) n % , e ) *£(")«E £ E 0(i,a,b,Co) t=2 ^ ' 6=0 a=max{6,e-6} c o =0 where and 0(i, a, b, c) = Q ( 2 i " ^ ) (n - 2 ) - ( ^ " J )^ (n - 1)-Cxy = (a - c0) + (6 - c0) - 2i cx = 2i — a + c0 cy = 2i — b + c0 under the constraints (cxy > 0 A c x > 0 A cv > 0). Proof: We wish to compute the total number of configurations in the intersections of the set B(vc,e) with the sets B{vj,e), and then compare that with the total number of configurations in the set B(vc,e) (cf. Equation 6.1). We start with Equation 6.7. | J % , e ) n % , e ) vjev (6.7) Equation 6.8 follows from Equation 6.7; since the maximum distance between a configuration v e B(vc,e) and a configuration v' G B(vi,e) for Vi e Vn\{vc} is 2e, we can thus restrict ourselves to consider valid configurations at distances 4,6,...,2e from vc. T>(vc,i) is the set of valid configurations exactly distance i from vc. • (J | J % , e ) n % , e ) ie{4,...,2e} Vj£V(vc,i) (6.8) 6.4. Algorithm accuracy 79 From Definition 6.4.5 we get the following: e e B(vc,e) = | J B(vc,a) and B(Vj,e) = \J B(vjtb). a=0 6=0 This gives the following result in Equation 6.9: U U [\jB(vc,a)) n ( \JB{,vjtb) ie{A,...,2e} VjeV(vc,i) \a=l .6=1 (6.9) Since the V sets are disjoint, we can exchange the unions with sums, and arrive at Equation 6.10: E E i=2 vjeV(va,2i) \jB(vc,a) )n(\jB(vj,b) \a=l Kb=l (6.10) For a configuration v e B(vc,a) n B(vj,b) with b = d(vj,v) and a = d(vc,v), if a < b, then vc wil l be reported as the correct communication configuration. Since we are counting the valid communication configurations that are incorrect corrections, we only consider cases where a > b. Additionally, if a + b < e then the intersection between B{vc, a) and B(vj,b) is empty. These three observations combined, yield Equation 6.11. E E i=2 VieV(vc,2i) (J (J % , o ) n % , i ) 6=0 a=max{b,e — 6} (6.11) Finally, by summing the sizes of the intersections B{vc, a) n B(vj,b), we obtain the quadruple sum shown in Equation 6.12, which is an upper bound for Equation 6.7. E E E E \B(vc,a)nB(Vj,b)\ i=2 Vje V(vc,2i) 6=0 a=max{b,e-b} (6.12) We now calculate the value of \B{vc,a)nB(vj,b)\ for values a and b where j = 2i. Let x = vc = xiX2-.-X2n arid y = VJ = yiy2---V2n with d{x,y) = j = Ii. We want to find configurations z = v = zxz2 ...z2n such that d(x,z) = a and •d(y,z) - b. Assume, without loss of generality, that xk = yk for j + 1 < k < 2n such that we get the following: j 2n—j X = x2 ... Xj X j + i . . . x2n y -- '- Zi . z2 . . . Zj+l . . . Z2n z = '• yi 2/2 • • • yj Xj+i . . . X2n 6.4. Algorithm accuracy 80 Now define the following: Cy = {Zk • 1 < k < j : zk = xk A zk ^ Vk) Cx = {zk •• 1 < k < j : Zk ^ xk A zk = Vk] CXy = {Zk • 1 < k < j : zk ^ xk A zk ^ Vk} Co = {Zk • j + 1 < k < 2n : zk ^ xk A zk ^ Vk} Cx — 1 Cx Cy — 1 Cy Cxy | CXy | Co = \c„ A ^-configuration must satisfy the following: + C0 since d(x, z) = a and d(y, z) = b. This results in the following 3 equations: (X 0Xy ~\- CX ~t~ CQ }) — CXy ~h Cy -f" C 0 — C-xy 0-x ~r" (6.13) The last equation follows from the fact that xk ^ yk for 1 < k < j. Setting a' = a - c0 and b' = 6 - c 0 we obtain 3 equations with 3 unknowns represented by the following matrix equation: f 1 1 0 \ f c x y \ ( a' \ 1 0 1 VI 1 1 / \ J J By inverting the matrix we can compute values for cxy, cx, and cy in the following way: ( Cxy \ Cx V cv J ( 1 1 - 1 W a' \ V 0 - 1 -1 0 1 / a' + b'-j \ V j-v 3-a! ( a + b - j - 2 c 0 \ j -b + c0 J \ j - a + Co / V V 3 J where the number of fields where z differs from x, but not from y, is cx; the number of fields where z differs from y, but not x, is cy; and the number of fields where z differs from both x and 2/ is c x y . We now look at the different ways of choosing fields in z that satisfy these constraints. Of the j fields where xk ^ yk, we must choose cx where z differs from x but not y. Of the remaining U - cx), we must choose cy fields. The rest of the cxy fields are prechosen. Of the (2n - j) fields where a; and y do not differ, we must choose c 0 fields where z differs. 6.4. Algorithm accuracy 81 The values of the Cx and Cy fields are chosen to be the values of the opposite string, that is, for the cx chosen fields the value is that of y and vice versa, for the cy chosen ones. The remaining Cxy fields can take any value except those of the corresponding fields in x and y, which (n - 2) choices for these cxy fields. The C0 fields can take any value except that of the corresponding fields of x and y, which are the same. There are (n - 1) different choices for these values. By substituting these values in Equation 6.12 on page 79, and summing over all valid values of c0 (i.e., values for c 0 that produce nonnegative values for cxy,cx and cy) we get the number of configurations with distance a to x, and distance b to y. The constraint (cx > 0 f\cy > 0Acxy > 0) assures valid values of cxy, cx and cy. By subtracting the two last equations of 6.13 from the first, we get c0 < (a + b + 2i)/2. Taking this into consideration, we arrive at the following result, as stated in Equation 6.6: | J % , e ) n % , e ) e e e ^ E E E E \B(vc,a)nB(Vj,b)\ i=2 VjeT>(vc,2i) 6=0 a=max{b,e-b} = Y,\V(vc,2i)\J2 E E 0(i,a,b,c0) i=2 6=0 a=max{b,e — b} co=0 a + b + 2i = E(")C 'E E E 0(i,a,b,Co) (6.14) i=2 ^ ' 6=0 a=max{b,e-b} co=0 where ct are the constants from Lemma 6.4.18, and and cxy,cx,cy are given as follows: cxy = (a - c0) + (b - c0) - 2i cx = 2i - b + c0 cy = 2i — a + c0 under the constraints (cxy > 0 A cx > 0 A cy > 0). • Using Equation 6.14 we can now compute an upper bound for the fraction in Equation 6.2. Figure 6.9 shows the estimated failure rate for the algorithm. An ambiguous correction is when more than one valid configuration is at the minimum distance. 6.5. Message tags 82 Number of errors (e) e = 1 e = 2 e = 3 e = 4 e = 5 n W A W A W A W A W A 2 0.00 0.00 0.00 54.55 3 0.00 0.00 0.00 24.66 4 0.00 0.00 0.00 13.00 5 0.00 0.00 0.00 7.38 4.74 32.93 6 0.00 0.00 0.00 5.26 2.67 23.86 7 0.00 0.00 0.00 3.75 1.64 17.91 7.76 53.79 8 0.00 0.00 0.00 2.80 1.07 13.88 5.10 42.08 9 0.00 0.00 0.00 2.17 0.74 11.05 3.51 33.67 11.33 88.85 10 0.00 0.00 0.00 1.73 0.53 8.99 2.52 27.49 8.02 71.39 Figure 6.9: The failure rate for the algorithm in percents-incorrect suggested correc-tions (labelled W) and ambiguous corrections (labelled A). 6.5 Message tags We now consider message passing that includes message tags. We do not formally analyze this case, but argue that the chances of the algorithm being able to predict the correct solution increases proportionally to the number of different message tags used. We do not have a polynomial time algorithm for the case where tags are considered, but the desired permutations can be obtained by computing a Hamming distance between all possible combinations of permutations of senders and receivers, and choosing the one or ones that give the smallest Hamming distance. This is an exhaustive search that is only feasible for small values of n. This algorithm has time complexity 0(n\). Figure 6.10 is a copy of Figure 6.3, where we have introduced message tags. The S i and Rx both send/receive with tag 11, and S 2 and R2 both send/receive with tag 22. It is obvious that v2 is no longer a valid configuration as there is a tag mismatch between S i and R2, and between S 2 and Ri. In fact, v2 is now a configuration with at least 2 errors. By introducing message tags into a communication system, and by choosing them carefully, that is, in a meaningful way with respect to the message they are associated with, the risk of the algorithm predicting a wrong solution is greatly reduced. As an example, consider C2. As seen in the previous example the 54.55% ambiguity rate has disappeared as v2 is no longer a valid configuration. This holds true if two errors are introduced in the sender or receiver IDs. The correct valid configuration vx is distance 2 away, where v2 is distance 4 away. Similarly, if two errors are introduced into the tags, the correct valid configuration is distance 2 away, whereas the wrong valid configuration is distance 4 away. A wild card or an 'any' value is a special value that matches any other value. These are often used when a receiver does not know the identity of the sender or the tag of the package. They 6.6. Summary 83 11 22 vi v2 Figure 6.10: Introducing message tags. are often used for dynamic communication in cl ient/server type applications or for convenience, instead of the process ID. When introducing wild cards into a communication system, the degree of freedom with respect to field values increases. This significantly decreases the success rate of an algorithm, such as the one presented. 6.6 Summary In this chapter, we presented an algorithm that proposes changes in message passing systems that have deadlocked due to a small number of typographical errors. If a small number of errors occur in an otherwise working message passing system, then we can correct these errors with a high probability. Many programmers make extensive use of wild cards in receive calls. This does not only increase the risk of a message being accepted by a receiver that is not supposed to receive it, but also complicates the problem of discovering the source of the error. In contrast, by carefully choosing message tags, and by associating different tags with different types of communication, the risk of wrong sends going through is substantially reduced. Furthermore, the ability to predict the correct communication configuration is greatly increased. We do not have a polynomial t ime algorithm for the case where message tags are considered; we believe that the problem can be reduced to a 3 dimensional matching problem, which is NP-comp le te . Introducing message tags make the problem more complex but on the other hand reduces the size of the overlapping B sets. The 0(n!) algorithm can easily be implemented such that message tags are taken into account. Chapter 7 Protocol Conformance Checking "The pure and simple truth is rarely pure and never simple" - Oscar Wilde In the previous chapter, we described the Deadlock Detection and Correction Module, which is the first tool at the protocol level of the Millipede multi level debugger. In this chapter, we investigate a technique referred to as Protocol Constraint Conformance Checking. A protocol constraint specification is an assertion-like specification of a protocol's behaviour that specifies a number of constraints that the protocol must conform to when executed. It is not a verification tool like SMV [CLM89, McM92] or FDR [For], nor is it directly comparable to assert statements in C, but rather a technique that allows the user to automate checking the behaviour of the protocol of a running parallel system by writing a specification file containing a number of constraints. These constraints are checked against the messages at runtime. We present the Protocol Constraint Specification Language (PCSL) and a Millipede tool Millipede Online Protocol Error Detection (MOPED) which executes the constraint conformance checking at runtime. 7.1 Between Testing and Verification The idea behind constraint conformance checking is to allow the user to write a specification of the behaviour of the protocol, and then, using information about actual messages, automatically check that the messages satisfy the constraints. It is important to distinguish the protocol constraint specification from the wel l known concept of constraint programming [Lel88]. A program written in a constraint programming language is a set of equations that are given to a constraint-satisfaction system which in turn, returns the values satisfying the constraints. Our approach does not generate a list of messages that satisfy the constraint system, but rather, using 84 7.1. Between Testing and Verification 85 message information, checks that the constraints are val id. In later sections we explain in detail how constraints are instantiated using message information and checked. We now argue why protocol constraints are useful in the debugging and development cycles of a parallel program. The communication protocol of a parallel message passing program starts as a specification; this specification can be anything from written prose to a detailed CSP description. One of the goals of such a specification is to serve as a starting point for the implementation of the protocol using, for example, C and PVM. A second goal is to serve as a specification that can be used for testing purposes. A number of different paths can be taken from the specification to the actual running imple-mentation. The most straightforward one is to simply implement the protocol. This leaves the user with the daunting task of having to test a protocol implementation that might contain errors and deadlocks. If the specification is more rigorous than just plain English, perhaps written in some verification language, using a verification tool to check that the protocol does not have deadlocks, livelocks, and race conditions is a natural choice. In Section 7.2, we briefly describe some of the advantages and disadvantages of this technique. Once a protocol specification is verified, it must stil l be implemented in the target language and message passing system. Errors may be introduced into this implementation as wel l ; this means that testing the implementation is sti l l necessary. Whenever the translation of the protocol from specification to implementation is done by hand, as with any implementation, the risk of introducing errors exists. We believe this is particularly true when the source (specification or verified protocol in some verification language) and the target (e.g., a C program) domains differ greatly. One of the closest relationships between a specification language and a programming language is between CSP [Hoa78] and Occam [May83], but even here the difference is stil l substantial. No matter which approach is taken, a substantial amount of testing is necessary. This is where protocol constraints can help; as a mixture of asserts and constraints, we allow the user to specify relationships between processes (through a constraint-like specification), and have the system check that the messages conform to the constraints through assert-like checks. We believe that the collection of constraint specifications in one file (rather than assert like statements associated with each message passing call) gives the user a faster and more complete overview of the entire constraint specification, as well as more tightly coupling the sending process with the receiving process. Since program development is often an iterative task, another main goal of the constraint system is to provide the ability to use it in connection with such a program development strategy. This means that the init ial specification can be very general, and as the implementation becomes more complex, or as discoveries are made about the protocol, the specification can be refined as 7.2. Protocol Checking and Verification 86 wel l . Naturally, there is no guarantee that the transcription of the protocol specification to the con-straint specification language is correct, but this language is not large nor complicated. Whether a protocol is formally verified or not, protocol constraints can aid in testing the implementation of a protocol. Many users are not familiar with CSP or other complex specification languages or verification tools. If this is the case, protocol verification is virtually impossible. However, though not constituting verification in the typical sense of the word, constraints enable users unfamiliar with verification tools to write simple protocol constraint specifications as the program is being developed, and MOPED checks messages against these when the program runs. We believe that this can assist the user in correcting errors in the implementation that might otherwise be difficult to find. 7.2 Protocol Checking and Verification For completeness, we include some information on protocol verification in this chapter. A common denominator for the tools mentioned in this section is the ability to check and verify protocols and perform model checking. Being able to check a protocol for deadlocks and fairness constraints is an important part of developing and debugging parallel programs. However, most existing tools require the protocol/model to be specified or implemented separately in the language of the tool, which means that the protocol must be re-implemented in the source language the application is written in. Some well known approaches to protocol specification include CSP [Hoa78], CTLV/u-calculus [CE81] and coloured Petri nets [Jen92]. Specifications written in CSP can be verified and checked using the FDR model checking tool [For]. FDR (Failures-Divergence-Refinement) allows for the checking of many properties of finite-state systems and the investigation of systems which fai l these checks. CSP allows a wide range of correctness conditions, including deadlock and livelock freedom, as wel l as general safety and liveness properties to be encoded and checked using FDR. A different approach to model checking is using CTL (Computational Tree Logic); systems using this abstraction include VIS [The96], Mur</» [Dil96] and SMV [CLM89, McM92]. The specification is typically translated into a BDD (Binary Decision Diagram), and various algorithmic techniques can be applied in order to verify statements about the model. All of these systems accept specifications written in different languages, none of which are compatible with standard C or C++. The SPIN [Hol97] system also falls into this category of tools, although it is based on LTL (Linear Temporal Logic) and not CTL. In [San99] two important problems are pointed out with these techniques; these are as follows: the specification languages are fairly low level, and the state space explosion problem is present. The approach to model checking with coloured Petri-nets is slightly different; the user has 7.3. Protocol Constraint Specification 87 to specify a graphical representation of the protocol and annotate it with code written in ML. A number of analyses can then be performed on the model by constructing a state space for the net. To transcribe a Petri-net model to C requires implementing the protocol based on a graphical representation and translating ML code to C. The risk of introducing errors is increased as the translation from a graphical representation and a functional specification must be performed manually. 7.3 Protocol Constraint Specification Before we start defining protocols, we introduce a few concepts and definitions. A group of processes is an ordered set of processes all spawned from the same pvm_spawn cal l . There can be several groups of the same program depending on the number of spawn calls. An instance is one process from a group. Each process in a group is given an instance number, starting at 0, each time a group is spawned. A l ine number is either a concrete line number containing a pvm_send, a pvm_recv or an identifier. If an identifier is used, Millipede wil l search the appropriate source file for comments of the form / * ( ( l i n e - l a b e l ) ) * / where l i n e - l a b e l is the identifier used in the specification of the protocol. 7.3.1 Protocol Contents To use the PCSL/MOPED, the user first writes a file containing a constraint specification that she wishes to check her program against. We refer to a protocol constraint specification as simply a protocol specification, or just as a specification. A protocol specification file consists of a number of lines that specify which sends can send to which receives. One of the powerful features of the PCSL/MOPED module is the ability to start out by specifying a very general version of the protocol and checking it; as errors are detected and corrected, or as more knowledge about the protocol is gained, the specification can be specialized step by step. A protocol consists of a number of lines of the following form: pgnamei[e1}{e2}(e3) -> pgname2[ei]{e5}{ee) Each line can be followed by a number of quantif iers of the following form: V id : RelExpression; The first part states that a process created from program pgnamei with instance number e 2 in group e i may send from a send call in line e 3 to a receive call in a process created from a program pgname2 with instance number e 5 in group e 4 with a receive call in line e 6 . Values for e i , e 2 and e 3 can either be omitted, or be a number or an identifier. If e 3 is the identifier x y z , and pgnamei.c 7.4. The PCSL Grammar and Semantics 88 contains a pvm_send followed by a / * ( ( x y z ) ) * / comment, e 3 wi l l be substituted with the actual line number of the send call in the source file. If e i or e 2 are identifiers, or if e 3 is an identifier that does not match any / * ( ( . . . ) ) * / line in the source file, then these are bound to the group, the instance number, or the line number of the process who sent the message. If any or all of e i , e 2 , or e 3 are omitted, no check is done for the missing expression. This is equivalent to a wild card match. Values for e 4 , e 5 , or e 6 can either be expressions, identifiers, or be omitted. Again, if omitted, a wild card match is performed. If an expression is given, this expression is evaluated and matched to the actual values of the group, instance, and line number of the process that received the message. If a new identifier is introduced in any of e 4 , e 5 or e 6 , it is bound to the actual group, instance, or line number of the receiver of the message. If e 6 is an identifier, a similar replacement, as described for e 3 , wi l l take place. A quantifier introduces constraints on an identifier used in the e i , . . . , e 6 . These can be qualified by both lower and upper bounds or bound by other expressions. A message (sent from a sender to a receiver) is a tuple as follows: where Ps and Pr are the program names of the sender and receiver processes, GS,IS, and Ls denote the group, instance, and line of the sender, and Gr,Ir, and Lr denote those of the receiver. Ns and Nr are the total number of processes in group Gs and Gr. Ns and N r are reserved names in PCSL; at check time they contain the values of Ns and /V r of M. 7.4 The PCSL Grammar and Semantics For completeness, the BNF grammar of the protocol constraint language (PCSL) can be found in Appendix B.1. The expressions and the relational expressions of the grammar are a subset of the grammar for expressions in the C programming language with the square root function added. In Appendix B.1, the semantics for computing expressions and relational expressions are shown. The Greek symbol u denotes a symbol table that associates variables with values. Vari-able/value pairs can be added to the symbol table using the E function defined in Figure 7.1. M = (PS,P, {GS,IS,L. ),(Gr,Ir,Lr),Ns,Nr) a U {e = v} if e is an identifier, and e is not bound in a. error if e is an identifier, and e is already bound in a. a otherwise. Figure 7.1: Adding elements to the symbol table. With the symbol table a in place, we now turn our attention to the semantics of a single PCSL 7.4. The PCSL Grammar and Semantics 89 line. Recall the following appearance of a specification line L: /3[ei]{e2}(e3)-><5[e4]{e 5}(e6) :: Q where Q is a list of quantifiers. Such a line (referred to by L) is always checked with respect to a message M. The semantic function we create is named B. We say that a message M does not violate a specification line L, if <B[L]]M = true. We briefly explain how a message is checked against a protocol line in the following. Remembering that e i , e 2 and e 3 can either be left blank, a number (constant) or an identifier. We perform the following for each of these: • If ei is a number (ci), e{ is replaced by ait and the quantifier V a * : at = a is added to Q. • If ei is an identifier, replace all of its occurrences by a * . This step is not necessary, but it clarifies the following explanation. • If ei is left blank, replace the blank with ai, and add the quantifier V a , : true to Q. This transformation is applied to each protocol line such that any quantifiers associated with the sender side of a protocol line can be checked separately from the rest of the quantifiers (by looking up quantifiers that bind a , ) . For e 4 , e 5 , and e 6 , apply the following transformation: if ei is left blank, replace ei with 7,, and add the quantifier : true to Q. This transformation is done in order to avoid comparing numbers to empty expressions. The table in Figure 7.2 summarizes the following in depth explanation of how to check a message against a protocol line. If any of the checks past step 2 fai l , the protocol is violated by the message. Before any checking can be performed, we need to add information from the message M to the symbol table. Recall that e 1 ( e 2 and e 3 are al l replaced by Q i , a 2 and a 3 , respectively. We now use the S function to add the bindings ax = Gs, a 2 = Is, and a 2 = Ls to the symbol table a. We are now ready to check the protocol line against the actual message M. • The first step in checking a line against a message is to determine if the actual sender and receiver of the message match the program names specified in the line. The actual sender and receiver are Ps and Pr, and the sender and receiver specified in the protocol line are 0 and 6. Thus, the first check that must be performed becomes the following: (Ps = p A Pr = 8). If this evaluates to true the sender may send a message to the receiver. • Recall the transformation performed on the protocol line, that is, replacing or adding a l t a2, and a 3 for the expressions e i , e 2 , and e 3 . This transformation may result in up to three 7.4. The PCSL Grammar and Semantics 90 quantifiers a» : r i t which must be checked as wel l . Here, checked means that the values are within the boundaries of their definitions. r» is a relational expression, so the semantic function Tl is used in the following way: f\Q3q=yai:ri Tllrijcr. If this expression evaluates to true we know the sender part of the message matches the line. • If the first two steps of the check are true, the protocol line matches the sender; now we need to check if the receiver of the message matches the receiver part of the protocol line. First, if any of e 4 , e 5 or e 6 are identifiers, add these to the symbol table with the bindings of the receiver group, instance or line number, respectively, (any line number identifiers that existed in a / * ( ( . . . ) ) * / comment wil l have been replaced already). All other quantifiers are checked also using the Tl function as follows: f \ Q 5 q = V v . r {v, •) e a A fc[r]]cr. We can restrict the conjunction to only consider quantifiers where v ^ ait but to keep it simple we do not bother with this restriction as checking the sender quantifiers one more time does not change anything. • We now need to check that the actual receiver of the message may indeed receive it according to the specification. That entails checking three properties: The group, the instance, and the line number. - The expression e 4 is evaluated using the £ semantic function for evaluating expressions. Note, if e 4 is an identifier, it would have been inserted into the symbol table with the value Gr- The resulting value is then compared to the actual group number of the receiver, Gr: £|e4]]<7 = Gr. - A similar check involving the use of £ is performed on the instance number in the following way: £[e5]]<7 = IT. - Finally, the line number of the actual receiver is compared to the value we get by evaluating e 6 : £[e 6]]cr = LT. Note, if for example e 4 is left blank, it then gets replaced by 74, and the quantifier V7 4 : true is added to Q. This quantifier always evaluates to true; however, since e 4 = 74 is an identifier, the value Gr is associated with it; that is, before performing the quantifier check on the receiver part, the binding 74 = Gr is inserted into a. Now, when the check £[[74J<7 = GT is performed, it becomes trivially true because of the binding of 7 4 in the symbol table. If any of the checks after the second check fai l , an error must be reported, as the receiver should not have received the message, or the sender should not have sent the message to the receiver. Figure 7.2 shows the six steps of the checking algorithm. We can summarize the check by the semantic function B, which is shown in Figure 7.3. 7.5. Examples Step Check Comment 1 (Ps = pAPr= S) If false move on to the next line. If true continue. 2 AQ3?=Va ;:r, This checks if the sender part of the mes-sage matches the sender part of the proto-col line. If false move on to the next line. If true continue. 3 A Q 9 , = v „ ( ( V ) ^ A W Check the rest (all) of the quantifiers. If false report a quantifier error. If true continue. 4 £le4<r = Gr Check if the receiver group may receive this message. If false report a group error. If true continue. 5 £[e 5Jer = Ir Check if the receiver instance may receive this message. If false report an instance error. If true continue. 6 £{e6}a = Lr Check if the receiver line may receive this message. If false report a line error. If true protocol line is not violated by the message with respect to the semantics of Figure 7.3 Figure 7.2: Checking a protocol line takes six steps. If the check fails in the first 2 steps, it is because the line did not match the sender, so move on to the next. If any of the checks in the subsequent steps fai l , it constitutes an error; the sender is matched to the protocol line, but the receiver did not match the line. B[LJM = (P„ = p A Pr = 5) A I f\ ((v, •) e a A 7^r>) I A \QBq=Vv.r J (Sletja' = G P) A {£|e5]a' = Ir) A (£[ee]ff' = Lr) where a = E[e 3](L.)(S[e 2K/.)(E[ei](G.)0)) a' = E [ e 6 ] ( L r ) ( E [ e 5 ] ( J P ) ( £ [ e 4 K G P ) a ) ) Q = Vv0 : r 0 ; . . . ; Vw„ : rn; r» is a relational expression. Figure 7.3: Semantics for a PCSL line. 7.5 Examples In this section we present a number of examples of how to specify protocol constraints. 7.5. Examples 92 7.5.1 The Simplest Protocol As stated in the previous section, a protocol constraint specification can start out being very general. In Figure 7.4 the simplest possible protocol is shown. H 2 Send Send ^j Receive Receive Send Send Receive Receive j = 0,1,2 Figure 7.4: Example of /?[]{}() - * 0[}{ }(); This protocol consists of only the following one line (we use the Greek symbol 0 as a shorthand notation to represent a program name): This line states that any 0 process can send to any other 0 process regardless of the group, instance or line number. The first part of the picture in Figure 7.4 shows that 0 processes communicate among themselves. The second part shows that any 0 process can communicate with any one 0 process. Lastly, the rightmost part shows that all sends in any 0 process may send to any receive in any 0 process, including itself. We can specialize this very simple specification to represent a system where 0 process number i can send to another 0 process with instance number i + 1, and where process number n - 1 sends to instance number 0. In summary, we have the following: 0 [ ] { O } ( ) -> / ? [ ] { ! } ( ) ; 0[ ]{!}() -> 0[}{2}(); 3[}{n-l}() -> /3[]{0}(); Alternatively, in short notation using a quantifier, we arrive at the following: P[ ]{*)( ) P\ H(« + !)%"}( ) : : V i : 0 < i &&t < n - 1; In Figure 7.5 this protocol is shown graphically with a fully quantified PCSL line. 7.5. Examples 93 Figure 7.5: Example of 0\ ] {» } ( ) -4 0{ ]{(t + l )%n}() :: V i : 0 < i kk i < n - 1; 7.5.2 Pipe-and-Roll Matrix Multiplication Consider a more complex example that also includes the use of line numbers. The pseudo code for the pipe-and-roll matrix multiplication algorithm [FJL+88] is shown in Figure 7.6 (the master's code) and in Figure 7.7 (the slave's code). Processes communicate subblocks of a matrix in a two-dimensional grid, sending up and right to neighbor processes. (A graphical illustration of this protocol can be seen in Figure 7.8.) Let N*N be the number of processors Map concurrent computer on to array of N*N processors D i s t r i b u t e subblocks of A and B to processors Await subblock r e s u l t s i n matrix C Figure 7.6: Pseudo code for the master of the pipe-and-roll matrix multiplication algorithm. As we can see from the 2 functions Pipe_A and Roll_B, a process executing a pipe call can 7.5. Examples 94 I n i t i a l i z e subblock matrix C to 0 Receive subblocks A and B for i=0 to N-l do { T = Pipe_A() C = C + T*B R o l l _ B ( ) } Send subblock C to master Pipe_A() { Determine the source processor of the pipe Determine the l a s t processor of the pipe, i f ( t h i s processor i s the source processor) then Copy A to T else i f (processor i s not the source processor) then Receive T from processor on the l e f t i f (processor i s not the l a s t processor i n pipe) then Send T to processor on the r i g h t return T } Roll_B() { Send B to processor above (with wrap around) Receive B from processor below. } Figure 7.7: Pseudo code for the slave of the pipe-and-roll matrix multiplication algo-rithm. only send to the process to the right of it and receive from the process to the left of it, and when executing a Roll_B, it can only send to the process above it and receive from the process below it (assuming the processes are arranged in a grid of size N x N). Let us assume, for simplicity, that N = 4 in the following; that is, we are working with a 4 x 4 grid of processes. A process with instance j performing a Pipe_A operation can send to process (j +1)%4, and a process j performing a Roll_B operation can send to process (j +12)%16. This can be expressed by the following two PCSL lines: Matrix[ ]{j}(SendPipe) -> Matrix[ + l)%A}(ReceivePipe) :: Vj : j < 16; Matrix{ ]{j}(SendRoll) -> Matrix[ + 12)%16}(ReceiveRoll) :: Vj : j < 16; The graphical representation can be seen in Figure 7.8. This only includes the communication between the worker processes (called Matrix). To add protocol specification lines to check communication between the master (Master) and 7.5. Examples 95 Matrix Matrix ' 0 a •O-•0--0-•0-Send / * Send Pipe */ Receive /* ReceivePipe */ Send /* Send Roll */ 1 fc- Receive/* SendRoll */ \ Figure 7.8: The pipe-and-roll part of the matrix multiplication algorithm. the slaves, add the following two lines to the specification file: Master[ ]{0}(SendParams) —> Matrix[ ]{ }(ReceiveParams); Matrix[]{}(SendResult) —\ Master[]{0}(ReceiveResult); Also, note that the group numbers are left out to simplify the description of the protocol. Limitations By inspecting the communication pattern in the program pseudo code, it becomes clear that the pipe communication does not need to wrap around, that is, the last processor in the pipe does not need to send anything to the source processor. The source processor of each round of pipes differs from the one in the previous round. It is not directly possible to specify a protocol that reflects such a communication pattern that depends on the state in the application. In Section 7.10 we describe a.way to resolve this problem and expand the set of protocols that can be specified. 7.5.3 A Partial Differential Equation Solver Let us consider a parallel master/slave program to solve a hyperbolic differential equation. There is one master process and n slave processes. Figure 7.9 shows the algorithm for the master, and Figure 7.10 for the slaves. Version 1 of the Protocol Constraint Specification The most general protocol, V\ (covering all sends) that we can specify for the master/slave system is illustrated in Figure 7.11. The V\ protocol contains 3 lines: 7.5. Examples 96 Send parameters to slaves 0 , . . , J f - l /* ((MS)) */ Repeat N times { Receive r e s u l t from sl a v e } /* ((MR)) */ Figure 7.9: Pseudo code for master algorithm for a differential equation. Receive parameters from master /* ((SR)) */ Repeat n times { i f ( i d > 0) t h e n Send to slave id - 1 i f ( i d < N-1) t h e n Send to slave id + 1 /* /* ((SI)) ( ( S 2 ) ) */ */ C a l c u l a t e i f ( i d > 0) t h e n Receive from slave id - 1 i f ( i d < N - 1) t h e n Receive from slave id + 1 } Send r e s u l t to the master /* /* ( (RD ) ( ( R 2 ) ) */ */ /* ((SS)) */ Figure 7.10: Pseudo code for slave algorithm for a differential equation solver. Master ^ Slave Master[]{}( ) Slave{}{}() Slave[}{}() Slave[}{}(); Master[]{}( ); Slave[}{}(); Figure 7.11: V\— Version 1 of the protocol specification. 1. Any master program can send to any slave program regardless of group, instance, or line number. 7.5. Examples 97 2. Any slave program can send to any master program regardless of group, instance, or line number. 3. Any slave program can send to any other slave program regardless of group, instance, or line number. Vx is not very useful; it does not specify anything about the communication between the slaves. First, we extend V\ for master group 0 (only one group of master programs is spawned, and this group contains only one process with instance 0). This changes the left part of the first line and the right part of the second line in Figure 7.11 to Master[0]{0}( ). Likewise, for the slaves, there is only one group of slaves spawned, so lines 1, 2, and 3 can be changed to Slave[0]{}(). Let V[ denote this version of the protocol specification, as shown in Figure 7.12. 1: Master[0}{0}{) -> Slave[0]{}(); 2: Slave[0}{}() -> M aster [0}{0}(); 3: Slave[0]{}() -> Slave[0}{}(); Figure 7.12: V[ - Extended version 1 of the protocol specification. Version 2 of the Protocol Constraint Specification By inspecting the code in Figure 7.10 we see that slave number i can send to slave number i + 1 if i < N - 1 (assuming the system has N slave processes), and slave number i can send to slave number i - 1 if i > 0. Figure 7.13 shows the protocol as a graphical representation. We can incorporate this into the protocol specification and arrive at the second version, which is shown in Figure 7.14. Note, that line 3 is split into line 3a (i sends to i + 1) and line 3b (i sends to i - 1). Also note the use of the two quantifier expressions following these lines. Version 3 of the Protocol Constraint Specification Looking closer at the lines 3a and 3b in Figure 7.14, and comparing these with the pseudo code in Figure 7.10, we see that the V? protocol specification does not specify that the send marked s i always sends to the receive marked R l , and that the send marked S2 always sends to the receive marked R 2 . If, by mistake, a message were delivered to the wrong receive, there wil l be a violation of the communication protocol, so we need to add this information to the specification. Thus, line 3a represents the message passed between send S i and receive R l , and line 3b represents the message passed between send S2 and receive R 2 . Adding this to the specification we obtain the third version, as shown in Figure 7.15. For completeness, we added line information about the parameter and result messages sent to and from the master. 7.6. Online Checking 98 Slave[0]{0} Slave[0]{1} Master[0]{0} <5 Slave[0]{2} Slave[0]{3} Slave[0]{4} Figure 7.13: Graphical representation of V2 - the second version of the protocol specification. 1: Master[0}{0}{) ->• Slave[0]{}(); 2: Slave[0}{}{) -»• • M a s i e r [0]{0}( ); 3a: 5/aue[0]{i}( ) ->• S7aue[0]{i + 1}( ) :: V i : i < n - 1; 3b: Slave[0]{i}{) -+ Slave[0]{i - 1}( ) :: V i : 0 < i; Figure 7.14: V2 — Version 2 of the protocol specification. 1: Master[0]{0}(MS) -> S/ovetOllJCSfi); 2: 5/aue[0]{}(55) -> Mosier[0]{0}(M JR); 3a: Slave[0\{i}{Sl) Slave[0}{i + 1}(RI) :: V i : i < n - 1; 3b: 5loue[0]{t}(52) -> 5/owe[0]{i - l}(iJ2) :: V t : 0 < i; Figure 7.15: - Version 3 of the protocol specification. Figure 7.16 shows an extended version of P3, where we added information about the instance of the slaves in lines 1 and 2. Furthermore, we added an upper bound for i in line 3b, and a lower bound for i in line 3a. Al l these changes do not change the protocol in any way, but allow the system to predict which sends/receives are legal. V3 can only be checked, not predicted (see Section 7.8 for more information on protocol prediction). 7.6 Online Checking MOPED can be used in two different modes: online or offline. The online mode checks the specification as the communication takes place; each message in the system is captured by Millipede and checked against the constraint specification. If an error occurs, that is, if a message 7.7. Offline Checking 99 1: Master[0}{0}(MS) 2: Slave[0}{i}(SS) 3a: Slave[0]{i}(Sl) 3b: Slave[0]{i}(S2) Slave[0}{i}(SR) :: V i : (0 < i) kk (i < n); -» Masier[0]{0}(Mi?) 5/ove[0]{z + l } (m) 5Iave[0]{i- l}(i22) V i : (0 < i) k k (i < n); V i : (0 < i) k k (i < n - 1); V i : (0 < i) k k (i < n); Figure 7.16: - Extended version 3 of the protocol specification. violates the protocol specification, a message is displayed in the Millipede status window. When developing programs, this approach can be used incrementally, as shown in the example in Section 7.5.3. The first version of the specification can be very general, and then gradually refined until errors are discovered. Once the error is corrected, the specification can be further refined if the program stil l does not function correctly. 7.6.1 Strictness A protocol specification can be checked using different levels of strictness. When using the refinement technique, that is, starting out with a simple specification, some messages might not match any lines, thus violating the protocol. The user might not perceive this as a violation as the protocol is not fully specified; if this is the case, a lower level of strictness can be adopted. Table 7.1 shows the 3 different levels of strictness that MOPED currently supports. Level Description 1 0 or more protocol specification lines may match with respect to program name and sender quantifiers. 2 At least one protocol specification line must match with respect to program name and sender quantifiers. 3 Exactly one protocol specification line must match with respect to program name and sender quantifiers. Table 7.1: The MOPED Strictness levels. Strictness level 1 should be used when the protocol has not yet been fully specified, level 2 when the protocol is fully specified, but not uniquely (i.e., a message can match more than one protocol line), and level 3 if a ful l specification is given. 7.7 Offline Checking As described in the previous section, Millipede can check messages against the protocol specifica-t ion, while the program is running. However, if Millipede is generating log files while the program executes, the checking can also be performed offline. All the information needed to check the protocol can be extracted from the set of log files and the corresponding project file. 7.8. Protocol Prediction 100 7.8 Protocol Prediction As mentioned earlier, if all constraint lines are fully quantified with bounds for each variable, Millipede can generate a list of all possible valid send/receive combinations. For the example in Figure 7.16, the prediction table is shown in Table 7.2 (for n = 4). Sender Receiver Line Master[0}{0}(MS) -»• Slave[0]{0}(SR) -> Slave[0]{l}(SR) -> Slave[0]{2}(SR) -> Slave[0]{3}{SR) | Slave[0]{O}(SS) Slave[0}{0}(Sl) -> Master[0]{0}(MR) -> Slave[0]{l}(Rl) 2 3a Slave[0}{l}{SS) Slave[0]{l}(Sl) Slave[0]{l}(S2) -> Master[0}{0}(MR) -> Slave[0]{2}(Rl) -> Slave[0]{0}{R2) 2 3a 3b Slave[0}{2}(SS) Slave[0}{2}{Sl) Slave[0]{2}{Sl) -» Master{0}{0}(MR) -)• 5/awe[0]{3}(i?l) -> S/aue[0]{l}(.R2) 2 3a 3b Slave[0]{3}{SS) Slave[0}{3}(Sl) -> Mas£er[0]{0}(M.R) -> 5/awe[0]{2}(i?2) 2 3b Table 7.2: Prediction table for the V3 protocol specification. A prediction table can help determine if the protocol specified matches what the user had in mind. Naturally, there is always a risk that an error is present in the protocol specification; this is similar to the risk mentioned in Section 7.2 of introducing errors into the implementation of a protocol that is verified using a model checker. However, we believe that the number of errors introduced here should be considerably smaller than in the implementation stage. 7.9 Implementation As with all other Millipede modules, the MOPED module is a separate process that runs the protocol checking algorithm. MOPED parses the constraint specification file using a parser generated from the BNF grammar in Appendix B.1. This parser is generated using Flex [Pax98] and Bison [CSH02]. A parse representing the specification is returned by the parser, and messages can be checked against this specification by evaluating the tree using the information about the sender, receiver, group, and instance number. Messages are provided by the Millipede runtime system; when the module is run offline (i.e., the application is not currently running) the messages are extracted from the log files. The protocol specification lines are checked against a message one at a time, and depending on the strictness level, errors are reported to the user. 7.10. Discussion 101 7.10 Discussion A number of interesting extensions should be added to PCSL and the MOPED checking module in order to strengthen the quality of the checks performed. We briefly describe some of these in this section. Since we are working with message passing systems, such as PVM and MPI, where al l sends are annotated with a message tag, it should be possible to add information about message tags to a PCSL line. This means expanding the following: £ [ ] { } ( ) - > * [ ] { } ( ) to the following: /?[]{}()< >->*[ ] { } ( ) < > where < > represents an expression that determines the message tag. In order to ease the possibility of choosing from a small number of values, another useful functionality would be to allow set expressions of the following form: e £ {vi,v2, • • .,vn}. So far, the focus has been on the senders and receivers of messages. However, errors also occur because the content of messages is incorrect. Another useful extension is to allow each PCSL line to be associated with one or more templates describing the structure of the message being sent and received. This can be achieved easily in Java by defining messages as objects, and using the reflec-tion mechanism (instanceof function) to determine the type of the incoming object. In Oc-cam [May83] the notion of typed channels assures that the correct type of data is always received on a channel. However, in PVM and MPI, the notion of channels does not exist, and when using message passing static analysis and type checking are not always possible. A possible solution can be specifying message content using a specification language like XDR [Sri95] or XML [XML98] to define data types, or using the MPi_Datatype function, which specifies an internal message data type for al l calls. The above mentioned extensions to PCSL/MOPED not only allow for a more refined protocol specification, which results in more rigorous checks, but also make it possible to check that messages contain the correct type of data. This last point again shows an example of an overlap between levels; at the protocol level we are also concerned with the content of the messages, which theoretically should be included in the Message Level. 7.10. Discussion 102 7.10.1 State Dependent Communication As the last part of this section we briefly return to the problem stated in Section 7.5.2. The problem is defining a protocol that depends on values stored in the program at runtime. The example at hand is more clearly illustrated in Figure 7.17 (the roll part of the protocol is left out for clarity). Depending on the program variable k, a number of processes do not send anything; this set of nonsending processes varies according to the row in which the process is located, as well as the number k. o - o - o o - o— o » o »o—I UD »• o » o - l L*o > • o -o—I LQ -O • o—I L-O -O »• O—I O -O HD k = 0 k = l o *o - O L-o >> o >o—I » o . » o - l - o — - * o o— O -O - O -[»• O K>- • O - l O HD K>-1 UD O •O-' k = 2 k = 3 Figure 7.17: The four different stages of the pipe operation. A simple formula that determines which instance numbers should not send, given a value k 7.11. Summary 103 and a row number, is the following: (row + k + Ns- 1)%NS + row * Ns where Ns is the group size. We can write the following protocol specification line: Matrix[ ]{j}(SendPipe) -t Matrix[ ]{(j + l)%A}(ReceivePipe) :: Vj : j < = 0 kk j < Ns kkj \ = (row + k + Ns - 1)%NS + row * Ns; This requires the values of the program variables k and row (row can be computed as Ns/A). These can be obtained by adding lines to the program in the following way: protocol_sym(row); protocol_sym(k); p v m _ s e n d ( . . . ) ; /* ((SendPipe)) */ The program variables row and k are then packed and sent to the protocol checking module (and added to the log files), and inserted, along with Ns, into the symbol table a at check time. Appendix B.2 shows an entire debugging session using the protocol specification language and the MOPED checking tool. 7.11 Summary In this chapter we presented a tool at the protocol level that can assist the user in checking that the messages in a system adhere to a protocol specification. The specification of the protocol is comparable to assertions in C; if a message violates a line in the specification, the user is notified about the line as wel l as the offending message. We have designed PCSL to be small in comparison to other specification languages to avoid having the user learn a complete new language. We have also avoided adding temporal constraints to the language. The reason is to reduce its complexity to the more simple task of matching message patterns. Adding temporal constraints is possible, however, the added expressiveness complicates the language when compared to simpler task of message matching. A number of interesting extensions are proposed; these extensions provide an even more flexible tool to aid the user when developing and debugging protocols. Chapter 8 Buffer Allocation in Message Passing Programs "How extremely stupid not to have thought of that." - Thomas Huxley on reading 'Origin of the Species' "Just because a problem is NP-comple te doesn't mean we can't try to solve it; as a matter of fact, those problems are the only interesting ones." — My M.Sc. supervisor Peter Meller-Hielsen, The University of Aarhus In the previous two chapters we described two tools at the protocol level of the Millipede tool. In this chapter, we present a number of theoretical results concerned with buffer allocation in message passing systems, plus a tool at the protocol level of Millipede. The motivation behind the work presented in this chapter is the simple question 'can we determine the minimum number of buffers needed for an asynchronous message program to be guaranteed not to deadlock because of an insufficient number of buffer?' This question leads to a detailed investigation of several buffer related questions under different buffer placement schemes [BBW02]; An example showing the importance of this problem is the following: assume a program has been developed and tested on a cluster of machines using a small problem set. Now, when the program is executed on a larger problem on a production machine, it deadlocks due to lack of buffers. If the production machine has fewer buffers due to the bigger problem size, an otherwise working program might deadlock. However, the answer to the original question turns out to be 'no, not for systems like MPI and 104 8.1. Motivation and Related Work 105 PVM, ' but other useful discoveries and results emerge in the process. One in particular, referred to as the Nonblocking Buffer Allocation Problem, turns out to be tractable, and thus a useful tool to be included in Millipede. In addition, our ability to solve this problem enables us to compute approximations for the original, intractable, problem. 8.1 Motivation and Related Work The multiprocess system that we consider is a collection of simultaneously executing independent asynchronous processes that compute by interspersing local computation and point-to-point mes-sage passing between processes; these are referred to as A-computations in [CMT96]. Such a sys-tem is equivalent to one with three different events, such as the one defined by Lamport [Lam78]: send events, receive events and internal events. As wel l , we only consider programs that are repeatable [CL94a, CL94b] when executed in an unrestricted environment, that is, programs with static communication patterns. While this narrows the class of programs we consider, the class of applications with static communication patterns is still considerable. The message passing primitives considered are the traditional, asynchronous, buffered com-munications: the nonblocking send and the blocking receive, which are the standard primitives used in MPI and PVM. Cypher and Leu formally define the former as a POST-SEND, immediately followed by a WAIT-FOR-BUFFER-RELEASE, and the latter as a POST-RECEIVE immediately fol-lowed by a WAIT-FOR-RECEIVE-TO-BE-MATCHED [CL94a, CL94b]. Informally, the send blocks until the message is copied out of the process into a send buffer; the receive blocks until the message has been copied into the receive buffer. One aspect of portability introduced in the MPI standard [Don94] is that of a safe program. As defined in the standard, a program is safe if it requires no buffering, that is, if it is synchronous. Safe programs can be ported to machines with differing amounts of buffer space. Determining whether a system is buffer independent-the system is 0-safe-was investigated in [CL94a, CL94b]. However, to demand that the program execute correctly with no buffering is restrictive. Buffering reduces the amount of synchronization delay and also makes it possible to offload communication to the underlying system or network components, thus overlapping communication and compu-tation. The notion of safety, as introduced in the MPI standard, underscores the concern that, when buffer resources are unknown, asynchronous communication can potentially deadlock the system. This notion is extended to fc-safety, in order to better characterize the buffer requirements of the program, thus making it safe to take advantage of asynchronous communication. The definition of fc-buffer correctness is introduced by Bruck et a l . [BDH+95] to describe programs that complete without deadlock in a message passing environment with k buffers per process. Similarly, Burns and Daoud [BD95b] introduce guaranteed envelope resources into LAM [GBV94], a public domain 8.1. Motivation and Related Work 106 version of MPI. Guaranteed envelope resources-a weaker condition than /c-safety-is used in LAM to reserve a guaranteed number of message header slots on the receiver side. In our model, the interesting systems are buffer dependent, and require an unknown number of buffers to avoid deadlock. More recently in modern clusters, greater overlap of computation and communication is possible by offloading communication onto network interface cards. Unfor-tunately, most NICs have orders of magnitude less memory than the average host, which makes message buffers a limited resource. Thus, programs that use asynchronous message passing, and that execute correctly otherwise, might deadlock when executing on a system where parts of the message passing system have been offloaded to the NIC. These issues have been investigated in several papers [Don94, DHHW93, FBH+92, KW01]. Unfortunately, the value of k, for determining fc-safety, is usually not known a priori. We have investigated the complexity of determining the minimum value of k for programs using asynchronous buffered communication with static communication patterns. A program is said to be fc-safe if k buffers are enough to guarantee that the program never deadlocks due to insufficient buffers. In this chapter, we consider the following three problems, all related to determining buffer requirements for asynchronous message passing programs: B A P - t h e Buffer Allocation Problem: What is the minimum number of buffers required to ensure deadlock free execution (i.e., determine k for fc-safety)? BSP- t he Buffer Sufficiency Problem: Given a buffer assignment, can we determine whether or not the assignment is sufficient to avoid deadlock? N B A P - t h e Nonblocking Buffer Allocation Problem: What is the minimum number of buffers needed to allow for an asynchronous execution (i.e., no send calls block)? The complexity of these questions depends as well on the type of buffers provided by the system. We consider the following types of buffering schemes. In the first three schemes, the buffers are either allocated on the send side only, the receive side only, or mixed and allocated on both sides. Finally, we also consider schemes that allocate buffers on a per channel basis. In the following section, we present the results of our investigation for the different buffer allocation schemes; we also present a tool in Millipede to assist the user to solve the Nonblocking Buffer Allocation Problem. The solution for this problem is an upper bound for the Buffer Alloca-tion Problem, and we later return to how the user can use this tool to reduce buffer requirements for a system by inserting barrier synchronizations. Variations of these problems have been investigated by the operations research commu-nity [Ana89, Rei87, She75]. In these models, events or products are buffered between various stations in the production process, however, the arrival of these events is governed by probability 8.2. Buffer Allocation Problems 107 distributions, which are specified a priori. In our model, since processes are asynchronous, the time for a message to arrive is nondeterministic; that is, a message may take an arbitrarily long time to arrive and a process may take an arbitrarily long time to perform a send or a receive. To determine the minimum number of buffers, the execution of a system can be modeled using a (coloured) Petri net [Jen92]. In order to determine whether the system can reach a state of deadlock, the Petri net occurrence graph [HJJJ85] is constructed, and a search for dead markings is performed. However, the size of the occurrence graph is exponential in the size of the original Petri net. 8.2 Buffer Allocation Problems We now introduce a number of definitions to formalize the model we wi l l work with. Let S be a multiprocess system with n processes and Ei communication events occurring in process i; a communication event is either a send or a receive cal l . A communication graph of 5 is a directed acyclic graph G(S) = (V, A) where the set of vertices V = {viiC | 1 < i < n, 0 < c < (Ei + 1) corresponds to the communication events and the arc set A consists of two disjoint arc sets: the computation arc set P and the communication arc set C. Each vertex represents an event in the system: vertex vifi represents the start of process i, vertex vitC, 1 < c < Eit represents either a send or a receive vertex, and finally, vertex vit(Ei+i) represents the end of a process. An arc (w;,c, Vt,c+i) e P, 0 < c < Eit represents a computation within process i and an arc (vitS,vjit) e C represents a communication between different processes, i and j, where Wj, s is a send vertex, and vj<t is a receive vertex. Note, process arcs are drawn without orientation, but are always oriented downward. These communication graphs are comparable to the time-space diagrams-without internal events-noted in [Lam78]. A multiprocess system S is unsafe if a deadlock can occur due to an insufficient number of available buffers; if S is not unsafe, then 5 is said to be safe. Figure 8.1 shows an example of a system that is unsafe; with no buffers this system always deadlocks. A per-process buffer assignment is an n-tuple B = (bi,b2,... ,bn) of nonnegative integers representing the number of buffers for each process. Similarly, a per-channel buffer assignment is a g-tuple B - {h,b2,... ,bq),q = (£), representing the number of buffers allocatable by the application; ideally, as few buffers as possible should be allocated. Two natural decision problems arise from this optimization problem. Given a communication graph G(S) and a nonnegative integer k, the Buffer Allocation Problem (BAP) is deciding whether there exists a buffer assignment B such that S is safe and Yl h < In order to solve this problem we need to solve a simpler one. Suppose we are given a buffer assignment B and a communication graph G(S), the Buffer Sufficiency Problem (BSP) is deciding whether the assignment is sufficient to make 5 safe. 8.2. Buffer Allocation Problems 108 Figure 8.1: An unsafe system-without buffers this communication graph always dead-locks. In addition, we can require that no process in the system 5 should ever block on a send. Given a communication graph G(S) and a nonnegative integer k, the Nonblocking Buffer Allocation Problem (NBAP) is deciding whether there exists a buffer assignment B, such that no send in S ever blocks, and J^h < k. As we see later, this problem plays a key role in the buffer reduction technique we describe in Section 8.4. In Section E.1 in Appendix E, we formally present the graph framework that we use to prove our results. The first result that we prove is Theorem 8.2.1 concerning the Buffer Allocation Problem. Theorem 8.2.1 Assuming buffers are allocated on the receiver side, the Buffer Allocation Prob-lem (BAP) is NP-hard. Proof: See page 147 in Appendix E. • Theorem 8.2.1 shows that determining the minimum number of buffers needed for a program to be fc-safe, that is, determining the value k, is intractable. Thus, there is likely no polynomial t ime algorithm to solve this problem. To illustrate this problem, consider the graph shown in Figure 8.2. To assure deadlock free execution, such a graph must have at least one buffer. The buffer can be placed in either of the processes; however, the choice of the process might affect future buffer requirements. The graph in Figure 8.2 is an example of a graph used throughout the proofs of both BAP and BSP in Appendix E; we refer to such a graph as a t-ring. A t-ring is a subgraph of a communication graph G(s), consisting of t > i processes, such that in each of the t processes there is a send vertex sijiCj and a receive vertex r i j 4 j , CJ < dh 1 < j < t such that the arcs (siltCl,riudt) and ( s i J + 1 , c j + 1 , r i j , d j ) , 1 < j < t are in A. The next step in our investigation is to consider a potentially easier problem, referred to as the Buffer Sufficiency Problem. 8.3. The Nonblocking Buffer Allocation Problem 109 Figure 8.2: A general t-ring in G(S). Theorem 8.2.2 Assuming buffers are allocated on the receiver side, the Buffer Sufficiency Problem (BSP) is coNP-comp/e te . Proof: See page E.3.2 in Appendix E. • As Theorem 8.2.2 states this problem is as hard as BAP, that is, in polynomial time we cannot verify that a buffer assignment is sufficient to avoid deadlocks. These two results mean that adding a polynomial t ime automated analysis to Millipede to determine the exact minimum number of buffers to assure fc-safety is difficult. However, as the opening quote of this chapter states, just because a problem is N P - h a r d does not mean we cannot provide approximations and heuristics. In Section 8.4, we present a technique that utilizes barriers to reduce buffer requirements; by inserting barriers or making certain communications synchronous, the buffer requirements for NBAP is reduced. Since the result of NBAP is st i l l an upper bound for BAP, this is a way of approximating the number of buffers required to assure safe execution in the original program, augmented with the barriers. We now turn to the Nonblocking Buffer Allocation Problem. 8.3 The Nonblocking Buffer Allocation Problem We have shown the Buffer Allocation Problem to be intractable; interestingly, the problem of determining the minimum number of buffers needed to assure nonblocking sends for an asyn-chronous message passing program, referred to as the Nonblocking Buffer Allocation Problem (NBAP), is tractable. This means we have a tr ivial upper bound for BAP. However, examples exist where the result of the NBAP algorithm used as an approximation for BAP results in an unbounded overestimation of the optimal solution for BAP. In Section 8.4 we return to such an example. The NBAP problem is stated as follows for a multiprocess system S: does there exist a buffer assignment B, such that no send in S ever blocks, and Yh < ^? To explain the algorithm we need to introduce a few definitions. Given a communication graph G(S) and two vertices, viiC+k and viiC in G{S). Vertex vitC+k is communication dependent 8.3. The Nonblocking Buffer Allocation Problem 110 on vertex Vi.c if vitC is the start vertex, or if there exists a vertex vjid,j ^ i, such that there exists a path from u i i C to vjid and the arc (vjid, vitC+k) is in A. See Figure 8.3. Vertex viiC+k is terminally communication dependent on vertex v^c if vitC+k is communication dependent on v^c and is not communication dependent on the vertices vitC+i, 0 < I < k. Figure 8.3: vitC+k is communication dependent on vitC. The algorithm for computing the minimum buffer assignment to assure nonblocking sends is shown in Figure 8.4. Section E.4 in Appendix E contains the proof for the correctness of the NBAP algorithm. 1. For each receive vertex Viit determine its terminal communication dependency, vertex viiC, where t > c. 2. Set Ii<t = [c, t] to be the interval between vertex vi:C and vertex vi}t. 3. For each process component d, compute bi, the maximum overlap over al l inter-vals 7M. 4. B = {bx, b2,..., bn} is the optimal nonblocking buffer assignment. Figure 8.4: Algorithm for computing an optimal nonblocking buffer assignment. The time between when a message can arrive at process i and when it is received by the receive cal l corresponding to vertex u i ] C is represented by the interval IiiC. Each of these intervals must have a buffer to ensure nonblocking sends. Hence, the minimum number of buffers, bi, is the maximum overlap over all intervals within process pi. See page 155 for a detailed description on the specific techniques for computing the maximum overlap in polynomial t ime. To illustrate the algorithm, consider an implementation of the parallel pipe-and-roll matrix multiplication algorithm as described in Figures 7.6 and 7.7 in Chapter 7. In this instance, let us consider a system with one master process and four slave processes arranged in a 2 x 2 grid. 8.3. The Nonblocking Buffer Allocation Problem 111 R> R ^ ^ ^ 4 3 3 3 3 Figure 8.5: Communication graph with buffer intervals for a 2 x 2 worker configuration of the pipe-and-roll matrix multiplication algorithm Figure 8.5 illustrates the communication graph created based on the messages sent when the algorithm is executed. P0 is the master process, and P i , . . . , P i are the slave processes. The dotted vertical lines represent the / intervals computed by the NBAP algorithm. Remember, the beginning of an interval signifies the earliest time during that process when the message, received by the receive vertex at the end of the dotted line, can arrive in the communication system in that process. Thus, a buffer must be available for this message during this interval. Figures 8.6 and 8.7 illustrate the use of this algorithm in Millipede. Millipede maintains information to construct a communication graph based on the messages. This information can be extracted from the relations Senders and Receivers, or from a set of message log files. By executing the command nbap, the current program loaded into Millipede is analyzed. The corresponding communication graph is built, and the NBAP algorithm is applied. The output, as seen in Figure 8.6, is a line for each process with its buffer requirements. To further investigate the requirements of a single process, the user can execute the nbap command with the process ID of the process in question. Figure 8.7 shows the output from the command nbap 242167. The output contains a number of intervals; these are equivalent to the space between two communication nodes on the communication graph. The number in the Line column represents the line number in which that interval ends, that is, the line number of either a send or a receive cal l . 8.4. Approximations of BAP using NBAP 112 (0)|MILLIPEDE> nbap Master.c Group / Instance / Tid / Buffers 0 / 0 / 242165 / 4 Slave.c Group / Instance / Tid / Buffers 0 / 0 / 242167 / 3 0 / 0 / 242169 / 3 0 / 0 / 242171 / 3 0 / 0 / 242173 / 3 Figure 8.6: The result of executing the nbap command in Millipede. (0)|MILLIPEDE) nbap 242167 File : Slave c Interval Line No Buffers 1 78 2 2 36 1 3 64 2 4 66 3 5 36 2 6 64 1 7 66 1 8 117 0 9 end 0 Figure 8.7: By passing the nbap command a process identifier, detailed information about its buffer requirements is displayed. Another important aspect of the NBAP algorithm is that if the system can provide the number of buffers suggested by the algorithm, all sends become nonblocking, which optimise's the ability to overlap communication and computation. In order to make efficient use of asynchronous message passing, it is important to ensure that no send operations block. 8.4 Approximations of BAP using NBAP As shown the Buffer Allocation Problem is N P - h a r d for all four buffer placement schemes. As mentioned, an obvious approximation to BAP is the result computed by the NBAP algorithm; not only does it guarantee that a program with that many buffers does not contain any blocking sends, but a side effect is that no deadlocks due to insufficient buffers ever occurs. Unfortunately, this approximation is not always sufficient. Consider the two communication graphs shown in Figure 8.8. For the left graph labelled (a), the NBAP algorithm suggests a number k, where k is the number of messages, in this case 8. However, the correct number of buffers for this graph to avoid deadlocks is 0; the graph represents a 0-safe program, which can be executed synchronously, thus requiring no buffers at a l l . For the graph labelled (b), the algorithm computes the value 1. 8.4. Approximations of BAP using NBAP 113 Again, this communication graph represents a programs that is 0-safe, but the approximation of 1 buffer for (b) is a much tighter upper bound than 8 is for (a). Pi (a) (b) Figure 8.8: (a) shows an example for which the NBAP output is a worst case approx-imation. The graph in (b) is an example where NBAP is within 1 from the optimal result. We attempt to counter this problem by introducing epochs. Intuitively, an epoch in a com-munication graph is any set of vertices that can be separated from the rest of the graph by two horizontal lines cutting the graph in three parts, such that exactly one vertical arc in each process is crossed by each line, and no communication arcs are crossed. Actually, all three parts, above the top line, between the lines and below the bottom line, are epochs. Consider a communication graph G(S) with n processes. A set e of n pairs of vertices (vi,a,Vitki)> one pair for each process in G(S), represents an epoch if the following holds. Define E(e) as follows: E(e) = {viti | (vitCi,viiki) 6 e, i G [0,n), I G [cj,fc;]} such that for al l v G E(e), if v is a send event, the corresponding receive event vr is in E(e), or if v is a receive event, the corresponding send event vs is in E(e). The vertices in the epoch are exactly the vertices in E(e). Figure 8.9 shows an example of dividing the communication graph from Figure 8.5 into three epochs. The significance of an epoch is that it is self contained, that is, no communication within an epoch involves any communication events outside that epoch. Assume that we have computed the minimum number of buffers needed to assure nonblocking sends, that is, the result of the NBAP algorithm. We know this is an upper bound for BAP. An example of parts of such a graph is shown in the left part of Figure 8.10. Consider a process arc 8.4. Approximations of BAP using NBAP 114 R> Pi -•4 Epoch 1 Epoch 2 Epoch 3 Figure 8.9: The communication graph for the matrix multiplication system with three epochs introduced. (vi,c,VilC+i), and an epoch e where vitC E(e) and t>i,c+i e E{e), that is, the process arc crosses an epoch boundary. If the line dividing the communication graph (the epoch line) is replaced by a barrier synchro-nization, the intervals, representing buffer requirements, that start before the barrier can now be shortened to start at the barrier. This is illustrated in the right side of Figure 8.10. Thus, the more epochs the communication graph is divided into, the better the chances of reducing buffer requirements. However, barriers do force the processes to synchronize, which can lead to a significant slow down, with respect to execution speed. This means there is a trade off between the number of buffers required and the number of barriers added; safety is traded for execution speed. The barriers we suggest are equivalent to the epoch boundaries, and involve all the processes in the system. This might be an unnecessarily conservative approach. If a number of the processes require more buffers than others, it might be sufficient to focus on these processes. We can define sub-epochs as self contained sets, like E(e), that do not involve al l n processes, but sti l l ensure that both ends of a communication arc are included in the epoch for the involved processes. The boundaries of such sub-epochs can be replaced by barrier synchronization between 8.4. Approximations of BAP using NBAP 115 Buffer Requirements Buffer Requirement Figure 8.10: By creating epochs, buffer requirements are reduced. Intervals no longer cross epoch boundaries. the processes included in the epoch. Figure 8.11 shows an example of a sub-epoch. An even less restrictive approach is to add synchronization points between just two processes. In practice, this is equivalent to making a message passing call synchronous. This is easily achieved in MPI by using the synchronous message passing calls rather than the asynchronous ones. Barriers are by definition synchronous, but our model assumes asynchronous communication. However, barriers can be simulated using asynchronous communication, as shown in Figure 8.12. Simulating a synchronous message passing call in the corresponding asynchronous communica-tion graph can be done by adding an 'ack-l ike' message in the opposite direction of the original message, thus making the call and the added 'ack' look like a mini barrier between two processes only. One of the important design goals for a tool like Millipede is to easily allow the user to map a problem back to the source code. Millipede provides this mapping through the nbap command. When nbap is executed with a process identifier, as seen in Figure 8.7, the buffer requirements for each interval are shown with corresponding line numbers. The intervals 2 through 6 are equivalent to the left side of Figure 8.10. By adding a barrier immediately before the line T = P i p e _ A ( ) in Figure 7.7, and re-executing the nbap algorithm, the buffer requirements change; the maximum is now 2 instead of 3, as previously. With the information about the buffer requirements for each interval, it is easy to search the list for the intervals with the largest numbers, return to the source code for examination, and 8.5. Discussion 116 Figure 8.11: An illustration of the use of sub-epochs; epochs that do not involve all processes. Processes P i ,P2>P3> and P 4 al l participate in a sub-epoch. Barrier synchro-nizations between these four processes can then be inserted at the epoch boundaries. possibly insert barriers, or make some calls synchronous. 8.5 Discussion A number of techniques can be amalgamated into a new analysis tool for the BAP approxima-tion algorithm. Since the user might not want to work with the communication graph, certain automated tasks can be implemented: • Communication in loops should be automatically recognized in the communication graph, which then could be 'rolled up ' . Often, buffer requirements within loops only slightly differ, so by rolling up loops, it becomes clear where barriers can be placed so that they do not interfere with loops. • Automatic recomputation of buffer requirements after inserting barriers or synchronous 8.6. Summary 117 P1 P2 P3 Barrier I ! I ! Figure 8.12: Implementation of barrier synchronization using asynchronous communi-cation. No process can cross the dashed line before every process is ready. communication without re-executing the program would make the development/debugging cycle much more efficient and shorter. 8.6 Summary In this section we presented three problems related to buffer allocation in asynchronous message passing systems, the Buffer Allocation Problem, the Buffer Sufficiency Problem, and the Non-blocking Buffer Allocation Problem. We considered them under four different buffer placements schemes (send side buffers, receive side buffers, mixed send and receive side buffers, and buffers placed on communication channels), and proved that the most general problem of determining the minimum number of buffer needed to assure deadlock free execution is intractable. We presented a polynomial time algorithm for computing the minimum number of buffers needed to assure that no send ever blocks. In addition, we showed how to use the NBAP algorithm along with barriers or synchronous communication calls to approximate the solution to the Buffer Allocation Problem. The majority of the work concerning buffer allocation is found in Appendix E, and while most of the problems presented are intractable (see Table 9.1), the results themselves are interesting and form a solid starting point, and offers insight into the problem for further research into heuristics and approximations. Chapter 9 Conclusion and Future Work "If you have an important point to make, don't try to be subtle or clever. Use a pile driver. Hit the point once. Then come back and hit it again. Then hit it a third time with a tremendous whack." - Winston Churchill "Always do one thing less than you think you can do" - Bernard Baruch 9.1 Conclusion In this dissertation we described a debugging strategy for parallel message passing programs, called multi level debugging. To validate the thesis, we develop a number of tools and analyses to support it. These tools are realized as modules in a prototype implementation of a multilevel debugger, called Millipede(Multilevel Interactive Parallel Debugger). Millipede is implemented for C programs that use the PVM message passing library. The multi level debugging strategy is based on a number of observations about parallel programming, debugging, and the shortcomings of existing tools. A parallel system decomposes naturally into three parts: sequential code, messages, and the overall communication protocol. Therefore, it is natural to tailor debugging of such systems to this structure; errors are classified according to the three levels in which they occur. We show that by providing tools specifically tailored to each level, certain tasks and analyses, which are otherwise tedious, error prone, and time consuming, become easier to accomplish. The 118 9.1. Conclusion 119 narrower focus enables the automation of certain tasks, such as deadlock correction and protocol conformance checking. Additionally, the design goals include avoiding information overload, computing relations on request, and providing views for 'key players' at each level. One main problem with existing tools, which we address with the multi level debugging strategy, is the set granularity, or lack of support for certain debugging tasks. An example of this is the application of N versions of a sequential debugger, such as Gdb, to ./V different processes in a parallel system. We show that a bottom up approach with a number of specially tailored tools is useful for debugging parallel message programs. We verify this thesis by implementing a variety of tools that support the multi level debugging strategy at each of the three levels. 9.1.1 The Sequential Level At the sequential leve l - the lowest leve l -we reason that since a number of very good tools, such as Gdb and Purify already exist, the missing functionality is the ability to apply these tools to a single process of a multiprocess system. To facil i tate the use of existing debugging tools, we add message logging functionality to the runtime system of Millipede; these log files contain information about the messages in the system. By intercepting al l message passing calls, the runtime system replays messages read from log files rather from the network, facil itating the sequential debugging of one process of a parallel system-as if it were a stand-alone program. We successfully validated this approach by demonstrating how a number of sequential debugging tools can be applied to one process of a parallel system. One of the key points in the design of this tool is the reduction in the amount of information presented to the user, as well as assuring that the abstraction of the tool is correct. Nondeterminism in the sequential code limits this tool's usefulness. If the sequential code does not receive the messages in the same order every time the code is executed the order of the messages in the log files wi l l be incorrect. However, similar problems exist in the sequential domain; errors can be hard to reproduce with nondeterministic code. 9.1.2 The Message Leve I At the message leve l - the second leve l - i t is difficult to perform debugging tasks that involve controlling and keeping track of messages. We counter this problem by providing two tools at this level. The first is an extension of the well known idea of inspecting and changing variables in a sequential program. We extend that idea to the Message Level, by providing functionality in the runtime system to allow the user to chose a number of processes to perform interactive message inspection on; as messages are delivered to the process, the user can choose to intercept them for inspection. In addition, if errors occur during unpacking, the user is automatically informed and can focus his attention on correcting the problem. This technique can be used in connection with 9.1. Conclusion 120 the Sequential Debugging Module as wel l ; if no log files exist, a program can still be debugged sequentially-messages can be provided manually by the user or from a file. This provides a way to test a single processes of a multiprocess system; even if the rest of the system is not yet implemented. An important result of this tool is the ability to view and manipulate messages a a unit; if the need arises, the user can investigate the source and destination variables of the values in the message. The level of abstraction can be tuned to match the user's needs. In addition, the tool wi l l automatically notify the user about errors caused by unpacking too much data, immediately enabling him to identify the packing and unpacking routines that were involved with the message. One potential draw back with interactive message debugging is the potentially large amount of data the user must manually supply if the tool is being used for unit testing. We have provided the user with the ability to read values from a file; this avoids errors due to typing, but the problem of producing the data in the file stil l remains. The second tool at the message level is a query language, the Millipede Query Language (MQL), which allows the user to form queries about the messages. If the user is interested in discovering certain facts about messages, it is an almost impossible task if he has to manually browse through a large set of messages. Using MQL he can easily form a query that selects the messages that he is interested in, thus computing relations when needed and reducing the information overload. Relations containing information about the messages are maintained by the runtime system, and at any time the user can form queries related to the messages and their content. We show a number of different useful queries as an example of the expressiveness of MQL. This tool is based on the design goal of 'computing relations on request'. It follows that this technique reduces the information overload that could otherwise appear when perusing the large relations containing information about the messages and their content. MQL is extensible, thus implementing support for more complex queries is straightforward. Queries in MQL are limited by the the language itself, but also by the operations provided by the underlying database system. One possibility is to replace MQL with an implementation of the complete SQL specification and use a state-of-the-art relational database. 9.1.3 The Protocol Level At the protocol leve l - the third of the three leve ls-we develop three different tools and perform an in-depth analysis of the problem of determining the minimum number of buffers needed to ensure that an asynchronous message passing program does not deadlock due to buffer insuffi-ciencies. The first tool is the Deadlock Detection and Correction Tool. When a parallel system deadlocks, a global overview is often required to gain the knowledge needed to remove the deadlock. The tool automatically computes a possible solution, which is a number of changes to the source code 9.1. Conclusion 121 that wi l l remove the deadlock. We present an algorithm that does this and is based on computing a maximal matching in a bipartite graph, where the nodes represent the senders and receivers in the parallel system. We show that the probability of the algorithm suggesting an incorrect way to resolve the deadlock is small. This tool allows the user to query the system for potential solutions. Again, a relation is computed automatically. Computing this relation by hand can be time consuming and error prone. We show that the number of times the algorithm suggests a wrong solution, when the number of errors in the system is less than half the number of processes, is sufficiently small (between 0% and 11% for the number of processes less than or equal to 10). This tool illustrates how certain tasks can be automated when the level of abstraction is raised. In addition to the automation of the task, the tool suggests a solution to remove the error, which is something made possible by the level of abstraction and automation as wel l . The second tool at the protocol level is the Protocol Conformance Checking Tool. The idea we investigated is how to provide an easy way for the user to automate checking that the messages adhere to the specification of the intended protocol. The tool allows the user to write a simple specification of a protocol, and then have all messages checked against that specification as the program is running. This tool reduces the gap between a protocol specification (even a protocol specification that has been verified using verification tools) and implementation, by aiding the user in verifying that the implemented protocol matches the specification. An obvious downside with such a tool is that it is impossible to guarantee that the constraint specification the user provides matches the protocol. On the other hand, the specification language is small, and if used in an iterative fashion along with the program development, we have shown that it is a feasible approach to take to automatically verify that messages adhere to the provided specification. One important issue is the tools ability to map the error back to the source code. That is, once a message has violated the specification, information about the sender and the receiver, as wel l as the lines involved in packing and unpacking the message are reported to the user. One of the limiting factors is that the tool does not take any temporal or timing issues into account; it is concerned with spatial issues only, that is, the message passing pattern. The last tool at this level originates from this theoretical question: " in a system that uses asynchronous buffered nonblocking sends and blocking receives, can we determine the minimum number of buffers required by a message passing program such that it never deadlocks due to lack of buffers?" This problem is known as the fc-safety problem, and it led to an in-depth investigation of three related questions: one, "can we compute a minimal buffer assignment needed to avoid deadlock?"; two, "can we verify that an assignment is sufficient?"; and three, "can we compute the minimal buffer assignment needed to avoid blocking sends?" We determine the complexities for all three questions; the first two are N P - h a r d , whereas the last one has a known polynomial t ime solution. We investigate these three problems, the Buffer Allocation Problem (BAP), the 9.1. Conclusion 122 Buffer Sufficiency Problem (BSP), and the Nonblocking Buffer Allocation Problem (NBAP), respec-tively, with four different buffer allocation schemes: buffers placed on the receiver's side, on the sender's side, on both sides, and on channels (under the assumption that we use channel-like communication). The result of this investigation is as follows: • The Buffer Allocation Problem is intractable under all four buffer allocation schemes. • The Buffer Sufficiency Problem is intractable for the receive side buffer and for the mixed buffer allocation schemes, tractable for the channel scheme and conjectured to be tractable for sender side buffers. • The Nonblocking Buffer Allocation Problem is tractable for all buffer placement schemes, except the mixed send and receive scheme. Table 9.1 summarizes these results. Problem Receive Buffer Send Placement Send 6t Receive Channel BAP NP-hard NP-hard NP-hard NP-complete BSP coNP-complete (P) coNP-complete P NBAP P P NP-hard P Table 9.1: Results for the three problems under the four different buffer placements schemes. In addition, we provide an implementation of the NBAP algorithm in Millipede, and show how this algorithm, combined with a number of techniques for inserting synchronization points into the code, can be used to reduce the number of buffers required to ensure fc-safety. Since the number of buffers needed to ensure nonblocking sends (the result of the NBAP algorithm) is an upper bound for the original BAP problem, reducing the number of buffers necessary for NBAP provides an improved approximations for BAP. This work was done in collaboration with Alex Brodsky [BBW02]. 9.1.4 Summary The decomposition of the parallel programming domain into three levels led to a bottom-up approach to debugging, referred to as multilevel debugging. We have implemented a number of tools in accordance with this strategy. These tools are tailored to a specific error type at one of the three levels of the parallel programming domain. In addition to serving as a debugging framework the multi level strategy provides a guide for implementing new tools within the framework. 9.2. Future Work 123 By implementing various tools we have shown that it is possible to util ize the extra information that can be extracted from a parallel program to raise the level of abstraction within each tool; in turn, this allows for automation and analyses that otherwise would be impossible. Furthermore, the decomposition and the narrowed focus on one specific error type for each tool has reduced the information overload that is eminent in many existing tools, and allowed us to meet a number of design goals for tools in general, and for debugging tools specifically. These include views for 'key players', automatic relation computation, automated analyses, and automatic computation of possible solutions to errors. All these tools have been implemented as modules in Millipede, a prototype multi level de-bugger with a simple command line interface. Millipede does not require any rewriting or trans-formation, and is thus applicable directly to the source code, which is another important design goal. 9.2 Future Work During the design and implementation of Millipede, a number of interesting issues and suggestions for improvements have arisen. The following sections describe some of the research directions future work can take, and also describe issues that should be addressed to make Millipede fully functional. 9.2.1 Further Development The current version of Millipede is written for PVM, and al l PVM functions have not yet been implemented. A first step is to complete the implementation of the Millipede runtime system to support al l PVM functions. A natural next step is to implement a version of the runtime system for MPI. Furthermore, providing a simple GUI for interacting with Millipede has certain advantages. One of the more important advantages is its ability to easily manage a number of windows through the use of tabs, or cascading panes. A simple GUI can be implemented using Tcl/Tk, which provides call back functionality to C. In Chapters 4 through 8 we describe in detail improvements and future work for each tool. We highlight some of the more important ones here. Since log files can become quite large, and since replaying a process from the start can take too long, one of the most important improvements for the runtime system is to add check pointing and log file truncation. Netzer and Xu describe an efficient way to checkpoint and maintain log file consistency in [NX93]. Implementing a similar scheme for Millipede wil l improve its ability to debug long running applications with large message data sets. 9.2. Future Work 124 The database system used in connection with MQL is written in C and is part of Millipede. By using existing databases that provide access through C functions, not only do we get a better database, but it also becomes easier to extend MQL with new functionality that might already be found in the native SQL dialect of the chosen database. Breakpoints are wel l known in the sequential debugging domain; we believe that this concept can be extended to messages as wel l . Message breakpoints abstracts away line numbers, and lets the user control program execution, based on the messages. The algorithm for correcting deadlocks, described in Chapter 6, does not take message tags into account. An interesting extension to this work is to develop an algorithm that also considers message tags. In addition, a new analysis of the quality of such an algorithm should be performed. For the Protocol Conformance Checking Tool, a number of simple improvements have been suggested in Section 7.10. Two examples are adding message tags to specification lines and allowing the user to write specification lines where communication is dependent on program state. An interesting extension to this module would be a graphic display showing the flow of the messages as the program executes. Such an interface can be combined with the message relations described in Chapter 5 to provide a graphical replay of the protocol. Finally, the Buffer Sufficiency Problem for the mixed buffer allocation scheme should be investigated. If we can show that the problem is tractable, an algorithm can be developed and added to Millipede. Further research should be done on an approximation heuristic for the Buffer Allocation Problem. We only suggest simple measures, but we believe that more work can be done to provide better approximations. Bibliography [AFC91] K. Araki, Z. Furukawa, and J . Cheng. A General Framework for Debugging. IEEE Software, pages 14-20, May 1991. [Ana89] V. Anantharm. The optimal buffer allocation problem. IEEE Transactions on Information Theory, 35(4):721-725, 1989. [Arv92] D. K. Arvind. On the detection of communication related errors in parallel programs. Parallel computing, 18:1381-1392, 1992. [Asp03] The AspectJ Team, Xerox Corporation, Palo Alto Research Center. The AspectJ™ Programming Guide, 2003. [BB97] B. B. Blendstrup and J . B. Pedersen. P V M b u i l d e r - et grafisk vaerktoj t i l parallel programmering. Master's thesis, Aarhus Universitet, January 1997. [BBW02] A. Brodsky, J . B. Pedersen, and A. Wagner. On the Complexity of Buffer Allocation in Parallel Message Passing Systems. In Communicating Process Architectures 2002. IOS Press, September 2002. [BD95a] G. Burns and R. Daoud. Robust MPI Message Delivery with Guaranteed Resources. MPI Developers Conference at the University of Notre Dame, June 1995. [BD95b] G. Burns and R. Daoud. Robust MPI Message Delivery with Guaranteed Resources. MPI Developers Conference at the University of Notre Dame, June 1995. [BDH+95] J . Bruck, D. Dolev, C. Ho, M. Rosu, and R. Strong. Efficient message passing interface (mpi) for parallel computing on clusters of workstations. In 7th Annual ACM Symposium on Parallel Algorithms and Architectures, pages 64 - 73, July 1995. [BK95] P. A. Buhr and M. Karsten. /JC++ Monitoring, Visualization and Debugging Annotated Reference Manual, Preliminary draft edition, November 1995. [BS95] P. A. Buhr and R. A. Stroobosscher. fiC++ Annotated Reference Manual, Version 4.4, Available via ftp from p i g . u w a t e r l o o . c a in pub/uSystem/uC++. g z , Department of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1 edition, September 1995. [BW00] J . B. Pedersen and A. Wagner. Sequential Debugging of Parallel Programs. In Pro-ceedings of the international conference on communications in computing, CIC'2000. CSREA Press, June 2000. 125 Bibliography 126 [BW01a] J . B. Pedersen and A. Wagner. Correcting Errors in Message Passing Systems. In F. Mueller, editor, High-Level Parallel Programming Models and Supportive Environ-ments, 6th international workshop, HIPS 2001 San Francisco, CA, USA, volume 2026 of Lecture Notes in Computer Science, pages 122-137. Springer Verlag, April 2001. J . B. Pedersen and A. Wagner. Protocol Verification in Mill ipede. In Communicating Process Architectures 2001. IOS Press, September 2001. E. M. Clarke and E. A. Emerson. Synthesis of synchronization skeletons for branching t ime temporal logic. Logic of Programs: Workshop, Yorktown Heights, NY, 131, May 1981. C. Clemencon, J . Fritscher, and R. RLihl. Visualization, Execution Control and Replay of Massively Parallel Programs within Annai's Debugging Tool. In Proceedings of High-Performance Computing Symposium, pages 393-405, July 1995. R. Cypher and E. Leu. Repeatable and portable message-passing programs. In Proc. of The Symposium on the Principles of Distributed Computing (PODC), pages 22-31, 1994. R. Cypher and E. Leu. The semantics of blocking and nonblocking send and receive primitives. In Proceedings of 8th IEEE International parallel processing symposium (IPPS), pages 729-735, 1994. E. M. Clarke, D. E. Long, and K. L. McMillan. Compositional model checking. In Proceedings, Fourth Annual Symposium on Logic in Computer Science, pages 353-362. IEEE Computer Society Press, June 1989. T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT press, 1990. B. Charron-Bost, F. Mattern, and G. Tel. Synchronous, asynchronous, and causally ordered communication. Journal of Distributed Computing, 9(4): 173-191, 1996. S. A. Cook. The complexity of theorem-proving procedures. In Proceedings of the 3rd Annual ACM Symposium on the Theory of Computing, pages 151-158, 1971. [BW01b] [CE81] [CFR95] [CL94a] [CL94b] [CLM89] [CLR90] [CMT96] [Coo71] [CSH02] [DHHW93] [Dil96] [Don] R. Corbett, R. Stallman, and W. Hansen. http://www.gnu.org/manual/bison/index.html. Bison 1.35, May 2002. J . Dongarra, R. Hempel, A. Hey, and D. Walker. A proposal for a user-level, message-passing interface in a distributed memory environment. Technical Report TM-12231, ORNL, June 1993. D. L. Dil l. The Mur</> Verification System. In 8th International Conference on Computer Aided Verification, pages 390-393, July/August 1996. J . Dongarra et a l . HeNCE: http://www.netlib.org/hence. Users' Guide. Version 2.0. [Don94] J . Dongarra. MPI: A Message Passing Interface Standard. The International Journal of Supercomputers and High Performance Computing, 8:165-184, 1994. Bibliography 127 [Eis97] M. Eisenstadt. My hairiest bug war stories. In The Debugging Scandal and What to Do About It - Communication of the ACM. ACM Press, April 1997. [FBH+92] D. Frye, R. Bryant, H. Ho, R. Lawrence, and M. Snir. An external user interface for scalable parallel systems. Technical report, IBM highly parallel supercomputing systems laboratory, November 1992. [FJL+88] G. Fox, M. Johnson, G. Lyzenga, S. Otto, J . Salmon, and D. Walker. Solving problems on concurrent processors. General techniques and regular problems, volume 1. Prentice Hall International, 1988. [For] Formal Systems. FDR2. http://www.fsei.com. [Fos95] I.Foster. Designing and Building Parallel Programs: Concepts and tools for parallel software engineering. Addison Wesley, 1995. [GBV94] R. Daoud G. Burns and J . Vaigl. LAM: An Open Cluster Environment for MPI. In Supercomputing Symposium '94, June 1994. [Gdb] Gdb - GNU Debugger, http://www.gnu.org/directory/gdb.html. [Gei94] A. Geist et a l . PVM: Parallel Virtual Machine. A User's Guide and Tutorial for Net-worked Parallel Computing. Prentice Hall International, 1994. [GJ79] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company, New York, 1979. [GMP89a] A. Giacalone, P. Mishra, and S. Prasad. Facile: A Symmetric Integration of Concurrent and Functional Programming. Proceedings of the 1989TAPSOFT Conference, 352, 1989. [GMP89b] A. Giacalone, P. Mishra, and S. Prasad. Facile: A Symmetric Integration of Concurrent and Functional Programming. International Journal of Parallel Programming, 18(2), 1989. [Gra86] J . Gray. Why do Computers Stop and What Can be Done About it? Proceedings of 5th Symposium on Reliability in Distributed Software and Database Systems, pages 3-12, January 1986. [HE93] M. T. Heath and J . A. Ethridge. ParaGraph: A Tool for Visualizing Performance of Parallel Programs. Technical report, Technical Report Oak Ridge National Laboratories, 1993. [Hei97] F. Heinze et a l . Trapper, Eliminating Performance Bottlenecks in a Parallel Embedded Application. IEEE Concurrency, pages 28-37, July-September 1997. [HJJJ85] P. Huber, A. M. Jensen, L. 0 . Jepsen, and K. Jensen. Reachability trees for high-level Petri nets. Theoretical Computer Science, 45:261-292, 1985. [Hoa78] C. A. R. Hoare. Communicating Sequential Processes. Communications of the ACM, 21(8):666-677, August 1978. [Hol97] G. J . Holzmann. The Spin Model Checker. IEEE Transactions on Software Engineering, 23(5):279-295, May 1997. Bibliography 128 [Hoo96] R. Hood. The p2d2 Project: Building a Portable Distributed Debugger. In Proceedings of the ACM SIGMETRICES Symposium on Parallel and Distributed Tools (SPDT'96), pages 127-136, May 1996. [IBM02] IBM Press Release. IBM Executive Says Grids Will Be A Breakthrough For Manag-ing IT Efficiency, June 2002. http://www-916.ibm.com/press/prnews.nsf/jan/ F23B8EA466B5569085256BDC0064024B. [Jen92] K. Jensen. Coloured Petri nets. Basic Concepts, Analysis Methods and Practical use, volume 1. Springer Verlag, 1992. [Joh83] W. L. Johnston. An Effective Bug Classification Scheme Must Take the Programmer into Account. Proceedings of the workshop of High-level debugging. Palo Alto, California, 1983. [Kar95] M. Karsten. A Multi-Threaded Debugger for Multi-Threaded Applications, Diplomar-beit, Fakultat fur Mathematik und Informatik, Universitat Mannheim, Deuthchland edition, August 1995. [KG96] J . A. Kohl and G. A. Geist. The PVM 3.4 Tracing Facility and XPVM 1.1. Proceedings of the 29th Annual Hawaii International Conference on System Sciences, pages 290-299, 1996. [KLK99] P. Kacsuk, R. Lovas, and J . Kocacs. Systematic Debugging of Parallel Programs in DIWIDE Based on Collective Breakpoints and Macrosteps. In Proceedings of the 5th International Euro-Par Conference (Euro-Par'99), volume 1685 of Lecture Notes in Computer Science, pages 90-97. Springer Verlag, August 1999. [Knu89] D. E. Knuth. The Errors of TpX. Software - Practise and Experience, 19(7):607-685, July 1989. [KV97] D. Kranzlmuller and J . Volkert. Using Different Levels of Abstraction for Parallel Programming Debugging. In Proceedings of the 1997IASTED (International Conference on Intelligent Information), pages 523-529, 1997. [KW01] C. Keppitiyagama and A. Wagner. Asynchronous MPI messaging on myrinet. In Proceed-ings 15th International Parallel and Distributed Processing Symposium (IPDPS'01). IEEE, 2001. [Lam78j L. Lamport. Time, clocks and the orderings of events in a distributed system. Com-munications of the ACM, 21:558-565, 1978. [Lel88] Wm Leler. Constraint Programming Languages — Their Specification and Generation. Addison-Wesley, 1988. [LLM88] M. Litzkow, M. Livny, and M. Mutka. Condor: A Hunter of Idle Workstations. Proceedings of the 8th International Conference of Distributed Computing Systems, pages 104-111, June 1988. [May83] D. May. OCCAM (language). ACM SIGPLAN Notices, 18(4):69-79, April 1983. [McM92] Ken McMillan. Symbolic Model Checking: An Approach to the State Space Explosion Problem. PhD thesis, Carnegie Mellon University, 1992. Bibliography 129 [MHC94] B. P. Miller, J . K. Hollingsworth, and M. D. Callaghan. The Paradyn Parallel Performance Measurement Tools and PVM. Environments and Tools for Parallel Scientific Computing, 1994. [Mon13] P. de MontMort. On the game of thirteen. 1713. Reprinted in Annotated Readings in the History of Statistics, ed. H. A. David and A. W. F. Edwards, Springer Verlag, 2001, pp. 25-29. [NAW+96] W. E. Nagel, A. Arnold, M. Weber, H. C. Hoppe, and K. Solchenbach. VAMPIR: Vi-sualization and Analysis of MPI Resources. Supercomputer 63, XII(1 ):69-80, January 1996. [NC92] P. Newton and J . C. Browne. The CODE 2.0 Graphical Parallel Programming Language. Proceedings of the ACM International Conference on Supercomputing, July 1992. [ND94] P. Newton and J . Dongerra. Overview of VPE: A Visual environment for Message-Passing, 1994. http://www.cs.utk.edu/ newton/vpe/vpe.html. [Net] NetSolve. http://icl.cs.utk.edu/netsolve. [NX93] R. H. B. Netzer and J . Xu. Adaptive message logging for incremental replay of message-passing programs. In Proceedings of the 1993 ACM/IEEE conference on Supercomput-ing, pages 840-849. ACM Press, 1993. [NY93] P. Newton and S. Y. Khedekar. CODE 2.0 User Manual, March 1993. [Pal99] Pallas GmbH. TotalVieW, 1999. http://www.pallas.de/pages/totalv.htm. [Pan93a] C. M. Pancake. Graphical Support for Parallel Debugging. Software Support for Parallel Computation, pages 216-228, 1993. [Pan93b] C. M. Pancake. Why Is There Such a Mis-Match between User Need and Parallel Tool Production? Keynote address, 1993 Workshop on Parallel Computing Systems: A Dialog between Users and Developers, April 1993. [Pan93c] C M . Pancake et a l . Results of User Surveys Conducted on Behalf of Intel Supercomputer Systems Division, Two Divisions of IBM Corporation, and CONVEX Computer Corporation, 1989-1993. [Pan94] C. M. Pancake. What Users Need in Parallel Tool Support: Survey Results and Analysis. Technical Report CSTR 94-80-3, Oregon State University, June 1994. [Pan99] C. M. Pancake. Applying Human Factors to the Design of Performance Tools. In Proceedings of the 5th International Euro-Par Conference (Euro-Par'99), volume 1685 of Lecture Notes in Computer Science, pages 44-60. Springer Verlag, August 1999. [Pax98] V. Paxson. Flex - a scanner generator, November 1998. http://www.gnu.org/ manual/flex-2.5.4/flex.html. [Pdb] pdbx and pedb: Parallel Program Debuggers. http://www.tc.corneii.edu/ UserDoc/Software/PTools/pdbx. [Pre92] 0 . Pretzel. Error-Correcting Codes and Finite Fields. Clarendon Press, 1992. Bibliography 130 [Pur] Rational Purify for UNIX, http://www.rational.com/products. [Rei87] M. Reiman. The optimal buffer allocation problem in light traffic. In IEEE Conference on Decision and Control, 1987. [Ros93] A. W. Roscoe. Developing and verifying protocols in CSP. Proceeding of the protocol verification workshop, Mierlo, The Netherlands, March 1993. [Ros94] A. W. Roscoe. Model-Checking CSP. A classical mind, essays in honour of C.A.R. Hoare, 1994. [RWZ88] B. K. Rosen, M. N. Wegman, and F. K. Zadeck. CML: a higher-order concurrent language. Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, January 1988. [SA97] R. Sosic and D. Abramson. Guard: a Relative Debugger. Software - Practise and Experience, 27(2): 185-206, February 1997. [San99] A. A. Sane. Techniques for Developing Correct, Fast and Robust Implementation of Distributed Protocols. PhD thesis, University of Illinois at Urbana-Champaign, 1999. [She75] T. Sheskin. Allocation of interstage storage along an automatic production line. AIEE Transactions, 8(1), 1975. [Sri95] R. Srinivasan. XDR: External Data Representation Standard. RFC 1832, Sun Microsys-tems, August 1995. [SSG91] A. Singh, J . Schaeffer, and M. Green. A template-Based Approach to the Generation of Distributed Applications Using a Network of Workstations. IEEE Transactions on parallel and distributed systems, 2(1 ):52-67, January 1991. [SSP85] J . C. Spohrer, E. Soloway, and E. Pope. A Goal /Plan Analysis of Buggy Pascal Programs. Human-computer Interaction, 1 (2): 163-207, 1985. [SSS90] A. Singh, D. Szafron, and J . Schaeffer. Experience with parallel programming using code templates. Concurrency, Practise and Experience, 10:91-120, March 1990. [The96] The VIS Group. VIS: A system for Verification and Synthesis. Proceedings of the 8th International Conference on Computer Aided Verification, 1102, July 1996. [Top] The top 500 fastest computers, http://www.topsoo.org. [TSS96] B. Topol, V. Sunderam, and J . Stasko. PVaniM2.0, 1996. http://www.cc.gatech.edu/ gvu/softviz/parviz/pvanimOL/ pvanimOL.html. [WA98] G. Watson and D. Abramson. Finding Errors in Data Parallel Programs: A Case Study. May 1998. http://www.rdt.monash.edu.au/ ~greg/papers/sc98.html. [WAF02] P. H. Welch, J . R. Aldous, and J . Foster. CSP networking for Java (JCSP.net). Lecture Notes in Computer Science, 2330, 2002. [XML98] Extensible markup language (XML) 1.0. Technical Report REC-xml-19980210, W3C, February 1998. Bibliography 131 [XWXS96] J . Xiong, D. Wang, W. Xheng, and M. Shen. BUSTER: An Integrated Debugger for PVM. Proceedings of 1996 IEEE Second International Conference on Algorithms and Architectures for Parallel Processing, ICAPP '96, Singapore, pages 124-129, June 1996. [Yee96] Bennet Yee. A Portable save_world Process Checkpointing Package, 1996. ht tp: / /www.cs.ucsd.edu/users/bsy/fun.html. Appendix A A Complete Example of a Millipede Session In this section, we illustrate a complete example of how to use Millipede to extract a single process from a parallel program, and how to debug such a process sequentially. We consider a master/slave application and extract one of the slave processes. The steps are as follows, and Figure A.1 shows how this session looks in Millipede: 1. First, we compile both the master program and the slave program with the - D M I L L I P E D E option set. This ensures that the Millipede versions of the message passing calls execute when the parallel program executes. 2. If the environment variable M I L L I P E D E _ R C M is set, log files are generated when the pro-gram executes. The program can execute as it is normally, or through Millipede. In this example we show an execution through Millipede. 3. Millipede is started and the name of the master program is passed as a parameter. 4. Using the Millipede command p r o j e c t <name>, we specify a project file, which contains al l the information about the execution for future debugging purposes. 5. The parallel program can now be executed using the run. 6. After exiting Millipede, we can unset the M I L L I P E D E _ R C M environment variable and set M I L L I P E D E _ R E M . This instructs the Millipede runtime system that instead of writing log files when executing the message passing calls, log files are read. 7. We can now apply any sequential debugging tool; in this case we execute Gdb with one of the slave processes. 132 133 8. When the first PVM call executes, the Millipede runtime system prompts the user for the name of a log file from which the messages are supplied. 9. Sequential debugging now commences as if the program were a sequential program. The Millipede runtime system reads the log file each time a message passing call is made in the code, and supplies the program with values for the variables received through the messages. The programmer can debug, recompile, and re-execute the process with the message log until the errors have been corrected. If the programmer wishes to debug the same process with another set of messages, the program can be restarted with a different log file. (1) gcc -g -DMILLIPEDE - I . -L$PVM_ROOT/lib/$PVM_ARCH/ -o Master Master c -lpvm3 (2) gcc -g -DMILLIPEDE - I . -L$PVM_ROOT/lib/$PVM_ARCH/ -o Slave Slave.c -lpvm3 (3) setenv MILLIPEDE_RCM (4) pvm pvm> q u i t pvmd s t i l l running (5) M i l l i p e d e Master W E L C O M E T O T H E # # # # # # # # # # ##### ##### # ##### ###### # # # # # # # # # # # # # # # # # # # # # # # ##### # # ##### # # # # # # ##### # # # # # # # # # # # ' # # # # # # # ###### ###### # # ##### # ##### ###### M U L T I - L E V E L D E B U G G I N G S Y S T E M (0)|MILLIPEDE> p r o j e c t M a s t e r - s l a v e . p r j P r o j e c t f i l e i s 'Master-slave.prj' (0) |MILLIPEDE) run Program 'Master' terminated normally; 0 messages i n 0 queues. (0) |MILLIPEDE) p r o j e c t P r o j e c t P r o j e c t name: Master-slave.prj 1 Master.c pr o c e s s ( e s ) : Group / Instance / T i d 0 / 0 / 242165 4 Slave.c p r o c e s s ( e s ) : Group / Instance / T i d 0 / 0 / 242167 0 / 1 / 242169 0 / 2 1 242171 0 / 3 / 242173 End of P r o j e c t (0) |MILLIPEDE) e x i t (6) unsetenv MILLIPEDE_RCMsetenv MILLIPEDE_REM (7) gdb Slave GNU gdb 5.2 (gdb) break main Breakpoint 1 at 0xl5af8: f i l e Slave.c, l i n e 19. (gdb) run 19 mytid " pvm_mytid() ; (gdb) next Replay f i l e name: Slave-242171. rpf Figure A .1 : A complete example of a debugging session using a sequential debugging tool on an extracted process. Appendix B The PCSL Grammar and Semantics B.1 The PCSL Grammar and Semantics This section contains the grammar and the semantics for the Protocol Constraint Specification Language (PCSL). Figure B.1 shows the BNF grammar for the PCSL grammar. Note, the symbol e is not a symbol Protocol Commlist Comm LeftClass RightClass Quantifiers QuantifierList Quantifier Index ClassExpression Expression RelExpression Commlist e | Commlist Comm LeftClass '->" RightClass Quantifiers ";" Identifier | "[" Index "]""{" Index "}""(" Index ")" Identifier "[" ClassExpression "]""{" ClassExpression "}" "(" ClassExpression")" e | Q u a n t i f i e r L i s t Quantifier | Quantif ier"," QuantifierList forall Identifier":" RelExpression e | Number | Identifier e j Expression Expression "*" Expression | Expression "/" Expression Expression"-" Expression Expression Expression sqrt(" Expression")" Expression"+" Expression Expression"%" Expression "-" Expression | " (" Expression ")' Identifier | Number Expression "<" Expression | Expression "<" Expression Expression ">" Expression j Expression ">" Expression Expression "=" Expression | Expression V Expression RelExpression "&&" RelExpression RelExpression "||" RelExpression "(" RelExpression ")" | "true" | "false" Figure B.1: The PCSL BNF grammar. 134 8.2. A Complete Example Using PCSL/MOPED 135 £ [Number]] cr £[ldentifier]cr = Number = a(ldentifier) = £[e i ]<7 * £\e2\a = £ [ e i ] a / £\e2\a = Slaja + £le2}a = £le^a - £\e2la = £|ei]CT mod £\e2\a = exp(£\el\cj,£le2\a) = SMa 7e[false]cr K\ex < e2]a niei > e2\a Kbi < e2\a 7l[ei > e2ja Tl\&\ = e2\a ft[ei ^ e 2 ]a ^ [ r i || r2p Rftrueja true false £\ex * e2|cr £[ei/e2J<7 £[ei + e2Jcr £[[ei - e2]]cr £ I e 1 % e 2 ] a £[e i 'e2J(7 £[ (e ) l c r £\eila < £le2\a Zleila > £\e2}<? %ij<7 < £\e2}a ^ffejo- > £(e2Ja £bih * £\e2}o-Rfrifr V n{r2}a £l-eja £[sqrt(e) }a Figure B.2: Semantics for the PCSL grammar in the grammar, but simply means that the left hand side of the production can be substituted with nothing, that is, it can be left blank. Figure B.2 contains a natural semantics for the PCSL grammar. B.2 A Complete Example Using PCSL/MOPED In this section we illustrate the use of the MOPED module in Millipede. Figure B.3 shows how to activate MOPED within Millipede. The protocol specification is the same as used in Chapter 7. A protocol specification file can be loaded when Millipede is in MOPED mode by using the command l o a d . The l i s t command displays the content of the specification currently loaded. Once the program executes, any violations of the protocol are reported. If the user wishes to know more about the message that violated the protocol, such information can be obtained through a query. B.2. A Complete Example Using PCSL/MOPED 136 (0) |MILLIPEDE) moped (0) |MILLIPEDE\MOPED> load "protocol.pes (0) |MILLIPEDE\MOPED> l i s t 1: master[0)0(MS) -> slave[0]i(MR) :: f o r a l l i (0<=i && i<=7); 2: s l a v e [ 0 ] i ( S S ) -> slave[0]0(MR) :: f o r a l l i (0< = i &St i< = 7) ; 3: s l a v e [ 0 ] i ( S I ) -> slave[0]i+1(Rl) :: f o r a l l i (0< = i &5c i<7 ) ; 4: s l a v e [ 0 ] i ( S 2 ) -> slave[0]i-1(R2) :: f o r a l l i (0<i && i<=7); (0) |MILLIPEDE\MOPED> e x i t (0) jMILLIPEDE) run MOPED: P r o t o c o l v i o l a t i o n : Q u a n t i f i e r e r r o r : (0<i && i<7) v i o l a t e d by i == 0 [ (0<0 && 0<7) ] i n f i l e . . . . : slave.c group... : 0 instance : 0 l i n e . . . . : 34 TID : 242167 (5)MILLIPEDE> Figure B.3: A complete example using MOPED to check all messages against a protocol specification written in PCSL. Appendix C The MQL Grammar C.1 The Millipede Query Language Grammar Figure C.1 shows the BNF grammar for the Millipede Query Language. Al l MQL definitions are stored in an internal function table in Millipede, and retrieved when needed. During execution of a query, a local environment is maintained, containing any inter-mediate relations created using let. This local environment, along with all the relations created during evaluation of a statement of application, is then removed. It is currently not possible to add any permanent relations to the runtime system. 137 The Millipede Query Language Grammar Command :: = Application Definition Query Application : = Name "(" [ Arguments ] ")" Arguments :: = Value Value" , " Parameters Definition :: = "define" Name "(" [ Parameters ]")" "as" Query Parameters :: Name Name" , " Parameters Query : = QueryElement "begin" [ QueryList ] QueryElement "end" QueryList :: = QueryElement QueryElement";" QueryElement QueryElement :: = Query "print ( String {"," Value }")" "display" Relation "using" String "let" Name "be" Relation Op = "==" | "!="|"<"|"<="|">"|">=" AttributeList :: = Name Attr ibuteList"," Name Relation :: = Name "select f rom" Relation "where" Name "Op" ("#. "project" Relation "over" AttributeList "rename" Name " in" Relation "to" Name " join" Relation "with" Relation "union o f Relation "and" Relation "di f ference between" Relation "and" Relation "sort" Relation "by" AttributeList Figure C. 1: The Millipede Query Language BNF grammar. Appendix D Millipede Screen Shots In this chapter we show a number of the most common windows in Millipede. Figure D.1 shows the start up screen of Millipede. This is the main window for interacting with the debugging system. Commands for controlling the execution are issued here. MQL scripts are loaded from here, and protocol specifications for the protocol assertion module can be controlled from this window as wel l . Figures D.2 and D.3 show an example of interactive debugging and the status monitor, respectively. Session tdt View -Settings Heip H E L C O N E T O T H E # # i # It tt tt tttttttttt tttttttttttt tttttttttt tttttttttttt ft tl It tt tt tt tt tt # tt tt , tt tt tt tt tl tl tt tt tt tt tt tt ttttlt#tt tt it tttttttttt # tt tt tt tt tt tttttttttt tt tt tt tt # # tt It tt tt tt tt It tt tt # tt tt tttttttttttt tttttttttttt tt * *#tttt#tt Ittttttt* tttttttttttt H U L T I -L E V E L D E I 3 U G G I H ' G S Y S T E M <0)IHILLIPEDE> | 7 * C Figure D.1: The Millipede startup screen. 139 140 Wavejs lave;c^262T51 I •-> I j | Mill ipede.c: Reading config f i l e done Replay f i l e : Uawe.slave-2S2151.rpf ==================== Debugging session ====== Debugging PVM program: 262151 Message Monitor : 2G2147 Status Monitor : 2G2148 Interactive debugging mode chosen. Line 72: pvm.mbtidO = 2G2151 Receiving New Package < Line 78; pvm_recv(-l,0) Line 81: pvm_upkint(8,nproc,l,l) = [4] <ok> Do you want to change this [y/n/f] ? n Line 82: pvm_upkint(tids,4,l) = [282151,282152,282153,282154] <ok> Do you want to change this [y/n/f] ? Figure D.2: Screen shot illustrating interactive message debugging. otalus UVIAVI ""_ " " — - j" I • (rfave.master.c (262150): Receive ended (Msg. No.: 29, Sender..: 262153, Tag: -1) Wave.slave.c (262154): Send ended (Msg. No.: 27, Receiver: 262150, Tag: 5) Wave.slave.c (262151): Send ended (Msg. No.: 25, Receiver: 262150, Tag: 5) Wave.slave.c (262153): Send ended (Msg. No.: 29, Receiver: 262150, Tag: 5) Uave.slave.c (262152): Send ended (Msg. No.: 28, Receiver: 262150, Tag: 5) Status Monitor Running (Process IE = 262148) Figure D.3: Screen shot showing the status monitor. This window shows what each process in the system is doing with respect to communication. Appendix E Theoretical Framework for The Buffer Allocation Problems In this section we define the three buffer allocation problems formally, present the theoretical graph frame work, and the proofs. E.1 Definitions Let 5 be a multiprocess system with n processes and Ei communication events occurring in process i; a communication event is either a send or a receive. A multiprocess system S is unsafe if a deadlock can occur due to an insufficient number of available buffers; if S is not unsafe, then 5 is said to be safe. Figure E.1 is an example of an unsafe system. The numbers above the graph in Figure E.1 represent the buffer assignment. Figure E.1: Order of execution can cause deadlock. A per-process buffer assignment is an n-tuple B = (bub2,..., bn) of nonnegative integers representing the number of buffers that can be allocated by each process. Similarly, a per-141 E.1. Definitions 142 channel buffer assignment is a g-tuple B = (h,b2, • • • ,bq), q = ("), representing the number of buffers each channel in the system can allocate. Since buffers take up memory, which may be needed by the application, ideally, as few buffers as possible should be allocated. However, allocating too few buffers results in an unsafe system. Buffer utilization is the nondeterministic phenomena of interest in the system. Making the choice of when to use a buffer affects future choices. For example, in Figure E.1, using a buffer for communication 1 before communication 3 completes results in deadlock. Two natural decision problems arise from this optimization problem. Given a system 5 and a nonnegative integer k, the Buffer Al location Problem (BAP) is to decide if there exists a buffer assignment B such that S is safe and ^ h < k. In order to solve this problem we need to solve a simpler one. Suppose we are given a buffer assignment B and a system S; the Buffer Sufficiency Problem (BSP) is then to decide whether the assignment is sufficient to make S safe. Additionally, we can require that no process in system S should ever block on a send. Given a system 5 and a nonnegative integer k, the Nonblocking Buffer Al locat ion Problem (NBAP) is to decide whether there exists a buffer assignment B, such that no send in S ever blocks, and Eh <k. We model systems by using communication graphs, and executions of systems by colouring games on these graphs. Communication graphs can be derived from execution traces of a program. The following subsection defines the graph based framework used throughout this section. E.1.1 The Graph Based Framework A communicat ion graph of 5 is a directed acyclic graph G = G{S) = (V,A) where the set of vertices V = {vitC | 1 < i < n, 0 < c < (Ei +1)} corresponds to the communication events and the arc set A consists of two disjoint arc sets: the computation arc set P and the communication arc set C. Each vertex represents an event in the system: vertex vifi represents the start of process i, vertex viiC, 1 < c < Ei} represents either a send or a receive event, and vertex ^.(^i+i) represents the end of a process. An arc, (vitC,vitC+i) e P, 0 < c < Ei} represents a computation within process i and an arc (viiS,vjit) e C represents a communication between different processes, i and j, where vitS is a send vertex, and v]tt is a receive vertex (e.g. Figure E.2). Note, the process arcs are drawn without orientation for clarity; they are always oriented downward. Communication graphs are comparable to the time-space diagrams-without internal events-noted in [Lam78]. The ith process component Gi of G is the subgraph Gi = (Vi,Ai) where Vi = {vitC e V \ 0 < c < (Ei + 1)} and Ai — {(vi!C,vi}C+1) e A \ 0 < c < Ei}. The process component corresponds to a process in S. We construct communication graphs by connecting process components with arcs. Hence, it is more intuitive to treat a process component as a chain of send and receive vertices bound by a start and an end vertex. A channel is represented by a channel pair (Gi:Gj) E.1. Definitions 143 P, P2 * V1,0* f \ o o E 1 , 2 < r 1 J 2,2 E 8 ^ Ala 8 Figure E.2: An example of a communication graph with a 2-ring. of process components. A t-ring is a subgraph of a communication graph G(S), consisting of t > 1 process components, such that in each of the t process components there is a send vertex SijtCj and a receive vertex rt^dj, CJ < dj, 1 < j < t such that the arcs (siuCl,riltdt) and (s i > + l i C . + 1 , r i j > d j ) , 1 < j < t are in A. This definition is equivalent to the definition of a crown in [CMT96]. A t-ring represents a circular dependence of alternating send and receive events; see the example in Figure E.3. The shaded arcs in Figure E.3 show how each receive event depends on the preceding send event and each send event depends on the corresponding receive event. Thus, without an available buffer, there is a circular dependency that results in the system deadlocking. Figure E.3: Dependency cycle in G(S). To model the execution of a system S, we define a colouring game that simulates the execution of the system with respect to G(S). E.1.2 Colouring the Communication Graph Given a communication graph G(S), an execution of a corresponding system S is represented by a colouring game where the goal is to colour all vertices green; a green vertex corresponds to the completion of an event. We use three colours to denote the state of each event in the system: a red vertex indicates that the corresponding event has not yet started, a yellow vertex indicates E.1. Definitions 144 that the corresponding event has started but not completed, and a green vertex indicates that the corresponding event has completed. Hence, a red vertex must first be coloured yellow before it can be coloured green; this corresponds to a traffic light changing from red, to yellow, to green. 1 We use tokens to represent buffer allocations. A buffer assignment of a process (or channel) is represented by a pool of tokens associated with the corresponding process component (respec-tively, the channel component). An instance of buffer utilization is represented by removing a token from a token pool and placing it on the corresponding communication arc. The colouring game represents an execution via the following rules. Initially, the start vertices of G are coloured green and all remaining vertices are coloured red; this is called the init ial colouring. send->yei A red send vertex may be coloured yellow if the preceding vertex is green-the send is ready. recv->yel A red receive vertex may be coloured yellow if the corresponding send vertex is yellow, and the preceding vertex (in the same process component) is green-both the send and the receive are ready. recvAye/ A red receive vertex may be coloured yellow if the corresponding send vertex is yellow, and a token from the corresponding token pool is placed on the incident communication a rc - the send is ready and a buffer is used. send^grn A yellow send vertex may be coloured green if the corresponding receive vertex is coloured yellow—the communication has completed from the sender's perspective. recv^grn A yellow receive vertex may be coloured green if both of its preceding vertices are green. If the incident communication arc has a token, the token is returned to its token poo l -a receive completes after the send completes. end-^yel A red end vertex may be coloured yellow if the preceding vertex is green. end^grn A yellow end vertex may be coloured green. Buffer util ization is represented by placing a token from the token pool on the selected arc, and colouring the corresponding receive vertex yellow. If no tokens are available, the rule cannot be invoked. A colouring of G, denoted by x , is a colour assignment to al l vertices, which can be obtained by repeatedly applying the colouring rules, starting from the init ial colouring. A colouring sequence £ = (xi» X2, •••) is a sequence of colourings such that each colouring is derived from the preceding one by a single application of one of the colouring rules. An execution of a multiprocess system 1 Naturally, we refer to a European traffic light. E.2. Useful Lemmas 145 S with buffer assignment B is represented by a colouring sequence on G(S). Each transition, from one colouring to the next, within a colouring sequence, corresponds to a state change of an event in the corresponding execution. Assuming that all events in the system are ordered, there is a correspondence between the colouring sequences on G(S) and the executions of system S. Using the correspondence between colouring sequences on G(S) and executions of system S, we reason about system 5 by reasoning about colouring sequences on G(S). We say that a colouring sequence completes if and only if the last colouring in the sequence comprises only green vertices. A colouring sequence deadlocks if and only if the last colouring in the sequence has one or more nongreen vertices and the sequence cannot be extended via the application of the colouring rules. A system S is safe if and only if every colouring sequence on the graph G(S) completes. We say that a colouring sequence blocks if there exists a sequence on G{S), ending with a colouring containing a yellow send vertex, that cannot be extended by applying rule recvAye/ to the corresponding receive vertex. A colouring sequence is block f ree if every prefix of the sequence does not block; a communication graph G, is block free if al l colouring sequences on it are also block free. If G(S)) is block free, then no send in S wi l l ever block during an execution. A token assignment, also denoted by B, is a list of nonnegative integers, denoting the number of tokens assigned to each token pool; the token assignment is the abstract representation of a buffer assignment. The number of tokens required depends on the number of times that rule recvAye/ may be invoked. If a token pool is empty, this means al l buffers are in use. E.2 Useful Lemmas The following lemmas are used throughout our proofs. Lemma E.2.1 characterizes the condi-tions under which a colouring sequence wil l deadlock. Lemma E.2.2 characterizes conditions under which a single colouring sequence may represent all possible colouring sequences. Finally, Lemma E.2.3 characterizes a class of communication graphs on which no colouring sequence wil l deadlock. Lemma E.2.1 (The t-Ring Lemma) Let G be a communication graph comprising a single t-ring. Any colouring sequence on G completes if and only if rule recvAye/ is invoked at least once. Proof: Assume by contradiction that there exists a complete colouring sequence £ that does not make use of rule recvAye/. Consider the first colouring in S where one of the send vertices is green; call the vertex s». Let rj be the corresponding receive vertex. According to rule send->grn, the vertex rj must be yellow. Since rule recvAye/ has not been applied, rule recv-^yel must have been invoked earlier in the sequence. By the definition of a t-ring, the send vertex Sj must be the predecessor of rj. Since the rule recv^yel was applied to r,-, Sj must be green. Hence, there is an earlier colouring in S with a green send vertex. This is a contradiction. E.2. Useful Lemmas 146 In the other direction, if rule recvAye/ is invoked on receive vertex rjt then rule send^grn can be invoked on the corresponding send vertex SJ, breaking the circular dependency. • Define the dependency graph of communication graph G = (V, A) to be H = (V, E) where al l process arcs in A are reversed in E and all communication arcs in A are bidirectional in E. Define the depth d(y) of a vertex v e V to be the length of the maximum length path in H from v to a start vertex. Lemma E.2.2 Let G be communication graph with a token assignment of 0. For any vertex v in G, if there exists a colouring sequence that colours vertex v green, there does not exist a colouring sequence that deadlocks before colouring v green. Proof: Proof by contradiction. Assume that there exist two colouring sequences, such that one colouring sequence colours a vertex green and the other deadlocks and does not colour the vertex green. Let v e V be such a vertex of minimum depth; that is, all vertices of lesser depth wil l be coloured green eventually by any colouring sequence on G. In order for a vertex to be coloured green, its component predecessor must be green. Since the depth of the predecessor is less than the depth of v, it can always be coloured green. Furthermore, since a send and its corresponding receive vertex are adjacent to each other, their depths differ by at most 1. Since v must be a communication vertex, by rules send->grn and recv^grn, the adjacent communication vertex t must be coloured yellow before v can be coloured green. If vertex t is of a lesser depth than v, then t must be green colourable in all colouring sequences; hence, v must also be green colourable. If t is at the same depth as v, then its component predecessor is at a lesser depth and must be green colourable, hence t is yellow colourable, and v is green colourable. If t is at a greater depth than v, the component predecessor of t, say u, is at the same or a lesser depth than t. If the latter, then u is green colourable and t is yellow colourable, otherwise, we apply the same argument to u first. Since there is no path from u to v in i l - b e c a u s e d(u) < d(v)-we need only recurse a finite number of times. • Lemma E.2.3 If G is a communication graph whose dependency graph is acyclic, then no colour-ing sequence on G will deadlock. Proof: Proof by contradiction. Assume that a colouring sequence deadlocks on G. Let v be the vertex of minimum depth that cannot be coloured green. If v is a send (receive) vertex, let u be the corresponding receive (send) vertex. Let vertex t be the component predecessor of vertex u and let vertex w be the component predecessor of vertex v. Since the dependency graph is acyclic, the depths of both t and w are less than the depths of u and v. Hence, both t and w may be coloured green based on our minimality assumption. However, then both u and v may be coloured green; this is a contradiction! If v is an end vertex, then it has only one predecessor, which is of a lesser depth, which leads to the same contradiction. • E.3. Buffer Allocation in Systems with Receive Side Buffers 147 E.3 Buffer Allocation in Systems with Receive Side Buffers In systems with receive side buffers, messages are buffered only by the receiver. Buffers are allocated by the receiving process when a message arrives, but cannot be received, and are freed when the message is received by the application. Analogously, when colouring a receive vertex of the corresponding communication graph, only a token belonging to the same process component may be used. We call this the receive side al location scheme. E.3.1 The Buffer Allocation Problem In order to prevent deadlock in distributed applications, the underlying system needs to allocate a sufficient number of buffers. Ideally, i t should be the minimum number required. Unfortunately, determining the number of buffers required to make the system safe is intractable. The corresponding graph-based decision problem is this: given a communication graph G and a positive integer.A;, determine if there is a token assignment of size k such that no colouring sequence deadlocks on G. We show that BAPr is N P - h a r d by a reduction of the wel l known 3SAT problem [Coo71] to B A P r . Recall the definition of 3SAT: determine if there exists a satisfying as-signment to A " = i ( a i v ^ V c j ) ' where ait bi} and c» are Boolean literals in {x1,xi,x2,x2,... ,xn, xn}. Theorem E.3.1 The Buffer Allocation Problem (BAPr) is NP-hard. Proof: Proof by reduction of 3SAT to BAPr. For any 3SAT instance F we construct a corresponding communication graph G such that for a token assignment of size n, any colouring sequence completes on G if and only if the corresponding variable assignment satisfies F. Let F be an instance of 3SAT with n variables and c clauses; the variables are denoted X\ , x2,..., xnt and the j th clause is denoted (a, V bj V CJ), where ahbj,Cj e {xx,xlt...,xn,xn}. The corresponding communication graph G comprises 2n + 1 process components: 2n of the components-cal led l i teral components-are labeled PXi and PSi, i = l...n, and correspond to the literals of F. The last component-cal led the barr ier component- is labeled Pbamer-Each process component is divided into c + 1 epochs, where each epoch is a consecutive sequence of zero or more vertices within the component. All epochs are synchronized, that is, the vertices of one epoch must be coloured green before any of the vertices in the next epoch may be coloured. To ensure this we use a barrier component; the j t h epoch of the barrier component, j = 0 , . . . , c, comprises 2n receive vertices, labeled qij, and 2n send vertices, labeled tij, I € {xi,xi,... ,xn,xn}. At the end of each epoch there is an arc from each of the literal components P ( , I e {xi,xi,...,xn,xn}, to the barrier component. Each arc emanates from vertex sij, called a barrier send vertex, and is incident on vertex qtj, where I e {xi,xly.. .,x„,xn} and j = 0 . . . c. These arcs are followed by arcs emanating from the barrier component to the literal components; the arcs emanate from vertices tij and are incident on vertices rt<j, called barrier E.3. Buffer Allocation in Systems with Receive Side Buffers 148 receive vertices. The barrier component has no cyclic dependencies. Hence, by Lemma E.2.3, no colouring sequence wil l deadlock on a barrier component. Epoch 0 fixes a token assignment corresponding to a variable assignment in 3SAT. Each pair of process components, PXi and PSi,i = l...n, forms a variable component, which corresponds to a variable. The two process components of a pair share a 2-ring; see Figure E.4. By Lemma E.2.1, at least one token must be assigned to either process component PXi or PXi to prevent all colouring sequences from deadlocking on G. Since only n tokens are available, each component pair can be assigned exactly one token. Finally, assigning the token to process component, PXi or PXi, corresponds to fixing variable Xi to true or false. The epoch terminates with a barrier send vertex s;,.,o, followed by a barrier receive vertex r j i i 0 , k e {xi,Xi}. Epoch j of each process component corresponds to the j th clause of F. The epoch of a process component Pi, I ^  o3, b}, c,—not labeled by a literal of the j th clause-contains only two vertices: the barrier send vertex stj and the barrier receive vertex rij. The three process components, Paj, Pbj, PCj, whose labels correspond to the literals in the j th clause share a 3-ring in the j th epoch; see Figure E.4. By Lemma E.2.1, to avoid deadlock, at least one of the three process components must have a token. If none of the components are assigned a token, all literals in the j t h clause are false. The epoch is terminated by the barrier send and the barrier receive vertices. A satisfying assignment on F satisfies at least one literal in every clause. A corresponding token E.3. Buffer Allocation in Systems with Receive Side Buffers 149 assignment assigns a token to the corresponding process component in each 3-ring—corresponding to the true l iteral. Hence, by Lemma E.2.1 none of the colouring sequences wil l deadlock on any of the clause component and any colouring sequence on G wi l l complete. For a falsifying assignment of F, there exists at least one clause comprising false literals. The corresponding token assignment fails to assign any tokens to the process components that share the corresponding 3-ring. Thus, by Lemma E.2.1 all colouring sequences wi l l deadlock in that clause component. Hence, for a token assignment of size n, any colouring sequence on G wi l l complete if and only if the corresponding assignment satisfies F. Since finding a token assignment of size n such that no colouring sequence on G deadlocks is as hard as finding a satisfying assignment for F, BAP r is NP -hard. • E.3.2 The Buffer Sufficiency Problem A potentially simpler problem involves verifying whether a given buffer assignment is sufficient to prevent deadlock. Formally, given a graph G and a token assignment on G, determine if none of the colouring sequences on G deadlock. This problem turns out to be intractable as wel l . We show that BSPr is coNP-complete by a reduction from the TAUTOLOGY problem [GJ79, Page 261] to BSPr. Given an instance of a formula in disjunctive normal form (DNF), V*=i Aj=i ai,j where O i j e . . . ,xn,xn}, the formula is a tautology if it is satisfied by all assignments. An assignment that falsifies F is a concise proof that the formula is not a tautology. We shall restrict our attention to 3DNF formulas, where each term has three literals: V L i ( f l i A h A Q ) . Theorem E.3.2 The Buffer Sufficiency Problem (BSPr) is coNP-complete. Proof: Let G be a communication graph along with a token assignment. If there exists a deadlocking colouring sequence on G, then the sequence itself is a certif icate. The sequence is at most twice the number of vertices in G. Hence, BSPr is in coNP. Let F be a 3DNF formula with t terms where each term has three literals. For any 3DNF formula F, we construct a communication graph G and fix a token assignment such that there is a colouring sequence on G that deadlocks if and only if the corresponding assignment falsifies F. The construction consists of four types of components that correspond to fixing an assignment, a term in the disjunction, the disjunction, and the interconnects between components. Each variable in F is represented by a variable component comprising three process compo-nents that are labeled PXi, PXi, and Qt. The latter, called the arbitrator component, comprises three receive vertices, labeled qit rXi, and rXi. The former two process components, called variable components, comprise two send vertices each. The first, labeled sXi (sSi), is adjacent to the corresponding receive vertex rXi (rXi) in the arbitrator component. The second, labeled txi (tXi), is adjacent to receive vertices in components called dispersers, described later. The E.3. Buffer Allocation in Systems with Receive Side Buffers 150 vertex qi in the arbitrator component is similarly adjacent to a vertex in a disperser component. The corresponding token assignment for each variable component assigns one token to Qi and no tokens to the other two components; see Figure E.5. The component has the following property: Figure E.5: The construction of the components. Property E.3.3 Let G be a communication graph that contains a variable component. Any colouring sequence on G may colour exactly one of the two vertices tXi or tXi yellow before vertex qi is coloured green. Proof: By rule send^yel, in order for tXi (tXi) to be coloured yellow, vertex sXi (sSi) must be coloured green. Hence, by rule send^grn, vertex rXj (rXi) must first be coloured yellow. Since vertex is red, vertex rXi (rXi) can only be coloured yellow via rule recvAye/ . However, there is only one token assigned to process component Qit hence rule recvAye/ may only be invoked once. • The j th term in the disjunction is represented by a term component comprising a process component, which is called the term component and labeled Pj. The first part of each term component consists of a send vertex SJ and a receive vertex ry, these vertices are part of a t-ring. In the first term component, P i , there is an additional send vertex labeled s d o n e ; these are described in the next paragraph. The second part of each term component consists of three receive vertices labeled r^aj, r2>bi, and r^Cj, where a,j,bj,Cj e {xx,xi,... ,xn,xn} correspond to the literals in the j th term; see Figure E.5. These receive vertices are adjacent to send vertices in components called dispersers, which are described later. The term components are used to construct a disjunction component. The disjunction component comprises t term components, where the first two vertices, SJ and rjt are part of a t-ring spanning all t components. Specifically, each send vertex SJ, j < t, is E.3. Buffer Allocation in Systems with Receive Side Buffers 151 adjacent to receive vertex rj+x and vertex st is adjacent to receive vertex n; see Figure E.5. Each term component is assigned one token. The disjunction component has the following property. Property E.3.4 Let G be a communication graph that contains a disjunction component. Any colouring sequence on G can colour rj} j e [1, t], green if and only if at least one of rk, k e [1, t], is coloured yellow before any rk,ak, rkibk, or rk,Ck are coloured yellow. Proof: By Lemma E.2.1, vertex rj can be coloured green, if and only if rule recvAye/ is invoked, colouring one of the receive vertices rk, k e [l,t], yellow. The rule may only be invoked if and only if a token is available. Since each term component only has one token assigned and since vertex rk precedes vertices rkiak, rktbk, and rk,Ch, a token is available if and only if none of the vertices rk,ak, rk>bk, and rktCk, are coloured yellow via rule recvAye/, before vertex rk is coloured yellow. • Once vertex rk, k e [l,t], is coloured yellow, all rjt j = l...t may be coloured green, and vertex s d o n e may be coloured yellow. We now describe how the components are connected together using disperse components. Let s be a send vertex and R be a set of receive vertices. An (s,.R)-disperser comprises \R\ + 1 process components: one master component, labeled Ms, and \R\ slave components labeled Sr, r e R. The master component comprises one receive vertex labeled rs, followed by \R\ send vertices labeled sr, r e R. Each send vertex is adjacent to the receive vertex on the corresponding slave component Sr. Each slave component has two vertices: a receive vertex qr, followed by a send vertex tr; see Figure E.6. The latter vertex is adjacent to the receive vertex r in some other component. None of the components are assigned any tokens. The following property of a disperser follows from Lemma E.2.3. sVl sV2 sx 0 0 0 0 Figure E.6: The disperser component. Property E.3.5 Let G be a communication graph containing an (s,R)-disperser. If a colouring sequence colours vertex rs yellow, then the colouring sequence can be extended to colour all vertices tr, r e R yellow. E.3. Buffer Allocation in Systems with Receive Side Buffers 152 Let RXi, i = l...n, be the set of receive vertices labeled rjiXi e Pj, j £ [l,t], and let RXi be similarly defined; recall that a,j,bj,Cj are simply l i teral place holders in the vertex labels rj,aj, rj,bj, rjtCj. Hence, a ( t^-R^J-d isperser connects send vertex tXi e PXi to vertices in RXi — belonging to the term components. Furthermore, let Q be the set of receive vertices qi (in the variable components), i = 1.. . n ; a (s d o n e , (?)-disperser connects vertex s d 0 n e to all variable components via receive vertices qi. The construction of G comprises n variable components and one disjunction component, composed of t term components; these are connected together by a (s d 0 ne ,Q)-disperser, and 2n (ta,#a)-dispersers, where a 6 {xx,xi,... ,xn,xn}. We claim that there exists a colouring sequence that deadlocks on G if and only if there is a falsifying assignment for formula F, that is, F is not a tautology. Suppose that F has a falsifying assignment x, that is every term in the disjunction is false because each term has a literal x, or xit which is false. To construct a colouring sequence on G that deadlocks, we construct a set of vertices U. The first half of the colouring sequence is a maximal colouring sequence involving only the vertices of U. The second half of the sequence may involve al l vertices in G. The resulting colouring sequence wil l always deadlock. Let X = {a e {xi,xi,..., xn,xn} \ a\x — 0}, which is the set of literals that are false, and let Z = {sa e Pa | o £ X) u {SJ | j = 1, . . . ,t}, which contains the set of send vertices from the variable components that are labeled by a true literal and the numbered send vertices in the disjunction component; the set Z contains the vertices which may not initially be coloured. Let U = V\Z be the rest of the vertex set. Consider a colouring sequence involving only vertices in U. By property E.3.3 any maximal colouring sequence wil l colour the vertices ta yellow (in the variable component), where a e X. Hence, by property E.3.5 the vertices tr (in the dispersers) wi l l be coloured yellow, where r G Uaex -Ra—the send vertices tr in the dispersers are adjacent to the receive vertices in Ra-Since a; is a falsifying assignment, every term contains a l i teral, which is falsified by x. Without loss of generality, let O j denote a literal that is false in the j th term; therefore, process component Pj contains a receive vertex r^ai, which is adjacent to the yellow send vertex trj (in the disperser). Since none of the vertices of the t-ring (in the disjunction component) are not in {/- they are stil l coloured red- the token belonging to component Pj is used to apply rule recvAye/ to colour vertex r^ai yellow. Since every term has a false l i teral, the colouring sequence colours a receive vertex r j a i , j = 1. . A in every term component Pj. After the sequence cannot be extended, allow al l vertices to be coloured; since vertices rjtai (in the term components), j = 1 . . . t, have been coloured yellow before vertex rj (in term component Pj), according to property E.3.4, the sequence wil l deadlock. If a colouring sequence on G deadlocks, according to property E.3.4, deadlock occurs only if there is a yellow vertex labeled rk,ak, rk,bk, or rk,Ck in each of the term components. Their E.3. Buffer Allocation in Systems with Receive Side Buffers 153 predecessors-vertices U, I e {xi,xi,... ,xn,xm}, in the dispersers-must be green. Since the colouring sequence is maximal, by property E.3.3 exactly one of tXi or tXi is red, thus this corresponds to a valid assignment: setting Xi = 0 if tXi is green, or Xi = 1 if tXi is green yields an assignment that falsifies F. Thus, a colouring sequence on G deadlocks if and only if the corresponding assignment falsifies F. Hence, BSPr is coNP-comple te . • Therefore, just determining whether a buffer assignment is sufficient is intractable, even one as simple as in the preceding example. Intuitively, the buffers of a process are assigned based on the behaviour of other processes; thus, buffer util ization is not locally decidable. Further, the order in which buffers are assigned is nondeterministic, exploding the search space of possible buffer utilizations. This phenomena, which our proofs rely on, is what we call buffer stealing. For example, in a system corresponding to the variable component (see Figure E.5), the first process to send its message gets the buffer, and the other process remains blocked until the arbitrator performs the receives. This stealing corresponds to fixing a value of a variable. Similarly, the system corresponding to the disjunction component allocates buffers for each of the term processes. However, if the buffer is stolen in al l terms, corresponding to a falsifying assignment, then the system wil l deadlock within the ring. For completeness, we note the following corollary: Corol lary E.3.6 The Buffer Allocation Problem (BAPr) is in S 2 P -Proof: By Theorem E.3.2, verifying that a token assignment is sufficient to prevent deadlock (BSPr) is coNP-comple te . Since we can nondeterministically guess a sufficient token assignment, the result follows. • E.3.3 The Nonblocking Buffer Allocation Problem In addition to the system being safe, we can require that no sending process ever blocks due to insufficient buffers on the receiving process. The Nonblocking Buffer Allocation Problem (NBAP r) is to determine the minimum number of buffers needed to achieve nonblocking sends. Formally, the corresponding decision problem is this: given a communication graph G and an integer k, determine if there exists a token assignment of size k such that no colouring sequence on G blocks. Recall that a colouring sequence does not block if, whenever a send vertex is coloured yellow, rule recvAye/ may be applied to the corresponding receive vertex. Let Pi and Pj, j ^  i, be two process components. Given two vertices, vitC and viit, in Pit t > c, vertex vitt is communicat ion dependent on vertex vi<c if viiC is the start vertex or if there exists a vertex vjid e Pj, such that there is a path from viiC to vjtd and the arc (vjtd,vi:t) is in A (see Figure E.7). Vertex vi>t is terminal ly communicat ion dependent on vertex vitC if vitt E.3. Buffer Allocation in Systems with Receive Side Buffers 154 is communication dependent on v»iC and is not communication dependent on the vertices viti, c < I <t. The algorithm depicted in Figure E.8 computes an optimal token assignment such that no colouring sequence on G can block. Figure E.7: vi>c+k is communication dependent on vi<c. 1. For each receive vertex viit determine its terminal communication depen-dency, vertex vitC, where t > c. 2. Set Iiit = [c, t] to be the interval between vertex vitC and vertex v;it. 3. For each process component Gi, compute bit the maximum overlap over al l intervals Iitt. 4. B = {h,b2,...,bn} is the optimal nonblocking token assignment. Figure E.8: Algorithm for computing an optimal nonblocking buffer assignment. Remark E.3.7 In a system corresponding to communication graph G, the time between a message arriving at process i and its receipt corresponds to the interval hit. Each interval must have a buffer to ensure nonblocking sends. Hence, the minimum number of buffers, bi, is the maximum overlap over all intervals within process pi. Computing the terminal communication dependencies for G can be done via dynamic program-ming in 0(\V\n) t ime, where V is the vertex set of G and n is the number of process components. If there exists a path from vertex v^c to vjtd, then there exists a path from v^c to all vertices Vj,d+k, k > 0. Associate with each vertex Vi,c an integer vector a,iC of size n; aiiC[j] = d means that there exists a path from vitC to vjyd, and thus to Vj:d+k, k > 0. The vector a i i C is computed by taking the element-wise minimums over the vectors of the adjacent vertices VitC; this is simply E.4. Proof of Correctness of the Nonblocking Buffer Allocation Algorithm 155 a depth-first traversal of G. Since the number of arcs is bounded by 3 | V | / 2 and the pairwise comparison takes n steps, the traversal takes 0{\V\n) t ime. Next, computing the 0 ( | V | ) intervals, Iitt, requires one table lookup per interval. To compute the maximum overlap we sort the intervals and perform a sweep, keeping track of the current and maximum overlap; this takes 0(\V\log \V\) t ime. Thus, the total complexity is 0 ( | V | n + | V | log |V | ) t ime. In the worst case, where n « | V | , this algorithm is quadratic. However, in practice n is usually fixed, in which case the \V\log\V\ term dominates. E.4 Proof of Correctness of the Nonblocking Buffer Allocation Algorithm Lemma E.4.1 Let G be a communication graph. For all vertices vi>c,vj,d 6 G; if v^d's a send vertex and there exists a path from the vertex vi>c to vertex vjid, then vertex v^d cannot be coloured yellow until vertex viiC is coloured green. Proof: By rule send-^yel, the predecessor of vjtd must first be coloured green before vjtd can be coloured yellow. Since rules send->grn, and recv-^grn imply that the predecessors of a green vertex must be green, the result follows. • Corol lary E.4.2 Let G, vitC, and vjtd be as in Lemma E.4.1 and let vitt be the receive vertex corresponding to the send vertex vj:d. Rule recvAye/ will never be applied to vertex vi<t before vertex i>jjC is coloured green. The preceding corollary implies that a token, which is needed to colour the receive vertex viit yellow, need not be available until the vertex on which vi<t is terminally communication dependent is coloured green. Hence, it is sufficient to ensure token availability just before colouring the respective send vertex green; this is also necessary. Theorem E.4.3 Given G, let viiC be a send vertex and v^t be a receive vertex that is terminally communication dependent on vertex vi}C. A token for the application of rule recvAye/ on arc {vj,d, i>i,t) must be available as soon as vertex VitC is coloured green. Proof: Let vjtd be the send vertex corresponding to the receive vertex vi<t and let Q = {viiQ \ c < q < t] be the set of vertices that are predecessors of vijt, but on which Vitt is not communication dependent. Since vi<t is not communication dependent on the vertices in Q, we can construct a colouring sequence on G that fixes the vertices in Q to be red, and colours vertex vjid yellow, making the application of rule recvAye/ possible in the next step. Since no progress is made in the ith process component after colouring vertex vi>c green, the state of the associated token pool does not change until the application of rule recvAye/ to vertex viit. Hence, when vertex vi;C is coloured green, the token pool must have a token destined for arc (vj^v^t). • E.4. Proof of Correctness of the Nonblocking Buffer Allocation Algorithm 156 Thus, if a receive vertex r is terminally communication dependent on a send vertex s, then it is necessary and sufficient that a token, which is used to apply rule recvAye/ to receive vertex r, must be available as soon as the send vertex s is coloured green; the start vertex may be thought of as a special send vertex. Since the interval corresponding to r begins when s is coloured green, and ends when r is coloured green, a token must be available for the recvAye/ rule, which can occur during this interval. Computing the maximum overlap of intervals yields the required number of tokens. Example Use of the NBAPr Algorithm To demonstrate the NBAPr algorithm we have implemented it, and analyzed the pipe-and-roll parallel matrix multiplication algorithm [FJL+88]. The program has one control process and a number of worker processes arranged in a 2 dimensional mesh. We ran the NBAP r algorithm on meshes of size 2 x 2, 3 x 3 and 4 x 4 . The communication graph for the smallest example, comprising four workers ordered in a 2 x 2 mesh, is depicted in Figure E.9. The corresponding optimal buffer assignment is listed in the second column of Table 1. Po Pi 1 P3 P4 4 3 3 3 3 • Figure E.9: The communication system for a 2 x 2 worker process mesh. In this example, process 0 is the control process and processes 1 through 4 are the workers. The control process needs 4 buffers and the workers each need 3 to execute without blocking. The results obtained when executing the NBAPr algorithm on a 3 x 3 worker system is 9 buffers for the control process and between 4 and 5 buffers for the worker processes. For the 4 x 4 system the numbers are 16 for the control process and between 5 and 7 buffers for the workers. E.5. Buffer Allocation in Systems with Send Side Buffers 157 Proc. Max overlap I i b Overlap for intervals lj I3 U Is U I7 Is l 9 0 4 0 0 0 0 4 3 2 1 0 1 3 2 1 2 3 2 1 1 0 0 2 3 3 2 1 2 1 1 1 0 0 3 3 3 2 1 2 1 1 1 0 0 4 3 2 1 2 3 2 1 1 0 0 Table 1. The result of running the NBAP r algorithm on the 2 x 2 worker example. Approximating BAPr with NBAPr The NBAPr algorithm is useful for determining a token assignment that prevents deadlock, that is, approximating BAPr. Since a nonblocking colouring sequence does not deadlock, a token assignment determined by the NBAP r algorithm ensures that the graph is deadlock free. However, the token assignment may be far from optimal. A simple example of this phenomena is a two process component graph comprised of n arcs emanating from the first component and incident on the second. Such a graph requires zero tokens to avoid deadlock, but requires n tokens to be block free. Thus, the aforementioned token assignment may entail many more tokens than required. E.5 Buffer Allocation in Systems with Send Side Buffers In this section we consider the second of the four buffer placement strategies: send side buffers. Buffers are now allocated on the sending process side if the receive is not ready to accept the message. Correspondingly, the token pool used when applying rule recvAye/ to the receive vertex of arc (s,r) belongs to the process component containing the send vertex s. We call this the send side al location scheme. The Buffer Allocation Problem (BAPS) remains intractable. The problem is conjectured to be NP-comple te (see the following paragraph). The NP-hardness follows from the observation that each t-ring in the construction in Theorem E.3.1 has to have a token assigned to a process component pair in order to prevent deadlock. It does not matter if the token is allocated from the token pool of the sending or the receiving process component. Hence, the reduction used in Theorem E.3.1 can be applied with no modification. We conjecture that the corresponding Buffer Sufficiency Problem (BSPs) is in P . This is because the relative order in which tokens from a particular token pool are util ized is invariant with respect to the colouring sequences. Hence, we believe that the determining sufficiency is similar to the nonblocking buffer allocation problem and hence is in P . If this is the case, BAP S is NP-comple te . The Nonblocking Buffer Allocation Problem (NBAPS) remains in P . The problem can be solved E. 6. Buffer Allocation in Systems with Send and Receive Side Buffers 158 by first reversing all arcs in the communication graph, swapping the start and end vertices, and then running the algorithm described in Figure E.8. E.6 Buffer Allocation in Systems with Send and Receive Side Buffers So far we have considered systems exclusively with send side or receive side buffers. In this section we investigate systems with buffers on both the send and the receive sides; many communication systems use per-host buffer pools for both receiving and sending messages. The choice of where to buffer the message-on the sender or on the receiver-increases the difficulty of determining the system's properties. We assume a lazy mechanism for utilizing buffers: first use a buffer from the sender's pool. If none is available, use a buffer from the receiver's pool. If neither is available, attempt to free a send side buffer by transferring its contents to a buffer belonging to the corresponding receiver. Intuitively, the system attempts to maximize buffer use, without attempting to predict the future. The corresponding colouring game allows tokens to be allocated from the pools belonging to both the sending component and the receiving component. Correspondingly, a lazy token util ization scheme is used: let (si,r,) be a communication arc from process component Pi to process component Pj. The following rules apply during the application of rule recvAye/ to vertex r j: 1. If a token belonging to component Pi is available, use it. 2. Otherwise, if a token belonging to component Pj is available, use it. 3. Otherwise, if a token belonging to component Pi is currently placed on arc (ti,rk), U e Pi, rk e Pk, and a token belonging to component Pk is available. Then the token on arc (ti,rk) may be replaced with the one belonging to Pk, freeing a token to be used in the current application of rule recvAye/. We call this the mixed allocation scheme. Not unexpectedly, the Buffer Allocation Problem (BAPSr) remains intractable within the mixed allocation scheme. This is because the receive side allocation scheme, which provides no choice of token pools, can be simulated within the mixed allocation scheme. Concretely consider the receive side allocation scheme analyzed in Section E.3: to simulate the receive side allocation scheme on communication graph G, within the mixed allocation scheme, each arc in G is replaced by the component illustrated in Figure E.10. Since vertex q cannot be coloured green until vertex r is coloured yellow, and component P' has no tokens, applying rule recvAye/ to r requires that Pj has an available token, regardless of whether Pi has an available token. Similarly, the Buffer Sufficiency Problem (BSPSr) within the mixed allocation scheme is also E.6. Buffer Allocation in Systems with Send and Receive Side Buffers 159 coNP-complete. The hardness follows from Theorem E.3.2 and the preceding argument. Since a colouring sequence also serves as a deadlock certif icate in this case, the coNP-completeness result follows. The interesting property of the mixed allocation scheme is that the Nonblocking Buffer Allo-cation Problem (NBAPsr) is intractable; the choice of token pools increases the search space of solutions exponentially! The reduction is from 3SAT. Theorem E.6.1 The Nonblocking Buffer Allocation Problem (NBAPsr) is NP -hard. Proof: Let F be an instance of 3SAT, comprising n variables, labeled xit i = 1... n, and c clauses. We construct a communication graph G such that there exists a token assignment of n + 2 tokens that prevents any colouring sequence from blocking on G if and only if the corresponding assignment satisfies F. The graph G comprises 2n + 3 process components: the first 2n are labeled PXi and PXi, i = 1... n, and the remaining three process components are labeled P, Qo and Qi, respectively. The graph is divided into c + 1 epochs: epoch 0 corresponds to the variable assignment, and epochs 1 through c correspond to clause evaluation. In epoch 0 each process component PXi contains a single send vertex Si that is adjacent to the receive vertex n located in epoch 0 of process component PXi. Process component Q0 (and Qi) contains four vertices: two receive vertices q 0 , i and g0 >2 (respectively 51,1 and 91,2), followed by two send vertices t 0 , i and £0,2 (respectively i i , i and * i > 2 ) . Finally, process component P contains eight vertices: two send vertices, s 0 , i and s 0 ) 2 , that are adjacent to vertices g 0 , i and q0y, two receive vertices, r 0 ] 1 and 7-0,2, that are adjacent to io, i and i 0 ,2; two more send vertices, s i , i and s 1 ) 2 , that are adjacent to c/i,i and QI j 2; and two more receive vertices, 7-1,1 and T-I,2, that are adjacent to <i,i and £ i , 2 . See Figure E.11. Epoch 0 has two important properties. Property E.6.2 Any token assignment must assign at least one token to either component PXi or PXi to prevent the colouring sequence from blocking after colouring vertex S J yellow. E.6. Buffer Allocation in Systems with Send and Receive Side Buffers 160 Property E.6.3 A token assignment on G having only n + 2 tokens must assign two tokens to process component P to prevent a colouring sequence from blocking after yellow colouring one of the send vertices s0,i, s0,2, s i , i or s i i 2 . Proof: Since n tokens must be allocated to the process components PXi or PXi, i = 1,..., n , this leaves only two tokens to be allocated. Since the colouring rule sequence send->yel, recvAye/ , send^grn, send-^yel, recvAye/can colour send vertices s 0 , i and s 0 ,2, or send vertices sXil and s i , 2 , component pairs (P,Q0) and (P,Qi) must each have two tokens between them. This can only happen by assigning the tokens to P. m A corollary of these properties is that once a legal token assignment is made, no colouring sequence wil l block in epoch 0. The choice of allocating the token on PXi versus PXi corresponds to fixing the variable assignment. P_ P X 4 Xn 9 9 9 v \ \ \ \ 9 # Figure E. 11: Reduction from 3SAT to N B A P s r For j = 1... c, epoch j corresponds to the j th clause. Each epoch comprises two parts of six arcs each: the synchronization part and the evaluation part. Four process components are involved in an epoch: the three components, P 0 j . , Pbj, and PCj, whose labels are the literals in the j th clause, where a,j,bj,Cj e {xlt xx,..., xn,xn], and component P, which is involved in every epoch. Epoch j of component Paj comprises four vertices: receive vertex r a j t j , send vertex E.6. Buffer Allocation in Systems with Send and Receive Side Buffers 161 taj,j, receive vertex r'ajj, and send vertex t'ajJ. Process components Pbj and PCj are analogously formed. In epoch j component P has 12 vertices, the first six are these: send vertex sajj, receive vertex qajj, send vertex sbjj, receive vertex qbjj, send vertex sCjtj, and receive vertex qCjj. These are followed by three send vertices: s'ajj, s'b.j, and s'Cjj, and three receive vertices: q'ajj, Each vertex sij is adjacent to vertex nj, each vertex tij is adjacent to vertex qtj, each vertex s\j is adjacent to vertex f{ -, and each vertex t\ . is adjacent to vertex c/,'^ ; see Figure E.11. For conciseness we drop the last index, j, if it is obvious from the context. Epoch j has three important properties: Property E . 6 . 4 / / vertex q' (in epoch j) is coloured green and vertex saj+1 (in epoch j + 1) is still red, then no tokens that belong to component P are assigned to arcs. The same applies to vertex pairs (qaj, sbj), (qbj, sCj), and (qCj, s'a.), also in epoch j . Proof: All ancestors of q'c must be coloured green and all descendants of saj+1 must be coloured red. This includes all vertices in G, except some vertices Si and n in epoch 0, which are not adjacent to vertices in component P. Hence, the tokens belonging to P are not assigned to any arc. The same argument applies to the other vertex pairs. • Property E . 6 . 5 A colouring sequence on G can block only when yellow colouring receive vertices r ' a j > r'bj> r'cj> q'aj, i'bj> or q'Cj. Proof: As a corollary of properties E.6.2 and E.6.3, no colouring sequence can block in epoch 0. Thus, we need only check that no colouring sequence can block in the first part of epoch j, j = 1 . . . c. By property E.6.4, if saj is red and its predecessor is green, then no tokens of P are in use. Hence, to colour saj green, a token is available to colour raj yellow. Since vertex raj is a predecessor of taj, vertex raj must be coloured green before taj may be coloured yellow. Thus the token is freed before taj is coloured green, and may be used to colour vertex qaj yellow after taj is coloured yellow. A similar argument applies to the vertices rbj, qbj, rCj, and qCj. m Property E . 6 . 6 A colouring sequence can block in epoch j if and only if none of the three process components, Paj, Pbj, and PCj, have a token assigned. Proof: For the ' i f direction consider a colouring sequence that colours vertex qCj green, but has not yet coloured vertex s'a yellow. By definition, blocking does not occur, if rule recvAye/ may always be applied to colour a receive vertex yellow. To colour the send vertices s'a., s'b., and s'c. yellow and then green, the receive vertices r'a., r'b , and r'c., must be coloured yellow via rule recvAye / . Since the receive vertices r'a., r'bj, and r'c. are not ancestors of the send vertices s'a , s'b , and s'c., none of the receive vertices need be coloured green before the send vertices are E.7. Buffer Allocation in Channel Based Systems 162 coloured yellow. However, component P has only two tokens, and none of components Paj, Pbs, PCj have any. Hence, rule recvAye/ can only be invoked twice, instead of the requisite three times. Thus, a colouring sequence can block in epoch j. For the 'only i f direction we claim that if a literal component Paj, Pbj, or PCj has a token, rule recvAye/ can be invoked on any of the six receive vertices r'a., r'b , r'c., q'a., q'b., and q'c.. Since r' is a predecessor of t'a , r' must be coloured green before t' , and hence before q'a. is coloured yellow. Thus, the same token that was allocated upon the application of rule recvAye/ to vertex r' , may also be allocated upon the application of rule recvAye/ to vertex q' ; the same argument is applicable to vertices q'b and q'c.. Applying rule recvAye/ to vertices r'a. and r'b , uses the two tokens from component P. To colour vertex r'c. yellow there are three possible scenarios: 1. the colouring sequence has already freed one of the tokens, allowing it to be reused, 2. component PCj has a token, in which case it is used, or 3. component Paj (or Pbj) has a token, in which case it replaces the token used to yellow colour vertex r'a. (or r'b ) and the freed token is used to colour vertex r'c.. Since at least one component Paj, Pbj, or PCj have a token, the claim is proven. • By property E.6.6 a colour sequence wi l l block in epoch j if and only if none of the process components Paj, P^, or PCj has a token, which corresponds to the jth clause having no literals that are true. Thus, a token assignment of size 2n + 2 prevents any colouring sequence on G from blocking if and only if the corresponding assignment satisfies F. u E.7 Buffer Allocation in Channel Based Systems In channel based systems processes communicate via pairwise connections that are created at startup. Each connection, called a channel, is specified by its endpoints and is used by one process to send messages to the other. Each channel functions independently of other channels in the system, and resources such as buffers are allocated on a per channel basis, rather than per process. Finally, channels behave like queues, that is, messages are removed from the channel in the same order that they are inserted. Channels may either be unidirectional, comprising source and destination endpoints, or bidi-rectional, comprising two symmetric endpoints. In the former case, only the source process may insert messages into the channel and only the destination process may remove messages from the channels. A bidirectional channel is equivalent to two unidirectional channels, allowing both processes to insert and remove messages from the channel. Here we only consider unidirectional channels. E. 7. Buffer Allocation in Channel Based Systems 163 Except for buffer allocation, channel based communication does not differ from the previously described send/receive mechanism. In fact, an unbuffered channel communication is just a synchronous send/receive communication. Thus, we can derive similar results for channel based systems. In the corresponding colouring game, tokens are allocated to channels (component pairs) instead of to components. This change does not change the properties used in our proofs. In fact, Lemma E.2.1 may be used unchanged. We call this the per channel allocation scheme. E.7.1 The Buffer Allocation Problem The corresponding Buffer Allocation Problem (BAPSr) is this: given a communication graph G and an integer k, determine whether there exist a token assignment of size k, such that no colouring sequence deadlocks on G. Even though token util ization, during the colouring of a communication graph, is only dictated by the communication arcs within a process component pair, determining the number of tokens needed remains N P - h a r d . The proof is similar in spirit to Theorem E.3.1. Theorem E.7.1 The Buffer Allocation Problem (BAPSr) is NP-hard. Proof: We prove this by reducing 3SAT to BAPsr. For any 3SAT instance F we construct a corresponding communication graph G-polynomial in size of F - s u c h that for a token assignment of size n, any colouring sequence wil l complete on G if and only if the corresponding variable assignment satisfies F. Let F be an instance of 3SAT on n variables and comprising c clauses. The construction is nearly identical to that in Theorem E.3.1, except for the components representing the clauses of F. The graph G has 2n process components that are labeled by the literals of F, PXi and PXi, i = 1 . . . n. Each component comprises c + 1 epochs, where each epoch contains zero or two vertices. As in Theorem E.3.1, epoch 0 fixes a variable assignment. In epoch 0 each component has two vertices: a send vertex, labeled sXi (or sXi), and a receive vertex rXi, (respectively rXi), i = 1... TI. Vertex sXi is adjacent to vertex rSi, and vertex sSi is adjacent to vertex rXi; this is a 2-ring, identical to epoch 0 in Theorem E.3.1. Epoch 0 has the the following property: Property E.7.2 Any colouring sequence on G will deadlock in epoch 0 unless each process com-ponent pair has a token assigned to the token pool of either (PXi ,PXi), or (PXi, PXi), i = 1 . . . ri. Thus, the token assignment must be of at least size n. (Follows from Lemma E.2.1.) Property E.7.2 yields the following correspondence between assignments on F and token assignments of size n. Property E.7.3 The corresponding token assignment of a variable assignment on F assigns a token to the channel (PXi, PXi) if xt is true, or to (PXi, PXi) if xt is false. E.7. Buffer Allocation in Channel Based Systems 164 The j th epoch represents the j th clause of F, denoted (a.j,bj,Cj), where a,j,bj,Cj e . . . , xn The process components Paj, P&i, Pbj, Piit PCj, and P E j form a 6-ring, while the remaining com-ponents have no vertices in the j th epoch. Process component Paj has two vertices in the j th component: a send vertex, sajj, and a receive vertex r a j j \ similarly, the other five components have a send and receive vertex that are correspondingly named. The arcs linking the 6 compo-nents are these: (saj,j,rajJ), {Saitj,n,j), (Sbj ,j ,% j) , (.S^j, TCjj), (sCj J ,rSj j), and {sSj ,j ,Taj ,j) . These form a 6-ring, as illustrated in Figure E.12. The key property of the j th epoch is this: Figure E.12: The clause representation in epoch j . Property E.7.4 No colouring sequence on G will deadlock in the jth epoch if and only if at least one of the channels has a token: ft, )> ( f t , - , ^ ) , ( A , , ^ ) , (P-bi,PCj), i.PcnPSj), (Pc,,Paj). (Follows from Lemma E.2.1.) A refined version of property E.7.4 is more useful: Property E.7.5 For any token assignment of size n such that no colouring sequence deadlocks on G in epoch 0, no colouring sequence on G will deadlock in the jth epoch if and only if at least one of the channels (Paj,Pa,), [Pbj,P-bj), ond (PCj,P5j), has a token. Proof: By property E.7.2, all token assignments that do not cause deadlock in epoch 0 only assign tokens to channels of the form (PXi,PXi) or (PSi,PXi). Hence, only channels (Pas,Pai), (Pbj,P-bj), and (PCi,Pcj) can have a token. By property E.7.4, no colouring sequence on G wi l l deadlock in epoch j if one of these channels has a token. • We claim that given a token assignment of size n, any colouring sequence wil l complete on G if and only if the corresponding variable assignment satisfies F. If an assignment x satisfies F, then every clause has at least one literal that evaluates to true. By Property E.7.3, in each of the j epochs at least one of the channels listed in Property E.7.5 wil l be allocated a token. Hence, by Property E.7.5 no colouring sequence wil l deadlock on G. If an assignment x does not satisfy F then there is at least one clause in which al l literals are false. Let (a,j,bj,Cj) be the unsatisfied clause. By property E.7.3, the corresponding token E.7. Buffer Allocation in Channel Based Systems 165 assignment wi l l not assign a token to ( P a . , P a j . ) , {Pbj,Pb.), or ( P C i , P 5 i ) , hence, by Property E.7.5, al l colouring sequences wil l deadlock. Thus, NBAPsr is N P - h a r d . • Since tokens are assigned on a per channel basis, token usage depends only on the two process components that comprise the channel. Consequently, the sufficiency of a token assignment can be verified in linear t ime. Thus, the easier problem BSPsr is in P , implying that BAPsr is N P -complete. We describe the verification algorithm and prove its correctness. E.7.2 The Buffer Sufficiency Problem To verify the sufficiency of a token assignment, perform a colouring on G: at each step of the colouring a vertex of G is coloured according to the rules in section E.1. Using a queue to keep track of colourable vertices means that determining which vertex to colour next takes 0(1) t ime. Since each vertex changes colour at most tw ice- the maximum length of any colouring sequence is 2\V\ colourings-colouring a graph takes 0( |V | ) t ime. The token assignment is sufficient if and only if the colouring sequence completes. The algorithm's correctness follows immediately from the following theorem: any colouring sequence on G completes if and only if some colouring sequence on G completes. Thus, a token assignment is sufficient if and only if some colouring sequence on G completes. Theorem E.7.6 Let G be a communication graph and B a token assignment on G. Any colouring sequence on G completes if and only if a colouring sequence on G completes. Proof: For any communication graph G, we construct a new graph G' where every token is simulated by a process component, the size of the corresponding token assignment is zero, and every colouring sequence on G corresponds to a colouring sequence on 67', such that a colouring sequence on G completes if and only if the corresponding colouring sequence on 67' completes. Since the token assignment on G' is zero, by Lemma E.2.2 a colouring sequence on G' completes if and only if every colouring sequence on 67' completes. Hence, every colouring sequence on G completes if and only if a colouring sequence on 67 completes. To simulate an m token channel (a channel that has been assigned m tokens) m process components are chained together. For each channel ( P , Q) with m tokens, m process components P i , P 2 , . . . , P m are interspersed between P and Q. The channel (P,Q) is replaced with these channels: ( P , P i ) , ( P i , P 2 ) , . . . , ( P m _ i , P m ) , ( P r a , Q ) . Each arc from P to Q is replaced by a chain of arcs from P -> P i -> P 2 -> . . . ->• P m _ i -> Pm -> Q. The replacement is illustrated in Figure E.13. We claim that a colouring sequence, S, on 67 wi l l deadlock if and only if the corresponding colouring sequence, £ ' , on 67' deadlocks. First, we construct the correspondence and argue its E.7. Buffer Allocation in Channel Based Systems 166 P q P Pi P m q Figure E.13: Simulating m tokens by m components. correctness. Second, we argue that sequence E deadlocks on G if and only if the corresponding sequence E ' deadlocks on G". Finally, we apply Lemma E.2.2 to prove our result. Since the transformation is i terat ive-each m token channel is independent of the other channels- i t is sufficient to derive the correspondence between the colouring sequence on G and the graph G" where a single m token channel has been replaced. Let (P, Q) denote the channel in G that is replaced in G ' . Let (s ; , rt) e G , / = 1,2, . . . , denote the arcs from process component P to Q. The correspond-ing paths in G ' are (s ( ,n , ( ,s i , ; , r2 , j ,S2 , / , . . . ,rm<i,smti,ri), Pi Pi Pm where each arc (rkti,skj) is within process component Pk and each arc (s*,j)riM-i,j) is between process components Pk and Pk+i; the vertices st and n , / = 1,2, . . . are called the fringe vertices. A colouring sequence E can be represented as a sequence of differences (or moves), 6it between every two consecutive colourings Xi and Xi+i- The sequence A E = <5i<52... is a sequence of colouring game moves Si = (v, colour) such that applying Si to colouring xt yields xt+i> the next colouring in £ ; A s can be derived from E and, E can be derived from A E and G . The sequence A s comprises two types of moves: those that colour fringe vertices, called fringe moves, and those that do not, called normal moves. Given a colouring sequence E on G , we transform it into the corresponding colouring sequence E ' on G ' . The transformation replaces some fringe moves in sequence A s with sequences of moves, resulting in the corresponding move sequence A s < . This sequence comprises normal moves and added moves; added moves are a mixture of fringe moves and moves on the vertices within the added components Pt. There are four types of fringe moves in A s : colour a send E.7. Buffer Allocation in Channel Based Systems 167 vertex s; yellow ((s/,yel>), colour a send vertex s; green ((sj.grn)), colour a receive vertex n yellow {(n, yel)), and colour a receive vertex r, green ((rh grn)). The transformation is performed in the order that the moves occur in sequence A^. • If 5i = (si,yel), then no action is taken. • if Si = (si,grn), we replace it with the sequence (ri ,i, yel) ,{sh grn), (n,/, grn), (si,/, yel), suffixed by the sequences (r^yel), (sj_i,j,grn), ( r ^ , grn), (s^;, yel), j - 2 . . . k - \ where k is the smallest integer such that the move (sk,i-i:grn) has not yet been inserted into the move sequence A E, that is, vertex sk,i-i has not yet been coloured green. • \f Si = (n,yel), we remove it from the sequence; it is restored when we replace the move (n,grn). • if Si = (n,grn) we replace this move with the sequence (rhyel), <sm,;,grn), (r,, grn), suffixed with the sequences (r9j ,hj, yel), (s9j -1 >ft., grn), (r9j ,h., grn), (s9.,/,.+!, yel), j = 0 ... fc - 1, where = m - j, h3• = I + 1 + j , and A; is the smallest integer such that the move (sm_jfe,j+i+jfc,yel) has not yet been inserted into the sequence, that is, vertex sm-k,i+i+k has not yet been coloured yellow. Since the head of this sequence colours s T O j ( green, rmti+i could be coloured yellow, if s m _i , j+ i is yellow, then sm-i,i+i could be coloured green followed byrmj+i and finally sm,i+i could be coloured yellow; this colouring cascades down the added process components. It is important to note that each of the replacement sequences is maximal, that is, no additional valid colouring moves on the chain process components Pi, i = 1... m, may be suffixed to them. The new sequence looks like this: normal moves normal moves A S- = 5i...5hl S'i---8'91 Shl+1...Sh25'gi+1...S'g2.... added moves added moves Since G is a contraction of G', all normal vertices are coloured by A £/ in the same order as in AE . Recall that normal vertices are not adjacent to the process component chain, and hence, are not affected by the transformation. While normal vertices within process components P and E.7. Buffer Allocation in Channel Based Systems 168 Q may depend on the order that the fringe vertices are coloured, the dependence is via process arcs, not communication arcs. Consequently, the normal vertices only depend on the order that the fringe vertices are coloured green. Fortunately, this order is preserved. By inspection, the replacement sequences of moves are valid. Thus, the transformed sequence As< is val id. Additionally, all green colouring moves on fringe vertices are preserved by the transformation; a vertex is coloured green by A s if and only if the corresponding vertex is coloured green by As-. The following property is key: Property E.7.7 As deadlocks on G if and only if A^ deadlocks on G'. Proof: By contradiction, suppose that As deadlocks on G while AE/ can be extended, that is, another vertex colouring move may be suffixed to AE<. Let v be the vertex that can be coloured by the extension. Vertex v may either be a normal vertex, a fringe vertex, or a vertex belonging to a process chain. The latter is impossible because every replacement sequence of moves is maximal. If v is a normal vertex, then its predecessors are either a normal vertex or a fringe vertex that has been coloured green. Since the transformation preserves the colourings of normal vertices and the order in which vertices are coloured green, if AE< can be extended by colouring v, then so can AE, which is a contradiction. If v is a fringe vertex, there are four cases: either v is a send vertex s; being coloured yellow or green, or v is a receive vertex r ( being coloured yellow or green. The transformation does not affect moves that colour send vertices yellow and such a colouring only depends on its process component predecessor being green. Hence, if the colouring can be suffixed to As<, it can also be suffixed to AE; resulting in a contradiction. If the extension colours the send vertex green, this means that the original sequence A E can be extended by either adding the colourings (s;,grn) or (rh yel)(s(, g rn) , depending on whether rt has been coloured yellow or not in the original sequence As; thus, it is a contradiction. Similarly, if v is a fringe receive vertex being coloured green, this is not possible, because the transformation colours fringe receive vertices yellow, then green, by a single replacement se-quence. Finally, if v is a fringe receive vertex r ( that can be coloured yellow, the original sequence A s can be extended by the move ( r ( , g r n ) , because in the original sequence the corresponding send vertex st has already been coloured green. Thus, we have another contradiction. In the other direction, if the original sequence can be extended, then transforming the extension of the sequence A s yields an extension to the presumably deadlocked sequence AE/. Thus, A s deadlocks on G if and only if As- deadlocks on G'. m A corollary of Property E.7.7 is that the colouring sequence £ deadlocks if and only if the colouring sequence £ ' deadlocks. E.8. Summary 169 By Lemma E.2.2 a colouring sequence on G1 completes if and only if all colouring sequences on G' complete. Hence, a colouring sequence on G completes if and only if al l colouring sequences on G complete. • Corollary E.7.8 A colouring sequence on G completes if and only if the token assignment is sufficient. E.7.3 The Nonblocking Buffer Allocation Problem For the Nonblocking Buffer Allocation Problem, the algorithm derived in section E.3.3 suffices with a small modification. Since the token pools are per channel, rather than per process component, the computation must be performed on a per pool basis. Hence, there is an additional factor of n in the runtime. Since each process may be using up to n channels, the runtime of the algorithm becomes 0(\V\n2 + |V|nlog(|V|n)); the cost increases because the number of allocations to be made becomes quadratic in n. E.8 Summary As message passing becomes increasingly popular, the problem of determining fc-safety plays an increasingly important role. The relevance of this problem grows as more and more functionality of message passing systems is offloaded to the network interface card, where limited buffer space is a serious issue. Even if message passing is kept in main memory, buffer space can stil l be limited due to the sometimes very large data sets used in many parallel and distributed programs. Unfortunately, determining fc-safety is intractable. We have shown that in the receive buffer model, determining the number of buffers needed to assure safe execution of a program is NP -hard, and that even verifying whether a number of assigned buffers is sufficient is coNP-complete. On the positive side, if we require that no send blocks, we provide a polynomial time algorithm for computing the minimum number of buffers. By allocating this number of buffers, safe execution is guaranteed. In addition, we have implemented the NBAPr algorithm, and it is now part of the Millipede debugging system. For systems with only send buffers, the Buffer Allocation Problem remains NP-complete. In addition, we conjecture that the Buffer Sufficiency Problem can be solved in polynomial time because the order of the sends in each process is fixed. The Nonblocking Buffer Allocation problem for systems with only send buffers can be solved in polynomial t ime. For systems with both send and receive buffers, the Buffer Allocation Problem as well as the Buffer Sufficiency Problem remain intractable. More interestingly, the Nonblocking Buffer Allocation problem has become intractable. For systems with unidirectional channel buffers, both the Buffer Sufficiency Problem as well E.8. Summary 170 as the Nonblocking Buffer Allocation Problem have polynomial time algorithms. However, the Buffer Allocation Problem stil l remains an NP-comple te problem. The results (conjectures) are summarized in Table E.1. Buffer Placement Problem Receive Send Send 6t Receive Channel BAP NP-hard NP-hard NP-hard NP-complete BSP coNP-complete (P) coNP-complete P NBAP P P NP-hard P Table E.1: Results for the three problems under the four different buffer placement schemes. 

Cite

Citation Scheme:

        

Citations by CSL (citeproc-js)

Usage Statistics

Share

Embed

Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                        
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            src="{[{embed.src}]}"
                            data-item="{[{embed.item}]}"
                            data-collection="{[{embed.collection}]}"
                            data-metadata="{[{embed.showMetadata}]}"
                            data-width="{[{embed.width}]}"
                            async >
                            </script>
                            </div>
                        
                    
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:
http://iiif.library.ubc.ca/presentation/dsp.831.1-0051457/manifest

Comment

Related Items