UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A virtual machine approach to parallel debugging Lin, Kunhua 1993

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1993_fall_lin_kunhua.pdf [ 3.27MB ]
JSON: 831-1.0051244.json
JSON-LD: 831-1.0051244-ld.json
RDF/XML (Pretty): 831-1.0051244-rdf.xml
RDF/JSON: 831-1.0051244-rdf.json
Turtle: 831-1.0051244-turtle.txt
N-Triples: 831-1.0051244-rdf-ntriples.txt
Original Record: 831-1.0051244-source.json
Full Text

Full Text

A Virtual Machine Approach toParallel DebuggingbyKUNHUA LINB.Eng., Wuhan Technical University of S&M, 1983M.Eng., Institute of Computing Technology, Academia Sinica, 1989A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF SCIENCEIN THE FACULTY OF GRADUATE STUDIESDEPARTMENT OF COMPUTER SCIENCEWe accept this thesis as conforming to the required standardTHE^UNIVERSITY^OF^BRITISH^COLUMBIAJuly, 1993© Kunhua Lin, 1993In presenting this thesis in partial fulfilment of the requirements for an advanced degree atthe University of British Columbia, I agree that the Library shall make it freely availablefor reference and study. I further agree that permission for extensive copying of thisthesis for scholarly purposes may be granted by the head of my department or by hisor her representatives. It is understood that copying or publication of this thesis forfinancial gain shall not be allowed without my written permission.Computer ScienceThe University of British Columbia6356 Agricultural RoadVancouver, CanadaV6T 1Z2Date:AbstractDebugging is generally considered to be difficult. The increased complexity and nonde-terminism of parallel programs makes it even more difficult. It is one of the reasons thatparallel machines are not widely used for computationally intensive applications eventhough recent progress on VLSI technology has significantly reduced the cost of buildingthese machines.In this thesis, a two-phase approach to debugging parallel applications based on task-oriented virtual machines is proposed. In the first phase, a message passing event historyof the application is constructed and analyzed against the specifications of the appli-cation to automatically identify and localize fatal errors in terms of the task number,the processor identity and the user program process. In the second phase, the codesegments identified in the first phase are debugged in a simulated environment usingeffective sequential debugging techniques. This provides us with a technique to deal withthe complexity of parallel programs through high-level abstractions and a divide-and-conquer strategy. As it is solely based on the execution history, this approach avoids thenon-reproducibility that results from the nondeterministic characteristics of a parallelprogram.We have implemented a monitor and trace analyzer on a transputer-based multicom-puter for two virtual machines, PFVM (Processor Farm Virtual Machine) and DCVM(Divide and Conquer Virtual Machine). The effectiveness and limitations of our approachare demonstrated and evaluated.Table of ContentsAbstract^ iiTable of Contents^ iiiList of Figures^ viAcknowledgment^ viiDedication^ viii1 Introduction 11.1 The Problems ^ 11.2 Motivation 31.3 Thesis Overview^ 42 Overview 62.1 Terminology and a Framework ^ 62.2 General Overview of Parallel/Distributed Debugging ^ 102.3 Display the Behaviours of Communicating Processes 122.4 Replay ^ 152.5 Assertion Checking ^ 172.6 Summary ^ 201113 VM-based Applications and Their Message-passing Behaviour 223.1 Introduction to VM-based Programming Environments ^ 233.2 Processor Farm Virtual Machine ^ 273.2.1^PFVM Interface ^ 293.2.2^Intermediate PFVM 303.3 Divide and Conquer Virtual Machine ^ 313.3.1^DCVM Interface ^ 323.3.2^Intermediate DCVM 343.4 Debugging VM-based systems ^ 343.4.1^Classification of Errors 353.4.2^Discussion ^ 374 A Two-phase Approach to Debugging VM-based Applications 384.1 Two-phase Debugging Approach ^ 384.2 Definitions ^ 414.3 Fault Models 434.4 Summary ^ 455 Implementation 475.1 Overall Design of the System ^ 475.2 Instrumentation ^ 495.3 Event Tracing 495.3.1^Environment ^ 505.3.2^System Structure of Tmon ^ 535.3.3^Modification to Tmon 55iv5.4 Trace Analysis ^  585.4.1 Algorithm for Logical Time Stamping^  585.4.2 System Design  ^615.4.3 User Interface  ^636 Experiments and Evaluation^ 666.1 Experiments with Simulation  666.2 Experience with Debugging an Actual PFVM Application ^ 726.3 Evaluation  ^767 Conclusions and Future Work^ 787.1 Summary ^  787.2 Future Work  807.2.1 Enhancement and Integration ^  807.2.2 Logical Time Stamping  81Bibliography^ 84vList of Figures2.1 A Debugging Process Model  ^82.2 Animation of Message-passing in Belvedere  ^143.3 Overall System Structure of a VM-based System ^  243.4 Overall System Structure of a Multi-VM System  253.5 Specification of an Example Program ^  263.6 Processor Farm Virtual Machine  283.7 IPFVM in a Chain ^  303.8 IPFVM in a Binary Tree  313.9 Divide and Conquer Virtual Machine ^  335.10 Overall Design of the System  485.11 System Design of deTrEK ^  62viAcknowledgmentFirst and foremost I would like to express my sincere thanks to both of my supervisors,Dr. Alan Wagner, without whose friendly guidance this thesis may never have beencomplete, and Dr. Samuel Chanson, whose advice has often gone far beyond academicresearch.I wish to thank Mandeep Dhami for his friendship and for proof-reading the thesis.Without his help, this thesis would be much less readable. I also wish to thank theother members of my research group, H. V. Sreekantaswamy, Sameer Mulye, XiangdongCai and David Feldcamp for their discussions and friendship. My special thanks go toNorman Goldstein, who introduced me to DSRG's transputer system and its magic wires.Also I would like to thank Jeff Beis for providing me with a test case for demonstrationand evaluation of my tool.I would also like to say thanks to all the people who have made my stay in UBC arewarding experience due to their friendship, in particular, Grace Wolkosky, Ming KoonLau, Roland Mechler, Marcelo Walter, Charles Mathieson, Helen Wong, Sijiang Zhang,Xun Li, Lianwen Zhang, Ruping Qi, and numerous others.Finally, I am indebted to my wife, Caiping Zhang, for her love and constant support.viiTo my father, Limin LinChapter 1IntroductionDue to the progress of VLSI technology the cost of building massively parallel computershas been significantly reduced. However, the difficulty in programming these machinesprevents them from being used widely. A programming environment based on virtualmachines has been developed at UBC to ease parallel programming[Sre93, Fe192]. Thefocus of this thesis is to provide debugging support in such an environment. A newdebugging approach is explored and the design and implementation of a real-time monitorand post-mortem trace analyzer are presented.1.1 The ProblemsDebugging is a difficult job that requires considerable experience in program developmentbecause it relies on heuristic insights. There are several quite effective debuggers availablefor sequential code, such as "xcodecenter" and "dbx", which are based on the traditionaldebugging method — stop-and-look. However, this method is no longer sufficient, or evenfeasible, for parallel programs because of the following two characteristics inherent in allparallel programs.1Chapter 1. Introduction^ 2ComplexityIn a parallel system there are multiple physical processors and often multiple pro-cesses on each processor. Multiple processors and processes result in multiplethreads of control which execute concurrently and interact with each other. Thusthere is extensive communication and synchronization between them.Understanding the execution of distributed systems with multiple threads of controlis difficult. Even if we attached a sequential debugger, such as "dbx", to eachprocess the simultaneous coordination and management of such a large numberof sequential debuggers would be a challenging task. Finally, because there is nosingle locus of control, problems such as the probe effect and error latency are moreserious.Non-determinismThe different behaviour or different results of different runs of the same programwith the same input is referred to as nondeterminism. There are two sources ofnondeterminism in message passing systems. First, there is the non-deterministicbehaviour inside a process. This is caused by nondeterministic constructs in theprogramming language, such as ALT in OCCAM, and access to real time clocks.Second there is the non-deterministic behaviour due to external events, such asmessages, timers or interrupts. Asynchronous communication between processes isalso a major source of non-determinism. The order in which messages arrive at thedestination may be different in different executions with the same input. This ordercan affect the execution of the process that receives the messages. In addition, theruntime system that manages the message passing and processes is also a source ofnon-determinism.Chapter 1. Introduction^ 3The major problem with non-determinism is "non-reproducibility", a consequenceof non-determinism that may make it impossible to reproduce an erroneous state.Traditional debugging is based on repeated executions of the program; setting thebreakpoints, observing and changing the values of variables to verify some hypothe-ses about the program. Non-reproducibility of the execution makes this ineffectiveas the repeated executions might not manifest the same behaviour. Furthermore,breakpoints are difficult to define because there is no single point of control.1.2 MotivationIn general, to deal with the increased complexity of parallel programs one can develophigher level abstractions and tools. One such approach at UBC is based on the definitionof virtual machine. A virtual machine(VM) is an abstract model of computation thatencapsulates a parallel programming paradigm, such as processor farm, SPMD, etc..It is a restricted model of computation corresponding to a widely used strategy forparallelizing a program.The objective of developing a VM-based programming environment is to ease pro-gramming and promote reusability while minimizing the loss of performance one mightexpect from using high-level abstractions. Although the VM-based programming envi-ronment significantly reduces the complexity of parallel programming, it lacks debuggingtools. The chief motivation for this thesis was to investigate tools to help the programmerunderstand the behaviour and debug VM-based applications.Certainly, one simplifies parallel programming by hiding the lower level managementof communication and process scheduling. For the task-oriented systems we have inves-tigated this has been accomplished in the VM-based environment where the user is onlyChapter 1. Introduction^ 4required to supply a few sequential functions. Intuitively this can simplify debuggingas it is now possible to debug the sequential functions individually as sequential pro-grams. However, there remains some errors that will not appear until the program runson a parallel system. Inevitably therefore one has to debug their program in a parallelenvironment. There is the interesting question as to whether the restricted models ofcomputation can be used to ease debugging and if so, how?1.3 Thesis OverviewTo overcome the complexity and non-reproducibility of parallel programs we have in-vestigated a two-phase hierarchical approach to debugging VM-based applications. Itutilizes the restricted model of computation in the VM-based programming environmentto divide the debugging into two phases. In the first phase, the execution of the systemis monitored and a high-level message passing history is recorded. The event history isthen compared with the specification of the application to identify the location of errorsat a level close to the user's conceptual model of programming. This process can be doneautomatically after the fault models for the virtual machines have been developed. Theconstruction of these fault models is simplified because of the restricted model of compu-tation on which it is based . In the second phase, a simulation environment is provided,which takes as input the event history, to ensure that the errors located in the first phaseare reproducible by repeated execution of the problem. The debugging environment inthe second phase can be implemented as a driver program to simulate the VM along witha traditional debugger. As a result, programmers debug their code in a familiar sequen-tial debugging environment using proven effective knowledge of debugging, without thenon-reproducibility and complexity often associated with debugging parallel programs.Chapter I. Introduction^ 5Chapter 2 reviews previous work in the area as it pertains to this work. After definingsome basic terms it introduces a general framework for debugging, and classifies andevaluates various approaches to the debugging of parallel programs based on this model.Approaches that relate closely to our method are discussed.Chapter 3 introduces a programming environment based on models of two task-oriented virtual machines, a Processor Farm Virtual Machine(PFVM) and a Divide-and-Conquer Virtual Machine(DCVM). The message-passing behaviour of the applicationsbased on these two virtual machines is discussed. Finally, the problems of debugging aVM-based system is discussed and possible errors are classified and analyzed.Chapter 4 proposes a two-phase approach to debugging VM-based applications. Ac-cordingly, fault models for PFVM and DCVM are developed for the automation of thefirst phase.Chapter 5 describes our implementation work. First, a brief overview of the under-lying hardware and software environment is given, followed by the description of themodifications that needs to be made to Tmon[WJC93], a parallel performance monitorfor the transputers. Next, the algorithm for generation of the vector logical time stamps ispresented, followed by the design and implementation of the trace analyzer. The chapterends with a brief description of the user interface to the trace analyzer.Chapter 6 demonstrates how the monitoring and trace analyzing tools can be used todebug VM-based applications. The experience of debugging an actual PFVM applicationis described.Chapter 7 concludes by summarizing the contributions of this thesis with suggestionsfor enhancements to the debugging tool and future research.Chapter 2OverviewIn Section 2.1, we introduce some terminology and describe a general framework fordebugging. A general overview of distributed/parallel debugging is given in Section 2.2.In the remaining sections, we discuss the various approaches that have been proposedfor debugging parallel or distributed systems. Most approaches to parallel debugginghave, as a first step, an event tracing of the program execution. A similar approach isproposed in this thesis. In the second step, depending on what is done with the trace,these approaches fall into three categories: displaying the behaviour, re-execution andassertion checking. Each of these approaches are discussed in Section 2.3 to 2.5. Section2.6 summarizes this chapter.2.1 Terminology and a FrameworkDebugging begins when the programmer discovers that the execution behaviour deviatesfrom the specified behaviour of the program. This difference is usually noticed by simpleproblems such as an incorrect value printed or the program not terminating. This isreferred to as a program failure, the result of an erroneous program state. An erroris the part of an erroneous state that differs from a valid program state, such as an6Chapter 2. Overview^ 7incorrect value for a variable or the program following an unexpected execution path.The invalid state transition that first caused an erroneous state is referred to as a fault.The manifestation of a fault produces errors in the program state which in turn lead toprogram failure. A bug is a fault and the corresponding error(s).Naturally, debugging is the process of detecting errors, locating faults and correctingthem. Most errors are detected by program testing and most of the debugging time isspent locating faults. In locating faults, programmers first develop hypotheses about theerrors and the faults, and verify or refute these hypotheses by examining the program.In correcting the faults, programmers again develop hypotheses about how to modifythe program and once again verify or refute them. Based on this three step process ofdebugging, Araki, Furukawa and Cheng [AFC91] proposed, as a general framework, thedebugging process model shown in Figure 2.1. In this model, programmers begin with aset of hypotheses, modify that set, select hypotheses for verification, and verify or refutethe selected hypotheses until bugs are fixed, or more generally some level of confidencehas been reached. An extended version of this framework is described below in moredetail.Hypothesis set generation. In Araki et al 's model, hypotheses are related to programsource code, its specification and its behaviour. Programmers hypothesize about theerrors in the program, including the locations in the program where errors may occur, thecauses of the errors, what constitutes a correct or incorrect behaviour, and modificationsto correct the faults. These hypotheses include the facts proven so far about the propertiesof the program and its errors, and the programmer's empirical knowledge about thedevelopment and specification of the program. As most errors are detected by programtesting, for simplicity, Araki et al treat the error report as the initial hypothesis set.Chapter 2. Overview^ 8Hypothesis SetGenerationHypothesis SetModificationHypothesisSelectionHypothesisVerification<Bugs^NoFixed?YesEndFigure 2.1: A Debugging Process ModelHowever, communication and synchronization errors in parallel programs are difficult todetect by program testing. Therefore, we consider error detection to also be an importantpart of debugging and extend the framework by including error detection in this step.Hypothesis set modification. As the process of debugging proceeds, programmers mod-ify the hypothesis set by creating new hypotheses, by further constraining or refining theexisting hypothesis set, and by changing the authenticity of hypotheses based on theChapter 2. Overview^ 9results of previous hypothesis verification.Hypothesis selection. Randomly selecting a hypothesis to verify is not efficient. In-stead, a hypothesis should be selected according to a strategy; using tactics to simplifythe error condition, narrow the suspicious region, expand the certified region, change thepoint of view, or by weighing the significance of each hypothesis.Hypothesis verification. A hypothesis is verified by examining the program and itsbehaviour. As a result of these examinations, a hypothesis is either verified as true orfalse, or remains unverified. Even when a hypothesis is unverified, one may have obtainedsome information about the program that will help the debugging process.Most of today's debugging tools provide mechanisms for observing program executionand helping programmers understand program behaviour. Source-level debugging, en-forced execution control, backward tracing, and other techniques are often used to verifyhypotheses. These tools support the verification of hypotheses that programmers de-velop, but do not support hypothesis generation or selection. A complete debugging toolmust support every stage of the debugging process: hypothesis generation, modification,and hypothesis selection and verification. In evaluating debuggers for parallel programs,the following four criterion were considered.• Adequacy: A parallel debugging tool should support all the steps in the processmodel of debugging. Specifically, the debugger should address the complexity ofparallel programs by presenting the execution behaviour of the program in an un-derstandable way. The debugger should also eliminate nondeterministic executionso that errors are reproducible.• Productiveness: Faults can potentially appear anywhere in a software systemand whereas some faults are relatively easy to detect, locate and correct, otherChapter 2. Overview^ 10faults are not. A good debugger should aid in finding and correcting all faults,especially those that are the most difficult to find.• Feasibility: A debugger is a tool that is eventually implemented in some specificunderlying system. It is important to ensure that the debugger is implementablein the targeted system.• Efficiency: The overhead introduced by a debugger varies from one approachto another. Even though the acceptable range of the overhead is relatively largecompared to other programming activities, it should not be overwhelming.2.2 General Overview of Parallel/Distributed DebuggingIn recent years, interest in debugging has increased, particularly in parallel and dis-tributed systems. McDowell and Helmhold [MH89} give a extensive survey of distributeddebuggers developed up until 1988. They classify approaches to debugging into fourgroups: traditional debugging , event-based, graphic display and static analysis. Theevent-based and graphic display approaches require, as a first step, the monitoring ofthe execution. While the execution is viewed as a sequence of events in the event-basedapproach, it is viewed as a flow of control and data in the graphic approach. Since bothof these approaches require events to be recorded we will use only three categories todescribe these approaches.1. Extending traditional debugging techniques to parallel programs: The most impor-tant feature of a typical sequential debugger is the capability to set breakpoints.Breakpoints allow the programmer to stop the execution and examine the state ofthe program. A straightforward debugging technique for parallel programs is toChapter 2. Overview^ 11treat the communicating processes as a set of sequential processes, and to attachto each process a sequential debugger. An obvious disadvantage of this techniqueis that it is too low level. As a result, some debuggers incorporate some high-levelabstractions[Smi851; however it is still difficult to coordinate and synchronize thesedebuggers, especially as the number of concurrent processes grows larger. In addi-tion, since this technique dynamically controls the execution of a parallel program,it can potentially change the behaviour of the program. In particular, becauseof non-determinism, it may not be possible to re-exhibit the bugs. In conclusion,this technique does not address the complexity and the nondeterminism present inparallel programs.2. Based on the event history of program execution: The approaches in this cate-gory view the execution of a parallel program as a sequence of events which takentogether constitute an event history. After the generation of the event history,by monitoring the execution, the history can be used in several ways. It can bepresented to the user as debugging information, such as textually, time-process di-agrams, or animation. The history can also be used to re-play the execution of theprogram.More sophisticated debuggers provide assertion checking facilities for behaviourcomparison. It allows the user to specify a behaviour and compare it with thebehaviour shown in the history. These approaches do not actively control theexecution as a traditional sequential debugger does, but rather they are passive.Through high-level abstraction of the events these approaches can reduce the com-plexity of parallel programs, and, compared to sequential debugging event basedChapter 2. Overview^ 12monitoring significantly reduces interference to the execution behaviour of the pro-gram. However, it is not possible to completely eliminate the probe effect sinceprobes do still have to be inserted into the program and thereby affect the execu-tion. The only way to make monitoring non-intrusive is by hardware monitoring.These approaches are discussed in more detail in Section 2.3 to 2.5.3. Static analysis of program: The approaches in this category are based on symbolicflow analysis of parallel programs. Static analysis can be used to detect errors, suchas race conditions and synchronization errors[MC89, MA87, McD89, T080, Tay84].It solves the problem of nondeterminism, however it is limited to error detection.Its computational complexity also limits its use.Because of the complexity and nondeterminism of parallel programs, traditional de-bugging is not sufficient for parallel debugging. Given its complexity, even if the facilitiesexisted, it would be difficult for the programmer to manage and control the execution.The computational complexity of static analysis makes it infeasible. In contrast, ap-proaches based on event histories that do not actively control the execution during testingand abstraction can reduce the complexity of the debugging information. These advan-tages make event based debugging more practicable and effective in parallel/distributedenvironments.2.3 Display the Behaviours of Communicating ProcessesOne natural way of helping the programmer to understand the program is to show itsbehaviour, recorded as an event history. Highly parallel programs are often best under-stood in terms of logical patterns of interprocess communication [HC89]. In displayingChapter 2. Overview^ 13communication interactions, time-process diagrams are often employed to provide a staticafter-image of communication over time. In most cases, a time line is drawn along onedimension, while individual processes are distributed across the other. Moving back andforth along the time line reveals the sequence of communications recorded during exe-cution. Message passing is depicted by connecting the two communicating processes(see[FLM89]). Hough and Cuny [HC89] [HC90] go further in this direction by animating thetime process information. By describing and animating abstract user-defined communica-tion events, their debugger Belvedere, helps users to compare the intended patterns withthe patterns that occurred during execution. Belvedere is a trace-based, post-mortemdebugger, intended for distributed memory architecture machines. A prototype has beenimplemented on a multiprocessor simulator.To facilitate the identification of communication patterns, Belvedere allows the userto define these patterns as abstract events and use perspectives that restrict the displayedbehaviour according to its perspective. There are two perspectives, the process perspec-tive and the consensus perspective. From the process perspective only the events seenby a single process are animated. From the consensus perspective high-level events con-sistent with that seen by all participating processes are animated. As shown in [HC90],Belevedere can be used to detect sequencing errors, missing communication errors andextraneous communication errors.Hough and Cuny exploit the fact that most parallel programs exhibit recurring oruniform patterns of interprocess communications. Belvedere allows the user to specifythe spatial arrangement of processes to reflect these logical patterns. Figure 2.2 showsa sample frame from an interaction animation, configured as a hypercube. Highlightedarrows represent SENDs, highlighted ports indicate messages received via GETs, andChapter 2. Overview^ 14Figure 2.2: Animation of Message-passing in Belvederemultiple arrowheads portray message queueing at a port.A disadvantage of this type of animation is that it is not sufficient for more complexinteractions, especially as the number of processes increases. As Pancake and Utter[PU89] pointed out the effectiveness of visualization is bounded by the degree to whichthe representations of running time behaviour correlate with the language constructs usedto incorporate parallelism, as well as the logical framework adopted by the programmer.How to select a perspective remains a problem even when the perspective technique canChapter 2. Overview^ 15deal with the complexity. This approach also does not address the problem of displayingevents for those programs in which the patterns of interprocess communication are notregular. A further disadvantage of debugging parallel programs by portraying behavioursis that often many concurrently executing processes have the same, or similar, behaviour.Without any "filtering" the programmer faces an overwhelming amount of repetitiveinformation. There also remains the question of accuracy. How accurately does thepost-mortem animation of the abstracted events depict actual program behaviour?The debugging tool developed by Caerts, Lauwereins and Peperstraete[CLP91] is alsobased on the animation of a program represented hierarchically and graphically. Francioniand Gach [JMFA91] experimented with the use of sounds to portray the behaviours ofprogram execution. As previously mentioned a limitation to these approaches is thatvisualization or auralization only helps to detect errors. There is no support for locatingfaults or verifying hypothesis, as there is no tool in the debugger to show where theseanomalies are located in the source program.2.4 ReplayBecause of the nondeterminism of parallel programs, successive executions of the sameprogram may lead to different behaviours and even different results. To solve this prob-lem, an approach called execution replay has been proposed [E1s89, Wit89, Sto89, PL89,CMN91, LSS90, SS90]. In this approach, information about critical events, such as sys-tem calls and message passing, are recorded during the initial execution of the program.This information is then used to control the re-execution (replay) of the program so thatit results in an equivalent execution. The traditional approach to debugging is often usedduring replay.Chapter 2. Overview^ 16According to Leu, Schiper and Zramdini[LSS90], there are two types of executionreplay techniques, "data driven" and "control driven". Data driven replay techniques[PL89, E1s89, Sto89] are relatively easy to implement for most common underlying com-munication systems. During replay, an individual process can be re-executed separately.However, the amount of information that needs to be recorded is very large and the timeto gather the information significant. In terms of message passing, it requires not onlythe type and order of messages, but also their contents. The overhead of monitoring cansignificantly modify the initial execution and make the execution replay meaningless. Incontrast, control driven replay techniques [LSS90, FLM89, Wit89] limits the amount ofinformation. It only records the relative order of the events pertaining to process interac-tion. The contents of the message are generated during replay. Thus individual processescan not be individually replayed, all the processes have to be replayed together.In addition, some debuggers not only record message events and system calls, but alsoperiodically save all variables at checkpoints. During replay, execution can be restartedfrom a checkpoint. This approach to debugging is called reverse execution [PL89, Wit89].As a checkpoint usually contains a lot of information and checkpoints may have to berecorded often, this method increases the overhead of recording the execution history. Toreduce this overhead, Choi, Miller and Netzer [CMN91, MC89] introduce a mechanismcalled incremental tracing. It is based on the idea of need-to-generate and attemptsto reduce the amount of information logged during execution. The gaps between theinformation gathered in the log and the information needed to do the flowback analysisare filled in incrementally by information obtained by statically analyzing the program.This information includes a static program dependence graph and a program database.They have introduced a technique called flowback analysis which allows the programmerChapter 2. Overview^ 17to examine dynamic dependence in a program's execution history without having tore-execute the program. Their debugger, PPD (Parallel Program Debugger), works byshowing the dynamic dependence among the program objects, such as variables andprocedures, through the dynamic program dependence graph.We believe that the dependence graph may still be too complicated for the program-mer to examine efficiently in flowback analysis. Simply showing the graph to the useris not enough since the graph differs from the user's conceptual model of the problem.Therefore it is necessary to develop a high-level tool to access the information in the graphand present it to the user in a meaningful way. In addition, the algorithmic complexityof generating the dependence graph remains to be addressed and, with respect to thegeneral framework, flowback analysis does not support the verification of the hypothesisset.Replay techniques aim at reproducing the execution behaviour, usually in a tradi-tional debugging environment. However, it inevitably modifies the behaviour duringthe monitoring of the initial execution. It is possible to permanently embed monitoringprobes into the program, however, this is only feasible when the overhead of monitoringis acceptable by the user. A final disadvantage of replay is that during replay, traditionaldebugging is used where again there is little help to deal with the complexity of parallelprograms.2.5 Assertion CheckingTypically, once an event history has been obtained, a programmer wants to test whetheror not the history conforms to certain properties. If it does not conform, the programmerChapter 2. Overview^ 18needs to find out what the differences are and where they occur. Based on this observa-tion, Hseush, Kaiser and Ponamgi[HK90, HK88, PHK91] describe a "data-oriented" de-bugging approach for concurrent programs. This debugger provides data-path expressions(DPE) as a formal notation for the user to describe expected or unexpected executionbehaviours of a program. Generally, a DPE is in the form of [<events>] 1<debuggingactions>1 <operator>. The events (<events>) can be a data event, such as variable Xbecomes equal to 0, or control events, such as a function or procedure is called, or messageevents, such as sending a message or receiving a message. The debugging actions are theactions to perform when the event occurs, for example printing the value of a variable.These actions also can be used to change the execution path at run-time or to assertadditional control, such as counting the number of times a specified variable is accessed.The operator specifies the temporal relation between adjacent events; sequencing(;), rep-etition(*), selection(l) and total concurrency (&&) are some of the relations. Operatorsare only allowed when there are DPEs following them. During execution, the debuggerautomatically compares the actual behaviors with those described in the DPE. As animplementation mechanism, they propose predecessor automata for event recognition. Apredecessor automata is a finite state automata in which each transition is labeled withboth a process event and a set of immediate predecessor events, on which the event iscausally dependent. Using DPE, the behaviour of a program can be described at a levelof abstraction specified by the programmer. Users are free to debug their programs atthe level of individual source code statements or at the level of interprocess activity orany combination of the two.Bates [Bat89] views a system's behavior as a stream of primitive event occurrences andprovides powerful operators for composing primitive events into high-level models thatChapter 2. Overview^ 19can be recognized by the debugger. In this debugger, behaviours are specified in an EventDefinition Language which can be automatically compared with the actual behaviours.For more complicated patterns, it provides a graphical communication description lan-guage (GCDL) to describe the expected logical patterns of interprocess interactions. Thelanguage allows flexibility in describing patterns, in a top-down fashion, at all levels ofabstraction. This approach is similar to that of the debugger developed by Francioni andGach [FG90] except that Francioni et al 's debugger provides a user graphical interfaceto display the expected communication patterns.Rosenblum[Ros91] uses yet another language called a Task Sequencing Language(TSL) to describe concurrent systems written in Ada. By providing meaningful com-parison of behaviours, these debuggers help users to verify their hypotheses about theerrors. However, the expressibility of the specification language depends on the imple-mentation. There also remains the problem of specifying the correct behaviours to becompared with the actual ones. Both of these assertion checking approaches assume re-producibility and do not address the problem of nondeterminism. Another disadvantageof these approaches is that the user has to learn another language in order to debug theprogram. Finally, it is non-trivial to describe the behaviour of large complex problemsand this itself might introduce bugs!Another debugger which provides assertion checking is that of Goldszmidt et al[GKY89] {GYK90}. It is an integrated system for debugging distributed programs writtenin a concurrent high-level language based on message-passing for interprocess commu-nication (i.e., OCCAM). It provides a variety of user-interface, monitoring and analysistools integrated around a uniform interprocess communication model. After collectingthe events of program execution, the available analysis tools are Queries, Assertion CheckerChapter 2. Overview^ 20and Replay. Queries provides a facility for the programmer to ask about the state of pro-gram execution at some point defined by a logical time stamp or some other condition.Assertion Checker allows temporal assertions to be checked against the event history. TheReplay tool aids in visualizing the evolution of a computation by displaying a simulatedreplay of the events. The tools also include a Scheduler that accepts control commandsfrom the user and accordingly schedules the running of application processes, and a Driverthat enables the user to manually simulate an external environment for running a singleprocess or a set of processes without the rest of the environment.Their debugging tool integrates the traditional approach of stop-and-look debuggingwith an event based high-level approach in a simulated environment. A disadvantage isthat in a real distributed environment tools such as the Driver and the Scheduler wouldbe difficult to implement and the overhead unacceptable. Finally, to manually simulatea process environment is not feasible for debugging parallel application programs thatare large and execute for a long time.2.6 SummaryTraditional debugging approaches require active control of the execution of a program.It greatly changes the behaviour of the parallel program and may cause the bugs to dis-appear when a sequential debugger is simply attached to each process. Furthermore, it isdifficult to simultaneously coordinate and manage a large number of sequential debuggers,even when nondeterminism is not a problem. The static analysis approach avoids theexecution of the program, thereby eliminating the problem of non-reproducibility causedby nondeterminism. However, it is usually limited to error detection and the algorithmiccomplexity of the analysis is intractable. Event-based debugging approaches record theChapter 2. Overview^ 21behaviour by monitoring the execution events. They do not control the execution activ-ity. An event history can then be presented to the user for replay or assertion checking.Through high-level abstraction of events, these approaches do reduce the complexity ofunderstanding the behaviour of parallel programs.Debugging parallel/distributed programs is a difficult problem, there is no simple solu-tion. As shown in the debugger developed by Goldszmidt [GYK90], a parallel/distributeddebugger needs to integrate a variety of techniques. The traditional debugging approachremains attractive if the complexity of a parallel program can be reduced and the re-producibility of execution assured. To reproduce the execution, event-based monitoringand control driven replay is effective and feasible. The assertion checking provides a fa-cility for high-level abstraction and automatic hypothesis verification and is a promisingtechnique to deal with this complexity. It is desirable to integrate these techniques sothat the assertion checking localizes the errors in a high level abstraction and after that,the error prone parts can be debugged in a traditional debugging environment wherereproducibility is assured by using the event history to control the re-execution.Chapter 3VM-based Applications and TheirMessage-passing BehaviourA virtual machine is a restricted model of computation that corresponds to an ideal archi-tecture for a specific parallel programming paradigm. It consists of a number of abstractprocessors and a communication network. An abstract processor performs specific oper-ations on high-level abstract data units and interacts by message-passing in a network.The communication network provides the facility for these processors to communicatewith each other. A Task-oriented Virtual Machine is a virtual machine with the followingcharacteristics:• A single input and output stream,• abstract data units called tasks which can be processed independently, and• the flow of tasks is the only communication in the network.Since we consider only task-oriented virtual machines throughout this thesis, unlessstated otherwise, it will simply be referred to as a virtual machine (VM).The basic objective in developing a VM-based programming environment is to pro-mote ease of use, and reuse, as far as it is consistent with expressiveness and efficiency.22Chapter 3. VM-based Applications and Their Message-passing Behaviour^23Virtual machines are implemented as pluggable software components (i.e., templates) foruse as basic building blocks in the construction of parallel applications. It allows users ofthe system to develop parallel applications without requiring detailed information aboutthe underlying architecture and run-time system. Each template corresponds to a parallelprogramming paradigm (e.g., processor farm, divide and conquer, vector model).In Section 3.1, we describe the concept of VM-based programming and its environ-ment. In Section 3.2 and 3.3, two specific VMs, processor farm and divide and conquerare described and their behaviour analyzed. In Section 3.4 we classify possible errors anddiscuss the problem of debugging these systems.3.1 Introduction to VM-based Programming EnvironmentsIn a distributed memory multiprocessor system a parallel program must communicatevia message-passing. A VM-based programming environment provides an easier way toprogram such systems since it hides the scheduling, load-balancing and other issues ofdistributed (parallel) computing.As shown in Figure 3.3, a VM-based programming environment is a hierarchicalsystem consisting of three layers: an application layer, a virtual machine implementationlayer and an implementation machine layer.The middle layer of the system is the virtual machine implementation. It consists oftwo parts, the virtual machine interface and the Intermediate Virtual Machine (IVM).The interface is the language interface of the VM, while the IVM is the implementationof the VM run-time system. The interface is what the programmer sees while program-ming the system. The IVM is the combination of the VM's conceptual model and thetopologies which are suitable for this virtual machine. Depending on the VM and the• • • Implementation MachineApplicationVirtual MachineVirtual Machine Interface• • •VirtualMachineImplementationIntermediateVirtual Machine nMultiprocessor ToolkitTarget Architecture• • •IntermediateVirtual Machine 1IntermediateVirtual Machine 2Chapter 3. VM-based Applications and Their Message-passing Behaviour^24MachineDependentPartMachineIndependentPartFigure 3.3: Overall System Structure of a VM-based Systemimplementation machine there may be more than one IVM corresponding to a single VM.Multiple IVMs are provided so that the programmer can tune performance by changingthe configuration without altering the program.The top layer is the application: the user code built on the virtual machine. Thebottom layer is the implementation machine, which can be further divided into twosublayers: toolkit and target architecture. The toolkit sublayer is a set of primitivesMachineDependentParttion Machine• • •MachineIndependentPartVM Interface VM 1 VM Interface VM 2 VM Interface VM m_IVM 1 IVM 2^IVM n IVM 1 IVM 2 IVM n IVM 1 IVM 2 IVM nApplicationMultiprocessor ToolkitTarget Architecturets.DChapter 3. VM-based Applications and Their Message-passing Behaviour^26provided for the implementation of the virtual machine. It could be an operating systemor a set of tools, which supports the communication among processes in the system andprovides other services such as input and output, loading the executable code into thesystem etc.. The target architecture is a distributed memory multiprocessor system.Each processor has its own private memory, there is no shared memory, thus the onlyform of communication is via message passing.Specification of VM-based ApplicationbeginVirtual_Machinetype = PFVM;Interfaceinput_stream = stdin;output_stream = stdout;Configurationtopology = BTREE;total_tasks = 100;total_processors = 16;Programsdata_generator = filel.c;comp_function = file2.c;result_receive = file3.c;endFigure 3.5: Specification of an Example ProgramAn example of how a program can be specified is shown in Figure 3.5 . There arefive parts to the specification. The Virtual_Machine specifies the type of VM, whichis the name of the virtual machine. The Interface describes the input stream andthe output stream of the program. The Configuration part gives the parameters thatChapter 3. VM-based Applications and Their Message-passing Behaviour^27define the size of the problem and the virtual machine. These parameters affect theperformance of the system. Depending on the type of VM different topologies are possi-ble, for example, a chain or a binary tree are possible topologies for the processor farmvirtual machine (PFVM). The total_tasks is the total number of tasks in the sys-tem. Together with the size of the tasks (i.e., granularity), it gives the total amount ofwork. The total_processors defines the number of processors in the system and in thePrograms part, the names of the files containing the user defined functions are specified.Note that a single virtual machine may be too restricted for many problems. Thus apractical programming environment based on virtual machines, as shown in the Figure3.4, will generally have more than one VM. One extra layer, a VM Metalanguage, isadded to the overall system structure between the application and the virtual machine.It provides a metalanguage for the programmer to "glue" the virtual machines into asingle system which models the problem. Accordingly, facilities for interactions betweenvirtual machines must also be present. In terms of debugging a multi-VM application, onepossible approach is to first debug the individual VMs and then combine them together.The communication between VMs may introduce bugs in the final system, however, thisproblem is beyond the scope of this thesis.A VM-based programming environment makes it easier to program a parallel systemwhile maximizing performance. As previously mentioned, it also promotes software reuse.For details of these issues, interested readers are referred to Sreekantaswamy[Sre93].3.2 Processor Farm Virtual MachineMany scientific and engineering applications require repeated executions of the sameprogram with different initial data (tasks) [Sre93]. In addition, the processing of tasksChapter 3. VM-based Applications and Their Message-passing Behaviour^28• • •TaskProcessor Figure 3.6: Processor Farm Virtual Machineis independent from each other and there is no interaction in the execution of thesetasks. However, the execution paths may vary from one task to another even though theprocessors execute the same program.These applications fit into a manager/worker computational model, where a managerprocess generates tasks for a set of worker processes, without specifying which processshould perform the computation for a particular task. The manager is responsible forgenerating the tasks, distributing tasks to workers, and then collecting the results fromthe workers.An ideal architecture for this kind of application is a Processor Farm Virtual Ma-chine(PFVM) which consists of a manager processor and a "farm" of worker processors.The manager processor reads the initial data from the input stream, generates the tasks,Chapter 3. VM-based Applications and Their Message-passing Behaviour^29collects the results and outputs results to the output stream. Each worker processor is atask processing unit. They receive tasks from the manager, execute the tasks, and returnthe results. Each worker processor executes the same program (with a data dependentexecution path) but works on different data.3.2.1 PFVM InterfaceConceptually, a user can assume that the PFVM has as many worker processors as thereare tasks. This conceptual model of programming a PFVM is shown in Figure 3.6. To usethe PFVM, the user needs to supply the definitions of three sequential functions for taskgeneration (data-generator()), result receiving (result -receiver()) and task compu-tation (compute-fn()). The first function, data-generator(), executes on the managerand is used to decompose the problem into a sufficient number of tasks. After generatinga task it calls do -task() to send the task to the worker processors. The second functioncompute-fn() is the user defined function invoked by the worker on each of the tasks.On each worker, compute-fn() receives tasks from the manager and, after computing thefunction, returns the result. There is also another user function, result-receiver(),that runs on the manager node and receives results from the workers. In the implemen-tation of TrEK [Sre93j there is another function, init -master 0, which performs someinitialization and can be considered to be a part of the data-generator() function.User programs are linked with the VM code to produce an executable which is thenloaded onto the processors./' asterProcessor ProcessorChapter 3. VM-based Applications and Their Message-passing Behaviour^30Figure 3.7: IPFVM in a Chain3.2.2 Intermediate PFVMThe target architecture used by PFVM can be any arbitrary tree-connected network ofprocessors. A chain and a binary tree are two common architectures. Their correspondingIntermediate Processor Farm Virtual Machines (IPFVM) are shown in Figure 3.7 and3.8, where the nodes represent processors, the edges the communication links, and thearrows the direction of task flow.Conceptually, the execution of the application on such a virtual machine can be viewedas a flow of task through the system. For example, in the IPFVM using a chain topology,in order for a task to be executed successfully it must first be passed from the managerprocessor to worker 0. After the task reaches worker 0, it can either be executed there orpassed on to the next worker processor in the chain, etc.. After a task has been executed,its result is passed back along the chain until it reaches the manager processor. From amessage-passing viewpoint, the behaviour of such a system is completely defined by thesending and receiving of tasks between neighboring processors. The result of a task mustbear the same identity as the task itself, thus it is not necessary to distinguish betweenthe task and its result.Chapter 3. VM-based Applications and Their Message-passing Behaviour^31Figure 3.8: IPFVM in a Binary Tree3.3 Divide and Conquer Virtual MachineThe Divide and Conquer Virtual Machine comes from the well-known problem solvingtechnique, Divide and Conquer. This technique is widely used in a variety of areasincluding graph theory, matrix computation, etc.. In divide and conquer, a problem isrecursively divided into several subproblems until they are small enough whereupon eachChapter 3. VM-based Applications and Their Message-passing Behaviour^32subproblem is solved directly. Subproblems are independent from each other. After allthe subproblems have been solved, the final solution is obtained by recursively combiningthe results in the reverse order.The Divide and Conquer Virtual Machine (DCVM) is the ideal architecture for suchapplications. As in the PFVM, it also consists of a manager and several workers. Themanager reads the initial data from the input stream and generates the tasks in the samemanner as the PFVM manager does. However, a worker not only forwards tasks but alsosplits the tasks along the way. Each task is split recursively to a certain level before beingprocessed by a worker. Once the results are obtained, they are combined and forwardedto the result receiver. Figure 3.9 shows a DCVM, in which each task is recursively splittwice before being processed.3.3.1 DCVM InterfaceTo use the DCVM, the user must define five sequential functions, a task generator(data-generator()), a result receiver (result -receiver()) and a task executor(compute -fn()). Two additional functions are also required, a task splitter (split -fn())and a task joiner (j oin-fn()).In programming the DCVM, the user defines the number of subtasks into which atask is split and the maximum number of splits. Note that the actual number of splits isdecided dynamically by the system and will not exceed the maximum number. In generalthe number of splits depends on the topology of the virtual machine. Conceptually, theDCVM may be of any degree or depth.Chapter 3. VM-based Applications and Their Message-passing Behaviour^33Figure 3.9: Divide and Conquer Virtual MachineChapter 3. VM-based Applications and Their Message-passing Behaviour^343.3.2 Intermediate DCVMAs in PFVM, the target architectures used for DCVM may be any tree-connected net-work, or spanning tree of a connected network of processors. The Intermediate Divideand Conquer Virtual Machine (IDCVM) with the topology of a binary tree is shown inFigure 3.8. In contrast to the PFVM, every non-leaf worker processor has, in additionto the computation process, a splitting process and a joining process. The flow of tasksin the system is different from the PFVM, a task is no longer passed from one workerto the next but rather subtasks proceed from level to level in the tree. When results arereceived, a node returns them back to its parent.3.4 Debugging VM-based systemsSince applications built using a virtual machine consist of several sequential functions,once again it is natural to debug these functions in a sequential debugging environment.However, there may remain errors that can not be found until it runs in the targetdistributed environment. This is especially true for large systems in which it may betoo time-consuming to thoroughly test each function. There are also other errors thatare environment-dependent. Memory errors, for example, will depend on the amountof available memory. In addition, there remains the problem of non-determinism. Forexample, the actual ordering of inputs of the user defined function result-receiver()in the PFVM is not known until the application runs in a distributed environment. Thisorder may change from run to run. In conclusion, it is necessary to debug the applicationas a distributed program.Chapter 3. VM-based Applications and Their Message-passing Behaviour^353.4.1 Classification of ErrorsIn a VM-based programming environment, errors may appear in the following locations:the application (the user code), the VM implementation, the operating system underthe VM, or the hardware system as in the case of an incorrect configuration of thecommunication links. In this section, the types of errors possible in the system areinvestigated and classified.In a distributed system based solely on message-passing, errors can be divided into thefollowing three classes: errors in expressing parallelism, errors in the message passing anderrors inside a process. These errors are discussed below. Note that common symptoms,such as deadlock, are not listed as an error because deadlock may be due to a variety oferrors such as a message omitted (an error in message passing) or the untimely death ofa process (an error in expressing parallelism).Errors in message - passing. This class of errors is detected when the message-passingbehaviour between processes, particularly between processes in different processors, isincorrect. Errors that are due to incorrect message passing include:• omitted messages. When an expected message does not arrive, the program oftendeadlocks.• unanticipated messages. When two processes communicate, it is possible that oneprocess sends more messages than the receiving process expects. From the receiver'spoint of view, this is an error of unanticipated messages. This error may or maynot cause a system failure depending on the message-passing mechanism and thereceiver's execution path. There are a number of possibilities. (a) If the messageis a blocking send, the sender blocks forever since there is not a correspondingChapter 3. VM-based Applications and Their Message-passing Behaviour^36receive call. (b) If the message is a nonblocking send, and this continues to occur,messages eventually overflow the buffers in the destination processor, again leadingto deadlock. (c) If the receiver is expecting more messages after the unanticipatedmessage, it may treat this message as the expected one and incorrectly process thedata. (d) If the message is a nonblocking send and the destination buffers do notoverflow, the system may continue to run and finish successfully. In this case, theerror is some un-received messages left in the system.• messages received in an unexpected order. A series of messages arriving in anunexpected order may cause the failure of the receiver of these messages. Thesymptoms are similar to the case of unanticipated messages.Errors in expressing parallelism. This class of errors appears at the level of processes,detectable by observing an individual process as a black box. From the programmer'spoint of view, these errors are incorrect process behaviour, such as too many or too fewprocesses created. The examples of errors in this class are:• a process created when it should not have been,• a process waiting for the completion of an unscheduled process,• a process waiting for the completion of another process that is already guaranteedto have completed, and• untimely process death.Errors inside a process. Different from the above two classes of errors, this class oferrors is common in both sequential programs and distributed programs. These errorsChapter 3. VM-based Applications and Their Message-passing Behaviour^37can only be observed by examining the internal state of a process, such as the value ofa variable. In a distributed program, these errors can manifest themselves as errors ofthe previous classes. A common error in this class, which is also difficult to debug, is themisuse of pointers. Other errors in this class include:• inconsistency in common data's declarations,• variables not initialized properly, and• misuse of variables, including pointers.3.4.2 DiscussionIn general, the faults that directly result in the first two classes of errors are algorithmicfaults. They are often the faults in algorithms, such as an incorrect synchronizationbetween processes. The faults that result in the third class of errors are usage faults.Assuming that the underlying VM system is correct, the only faults in a VM-basedapplication are usage faults. However, as mentioned, these usage faults may exhibitthemslves as errors in the first two classes. Both PFVM and DCVM have a predictablemessage passing pattern. Tasks always start at the manager, are forwarded to one ormore workers where they are computed and their results returned along the same route.This pattern can be used to debug programs that use the PFVM or DCVM.Chapter 4A Two-phase Approach toDebugging VM-based ApplicationsBased on the analysis in the last chapter, we propose a two-phase approach to debugginga VM-based system. In the first phase, error identification and localization are automatedat the conceptual level using fault models. The results from this phase are used to controlthe partial replay of the system in the second phase. In Section 4.1, we describe our two-phase approach. In Section 4.2, we define several abstractions needed to define the faultmodels. In Section 4.3, we define the fault models for programs that use the PFVM andDCVM. In the construction of these fault models it is assumed that the VM system iscorrect. Finally in Section 4.4, a summary is presented.4.1 Two-phase Debugging ApproachAs discussed in the previous chapter, PFVM and DCVM differ only in that DCVM splitsthe task before computation and combines results whereas in PFVM the tasks can onlybe forwarded without being split. Thus when observed at the level of interprocessorcommunication, PFVM may be viewed as a degenerate case of DCVM where a task is38Chapter 4. A Two-phase Approach to Debugging VM-based Applications^39split into only one task, the task itself. Therefore, in the following discussion we will notdistinguish between these two virtual machines. In the case of PFVM we assume thatthere are pseudo splitting and joining processes.In programming a virtual machine, the programmer can assume as many processorsas required. A task can be split any number of times before being computed, and in eachsplit, the number of sub-tasks can be arbitrary. This defines a conceptual model of thedistributed program which corresponds to the virtual machine layer of the environment.In this model, the logical computation of each task can be represented by a task graph inwhich each node represents one of the following processes: the generator, the splitter, thejoiner, the executor (computational process) or the post-processor (process that collectsand computes the results). The arcs between these nodes represent the communicationbetween them. A formal definition is given in Section 4.3.These processes still have to be mapped onto the target architecture to be executed.The execution of these processes can be viewed as a flow of tasks into the architecture.This flow defines an execution model of the distributed program which corresponds to theIVM layer of the environment. In this model each node is a processor. Each processorattempts to grab tasks from its parent. The processor keeps the task if it is currentlyidle (not executing a task), otherwise it splits the task into a number of subtasks andputs them into the output queue to be distributed to its children processors. Thus, atany given moment there is only one task being processed on a node.By analyzing the task flow behaviour of the system at this level, we would like to mapthe execution model back to the conceptual model of the programmer. By narrowing theerrors down to the process executing a particular task, the programmer can debug theChapter 4. A Two-phase Approach to Debugging VM-based Applications^40application in a sequential environment. This is consistent with the objectives of VM-based programming — to hide the details of distributed computing while providing theperformance of a distributed (parallel) system. Based on this methodology we proposethe following two-phase approach to debugging.The first phase — Automatic monitoring and trace analysisIn this phase, the message passing history is traced and analyzed to locate possible errorsas an initial guide for the programmer to debug in the second phase. As mentioned inthe last chapter, we target those errors that can be identified by the message-passingpattern. As we collect only message passing events, it is not possible to actually pinpointthe location of the error at this abstraction level. Instead, the errors are identified interms of its associated process, task and processor. This information is then used in thesecond phase to locate the errors in the source code.The second phase — Sequential debuggingIn this phase, a simulation execution environment is provided so that the programmercan replay those parts of the program identified in the first phase. A sequential debuggingenvironment is used to pinpoint the location of errors, verify and correct the faults.As a result of the analysis in the first phase the problem has been reduced to oneof debugging a sequential segment of code. As the second phase reduces to debuggingsequential programs we have concentrated on the first phase. In the following sections,we show that the first phase of debugging can be automated by building the appropriatefault models for the VMs.Chapter 4. A Two-phase Approach to Debugging VM-based Applications^414.2 DefinitionsIn this section, we define the abstractions needed to build the fault models. For simplicity,a processor in the execution model is called a node and the master processor is alwaysassumed to be the first node in the topology.• An event, e, is the tuple (sid, type, did, tid, ptime, ltime), where— sid is the node on which the event is initiated.— type is either send or recv, which represents an abstract task send or taskreceive. Note that it does not directly correspond to a message-passing call inthe program.did is the destination node when the type is send, or the source node when thetype is recv.tid is the task associated with this event.—ptirne is the physical time stamp as recorded during the tracing. It linearlyorders the events with the same sid.— ltime is the vector logical time stamp, a temporal partial ordering that recordsthe "happen-before" order relation.• A message-passing history, H, is a set of events.• The event set of a task tid, E, is the set of all events in H which have the same tidand the following two pseudo events:— A source event, e8, which is defined as the tuple (FirstNode, recv,  FirstNode,tid, PtimeStart, LtimeStart), where FirstNode is the identity of the firstChapter 4. A Two-phase Approach to Debugging VM-based Applications^42node in the VM topology, i.e., the node where the task generation part ofthe application runs. For example, the FirstNode of a binary tree DCVM isthe root of the tree of processors. PtimeStart is an assigned physical timestamp which is smaller than the physical time stamp of all other events onFirstNode. Similarly, DimeStart is a logical time stamp assigned so thatthis event happens before all the other events in this set. The source event isassumed to have been generated by the data generating process.A sink event, ek, which is defined as the tuple (FirstNode, send , FirstNode,tid, PtimeEnd, DimeEnd), where FirstNode is as defined in the sourceevent. PtimeEnd is an assigned physical time stamp which is larger than thephysical time stamp of all other events on FirstNode. Similarly, LtimeEndis a logical time stamp assigned so that this event happens after all the otherevents in this set. The sink event is assumed to be generated by the resultreceiving process after the result of the task is received.• An event graph of a task, G(V, E), is a directed S-T (single source and single sink)graph where V is the event set of the task and E is a set of arcs (a, b). There is anarc (a, b) E E wherever a, b E V(a b) and b receives a message from a, or a and bare initiated from the same process and a happened before b. Thus, an event graphrepresents the causality relations between all the events related to a single task.• A coalesced event graph of a task, G(V,E), is a collapsed event graph obtainedby combining all the vertices initiated by the same process into a single vertex. Avertex is considered completely coalesced if it coalesces the correct number of events.Otherwise, it is called an incompletely coalesced vertex. The correct number ofevents for the data generator process is two. First is the source event and secondChapter 4. A Two-phase Approach to Debugging VM-based Applications^43is the send event that passes the task to the next processor. Similarly, the numberof events for the post-processor process is also two, first is the recv event whichreceives the result of this task and second is the sink event. For the split process,the correct number of events is one more than the degree of the task graph. Ofthem, one event receives the task, and other events send the sub-tasks to the nextprocessors. Similarly, the correct number of events for the join process is also onemore than the degree of the task graph.• Of all events in the event set of a task, those associated with the vertices whichare reachable from the source event and do not have any outgoing arcs are calledfrontier events. The frontier node of a frontier event is the node sid when the eventis a recv, or the node did when the event is a send. The frontier node is where thedata flow of this task stops. The set of frontier events of an event graph is calledthe frontier set of the event graph.4.3 Fault ModelsAs shown in the first section, the architecture and the task graph are the most essentialparts in the specifications of a VM-based application. In the following, we give theirformal definitions.An architecture, A(Va, Ea), is a directed graph where Va is a set of processors and Eais the set of arcs such that (n1, n2) E Ea whenever n1, n2 E Va(ni L n2) and there is acommunication link from n1 to n2.A task graph, T (Vt, Et), is a directed graph where Vt is the set of processes of comput-ing this task and Et is the set of arcs such that (pi ,p2) E Et whenever pi ,p2 E Vt(pi p2)and there is message passing from p1 to p2. An arc (pi ,p2) is labeled "Distributing" ifChapter 4. A Two-phase Approach to Debugging VM-based Applications^44the message passed is a task, or "Collecting" if the message passed is a result.A task computation graph, C(Vc, Es), is the directed graph obtained by combiningan architecture A(17,, Ea) and a task graph T(14, Et). V, is the set of vertices such thatC Va X 14 and E, is the set of arcs such that ((n1,p1), (n2, p2)) E E, if (ni,pi), (n2,P2) EVc((ni,M)^(n2,p2)) and (n1, n2) E Ea and (pi,p2) c E.Given an architecture A and a task graph T, the family of task computation graphs,C, is the set of all correct task computation graphs.Given an event graph and a task computation graph, the event graph is said to maponto the task computation graph if there is a one-to-one correspondence between itscoalesced event graph and the task computation graph. If there is also a one-to-onecorrespondence from the task computation graph to the coalesced event graph, the eventgraph is said to match the task computation graph.The specifications of a VM-based application define a family of task computationgraphs. Given a message history, as defined in the last section, every event graph willmap onto a computation graph. Based on whether or not the event graph matches somecomputation graph, we have the following fault models.• Success. Given the family of task computation graphs C defined by the specifi-cation of a VM-based application and an event history H which defines a set ofcoalesced event graphs S, the event history is considered a success if, for every eventgraph Et E E corresponding to some event t, there exists aCEC such that C is amatch of E. That is, the message passing behaviour of the system is as expected.Note that this can only detect errors that are exhibited in the message passing. Itis up to the programmer to detect the other types of errors. Once the task numberis identified, by comparing the result of the tasks with the expected, the user canChapter 4. A Two-phase Approach to Debugging VM-based Applications^45proceed to the second phase, where its event graph is used to control the replay.• Failure. Given the family of task computation graphs, C, defined by the specifi-cation of a VM-based application and an event history, H, which defines a set ofcoalesced event graphs e, it is considered a failure if, for every event graph Et E Ecorresponding to some event t, there does not exist C E C such that C is a matchof E. In other words, the message passing behaviour of the system does not matcha correct execution because the computation of some task failed to complete.For each such coalesced event graph, there are m errors in computing this task,where m is the number of frontier events of E. For every frontier event in theseevent graphs, there is a fatal error located at (tid, pid, nid) where tid is the tasknumber. If the associated vertex of this frontier event is not a completely coalescedvertex, then nid is the node associated with the vertex of the coalesced event graphand pid is the process associated with the vertex. Otherwise, nid is the nodeassociated with the destination vertex of the outgoing arc of this vertex in the taskcomputation graph, and pid is the label of the outgoing arcs.The event graph of a task represents the causality relation between events associatedwith this task. Each frontier event in the graph represents a location where thecomputation of a certain sub-task fails to complete, that is, an error. The locationof this error is the location of the next expected event.4.4 SummaryBy analyzing the message-passing behaviour of a VM-based program, we have developedfault models so that the first phase of debugging can be automated. As we have assumedChapter 4. A Two-phase Approach to Debugging VM-based Applications^46that the underlying VM implementation, the operating system and the hardware are allcorrect, the fault model is able to detect errors in the application that manifest themselvesas errors in the message passing behaviour. Errors that do not cause any deviation fromthe message-passing behaviour but return a erroneous result can not be detected by thistechnique. Once the task in which an error occurred has been identified the programmercan use the second phase to sequentially debug the code.Chapter 5ImplementationIn this chapter, we describe the design and implementation of the first phase of thedebugging method proposed in the last chapter. Section 5.1 gives an overview of thesystem. Section 5.2 describes the event tracing and discusses some of the problems incollecting traces for the purpose of debugging. Section 5.3 describes the implementation ofthe fault model and the automation of the error identification and localization. Finally, inthe last section we discuss the implementation, including event tracing and time stampingof events.5.1 Overall Design of the SystemFigure 5.10 depicts the overall structure of the system. The implementation of virtualmachines in this case, TrEK (Tree Execution Kernel) must first be instrumented byinserting special probes. It is then compiled and linked with the application and theresulting executable is downloaded onto the transputers. As it executes events are gen-erated. A monitor Tmon-de collects these events into an event trace file. The traceanalyzer, deTrEK, takes the event trace as input and produces a fault list. The secondphase debugger has not been implemented. It would take as input the user code, TrEK,47Chapter 5. Implementation^ 48Executable Code ›^,- Transputer NetworkRuntime SystemTmon-deCompilerand LinkerAUser CodeInstrumented TrEK^Event TraceInstrumentationAdeTrEKError List- Second PhaseDebuggerFigure 5.10: Overall Design of the Systemthe fault list and the event trace and sequentially simulate the VMs.The major parts of the implementation were Tmon-de, a debugging version of Tmon,the trace analyzer, deTrEK, and the instrumentation of TrEK.Chapter 5. Implementation^ 495.2 InstrumentationAs described in Chapter 3, TrEK consists of several processes on each processor. Tomonitor the task flow between processors, special probes were inserted into the processesthat manage the communication channels. In TrEK, the passing of a task or the resultof a task is done by passing two messages, the size of the data and the data itself. Aprobe was inserted after these two communication calls. The probe generated an event,as defined in Chapter 4, which was then assigned a physical time stamp by the monitor.As the result of this instrumentation, the event tracing is at a level that more closelycorresponds to the virtual machine. The resulting events are a set of abstract events,each of which is a send or receive of a task (or its result), instead of a message. Tracingabstract events reduces the overhead of event tracing and the interference to the programsbeing monitored. It also simplifies the filtering of the trace and further processing neededfor the analysis.5.3 Event TracingEvent tracing is an important part of any post-mortem debugging method. The accuracyof the results depends on the completeness and accuracy of the tracing mechanism. Theobjective of our design of the monitor was to obtain high accuracy and completenesswhile minimizing the probe effect and the overhead in generating the events, collectingthem and downloading them to the host.Chapter 5. Implementation^ 505.3.1 EnvironmentIn this subsection, we give a brief description of the underlying instrumentation environ-ment, including the hardware architecture, the operating system and TrEK.Multicomputer ArchitectureThe target hardware architecture for our debugging system is a 75-node transputer-basedsystem in the Department of Computer Science at UBC. The system consists of a Sun4 workstation as the host and 75 processors. On each processor, or node, there is anIMS T800 transputer with 4 Kbytes of on-chip RAM, 4 bidirectional serial links and 1 to16 Mbytes of local memory. The 75 nodes are interconnected through 10 programmablecrossbar switches and the system is connected to the host by a Sbus interface. Thereare currently four connections between the host and the transputers and two connectionsbetween the host and the crossbar switches, which are used to control the configurationof the crossbars, and thus the topology. Nodes that are not directly connected to the hostcan only communicate with the host through intermediate nodes. The interconnectiontopology of the nodes is statically reconfigurable by a process on the host that sendsswitch setting commands to the crossbar switches.Trollius Operating SystemThe multiprocessor toolkit in our implementation is the Trolliusl Operating System[Bur88]. It is a parallel operating system designed for distributed memory multicomput-ers, developed jointly at Ohio State Supercomputing Centre and Cornell Theory Centre.1Trollius is a trademark of the Ohio State University and the Cornell Research FoundationChapter 5. Implementation^ 51Trollius provides a uniform programming environment that extends from the parallel pro-cessing units to the host. It enables the programmer to utilize the program developmentpower of the UNIX' operating system on the front end and the computational powerof the parallel machine on the back end. Trollius provides a message passing facilitybetween the processes on different nodes, as well as inside the same node. There aretwo levels of message-passing in Trollius. One is for intra-processor communication atthe kernel level. It provides a blocking message passing service to processes on the samenode. The other is for inter-processor communication at the network level. It looselyresembles the OSI network model and provides datalink, network and transport layerservices. The synchronization between the message sender and receiver is via arbitrarynumbers called events, instead of the process identifier. Other features of Trollius includemultitasking, access to host's remote file system and process loading. These services areprovided as a C library for process creation, process destruction, signal handling andaccess to the remote file system. Many of these services are also available as commandprograms which are executed from the UNIX shell. For a detailed description of Trolliusreaders are referred to [Bur88] [Tro92b] [Tro924The Tree Execution KernelThe virtual machine implementation in our VM-based programming environment is aTree Execution Kernel (TrEK) [Sre93]. It is a runtime kernel that controls and coordi-nates the activities of a virtual tree of processors. The kernel on each node is responsiblefor accepting tasks from its parent processor. Upon receiving a task, it may, dependingon the load, execute the task locally or pass it on to one of its children processors. The'UNIX is a trademark of AT&TChapter 5. Implementation^ 52kernel is also responsible for returning the results to its parent, and eventually to theroot processor that generated the tasks. Independent of the underlying topology of thetarget architecture, the kernel provides a virtual tree machine with the topology specifiedby the programmer.The task distribution strategy in TrEK is a scheme called flood-filling. In this scheme,a child processor greedily tries to grab tasks from its parent. A child processor keeps thetask if it is currently idle (not executing a task), otherwise it splits the task into a certainnumber of subtasks and adds the subtasks into the output queue to be distributed tothe children processors. The number of subtasks split from a task is controlled by theuser program. Results are returned to its parent processor as soon as they becomeavailable. That is, there is no guarantee that the results will be returned in the order ofthe distribution of tasks. The whole system is not a FIFO queue.In terms of implementation, TrEK is organized as a set of communicating processes.On a typical node, there are 8 high priority processes that manage the 4 input and 4output communication links and are responsible for task/result receiving and forwarding.In addition, there are 4 other processes: a data manager, a result manager, a split processand a join process. Briefly, the data manager is responsible for making task schedulingdecisions, the result manager gets the result from the user-defined computation processand passes it to the parent processor. The split process and the join process invoke theuser split and join function respectively as required.TrEK can support both the processor farm and divide-and-conquer virtual machine.The processor farm can be viewed as a degenerate case of divide and conquer, whereeach task is only "split" into one subtask, the task itself. Thus task splitting and resultjoining is unnecessary. A small change to configuration file allows TrEK to be configuredChapter 5. Implementation^ 53for either PFVM or DCVM. For a detailed description of the TrEK, readers are referredto [Sre93].5.3.2 System Structure of TmonTmon[WJC93] is a parallel monitoring system implemented on the transputer-basedticomputer for the purpose of performance analysis and tuning. It uses a global interruptmechanism for clock synchronization, to get relatively accurate timing and global state in-formation. To minimize the interference to application communication and performancedegradation because of the monitoring, an adaptive reporting mechanism was developedto unload the trace data to the host when the communication and computation load ofthe local processor is detected to be under a predefined limit.There are three major components in the system: data generation and collection,global control and data analysis, and display. One transputer in the network is distin-guished as the master node for global control. All the nodes being monitored are treatedas slave nodes. The application runs on these slave nodes where the operations of datageneration and collection are performed.The master node is capable of simultaneously interrupting all the slave nodes in thesystem to perform clock synchronization and for generation of sampling events regardingnode usage. There are two processes on the master node: an interface process and acontroller process. The interface process accepts monitoring commands from the userto start or stop a monitoring session. The controller periodically generates the globalinterrupt signal to synchronize the activities of the slaves.On each slave node, there are several processes that perform the data generationand collection of events. Event probes are inserted into the application to generate traceChapter 5. Implementation^ 54data. As they are generated, a meter process collects these event traces and puts theminto the trace buffers. A buffer manager periodically checks the status of the buffersand processors, and flushes the buffers when they are full. There are two trace databuffers, organized as a double buffering system, allowing the meter process to fill onebuffer while another is being flushed by the buffer manager. The trace data is sent tothe host using the global message passing services provided by the underlying operatingsystem, Trollius. A backend process accepts the global interrupts for synchronization,performs clock synchronization and generates sampling events of node usage.A collector process on the host runs as a daemon. It collects trace data from all of theslave nodes and sends the data to the data display, which can display the performanceresults graphically in real time. It also dumps the trace data to trace files for furtherdata analysis.There are basically three types of events being monitored in Tmon. The first typeof events are the node usage events which report the status of the communication linksand the CPU. The second type are the standard events of message passing and processcreation and destruction, including message send, message receive, calling of a messagereceive, the initiation of a process and the destruction of a process. The third type ofevents are the user-defined events. A simple command line user interface is provided toenable/disable monitoring. Two functions are also provided in the library so that themonitoring can be enabled/disabled from the user program.Chapter 5. Implementation^ 555.3.3 Modification to TmonPerformance Analysis vs. DebuggingThere are some similarities and differences between monitoring for performance analysisand monitoring for debugging. Generally, the objective of monitoring is to understandthe execution behaviour of the program. Both types of monitoring need to dynamicallyextract information about the execution of the program. And both attempt to minimizethe interference to the program being monitored and the overhead of monitoring, whichinclude the following:• CPU time for event generation in the program,• CPU time of running the monitoring software, i.e., is to collect the event trace,• communication cost for transferring the trace data,• memory for storing the trace data.However, whereas performance analysis is concerned with the efficiency of the pro-gram, debugging is concerned with its correctness. As a result they require differentinformation. The issues they both address have different priorities. In debugging, theprobe effect is more critical than the overhead introduced. As long as the overhead iswithin an acceptable range, first priority is to minimize the probe effect. That is, the de-viation of execution behaviour because of the monitoring, in terms of correctness, shouldbe minimized. Also, performance monitoring assumes algorithmic correctness of the sys-tem while the monitoring for debugging is concerned with the anomalous termination ofthe program being monitored.Chapter 5. Implementation^ 56ModificationAs seen in the last section, Tmon is designed for performance analysis. To modify it fordebugging monitoring, we have made a number of changes.• The global interrupt for clock synchronization has been disabled due to unavail-ability of the hardware hooks. Thus there is no master node which enforces globalcontrol. Instead, the control is now distributed on the slave nodes. The moni-toring on a slave is turned on and off by a local interface process, which acceptsthe control commands from the host. As the control commands are sent via themessage passing mechanism provided by Trollius, the slave nodes are not stoppedsimultaneously. If the monitoring is turned off on a slave node, the execution ofthe application will be suspended when it tries to generate the next monitoringevent. The execution of the application is resumed as soon as the monitoring isturned back on. In this way, turning on/off the monitoring indirectly controls theexecution of the program.Hardware clock synchronization is not used. The clocks are reset when the moni-toring is turned on, after that, the clocks in each processor progress according totheir own speed. Thus clock drift is inevitable. Fortunately, we do not depend onthe physical time stamps in our debugging method. As long as the events thatoccur on the same node keep a linear order, clock drift is not important.• The unloading of trace data is controlled by a sequence of commands that is spec-ified statically. Each command is in the format<Time> <LowerBound> <UpperBound>Chapter 5. Implementation^ 57The command specifies that the trace data must be unloaded immediately to thehost whenever there are more than <UpperBound> number of events in the buffer.If the number of events in the buffer is less than <UpperBound> but greater than<LowerBound>, and the CPU and communication links are not busy, the trace isunloaded. These two bounds are in effect until the clock reaches time <Time>,at which time it steps to the next command. Of course, these bounds will beoverwritten with the buffer size if it is smaller.These modifications to the monitor have kept the adaptive reporting of Tmon, whileproviding the facility to control the unloading of trace data in order to manage theloss. If this facility had not been implemented, up to 256 events could be lost whena node with two full buffers crashes. This loss of information would have made thetrace analysis far less accurate.• A number of functions have been added to the trace collector on the host. To beginwith, all the events are recorded into a file by the trace collector instead of filteringthe events into two different files. Afterwards, a program filters the data into thedifferent files needed for performance analysis, visualization and debugging. As doesthe original Tmon, node usage events are recorded into one file and the standardevents of Tmon into another. The format of those events is unchanged and allthe related tools for analyzing that data can be applied without any change. Theuser-defined events (probes) are written into a separate file. Also, the configurationfile in Tmon is no longer generated by the slave nodes. Instead, it is generated byhostmon, which is invoked with the nodes explicitly specified on the command line.In addition, a number of statistics are shown when the monitoring session ends.This includes the total number of events received and the time stamps of the lastChapter 5. Implementation^ 58event. This information tells the programmer how far the program executed beforefailure. The time stamps can also be used to modify the shipping control sequencesfor the next monitoring session.• A simple mechanism for termination is provided. If the program executes success-fully, the monitoring session ends when the process "hostmon" receives an "end"signal from each node being monitored. In the case of a node crash, the moni-toring session can be aborted by sending a "quit" command from the monitoringcontroller on the command line. Then, depending on the extent of the crash, allremaining monitored data is flushed to the host.5.4 Trace Analysis5.4.1 Algorithm for Logical Time StampingIn event tracing, events on the same processor are linearly ordered. However, eventsoccurring on different processors are not. There does, however, exist a partial orderbetween them. We implemented the vector logical clocks of Fidge [Fid91] for keeping apartial order relation between events occurring on different processors.Based on the dynamic algorithm of Fidge[Fid91], the following algorithm generatesvector logical time stamps for the events after all the events have been recorded. It isa centralized, post-mortem algorithm even though it can also work dynamically. Thealgorithm assumes asynchronous communication where the send is buffered and non-blocking and the receive is non-buffered and blocking (it also works with synchronouscommunication). The algorithm is based on the replay of the message-passing events.On each processor, there is a queue which stores future events in order and a list whichChapter 5. Implementation^ 59contains the send events to be matched. The first event in the queue is the event that ispending. The list acts a buffer for the incoming messages.Input. The number of processors N and a file that contains all the abstract eventslinearly ordered for the same sid. Each event has a type which is either a send or a recv,and an initiating processor identity.Output. A list of logically time stamped abstract events.Method. Suppose that the processor ID's are numbered from 0 to N - 1. Let S be astack of processor IDs, Cltime be an array of current logical time stamps, one for eachprocessor, Q be an array of queues for pending events to be time stamped and L be anarray of lists for send events which have been time stamped but remain to be matchedwith their corresponding recv events. The algorithm is shown below./* initialization */for (i=0; i<N; i++)init_queue(Q^) ;init_list (L [i] ) ;init_stack(S) ;while (read_event (E))insert _queue (Q [E. sid] , E) ;push_stack(S , E . sid) ;while (pop_stack(S , P) )if (find_queue(Q [P] , El)) {if (El is a send) {Chapter 5. Implementation^ 60Cltime[P][P1 += D;El.ltime = Cltime[P];insert_list(L[El.did], El);push_stack(S, El.did);push_stack(S, P);output_event(El);delete_queue(Q[P]);1 else if (match_search_list(L[P], El, E2)) {Cltime[P][P] += D;Cltime[P] = maximum(E2.1time, Cltime[P]);El.ltime = Cltime[P];push_stack(S, P);output_event(E2);delete_queue(Q[P]);delete_list(L[P], E2);11/* output the remaining send events without matching receive */for (i=0; i<N; i++) {output(events in the L[i]);1}1Chapter 5. Implementation^ 615.4.2 System DesignThe trace analyzer deTrEK takes the event trace file produced by the monitor as its inputand outputs an error list whenever one or more errors have occurred in the messagepassing behaviour. As shown in Figure 5.11, the design of deTrEK consists of eightmodules, approximately in a hierarchy of three layers. The top layer, Main, is the maincontrol module. The next layer consists of four modules that implement the majoralgorithms. The pre—processing module performs the conversions of task identities andthe logical time stamping. The Event Graph Construction constructs the event graphsfor each task. The Event Graph Validation module validates the graphs before passingthem to the Error Identification and Localization module, which identifies and localizesthe errors exhibited in the event graphs. Three modules on the bottom, Queue, Listand Input and Output provide the tool set for the other modules. Of them, Queueand List are abstract data types, an extended queue and a list. The Input and Outputmodule performs the I/O functions. In our implementation, we did not generate thetask computation graphs according to the specification implied by the fault models.Generating that large number of task computation graphs is inefficient. Instead, theevent graph is directly checked against the properties of the task computation graphs.The program begins with the pre-processing of the trace events. It includes sortingthe events so that the events initiated from the same node appear in the order in whichthey occurred. This is required to ensure the correctness of the logical time stamping.It also converts the task IDs of the trace events into a format that is suitable for traceanalysis. The events are then logically time stamped.After that, the message passing history is augmented by adding the two pseudo eventsfor each task as described in Chapter 4. The event graphs are then constructed. EachMainQueue Input and Output ListPre-Processingz Event GraphConstructionNEvent GraphValidationError Identificationand LocalizationChapter 5. Implementation^ 63event graph is represented by a structure which contains two queues; one for vertices andanother for arcs. Each vertex is represented by a structure which contains a pointer tothe associated event. Each arc is represented by a structure which contains pointers tothe two associated events, the source and destination vertex. The history is then scannedto build the vertex queues of event graphs. Task generation errors are detected at thistime. Finally, the arc queue for each event graph is generated.The validation of an event graph checks the topology of the graph according to thespecification of the application. This operation assures correctness of the tracing mech-anism and the application specifications. Without validation the correctness of erroridentification and localization can not be guaranteed.In the final step, the frontier events are extracted from each of the event graphs.These events form a list, and for each frontier event on the list, an error is identified andlocalized based on the position of the event in its event graph and its time relationshipwith the other frontier events.5.4.3 User InterfaceA simple command line interface to the trace analyzer, called deTrEK, is provided for theprogrammer. A number of parameters of the application, including the type of VM, thenumber of processors, the number of tasks and the topology are specified in the commandline. A man page for deTrEK is shown below.• NAME— deTrEK - A tool to analyze a trace file for the purpose of debugging.• SYNTAXChapter 5. Implementation^ 64- detrek [ -hvco] [-p <FilePref>]• DESCRIPTION- deTrEK is a tool for analyzing the message-passing history of applicationswritten for TrEK. It takes as input a history of message passing events andspecifications of the application and outputs a list of possible errors and theirlocations by analyzing the message-passing history for conformance to thespecifications.• OPTIONS- -h Prints out a help message showing the options and their usage- -v Turns on the verbose mode, the default mode is off- -c Turns off the conversion of the task identity- -p <FilePref> Specifies the prefix of the two file names. If this option is notspecified, the default prefix "data" is used.The first file is the specifications of the application "<FilePref>.spf". It is acommon text file that includes the following 5 integer tuples representing thespecifications of the application:<NumProc> - number of processors in the target architecture.<Topology> - type of topology, 1 represents a chain; 2 a binary tree; 3 aternary tree. The current implementation of this tool only supports these 3topologies.<NumTasks> - The number of tasks generated by the data-generator0.<TaskDegree> - Degree of the task graph.Chapter 5. Implementation^ 65<TaskDepth> — Depth of the task graph.For example, a DCVM-based application that runs on a binary tree of 16nodes with 100 tasks where each split generates 4 tasks and the maximumnumber of splits is 5 will look like:16 2 100 4 5The second file is the abstract event history, "<FilePref>.tvm". Each line ofthis file is an abstract event of the following format:<Ptime> <Sid> <Type> <Did> <MainTid> <SubTid>where <Ptime> is its physical time stamp; <Type> is its type, that is eithersend(send ) or receive(recv ); <Sid> is the initiating processor; <Did> is thedestination processor for a send event or the source processor for a receiveevent; <MainTid> is an integer, the major task associated with this event;<SubTid> is a bit sequence in hexadecimal which further identifies the sub-task numbering in the DCVM.— -o Outputs the error list into a file "<FilePref>.bug", instead of the defaultstandard output. "<FilePref> . bug' is also a text file. Each line of the filereports an identified error in the following format:<Error> <TaskNo> <ProcNo> <Process>where <Error> is the type of fault, <TaskNo> is the task associated with thaterror, <ProcNo> is the processor where the error occurred, <Process> definesthe process in which the error occurred, and is one of the generating, splitting,joining, computing and post-processing processes where the error occurred.Chapter 6Experiments and Evaluation6.1 Experiments with SimulationTo test and demonstrate the monitor and trace analyzer, we developed a general sim-ulated application for a task-oriented virtual machine. The application takes as inputa task description file and simulates the computation of generating, splitting, joining,computing and post-processing of the tasks. This is useful for testing since the load anderror conditions can be easily controlled and varied. Actual debugging of real programsis reported later in this chapter.A task description file is the definition of the computation of a task. It includes thetask number and the definitions of five program blocks in the following format:<taskno> <gen- seg> <comp -seg> <post -seg> <split -seg> <join- seg>where <taskno> is the number of the tasks, <gen-seg> is the definition of the sim-ulated computation generating the task, <comp - seg>, <post - seg>, <split - seg> and<j oin-seg> are the definitions of the simulated computation doing computing, post-processing, splitting and joining of the tasks, respectively.Each segment is a tuple of 3 integers shown as follows:66Chapter 6. Experiments and Evaluation^ 67<time> <space> <bug-id>It simulates a segment of the program which occupies <space> bytes of memory andtakes <time> microseconds to execute. In the simulation, a memory block of that size isallocated and an empty loop is executed an appropriate number of times.For convenience, blank lines and lines starting with "#" are treated as comments.The numbering of tasks starts at 0. The total number of tasks in each execution dependson the command line parameters used when the executable code is loaded onto the nodes.By default, wherever their definitions are missing, all the tasks are defined to be the sameas the most recently defined task. If the first task defined in the file is not task 0, all thetasks with smaller numbers are assumed to be built-in default tasks.With this simulated application we can experiment with bugs in different processesand with different task granularities. In the following, we show a typical session of a testrun.1. First, the specification (dc.spf) of the application is constructed. Shown belowis the specification of an application. It is intended to run on a binary tree with8 nodes. There are 100 tasks in this application. Each task can be split into 2sub-tasks, to a maximum depth of 5.8 2 100 2 52. Then a task description file (dc .dat) is constructed. It defines the granularity ofeach tasks and the bugs, if any, with their location. The following describes threetasks 0, 49 and 50. In the computation of task 49 there is a fatal bug (Bug-id 4)which will cause the node to crash. All the other tasks are, by default, the same asthe previous one.Chapter 6. Experiments and Evaluation^ 680 20 10 0 140 1000 0 10 10 0 100 10 0 100 10 049 20 10 0 140 1000 4 10 10 0 100 10 0 100 10 050 20 10 0 140 1000 0 10 10 0 100 10 0 100 10 03. The transputers are booted, loaded, and monitor is started. The "state" commandprovided by Trollius is used to check whether the system processes on nodes arestill alive. This command reports the status of the Trollius processes and prints outthe processes and their execution state. If this command hangs in trying to reportthe process status on some node, it means that the node is not responding and thesystem has failed. In our experiment, after a while, this command reported thatnodes n3 to n7 have failed. The session is ended by forcing the monitor to quit.xanadu<l>spread -Sc bnail.btree_32.02Ohio Trollius 2.2 - The Ohio State Universitysoldering...xanadu<2>loadgo n0-7 nodemonxanadu<3>hostmon n0-7 -e data &xanadu<4>loadgo n4-7 slavexanadu<5>loadgo n1-3 slave -- lastxanadu<6>loadgo nO master -- 8 100 dc.datxanadu<7>monctrl -b n0-7...xanadu<8>state n0-7NODE INDEX PIDnO^[18]^297985nO^[19]^366821n1^[18]^305141n1^[19]^361913n2^[18]^305141n2^[19]^361913n3^[18]^297985n3^[19]^361937-ZSuspendedxanadu<9>kill %xanadu<10>monctrl -q4. After forcing "hostmon" to quit, two files were left in the current directory, "data.config"and "data.all". Using our filtering tool "tvmfilter", we obtained the file "data.tvm",in which the events initiated on the same node are ordered and ready for input toKPRI KSTATE PROGRAM0 BR (-91009, 0) nodemon0 A (-366821, 0) master0 BR (-91009, 0) nodemon0 A (-408885, 0) slave0 BR (-91009, 0) nodemon0 A (-408885, 0) slave0 BR (-91009, 0) nodemon0 A (-408909, 0) slaveChapter 6. Experiments and Evaluation^ 69the trace analyzer, deTrEK, which generated the following.xanadu<11>tvmfilter -p dataxanadu<12>detrek -cp data^ Failure^TASK SUBTASK NODE PROCESS49 0010 6 comp_fn()49 0110 7 comp_fn()49 0100 5 comp_fn()49 0000 4 comp_fn()50 0000 1 split_fn()/comp_fn()51 0000 2 split_fn()/comp_fn()51 0010 3 split_fn()/comp_fn()52 0000 1 Distributing53 0000 2 Distributing53 0010 3 DistributingThe trace analysis shows that task 49 was split into 4 sub-tasks which were executingon nodes 4, 5, 6 and 7 where they failed to complete. Task 50 failed in either thesplitting process or the computing process on node 1 and task 51 was split into twosub-tasks which were either being split or computed when the failure occurred. Task51 and the remaining tasks, which had been either split or not, were being distributed.This information allows us to focus on debugging the computional process of task 49.This process would be executed in a sequential environment when the second phase of thedebugger is implemented. The debugging tool will use the information in the event graphto generate and split the task the correct number of times before feeding the sub-tasksto the process.The output of the trace analyzer for another experiment of 1000 tasks running abinary tree with 16 nodes, in which a fatal error is introduced in task 500, reveals thefollowing information.Chapter 6. Experiments and Evaluation 70TASK Failure^SUBTASK^NODE PROCESS475 0000 1 split_fn()/comp_fn()476 0010 1 join_fn()476 0000 1 join_fn()477 0000 1 Distributing478 1110 7 join_fn()478 0110 7 join_fn()478 0010 3 Collecting478 0000 1 Collecting479 0000 1 Distributing480 0110 7 Distributing480 0010 12 Distributing480 1010 13 Distributing480 0100 10 Distributing480 1100 11 Distributing480 0000 8 Distributing480 1000 9 Distributing481 0000 1 Distributing482 0010 3 Distributing482 0100 5 Distributing482 0000 4 Distributing483 0000 1 Distributing484 0000 1 Distributing485 0000 1 Distributing486 0000 1 Distributing487 0000 0 data_generator()488 0000 0 data_generator()Obviously, it did not give us any useful information for localizing the bug introduced.This was due to an improper unloading command sequence used during monitoring.When the system failed, "hostmon" did not receive the updated events from the nodes.Based on the statistics provided by "hostmon", we defined the unloading command se-quence so that the trace data on the nodes was flushed more often near the point offailure. The application was rerun and the following output from deTrEK was obtained.TASK Failure^SUBTASK^NODE PROCESS500 0010 12 comp_fn()500 1010 13 comp_fn()500 0110 14 comp_fn()500 1110 15 comp_fn()500 0100 10 comp_fn()500 1100 11 comp_fn()Chapter 6. Experiments and Evaluation^ 71500 1000 9 comp_fn()500 0000 8 comp_fn()501 0000 1 split_fn()/comp_fn()502 0010 12 Distributing502 1010 13 Distributing502 0110 14 DistributingThe fault was found to be located in the computing process of the subtasks of task500, on nodes 8 to 15, exactly where we had introduced the bug.Several other experiments introducing fatal errors revealed similar results. In mostcases, deTrEK was able to point out the right process where the bug was introduced.In a few cases, where the runtime system was communication intensive, the informationprovided by deTrEK was less accurate because of the amount of trace data that wasnot received. However, as most applications in the parallel computing environment arecomputationally intensive, this should not be a serious problem. Also the unloadingcontrol sequence can be adjusted so that the overhead introduced by the monitor is keptwithin an acceptable range, and the loss of trace data is minimized.Experiments also show that there is no significant difference when several fatal errorsare simultaneously introduced. The first fatal error that occurred will be detected bydeTrEK while the other errors will not appear. This conforms to the usual scenarioof debugging where the programmer examines the errors one by one in their order ofoccurrence.Note that our simulated application did not allow introduction of bugs into the split-ting process of a subtask. If a bug is ever introduced in the splitting process, all thesplitting processes will crash while splitting this task or its sub-tasks. In fact only thefirst splitting process will be executed as the system will crash after that. This is also thecase for the computing and the joining processes. However, it is expected that deTrEKChapter 6. Experiments and Evaluation^ 72will still be able to pinpoint the right process for a particular sub-task if it crashes, asthe other successfully completed sub-tasks will be noted. For example, in the case of anerror in the computation of a sub-task, all other sub-tasks will complete and the joiningprocess leading to that sub-task will report failure. This can be tested by introducinga subtask number in the task description file, and accordingly, inside the process. Thesubtask number is compared with the actual subtask number before simulating the error.6.2 Experience with Debugging an Actual PFVM ApplicationIn this section, we describe our experience in debugging an actual PFVM applicationusing our monitoring and trace analysis tools. The application was a course project doneby Jeff Beis for CPSC536.The application is an image processing program. The input to the program is animage with a number of segments. Each segment represents some part of an object. Theprogram groups the segments belonging to one object together. It begins by loading theentire image on all the slaves. The master then sends out the segment number to begrouped, so in effect each segment number defines a task. When a slave receives a task,it attempts to group that segment with all the other segments in the image and returnsthe result.When the program ran for 5 tasks using a chain of 6 nodes, as we did for the simulatedproblem, we obtained the following:xanadu<2>spread -vsc bnail.chain_16.3).ciladu<5>loadgo nO master -- 6 5Execution time = 7424 microseconds***** Final Report*****Configuration : Linear ChainTotal Nodes = 6Granularity = 1Total Buffers = 1Chapter 6. Experiments and Evaluation^ 73Task Distribution:Node 0^0 TasksNode 1 1 TasksNode 2^1 TasksNode 3 0 TasksNode 4^0 TasksNode 5 0 TasksAs we can see from the report printed by TrEK, two tasks executed on node 1 and 2.It appeared as if the program had finished successfully as TrEK successfully terminatedand the Trollius "state" command showed that all the nodes were alive.xanadu<6>state n0-5NODE INDEX PID KPRI KSTATE PROGRAMnOn1n2n3n4n5[18][18][18][18][18][18]263717263521303177260265263521263521000000AAAAAA(-263717,(-263521,(-303177,(-260265,(-263521,(-263521,0)0)0)0)0)0)masterslaveslaveslaveslaveslaveHowever, results were not being printed in a file. The error could have been thatthe data-generator() generated only two tasks. It could also have been that somecompute-fn() of the slaves failed to return the results. It could also have been that theresult -receiver() failed to receive the results. That is to say, the fault could haveoccurred in "any" of these three processes.We recompiled the program with the monitoring option. No changes were made tothe user program, the only change was in the compiling options in the Makefile. Werebooted the transputers, started Tmon-de, and re-executed the program.xanadu<5>monctrl -bn 6Execution time = 620544 microseconds***** Final Report*****Configuration : Linear ChainTotal Nodes = 6Granularity = 1Total Buffers = 1Task Distribution:Node 0^0 TasksNode 1 1 TasksNode 2^1 TasksNode 3 0 TasksNode 4^0 TasksNode 5 0 TasksChapter 6. Experiments and Evaluation^ 74As shown above, the final report is the same as the previous one except that theexecution time had increased. At this point, we used the monitoring control command.`monctrl" to end the monitoring session. The trace was collected in the file "group. all",and the configuration was recorded in the file "group. conf ig". Using the "tvmfilter"to filter the trace, we got "group. tvm" in a format ready for input to deTrEK. We thencreated a file specifying the application, "group. spf"and used deTrEK to analyze thetrace file. We obtained the following log.xanadu<8>deTrEK -cp group^ Failure^TASK SUBTASK NODE PROCESS2 0 0 Collecting3 0 0 Collecting4 0 0 CollectingThe analysis showed that the task 0 and 1 had been successfully generated, processedand their results collected. In addition, the analysis indicated that task 2, 3 and 4had been successfully generated and computed at the message passing level. Althoughwe were not sure that the results of these tasks were correct, we now knew that thedata-generator() and compute-fn() were correct. Thus we focused our attention onthe result -receiver() process. On examination, we found that it was basically a looplike:result _receiver(){for (1=0; i<Nsegs; i++) {msg = (user_result *)get_result();/* results printing and counting */Thus we hypothesized that the loop was never executed, which led the system processfor receiving results to exit before all the messages had arrived. We added a printstatement in the above function as follows:Chapter 6. Experiments and Evaluation^ 75result_receiver(){printf("Nsegs = Yod\n", Nsegs);for (1=0; i<Nsegs; i++) {msg = (user_result *)get_result();/* results printing and counting */}}The program was then recompiled and rerun. The printout showed that the variableNsegs was 0. It verified our hypothesis that the loop was never executed since the valueshould have been 5, the number of tasks. The fault was that this variable was not beinginitialized properly. A further check of the program showed that Nsegs was initialized indata-generator() as:data_generator(){Nsegs = TotalTasks;1^•where Tot alTasks was a global variable shared by all the processes. According to thespecification of the PFVM, all the global variables should be initialized in init -master()because the data-generator() and the result -receiver() are independent processes,each with its own copy of the shared data which is initialized by the init -master(). Thusinitializing Nsegs in data-generator() does not have any effect in result -receiver 0.In other words, Nsegs in result -receiver() was not being initialized. We moved theabove statement from data-generator() to init -master(), recompiled and ran theprogram once again.Total Execution time = 12572736 microseconds***** Final Report*****Configuration : Linear ChainChapter 6. Experiments and Evaluation^ 76Total Nodes = 6Granularity = 1Total Buffers = 1Task Distribution:Node 0^0 TasksNode 1 3 TasksNode 2^2 TasksNode 3 0 TasksNode 4^0 TasksNode 5 0 TasksIt showed that all the tasks have been generated, computed and their results received.A further check of the contents of the results showed that they were correct. The errorhad been located and corrected.An uninitialized variable was one of the errors described in Chapter 3. In terms of thespecification of parallelism, the error manifested itself in the untimely death of the resultreceiving process. In terms of message passing, the error appeared as an unanticipatedmessage from the receiving process. This example supports our approach in examiningthe high level behaviour (message passing) to help to localize low level faults.6.3 EvaluationAs the above experiments and actual experience show, deTrEK is able to locate thefault, its task number and process. Even in the case where the system does not crash,as in the above example, the trace analysis was helpful in narrowing down the suspiciouspart of the program. On a Sun4 workstation, deTrEK took approximately three minutesto analyze the trace produced by the example shown in the first section (a DCVMapplication with 1000 tasks running on a binary tree with 16 nodes). It took only afew seconds to analyze the other traces shown above. Experience shows that the firstphase of our two-phase debugging approach is practical and effective. It can be doneautomatically by the monitoring tool and the trace analyzer. Thus it strongly supportsChapter 6. Experiments and Evaluation^ 77the hypothesis that effective parallel debuggers can be built using the concept of VM-based programming. More generally, restricted models of parallel computation not onlysimplifies programming, but also debugging.Referring back to the general framework presented in Chapter 2, our debugging ap-proach supports not only hypothesis generation, but also hypothesis modification andselection as our trace analyzer detects errors and localizes them at the same time. Theerror list gives the programmer a clear idea of which task and process to debug in thesecond phase, where hypothesis verification is to be done. By developing the fault modelsat the message passing level, the programmer is not required to manage the large numberof processes in the system. Even though the second phase has not yet been implementedand evaluated, our initial experience with deTrEK has shown that the complexity ofdebugging is significantly reduced.Our two-phase approach to debugging integrates the techniques of assertion checkingand replay. The assertion checking is done automatically. It does not require the pro-grammer to learn yet another language to describe the behaviour. Its results can also beused to limit the size of data to be replayed.Our trace analysis is limited, however, to detecting and localizing fatal errors thatresult in deviation from the expected message passing behaviour.Chapter 7Conclusions and Future Work7.1 SummaryThis thesis has studied the debugging of parallel programs in a VM-based programmingenvironment. Based on the analysis of the message passing pattern in the system, ahierarchical approach of two phase debugging is proposed. Fault models for two virtualmachines, PFVM and DCVM, were developed to automate the first phase. The designand implementation of an event tracer and a trace analyzer are presented. The tools arethen demonstrated and evaluated using a simulated application for DCVM and an actualprogram for PFVM.We started with an analysis of the message passing behaviours of task oriented virtualmachines, PFVM and DCVM. The possible errors in the systems were classified. Thesesystems have a hierarchical structure. At the higher level, a conceptual model of the VM-based program is presented to the user. At the intermediate level, a set of configurationparameters are given to the user to choose an intermediate virtual machine on whichthe system may execute. The execution model of the application corresponds to theintermediate virtual machine level. Based on the execution model, fault models weredeveloped for both PFVM and DCVM. The development of fault models led to the78Chapter 7. Conclusions and Future Work^ 79design and implementation of a trace analyzer, deTrEK. The project also required theinstrumentation of TrEK, and the revision and enhancement of Tmon. A simulatedapplication was set up to validate the accuracy of the monitoring results and the traceanalysis. We also demonstrated how our debugger could be used to debug an actualapplication using the processor farm virtual machine.In Chapter 2, we proposed several criteria to evaluate a debugging approach, namelyadequacy, productiveness, feasibility and efficiency. In terms of adequacy, we have takenthe strategy of using high level abstraction and divide-and-conquer to overcome thecomplexity of parallel programs. By taking advantage of the structured message passingbehaviour our hierarchical approach simplifies the debugging of a parallel program tothat of a few specific sequential code segments. In addition, the trace analysis helpsthe programmer in both hypothesis generation and hypothesis selection. By collectingthe message passing events during execution we were able to reproduce the errors in asimulated environment. For productiveness, our trace analyzer is limited to detectingand localizing fatal errors which cause deviation from the expected message passingbehaviour. Through the design and implementation of a monitor and a trace analyzer,we have shown that our approach is feasible. Efficiency is achieved by compile timeinstrumentation of high level events and by post-mortem analysis.Although the monitoring tool was designed for a transputer-based system and im-plemented under the Trollius Operating System, the instrumentation technique and thetrace unloading strategy are applicable to a wide range of multicomputers. The hier-archical approach proposed in Chapter 4 can easily be adapted to any message passingsystem. Of course, the related fault models have to be developed accordingly. Our traceanalyzer is an independent tool which can be used to analyze the event history of PFVMChapter 7. Conclusions and Future Work^ 80and DCVM, irregardless of how they were implemented.7.2 Future Work7.2.1 Enhancement and IntegrationImplementation of Second PhaseThe implementation of the second phase of our debugging approach is obviously part ofthe work to be done. To do this, besides the message passing events between nodes, werequire the following data:• message passing events between processes on the same node,• nondeterministic events internal to each process. For example, the value of therandom number generated or the real time clock, etc..The processes must be scheduled for re-execution in an order consistent with themessage passing events as recorded. The replay of each process has to behave determin-istically by reading the values recorded during the tracing stage as shown in [MAP91].Integration with the EnvironmentAs a tool, the interface of the debugger can be significantly improved. Integration of ourtools with PARSEC[Fe192] would greatly improve its ease of use. Our experience is thata general visualization tool is usually not very helpful. However it might be helpful inboth debugging and performance tuning by showing the event graph and its associatedcoalesced event graph.Chapter 7. Conclusions and Future Work^ 81The integration of debugging with performance modeling might reduce the numberof test runs needed for monitoring. For example, the shipping control sequence couldbe defined based on the output of performance prediction. Furthermore, integrating thedesign of the virtual machines with fault models will improve the modeling, increasethe efficiency of monitoring, and simplify the debugging monitor and the second phaseimplementation. It could also lead to a fault-tolerant virtual machine, in the sense thatthe machine could continue the computation of other tasks when some tasks fail and causea node crash. This, of course, requires more flexibility in the management of resources.However, the benefit is that only a part of the computation needs to be re-executed anddebugged without recomputing the tasks.General Virtual Machine AnalyzerTo further explore the applicability of our approach to parallel debugging, more virtualmachines should be investigated. A language could be developed to describe the faultmodels of the virtual machines. Accordingly, a general assertion checker of that languagecould be designed and developed. After that, the development of a trace analyzer for anew virtual machine will be equivalent to the development of its fault models and theirdescriptions in that language.7.2.2 Logical Time StampingMemory space and processor time are the two basic resources needed by a program. Ofthe two, time presents a more fundamental facet of distributed systems, namely causalityrelation between events [Ray92J. The logical nature of time is of primary importance whendesigning or analyzing distributed systems. In 1978, Lamport [Lam78] first proposed aChapter 7. Conclusions and Future Work^ 82mechanism of logical time stamping for events in a distributed system.Since then, a number of mechanisms for logical time stamping have been proposed.Raynal [Ray92] classified them as linear time, vector time and matrix time. Linear timeis the one proposed by Lamport where the time domain is the set of positive integers.The time keeps the causality relation between two events. That is, if an event a happensbefore event b, then the logical time stamp of a is smaller than that of b. However, thevice versa is not necessary true.Vector logical time was developed independently by Fidge [Fid89, Fid91], Mattern[Mat88] and Schmuck[Sch88]. In this mechanism, time is represented by an n-dimensionalvector of integers, where n is the number of sites involved in the communication in thesystems. It establishes a causality relation that corresponds to the values of the timestamps.Matrix time goes a step further to use an n x n matrix to represent the logical time.In addition to the properties of the vector clocks, each site understands that every othersite knows its progress up to a certain local time.From linear time to vector, to maxtrix time, the size of a time stamp increases from 1,to n, to n2. Whether the time stamping is done dynamically, when a distributed systemis executing, or statically, after the trace has been recorded, as we did, the number ofevents is usually large and the space to store the time stamps is significant. In the generalcase, it was shown by Bost [CB91] that the size can not be less than n. Based on thesize of the poset (partially ordered set) coordinates, which can be used as a logical timestamp[0re62, WTT83]. Summers [Sum92] states that the lower bound of logical timestamps is the dimension of the poset, where a point of the poset is an event and thebinary relation is the "happen-before" relation between events.Chapter 7. Conclusions and Future Work^ 83Thus the question arises as to whether the restricted model of computation in task-oriented virtual machines can reduce the size of the logical time stamps required. If so,what are the new limits?Formally, the problem can be stated as follows. Given a history of the messagepassing, we can define a poset, where a point of the poset is a send or receive event in thehistory and the binary relation between events is the "happen-before" relation. Giventhe specifications of an application, different runs of the systems may produce differentposets because of non-determinism. Therefore, the specifications define a family of posets,constrained by the restrictions of the VM's communication pattern. The minimal sizeof the logical time stamps required for these histories is the maximum dimension of theposets in this family.It is easy to see that the above defined poset depends on n, the number of processors,and m, the number of tasks. We already know that it will not exceed the width ofthe poset, in this case, n. Therefore, the bound is possibly less than n and if m > n itshould not depend on rn. To determine the value of a tighter bound remains a challengingproblem.Bibliography[AFC91] Keijiro Araki, Zengo Furukawa, and Jingde Cheng. A general framework fordebugging. IEEE software, pages 14-20, May 1991.[Bat89]^Peter Bates. Debugging heterogeneous distributed systems using event-basedmodels of behavior. Proceedings of the ACM SIGPLAN/SIGOPS Workshopon Parallel and Distributed Debugging, published in ACM SIGPLAN Notices,24(1):11-22, January 1989.[Bur88]^G. D. Burns. Trollius operating system definition. Trollius documentationseries, Ohio Supercomputer Centre, October 1988.[CB91]^Bernadette Charron-Bost. Concerning the size of logical clocks in distributedsystems. Information Processing Letters, 39:11-16, July 1991.[CLP91] C. Caerts, R. Lauwereins, and J. A. Peperstraete. A powerful high-leveldebugger for parallel programs. In Proceedings of the First InternationalACPC Conference of Parallel Computation, pages 54-64, Salzburg, Austra,September—October 1991.[CMN91] Jong-Deok Choi, Barton P. Miller, and Robert H. B. Netzer. Techniques fordebugging parallel programs with flowback analysis. ACM Transactions onProgramming Languages and Systems, 13(4):491-530, October 1991.[E1s89]^I.J.P. Elshoff. A distributed debugger for Amoeba. In Proceedings of theACM SIGPLAN/SIGOPS Workshop on Parallel and Distributed Debugging,published in ACM SIGPLAN Notices, volume 24, pages 1-10, January 1989.[Fe192]^David Feldcamp. A hierarchical software development environment for perfor-mance oriented parallel programming. M.Sc. thesis, The University of BritishColumbia, December 1992.[FG90]^Joan M. Francioni and Michael Gach. Design of a communication modelingtool for debugging parallel programs. Personal communication, April 1990.84Bibliography^ 85[Fid89]^C. J. Fidge. Partial orders for parallel debugging. Proceedings of the ACMSIGPLAN/SIGOPS Workshop on Parallel and Distributed Debugging, pub-lished in ACM SIGPLAN Notices, 24(1):183-194, January 1989.[Fid91]^Colin John Fidge. Logical time in distributed computing systems. IEEEComputer, 24(8):28-33, August 1991.[FLM89] Robert J. Fowler, Thomas J. LeBlanc, and John M. Mellor-Crummey. Anintegrated approach to parallel program debugging and performance analysison large-scale multiprocessors. Proceedings of the ACM SIGPLAN/SIGOPSWorkshop on Parallel and Distributed Debugging, published in ACM SIG-PLAN NOTICES, 24(1):163-173, January 1989.[GKY89] German S. Goldszmidt, Shmuel Katz, and Shaula Yemini. Interactive black-box debugging for concurrent languages. Proceedings of the ACM SIG-PLAN/SIGOPS Workshop on Parallel and Distributed Debugging, publishedin ACM SIGPLAN Notices, 24(1):271-282, January 1989.[GYK90] German S. Goldszmidt, Shaula Yemini, and Shmuel Katz. High level lan-guage debugging for concurrent programs. ACM Transactions on ComputerSystems, 8(4):311-336, November 1990.[FIC89] Alfred A. Hough and Janice E. Cuny. Initial experiences with a pattern-oriented parallel debugger. Proceedings of the ACM SIGPLAN/SIGOPSWorkshop on Parallel and Distributed Debugging, published in ACM SIG-PLAN NOTICES, 24(1):195-205, January 1989.[1-1C90] Alfred A. Hough and Janice E. Cuny. Perspective views: A technique forenhancing parallel program visualization. In Proceedings of 1990 InternationalConference on Parallel Processing, pages 11.124-11.132, University Park PA,August 1990.[HK88] Wenwey Hseush and Gail E. Kaiser. Data path debugging: Data-orienteddebugging for a concurrent programming language. In Proceedings of the ACMSIGPLAN and SIGOPS Workshop on Parallel and Distributed Debugging,pages 236-247, Madison, Winsconsin, May 1988.[1-1K90] Wenwey Hseush and Gail E. Kaiser. Modelling concurrency in parallel debug-ging. In Proceedings of the Second ACM SIGPLAN Symposium on Princi-ple and Practice of Parallel Programming, pages 11-20, Seattle, Washington,March 1990.Bibliography^ 86[JMFA91] Jay Alan Jackson Joan M. Francioni and Larry Albright. The sounds ofparallel programs. In The Sixth Distributed Memory Computing Confer-ence Proceedings, pages 570-577, University of Southwestern Louisiana, email:jf@cacs.usl.edu, 1991.[Lam78] L. Lamport. Time, clocks, and the ordering of events in a distributed system.Communication of ACM, 21(7), July 1978.[LSS90]^E. Leu, A. Schiper, and A. Sramdini. Execution replay on distributed memoryarchitectures. In Proceedings of the Second IEEE Symposium on Parallel andDistributed Processing 1990, pages 106-12, Dallas, Texas, December 1990.[MA87]^Charles E. McDowell and William F. Appelbe. Minimizing the complexity ofstatic analysis of parallel programs. In Proceedings of the Twentieth AnnualHawaii International Conference on System Sciences, pages 171-176, January1987.[MAP91] Michel Hurfin Michael Adam and Noel Plouzeau. Distributed debugging tech-niques. Foundations of Computing and Decision Sciences, 16(3-4):191-201,1991.[Mat88]^F. Mattern. Virtual time and global states of distributed systems. In RaynalCosnard, Quinton and Robert, editors, Conference of Parallel and DistributedAlgorithm, pages 215-226, North-Holland, 1988.[MC89]^Barton P. Miller and Jong-Deok Choi. A mechanism for efficient debuggingof parallel programs. Proceedings of the ACM SIGPLAN/SIGOPS Workshopon Parallel and Distributed Debugging, published in ACM SIGPLAN Notices,24(1):141-150, January 1989.[McD89] C. E. McDowell. A practical algorithm for static analysis of parallel programs.Journal of Parallel and Distributed Computing, 6(3), June 1989.[MH89]^Charles E. McDowell and David P. Helmbold. Debugging concurrent pro-grams. ACM Computing Surveys, 21(4):593-622, December 1989.[0re62]^0. Ore. In Theory of Graphs, American Mathematics Society ColloquiumPublication 38, Providence, R.I., 1962.[PHK91] M. Krish Ponamgi, Wenwey Hseush, and Gail E. Kaiser. Debugging multi-threaded programs with MPD. IEEE Software, pages 37-43, May 1991.Bibliography^ 87[PL89]^Douglas Z. Pan and Mark A. Linton. Supporting reverse execution of par-allel programs. In Proceedings of the ACM SIGPLAN/SIGOPS Workshopon Parallel and Distributed Debugging, published in ACM SIGPLAN Notices,volume 24, pages 124-9, January 1989.[PU89]^Cherri M. Pancake and Sue Utter. Models for visualization in parallel de-buggers. In Proceedings of Supercomputing '89, pages 627-636, Reno NV,November 1989.[Ray92] Michel Raynal. About logical clocks for distributed systems. ACM OperatingSystems Review, 26(0:41-48, January 1992.[Ros91]^David S. Rosenblum. Specifying concurrent systems with tsl. IEEE Software,pages 52-61, May 1991.[Sch88]^F. Schmuck. The use of efficient broadcast in asynchronous distributed sys-tems. Ph.D. Thesis TR88-928, Cornel University, 1988.[Smi85]^Edward T. Smith. A debugger for message-based processes. Software - Prac-tice and Experience, 15(10:1073-1086, November 1985.[Sre93]^H. Sreekantaswamy. Performance prediction modeling and its integration intomulticomputer programming environments. Ph.d. thesis, The University ofBritish Columbia, August 1993.[SS90]^J. Scholten and F. Sauer. On debugging in a parallel system. In IEEE TEN-CON'90: 1990 IEEE Region 10 Conference on Computer and CommunicationSystems, volume 1, pages 264-8, Hong Kong, September 1990.[Sto89]^Janice M. Stone. A graphic representation of concurrenct processes. InProceedings of the ACM SIGPLAN/SIGOPS Workshop on Parallel and Dis-tributed Debugging, published in ACM SIGPLAN Notices, volume 24, pages226-235, January 1989.[Sum92] James Alexander Summers. Precedence-preserving abstraction for distributeddebugging. M.math. thesis, The University of Waterloo, 1992.[Tay84]^R. N. Taylor. Debugging real-time software in a host-target environment.Technical Report 212, University of California at Irvine, 1984.[T080]^R. N. Taylor and L. J. Osterweil. Anomaly detection in concurrent softwareby static data flow analysis. IEEE Transaction of Software Engineering, 3,May 1980.Bibliography^ 88[Tro92a1 Trollius command reference. Trollius documentation series, Ohio Supercom-puting Center, The Ohio State University, March 1992.[Tro92b] Trollius library reference in C. Trollius documentation series, Ohio Supercom-puting Center, The Ohio State University, March 1992.[Wit89]^Larry D. Wittie. Debugging distributed C programs by real time replay.Proceedings of the ACM SIGPLAN/SIGOPS Workshop on Parallel and Dis-tributed Debugging, published in ACM SIGPLAN Notices, 24(1):57-67, Jan-uary 1989.[WJC93] A. Wagner, J. Jiang, and S. Chanson. Tmon — a transputer performancemonitor. Concurrency: Practice and Experience, 1993. to appear.[WTT83] Jr. William T. Trotter. Graphs and partially ordered set. In L. W. Beinekeand R. J. Wilson, editors, Selected Topics in Graph Theory 2, pages 237-268,Academic Press, London, August 1983.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items