Open Collections

UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

A fault-tolerant building block for transputer networks for real-time processing Fei, Yueying 1993

Your browser doesn't seem to have a PDF viewer, please download the PDF to view this item.

Item Metadata


831-ubc_1993_spring_fei_yueying.pdf [ 3.81MB ]
JSON: 831-1.0065183.json
JSON-LD: 831-1.0065183-ld.json
RDF/XML (Pretty): 831-1.0065183-rdf.xml
RDF/JSON: 831-1.0065183-rdf.json
Turtle: 831-1.0065183-turtle.txt
N-Triples: 831-1.0065183-rdf-ntriples.txt
Original Record: 831-1.0065183-source.json
Full Text

Full Text

A FAULT-TOLERANT BUILDING BLOCK FOR TRANSPUTERNETWORKS FOR REAL-TIME PROCESSINGByYueying FeiB.A. Sc., Northeast Heavy Machinery Institute, P.R.China, 1982A THESIS SUBMITTED IN PARTIAL FULFILLMENT OFTHE REQUIREMENTS FOR THE DEGREE OFMASTER OF APPLIED SCIENCEinTHE FACULTY OF GRADUATE STUDIESDEPARTMENT OF ELECTRICAL ENGINEERINGWe accept this thesis as conformingto the required standard THE UNIVERSITY OF BRITISH COLUMBIAJanuary, 1993© Yueying Fei, 1993In presenting this thesis in partial fulfilment of the requirements for an advanceddegree at the University of British Columbia, I agree that the Library shall make itfreely available for reference and study. I further agree that permission for extensivecopying of this thesis for scholarly purposes may be granted by the head of mydepartment or by his or her representatives. It is understood that copying orpublication of this thesis for financial gain shall not be allowed without my writtenpermission.(Signature) Department of Electrical EngineeringThe University of British ColumbiaVancouver, CanadaDateDE-6 (2/88)AbstractSoftware Implementation of Multi-Processor Fault Tolerance for Real-Time processing isaddressed in this thesis with the research focused on:• Fault-Tolerant cells as building blocks that can survive concurrent transient physicalfaults and permanent failures in large parallel processing systems with potential forreal-time processing.• Efficient group communications for redundant data exchanges through multiplecommunication links that connect the group peers.• Transparent fault-tolerance.• On-Line Forward Fault-Repair using the live execution image from the non-faulty peer with a bounded delay.By systematically connecting the redundant processing modules, the architecture of-fers regularity and recursiveness which can be used as building blocks for construction offault-tolerant parallel machines.The communication service protocols take advantage of redundant linkages to ensurereliable and efficient message deliveries among the fault-tolerant abstract transputer peernodes through the concept of activity observation. The multiple redundant linkagesprovide a means for parallel communications. This is essential for redundant informationexchanges in fault-tolerance. The activity observation concept further reduces theeffort for reliable message delivery and simplifies the system design. As a result, messagesare dynamically and optimally routed when link failure or processor failure occurs.iiThrough the group communication mechanism underlying the platform, applicationprocesses on each FTAT peer node are transparent to details that they are replicated,repaired upon fault detections, and reintegrated after fault repair. Based on a dynamicTriple Modular Redundancy scheme, each application process can survive up to two con-current faults under the assumption that the probability of two faulty peer applicationshaving the same fault is very small.In a large interconnected network, the cost of fault-tolerance can be very expensivein terms of time and communication due to the cost of either synchronization or roll-back recovery. The use of redundant live execution images to repair the faulty moduleguarantees forward fault recoveries.111Table of ContentsAbstract^ iiList of Tables^ vList of Figures^ viAcknowledgement^ vii1 Introduction 11.1 The Overview Of The Thesis ^ 11.2 Major System Design Issues 31.2.1 Fault Type ^ 31.2.2 Parallel Processing System Architecture ^ 41.2.3 Multilink Group Communication and Real-Time Processing 51.2.4 Transparency of Fault Tolerance ^ 51.2.5 Forward Fault Repair ^ 61.3 Previous Related Work Review 81.3.1 Fault-Tolerant Architecture ^ 81.3.2 Group Communication Facilities 91.3.3 Transparency in Fault-Tolerance Architecture ^ 91.3.4 On-Line Forward Fault-Repair ^ 101.4 Our Approaches In Meeting The Requirements And Their Rationals 111.4.1 Multiple Modular Redundancy ^ 11iv^1.4.2^Multilink Group Communication ^^1.4.3^Transparency ^1.4.4^On-Line Forward Fault Repair ^1314152 System Architecture 162.1 General Assumptions ^ 162.2 System Hardware Architecture ^ 172.3 System Software Architecture 183 Highly Efficient Multiple-Link Group Communication 213.1 Communication System Structure ^ 213.2 Link Channel Communication Layer 213.3 The FTAT Group Communication Service Layer ^ 223.4 Group Communication ^ 233.5 Activity Observation 253.5.1^Operation When All Peers Are Available ^ 293.5.2^Operation When a Link Failure Occurs 293.5.3^Operation When Multiple Link Failure Occurs ^ 313.5.4^Operation When A Processor Failure Occurs 363.6 Timer Handling ^ 384 FTAT Services with Transparent Fault-Tolerance 404.1 Swap Operation Based on the Scatter&Gather Mechanism ^ 414.2 FTAT Services ^ 414.2.1^IRPCRead Service ^ 424.2.2^IRPC Write Service 434.3 Start of Service Operations ^ 43v4.4 Structurally Stated Buffer Set ^4.4.1^The Group Communication Buffer Consumption Requirements4.4.2^The Deadlocks And Deadlock Avoidance ^4.4.3^The Service Buffer Set Design ^444445464.5 Structured States ^ 474.6 Fault Masking 484.6.1^Selection of the Representative & State Swapping ^ 484.6.2^State Swapping ^ 494.7 Failure Handling 494.7.1^Link Failure ^ 504.7.2^Processor Failure 505 On-Line Forward Fault-Repair 545.1 Issues in Fault Repair^ 555.2 The Processing of The Fault Repair ^ 575.3 Handling Resource Rebinding Problems 595.4 Pointer Reference Stack ^ 606 Reliability Analysis 616.1 System Reliability ^ 616.2 Network Connection Reliability ^ 647 Petri Net Model of the FTAT Multilink Group Communication 687.1 Modeling Scatter & Gather Operations ^ 707.1.1^Boundedness ^ 727.1.2^Liveness 747.1.3^Link Failure Handling ^ 75vi7.2 Modeling the Complete Swap Service ^  767.2.1 Boundedness ^  777.2.2 Liveness  ^807.3 The FTAT Service Modeling ^  827.3.1 Boundedness  ^827.3.2 Liveness  ^858 Performance^ 888.1 Scatter and Gather ^  898.2 Swap ^  928.3 On-line Fault-Repair  ^928.4 Generic Communication Performance Measurements^ 939 Conclusions^ 94Bibliography^ 97viiList of Tables7.1 Incidence Matrix and Invariants ^  807.2 FTAT Incidence Matrix and Invariants ^  858.3 Generic Operation Performance Measurements ^  93viiiList of Figures2.1 Hardware Architecture  ^182.2 The system architecture of an FTAT ^  193.3 The relations among the FTAT peer modules during a scatter operation.^273.4 Observing undergoing activities  ^303.5 Lost packet claiming  ^323.6 Multiple link failure handling  ^333.7 Multiple link failure handling 1  ^353.8 Multiple link failure handling 2  ^376.9 Reliability vs. fault detection prob. c and successful repair prob. r.^.^656.10 The connection reliability vs. the link reliability  ^677.11 Scatter & Gather Pr/T net Model ^  727.12 The link failure handling part of Scatter & Gather Pr/T net ^ 737.13 The invariant part of the Scatter & Gather Pr/T net model ^ 747.14 Swap operation Pr/T net model ^  787.15 Swap operation Pr/T simplified net model ^  797.16 IRPCWrite Service Pr/T net model  837.17 IRPCRead Service Pr/T net model ^  847.18 IRPCWrite Service Pr/T simplified net model ^  868.19 Scatter without activity observation in a single link connection . . . .^898.20 Scatter with activity observation and a single link connection ^ 90ix8.21 Scatter with activity observation and a parallel link connection . . .^91xAcknowledgementI would like to express my sincere thanks to Dr. Mabo Robert Ito, my supervisor for hispatient guidance and persistent financial help during the course of my thesis study.I would also like to extend my thanks to Dr. Sam Chansom, Dr. Allan Wagner ofComputer Science Department for their precious opinions and help in doing the research.Thanks also go to Mr. A. Zaafrani, Mr. Robert Lam, Mr. Jefferey Chow of ElectricalEngineering Dept. for their help in editing up the thesis.My deepest thanks to my wife for her help and encouragement, who had to face withthe life alone much of the time and relieved me from most of the house work to allow mededicate myself to my studies.xiChapter 1IntroductionComputers have been widely applied in almost every aspect of our daily life due to therapid development of the computer hardware technologies and enormous decrease in theircosts. However, applications such as aircraft control, nuclear power station control, andspacecraft control etc. can not afford potential catastrophe in which human lives areat stake, caused by computer faults. Computer fault-tolerance has been an importantresearch area for many years and is the subject of this thesis.1.1 The Overview Of The ThesisIn this thesis research, we target a system that meets the following requirements:• Transparent protection to the applications from limited permanent module failures,transient and intermittent physical faults based on multiple modular redundancy.• A systematic structural architecture that easily fits into parallel machines.• Transparency to applications of the underlying fault-tolerant architecture to facil-itate parallel processing.• Deterministic system behavior for real-time processing.To meet the above requirements, we built an experimental system based on systemat-ically connected redundant modules with multiple linkages in hardware and a structural1Chapter 1. Introduction^ 2approach in software as building blocks. Such a building block is called Fault TolerantAbstract Transputer (FTAT). Through the designed multi-link group communicationmechanism, an application on top of the system gets transparent fault tolerance servicesthat encapsulate acceptance testing, fault masking, possible fault-repairing, reintegratingthe repaired module, and inter-FTAT replicated process message-passing. The systemis also abstracted with a consistent view of a transputer to its applications through themechanism, providing an easy model for programming fault tolerant parallel processingunder the system.The multi-link group communication is designed with the concept of activity ob-servation, gaining much improved communication reliability and good performance inredundant data distribution. Activity observation is explained in detail in chapter 3.Generally speaking, it uses extra ACKs as the drivers for reliable data distribution andefficient group synchronization.A dynamic Triple Modular Redundancy (TMR) scheme employed in the system makesit possible for an FTAT to be tolerant of two concurrent faults based on four transputers.The on-line forward fault-repair, described in detail in chapter 5, greatly extends thelife of the system, repairing unlimited sequential transient physical faults without check-pointing and roll-back recovery. It guarantees forward fault recovery and eliminates thenondeterminism due to roll-back recovery that is common to most current fault-tolerantdistributed systems. Our reliability analysis, in chapter 6, has shown a remarkable im-provement in systems reliability by the use of forward fault-repair.The interesting approach to process migration for volatile on-line fault-repair sepa-rates applications from their run-time environment at a minimal cost, making applica-tions relocatable during run-time.In this thesis, chapter 1 introduces the research by discussing the major design issues,reviewing the related work, and introducing our approaches and their rationals. ChapterChapter 1. Introduction^ 32 describes our system architecture. The design of the scatter and gather service forone-to-many and many-to-one pattern communication, which is the core of the multi-link group communication protocol, is discussed in chapter 3. The swap service andthe FTAT services, together with the scatter and gather service constituting the multi-link group communication mechanism, are discussed in detail in chapter 4. Chapter5 describes the fault-repair scheme in which the volatile run-time context is used toachieve forward fault-recovery. A brief reliability analysis is conducted on the multi-link group communication design and on forward fault-repair for its contribution to thesystem reliability in chapter 6. The multilink group communication design is formallyrepresented in Predicate/Transition net, a variation of Petri Net, and is informally shownto be correct. Chapter 8 discusses the performance results of the two major components,the scatter and gather service as well as the on-line forward fault-repair, in term of timeand amount of communication. These two components are the major contributions in thisresearch. The thesis is concluded in chapter 9, which gives a summary of our conclusions.1.2 Major System Design Issues1.2.1 Fault TypeThere are many fault types in computer systems such as permanent faults, transientfaults, software design errors, human operation errors, etc. The errors which result fromthese faults may have a catastrophic effect on the system.Hardware faults (physical faults) include short circuits between two leads, open cir-cuited transistor junctions, alpha particle impact on dynamic MOSFET memory cells[7][10]. These errors most often appear as changes in the patterns of zeros and ones thatrepresent the information. Other errors are time based; e.g., information fails to arriveor arrives at the wrong time [10]. Errors that are not detected and eliminated within theChapter 1. Introduction^ 4system are likely to lead to the failure of the systems. In addition, faults due to tempo-rary environmental conditions are hard to repair because the hardware is undamaged. Itis this attribute of transient faults that makes them of interest [7].It has been observed that more than 75% of faults occurring in a system are transientin nature [16], and it is known that more than half of all failures are intermittent [11].Therefore, the system reliability greatly depends on its recoverability from these faults.Permanent or solid failures should be removed, with related parts, from the system.For these reasons, transient physical faults are the major fault type to be dealt in thisthesis.1.2.2 Parallel Processing System ArchitectureIn recent years, the price of hardware has dropped to the point where non-self-checkedcomputation redundancy, that can guarantee the computation integrity, can be intro-duced into commercial machines. Spatial redundancy [7][10] provides the potential forefficient fault detection and recovery if the redundancy is fully exploited. To be tolerantof concurrent faults, a system usually needs more processing modules than double of thenumber of concurrent faults to be tolerated. However, among independent modules, twoconcurrent faults on different modules are unlikely to be the same as suggested in [22].The cost of fault tolerance in a large parallel processing system can be very expen-sive due to either synchronization or checkpointing and roll-back recovery. Since theprobability of system failure increases rapidly with the number of processors [7], it is im-portant to incorporate structural fault-tolerance in medium- to large-scale multiprocessorcomputers.Chapter 1. Introduction^ 51.2.3 Multilink Group Communication and Real-Time ProcessingReal-Time applications have strict requirements over the system's response to the ser-vice requests. Timeliness is the major characteristic for real-time systems. Asynchronousrecovery does not offer any timeliness characteristic as a recovery may cause a nondeter-ministic delay because of a chain of roll-backs [8][18] in worst cases. Synchronous oneswill have more overhead in synchronization and cooperation. The difficulty in efficientsynchronization [17] means additional cost in terms of development effort and operationoverhead. Efficient group communication technique should be developed for multiplemodular redundancy fault tolerance.It is shown in [5][21] that the more processors or computers are involved in coopera-tion to achieve a certain goal, the higher the overhead is. Redundant messages flowingamong these replicas are essential for fault-masking but are desired to be transferredat minimal cost, especially for masking redundancy. As a result of a structural fault-tolerance approach, the memberships of a group at a level in the structure can be welldefined. The communication among the members can also be simplified so as to minimizethe synchronization overhead.Multi-links for communication provides the potential for fault-tolerant communica-tion. Efficient multilink group communication in support of an architecture of structuralfault-tolerance in large parallel processing systems is a key issue to systems design.1.2.4 Transparency of Fault ToleranceIn many fault-recovery systems, the designers of applications are required to explicitlyprogram for recovery; e.g., by writing code to take checkpoints and to restore the internalstates from the checkpoints after detection of fault, error or a failure, and by organizingcomputations as transactions and periodically committing them. Transparent recoveryChapter 1. Introduction^ 6[12] on the other hand, hides the details of checkpointing and fault occurrences in theunderlying hardware from the programmer. The application development effort for fault-tolerance is thus greatly reduced.Parallel programs are non-trivial to design and develop on a large parallel processingmachine. It is even harder to program parallel processing with fault-tolerance require-ments. Also, the costs in the software overhead of fault-tolerance and the developmenteffort can be much higher than those in a single processor case. A structural developmentapproach structures the system with different levels of abstractions in a modular way.Hence, the effort paid for fault-tolerance in development and operation of the overallsystem can be greatly reduced.As system grows in sophistication, the advantages of transparent recovery increases.Transparent fault-tolerance service will be more significant for fault-tolerant parallel ma-chines in the near future. In such systems, a simple programming model provided bythe transparency to the architecture means great convenience for the developments offault-tolerant parallel-processing.1.2.5 Forward Fault RepairAs a support of the architecture transparency, faulty applications should be repaired andreintegrated into the system smoothly.On-line fault repair on most concurrent systems is done through system reconfigura-tion and can run into problems when spares are exhausted. Checkpointing and roll-backrecovery may cause a nondeterministic chain of roll-backs in a global system that requiresdeterministic behavior of a system and is not suitable for systems of time constraints.On-line fault-repair means that both the availability and reliability of the systemhave to be maintained during run-time even when transient faults or failures occur.Chapter 1. Introduction^ 7When high degree of redundancy is used for fault-tolerance, the redundant run-time in-formation should be exploited to the maximum to achieve an on-line forward fault-repair.Consequently no computation is wasted.In this research project we have addressed the problems of:• efficient multi-link group communication for redundant data distribution amongthe modules within a structurally built fault-tolerant cell.• transparent fault-tolerance with dynamic Triple Modular Redundancy (TMR).• on-line forward faulty process repair using the redundant live execution image.Section 2 describe the previous related work. Section 3 introduces our approach tothe systematically structured multiple-modular-redundancy fault-tolerance under real-time constraints, the transparency to the fault-tolerant architecture, and the on-lineforward fault repair.To aid in the discussion, we use the following terms consistently throughout thefollowing chapters:• Peer Module or Peer Node — a redundant module.• Peer system — the system on each redundant module.• System — the system that comprises the four redundant modules, called a Fault-Tolerant Abstract Transputer (FTAT), cooperating to provide a single transputerabstraction.• The overall system — the target system built with FTATs to form a fault-tolerantmultiprocessing system.Chapter I. Introduction^ 81.3 Previous Related Work ReviewThe previous work is reviewed under the separate topics to distinguish different areasaddressed.1.3.1 Fault-Tolerant ArchitectureThere are quite a few successful commercial fault-tolerant systems on the market today.Tandem's non-stop systems [26] package the CPU, memory, bus, and I/O controller asreplaceable modules operated by message-based distributed operating system. I/O devicecontrollers are dual-ported, accessible from either port. All the modules are connectedby the dual-bus called "dynabus" with on-line repair features. The approach eliminatesany single point vulnerability and is dynamic in its "primary and standby" architecture.However, it is difficult to build very large systems based on this architecture [26].Stratus [26] employs a "pair and spare" architecture. The major functions on thesystem are replicated four times in independent and identical modules in a hierarchicalway. At the first level, two modules form a paired function unit with each employingself-checking. Then the two function units work as "primary and standby" with on-linetransparent repair. In the system, self-checking is employed throughout the system; but"pair and spare" is applied only to the CPUs and memory. It is based on a ring typeof local area network, a message-based system. Nonetheless performance is degraded asthe number of processing modules increases because of communication bottleneck.Much other research has been done in the fault-tolerance area. An interesting ap-proach in [14] suggested that a high abstraction module based on a petri-net can beused as an observer. This abstract model can preserve the specification redundancythrough design diversities for software fault detection. Although not providing a fullfault-coverage, it is useful for detection of software design errors.Chapter 1. Introduction^ 9Based on transputers, the DRAT system [28] is implemented in a ring connection forOn-Line Transaction Processing (OLTP) applications. Each data processor has a recoveryprocessor associated with it. The recovery processor is connected in a separate link routefrom the data base processing path to avoid link traffic bottleneck. Nevertheless, theseparation of the recovery ring [28] from the data ring makes the system vulnerable tomultiple link failures and node failures. In addition, the system is also difficult to bereconfigured to mask processing module failures.The fault-tolerant processor described in [27] is built based on Triple Modular Redun-dancy (TMR). However, it is not designed to be used as building blocks for construction oflarge systems. It does not provide transparency to its underlying hardware architecture.The system can only mask faults by reconfiguration.1.3.2 Group Communication FacilitiesIn order to support replicated distributed programs [19], a replicated procedure callmechanism is developed. The client group and server group, which are termed troupes,can exchange messages among the troupe members on the collective unit basis throughthe call. The members act as a logical unit and are brought in step by the call. Thegroup communication mechanism deals with the design issues such as those in sendingand receiving between different troupes. These systems are based on a LAN, which isvulnerable to the communication link failure.1.3.3 Transparency in Fault-Tolerance ArchitectureTransparent recovery in distributed systems is recognized to be of great convenience sincea programmer does not have to explicitly "arrange" for taking checkpoints and recoveringfrom the consistent state backup.Chapter 1. Introduction^ 10Compiler assisted checkpointing [15] is a technique to provide transparent checkpoint-ing. The compiler generates sparse potential checkpoint code with adaptive checkpointingto reduce the size of the checkpoints. The volatile logging technique in [29] can hide theprogrammer from the fault-recoveries.1.3.4 On-Line Forward Fault-RepairOn-line fault-repair is attractive because it provides continuous services to the applicationeven in the face of failures, provided that the majority of the redundant modules still workcorrectly. Conventional fault-tolerant computer systems usually use a checkpointing androll-back recovery scheme or use primary and hot-standby to replace the faulty modules.In a system with a high degree of redundancy, it is desirable to exploit the redundancyto the maximum so as to provide forward fault-recoveries. The non-determinism due toroll-back and recovery can be eliminated if the non-faulty volatile states of the applicationcan be used to repair the transient faults in the redundant modules. Process migrationis an ideal technique for the on-line forward fault-repair.Process migration has been used mainly for load balancing [23][24][25] However,the technique has the potential to support fault-repair. Many systems that supportprocess migration are based on shared virtual memory [23] [24] [25]. The memory pagingmechanism supported by the hardware is very useful in reference rebinding problems[23]. On the other hand, the cost is very high for maintaining the shared virtual memoryon distributed systems and accessing disk I/O, since every object on such systems ispersistent and requires concurrency control to ensure its consistency under concurrentaccess [2].Highly Availability Transputing Systems [1] provides dynamic fault recovery for MIMDparallel machines. However, it does not provide forward fault recovery to avoid waste ofcomputation due to recoveries and retries.Chapter I. Introduction^ 111.4 Our Approaches In Meeting The Requirements And Their RationalsWe approach the major issues previously identified by the following:• To facilitate fault tolerant parallel processing,— a systematically structural architecture to yield regularity and recursiveness.— transparent fault tolerance service to provide an easy programming model.• To meet time constraints,— efficient multilink group communication for redundant data distribution toreduce communication overhead.—efficient group synchronization through cooperation among well defined pro-cessors of small sized group to reduce synchronization overhead.— forward fault recovery by the of the redundant live execution image to elimi-nate the nondeterminism due to roll-back recoveries.• To maintain high reliability,— a dynamic TMR to tolerate concurrent faults.— forward fault repair to tolerate sequential transient physical faults.We discuss our approaches to the design issues in separate topics for different areasin the same order as in section Multiple Modular RedundancyAs hardware cost decreases and microprocessing power soars (the recent DEC RISCmicroprocessor Alpha is rated 200MIPS), micro based parallel processing systems withChapter 1. Introduction^ 12fault tolerance will become in demand. This is not only because they can be built fromthe off-the-shelf chips, but also they offer much more powerful capabilities when they arebound with parallel processing techniques while paying moderate cost in performance forfault-tolerance and guaranteeing reliability.Since the probability of system failure increases very rapidly with the number ofprocessors [7], it is important to incorporate structural fault-tolerance into medium- tolarge-scale computer [30]. A natural way of building a large system is to build smallerunits first and then to construct bigger units from the smaller units in a hierarchicalfashion.Transputers are 32-bit RISC microprocessors with on-chip communication facilitiesas part of its architecture. They offer multiple links of high speed serial communicationchannels and can be built into parallel systems easily to provide nearly scalable processingpower. This architecture makes transputers suitable for interconnection in hardwarewithout high cost and provides the potential for fault-tolerance. DSP processors withmultiple on-chip communication links and high processing power have been developedby Texas Instruments.This may be an indication that multiple link processors will play a more and moreimportant role in parallel processing as well as in fault-tolerance in the future. We choosethe transputers because its architecture offers very good inter-connectivity for parallelmachine construction, the message passing based communication and its availability inthe department.The experimental system platform is based on the transputers hosted on a Sun work-station. Four transputers form a Fault-Tolerant Abstract Transputer (FTAT). Eachtransputer has four on-chip serial communication links that are bidirectional with sup-port from dedicated DMA controllers, at a data rate of 20Mbits/sec./link. An FTATis based on four transputers which are fully connected in a systematic way, which isChapter 1. Introduction^ 13described in detail in chapter 2. The modular group FTAT cooperates among its welldefined members to minimize the overhead in fault tolerance and synchronization.This abstract transputer FTAT can be used as a building block to build a large parallelprocessing system. It can also be used as a non-stop type of a computer system with thepotential for real-time processing. In our system, the software distributed over the fourredundant processor units hides all of the details of application replication, fault-masking,on-line forward fault-repair and reintegration after repair.On message-based distributed systems, fault-recovery may involve roll-backs of agroup of processes based on the checkpoints [8]. All the invocations to other processes,since the last consistent checkpoint, must be undone in order to eliminate the effect of afault. This has a drawback that the recovery may take a nondeterministically long delayand wastes a lot of computation time on the overall global system. This approach is notsuitable for real-time processing in a large parallel processing system.Our approach is to try to confine the fault within the module on a process basis toavoid expensive global recoveries. Before each output from an FTAT, fault detection isconducted and only the correct output is sent. Each application process is a fault-tolerantobject on the system. Inter-process communication is protected by error control in theprotocols. A faulty process is repaired by using the run-time context of a non-faulty peerinstead by undoing and then retrying of the last failed attempt. Hence, forward recoveryis guaranteed. Our on-line fault-repair mechanism repairs the faulty module on a processbasis instead of through pure reconfiguration.1.4.2 Multilink Group CommunicationA group communication mechanism is essential to enable the group members to commu-nicate and reach agreement on the state of the applications running on top of the systemdespite random communication delays and failures. The mechanism should make theChapter 1. Introduction^ 14behavior of the group indistinguishable from that of a non-redundant processing mod-ule [17] in order to implement the desired replication transparency. Multi-link groupcommunication provides an essential means for redundant data distribution.To achieve efficient multilink group communication, ACKs are broadcast to all theother peers in response to each data packet received, to create the transmission redun-dancy in the form of control packets and to spread the communication status of thereceiver for efficient synchronization. The redundant ACKs creates more opportunitiesfor the intended receiver to take active action. Packet loss can be detected promptlyand can be possibly claimed back in parallel. The low cost in using ACKs as "hints"for optimal packet loss claiming outperforms sender-initiated dynamic packet routing interm of both time and communication.1.4.3 TransparencyTo provide a simple programming model, the service abstractions designed are similarto the conversation construct in [13] for a distributed recovery block spanning severalprocesses. Applications need only to hand in the result or intermediate result for accep-tance testing and determine its "pace" to use (call) the service primitives to achieve thedegree of reliability in computation integrity and fault coverage [6].The system provides Read and Write calls which performs the communication func-tions to read from and write to external FTATs. The write call encloses a set of symmetricoperations among the four peer modules within an FTAT. It also provides a dummy callthat only invokes the fault-tolerance functions implicitly without actually sending outthe result to other FTATs. This allows computations to run on its own logical phases.Chapter 1. Introduction^ 151.4.4 On-Line Forward Fault RepairBy incorporating process migration techniques into the system, a faulty application pro-cess can be repaired by using the volatile execution image from a non-faulty one. Thisguarantees a forward fault repair. The approach to fault repair in our system is noveland interesting. The availability of the redundant modules subject to sequential faults isunlimited, thus the life time of the system is greatly extended.In process migration, one of the difficult problems is to cope with pointer referencesin the run-time context of a process. The problem is handled by the use of a pointerreference stack in conjunction with a set of resource management information tablesmaintained by the system. The architecture of a transputer with distributed memoryprovides only one linear address space without any memory paging mechanism support.The shared virtual memory solution is not practical if not impossible here.The difficult reference rebinding problems in process migration are handled throughthe system's resource management information and run-time facility of the so called"pointer reference stack". The pointer creations are tracked by the system all the waydown the current execution path. Our approach is to separate the applications from theirrun-time physical environment so that they can be moved around easily and effectivelyat minimal cost.Chapter 2System ArchitectureThis chapter discusses the fault-masking architecture and software architecture on ourFault-Tolerant Abstract Transputer (FTAT). By systematically connecting the redundantmodules through the multilinks on each processor, the architecture designed is regularand recursive. The resultant architecture preserves the original transputer architecture.The FTATs are easily to be constructed into large parallel processing systems withoutcreating a complex view of the system.The system software implements an abstraction of a transputer while providing fault-tolerance services transparently. The overall system yields low cost for fault tolerancedue to the avoidance of roll-back recovery and the small scale of synchronizations andcooperation among well defined group members.2.1 General AssumptionsIn order to focus on the issues identified, the system was designed under the followingassumptions:• The processing modules based are computers with private memory, 4 bidirectionalcommunication links and a copy of the operating system.• The application processes are statically created in the same order on each of theprocessing modules of an FTAT.16Chapter 2. System Architecture^ 17• Applications are allowed to use physical pointers residing only on the run-timestack.• The FTAT system software is assumed to have omission failure semantics [6]. Ar-bitrary failures [3][17] are not considered in the system design.• The application processes are cooperative as opposed to collaborative.The system is built via an structural approach for both hardware and software. It pro-vides a many-to-many inter-FTAT communication mechanism to carry out the redundantdata distribution with the omission failure semantics [17]. A set of group communicationprimitives, called the scatter and gather services, are furnished as generic primitives forthe reliable redundant data distribution and synchronization in the event of link failuresand processor failures.2.2 System Hardware ArchitectureThe hardware redundancy is based on four transputers so connected that from any in-dividual transputer the scan of others is straight forward. Each transputer has its nodenumber. Each board has four transputer nodes with their own private memory. Thetransputers are numbered from 0 to N sequentially in combination with a board number.Links 1 to 3 of a transputer are used as FTAT internal links leaving link 0 as a FTATlink. The relation between node numbers Peer#, board numbers FTAT#, and link num-bers InternalLink# in the configuration of the system can be defined by the followingformulas:FTAT# = [Peer# ± 4];InternalLink# = (RemotePeer# — LocalPeer#) mod 4;Grid InterconnectionChapter 2. System Architecture^ 18RernotePecr# = (InternalLink# + LocalPeer#) mod 4;FTATLink# = LocalPeer# mod 4;As we can see, the architecture offers regularity and recursiveness, which makes iteasy to build a parallel machine with the FTATs.The hardware architecture is shown on the left side in Figure 2.1. On the right isshown the an example of the architecture applied in a grid connection.Figure 2.1: Hardware Architecture2.3 System Software ArchitectureFigure 2.2 shows the programming model as seen by systems programmers. The pro-gramming model is defined as following: Each transputer within an FTAT is a peermodule. All four peer modules within one FTAT run an identical set of applicationApplicationProcessStated BufferSetLinkBuffer QueueFTATNetwork LayerChapter 2. System Architecture^ 19processes. We say that the set of applications are replicated over the peer modules. Eachof these replicated processes is called a replica process. The four replica processes arecollectively called a replicated process which is identified by a unique ID.Figure 2.2: The system architecture of an FTATThe system software implements the abstraction of an transputer with fault toler-ance over the four peer modules. This provides to application programmers a simpleprogramming model for developing fault tolerant parallel processing applications basedon FTATs just like on individual transputers, while the system takes care of the faulttolerance requirements within the FTATs. On the right side of Figure 2.1 is an exampleof part of a mesh connected application system based on the FTATs as building blocks.For each message packet, the sender is called the originator of the message packet,and the destination receiver the receiver of the packet. A peer module that has noticedChapter 2. System Architecture^ 20such a delivery is called the observer of the packet. These terms are used in the followingchapters using the definitions given here.Since a replicated process is a logical unit on the system, it must be identified uniquelyso that an operation in group can be recognized by the system which provides the services.The ID of each replica process is formed by concatenating the peer module ID and processID on the peer system. From the system assumption, the application processes are createdon all peers of an FTAT in the exactly same order. So the same sequence number is beassigned to the same replica process. This results in much simplified design: the processID can be used as the ID for the replicated process ID, since the replicated processidentification is no longer a major issue.The system is built on top of the hardware architecture defined above. The structuralapproach can greatly reduce the work to obtain overall system fault-tolerance when amultiprocessor or massively parallel system is built. The higher levels of a large system,constructed with the building blocks, need only to take care of the optimal routing ofmessages and the reliable delivery of packets between FTATs. Moreover, the systemensures that the received packet on an FTAT will not be lost due to peer module crashright after its reception. It is the receiving FTAT's responsibility to ensure the correctand optimal delivery of the message.The four transputers in an FTAT work in dynamic TMR under the assumption thattwo modules having the same fault simultaneously is very small due to the autonomyof each module. Each peer system must get one and only one vote in the majorityconsensus to assert "non-faulty". The output from the FTAT masks any fault in thesystem automatically. Once a fault is detected, the peer system sends a request forrepair to one of the non-faulty peers and the on-line fault-repair process is started.Chapter 3Highly Efficient Multiple-Link Group CommunicationThis chapter describes our contribution with respect to efficient redundant data distri-bution among group members through the multiple link connections. The concept ofactivity observation used in the protocol design provides receiver-initiated dynamicpacket rerouting. This may lead to possible parallel claim of lost packets via optimalroutes, low cost by using ACKs as rerouting drivers and better synchronization amongthe group peers.3.1 Communication System StructureThe system has 3 layers in its implementation of the communication services. The physi-cal layer provides a blocking call that requires an exact matching count of data transferredby both parties of the communication. Based on this layer the link layer communicationprimitives are developed. On top of the link layer communication services, the FTATgroup communication network layer packages three logical level of services to provide thetransparency to the underlying service details.3.2 Link Channel Communication LayerThe link layer service provides a reliable communication packet channel. It preserves thepacket sending order, employs flow and error control, and guarantees the correct receptionof the packets carried. It works with a watchdog that monitors the link status through21Chapter 3. Highly Efficient Multiple-Link Group Communication^22the associated watchdog timer. This layer is not discussed in detail since it is based onwell defined principles [9].The packet address header at the link layer is quite simple since it is a point-to-pointcommunication. One thing worth mentioning is the link status vector which is a bit-mapped link indicator packaged into an integer. During the packet exchanges, the linkstatus vector provides to adjacent peers about the link status of third party peers as wellas about themselves.3.3 The FTAT Group Communication Service LayerThe FTAT network layer provides the service primitives that enclose several rounds ofgroup communications, fault-detection, fault-masking, on-line fault-repair and reintegra-tion after repair. In addition, they bring the replica processes over the peer processorsinto synchronization. The layer has three logical levels packaged in it for parallelism.As mentioned earlier, the replica processes of an application must be uniquely iden-tified individually and collectively. The FTAT service calls on behalf of the replicaprocesses must also be uniquely identified. With these identifications, the group commu-nication among the peers can be recognized as either scatters or gathers, and processedaccordingly.Each FTAT layer packet has an address header consisting of 3 parts: the originator(original sender on the message), the source (the last sender which could be a forwarder),and the receiver (the eventual receiver). As part of message identification, each packethas a logical ID which consists of the message ID, process ID, and FTAT No.There are two types of packets flowing in an FTAT: message packets, and controlpackets. Control packets are those such as ACKs, NAKs, and SPECIAL; the latter isused for prevention of loop sending during dynamic packet rerouting when the destinationChapter 3. Highly Efficient Multiple-Link Group Communication^23is not reachable.To achieve reliable and efficient group communication through multiple link connec-tions, a concept of activity observation is devised. The concept of activity observa-tion is basically the use of extra ACKs from the recipients as the observations of scatteroperations. Here a group send (scatter), or a group scatter of the related data packet(s)is defined as a logical activity. The use of ACKs from the group member recipientsresults in low cost and high reliability in dynamic packet routing in the event of linkfailures. The concept also enables the system to detect peer application failure and toresult better timeliness in group synchronization.3.4 Group CommunicationA group communication mechanism is essential to enable the group members to commu-nicate and reach agreement on the state of the applications running on top of the systemdespite random communication delays and failures. The mechanism should make thebehavior of the group indistinguishable from that of a non-redundant processing module[17] in order to implement the desired replication transparency.Much research has been done recently on group communication. The V system [20][23]exploited the use of one-to-many group communication. Replicated procedure call in[25] was developed for many-to-many pattern calls including one-to-many call from eachgroup member and handling many-to-one call by each. The group communication servicesneeded in this thesis should be able to handle both one-to-many and many-to-one callsthrough the multiple links. However, the information distribution is among the groupmembers instead between different groups.Multi-link group communication provides an essential means for redundant data dis-tribution. A many-to-many communication call involves two subproblems: 1) a groupChapter 3. Highly Efficient Multiple-Link Group Communication^24broadcast from one-to-many through all the internal links; this is termed here as "scat-ter" of the message, and 2) the handling of a many-to-one call from the point of view ofa receiving group member; this termed as "gather". The peer processors in the FTATform a group. In [19], the issues related to the group send and receive are identified asfollows:• Scatter should be a blocking call so that the return point is the synchronizationpoint for the group.• Gather must solve the following two problems:I. The receiver must be able to distinguish unrelated call messages from onesthat are part of the same group call.2. When one call message of a group call arrives, the receiver must be able todetermine how many other call messages to expect as part of the same groupcall.Although the applications and services in our system are performed on the same setof machines instead of in the form of RPCs (remote procedure call), the same issuesalso apply. The multiple linkage among the group members makes the problem morecomplicated. A link failure in a local area network may result in a network partition.However in the multiple link case, a network can still be connected.A complete swap exchange involves the following steps among all peer modules of agroup through the internal links within an FTAT:• Each group member scatters the message to all other peer modules.• Each group member gathers the messages from all other peer members.Chapter 3. Highly Efficient Multiple-Link Group Communication^25• Each group member scatters the ACKs to all others in response to each data packetreceived.• Each group member gathers the ACKs from all others.Although the third step does not seem necessary here, it is designed on purpose for theconcept of activity observation.To detect these failures, normal techniques are based on timers. First, link failurecan be asserted when no communication is heard on a link for a period of certain time.Secondly, a processor failure exhibits failure of all links and vice versa.Arbitrary failures may mean that the faulty peer system is malicious. To simplifythe design and to concentrate on the issues identified in the introduction of this thesis,arbitrary failures in the system are not considered in our design. When this type of faultsoccurs, the protocol here may fail to work.In this research, the focus is on the highly efficient multi-link parallel group com-munication, the transparent fault-tolerance services and the on-line forward fault-repair.We assume that the FTAT network layer have the omission failure semantics [17]. Thisassumption can be implemented by employing self-checking technique to the FTAT net-work layer to avoid arbitrary failures. The Petri-Net based Observer suggested in [14]can be one of the candidate techniques.3.5 Activity ObservationThe concept of activity observation is designed to help the system to handle link fail-ures, processor failure, and peer process failure in group communication. These failuresmay affect the membership of the related peer in group communication and hence thecorrect operation of the protocol.Chapter 3. Highly Efficient Multiple-Link Group Communication^26The concept of activity observation works in the following way. When a peerinitiates a new logical activity, it scatters the data packet to the group peers. Thereceivers scatter the ACKs in response to a data packet packet in order to quickly spreadthe observation of the activity. Activity observation serves two purposes:• To achieve efficient and reliable packet delivery in the event of link failures orprocessor failure.• To help identify the application, whose participation is expected by the others ofthe group, from the list of applications running on the system.• To detect the failure of a peer process whose attention is expected by the otherpeers in group communication.Since a packet may go through a third, or even a fourth peer module, to reach thedestination in the face of link failures, the receiver must be able to quickly identify thelink through which to ACK the packet forwarded by the last sender. The ACK packetkeeps the originator and the logical ID of the message as well as the last sender. Theoriginator and the logical ID together identify the scattered message and the last sendertells where the ACK is to be returned to. It is very likely that the last sender remainsthe best route to ACK a claimed packet. This way, when an observed ACK is receivedthe observer gets enough detail about the activity of the observed peer application torespond correctly.Figure 3.3 describes the relations among the peers during a scatter operation andexplains the concept of activity observation. When a packet is scattered, each receivingpeer, upon the correct reception, will scatter an ACK to the other peers including theoriginator (original sender) and the other two peers which act as packet observers. Theextra two ACKs serve as the observation of the logical activity.Chapter 3. Highly Efficient Multiple-Link Group Communication^271. A complete logical swap in an1RPCWrite operation.2. Scattering of a logical swap operationfrom node 0. 3. ACKing a scatter from node 0and observation of ACKs.Figure 3.3: The relations among the FTAT peer modules during a scatter operation.Chapter 3. Highly Efficient Multiple-Link Group Communication^28Since each "scatter" is paired with "gathers" on the other sides, the pairs enforce thesynchronization among all the peers. At the receiving side, the "gather" is completedwhen all the expected packets have arrived from all the other peers. On the sending side,when all the direct ACKs have arrived at the originator a scatter operation is completed.If a peer module does not receive a packet due to link failure but received the relatedACK from another peer, the receiving peer can now actively claim the missing packetfrom the known recipient instead of waiting for originator to reroute the packet. Thelogical ID and the originator ID, carried in the observed ACK packet header, can identifywhich of the data packets gathered is being claimed.The use of the extra ACKs yields redundancy of a sending activity in the form of thecontrol packets. The status about the other peers in the group communication can beknown from these ACKs. In addition, the scatter and gather operations are all engagedwith all the internal links in a "symmetrically complementary" pattern, i.e., all receiverslisten to and the sender talks to all the working internal links. This form of balance helpsfunctional clarity of the scatter and gather services in the system design and enhancesthe timeliness of the group synchronization.It is necessary to keep the process executions on all the peers in close step for real-timeprocessing. The synchronized execution can produce better timeliness, better throughputand deterministic behavior of the group communication. For real-time systems, timelinessis most important. In contrast, in an asynchronous system a fault can not be detectedpromptly and an latent fault may result in great waste of computation due to a possiblechain of roll-back recovery.As a result, the following advantages are obtained:• low cost in transferring ACKs for packet rerouting.Chapter 3. Highly Efficient Multiple-Link Group Communication^29• optimal routing of a claimed packet from a known recipient with an adjacent work-ing link.• in case of multiple link failures, dynamic rerouting of a packet may be done inparallel since it is a receiver-initiated operation.• prompt detection of lost packets.• effective timeliness in synchronization, which is essential for real-time processing.3.5.1 Operation When All Peers Are AvailableAs mentioned earlier, scatter is an operation that broadcasts a data packet to the rest ofthe group members as receivers.When a peer scatters a packet, the receivers scatter ACKs through all internal workinglinks connecting the other 3 peers. The extra two ACKs would seem to be an extraoverhead in the operation. However since they can be broadcast in parallel, only a smallincrease (caused by two extra ACKs) in time is seen. Moreover, the redundant ACKshelp the distribution of data reliably and efficiently through the concept of activityobservation.At the receiver side, when all ACKs have arrived and the packet is received andACKed, the end of the gather operation in response to a scatter is triggered. On theother end, receipt of all direct ACKs by the originator, will trigger the end of the scatter.In this process, each receiving peer will receive 1 message and two observed ACKs, andthe originator will receive 3 direct ACKs in response to the scattered message.3.5.2 Operation When a Link Failure OccursOriginator^ Receiver/Packet ObserverScatter6‘'Receiver/Packet ObserverChapter 3. Highly Efficient Multiple-Link Group Communication^301. A complete logical swap exchangesthe same set of logical messages2. Node 0 is the originator of the scatteredmessage.Link Failure^ACK Observer3. Node 2 and 3 are the receivers of thescattered message, but node 1 is only anobserver.Figure 3.4: Observing undergoing activitiesChapter 3. Highly Efficient Multiple-Link Group Communication^31When a link failure occurs in an FTAT, such as the case shown in Figure 3.4, the watchdogwill detect it eventually after at most two probing periods by the link layer. After thescatter operation is started, the peer on the receiving end of the failed link may observethe ACK(s) from the adjacent receiving peer(s). When link failure is detected, the peersystem is ready to use the Observed ACK to claim a lost packet from a known recipientof the packet. The above process is illustrated in Figure 3.5.When the lost packet is eventually received, the direct ACK in response to the packetmust be received by the originator since it confirms the correct reception of the packetconcerned. Indirect ACKs are not important because they serve only as the observationof a reception for possible lost packet claim or identification of the application.Only direct ACKs must be delivered to the destination even in case of link failuresbecause of its importance to the originator. They represent memberships in the groupcommunication. The direct ACK is handled in the following way. The claimer of a lostpacket first returns the ACK to the forwarder since it is very likely that the latter stillhas a working link to reach the originator. Tithe forwarder becomes no longer reachable,the only link left is tried. Nonetheless, new link failure may occur and a known recipientmay not be able to reach the originator. In section 4.7.1, the further link failure handlingis discussed.In the service mechanism, there is a filter process that forwards any packet to anappropriate link if it is not destined to the peer module where it has currently arrived.Within the address header of a packet, the source ID part of the header tracks the lastsender as the packet gets delivered to peers along the route.3.5.3 Operation When Multiple Link Failure OccursMultiple link failures may result in a network partition. In the system, an even networkpartition is not considered in our design for simplicity. Processor failure is assumed whenPacket ObserverLost Claimer Lost ClaimerPacket Observer^ Packet Observer1. ACK Observer claims a lostpacket from ObserverPacket ObserverPacket Observer2. Observer forwards the packet inresponse to LostClaim requestPacket ObserverChapter 3. Highly Efficient Multiple-Link Group Communication^323. Observer waiting IPC WRITEoperation initiated from insideFigure 3.5: Lost packet claimingLostClaimerty,,rLostClaimerLostClaimerLostClaimer1. Node 0 scatters the PDU, Node 2^2. Node 1 and Node 3 observe andreceives and ACKs it. claim it.LostClaimer3. Lost claimers ACK the PDU received.LostClaimer4. Direct ACKs are returned to theoriginator.Chapter 3. Highly Efficient Multiple-Link Group Communication^33Figure 3.6: Multiple link failure handlingChapter 3. Highly Efficient Multiple-Link Group Communication^34all its internal links appear to have failed because an all-link failure exhibits the samebehavior as a processor failure does.Consider first the simple case of a processor's two links failing as shown in Figure 3.6.After the peer scatters a data packet to the available working links, only one peer receivesit. ACKs are then scattered by the recipient to all the working internal links in responseto the packet. Now if the other two links to the recipient are working, the other twopeers observe the reception through the ACKs.A simultaneous claims for the packet observed from both receiving peers are nowpossible. The recipient sends the data packet to the two peers in response to the receptionof the Lost Claim packets. The ACKs returned to the originator (direct ACK) are alsodirected to the recipient as a known peer reachable from the originator. This shows thatthis method may result in possibly better performance (parallel vs sequential) and henceis much better than single sender-controlled rerouting.When the ACKs are scattered from the lost claimer, the ACK destined to the orig-inator is first sent to the peer that provided the claimed packet. It is then forwardedby the peer who should have a working link to reach the originator, provided that nofurther link failure occurred so far. Tithe link between the two claimers is working, theobserved ACK from either should be received by the other. By now, both claimers shouldhave gathered the set of packets, i.e., one message packet and two observed ACKs. Theprevious recipient should now also have gathered the packets in response to the scatter,and the originator should have gathered all the direct ACKs.The second scenario in Figure 3.7 is the case of link failures which result in a singleline connection in the network. This is the worst case for the performance of the FTATprotocol. This time the third peer will first observe the reception of the packet. Asuccessful claim by the claimer from the second peer triggers the scatter of the claimer'sACKs to all the working internal links. The direct ACK is routed to the second peer.4LostClaimer LostClaimerChapter 3. Highly Efficient Multiple-Link Group Communication^351. Node 1 scatters PDU and Node 2receives and ACKs it.2. Only Node 2 receives PDU and Node 3observes and claims itLostClaimer LostClaimer 3. Node 3 receives and ACKs the PDU,^4. Node 1 receives the PDU and ACKsand Node 1 observes, and claims it. the reception of the PDUFigure 3.7: Multiple link failure handling 1Chapter 3. Highly Efficient Multiple-Link Group Communication^36Now suppose the fourth peer has observed the reception by the third. The aboveprocess applies again. This time, since there is no link connection between the secondand fourth peer modules, only the ACKs to the third peer and the originator are sentwhile the one to the second is dropped according to the protocol. Eventually the directACKs will reach the originator as long as the network remains connected.In Figure 3.8, the direct ACK is returned to a claimee but this time the link is notworking any longer. The algorithm is designed so that if any other link is still working,then the ACK is forwarded to that link, otherwise the ACK will have to be returned tothe link through which it came. The original ACK sender now will have to try the onlylink left to the peer to reach the originator.To prevent a possible loop in sending between the claimee and the claimer given thelink between them is still working, the ACK is turned into a special packet so that itcan be recognized and dropped if it is to be sent again. Since there is only one chanceleft, there is no need to send the ACK if the last attempt has failed.3.5.4 Operation When A Processor Failure OccursWhen a processor fails, we shall assume in this case that when a peer system has crashednothing can be heard from any of its links. A peer system should always be readyfor processor failures so that time is not wasted on idle waiting. Processor failures arealways checked first before a decision is made to wait any further for the events from thecorresponding peer module.The knowledge about the status of a peer is also available from the other peers. Thedetection of a processor failure is essential in order to stop meaningless waiting or sendingattempts. For this purpose, at the lowest layer, each time a packet (data packet, controlpacket, or probe packet) is exchanged, the current knowledge about link status of allpeers is attached to the header of the packet. Each peer periodically updates the linkLostClaimerLostClaimer4LostClaimerLostClaimer LostClaimer6-C..,LostClaimerChapter 3. Highly Efficient Multiple-Link Group Communication^371. Node 0 scatters the PDU, and Node 1^2. Node 3 observes and claims it.and 2 receives and ACKs it.3. The link to reach node 0 from node 2 has^4. Now the Direct ACK can reach thefailed. The Direct ACK is routed to node 1. originator.Figure 3.8: Multiple link failure handling 2Chapter 3. Highly Efficient Multiple-Link Group Communication^38status vectors as packets get exchanged with other peers. A failed peer processor will beexcluded from the group membership in the protocol operation until it is repaired.3.6 Timer HandlingIn the detection of a packet loss due to link failure or processor failure, timers are usuallyinvolved in a message-passing multi-computer system. In [4], timers associated the groupcommunication introduce difficulties in group synchronization. Packets may come inearly, or late within the time out period because of the autonomy of each peer and thetransputer communication protocols. If only one timer is used for group receiving, anearly packet can trigger the timer, but a late packet from the other two links can alsotrigger the time out [4]. A second timer is proposed so that a two phase timeout can beused to allow early packets or late packets resulting from the looseness in synchronization.Because ACKs are used extensively for activity observation, more timers would beneeded for each of the packet sent on each link. This would increase more complexity andoverhead in the synchronization management. In our system design, the use of the linkstatus eliminates the timers associated with "scatter". However, this requires that theunderlying communication service ensure the correct delivery of the packets once theyare on route unless the link has failed. This is achieved by our link layer service.To detect link failures and peer failures, a watchdog mechanism is built in the linklayer. There is a watchdog monitoring the link status. With the link status and theconcept of activity observation, the FTAT network layer handles link failure andprocessor failure in the communication.A watchdog timer is associated with each link and gets reset when a correct packet isreceived from the corresponding link. When a pending link output queue is empty, thewrite worker will send a probe packet to each link on a periodic basis. Once a link failureChapter 3. Highly Efficient Multiple-Link Group Communication^39is detected, the associated failure flag is turned on.Another set of timers is employed to detect peer processes' failures. Each applicationprocess has an associated timer. If a packet as an observation of a new FTAT servicearrives at a peer module, an associated timer for the identified process is started so as todetect the replica process failure to participate in the group communication.Chapter 4FTAT Services with Transparent Fault-ToleranceThe fault-tolerance services provide transparency to the underlying architecture, dataexchanges among the peer modules, and possible on-line forward fault-repairs. Thesystem implements dynamic TMR with a spare (hot-standby). The resultant system cansurvive up to two concurrent faults and unlimited sequential transient faults.Location transparency hides the hardware redundancy from programmers. Fault-transparency masks faults occurring in the underlying hardware. Replication trans-parency relieves the programmer from having to cope with the details of the cooperationand synchronization among the peer modules.The FTAT service primitives are based the scatter and gather services. For efficiencyand parallelism, the scatter and gather services are not implemented as service abstrac-tions in order to reduce the overhead in the cooperation and increase the parallelism inthe FTAT network layer.The synchronization points of scatter and gather operations trigger the state transi-tions in operation of the FTAT protocols. Application processes are synchronized at theapplication service interfaces. These interfaces enclose a series of scatters and gathersoperations, fault detection and fault-repair if any fault occurred. Special buffers withstructured states are designed to facilitate these services.40Chapter 4. FTAT Services with Transparent Fault-Tolerance^ 414.1 Swap Operation Based on the Scatter&Gather MechanismBased on scatter and gather operations, the swap operation (all-to-all exchange) isdefined. A swap operation involves symmetric scatter-and-gather by all peers in theFTAT and includes:• scattering the data packet.• receiving the related data packets from all other working peers.• scattering ACKs in response to the reception of each of the related data packets.• gatherings for all of the peers: one gather for the scatter and three gathers corre-sponding to scatters from the other peers.The synchronization point is the point where data packets and ACK packets from allother peers have arrived. Due to the autonomy of each peer, which may be driven bydifferent external events from the connected FTATs, each peer may be running differentapplication process at a particular time, so some peer may be the first in the initiationof a logical activity. The activity observation can tell the other peer systems to bringup the related application process to participate. Eventually the replica processes willbecome active as result of the buffer allocations and round robin service fashion.The swap service is formally defined in chapter 7 using Petri Nets. It is also shownto be correct.4.2 FTAT ServicesThe network layer FTAT communication manager is based on the scatter and gathermechanism described in the previous chapter. The cooperation among the four peersChapter 4. FTAT Services with Transparent Fault-Tolerance^ 42requires several rounds of information exchanges to ensure the correct operation of theprotocols. The system services provide two primitives: IRPCWrite and IRPCRead.For inter-FTAT packet routing, there is a routing table on each peer system thatrecords the current network distances to the other FTATs in the overall system. Thistable is dynamically established as inter-FTAT packets are exchanged. It is not an majorissue to be addressed and is not discussed in this thesis.4.2.1 IRPCRead ServiceFor IRPCRead, the reading peer, on behalf of an FTAT with an adjacent FTAT, scattersthe received packet to the other peers. It then waits for the ACKs to arrive from the otherpeers before ACKing the sending FTAT. This gives the sending FTAT an opportunity toreroute the packet in case the reading peer crashes right after the reception of the packet,thereby preventing packet loss. A receiving peer of an inter-FTAT packet is called "theinitiator" of the IRPCRead operation. It does not ACK the sending FTAT until all theother peers of the receiving FTAT have correctly received the packet.After the sending FTAT gets ACKed, all the peers of the receiving FTAT will startstate exchanges and attempt, in turn, to deliver the packet based on the representativeselection rule. The selection rule decides which next best peer to deliver a packet based onthe current routing table. After the selected peer has tried to deliver the packet, a round ofstate exchange (swap state packets) is performed to update the current distributed statesof the service until all attempts have failed or a delivery is successful. It is the receivingFTAT's responsibility to ensure the further delivery of the packet to the destination.If a packet reaches the destination FTAT, the peers will try to deliver the packet tothe receiving process. If it arrives at only an FTAT on its way to the destination, thepeers will try one after another in turn by the representative selection rule.Chapter 4. FTAT Services with Transparent Fault-Tolerance^ 434.2.2 IRPC Write ServiceIRPCWrite constitutes an output operation from an FTAT. The IRPCWrite service issimilar to the IRPCRead except that the first step is a many-to-many complete dataexchange among the peers of the outputting FTAT. A majority consensus is then per-formed on the data exchanged by all peers before the data is sent to another FTAT. Therest of operations are the same as those for IRPCRead.An IRPC Write packet becomes an IRPCRead at the next FTAT on its way to thedestination. This is natural since the next FTAT is the reader of the IRPCWrite.The FTAT services are based on the swap operations described in the previous section.Its formal definition of the protocol design is given and discussed in chapter 7.4.3 Start of Service OperationsWhen an application process calls an FTAT service for either Inter-FTAT communicationor fault testing, it gets blocked until the system finishes with the service and returnscontrol. Before an application is rescheduled a resumption point is taken for later repairor control resumption. For details about the resumption point, please refer to section 5.2.There are two important lists in the system: the application channel list and the bufferset list. They are all system resources. The channels in the application channel list blocksthe processes upon their calling for FTAT services. The buffer set list comprises of a listof grouped buffers for many-to-many pattern data exchange buffering.Depending on the service called, the application waits on its corresponding associatedchannel. A buffer set must be acquired before the request can be processed. A buffer setis a set of buffers that can hold all data packets as well as ACKs from all the other peersfor a group communication. The so called "structurally stated buffer sets (SS buffer) aredesigned for this purpose and is discussed in section 4.4. Only the service requests thatChapter 4. FTAT Services with Transparent Fault-Tolerance^ 44have acquired such buffer sets are processed by the system. These two stages are due tolimited buffer resources and deadlock avoidance in the buffer allocations.The peer system scans through the channel list looking for pending service requests.If such a request is found, a structurally stated buffer set (SS buffer set) for the serviceis assigned for it if any is available. The system then scans through the SS buffer sets toperform the actual service.Once a service request is in processing, there are two types of swapping of packets:data packets and state packets. The first step of an FTAT service is the data distributionfollowed by a series of the state exchanges. To the scatter and gather service, the datapackets and the state packets here are same.4.4 Structurally Stated Buffer SetSince there are limited system resources and the group communication services require alarge amount of memory, only a limited number of service requests can be processed atany one time. Such a limitation must allow the avoidance of possible deadlocks due toresource allocations at the system level.4.4.1 The Group Communication Buffer Consumption RequirementsEach transputer link is actually two separate channels, one for each direction. This avoidscommunication confusion, which results in failure of the link communication protocols.Concurrent in-bound and out-bound communications can be carried om the link. Eachcommunication on an FTAT link represents a logical operation. To allow concurrent in-bound and out-bound communications with other FTATs, separate buffers are necessary.To reach agreement on a group decision, synchronization among the peer applicationsmust be enforced at some point. The observation of activities again helps to identify theChapter 4. FTAT Services with Transparent Fault-Tolerance^ 45applications being served by other peer systems. Limiting the number of the requestsbeing processed concurrently is a way to enforce synchronization of the group memberson the services, provided that service fairness is ensured by the system.When each service primitive is called, the calling process is blocked on the correspond-ing channel. Once a service call is being processed, all peers must be able to buffer allnecessary information related to the service call for the purpose of fault-detection, mask-ing and loss claiming. This data redundancy requires more than four times of memorythan the single data packet does in an FTAT service.If all service requests are processed concurrently, the limited memory resource on eachpeer module becomes exhausted. Therefore it is impossible to process all service calls atthe same time. But what should be the minimum number of service requests that canbe processed by the system concurrently? This question is resolved in the next section4.4.2 The Deadlocks And Deadlock AvoidanceTo avoid deadlocks, each service request to be processed must be able to acquire a set ofbuffers distributed over all the peer modules. This means that the initiation of a systemservice by a peer should be honored by all the other peers in order to proceed with theservice request. Another interpretation is that all the peer systems should ultimatelyserve the same set of service requests in order to avoid cyclic waits among the peers.A partial acquisition of a buffer set in the system means that different service re-quests are being processed by different peer systems. Cyclically dependent waits areconsequently formed between parties, with different service requests being served. Adeadlock could result because the peer has to wait forever for the attention of all theother group peers for the service while some of them are expecting another service fromthe first peer!Due to the autonomy of each peer module, service requests from different applicationsChapter 4. FTAT Services with Transparent Fault-Tolerance^ 46may be processed on different peer systems. To see how the independence of the peersmay affect the system, imagine the following situation of an FTAT concerned in a meshconnected network. All the four adjacent FTATs are trying to pass something into theFTAT. Supposing that the FTAT peers all read at the same time, a SS buffer set isreserved for each of the abstract FTAT links on each of the four peers. If the system onlyhas one SS buffer set, each peer allocates the SS buffer set for the FTAT it is respondingto. A deadlock results from the above situation since no service initiative from a peer ishonored by all the other peers within the FTAT.To avoid deadlocks in buffer allocations, the solution is to ensure the same set ofservice requests are served by all the working peers. Since there are two concurrentchannel communications (In-bound and Out-bound) on each link, there will have to betwo separate distributed buffer sets.As can be seen easily, there are at least four possible concurrent service requestsbeing processed at a time for each type of service primitive in the system. Thereforethe minimum number of buffers on each peer is four for each type of service primitiveif only one service initiation is allowed on each peer. Through activity observation, theservice requests being processed by the other peers can be passively started. All thepeers eventually reach agreement on which service requests to process without muchnegotiation and any deadlock.4.4.3 The Service Buffer Set DesignAs mentioned before, different types of buffer sets are used for different services: namely,write buffers and read buffers. A Write buffer set require more memory than a readbuffer does because it has to be able to hold packets exchanged with all the other peers.In the system design, each peer system is allowed to initiate one service at a time foreach service type. All observed services through the concept of activity observationChapter 4. FTAT Services with Transparent Fault-Tolerance^ 47should also be processed during the same period. As a result, the processing of sameset of service requests is always guaranteed. Hence, up to 4 services of each type can bedealt with by the system concurrently.This buffer design helps the system to enforce synchronization among the FTAT peers.Because up to only four outstanding service requests can be processed at any time on thesystem, other processes are either blocked or in execution (ready state and eventuallywill be blocked). The absent processes, which have passively acquired a service of thesystem, will ultimately be brought up on each peer. Therefore, the peer systems arebrought into synchronization.In the design we assume that applications are responsible for avoiding deadlocks atits logical level.Service fairness is also very important to ensure the timeliness of the system. As men-tioned in section 4, there are two important lists in the system: the application channels,and the SS buffer sets. The requests from the applications should get fair services so thatthey can behave deterministically. The same is true for the structurally stated buffer sets.Consequently, both application channel scanning and buffer set scanning are processedin a round-robin fashion.4.5 Structured StatesAs the services proceed, different events can occur. These events have impact on thelater operations and have to be recorded as states. Because the three logical levels of theFTAT network layer are implemented as one, the states of the system are structured. Acollection of similar events can contribute to the occurrence of some (aggregate) event.For each data packet sent on an internal link, 3 ACKs are expected for it. For each datapacket received, two extra ACKs accompany it. The arrivals of the 3 direct ACKs triggerChapter 4. FTAT Services with Transparent Fault-Tolerance^ 48the event ACKGathered.The aggregate state should be efficiently triggered without constantly polling all in-dividual states. For this reason, the states of the FTAT services are structured. Forexample, the end of a complete swap is indicated by the event that all data and ACKpackets from all working peers have arrived.The structured state transitions are triggered by the possible inter-FTAT events,changes of link status and processor status. Whenever a new event for some state hasoccurred, it is checked to see if the transition condition is met for its aggregate state.The system handles the inter-FTAT events similar to responding to interrupts.4.6 Fault MaskingIn the system, we assume that the network layer is implemented with omission failuresemantics [17] so that the key issues identified in the introduction can be focused. Whenfaults are detected in any of the peers, they will show up in the distributed status afterthe first round of the states exchange. The peers then attempt to deliver the packet inturn according to the Representative Selection Rule explained below. The correct resultof the system is always present at any given time, provided that no more than two faultsoccur at the same time.4.6.1 Selection of the Representative & State SwappingOnce the majority consensus is done in the FTAT, the peers will select a representativeto deal with the next FTAT to deliver the output to the best possible route based on thedynamically established routing tables.The distributed algorithm to select the representative peer is based on the statesChapter 4. FTAT Services with Transparent Fault-Tolerance^ 49exchanged among the FTAT peers after the data packet distribution. A state packetcontains the result of the majority consensus for the application, and the distance toreach the destination FTAT from the peer. A peer is elected if its status is not marked"FAILED", its distance is the shortest and the module number the smallest.4.6.2 State SwappingAfter the majority consensus is done, the status of each of the replica processes is de-termined by the system. The first round of swap is performed with such information asthe outcome of the majority consensus and the distance to the destination FTAT. Thenthe representative peer currently selected will try to deliver the packet to the destinationon behalf of the FTAT, while all the others wait for the outcome of its attempt. Therepresentative peer broadcasts the outcome together with the distance to the destinationFTAT. This process is called state swapping because the data exchanged form thedistributed states of the system.After the attempt to deliver the packet, the status of the peer for the particular serviceis marked "FAILED" if the attempt fails due to link failure or processor failure. Sucha peer will be excluded from the set of candidates eligible to serve as the representativelater. The next one will then be selected. This process continues until an attempt issuccessful or all the peers have been used.Here the state "FAILED" means the failure of a delivery attempt instead of the failureof a peer process.4.7 Failure HandlingThe FTAT protocols must be able to provide correct services even in the face of linkfailures and processor failures. If the failures are left undetected, the group members inChapter 4. FTAT Services with Transparent Fault-Tolerance^ 50communication may be confused, and/or the group membership may not be consistentany longer among the peers. Consequently, the protocols will fail.4.7.1 Link FailureIn the system, the transputer links are monitored by a Watchdog with associated watch-dog timers. Each time a packet is received, the watchdog timer is reset. If the bufferqueue is empty, the writer worker process will send "probe" packets periodically. If thelink is dead, the timer will never get reset. Thus, eventually link failures will be detectedby the watchdog timer.Repairing a failed link involves resetting the link channel and rescheduling both pro-cesses. Normally a process blocked on a failed link never comes back. However, theblocked process has to be rescheduled in order to proceed. A simple link channel resetmay result in the loss of the process since its break point is stored in the link channelport address. To solve this problem, we save a resumption point before calling the com-munication instruction. When the failure is detected, the channel can be reset and theblocked process is rescheduled by the system from the resumption point saved.4.7.2 Processor FailureProcessor failure detection is not straight-forward since a conclusion can only be drawnbased on the link status from the other working peers. During each packet exchangebetween the peers, the distributed link statuses are passed to the receiving peers. Thislink status vector is maintained by the system watchdog and updated by the peer system.The entry for the local peer in the link status vector is maintained by the watchdog.Whenever a link failure is detected, the corresponding bit is set to indicate the failure.The other entries in the vector are updated by the peer system when such a vector isreceived from another peer. The entry for the provider peer of the vector can be directlyChapter 4. FTAT Services with Transparent Fault-Tolerance^ 51copied. However, if a link fails, no link status vector can be received from the link. Whena link failure is detected, the peer on the other side becomes "suspicious". In this case,the entry has to be worked out based on view of the third party peers. Only when allthe other peers shows that its link with the suspicious peer has failed, then the failure ofthe peer can be asserted.To ensure that the protocol works properly, the updating of the link status about anunreachable peer is designed as below:/**************** Link Status Vector Updating Algorithm ****************/; Peer is the sender of the currently received link status vector.; Local is the receiving peer.; LinkVector is the local Link Status Vector.; ReceivedVector is the received Link Status Vector.FOR ( PeerScan = 0 TO 3 )IF ( PeerScan = Peer ) CONTINUE;IF ( PeerScan = Local ) CONTINUE;link = (PeerScan - Local) MOD 4;IF ( LinkVector[Local] [link] = BAD )IF ( ReceivedVector[Peer] [(PeerScan-Peer) MOD 4] = BAD )LinkVector[PeerScan][(Peer-PeerScan) MOD 4] = BAD;ELSELinkVector[PeerScan] = ReceivedVector[PeerScan];AnotherPeer = Peer XOR PeerScan XOR Local;IF ( LinkVector[Local] [(AnotherPeer-Local) MOD 4] = BAD )Chapter 4. FTAT Services with Transparent Fault-Tolerance^ 52LinkVector[AnotherPeer][(PeerScan-AnotherPeer) MOD 4]= LinkVector[PeerScan][(AnotherPeer-PeerScan) MOD 4];ENDIFENDELSEENDIFENDFORUsing the link status vector, the status of a peer can be determined be the algorithmgiven below:/********** Node Failure Detection Algorithm *************/; Peer specifies the "suspicious" peer; Local is the receiving peer.; LinkVector is the local Link Status Vector.; NodeStatus is the Node Status Vector.NodeVector[Peer] = BAD;IF ( LinkVector[Local][(Peer-Local) MOD 4] = BAD )FOR ( PeerScan = 0 TO 3 )IF ( PeerScan = Peer ) CONTINUE;IF ( PeerScan = Local ) CONTINUE;link = (PeerScan - Local) MOD 4;IF ( LinkVector[PeerScan][link] = GOOD )IF ( LinkVector[PeerScan][(Peer-PeerScan) MOD 4] = GOOD )NodeVector[Peer] = GOOD;EXIT;Chapter 4. FTAT Services with Transparent Fault-Tolerance^ 53ELSEAnotherPeer = Peer XOR PeerScan XOR Local;IF ( LinkVector[PeerScan][(AnotherPeer-PeerScan) MOD 4] = GOOD )IF ( LINKVector[AnotherPeer][(Peer-AnotherPeer) MOD 4] = GOOD )NodeVect or [Peer] = GOOD;EXIT;END IFEND IFENDELSEEND IFENDFORELSENodeVect or [Peer] = GOOD;ENDELSETo see that these algorithms are necessary, let us examine the following scenario: thepeer concerned is trying to determine the link status of a third party peer A. Suppose thelink between the two has failed. Peer B is also not reachable from the peer concerned,but peer C is. The link status vector provided from peer C shows that the link betweenA and C is also dead, how the status of peer A can be worked out. The peer can notdetermine the status of peer A until all other working peers are consulted. If peer B seespeer A is alive, then peer A is alive since it is still connected to the system.Chapter 5On-Line Forward Fault-RepairThe on-line forward fault-repair in the system aims at exploiting the spatial redundancyto repair faulty application processes. Forward fault-recoveries for real-time systems areachieved by using the volatile data redundancy, without checkpointing and roll backs,to handle transient and intermittent physical faults. To accomplish such fault repair,some of the issues that are common to Process Migration [23][24][25] must be solved,as identified in the introduction. The challenge here is that the fault-repair has to beable to handle reference rebinding problems [23], without use of shared virtual memory.We have developed a scheme to separate the application processes from their run-timephysical environment so that their contexts are movable for fault-repair.For on-line fault-repair, both the availability and the reliability of the system are tobe maintained. In this research, the repair aims at not only reconfiguring the redundantmodules but also retaining the availability of the redundant modules, unless permanentfailures have occurred to the majority modules.The system recovers the faulty processes by using the redundant volatile data in therun-time environment, the resource management information during the run-time of thesystem and the Pointer Reference Stack (PRS). Some restrictions to the applicationson the system are imposed in order to focus on the development of the fault repairmechanism and reduce the complexity of the processing:• Processes are required to be loaded and created statically in the same order.54Chapter 5. On-Line Forward Fault-Repair^ 55• No global sharing data are allowed among the processes.• Global data variables for each application process have to be bound in a single datastructure.• Physical pointers, allowed in a process, are only those created on the run-timestack.• Physical pointers are not allowed in the heap allocations, but can be replaced by alogical pointer implementation.• Global physical pointers are prohibited in the programs.However these restrictions are not necessary for volatile fault-repair. Any of these re-strictions can be removed at more processing cost.5.1 Issues in Fault RepairFor fault-repair from the system level, the semantics of the high level logical faults aredifficult to know. As a result, it is hard to do an incremental fault recovery and fullrecovery has to be employed. In addition, the repaired process has to be reintergratedinto the system without interfering with smooth running of the system.Many fault-tolerant systems use check-pointing and roll-back schemes. As mentionedearlier, these types of systems pay a high cost in check-pointing by logging or journaling[7][8] through persistent storage. In addition, the system has to undo previous operationsthrough roll backs until the effect of a fault is totally eliminated. Such recoveries mayrequire a chain of global roll-backs when distributed group cooperations or collaborationsare involved. However, the time previously spent for the operations being undone iswasted.Chapter 5. On-Line Forward Fault-Repair^ 56To preserve the timeliness of the system, we use the volatile execution image torepair the faulty process. This approach is especially useful for fixing transient faultswhich cause only damage to the volatile states of a processing module.Using the run-time context of a process on one peer module to repair a faulty processon another is not simply a matter of copying the run-time context. A difficult problemthat needs to be solved is pointer reference rebinding [23]. In the V system [23] and theAccent system [25], shared virtual memory is used to deal with the pointer handling.A shared virtual memory implements disk-based (persistent) data objects. It is usu-ally based on a memory page-addressing mechanism so that the essential context canbe used immediately. A non-existent pointer reference can be "paged in" later from theshared virtual memory upon a page-fault interrupt. The data of the pointer reference isthen copied to the local memory cache.This seems to be a simple solution but it burdens the system in the following ways:• Shared virtual memory itself suffers expensive overheads in maintaining the dataconsistency between memory cache and disk storage distributed over the machines[2].• Since the run-time process contexts are also persistent data objects, copy-on-demand [23] may result in extra disk I/O access. In addition, the startup costsbecomes observable for intermittent disk I/O access if the reference straddles sev-eral memory pages due to the nature of copy-on-demand.• Shared virtual memory can cache a virtual memory page only on a page basis. Thismeans more intermittent remote virtual memory page requests to be processed,which takes more time than does consecutive loading of the context.Volatile fault-repair is difficult since the run-time pointer references are hard to copewith. When pointers are copied to another machine they become meaningless unlessChapter 5. On-Line Forward Fault-Repair^ 57both machine carry exact memory images. They may be pointing to different data instead of those that are bound to the process.5.2 The Processing of The Fault RepairThe fault repair is started by the peer system which has detected a fault in a peerapplication process. The peer sends a repair request to one of the non-faulty peers andthe repair is started. While the repair is be performed, the other peers wait until thefault repair is completed.The consistent volatile part of a replicated process can be provided by the maskingredundancy FTAT and can be used to fix a faulty peer process. While the correspondingmemory parts can be copied from one process to another, pointers have to be readjustedto point to the same logical data because of the different memory images on the differentpeers due to their autonomy. The problem is how the pointers can be handled correctlyand efficiently.By the assumed restrictions, each peer should have the same image for the code partas in the memories on all the peer modules. Since the global data is bound in a singlestructure and contiguously laid out in memory, the only things needed are the referenceand the size of the structure. All the global data of a process can be given by a singlereference, moved around and updated as one entity without having to know the sizesand semantics of particular data. The information about the global data is part of anentry in a process management table. Process management tables are maintained bythe system for the conventional process management [33]. All memory heap allocationsfor each process are maintained in its associated resource allocation table. The resourceallocation tables, common to operating systems, provides the sizes and addresses of heapallocationsChapter 5. On-Line Forward Fault-Repair^ 58A repairing table is prepared by the repairing peer when the repair is started. Thistable contains the information such as the address and the size of the run-time stack,information about the resource allocation table, the address and size of the global datafor each process, and information about the Pointer Reference Stack (PRS), etc.The table is then packaged and delivered to the faulty site, followed by the run-timestack, the global data for the process, the heap allocations and the PRS. The resourceallocations bound to the faulty process are released and then reallocated to be certain ofthe same resource allocations for both the faulty process and the repairing process. Thereceived copy overrides the corresponding items at the faulty site. The pointers used inthe process are then relocated to complete the repair.On a transputer, when a process is rescheduled, the microcoded scheduler of thetransputer pushes all the registers onto the stack of the process and the stack pointer,work space pointer [34], yields the resumption point of the peer process. A resumptionpoint of the repairing peer process is taken before the FTAT service is started as statedin section 4.3. Because the repaired process is consistent with the repairing process, thisresumption point is used for the repaired peer process to resume control.In essence, the correct handling of pointer references must ensure the following twoconditions exist between the two parties involved, after the fault-repair:• The corresponding referenced data in both parties should be consistent.• The corresponding pointers should point to the same logical data.As long as these conditions are met, the repaired process will be in a consistent stateswith the other peers, no matter what have happened to the pointers and their referenceddata in the past. Therefore no "undo" or roll-back is necessary.Chapter 5. On-Line Forward Fault-Repair^ 595.3 Handling Resource Rebinding ProblemsIn order to be able to move the process context around among the peers, the system hasto keep track of all the resource allocations, the birth and death of the pointers on thestack, in addition to those for the regular system management.From the assumed restrictions, use of global pointers are prohibited in applicationsfor simplicity. Pointers in heap allocations are not allowed but can be replaced by thelogical number implementations. The pointers on the run-time stack are dynamicallyestablished during the run-time of a program. Only pointers used in the current executionpath appear on the run-time stack. Other pointers vanish as the stack pops after returnfrom procedures.The resource allocation information is maintained in the resource allocation tables bythe system when a system resource is allocated to the application processes. There isone such table for each application process. Only memory heap resource is considered inour system.Only those pointers currently used along the process execution path are relocated.Hence, the number of pointers to be processed is minimal. The PRS, the process tables,and the resource allocation table, together with the assumed restrictions, facilitate theseparation of applications from the run-time physical environment.From the fault repair process, the first thing to be relocated is the PRS of the faultyprocess, because the PRS holds a copy from another peer. It gives the list of the addressesof the pointers used in the process. The local PRS overwritten by the received PRS mustcorrectly locate the pointers residing on the run-time stack. The addresses of the run-time stacks for both peer processes are available, so it is quite easy to relocate thesepointers by the difference of their address values.PRSs lists all the pointers that need to be relocated. These pointers may point toChapter 5. On-Line Forward Fault-Repair^ 60different type of memories, including the global data area, the run-time stack, and thememory heap allocations. They should be processed accordingly. Global data referencesand references of data on the run-time stack are simple. The references of data in theheap allocations have to be processed corresponding to the referenced allocations.Starting with scanning the received PRS, the referenced heap allocation is looked up inthe received resource allocation table for a pointer given by the PRS. The identification ofthe allocation also locates the corresponding local resource allocation. The correspondingpointer from the local PRS can be then relocated by the information about the localresource allocation and the allocation in the received resource allocation table.5.4 Pointer Reference StackThe PRS is designed for tracking the pointer creations during the program executionso that the system can use the information in the PRS to handle a minimal number ofpointers. The PRS records the pointer creations all the way along the process executionpath. It contains information such as the pointer value, the type of resources pointed to.Use of PRS can be automated through a compiler so that PRS operations can bedone transparently. The introduced overhead is not noticeable to a program since it isonly called twice (push and pop) per pointer per function call.Chapter 6Reliability Analysis6.1 System ReliabilityReliability modeling for the system is derived with the following system parameters andmodule parameters:System Parameters:c — the probability that a fault in the system is detected.r — the conditional probability that a fault is repaired after its detection in thesystem.Module Parameters:t — timeA — fault rate during power-on.The reliability analysis of the system is based on the assumptions given below:• All modules are identical and have the same reliability.• Fault distributions Rm(t) for all modules are exponential and identical, i.e. R, =et.• Fault detection and repair are done instantly.• All modules are independent of each other except when a repair occurs.61Chapter 6. Reliability Analysis^ 62• The time between two consecutive acceptance tests <T, where T is some constant.• All modules are symmetric with respect to each other since they are fully connected.• c and r are constant.To obtain the system reliability Rrc, we first calculate the system failure probabilityFcr. The system failure probability has three components. The first part F1 is caused byconcurrent faults that occur during T. The second F2 is the sequential fault effect thatoccurs over time t where t > T. The last one, F3, results from fault repairs, correspond-ing to the case when faults occur on two modules. One fault is detected while the otheris not and the undetected fault is on the repairer. The successful repair then results inthe propagation of the fault and ultimately the failure of the protocol.The derivation of F1The faults that occur during the time between two consecutive acceptance tests, T, aredefined as concurrent faults since they have to be dealt with concurrently. If more thantwo concurrent faults occur during the time period of T, the system crashes. F1 is theprobability that all four modules have faults or three out of the four modules have faultsduring T.This is can be obtained from the probability that faults occurred on all the peersduring a period T plus the probability that three faults occurred on any three of the fourpeers. The first term in Eq. 6.1 below determines the former while the latter is givenby the second term.F1 =(1 — Rm)4 + (4 ) [1 — Rni]3fen, if t < T^(6.1)3Chapter 6.^Reliability Analysis 63Substituting Rin with e'' in Eq. 6.1,F1 ,---- (I — cAt)4 + 4( [1 _ e—At]3e—At t < T (6.2)3The derivation of F2The sequential faults are repaired by the on line repair mechanism. Transient faultscan be tolerated and the faulty modules are made available again to the system afterbeing repaired. The failure probability F2 comes from the situation in which the systemhas suffered two sequential faults, but has been unable to detect or recover from thesefaults and a third fault occurs. In this case, the system can no longer perform majorityconsensus correctly based on the dynamic TMR scheme.Since all modules are symmetric to each other, each of the sequential fault occurrencesequences has the same probability. There may be 4! such sequences, so F2 is the sum ofall of the sequences. The fault detection probability c and the probability of successfulrepair r can change the system failure probability substantially, as can be seen below.F2 := 4![(1—c)-Fc(1—r)J2 I t —d (1—Em) f t —d (1—R,) f t —,d (1—Rm)Rmdt3dt2dti t > To dti^n dt2^a at3(6.3)where (1 — c)--1- c(1 — r) is the probability that a fault is not detected, or is detected butnot repaired on a peer module. After two such faults have occurred, a third module faultcrashes the system immediately due to the dynamic TMR scheme.Substituting II, with et in Eq. 6.3,F2 = [ ( 1 — c) + c (1 — r)]2[1 — 6e-2At -4- 8e-3At_ 3eitj if t> T^(6.4)—x,We can further simplify Eq. 6.4 and obtain the following:F2 = [ ( 1 — c) + c (1 — 012(1 — e-At)3(1 + 3e-Ai) t > T^(6.5)F3 = 2 x -i-1 ( 42c)cr(1 — Rm)211,2 t <T^(6.6)Chapter 6. Reliability Analysis^ 64The derivation of F3F3 is the probability that when faults occur on only two modules one of them is detectedwhile the other is not and the latter appears to have successfully repaired the formerusing its run-time context. This condition results in the propagation of the fault on thelatter and ultimately the failure of the protocol due to the dynamic TMR scheme. F3results from the cases when two and only two concurrent faults occur. It is obtainedas given below under the assumption that the repairer is selected from the non-faultymodules with equal probability:The system failure probability is:{ Fi + F3 if t < TF=^ (6.7)F2^if t > TThus the reliability is _FP = 1 — F. Here, Eq. 6.2 and Eq. 6.6 are probabilities over thetime period T. Eq. 6.5 reflects the contribution to the reliability from the on-line forwardfault repair. The higher values of the probability c and r, the lower the system failureprobability.Figure 6.9 shows that the reliability increases remarkably by the on-line forward faultrepairing even when the fault-detection probability and the fault repair probability arelower than 0.5. The curve in "*" in the figure is the reliability of one module.6.2 Network Connection ReliabilityThe probability that packet exchanges between a pair of nodes can still be conducted inthe event of link failures in the network is defined as the connection reliability R.10.9x->-,^.. , xxx,...^* )1( A( x^.^•. — --- i- - — -)K— 1411. 6. N5!4=AlOig * * * * * * * w v 100K .0.2^0.4^0.6^0.8^1^1.2^1.4^1.6^1.8^2time (sec.) 4X 1 0Figure 6.9: Reliability vs. fault detection prob. c and successful repair prob. r.Chapter 6. Reliability Analysis^ 65The connection reliability between any two peers of the network within an FTATis enhanced by employing the concept of activity observation. A packet sent from ascatter is transformed into three redundant signals: one original packet plus two extraACKs. This gives three chances for the receiver to know or receive the packet and takeproper actions. As a result, the reliability of the connection, between any two peers inthe system, is increased by the use of redundant ACKs.Let R1 be the reliability of a transputer link, and R, be the reliability of the connectionbetween the two peer across a link. Assume all transputer links have the same linkChapter 6. Reliability Analysis^ 66reliability RI. The probability of a connection failure is obtained and explained below:F, = (1 — R1)3 + 2(1 — R1)2/01 — R1)2 + R1(1 — R1)21^(6.8)Starting with the view from a node of a connection in the network, the first term isthe probability that all links of the concerned node in the connection have failed. Thesecond term is the probability that the other two nodes can not provide a route for theconnection due to their link failures, assuming that these two nodes are symmetric andindependent of each other as well as of the node pair of the connection.Eq. 6.8 can be further simplified to the following:F, = (1 — R1)3(1 + 2(1 — TO)Ri)^ (6.9)So the connection reliability R, is:R, = 1 — (1 — R1)3(1 + 2(1 — R?)Ri)^(6.10)There are five possible routes in all possible link failures provided that the systemis still connected. The extra ACKs, scattered in response to a data packet received,transform the transmission into four more opportunities to let the intended receiver takeproper active actions.Assuming that the link fault distribution is exponential, R1 is then e-A'. Figure 6.10shows the contribution from the activity observation scheme based on the multiple linkgroup communication. The dotted line is the link reliability RI, and the solid line is theconnection reliability R.Chapter 6. Reliability Analysis^ 670. 11000 2000 3000 4000 5000 6000 7000 8000 9000 10000timeFigure 6.10: The connection reliability vs. the link reliabilityChapter 7Petri Net Model of the FTAT Multilink Group CommunicationIn this chapter, the FTAT protocol is formally defined using Petri Nets. By playing withtokens in the net of the protocol modeled, one can derive all the behavior. Using certainproperties of Petri Nets, the protocol design is informally shown to be bounded and live.The implementation of the design demonstrates that it is functional. Consequently, thedesign is shown to be correct.Petri-Nets are a useful modeling tool for describing the dynamic behavior of con-current systems. A variation called the Predicate/Transition PetriNet (Pr/T Net) [31]allows the formal treatment of individual tokens, their changing properties and relations.A Pr/T net can greatly reduces the model size and is very useful in replicated/redundantprocessing modeling, especially for systems with multiple redundant or duplicated mod-ules.Pr/T net is developed from normal Petri Net theory. It differs from the latter inthat inscriptions are allowed to places (predicates), transitions, as well as arcs betweenplaces and transitions. The inscription can be a formal sum of tuples of individual to-kens/variables. Upon transition, individual variables inscribed to the arc are substitutedby the corresponding individual tokens that meet the condition inscribed of the transi-tion, removed from the predicate, or deposited into the predicate.The Pr/T net model of the system is structured into three levels of abstraction, i.e.,the scatter and gather, the swap, and the FTAT services. Each of these levels servesthe basis for the higher level, with the scatter and gather at the lowest in the model.68Chapter 7. Petri Net Model of the FTAT Multilink Group Communication^69This structure is followed in the analysis of the system model. The scatter and gatherservice is the processing core of the multilink group communication for the interactionsamong the group peers. The swap protocol is based on the scatter and gather services,and provides service to the IRPCWrite and IRPCRead protocols. Our discussion startswith the scatter and gather protocol to hide the higher levels from the details of the peerinteractions for easy analysis.First the following notational convention is used in the Pr/T net inscriptions:• fle,(< 60, el, e2, ...en) where 0 <i < n specifies all the elements with respect to allthe values of ei;• H(S), where S is a set, specifies all the elements available in the set.• S, specifies the set of all the elements in S excluding s.• IXI is a function that gives the number of elements in X where X is a set.The discussion of the system model assumes the following:• the protocol is built on top of a reliable packet channel. The channel guarantees thecorrect delivery of packets and preserves the order, in which they are sent, withoutany duplicates.• in order to concentrate on the protocols and simplify the analysis, the transferof packets through bidirectional link channels are abstracted by a single virtualpredicate (a place) to simplify the analysis.To aid in presenting the model in different layers, the following notation is used:• The predicates with a ground sign I are from other nets in a different level ofstructure.Chapter 7. Petri Net Model of the FTAT Multilink Group Communication^70• A packet token is used with the format < Sd, Rc, Or, Pk > whereSd - the sender.Rc - the receiver.Or - the originator of the relate data packet.Pk - the packet type which is the one of the following:pack - data packet.ACK - acknowledgement packet.claim - lost claim packet.special - an ACK packet being returned to its forwarder.7.1 Modeling Scatter & Gather OperationsThe following predicates are used in the protocol modeling:• Receivers: the receivers of a scatter operation.• Sender: the sender of a scatter operation.• PacketInQue: the sender(s) has sent the packets to the specified receivers throughthe proper channels. This is a virtual predicate to hide the direct link connections.The following is assumed: once a packet (a token) is in the predicate, it will fireto the proper predicate corresponding to the given destination provided the corre-sponding link has not failed. When a linkage has broken (link failure), a token (apacket) in this predicate disappears if it is routed to the failed link.• ReceivedPackets: the receiver has received the data packets from the sender.Chapter 7. Petri Net Model of the FTAT Multilink Group Communication^71• ObservedACKs: the receiver has received the ACKs from other receivers involvedin the same scatter.• DirectACKs: the sender has received the ACK(s) from receiver(s) of the scatteroperation.• Gathered: the receiver has received all the packets, from all the other peers relatedto a scatter operation (ACKs and the data packet).• ACKGathered: the sender has received all the ACKs in response to its scatteroperation.• LinkOK: the links are known to be working properly on the processor.• LinkBad: the link(s) are known to be dead on the processor.• NodeBad: the processing module(s) is known to have crashed.Figure 7.11, figure 7.12 and figure 7.13 together show the Pr/T net model of thescatter and gather protocol. The model depicts the behavior of the scatter and gatherservice in terms of the direct interactions among the redundant peer modules of an FTAT.It consists of three parts in a modular fashion. The main part, given in Figure 7.11,defines the behavior under no-failure conditions. The link failure handling part, shownin Figure 7.12, describes the behavior when link failures occur. The invariants part,defined in Figure 7.13, represents the invariant assertions about the system using the"dead" transitions [31]. "Dead" transitions are conceivable facts but impossible eventsof the model.In the figure, set L is the set of working links; set R is the set of peers of an FTATwhich is equal to P; and s E S = P. For ease of representation, the set notation is usedto show the individual tokens so as to fit in the diagram.PacketInQue ReceiversReceivedPacketsACKGathered PacketInQue ObservedACKtChapter 7. Petri Net Model of the FTAT Multilink Group Communication^72Figure 7.11: Scatter & Gather Pr/T net Model7.1.1 BoundednessDefinition: A P/T-net N is called bounded if MN is finite and there exists n E N suchthat for all M E [MN > and all s E SN, M(3) < n, where M is a marking, [MN > theset of all derivable markings of the net.First it is clear that there are finite number of predicates in the scatter and gatherPr/T-net model (SG net). Without considering the link failure handling part of thenet, the net NsG consists of single loops in one direction. PacketInQue is a virtualpredicate. Therefore the number of markings that can be derived from this net is finite,t12 Rc != r AND 12c = Ort13ReceivedPackets<r, S& o, claim>PacketInQue<Sd, r, o, claim>4^ObservedACKtllChapter 7. Petri Net Model of the FTAT Multilink Group Communication^73ReceiversLinkBadFigure 7.12: The link failure handling part of Scatter Sz Gather Pr/T neti.e., [Mivs, > is finite. We then need to prove that any marking M E [MN,G > is finite.Assume at the initial marking Mo we have:Mo =M(Sender)+M(Receivers)+M(LinkOK)= s+Rs-I-L;Then we follow the firings to the possible markings in NSG:=M(Receivers)+M(LinkOK)+M(PacketInQue);M2 =M(Receivers)+M(LinkOK)+M(PacketInQue)+M(ReceivedPackets);M3 =M(Receivers)+M(LinkOK)+M(PacketInQue)d-M(ReceivedPackets)+M(ObservedACKs)+M(DirectACKs);M4 =M(LinkOK)+M(PacketInQue)+M(ACKGathered)+M(ReceivedPackets)+M(ObservedACKs)+M(Gathered);M5 =M(ACKGathered)+M(Receivers)+M(LinkOK);At any time in the net,ReceivedPacketsChapter 7. Petri Net Model of the FTAT Multilink Group Communication^74Sender^ObservedACK^PacketInQueACKGathered^Receivers^GatheredFigure 7.13: The invariant part of the Scatter & Gather Pr/T net modelM(Sender) 5_ Isl;M(Receivers) 5_ IRs1;M(LinkOK) < ILI;M(PacketInQue) 5_ IsI * ILI + IRsI;M(ObservedACKs) <^* LI — ILI,M(Gathered) < IRs1 — ILi;Now it is very easy to see that any marking M E [MNsG > is a finite number n E N,since any finite linear combination of finite numbers is still finite number. Hence the netNsG is bounded.7.1.2 LivenessDefinition: Under any marking M E [MN >, if there always exists an enabled transition,then the net is called live.Deadlock can result when the set of transitions directed into a set of predicates belongsto, or equals the set of the transitions directed out from the same set of predicates, i.e.,-S C S. [32].Chapter 7. Petri Net Model of the FTAT Multilink Group Communication^75The possible deadlock can only result from the following two situations: 1) waiting fora direct ACK from a "dead" peer. 2) waiting an observed ACK on a channel of brokenlink.Sender sideThe sender gets blocked at DirectACKs predicate. Link failure handling can guaranteethe delivery of the data packet as well as the ACKs, unless the receiver crashes. TheNodeBad predicate can prevent the net from entering the "deadly" situation of waitingfor the response from that failed processing module. Consequently, the sender is alwayslive (deadlock free).Receiver sideA receiver depends on the sender as well as the other receivers. LinkBad predicateeliminates the dependency on a receiver for an observed ACK as soon as the failure ofthe link is detected. The Activity Observation concept is reflected in the link failurehandling part of the net model and guarantees the correct reception of the data packetprovided that the receiver concerned is still connected to the FTAT network. The linkstatus is monitored and maintained by the peer system periodically. Once a link failureis detected, the waiting on the channel of that link is terminated by depositing a tokenfor that link into ObservedACK. Therefore, the net for the receiver has always at leastone enabled transition M E [MNs, >. The system is thus deadlock free.7.1.3 Link Failure HandlingSince this part of the model, shown in Figure 7.12, is mainly responsible for correcttransfer of packets between peer systems involved in a group communication, it simplyChapter 7. Petri Net Model of the FTAT Multilink Group Communication^76shows the rerouting of data packets and ACKs when link failures occur. When an ob-served ACK is received, a lost claim packet is sent at the transition t12 to request thedata packet from the known recipient. After the data packet is received, the ACK is thenpassed to the recipient that provided the data packet.Deadlock occurs when an ACK, being forwarded, loops back and forth around theworking peers without a way to reach the originator. This condition is prevented by thedetection of the special packet. A special packet can only be sent to a new peer modulebut never returned. If there is no way for the packet to be forwarded, it is dropped, fromthe assumption about the PacketInQue. This can only happen when the originator isno longer connected to the system. When such failure is detected, the scatter can proceedas the transition t9 deposits the expected token in DirectACK in Figure 7.11.When a direct ACK is received and not addressed to the current receiver, it is routedto the best available link. This is reflected by the condition inscribed to the transitionti2 : Rc r AND Rc =--- Or.The availability of the data packet for possible lost claim is ensured by the transitiont6. As long as the receiver system has working links with the other receivers within theFTAT it will wait for the observed ACKs from them. This yields the opportunity fordetermining the status of other peers or providing the data packet upon request.On the other hand, if the link is broken, there is no need to wait since the purpose ofan observed ACK is for knowledge of link connections in the system. Transition t8 servesthis purpose.7.2 Modeling the Complete Swap ServiceIn [31][321, a special set of places, called S-invariant set, is defined. This set has thespecial property that the total joint token count in the net remains invariant during theChapter 7. Petri Net Model of the FTAT Multilink Group Communication^77transition firings of the net. It can be used to show the boundedness of a net. One ofits properties is, as a proofed theorem in [32], that if an net N is covered by S-invariantsthen the net is bounded.A P/T net is said to be covered by S-invariants if, for each place 3 e SN, there existsa positive S-invariant i of N with i(s) > 0. This is defined in [32].The model here describes the behavior of the complete swap service based on thescatter and gather services. Figure 7.14 shows the Pr/T net model of the swap serviceprotocol, Nswap. In the model, the following predicates are used:• SEMPTY: the buffer for swap is empty.• SACKOBSERVED: the ACK(s) observed is/are in the buffer.• SPDULOSTCLAIM: the buffer is expecting a lost packet from the claim.• SPDUOBSERVED: a data packet has arrived in the buffer.• SACKSCATTERED: ACKs have been scattered for the buffer.• SACKGATHERED: the buffer has gathered packets from all the peers.7.2.1 BoundednessFor a simple analysis, we can ignore the predicates from the scatter and gather net sincethey remain the same after the related transitions fire. The swap net does not changemarkings of the underlying net. A simplified net of the model is therefore shown inFigure 7.15.We now construct the incidence matrix [31] and try to find an S-invariant. Theinvariant vector i can be acquired by solving the linear equation: Ns' „„p x 7 = ii.•SACKOBSERVEDChapter 7. Petri Net Model of the FTAT Multilink Group Communication^78PacketInQueuer4079SPDUOBSERVEDSEMPTY tots<S, r, o, pack>-0SACKSCATTE<S. r• o' pack^ReceivedPackets<5,1, o, pacb•ED•ReceivedP I ckets *#*t4^S r;t2<s, r, o, ack>S AND o !=rit, 0 ct>ObservedACK•SPDULOSTCLAIIVILinkBadGatheredfl(S)•SACKGATHEREDFigure 7.14: Swap operation Pr/T net modelSACKOBSERVED SPDULOSTCLAIMSACKSCATTE ED'411SACKGATHERED(5toSPDUOBSERVEDSEMPTY-Chapter 7. Petri Net Model of the FTAT Multilink Group Communication^79Figure 7.15: Swap operation POT simplified net modelChapter 7. Petri Net Model of the FTAT Multilink Group Communication^80The incidence matrix and the invariant vector of the net Nswap are shown in Table 7.1where all the elements in the invariant vector are greater than zero. Using the definitionand the property stated at the begining of section 7.2, which are given by definition (a)and theorem (d) of Section 6.2 in [32], we know that the net is covered by S-invariantsand therefore draw the conclusion that it is bounded.Predicate/Transition 0 t1 t2 t3 t4 t5 t6 t7 InvariantSEMPTY -1 -1 4 1SACKOBSERVED 1 -1 -1 1SPDULOSTCLAIM 1 -1 1SPDUOBSERVED 1 1 1 -1 1SACKSCATTERED 1 -1 1SACKGATHERED 1 -4 1Table 7.1: Incidence Matrix and Invariants7.2.2 LivenessIf we were to ignore the predicates used from the underlying net as in Section 7.2.1, underany marking in the net, there is always at least one transition enabled in Ns„cip as canbe seen directly from Figure 7.15. However, the firings are dependent on the underlyingnet, as the net is driven by the token movements of the underlying net. Therefore weneed to prove that given this dependency, under any marking M E [M.— Swap >, there isalways at least one transition enabled in Nsw„p.First we know that the underlying SG net is deadlock free as proved in Section 7.1.2and 7.1.3 of this chapter. Secondly, the order that the predicates are marked with therelated tokens from the underlying net is in the same partial order as the possible firingsequence in this net. This can be shown by ordering the token movements in partial orderand comparing the orders from both nets. We start the ordering from the initial statesChapter 7. Petri Net Model of the FTAT Multilink Group Communication^81of the system, SEMPTY and Sender/Receivers in the direction of the transitions:• For Nsump, we have the partial order sequence: SEMPTY, {SACKOBSERVED,SPDUOBSERVED, SPDULOSTCLAIM}, SACKSCATTERED, SACK-GATHERED.• For the SG net: {Sender, Receivers}, PacketInQue, fReceivedPackets,PacketInQuel, {ACKGather, ObservedACK}, Gathered.Note that the elements in brackets are equal in the partial ordering. In the sequencefor the SG net, the PacketInQue is a virtual predicate and is ignored in our analy-sis. The unused predicates are eliminated from the sequence. Now we have: {Sender,Receivers}, {ReceivedPackets, ObservedACK}, Gathered.We assume at the start of the swap service, the initial predicate marking is set toSEMPTY, Sender, Receivers for the Swap net and the SG net, respectively. Alsowe assume that the firings of the enabled transitions in Nsv,„p is always performed beforethose in the SG net. This assumption is made to simplify the analysis of the model andcan easily be transformed into an implementation of the design.Under any marking M E i—Nrm swap >, there is at least one transition t is enabled aslong as the attached predicate from the SG net has the related token. By the previousassumption, (11,SwapNet is always marked before (t)sGNet. Since the partial ordering ofthe SG net reflects the firing sequence and the SG net is deadlock free, then (1)3GNetwill be eventually marked, i.e., t is enabled. Thus, there exists at least one transitionenabled under any marking M E [MNsw, > in the nets considered. Therefore, the Swapnet is then live, deadlock free.Chapter 7. Petri Net Model of the FTAT Multilink Group Communication^827.3 The FTAT Service ModelingThe FTAT services, IRPCRead and IRPC Write, are based on the Swap mechanism de-scribed in the earlier section. They provide the services for inter-FTAT communications.Using a similar method used in Section 7.2, the model is now shown deadlock free. Themodel, NFTAT, has two parts for the services modeled and is shown in Figure 7.16 andFigure 7.17.In the following analysis, only the net model for IRPCWrite service is shown. TheIRPCReac service model can be analyzed in a similar way.7.3.1 B oundednessNFTAT has a limited number of predicates in a single directed transition loop. The netis basically a single loop type of transitions, except between one particular section of theloop. Between predicates PDUGATHERED and STATEGATHERED there existsan embedded loop which may loop n+1 times, where n is the number of the workingpeer modules. This results from the transition condition All Stat -= DONE or All Stat= FAILED. This transition condition ensures that if all modules have tried and failed todeliver a packet, the service is aborted. Consequently, the number of possible markingsis finite.For each predicate in the net, the deposit and removal of a token is always balancedas can be seen from the model. Without considering the effects from Ns,,,,p, the incidencematrix and its invariant vector of the net are shown in Table 7.2. Since all the invariantsin the corresponding predicates are greater than zero, the net is said to be covered byS-invariants according to the definition (a) in Section 6.2 of [32]. By the theorem (d) ofSection 6.2 of [32], the net NF,' TAT is bounded.Now let us examine the effects stemming from N s ,, ap on the net NFTAT. It is obvious<b, pack>ApplicationPendingEMPTYNOTSENTYET 0___It5PDUSCATTERED2A CKOBSERVEDST ATESCATTERED,,1J Tilr Aor___...:(0_3ralt6SenS derMs)1b^SA CKGATHEREDPDUGATHEREDSACKOB ERVEDSACKGn()STATEGA HEREDBSERVEDFigure 7.16: IRPCWrite Service Pr/T net modelChapter 7. Petri Net Model of the FTAT Multilink Group Communication^83ExtFTAT<b, packit>PDUOBSERVEDNOTSENTYET 0___1bFINI HEDAC KOB SERVEDSACKOB ERVED^ der411Stat --, DOWeJ1Stat = FAILIEC SACKGAITHEREDhirico )ncob^ bSTATEGA HERED^STATES CATTERED^PDUGATHEREDACKSCATTEREDbCoFigure 7.17: IRPCRead Service Pr/T net modelChapter 7. Petri Net Model of the FTAT Multilink Group Communication^84Chapter 7. Petri Net Model of the FTAT Multilink Group Communication^85Predicate/Transition t0 ti t2 t3 t4 t5 t6 t7 tg t9 t10 tn. InvariantEMPTY -1 -1 -1 1 1PDUOBSERVED 1 1 -1 1ACKOBSERVED 1 -1 1NOTSENTYET 1 1 -1 1PDUSCATTERED 1 -1 1PDUGATHERED 1 -1 1 1STATESCATTERED 1 -1 1STATEGATHERED 1 -1 -1 1ApplicationPending 1 -1 1Table 7.2: FTAT Incidence Matrix and Invariantsthat Nswap does not change the total joint token count in the net NFTAT and nor doesNFTAT in Nswap. Net NFTAT is thus bounded.7.3.2 LivenessFor simplicity in the analysis, the swap net Nswap is ignored first. Figure 7.18 shows thenet model without Ns„,,p. In NFTAT) each predicate has an arc to some transition t, andfor any predicate with a marking greater than zero, the followed transition is enabledsince each transition t has only one predicate in its 1. As a result, NFTAT would be adeadlock free net without Nstuap connected.Now we examine NFTAT with Nstocip connected. We still assume that the transitionsenabled in NFTAT are always fired before those in the underlying nets until the NFTATtransitions are disabled due to the markings of the underlying nets. This assumption canbe easily translated into an implementation and greatly helps the structured analysis bymaking the modules less dependent on others and without distorting the model from thedesign.b<b, pack>1)0bl,b2b3<b, packNOTSENTYETFINI HED2ACKOBSERVEDt5PDUSCATTERED kbSTATEGA HERED^STATESCATTERED9PDUGATHERED"TrBSERVEDEMPTYChapter 7. Petri Net Model of the FTAT Multilink Group Communication^86Figure 7.18: IRPCWrite Service Pr/T simplified net modelChapter 7. Petri Net Model of the FTAT Multilink Group Communication^87Firstly, the firing sequence is ordered from the initial states of the system in par-tial order. The result ordering is EMPTY, {ACKOBSERVED, PDUOBSERVED,ApplicationPending}, NOTSENTYET, PDUSCATTERED, PDUGATHERED,whereas in Section 7.2.2 the ordering was SEMPTY, {SACKOBSERVED, SP-DUOBSERVED, SPDULOSTCLAIM}, SACKSCATTERED, SACKGATH-ERED. By the fact that NFTAT is a deadlock free net without Nswap connected, thenfor any marking M E [MN, > in NFTAT, there exists some transition t that is to beFTATenabled by some predicate in N sw„p. Thus we have (t)FTATNet > 0 and (1)Secondly, note that the firing sequence between to to t7 of NFTAT is in the samepartial ordering with that of N sp, i.e., for any tn in the FTAT net NF/ TAT, n—i '^tnand .tn^.tn+i in both nets respectively; the transition loop, t7 to to, is only dependenton SACKGATHERED from Nswap.Since Nsv,„p is deadlock free, as shown earlier in Section 7.2.2, eventually the (1)swapNetbecomes marked. That is, the transition t is enabled. Since t is some transition thatwould be enabled under any marking M E [MN, TAT >, we draw the conclusion that un-der any marking [MAT_-FTAT >, there exists some enabled transition t in the FTAT model.The FTAT model has thus been shown to be live.SwapNet^0.Chapter 8PerformanceIn this chapter, we analyze the performance of the major components of the system interm of time and amount of communication. The multilink group communication andthe on-line forward fault-repair are the major research issues addressed in this thesis.A brief analysis of the performance of the former is carried out quantitatively and theon-line forward fault-repair is analyzed qualitatively. The generic communication servicemeasurements are listed last.The performance analysis of the multi-link group communication demonstrates howthe concept of activity observation improves the performance in the communication.The on-line forward fault repair is analyzed to see how the system performance is en-hanced because of the repair and how the performance of the on-line forward repair itselfis improved by the use of volatile data redundancy as opposed to the use of virtual sharedmemory.The operations of the FTAT are mainly based on the scatter and gather mechanismbuilt for group communication over multiple link connections. The bidirectional commu-nication links are all supported by dedicated DMA controllers which allow concurrentbroadcast on all the links. The distribution of redundant messages are parallelized andmost of the operations are symmetric with respect to all the other peers within an FTAT.In this chapter, a simple performance analysis of the system is conducted. For themultilink group communication, the analysis of a worst performance case and a bestperformance case are given and contrasted with what the performance would be without88Trans1FACKtACKtTrans ACKtACKt 11/ACKtTransACKtPM 0PM 1PM 2PM 3Chapter 8. Performance^ 89TransTimeout Trans^TransFigure 8.19: Scatter without activity observation in a single link connectionactivity observation (A.0.).The following definitions are used in the performance analysis:• Ttrans — the transmission time for a data packet.• TACKtrans — the transmission time for an ACK packet.— the transmission time for a lost claim packet.• Ttimeout — the time period which if exceeded, results in a decision that a packet islost.8.1 Scatter and GatherA scatter operation consists a broadcast on all the FTAT internal links from a peer anda reception of all the ACK packets from the other peers. The packet transmission oneach link takes about the same amount of time since they are sent concurrently.Assuming that T4- orans is far greater than Tchiim, TACKtrans, and Treroute, then the majorcost is mainly contributed by Ttrans. In the following analysis, the packet processing timeand the channel setup time are all assumed to be negligible.• Single link failure:The time cost for a scatter without (A.0.) is:Chapter 8. PerformancePM 0^TransPM 1 ACK Trans ACKPM 2^ Claim^ ACK^Trans^ACKPM 3^ ClaimACKFigure 8.20: Scatter with activity observation and a single link connectionT Ttrans Ttirneout {Trans + TACKtransi * 2 > 3Ttrans^(8.11)With A.O. it is:T Ttrans Tclatrn Ttrans TACKtrans * 2^2Ttrans^(8.12)The overall time for the scatter operation with A.O. takes one less packet of datatransmission than it would without A.O. at the cost of a lost claim packet. Tdaimis the same as TACKtrans. An increase of the processing speed is about 30% in thiscase.• Single line connection in the system resulting from Multiple link failures:From Figure 8.19, the time cost for a scatter without A.O. is:T = Ttrans Ttirneout 4 * Ttrans + 3* TACKtrans^5Ttrans^(8.13)From Figure 8.20 with A.O. it is:^T --= Ttrans iTtrans TACKtrans ?Claim] * 2 + 3 * TACKtrans^3Ttrans^(8.14)Here, the scatter operation without A.O. would take 6 data packet transmissions inthe network. Considering that some of the transmissions are parallel in time, the90PM 3Figure 8.21: Scatter with activity observation and a parallel link connectionPM 2PM 1PM 0ACK Trans ACK 2 ACK 3ACKTransChapter 8. Performance91effective time consumption is about 5 times of a data packet transmission, whilethat for the scatter with A.O. is only about 3. This is a nearly 50% decrease incommunication traffic and a 40% increase in the processing speed.• Parallel lost claiming from the recipient with multiple link failures in the network:From Figure 8.19, the time cost for a scatter without A.O. is:T Ttrans + 4 * Ttrans + 3 * TAcKtrans] 5Ttrans^(8. 1 5)From Figure 8.21 with A.O. it is:T Ttrans TACKtrans Tclairn Trans + 3 * TAcKtrans 2Ttrans^(8.16)This is the most significant case with respect to performance. The scatter withoutA.O. would take 5 data packet transmissions while with A.O. it takes only 3 inactual transmission. The effective time consumed is 2 times of a packet transmissionbecause some of the transmission are in parallel. This results in a nearly 40%decrease in communication consumption and a 60% increase in the processing speed.Chapter 8. Performance^ 928.2 SwapWhen no failures occur in the system, a complete swap of information takes slightly moretime than a scatter operation since every peer in the group does the same operationssymmetrically.The extra costs are caused by the transmission of two extra ACKs over each link ineach direction and the processing time for these extra ACK packets from all the otherpeers. If the average data packet is far longer than ACK packets and the processing timeof ACK packets are negligible, then the cost is mainly due to the transmission of thedata packet.8.3 On-line Fault-RepairOn-line forward fault-repair involves selection of the repairer node, initiation of the repairrequest, packing of volatile data, delivering these data in packets, overwriting the contextof the faulty process and relocating pointers in the context. Volatile data of a processincludes global data, heap allocations, run-time stack, pointer reference stack, as well asthe resource management table.The repair service provides full context repair. Its performance is dependent on thesize of context. The process of packing and delivering is much faster than disk accessingsince memory access is faster than disk I/O, and the data communication (2.5MB/s) isfaster than disk I/O. Compared with commonly used process migration schemes such asthose used in the V system or the Sprite system[23][25], our approach has the followingadvantages:• no checkpointing and roll-back is necessary for fault recovery.• no disk I/O access is required.Chapter 8. Performance^ 93• no shared memory consistency maintenance overhead is suffered.• pointer adjustments is done in one batch.• pointer references rebinding are performed on a resource allocation basis instead ofa pointer basis.8.4 Generic Communication Performance MeasurementsThe generic communication operations have been measured with and without link fail-ures for the operations such as scatter & gather, swap and IRPCWrite. No significantdifference has been observed between the first two. However the measured results arenot the optimal performance for the protocol, realizing that the implementation can befurther tuned for best results.Table 8.3 shows the results of our implementation which can be further optimized.The processing overhead seems to be the major factor. However, our quantitative analysisshows an interesting potential performance gain for the multi-link group communicationprotocol. Our implementation is still subject to tuning to get the best performance.Measured Time\ Service Scatter Si Gather Swap IRPCWritewithout link failure 20ms 27 ms 128mswith link failure 21ms 30 ms 140msTable 8.3: Generic Operation Performance MeasurementsChapter 9ConclusionsThe FTAT architecture yields a systematic architectural approach to fault-tolerance sys-tems. The regularity and recursiveness of the architecture makes it easy to use thearchitecture as building blocks to build large multiprocessing systems.The multilink group communication mechanism provides redundant and parallel datadistribution and group synchronization. The concept of activity observation employedin the design greatly improves the communication mechanism to provide efficient andreliable data distribution in the face of link failures and processor failures. The majorgains can be summarized as follows:• Packet rerouting is initiated by the receiver in response to the observation of arecipient of the data packet in a group communication. A lost packet is always"claimed" from a known recipient and the route is always the optimal.• The low cost in transferring the ACKs, which used as "hints" for packet rerouting,greatly reduces the network operation overhead.• Because of the receiver-initiated dynamic rerouting, parallel "claims" of a lostpacket by multiple intended receivers are possible.• The elimination of the many timers that otherwise would be needed for reliableredundant data distribution, reduces the overhead and complexity of the protocolimplementation.94Chapter 9. Conclusions^ 95• The scatter and gather operation for the distribution of a data packet makes thesender and receivers symmetrically complementary, which yields better timelinessin group synchronization.The reliability analysis of the multilink group communication of the system shows thatthe connection reliability in the group communication between any pair of the modules isincreased. The concept of activity observation in the design of the protocol is shown tohave the potential of considerable performance gain and high communication efficiency.The transparency in fault-masking provides a simple programming model to the pro-grammers for fault tolerant applications, especially for parallel programming applica-tions. The dynamic TMR scheme used makes the system tolerant of two concurrentfaults without using more than four processors. The underlying mechanism of the on-line fault-repair makes the system tolerant of unlimited sequential transient physicalfaults with smooth reintergration of the repaired module, unless permanent failures oc-cur on majority modules. Our reliability analysis has shown that system reliability canbe improved remarkably with on-line forward fault-repair.The design of the multilink group communication protocol has been formally repre-sented by Petri Nets in a structured modular way. The nets are shown informally tobe bounded and deadlock free, and the implementation demonstrates that the designis functionally correct. Hence, the design is shown to be correct. The use of PredicateTransition Net which is a variation of Petri Net allows us to represent the model in simplediagrams and to give proofs in a simple way.The on-line fault-repair eliminates faults on the process basis without roll-back ofprevious operations. Some of the similar issues in process migration are approachedin a different way. By separating an application process from its run-time physical en-vironment at a minimal cost, the live execution image can be used to repair the faultyChapter 9. Conclusions^ 96process to guarantee forward fault recovery. This is very attractive for real-time process-ing because of the following:• No computation is wasted because the forward recovery occurs without any roll-back.• Pointer reference rebinding is performed on a resource allocation basis instead of apointer basis.• Using volatile redundancy saves the cost of accessing persistent I/O devices.• No shared virtual memory has to be used, thus concurrency control overhead iseliminated.There is still much work that can be done in the future:• The experimental system is subject to optimization for best performance.• Investigation is needed on how system's reliability relies on the acceptance test paceunder different module reliabilities. Since high fault coverage can be achieved atthe cost of communication and comparisons, this may provide a guide line for thebest performance while meeting the reliability requirements.• The system itself is vulnerable to faults, thus highly efficient self-checking tech-niques should be exploited to cause the system to have crash-failure semantics.• Efficient fault detection techniques of high coverage are needed for forward fault-repair to achieve high reliability.Bibliography[1] R. Beton, J. Kingdon and C. Upstil, "Highly Availability Transputing Systems",Proceedings of the World Transputer User Group Conference. April, 1991.[2] G. Coulouris and J. Dollimore, "Distributed Systems Concepts and Design",Addison-Wesley Publishers Ltd. 1988.[3] N. Brock, "Transputers and Fault-Tolerance", Proceedings of the World TransputerUser Group Conference, April 1991.[4] J. Standeven and M. Colley, "Hardware Voting of Transputers in Real-Time nMRFault-Tolerant Systems", Dept. of Computer Science, Univ. of Essex. April, 1990.[5] Q. Stout and B. Wager, "Intensive Hypercube Communication I: Prearranged Com-munication in Link-Bound Machines", The Univ. of Michigan Computing ResearchLaboratory, June 1987.[6] D. Rennels, "Fault-Tolerance Computing - Concepts and Examples", IEEE Trans.on Computers, Vol. C-33, No.12, December 1984.[7] D. Siewiorek, and R. Swarz, "The theory and practice of Reliable System Design",1982 by Digital Press.[8] R. Koo and S. Toueg, "Checkpointing and Rollback-Recovery for Distributed Sys-tems", IEEE Trans on Software Engineering, Vol. SE-13, No.1, January 1987.[9] A. Tanenbaum, "Computer Networks", Prentice Hall, Inc., 1988.[10] A. Avizienis, "The N Version Approach to Fault-Tolerant Software", IEEE Trans.on Software Engineering, Vol. SE-11, #12, December 1985.[11] K. Tomita, S. Tashiro and T. Kitamura, "A Highly Reliable Computer System - ItsImplementation and Result", Annu Int. Symp. on Fault-Tolerant Computing, 1980.[12] R. Strom, D. Bacon, and S. Yemini. "Volatile Logging in n-Fault-Tolerant Dis-tributed Systems", Annu Int. Symp. on Fault-Tolerant Computing, 1988.[13] D. Russell and M. Tiedeman, "Multiprocess Recovery Using Conversations", Annu.Int. Symp on Fault Tolerant Computing, 1979.97Bibliography^ 98[14] J. Ayache, P. Azema, and M. Diaz. "Observer: A Concept For On-Line Detection ofControl Errors In Concurrent Systems", Annu. Int. Symp on Fault Tolerant Com-puting, 1979.[15] C. Li, and W. Fuchs. "CATCH — Complier-Assisted Techniques for Checkpointing",20th Annu Int. Symp on Fault-Tolerant Computing, June 1990.[16] P. Velardi, and R. Iyer, "A Study of Software Failures and Recovery in the MVSOperating System", IEEE Trans. on Computers, Jun. 1984.[17] F. Christian, B. Dancey, and J. Dehn. "Understanding Fault-Tolerant DistributedSystems", Invited paper, 20th Annual International Symposium on Fault-TolerantComputing, June, 1990.[18] K. Kim and J. Yoon, "Approaches to Implementation of a Repairable DistributedRecovery Block Scheme", Annu Int. Symp. on Fault-Tolerant Computing, 1988.[19] E. Cooper, "Replicated Distributed Programs", Proceedings of 10th ACM Sympo-sium on Operating System Principles. December, 1985.[20] D. Cheriton and W. Zwaenepoel, "One-to-Many Interprocess Communication in theV-System", Report STAN-CS-84-1011, Department of Computer Science, StanfardUniversity, August 1984.[21] H. Stone. "High Performance Computer Architecture", Addison Wesley publishing,1987.[22] Y. Chen, T. Chen, "Implementing Fault-Tolerance via Modular Redundancy withComparison", IEEE Trans. On Reliability, Vol. 39, No.2, June, 1990[23] M. Theimer, K. Lantz and D. Cheriton, "Preemptable Remote Execution Facili-ties for the V-System", Proceedings of the 10th Symposium on Operating SystemPrinciples. December, 1985 .[24] F. Douglis, "Process Migration in the Sprite Operating System", Electrical Engi-neering and Computer Sciences. Univ. of California Berkeley, Technical Report, Feb.1987.[25] E. Zayas, "Attacking the Process Migration Bottleneck", Computer Science, CanegieMellon Univ. ACM SIGOP, 1987.[26] 0. Serlin, "Fault-Tolerant Systems in Commercial Applications", Computer, IEEE,August 1984.Bibliography^ 99[27] J. Ortiz, "Transputer Fault-Tolerant Processor", Proceedings of the Third Confer-ence of the North American Transputer Users Group, 1990.[28] R. Oates, J. Kerridge, "Adding Fault Tolerance to a Transputer-based ParallelDatabase Machine", Proc. of the World Transputer User Group Conference, 1991.[29] R. Strom, D. Bacon, S. Yemini, "Volatile Logging in n-Fault-Tolerant DistributedSystems", Annu Int. Symp. on Fault-Tolerant Computing, 1988.[30] S. Dutt, J. Hayes, "Some Practical Issues in the Design of Fault-Tolerance Multi-processors", IEEE Trans. on Computers, Vol.41, No.5, May 1992.[31] H. Genrich and K. Lautenbach, "System Modeling with High-Level Petri Nets",Theoretical Computer Science, 13, North-Holland Publishing Company, 1981.[32] J. Peterson, "Petri Net Theory and the Modeling of Systems", Prentice-Hall, Inc.,1981.[33] D. Corner, "Operating System Design: The XINU Approach", Prentice Hall, Inc.,1984.[34] INMOS, "The Transputer Databook", second edition, 1989.


Citation Scheme:


Citations by CSL (citeproc-js)

Usage Statistics



Customize your widget with the following options, then copy and paste the code below into the HTML of your page to embed this item in your website.
                            <div id="ubcOpenCollectionsWidgetDisplay">
                            <script id="ubcOpenCollectionsWidget"
                            async >
IIIF logo Our image viewer uses the IIIF 2.0 standard. To load this item in other compatible viewers, use this url:


Related Items