UBC Theses and Dissertations
A fault-tolerant building block for transputer networks for real-time processing Fei, Yueying
Software Implementation of Multi-Processor Fault Tolerance for Real-Time processing is addressed in this thesis with the research focused on: • Fault-Tolerant cells as building blocks that can survive concurrent transient physical faults and permanent failures in large parallel processing systems with potential for real-time processing. • Efficient group communications for redundant data exchanges through multiple communication links that connect the group peers. • Transparent fault-tolerance. • On-Line Forward Fault-Repair using the live execution image from the non-faulty peer with a bounded delay. By systematically connecting the redundant processing modules, the architecture offers regularity and recursiveness which can be used as building blocks for construction of fault-tolerant parallel machines. The communication service protocols take advantage of redundant linkages to ensure reliable and efficient message deliveries among the fault-tolerant abstract transputer peer nodes through the concept of activity observation. The multiple redundant linkages provide a means for parallel communications. This is essential for redundant information exchanges in fault-tolerance. The activity observation concept further reduces the effort for reliable message delivery and simplifies the system design. As a result, messages are dynamically and optimally routed when link failure or processor failure occurs. Through the group communication mechanism underlying the platform, application processes on each FTAT peer node are transparent to details that they are replicated, repaired upon fault detections, and reintegrated after fault repair. Based on a dynamic Triple Modular Redundancy scheme, each application process can survive up to two concurrent faults under the assumption that the probability of two faulty peer applications having the same fault is very small. In a large interconnected network, the cost of fault-tolerance can be very expensive in terms of time and communication due to the cost of either synchronization or rollback recovery. The use of redundant live execution images to repair the faulty module guarantees forward fault recoveries.
Item Citations and Data