You are on page 1of 10

A paper on



Presented by:
G.Anjeneyulu REGD NO: 05G31A0428, G.Caleb, REGD NO:05G31A0429, .


Today, Fault-Tolerance plays a major role in all the computational and real time systems. It is a good approach for treatment of hardware and software faults in real time applications and makes the applied software as more reliable. The main goal of designing and building fault tolerant systems is to ensure that the system as a whole continues to function correctly, even in the presence of faults. Till now the research is going on Fault-tolerance in different fields. Fault-Tolerance has a great impact on Embedded Supercomputing systems by which, the system can achieve a maximum performance with fault-free solutions. Embedded Supercomputing is essential in almost all applications such as complex, computing-intensive scientific and industrial applications. So, I choose Fault-Tolerance Communication in Embedded Supercomputing. In this paper, we present some of the difficulties that are commonly arise in Embedded Supercomputing, and how to recover that problems by using the Faulttolerance communication. Synchronous, Asynchronous Fault Tolerance communications are described in this paper. Other communication issues in FT are mentioned in brief. And we elucidate some of the recovery tools in synchronous, asynchronous communications. We illustrate the performance of Fault-tolerance communication in embedded supercomputing.

Interprocessor communication Synchronous communication Asynchronous communication Channel control Thread DIRnet Thread Mailbox FT library


Embedded Supercomputing is becoming indispensable for complex, computingintensive scientific and industrial applications, and parallel systems are supplanting traditional uniprocessor platforms. Dependability and fault tolerance thus become critical to the performance of parallel systems. Multiprocessors cooperationa parallel systems most powerful featurecan also be its fatal weakness. More processors mean more faults, and failure of a single processor can crash the whole system. Failures are no longer just undesirable situations; depending on the application, they can be hazardous or even catastrophic. A major factor is communication system. Interprocessor communication, which coordinates processors and enhances their power, is key to a successful parallel system. Distributed -memory multiprocessor systems rely on message communication between nodes. Message-passing applications are based on either synchronous(blocking) or asynchronous(nonblocking) communication for the coherence of parallel tasks. In the synchronous mode, problems arise when communication links or communicating threads are in an erroneous state(broken links, threads in infinite loops, and so on). When such errors occur, communicating threads remain blocked, since communication cannot be initiated or completed. In asynchronous communication when communicating threads are in erroneous state, or when mailbox mechanisms supporting asynchronous communication malfunction. Clearly, falult-tolerant communication mechanisms are key factors in parallel system dependability and can unlock a systems full potential. The Esprit project EFTOS(Embedded Fault-Tolerant Supercomputing) develops a framework to integrate fault tolerance flexibly and easily into distributed, embedded, high-performance computing(HPC) applications. This framework consists of reusable FT modules acting at different levels. Integration of this functionality into actual embedded applications has validated the approach and provided promising results.


Systems within the scope of the EFTOS project exhibit errors in the following categories: Untested or unforeseen input values triggering software errors; Electromagnetic interference causing hardware faults; Propagation of errors through communication channels from one part of the system to other parts; Errors propagating from one process to another causing memory corruption, since processes run concurrently from same memory space; Loss of subsequent inputs because of a faulty input item; and Failures to deadlines and time constraints. The EFTOS project developed FT modules that deal with the different types of errors in these categories. We were then able to determine where fault tolerance was required (processing and networking modules) and the steps needed to achieve it (detection, isolation, and recovery mechanisms). The different levels in which FT required are Thread level, Memory level, Node level, System level, Message level and link level.

The proposed FT framework, at general embedded applications, operates at three layers. At the lowest layer, it consists of error detection tools(D-tools) and error recovery tools(R-tools). These parameterizable functions start dynamically during application execution. When a D-tool detects an error, it uses a standardized interface to pass specific information to the next higher layer, the detection-isolationrecovery network (DIRnet), a distributed control network. The DIRnet starts the R-tools, which recover the application after an error occurs. These tools can work either in combination with the higher layers or as standalone tools. At the middle layer, the DIRnet coordinates the D-tools and R-tools. This hierarchical network serves as a backbone for passing information among the applications FT elements, and it enables distributed action.

Several agents located in different nodes assist the DIRnet central manager. These agents interact with D/R-tools in their field , take local recovery actions, and upon the DIRnet managers request. The agents forward this information to the DIRnet manager. Agents are not interconnected but nevertheless communicate through the DIRnet manager, which is responsible for their cohesion. At the highest layer, the D/Rtools and the DIRnet are combined into mechanisms that apply fault tolerance to processing or communication modules, and a custom language specifies the users recovery strategy. The adaptation layer allows a generic definition of the FT library interface to both the underlying operating system and the target hardware. R-tool DIR net Adaptation layer Operating System
Figure 1. Fault tolerance framework architecture. 1.





Figure 2: The DIRnet Architecture

FT can be implemented in both Synchronous communication and asynchronous communication. FAULT-TOLERANT SYNCHRONOUS COMMUNICATION: In synchronous message passing, communications problems arise when communication links are communicating threads are in erroneous state (broken links, threads in infinite loops, and so on). Because communication cannot be initiated or completed, communication threads remain blocked. There are two ways to avoid these situations: The status of both communication link and communication partner is explicitly tested before messages are passed. Communication is established normally, but time-out mechanisms are initiated to escape from problematic situations. Naturally, including Message Delivery time-out, these two approaches can be used in combination.
DIR Agent Timer DIR Agent


CCT ready


Virtual link


Figure 3. The Channel Control Thread cooperates with DIRnet 1.Channel control thread(CCT): Whenever two threads need to establish a communication channel, the initializing thread orders the creation of a special thread the will control the faulttolerant communication. Using this separate thread provides the possibility of returning to a safe state from a blocked communication.The channel control thread (CCT) shown, in Figure 3, handles time-outs and triggers recovery actions in cooperation with the DIRnet. The CCT is also responsible for handling isolation actions and recovery actions. The

CCT and its actions are transparent to the application and are initiated only if a communication channel is defined to be fault tolerant. 2. Dual CCTs: It is the extension of the single CCT implementation. In this, CCT for communication partner is maintained, i.e, symmetric pair of communication activities takes place in single communication channel. Several message-passing environments may treat a communication channel as one object or as a symmetric pair of communication activities. The two implementations just described can be made faster if the actual messages are sent directly from sender to receiver and not through the CCTs. When this is done, the CCTs dont need knowledge of the protocol used by the original channels. Thus the CCTs become pure control instances of the applications sending and receiving actions and, as a result, have a similar load. There are some protocols for Synchronous communication on fault-tolerant links. There are some algorithms that are executed in sending CCT. FAULT-TOLERANT ASYNCHRONOUS COMMUNICATION: Asynchronous communication is based on the mailbox concept, whereby the sending thread no longer hangs after sending its message. . Receiver Mail Monitoring task

Mail Sender


Figure7.Fault-tolerant Asynchronous Communication with monitoring task.

The message is stored in a buffer or mailbox; when receiving thread is available it retrieves the message from the mailbox. This way the sender is free to continue its tasks after sending its message. Figure.7 explains Asynchronous Communication. OTHER COMMUNICATION ISSUES: 1.Synchronous communication: Communicating threads detect the channel status. If the channel is busy it will be checked periodically, otherwise, simply use the channel.
As an extension to select mechanism, which allows the definition of sending options along with the time-out option; and As an extra feature of the CCT implementation, described in the Message delivery timeout The receiver sends a Ready signal to the CCT just before blocking for communication. The CCT can issue a ConditionalSelect for this signal to check the receivers status.

2.Asynchronous communication: A thread can be specified for the sole purpose of monitoring whether or not the mailbox
is empty. Errors are reported to the monitoring tasks that try to send mail to a faulty (full) mailbox. Monitoring tasks can then trigger actions by issuing a ConditionalSelect

for the error signal. Asynchronous communication operates by means of mailboxes; this FIFO implementation guarantees the messages are transferred in the order in which they are queued. Therefore, no further check is necessary. RECOVERY TOOLS: 1.Synchronous Communication: These tools deal with both application and system errors. More specifically, For a Send-Send fault, one of the messages is stored in a local buffer and communication switches temporarily to asynchronous mode. For Recv-Recv fault, dummy data is sent to one partner and communication continues. For Send/Receive-stop fault, the active communicating partner tries again to communicate, with no time-out.

For channel errors, the communication link is restarted.

2.Asynchronous communication:

These tools will try to recover the application from a faulty state by executing specific actions, such as Leave the mail in the mailbox (no recovery), Deleting the mail when the mailbox is full, Resetting the receiver via an interrupt signal, Resetting both the sender and receiver, or Performing an application-dependent recovery action. PERFORMANCE STUDY: The following table shows the time overhead for the synchronous communication FT library. CREATE Fault tolerant communication 0.5 TRANSFER BREAK 2.0 2.0 (In milliseconds) Table 2: Overhead of the Synchronous Communication FT Time library Specifically, the table depicts the time taken to create a communication link, transfer a message through a link, and break a link. The create and break times are of minor importance, since these functions are performed only once. The transfer time becomes more important as the application becomes more communication intensive. Applications in which communication time is comparable to CPU time are ones that can benefit from the FT library.

CONCLUSION: CONCLUSION: The communication FT framework we have described has been successfully integrated into real-time embedded high-performance computing applications. They are an image processing module in an automatic mail-porocessing system developed by Siemens Elektrocom and a remotely controlled automation system for electric high voltage substations operated by ENEL(the Italian electricity provider). Both systems proved more dependable when faults occurred, and overall system performance improved. System downtime decreased significantly, and the mean time between system reboots increased. Also it has been planned to port the FT framework across additional platforms and operating systems, providing and integrating standard mechanisms for node-to-node interoperability. Furthermore, researchers will consider FT middleware implementation using emerging standards, technologies, and industrial initiatives(such as CORBA) to guarantee the required level of dependability in object oriented open distributed systems.

1.IEEE Magazine, Sept Oct 1998. 2.Fault Tolerant Computing: Theory and Techniques, 2nd ed., D.Pradhan, ed., Prentice Hall,
Old Tappan, N.J. 1995.

3.Dependable Computing for Critical Applications, C.Landwehr, B.Randell, and L.Simoncini,

eds., Springer-Verlag, (Berlin, Heidelberg, N.Y), 1995.

4.G. Deconinick et al., Fault Tolerance in Massively Parallel Systems , Transputer Comm,
Vol 2, No. 4, Dec. 1994, pp. 241-257.

5.IEEE Trans, Reliability, special issue on fault tolerance, Vol.42, No. 2, June 1993.