Dynamic-Hybrid Fault Detection Methodology

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.
ORG
58
Dynamic-Hybrid Fault Detection Methodology

Ahmad Shukri Mohd Noor and Mustafa Mat Deris
Abstract Fault detection methodology is a crucial part in providing a scalable, dependable and high availability of large distributed computing environment. The most popular technique that used in detecting fault is heartbeat mechanism where it monitors the system resources consistently and in a very short interval. However, this technique has its weaknesses as it requires a period of times to detect the faulty node. and therefore delaying the recovery actions to be taken. This paper presents fault detection mechanism and service using hybrid heartbeat mechanism and dynamic maximum time allocation interval for each heartbeat message. This technique introduced the use of index server for indexing the transaction and utilizing dynamic hybrid heartbeat mechanism and pinging procedure for fault detection. The evaluation outcome indicate the use of hybrid heartbeat mechanism allows us reducing approximately 30% time taken to detect fault compare to an existing techniques and provides a basis for customizable recovery actions to be deployed. Index Terms Fault-Tolerance and Distributed systems.
1 INTRODUCTION
the heartbeat monitor run reliably [15]. However, few bottlenecks have been identified; they scale badly in the number of members that are being monitored [16], require developers to implement fault tolerance at the application level [14]; difficult to implement [18] and have high-overhead [17]. Failure Detection and Recovery Services (FDS) [18] improves the Globus HBM with early detection of failures in applications, grid middleware and grid resources. The FDS also introduce an efficient and low-overhead multi-layered distributed failure detection service that release grid users and developers from the burden of grid fault detection and fault recovery. However, this technique has a weaknesses as it requires a period of times before detecting faulty and delaying the recovery actions to be taken. The problem arose due to unindex status for each transaction and need to wait for a certain time interval. K. C. W. So, and E. G. proposed latency and bandwidth-minimizing for failure detector and propose a generic way to detect faulty nodes but do not provide flexibility (i.e., support different types of applications) [12 , 19]. L. Falai and A. Bondavalli, presented an experimental evaluation of different estimations and safety margins of a distributed push failure detector [21].Othe researches such as in references [20 and 21] but they did not treat size scalability, while others are not adaptable in presence of a network changes [22, 23 and 24]. The state of the art of failure detection is discussed briefly in [20 and 25].. In this paper, we propose the Dynamic hybrid Heartbeat mechanism by integrating dynamic heartbeat detection mechanism with pinging service. In this technique, all the message transactions needs to be indexed for references and further action to be taken during recovery process. We call it hybrid because we integrate the heart beat mechanism with ping service .We focus on the prob A.S.M. Noor with the Department of Computer Science ,Faculty of lem of detecting fail-stop crash failures. A failure, in Science and Technology Universiti Malaysia Terengganu , Malaysia. M.M. Deris with the Faculty of Multimedia and Information Technology which a crash results in the component transitioning Universiti Tun Hussein Onn Malaysia 86400 Parit Raja, Batu Pahat,Johor permanently to a state that allows other components to Darul Takzim, Malaysia. detect that it has failed (e.g., by ceasing to send periodic \i-am-alive" messages).This research focus on improving
n effective and efficient Fault tolerant methodology is an important part in building reliable dependable systems. Thus fault tolerant computing has become an active research area [1][2][3][4][5][6]. Fault detection is the first essential phase for developing any fault tolerance mechanism or fault tolerant systems [7]. Hence Fault detection are very important part of fault-tolerant distributed systems and are thus widely used[8]. Fault detections provide information on faults of components of these systems. Depending on system architecture and assumptions about fault characteristics of components, the study found fault-detection latencies covered from 55% to 80% of non-functional periods[9]. This nonfunctional periods happened when a system is unaware of a failure (failure detection latency) and periods when a system attempts to recover from a failure (failurerecovery latency) [9][10]. Even thought the development of fault detection mechanism in large scale distributed system is subject to active research. It still has some lacks [11][12].Matthew Gillen at el . introduce Scalable, Adaptive, Time-Bounded Node Failure Detection (NFDS)[13]. Since network traffic , CPU load is unpredictable. However their paper [13] did not discuss the algorithm how to tailor the detection time in case of network congestion of CPU overload.This paper only compares mathematically and experimentally across detection time, mistake rate, network bandwidth, and message loss rate. One of the most popular fault detector service id grid computing is the Globus Heartbeat Monitor (HBM) [14]. HBM provides a fault detection service for applications developed with the Globus toolkit. It was developed under the assumption that both the grid generic server and
JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.ORG
59
failure detection methodology for large scale distributed environment . This work aims at reducing detection time, reducing false alarm warning , scalable and providing the system with more accurate information for recovery and fault tolerant purposes. The rest of the paper is organized as follows: Section 2 presents the proposed failure detection mechanism architecture, illustrates the components functionalities to detect the failure as well as failure detection workflow . Section 3 dicusses the performance measurements of an existing model and proposed model. Section 4 presents the findings. Finally, in Section 5 we conclude the paper
dex Manager (IM). The IM will receive notification when the node is added to the index or changed since last indexed. ii) Index Manager (IM) The IM will be notified each time the NM register new node or updating the node information since last indexed. This information is automatically updated in IM. During monitoring process, The IM receives the log file for each of heartbeat messages generated by the heartbeat monitors (HBM). It contains the information of the current status for each participated nodes. The log file format is [day/month/year:hour:minute:nodeIP:status]. The node IP is the IP address of the node which recorded by the NM. Status field is referring to the current status of the application for example 'inactive', 'active', 'done', 'failed' and 'exception' while comment field is referring to the additional information of the registered node such as the availability of free space. iii) Heartbeat Monitor (HBM) HBM is responsible for monitoring the state of the registered nodes. In the case of they differs from their usual state, it will notify the IM for necessary actions to be taken. Each node is periodically send a message indicating it aliveness to the HBM. Then, generate a status to be sent to the Index Server for updating the current status of the node. HBM support heartbeat mechanism failure detection. If the HBM does not receive a heartbeat within certain duration HBmax, thenode is considered as failed. Consequently, the IM will provide sufficient information for recovery process.
2 DESIGN ARCHITECTURE
In this section we briefly describe the dynamic hybrid fault detection methodology approach. Figure 1 presents Hybrid Fault Detection architecture, which comprises three main components that are Heartbeat Monitor (HBM), Application Register (AR) and Index Server (IS).
Index Server (IS)

(iNode manager Index Manage)
HeartBeat(HBM)
Data receiver / collector / detector
IS
HB M
n d1
n d2
n3
n4
n5
n6
2.1 Dynamic Hybrid Failure Detection WorkFlows

n n8 n9 An overview of Dynamic-Hybrid fault detection methodology workflow is illustrated in Figure 2. diagram. Message generators is located in each node, It generates heartbeat messages with certain information and submitted to the heartbeat monitor server . The information are : Node ID, Message ID, Sending time Si, Interveal Time , Node volume
Fig. 1. Hybrid Fault Detection architecture
i)
Node Manager (NM) In order to utilised the dynamic hybrid fault detection (DHFDM). At first each participated node needs to register with NM. NM provide method to register the node to the Index Server by storing the required information to be monitored. Once a node register with NM, the information provided by the node will be added and indexed in the In-
60
Node(s) Node fail stop Link crash Network parti-
Alive
other possibilities that can caused a fail to send heartbeat such as HB message generator malfunctions. Thus the detector then run a ping command to missing heartbeats node. Upon ping response, The detector generate the status for the node .Then further action can be taken.
HB Messages generator
Based on Hybrid Fault Detection workflow. We developed the algorithm for node failures detection as figure 3 .
4 5
Detect new monitorable node Nx /*check if Nx register with index log*/ IF Nx in NM index then update index manager N++; /* HB w > max allocated , maybe a problem*/ if Hw > HMAX ping node if ping timeout then node y failed; update index manager N--; call fault tolerant service; else Hi = HMAX ; elseif Hi < HMAX then Hi = HMAX ; endif
Fig. 3 . Node failures detection algorithm Register
Hi < H max
Hi > H max
Failure
Data Receiver
alive
ping
Time out
Data Receiver
Fig. 2: Hybrid Fault Detection workflows diagram
At the HBM server side 1. A data/messages receiver receives heartbeat messages generated by message generator. Each nodes have their own data receiver and placed in unique storage in HB Monitor server. This storage is identified by node ID. For example, every time node x generate and send heartbeat message to HM monitor .A folder that contained Data receiver allocated to in HM monitor will receive the message. 2. A data collecto(DC), fetch data from data receiver. The DC keeps track of all registered heartbeat messages and records whenever a heartbeat arrives.
3) A detector is responsible for observing the state and actually identifies failed node(s) based on missing heartbeats.Since the detector knows the frequencyat which heartbeats are being generated by a registered node , it can infer missing heartbeats. It can interpret these messages to determine the state (i.e., inactive, active, done, failed, exception) and summarizing this status information. We combine heartbeat and ping technique to be a hybrid technique. It is because sometimes if a node fails to send heartbeat doesn't mean that it is down. There
HM monitor Index server
Messages collector Detector and status generator
Index manager Node /Log Manager Recovery Fault tolerance Service
2.3 Dynamic Heartbeat Detection Mechanism (DHDM)

This section elaborates in detail failure detection process from a node to HBM heartbeat monitor and how DHDM set the parameters and compute the timing. Nodes and HB monitor (HBM) server are connected by a communication channel. A Node sends heartbeat message to HB monitor. HB monitor uses them to draw conclusions about nis status. HB monito manages a list of inter-arrival times B ={1.083s, 1.968s, 1.062s, 0.993s, 0.942s, 2.037s, }. HBM refers to this list as one of the parameters to calculate maximum waiting time HBmax. HBmax is a maximum inter-arrival time allocated for each heartbeat message . Since network bandwidth , CPU load is not unpredictable. This List is the key for HBM to dynamically tailor HMmax. HBM always dynamically recalculate HBmax failure detection time. Thus it can reduce the false alarm rate. Thus we consider the maximum failure detection time equal the largest value in B:
HBmax = B{largest value of b}
(1)
Along with lis of B , HBM also stores the heartbeat interval time Ti and the checkpoint Tc , the time when the last heartbeat was received.At runtime, p could for instance have the following relevant information, This information is the core for HB detection to draw conclusions about a node status. B = [1.083s, 0.968s, 1.062s, 0.993s, 0.942s, 2.037s, . . .] Ti = 1 second Tc = 2007-10-22 10:06:02.434 (local clock)
61
Tr: 2007-10-22 10:06:02.919 (local clock) Ti: The heartbeat interval is the period between each message sent to the heartbeats monitor, normally this parameter is constant, Sending time Si : The term Tr Ts reflect the difference of the receipt times and sending times of a HM message through the network, according to HBM local clock Basically, the sending times are mainly influenced by the following environmental circumstances: Network delay : Heartbeats sent over the network are affected by network bandwidth and load Therefore, in this work, it is necessary to consider the network delay in determined the maximum allocation time HBMax a node to send a HB. HB message processing delay: a node sends heartbeat messages not at the time it is supposed, e.g. due to processing overload or cpu busy . In many systems the variations of the inter-arrival times due to processing delays of q are negligible. Based on this data HBM generate a suspicion status which indicates whether a node has failed or not.
At node Generates HB massage to sent To HBM: - append sending time ts and message id to every outgoing message - if no message has been sent to HBM since Ti(sleep 2) - send heartbeat message HBM At HBM server If HBt HBmax HB message is normal. Tc = 1 //check point B = nil //S is initialised as an empty list upon receive of message mj at time tr if Tc== -1 then f = tr else b = Tr Tc ; b = Ti + (Tr Ts) ; Tc= Tr ; append s to S endif Else HB Message suspiciously failed
2.2 Integrating ping with Dynamic Heartbeat Methodology.

A node will periodically send messages which consider as heartbeat in maximum time interval allowed. If the HBM does not receive the message after the time interval expired, the node is suspiciously having fault.However sometimes a node cannot send the heartbeat message not because of CPU or network problem but due to the node messaging problem or message generator . In this case the node is active and in operational. Thus Integrating the heartbeat mechanism with pinging service can determine the node status
expected Timeout
Confirm fail
Log file
Log file
Log file
Pinging
Fault
Time
Fig 5: Ping with Heartbeat Mechanism Figure 5 shows the flow of the fault detection service methodology with heartbeat mechanism . The node, P periodically sends the heartbeat messages to the Health Monitor (HBM) to inform its status such alive, 'inactive', 'active', 'done', or 'exception'. When the HBM failed to received the heartbeat message within certain time interval, the HBM will ping the nodes as the nodes is considered as having a problem. If the nodes do not reply to the pinging process and exceeded ping timeout, the nodes is considered as failed state and the status of the node will be updated in the Index Server and finally, proper recovery action will be undertaken.
Activate ping procedure

Fig. 4. The algorithm for the proposed heartbeat model.
3 THE PERFORMANCE
In heartbeat performance, the most crucial aspect is monitoring the accuracy of the heartbeat reports that is the time taken to detect a failure as well as the number of false report received. In a normal heartbeat operation, the heartbeat monitors send heartbeat messages to the nodes at regular intervals. The equation of a normal heartbeat operation without having any failure is given by:
Figure4 presents our heartbeat mechanism algorithm in detail. A node sends heartbeat messages to the HBM every Ti. HBM manages a list B - the sample base with the sampled inter-arrival times - along with the Checkpoint Tc which is always set to the time when the last heartbeat was received and the heartbeat interval i. This information is needed to compute a suspicion value. Whenever HBM receives a heartbeat, it appends b = Tr Tc(tr is the current received time) to B and sets Tc = tr afterwards. Most failure detectors limit the size of B to a certain value n (e.g. n = 10000) causing the oldest sample being deleted if a new one is inserted into B. In order to save a heartbeat message, our approach needs a consecutive id and a timestamp to be appended to an application message.
T n = T HM + T i
(2)
where, i) The number of nodes in a grid environment, n. ii) The interval between each message sent to the heartbeat monitor, Ti. iii) Time taken for a message to arrived at heartbeat monitor, THM.
62
The THM and Ti parameters set to constant while the number of nodes in the grid environment, n is manipulated to measure the degree of the effectiveness on mechanism used in detecting fault. The number of nodes define the size of the cluster and it will measure the inter arrival times of the heartbeat messages as well as time taken to detect failure grid component. The THM and Ti parameters set to constant while the number of nodes in the grid environment, n is manipulated to measure the degree of the effectiveness on mechanism used in detecting fault. The number of nodes define the size of the cluster and it will measure the inter arrival times of the heartbeat messages as well as time taken to detect failure grid component.
Table 1 shows the time taken for declaring the failed nodes using the existing heartbeat mechanism. Table 2 shows the overall time taken to detect the failed nodes under the grid environment where the number of nodes available varies from 100 to 1000. We assume that 10% from the numbers of nodes available are failed. TABLE 1. THE TIME TAKEN FOR DETECTING THE FAILED NODES USING FDS MECHANISM
THM Time (ms-1) 30
Ti 100
TW 100
Tr 20
T 250
TABLE 2. THE VALUE OF N NODES AND TIME TAKEN TO DETECT FAULT USING THE FDS. n 10% failure T (ms-1) 100 10 200 20 500 50 600 60 700 70 175000
Tp
3.1 The FDS Performance

The FDS performance and architecture for failure detection is illustrated in shown in Figure 4.
1000 100
`
HM
Susptimeout cious Confirmed
125000
150000
alive THM Fault
3.2 The performance of the proposed Model.

P Ti Tr TW Time
Fig. 6. Existing Heartbeat Mechanism
Figure 7 shows the proposed mechanism dynamic hybrid fault detection methodology (DHFDM) that integrates the heartbeat mechanism with the ping mechanism in order to shorten the time taken to detect failure node(s).
expected Timeout Confirm
From Fig. 6, the application periodically send heartbeat message to HM. If the HM does not receive the message within the time interval, the application is considered as having a problem. The HM will wait for next interval. If the HM had received the message for the next interval, the Index Server will be updated with the new status of the monitored node. However, if the HMs do not received the message after the second time interval, the application is declared as being failed. This will take a longer time to detect failures and to prove that the node being faulty. The equation of FDS to detect a failure is given by:
HM Log file Log file Fault
THM
pinging
i. ii. iii.
T n = T HM + T i + Tr+ T W where,
(3)
Ti
Tr
Time
iv.
v. vi.
The number of nodes in a grid environment, n. Time taken for a message to arrived at heartbeat monitor, THM. The interval between each message sent to the heartbeat monitor, Ti. The timeout when the heartbeat monitor realized that it have not received the message from the node, Tr. The waiting time of the heartbeat monitor before declaring that the node has dead, TW. The time taken by heartbeat monitor to declare a node has dead, T n
Fig. 7. Proposed Heartbeat Mechanism
The equation of the proposed model is given by: T n = T HM + T i + Tr + T P (4) where ,

i) The number of nodes in a grid environment, n. ii) Time taken for a message to arrived at heartbeat monitor, THM. iii) The interval between each message sent to the heartbeat monitor, Ti. iv) The timeout when the heartbeat monitor realized
250000
25000
50000
63
that it have not received the message from the node, Tr. v) The time taken by heartbeat monitor to ping out the suspected node, Tp. vi) The time taken by heartbeat monitor to declare a node has dead, T n Table 3 shows the time taken for declaring the failure nodes using DHFDM. Table 4 shows the overall time taken to detect the failed nodes under the grid environment where the number of nodes available varies from 100 to 1000. We assume that 10% from the numbers of nodes available are failed.
From Figure 8, when the number of nodes increases, the time taken to detect the failure nodes is increased. However, DHFDM required less time taken as compared to FDS because for each suspected failure node DHFDM do not to wait for next interval. Alternatively, DHFDM ping the node to force the node to response, whereas the FDS mechanism need to wait for the second interval to confirm the failure state. For example from Fig. 6., when the number of nodes is 700, DHFDM model needs only 121800ms of processing time while FDS model needs 175000ms. The results showed that DHFDM model provides an efficient approach up to approximately 30% as compared to FDS model.
TABLE 3. THE TIME TAKEN FOR DETECTING THE FAILED NODES THE PROPOSED MECHANISM THM Time (ms-1) 30 Ti 100 TABLE 4. Tr 20 TP 24 T 174
USING
4 RESEARCH FINDINGS
The Dynamic-Hybrid fault detection methology is designed for monitoring large distributed systems such as the grid resources, middleware, application and process simultaneously. In this paper, the model integrates the heartbeat monitor with pinging process and run at low level service thus it will never affect the system protocols and policy in order to avoid future problems. The proposed fault detection methodology has been simulated in mathematical model and is proved that HHM model can improves the communication overhead by shorten the time taken to detect fault in a distributed environment and (as well as) reducing the number of false alarm failure due to the heartbeat communication delay. The overall finding as follow i) By considering a list of inter-arrival history data, HBM always dynamically recalculate HBmax failure detection time. Thus it can reduce the false alarm rate. ii) Sometimes a node cannot send the heartbeat message due to HB generator problem. In this case the node is active and in operational. Thus Integrating the heartbeat mechanism with pinging service can determine the node status. The process also will reduce the detection time and again it can reduce the false alarm rate. iii) The proposed model capable in improving the communication overhead by reducing time taken to detect a fault iv) The indexing of the registered application via AR allow the construction of reliable recovery mechanism since the Index Server (IS) will provide the required information to perform recovery service for failed state application.
THE VALUE OF N NODES AND TIME TAKEN TO DETECT FAULT

USING THE DHFDM.
n 10% failure T(ms-1)
100 10
200 20
500 50
600 60
700 70 121800
1000 100 174000
3.3 Performance Comparison The comparison between the existing mechanism and the proposed mechanism has been made as shown in the Figure 8.
FDS(Existing Mechanism) DHFDM(Proposed Model)
300000
Time Taken (ms)
250000 200000 150000 100000 50000 0
104400
17400
34800
87000
5 CONCLUSION
1 200 400 600 800 1000
Number Of Nodes
Fig.8 The time taken for the DHFDM and the FDS mechanisms to detect 10% failed nodes versus number of nodes available.
This paper demonstrates a methodology in detecting fault for a distributed computing environment as a basis in providing a scalable, high availability and reliability grid environment. The benchmarking has been conducted using existing model FDS and the proposed model, DHFDM using the same pre-defined parameters. The evaluation outcome has proved that DHFDM is capable
64
in detecting fault, reducing communication overhead and the ability to serve a focal point for integration fault detection and customizable recovery action for solving the problem. Eventhough our focus on failstop failure, this model is capable to detect large variety of failures such as process crash/hang failure, host crash and link crash. Developing and running application on a large distributed environment causes major challenges due to the various failures encountered during the execution. These failures are unpredictable and the combination of all these failures will obscure the fault detection and recovery actions. In detecting fault, the most important things is to detect the failure nodes immediately as delaying the detection may cause delaying the recovery process, thus the system may be down for a considerable period of time. The proposed fault detection service may reduces the time taken to detect fault as well as providing a basis for a recovery action to be taken as it stores all the nodes information in an index form stored in an index server.
[11]
[12]
[13]
[14]
[15]
[16] [17]
REFERENCES
Fiaz Gul Khan, Kalim Qureshi and Babar NazirPerformance evaluation of fault tolerance techniques in grid computing system Computers & Electrical Engineering Volume 36, Issue 6, November 2010, Pages 1110-1122, Elsevier B.V. [2] Jess Montes, Alberto Snchez, Mara S. Prez, "Improving Grid Fault Tolerance by Means of Global Behavior Modeling," Parallel and Distributed Computing, International Symposium on, pp. 101-108, 2010 Ninth International Symposium on Parallel and Distributed Computing, 2010. [3] Alexandru Costan, Ciprian Dobre, Florin Pop, Catalin Leordeanu, and Valentin Cristea, "A fault tolerance approach for distributed systems using monitoring based replication," International Conference on Intelligent Computer Communication and Processing, pp. 451-458, Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing, 2010. [4] Adrian Boteanu, Ciprian Dobre, Florin Pop, Valentin Cristea, "Simulator for fault tolerance in large scale distributed systems," International Conference on Intelligent Computer Communication and Processing, pp. 443-450, Proceedings of the 2010 IEEE 6th International Conference on Intelligent Computer Communication and Processing, 2010. [5] Naixue Xiong, Yan Yang, Ming Cao, Jing He, Lei Shu, "A Survey on Fault-Tolerance in Distributed Network Systems," Computational Science and Engineering, IEEE International Conference on, vol. 2, pp. 1065-1070, 2009 International Conference on Computational Science and Engineering, 2009. [6] Magdalena Balazinska, Jeong-Hyon Hwang and Mehul A. Shah, Fault-Tolerance and High Availability in Data Stream Management Systems ,Encyclopedia of Database Systems 2009, Book Chapter,Part 6, pp 1109-1115,Springer U [7] Hwang, S. & Kesselman, C., 2003. Introduction, Requirement for Fault Tolerance in the Grid, Related Work. A Flexible Framework for Fault Tolerance in the Grid Journal of Grid Computing 1:251-272, 251 [8] Benjamin Satzger, Andreas Pietzowski, Wolfgang Trumler, Theo Ungerer, "A Lazy Monitoring Approach for HeartbeatStyle Failure Detectors," Availability, Reliability and Security, International Conference on, pp. 404-409, 2008 Third International Conference on Availability, Reliability and Security [9] K. Mills, S. Rose, S. Quirolgico M. Britton ,C. TanWOSP '04 Proceedings of the 4th international workshop on Software and performance ACM New York, NY, USA 2004 [10] C. Dabrowski, K. Mills, and A. Rukhin A Performance ofService-Discovery Architectures in Response to NodeFailures, Pro[1]
[18] [19] [20] [21] [22]
[23] [24] [25]
ceedings of the 2003 International Conferenceon Software Engineering Research and Practice (SERP'03),CSREA Press, June 2003, pp. 95-101. Marcia Pasin, Stphane Fontaine, , and Sara Bouchenak. Failure Detection in Large-Scale Distributed Systems: A Survey. In 6th IEEE Workshop on End-to-End Monitoring Techniques and Services (E2EMon 2008), Salvador - Bahia - Brazil, April 2008 Junqueira, Flavio, Coping with dependent failures in distributed systemsPh.D. Dissertation , University of California, San Diego, 2006 , 301 pages; AAT 3208277 Matthew Gillen, Kurt Rohloff, Prakash Manghwani, and Richard Schantz.. Scalable, Adaptive, Time-Bounded Node Failure Detection. In Proceedings of the 10th IEEE High Assurance Systems Engineering Symposium (HASE '07). 2007.IEEE Computer Society, Washington, DC, USA Stelling, P. Foster, I. Kesselman, C. Lee, C. Laszewski, G.: A Fault Detection Service for Wide Area Distributed Computations, In proceedings of HPDC, (1998) 268-278. Soonwook H.: A Generic Failure Detection Service for the Grid, Ph.D. thesis,institution = University of Southern California, (2003). Renesse, R. Minsky, Y. and Hayden, M.: A Gossip-Style Failure Detection Service,Technical Report, TR98-1687, (1998). Abawajy, J. H. and Dandamudi,S.P.: A Reconfigurable MultiLayered Grid Scheduling Infrastructure, In proceedings of PDPTA03), (2003) 138-144. Abawajy, J.H. 2004. Introduction. Fault Detection Service Architecture for Grid Computing Systems 107-108. K. C. W. So, and E. G. Sirer, Latency and BandwidthMinimizingFailure Detector, in Proceedings EuroSys 2007. X. Defago et al., On the Design of a Failure Detection Service for Large Scale Distr. Systems, in Proceedings PBit 2003. L. Falai and A. Bondavalli, Experimental Evaluation of the QoS of Failure Detectors on WAN, in Proceedings DSN 2005. Gupta, T. D. Chandra, and G. S. Goldszmidt, On Scalable and Efficient Distributed Failure Detectors, in Proceedings ACM Symp. PODC. Aug. 2001. K. C. W. So, and E. G. Sirer, Latency and BandwidthMinimizing Failure Detector, in Proceedings EuroSys 2007. . S. Zhuang et al., On Failure Detection Algorithms in Overlay Networks, in Proceedings IEEE INFOCOM Conference 2005. N. Hayashibara, A. Cherif, and T. Katayama, Failure Detectors for Large Scale Systems, in Proceedings SRDS 2002.
Ahmad Shukri Mohd Noor received a B.Sc. in Computer Science from Coventry University 1997, M.Sc. MSc in High Performace System from University College Science and Technology Malaysia 1989, Currently,He is a lecturer in the Department of Computer Science ,Faculty of Science and Technology Universiti Malaysia Terengganu (UMT).He also a Ph.D candidate in computer science in UTHM . His research interests include distributed computing, data grid and Appllication system development Dr. Mustafa Mat Deris received a B.Sc. from University Putra Malaysia 1982, M.Sc. from University of Bradford 1989, England and Ph.D from University Putra Malaysia 2001..He is a professor of computer science inthe Faculty of Information Technology and Multimedia, UTHM, Malaysia. His research interests include distributed databases, data grid, database performance issues and data mining. He has appointed as one of the editorial board members for International Journal of Information Technology, World Enformatika Society, International Journal of Applied Science, Engineering and Technology (IJSET), International Journal of Electrical, Computer, and Systems Engineering (IJECSE), International Journal of Applied Mathematics and Computer Sciences (IJAMCS) and Encyclopedia on Mobile Computing and Commerce, Idea Group, USA; Apart from that he is also a reviewer for several International Journals such as Journal of Parallel and Distributed Databases,

Dynamic-Hybrid Fault Detection Methodology

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Dynamic-Hybrid Fault Detection Methodology

Uploaded by

Copyright:

Available Formats

JOURNAL OF COMPUTING, VOLUME 3, ISSUE 6, JUNE 2011, ISSN 2151-9617 HTTPS://SITES.GOOGLE.COM/SITE/JOURNALOFCOMPUTING/ WWW.JOURNALOFCOMPUTING.

Dynamic-Hybrid Fault Detection Methodology

Index Server (IS)

2.1 Dynamic Hybrid Failure Detection WorkFlows

Fig. 1. Hybrid Fault Detection architecture

Node(s) Node fail stop Link crash Network parti-

Fig. 2: Hybrid Fault Detection workflows diagram

HM monitor Index server

Messages collector Detector and status generator

Index manager Node /Log Manager Recovery Fault tolerance Service

2.3 Dynamic Heartbeat Detection Mechanism (DHDM)

HBmax = B{largest value of b}

2.2 Integrating ping with Dynamic Heartbeat Methodology.

Activate ping procedure

THM Time (ms-1) 30

3.1 The FDS Performance

Susptimeout cious Confirmed

alive THM Fault

3.2 The performance of the proposed Model.

Fig. 6. Existing Heartbeat Mechanism

HM Log file Log file Fault

Fig. 7. Proposed Heartbeat Mechanism

The equation of the proposed model is given by: T n = T HM + T i + Tr + T P (4) where ,

THE VALUE OF N NODES AND TIME TAKEN TO DETECT FAULT

n 10% failure T(ms-1)

1000 100 174000

250000 200000 150000 100000 50000 0

[18] [19] [20] [21] [22]

[23] [24] [25]

You might also like