Yadav 2015

Fault Tolerant Algorithm for Replication
Management in Distributed Cloud System

Rimmy Yadav Dr. Avtar Singh Sidhu
Research Scholar Dept. Computer Science & IT
Department of Computer Applications G.G.S Khalsa College, Sarhali
Lovely Professional University, Phagwara,
Taran Taran, India.
Punjab.
Rimmy_yadav@ymail.com avtarsidhu22000@gmail.com
model in which the network nodes are process the ability to

Abstract— Distributed cloud based systems are sending the test certain another network amenity for the presence of the
traditional processing systems at the back-front because of their
failure is employed. Rao and Malladi [3] developed and
increasing popularity. Fault tolerance is a rapid growing
challenge in the distributed cloud based system. Fault tolerance is provide an analysis of the algorithm for developing a fault
the ability of the system to perform its function even in the tolerant distributed system. Check-pointing and message
presence of the failures. A number of models and algorithms has logging methods are used in this paper. Al Jaroodi et [4]
been developed to make the distributed cloud based system fault developed a delay tolerant and fault tolerant algorithm to
tolerant. In this paper, to make the distributed cloud handle workload and fault occurs in the distributed cloud
environment are fault-less, the fault tolerant algorithm for the milieu [5]. They developed a priority centered algorithm to
replication management is developed. It is consider being very find the defective or faulty processors in the distributed
effective. The Fault Detector Replication Manager (FDRM) find system. In this algorithm the messages are propagated in
the failed server in the particular distributed location and
backward direction to determine the node failure information.
allocate replicated server to accomplish the tasks.
Non-faulty processors help to send the information of failed
Index Terms—Distributed Cloud System, Fault Tolerance, nodes to other non- faulty processors only in backward
FDRM, Fault Tolerant Algorithm, Job Migration. direction [6][7][8]. They developed two algorithms to deal
with the hardware failure occur in the grid environment. The
developed algorithms used the concept of message passing
I. INTRODUCTION technique to check whether the nodes are in safe state or not
Distributed cloud based framework picking up an and this technique also helps to drop the failed server from
expanding ubiquity over the conventional handling their underlying architecture [9].
frameworks. One of their conspicuous advantages is that In this paper, the fault tolerant algorithm for the replication
disseminated framework hold the capacity to take care of management is developed and considered to be more effective
complex issues obliging huge computational by separating algorithm that helps to make the distributed cloud environment
them into the littler issues. Distributed cloud based system fault free. The fault tolerant algorithm and proposed
provides various facilities to its intended clients and these architecture provides a fault free user transparency at fast and
facilities are data, hardware, software and resource sharing, this effective level. The reason of the FDRM is to find the
distribution is not like as distribution of files. In addition to unsuccessful server in the particular distributed location and
this, data and amenities are generally replicated to provide allocate replicated server to accomplish the tasks.
scalable and efficient elucidations to numerous clients at
different positions. II. RELATED WORK
Distributed framework serves to deed parallelism to Bheevgade and Patrikar [8] developed watchdog timer and
accelerate implementation of reckoning starving requests, for Sanchita algorithm to deal with the faults in Grid System. In
example, neural-system preparing or different framework sanchita algorithm, the states of running jobs are collected at
demonstrating. Another advantage of conveyed framework is central (master) node and intermediate results are stored.
that they mirror the worldwide professional and societal Sapre, Garje [9] they developed built in user- transparent
location in which we animate and exertion. error detection mechanism cover processor and node crash
Fault tolerance is a rapid growing test in the distributed cloud failures and hardware.
based system. Fault tolerance is the skill of the system to Li and McMilin [5] developed a priority based probe
perform its role even in the attendance of the failure. algorithm which works well to find the faulty processors exist
Numerous algorithm and models has been developed by in the system. Messages are propagated by the non-faulty
researcher to make the distributed cloud based system fault processors in backward direction to determine the faulty
tolerant. Chandrasekar and Srimani [1] developed an processors. Al-Jaroodi, Mohamed, and Al Nuaimi [4]
algorithm which is based on self-stabilization and it provides developed a delay-tolerant fault-tolerant algorithm to minimize
built in safeguard against any failures that occurred in the the communication delay occurred due to the network failure
distributed architecture. Hosseini and Kuhl [2] proposed a
978-1-4673-6747-9/15/$31.00 2015
c IEEE 78
or the server failure in the distributed cloud services. Malladi is found in the distributed cloud environment is the hard disk
and Rao [6] they make use of check pointing and message failure.
logging methods in this paper. An efficient co-ordinate check To provide effective level of the user transparency even in
pointing protocol combined with limited sender based the presence of the failure, FDRM as shown in Fig. 2 helps to
pessimistic message logging. detect the failed server in the distributed cloud environment.
Zheng and Lyu [7] they proposed an approach in which The job migration policy adopts by the FDRM helps to
nodes test another network observation for the vicinity of the migrates the resume tasks, which are hampered during the
botch is engaged. The motivation behind this model a failure, to the replicated server. In the Fig. 2 a FDRM is
distributed algorithm is produced which tries to permit all the deployed at the particular distributed location where the
network nodes to correctly reach independent diagnosis of servers and replicated servers are working. A FDRM will
condition like faulty and non-flawed of all the network nodes communicate will all the participating servers and the
and the intermediate nodes with the communication facilities. replicated servers. The FDRM will first sends an
Girault et al. [10] proposed a solution to automatically produce acknowledgment to all the servers, the message contains the
a fault tolerant distributed schedule of a given algorithm onto a information like aliveness or status of the servers and this
given distributed architecture, where each operation of the acknowledgment will be send at the regular intervals of the
algorithm is replicated on different processors. Calheiros and time.
Buyya [11] proposed an algorithm that uses idle time of In response to this, message would be delivered by all the
provisioned resources and budget surplus to replicate tasks. participating servers or replicated server. In case server do not
Zhou and Jiang [12] proposed a composite self-adaptable sends the message(aliveness) to the FDRM, the FDRM will
hierarchical fault tolerant scheme which effectively integrates again sends an acknowledgment to all the server in that
and expands the idea of centralized and distributed fault distributed location in order to determine the state of the
tolerant methods. This executes system fault tolerance in the servers. If the replicated server does not send the message to
sequence of 'first centralized then distributed'. Altaf et al. [13] the FDRM, it FDRM assumes a fault in that server, in order to
developed supervised Artificial Neural Network (ANN) deal with it, the failed server will be globule by the FDRM
approach is used to identify various fault types such as Broken and allocate the replicated server to the cloud users. If
Rotor Bar (BRB) as well as the location of fault event within replicated server failed then other replicated server will be
an industrial motor network. allocated to accomplish tasks requested by the cloud clients.
The FDRM uses check-pointing technique and job migration
III. PROPOSED ARCHITECTURE policy to make the distributed environment fault free and
The communication in the distributed cloud environment is effective. In order to determine the failed server, the FDRM
accomplished with the help of service oriented architecture will save the address of the server or the replicated server
protocols, Common Request Broker Architecture (CRBA) etc. before and after the failure occur in the server. The job
These techniques are deployed at the middleware layer of migration will helps to allocate the replicated server to the
distributed cloud environment. In Fig. 1 when the number of failed server.
the cloud clients are growing dynamically. There will be
single cloud client, a business organization or several users etc.
This resource requirement will be either same or different
depending upon their requirements like using the cloud as a
service.
In Fig. 1 during the handling the thousands of users, the users
request taken by the active server or the server. In this case the
replication mechanism is used. Cloud client's application
process are copied and allocated to the other servers using
replication mechanism by doing this, load balancing can be
easily handle. There are various technologies have been
developed like Hadoop Mapreduce function[4], whose
purpose is to divide the tasks and allocate it to the other server
which are located in different locations. Hadoop Map decrease
function is implemented at the middleware layer of the
distributed environment. But in this case if the demand and
number of users are increasing dynamically, then there would
be the chances of the failure which are mostly found in server
failure and there are hard disk failure, Redundant Array of
Independent Disks (RAID), Memory modules, and
replacements of the components etc., one of the major failure
Fig. 1. Cloud communication with multiple clients without FDRM.
2015 IEEE 3rd International Conference on MOOCs, Innovation and Technology in Education (MITE) 79
Save the address (state) of the failed server using
check-pointing mechanism. Drop that failed server.
Allocate the free replicated server to the resuming

user's application request.
}
Else
{
Continue to process the cloud client's needs.
}
}
To detect the server failure, the actual implementation of

the Fault Detector Algorithm (FDA) in this paper helps must
be considered. The application had resumed on available
server/replicated server from the previous state saved data. The
dependability of the application improves with the help of FDA
but due to more computation overheads the performance
Fig. 2. Cloud communication with multiple clients involving FDRM. suffers slightly. To store the data and the actual state of the
server during the occurrences of the fault in the distributed
cloud system environment is the other main disadvantage of
IV. PROPOSED ALGORITHM FDA. Some solutions of the above mentioned problem are
The following algorithm helps to provide the user suggested as below:
transparency in spite of failure occurrence in the distributed
cloud environment. The algorithm is given is given as below:
If (Number of cloud clients is less) • If the numbers of cloud clients are less, then it is
{ beneficial to use one cloud site or one server at a time.
Allocate one cloud site to them. • In case of dynamically increasing number of cloud
clients, then replicas, of user application process are
} made and copied to the other server to make them
Else replicated server.
{ The purpose behind using the FDRM is to maintain the
•
Make replicas of user required applications. load balancing in the distributed cloud environment.
Copy and allocate these replicas to other servers
• Each server/replicated server in a particular distributed
to manage several clients.
location must send an acknowledgment (aliveness
} message) to the FDRM deployed in that distributed
//FDRM sends an acknowledgment to the participating servers location. This message contains the information like
and replicated servers. current status information, means all the replicated
servers are active or in working condition.
If (Server/Replicated servers send messages in response to the • If any of the system does not send an active message to
acknowledgment) the FDRM, the fault detector assumes that a fault has
{ been occurred in that particular server.
Save the states of the Server/Replicated server in • If the FDRM does not receive aliveness message from
FDRM's master file at the regular intervals. any of its replicated server, it allocate the replicated
server to it in the particular distributed to each on
} detected by FDRM.
If (Server/Replicated server! = Responding)
{ In Fig. 3 depicts that the set of nodes are connected with
Again FDRM sends an acknowledgment to all the each other via the bidirectional edges. Node 0 is represented as
participating servers/replicated serves. a local node or client node which requires the access to the
resources provided by the distributed cloud. Node 4 is
If (Server/Replicated server! = Responding) represented and worked as the FDRM.
{
80 2015 IEEE 3rd International Conference on MOOCs, Innovation and Technology in Education (MITE)
In Fig. 5 all the nodes/servers respond the FDRM by
sending it the message to declare their aliveness. In Fig. 6
Suppose the node 2 becomes fail. Before node 2 sends the
notification to the FDRM regarding is aliveness.
Fig. 3. Set of connected nodes with FDRM.
In Fig. 4 to check the aliveness of all the participating nodes

and servers, the FDRM sends them the acknowledgement, but
not to node 0 because the node 0 is the local node and it
requires accessing the resources by the cloud environment.
Fig. 6. Failure of node 2.
In Fig. 7 node 2 sends its data in duplicate from to its nearest

adjacent nodes i.e. 1, 6 and 3, so that FDRM could get the data
from 1, 6 and 3 nodes while node 2 is on failure.
Fig. 4. Acknowledgment sent by the FDRM
Fig. 7. Node sending its data in replication from its adjacent nodes i.e. node
1, 6, and 3.
Fig. 5. Messages sending to FDRM in response to the

acknowledgment.
In Fig.8 node 3 is sending the replication of data to node 2. A. Simulation Model
The most nearby node i.e. node 3 is sending the replicated of Our simulation model consist of 6 number of nodes
data of node 2 to FDRM. deployed in the distributed environment and connected
bidirectional.
In Fig. 9 when node 4, i.e. FDRM, assume to be failed. In this
case, node 4, i.e. FDRM sends the replication of its data 0 all B. Simulation Procedure
nodes except node 0. FDRM contain the replicated copies of Simulation inputs for proposed scheme are as follows:
the participating nodes except node 0. So when FDRM get Number of nodes (N) = 6, Number of sink Node(s) = 1, Packets
failed, before failure it firstly sends the replicated data of itself (P) = 64 bits, 128 bits, 512 bits, 1024 bits and so on,
and replicated data of all other nodes that it contains. So when transmission of data = bits/sec.
FDRM failed, other nodes perform the roles of FDRM itself. Begin
1) Deploy the number of nodes randomly as in the
distributed cloud environment.
2) Apply the proposed scheme to control topology of
the distributed cloud.
3) Compute the performance parameters.
4) Generate the graphs.
End
C. Performance Parameters
We have used the following parameters the performance
of our proposed fault tolerant scheme.
• Probability of Fault Tolerance: It measure the enable
system to continue to operating properly even in the
presence of failure in distributed cloud environment.
• Time Computation: is defined as the number of nodes
increases as the percentage of time computation will
increase in topology of distributed cloud environment
D. Results and Discussion

Fig. 8. Node 3 sending replicated data of node 2 to FDRM
The Fig.10 Show that the probability of fault tolerance varies
with the number of nodes. By embedding the FDRM in the
distributed cloud, the probability of faulty nodes is 10%, 20%
and 30% respectively in out of 6 nodes and then the
probability of fault tolerance will be decreases. The time
complexity of proposed framework with defined topology of
is O (n2)
E. Conclusion
Fig. 9. Failure of FDRM.
To make the distributed cloud environment more
reliable and user transparent, an efficient algorithm is
V. IMPLEMENTATION
proposed i.e. fault detector and replication manager is
In this section we describe the simulation setup used to proposed. FDRM in the proposed architecture helps to manage
verify the fault tolerant algorithm for the distributed cloud the replication of data during failure as shown in Fig. 10. The
environment. We conducted the simulation of the proposed probability of fault tolerance is about 10% by employing the
scheme by using the "Network Simulator Version 2.0": FDRM in the proposed architecture.
82 2015 IEEE 3rd International Conference on MOOCs, Innovation and Technology in Education (MITE)
REFERENCES
[1] S. Chandrasekar and P. K. Srimani, “A new fault tolerant
distributed algorithm for longest paths in a DAG,” in Software
Reliability Engineering, 1993. Proceedings. Fourth International
Symposium on, 1993, pp. 202–206.
[2] S. H. Hosseini, J. G. Kuhl, and S. M. Reddy, “A diagnosis
algorithm for distributed computing systems with dynamic failure
and repair,” Computers, IEEE Transactions on, vol. 100, no. 3, pp.
223–233, 1984.
[3] U. Malladi, “Notice of Violation of IEEE Publication Principles
Design, analysis and performance evaluation of a new algorithm for
developing a fault tolerant distributed system,” in Parallel and
Distributed Systems, 2006. ICPADS 2006. 12th International
Conference on, 2006, vol. 1, p. 10–pp.
[4] J. Al-Jaroodi, N. Mohamed, and K. A. Nuaimi, “An efficient
fault-tolerant algorithm for distributed cloud services,” in Network
Cloud Computing and Applications(NCCA), 2012 Second Symposium
on, 2012, pp. 1–8.
[5] S. Hariri, A. Choudhary, and B. Sarikaya, “Architectural support
for designing fault-tolerant open distributed systems,” Computer, vol.
25, no. 6, pp. 50–62, 1992.
[6] P.-Y. Li and B. McMillin, “Fault-tolerant distributed deadlock
detection/resolution,” in Computer Software and Applications
Conference, 1993. COMPSAC 93. Proceedings., Seventeenth Annual
International, 1993, pp. 224–230.
[7] M. B. Bheevgade and R. M. Patrikar, Implementation of Fault
Tolerance Techniques for Grid Systems. INTECH Open Access
Publisher, 2009.
[8] B. Sapre, A. Garje, and B. B. Mesharm, “Fault Tolerant
Environment Using Hardware Failure Detection, Roll Forward
Recovery Approach and Micro rebooting For Distributed Systems”.
[9] Z. Zheng and M. R. Lyu, “A distributed replication strategy
evaluation and selection framework for fault tolerant web services,”
in Web Services, 2008. ICWS’08. IEEE International Conference on,
2008, pp. 145–152.
[10] A. Girault*, H. Kalla, and Y. Sorel, “A scheduling heuristics for
distributed real-time embedded systems tolerant to processor and
communication media failures,” International Journal of Production
Research, vol. 42, no. 14, pp. 2877–2898, 2004.
[11] R. N. Calheiros and R. Buyya, “Meeting deadlines of scientific
workflows in public clouds with tasks replication,” Parallel and
Distributed Systems, IEEE Transactions on, vol. 25, no. 7, pp. 1787–
1796, 2014.
[12] H. Zhou and J. Jiang, “CSHFt: A Composite Fault- Tolerant
Architecture and Self-Adaptable Hierarchical Fault-Tolerant Strategy
for Satellite System,” in Distributed Computing and Applications to
Business, Engineering and Science (DCABES), 2011 Tenth
International Symposium on, 2011, pp. 333–337.
[13] S. Altaf, A. Al-Anbuky, and H. GholamHosseini, “Fault
diagnosis in a distributed motor network using Artificial Neural
Network,” in Power Electronics, Electrical Drives, Automation and
Motion (SPEEDAM), 2014 International Symposium on, 2014, pp.
190–197.

Yadav 2015

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yadav 2015

Uploaded by

Copyright:

Available Formats

Fault Tolerant Algorithm for Replication

Management in Distributed Cloud System

model in which the network nodes are process the ability to

Fig. 1. Cloud communication with multiple clients without FDRM.

Allocate the free replicated server to the resuming

To detect the server failure, the actual implementation of

Fig. 3. Set of connected nodes with FDRM.

In Fig. 4 to check the aliveness of all the participating nodes

Fig. 6. Failure of node 2.

In Fig. 7 node 2 sends its data in duplicate from to its nearest

Fig. 4. Acknowledgment sent by the FDRM

Fig. 5. Messages sending to FDRM in response to the

D. Results and Discussion

You might also like