Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
1Activity
×
0 of .
Results for:
No results containing your search query
P. 1
Automating the fault tolerance process in Grid Environment

Automating the fault tolerance process in Grid Environment

Ratings: (0)|Views: 354|Likes:
Published by ijcsis
As Grid encourages the dynamic addition of resources that are not likely to be benefited from the manual management techniques as these are time-consuming, unsecure and more prone to errors. A new paradigm for self-management is pervading over the old manual system to begin the next generation of computing. In this paper we have discussed the different approaches for self-healing the current grid middleware use, and after analyzing these we have proposed the new approach, Selfhealing Management Unit, SMU that will provide the automated way of dealing with failures.
As Grid encourages the dynamic addition of resources that are not likely to be benefited from the manual management techniques as these are time-consuming, unsecure and more prone to errors. A new paradigm for self-management is pervading over the old manual system to begin the next generation of computing. In this paper we have discussed the different approaches for self-healing the current grid middleware use, and after analyzing these we have proposed the new approach, Selfhealing Management Unit, SMU that will provide the automated way of dealing with failures.

More info:

Published by: ijcsis on Nov 02, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

11/02/2010

pdf

text

original

 
Automating the fault tolerance process inGrid Environment
Inderpreet
Chopra
 
Research Scholar Thapar University Computer Science DepartmentPatiala, Indiainderpreet@thapar.edu
Maninder Singh
Associate Professor Thapar University Computer Science DepartmentPatiala, Indiamsingh@thapar.edu
Abstract:
As Grid encourages the dynamic additionof resources that are not likely to bebenefited from the manual managementtechniques as these are time-consuming, un-secure and more prone to errors. A newparadigm for self-management is pervadingover the old manual system to begin thenext generation of computing. In this paperwe have discussed the different approachesfor self-healing the current gridmiddleware use, and after analyzing thesewe have proposed the new approach, Self-healing Management Unit, SMU that willprovide the automated way of dealing withfailures.Keywords
: SMU, heartbeat
1.
 
Introduction
 
In recent years Grid, which facilitates thesharing and integration of large scale,heterogeneous resources, has been widelyrecognized as the future framework of distributed computing [1, 2]. However, theincreasing complexity of Grid services andsystems demands correspondingly larger human effort for system configuration and performance management, which are mainlydone in a manual style today, making it time-consuming, error-prone and evenunmanageable for human administrators.Autonomic computing [4, 5] presented andadvocated by IBM, suggests a desirablesolution to this problem. The vision of autonomic computing is to design and buildcomputing systems that possess inherent self-managing capabilities [6]. In this paper a Self-healing model- SMU (Self-healingmanagement unit) has been described which by autonomic computing has targeted toimprove the level of automation and self-management capabilities to a far greater extentthan it is today in Grid Computing systems.The SMU aims to:
 
keep track over the jobs submissionand execution
 
recover the lost jobs
 
administrate complexity of grid
 
check the efficiency for resourcediscovery
 
keep all the services running.
2.
 
Self-Healing Mechanisms
Self-healing [7] is the ability of a system torecover from faults that might cause some parts of it to malfunction. For a system to beself-healing, it must be able to recover from a
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 2010224http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
failed component by first detecting andisolating the failed component, taking it off line, fixing and reintroducing the fixed or replacement component into service withoutany apparent overall disruption. A self-healingsystem also needs to predict problems andtake actions to prevent the failure from havingan impact on applications. The self-healingobjective must be to minimize all outages inorder to keep the system up and available atall times.There can be many different reasons that canlead to the fault occurrence in grids. Some of the reasons we are able to find are as follows:
 
 Hardware Faults:
Hardware failurestake place due to faulty hardwarecomponents such as CPU, memory, andstorage devices [8].
 
 Application and Operating System Faults:
Application and operating systemfailures occur due to application or operating system specific faults likememory leakage, deadlocks, inefficientresource management etc.
 
 Network Faults:
In a grid, computingresources are connected over multiple anddifferent types of distributed networks. Asa result, physical damage or operationalfaults in the network are more likely [9].The network may exhibit significant packetloss or packet corruption. Moreover,individual nodes in the network or thewhole network may go down.
 
Software Faults:
There are severalhigh resource intensive applicationsrunning on grid to do particular tasks.Several software failures like the unhandledexception; unexpected input etc. can take place while running this softwareapplication.In addition to ad-hoc mechanisms – based onusers complaints and log files analysis – gridusers have used automatic ways to deal withfailures in their Grid Environment. To achievethe automatic ways to deal with failures,various fault tolerance mechanisms are there.Some of these self-healing mechanisms are:
Application-dependent
: Grids areincreasingly used for applications requiringhigh levels of performance and reliability, theability to tolerate failures while effectivelyexploiting the resources in scalable andtransparent manner must be integral part of grid computing resource managementsystems. Support for the development of fault-tolerant applications has been identified as oneof the major technical challenges to addressfor the successful deployment of computational grids [10]. To date, there has been limited support for application-level faulttolerance in computational grids. Support hasconsisted mainly of failure detection servicesor fault-tolerance capabilities in specializedgrid toolkits. Neither solution is satisfactory in
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 2010225http://sites.google.com/site/ijcsis/ISSN 1947-5500
 
the long run. The former places the burden of incorporating fault-tolerance techniques intothe hands of application programmers, whilethe latter only works for specializedapplications. Even in cases where fault-tolerance techniques have been integrated into programming tools, these solutions havegenerally been point solutions, i.e., tooldevelopers have started from scratch inimplementing their solution and have notshared, nor reused, any fault tolerance code. A better way is to use the compositionalapproach in which fault-tolerance expertswrite algorithms and encapsulate them intoreusable code artifacts, or modules.
Monitoring Systems
: In this a faultmonitoring unit is attached with the grid. The base technique which most of the monitoringunits follow is heartbeating technique. Theheartbeating technique [11] is further classified into 3 types:
- Centralized Heartbeating 
- Sendingheartbeats to a central member creates a hotspot, an instance of high asymptoticcomplexity.
- Ring Based Heartbeating 
- along a virtualring suffers from unpredictable failuredetection times when there are multiplefailures, an instance of the perturbation effect.
- All-to-all heartbeating 
- sending heartbeatsto all members, causes the message load in thenetwork to grow quadratically with group size,again an instance of high asymptoticcomplexity
Checkpointing-recovery
: Checkpointing androllback recovery provides an effectivetechnique for tolerating transient resourcefailures, and for avoiding total loss of results.Checkpointing involves saving enough stateinformation of an executing program on astable storage so that, if required, the programcan be re-executed starting from the staterecorded in the checkpoints. Checkpointingdistributed applications is more complicatedthan Checkpointing the ones which are notdistributed. When an application is distributed,the Checkpointing algorithm not only has tocapture the state of all individual processes, but it also has to capture the state of all thecommunication channels effectively.Checkpointing [12] is basically divided into 2types:
- Uncoordinated Checkpoint 
: In this approach,each of the processes that are part of thesystem determines their local checkpointsindividually. During restart, these checkpointshave to be searched in order to construct aconsistent global checkpoint.
- Coordinated Checkpoint 
: In this approach,the Checkpointing is orchestrated such that theset of individual checkpoints always results ina consistent global checkpoint. This minimizesthe storage overhead, since only a singleglobal checkpoint needs to be maintained on
(IJCSIS) International Journal of Computer Science and Information Security,Vol. 8, No. 7, October 2010226http://sites.google.com/site/ijcsis/ISSN 1947-5500

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->