You are on page 1of 22

A Client-centric Grid Knowledgebase

George Kola, Tevfik Kosar and Miron Livny


University of Wisconsin-Madison

September 23rd, 2004

Cluster 2004
San Diego, CA
A Client-centric Grid Knowledgebase
Grid Trivia

 How many of you have submitted a job to the


Grid resources and did never hear back from
it?
 How many of you got mad by the inconsistent
behavior of some grid resources?
• Completing successfully some jobs and failing
others..
• Similar jobs performing completely different..

... We did!

George Kola, Tevfik Kosar and Miron Livny 2


A Client-centric Grid Knowledgebase
Goal: Prevent Unexpected Behavior in a Grid
 Learn from experience and prevent them from repeating
in the future again.
 Causes for unexpected behavior in a Grid:
• Black holes
• Resources with
– Faulty hardware
– Buggy or misconfigured software
• Extremely slow computational sites
• Memory leaks
..etc

George Kola, Tevfik Kosar and Miron Livny 3


A Client-centric Grid Knowledgebase

Black holes

George Kola, Tevfik Kosar and Miron Livny 4


A Client-centric Grid Knowledgebase
Black holes
 Definition: “A black hole is a region of spacetime
from which nothing can escape, even light.”
 If you send a light beam to a black hole, you never
hear back from it.
 You can only know it after you have encounter it. Is
it too late?
• No. You should learn from experience..

George Kola, Tevfik Kosar and Miron Livny 5


A Client-centric Grid Knowledgebase

Black holes in the Grid


 Resources that accept jobs but never complete them
• You send a job to a resource, but never hear back from it.

George Kola, Tevfik Kosar and Miron Livny 6


A Client-centric Grid Knowledgebase
Black hole examples from real life:
 In the WCER educational video processing pipeline:
• A specific pool was accepting and processing our jobs for
a couple of hours, but evicting before completion.
• A machine accepted a job, but due to a memory leak it
kept throwing “shadow exceptions” and retrying the job
forever.
• Some thirdparty (GridFTP, DiskRouter) transfers hang
occasionally and never returned.
• A machine caused an error because of a corrupted FPU.
It successfully completed MPEG-1 encoding but failed
MPEG-4.

George Kola, Tevfik Kosar and Miron Livny 7


A Client-centric Grid Knowledgebase
Grid is good.. but not perfect..

 Heterogeneous resources
 Multi administrative domains
 Spanning wide area networks
 Consists of commodity hardware and software

Prone to network-, hardware-, software-, middleware-


failures!

We cannot expect everything from the Grid or Grid


middleware!

George Kola, Tevfik Kosar and Miron Livny 8


A Client-centric Grid Knowledgebase
Take the Ethernet Approach
 A truly distributed (and very effective) access control
protocol to a shared service
 Client responsible access control
 Client responsible for error detection
 Client responsible for fairness

Keep track of job/resource performance & failure


characteristics as observed by the client.
Use job/user log files collected at the client side
to build a grid knowledgebase.

George Kola, Tevfik Kosar and Miron Livny 9


A Client-centric Grid Knowledgebase
Grid Knowledgebase
 Parse user/job log files
 Load them into a database
 Aggregate experience of different jobs
 Interpret them
 Plan action
 Generate feedback to the scheduler as well as to
the user

George Kola, Tevfik Kosar and Miron Livny 10


PLANNE JOB
R DESCRIPTION
S

JOB QUEUE

MATCH JOB
MAKER SCHEDULE
R

Clusters Storage Servers Personal Computers

GRID RESOURCES

JOB LOGS
PLANNE JOB
R DESCRIPTION
S

JOB QUEUE

ADAPTATION NOTIFICATIO
MATCH JOB LAYER N LAYER
MAKER SCHEDULE
R

DATA
MINER

DATABASE
Clusters Storage Servers Personal Computers

GRID RESOURCES

JOB
PARSER
JOB LOGS GRID
KNOWLEDGEBAS
A Client-centric Grid Knowledgebase
Database Schema User

Field Type
Submit
JobId Int
JobName string
Schedule
State Int
SubmitHost string
SubmitTime Int Suspend Evicted

ExecuteHost string [] Execute


ExecuteTime string [] Un-suspend Exception
ImageSize int[]
ImageSizeTime int []
EvictTime int [] Terminated Terminated
Abnormally Normally
Checkpointed bool []
EvictReason string
TerminateTime int []
No
Exit code = 0?
TotalLocalUsage string
TotalRemoteUsage string
Yes
TerminateMessage string
ExceptionTime int [] Job Job
Failed Succeeded
ExceptionMessage string []

George Kola, Tevfik Kosar and Miron Livny 13


A Client-centric Grid Knowledgebase
Difference from existing approaches
 Client view
 Use only job/user log files at the client side
• Many administrators do not want to share
resource/scheduler log files.
 We do not need to know everything going on in the
whole grid
• Scalable

George Kola, Tevfik Kosar and Miron Livny 14


A Client-centric Grid Knowledgebase
What do we get?
 Collecting job execution time statistics
• Average job execution time
• Standard deviation
• Fit a distribution
 Detect and avoid black holes
• For normal distribution:
– 99.7% of job execution times should lie between
(avg-3*stdev) and (avg+3*stdev)
– 96% of job execution times should lie between
(avg-2*stdev) and (avg+2*stdev)

George Kola, Tevfik Kosar and Miron Livny 15


A Client-centric Grid Knowledgebase

Detecting hanging transfers


Transfer Time (T) vs Probability (t<T)

120

100
Probability (t<T)

80
(%)

60

40

20

7.9
4.6
4.8
5.0
5.1
5.3
5.5
5.7
5.9
6.2
6.6
6.9
7.3

8.4
9.3
9.8
11.9
14.2
15.3
Transfer Time (T)
(minutes)

George Kola, Tevfik Kosar and Miron Livny 16


A Client-centric Grid Knowledgebase
Setting Execution Time Limits
 Avg = 7.8 min
 Stdev = 3.17min
 For normal distribution:
• %99.7 : [0 – 17.31 min]
• %96 : [1.46 min – 14.14 min]

George Kola, Tevfik Kosar and Miron Livny 17


A Client-centric Grid Knowledgebase
What do we get? (2)
 Identifying misconfigured machines
• e.g. find set of machines which fail jobs with I/O data
size larger than 2 GB (i.e. OS limitations)
 Identifying factors affecting job run-time
 Bug hunting
• Job failures on certain inputs
• Memory leaks
– Scheduler logs image size regularly

George Kola, Tevfik Kosar and Miron Livny 18


A Client-centric Grid Knowledgebase
Catching Memory Leaks
Job Memory Image Size (MB)

Time

George Kola, Tevfik Kosar and Miron Livny 19


A Client-centric Grid Knowledgebase
What do we get? (3)
 Application optimization
• How long does each step of an application/pipeline
take to execute?
 Adaptation
• Find resources that take least time to execute jobs
from a particular class

George Kola, Tevfik Kosar and Miron Livny 20


A Client-centric Grid Knowledgebase
Conclusions
 View of the Grid from the client side
 Job/user log files as main source of information
 Aggregate experience of different jobs and pass
them to future ones
 Helps in:
• Catching black holes
• Identify faulty/misconfigured resources
• Bug tracking
• Statistics collection
 Future work:
• Merge experience of different clients

George Kola, Tevfik Kosar and Miron Livny 21


A Client-centric Grid Knowledgebase

Thank you…
For more information, contact:

Tevfik Kosar
http://www.cs.wisc.edu/~kosart
kosart@cs.wisc.edu

George Kola, Tevfik Kosar and Miron Livny 22

You might also like