You are on page 1of 52

Understanding Issue Correlations:

A Case Study of the Hadoop System

Jian Huang
Xuechen Zhang† Karsten Schwan


Why Issue Study Matters?
Scalable distributed systems are complex [Yuan et al., OSDI’14]

Complicated System

2
Why Issue Study Matters?
Scalable distributed systems are complex [Yuan et al., OSDI’14]

+
Complicated System Error-prone

2
Why Issue Study Matters?
Scalable distributed systems are complex [Yuan et al., OSDI’14]

+ +
Complicated System Error-prone Hard to Debug

2
Why Issue Study Matters?
Scalable distributed systems are complex [Yuan et al., OSDI’14]

+ +
Complicated System Error-prone Hard to Debug
Issue Study

Issue Pattern
2
Why Issue Study Matters?
Scalable distributed systems are complex [Yuan et al., OSDI’14]

+ +
Complicated System Error-prone Hard to Debug
Issue Study

+
Issue Pattern Better Software & Debugging Tools
2
Hadoop: A Representative Distributed System

3
Hadoop: A Representative Distributed System
HDFS (Storage) MapRedue (Computation)
10
Number of Reported Issues

8
6
(x1000)

4
2
0
2008 2009 2010 2011 2012 2013 2014 2015
The Evolution of Apache Hadoop

3
Hadoop: A Representative Distributed System
HDFS (Storage) MapRedue (Computation)
10
Number of Reported Issues

8
6
……
(x1000)

4
2
0
2008 2009 2010 2011 2012 2013 2014 2015
The Evolution of Apache Hadoop

3
Hadoop: A Representative Distributed System
HDFS (Storage) MapRedue (Computation)
10
Number of Reported Issues

8
6
……
(x1000)

4
2
0
2008 2009 2010 2011 2012 2013 2014 2015
The Evolution of Apache Hadoop

Learn from issues – more than 6 years of experience.


3
What Can We Learn From Issues?
Related Work
[Gunawi et al., SoCC’14]
What Bugs Live in the Cloud?

[Lu et al., FAST’13]


A Study of Linux File System Evolution
……

4
What Can We Learn From Issues?
Related Work
[Gunawi et al., SoCC’14]
What Bugs Live in the Cloud?

[Lu et al., FAST’13]


A Study of Linux File System Evolution
……

Our Focus: Issue Correlations

Programming

Systems Tools

4
Our Findings

•  Half of the issues are independent


•  MapReduce issues tend to relate to YARN
•  One third of the issues have similar causes
•  ......

5
Our Findings

•  Half of the issues are independent


•  MapReduce issues tend to relate to YARN
•  One third of the issues have similar causes
•  ......

•  Memory: GC is still the No. 1 concern


Programming •  Storage: “99.99% of data reliability” is challenged
•  Programming: one third of them relate to interfaces
Systems Tools •  Tools: the logging in Hadoop is error-prone
•  ......
5
Methodology Used in Our Study
Hive … Pig Flume
HCatalog … Mahout Cascading

MapReduce HBase

HDFS
Hadoop Ecosystem

6
Methodology Used in Our Study
Hive … Pig Flume
HCatalog … Mahout Cascading
Computation MapReduce HBase

Storage HDFS
Hadoop Ecosystem

6
Methodology Used in Our Study
Hive … Pig Flume
HCatalog … Mahout Cascading
Computation MapReduce HBase

Storage HDFS
Hadoop Ecosystem

Closed Issues 2359 2340


Examined Issues 2180 2038

6
Methodology Used in Our Study
Hive … Pig Flume
HCatalog … Mahout Cascading
Computation MapReduce HBase

Storage HDFS
Hadoop Ecosystem

Closed Issues 2359 2340 Sampling Rate


89.8%
Examined Issues 2180 2038

6
Methodology Used in Our Study
Hive … Pig Flume
HCatalog … Mahout Cascading
Computation MapReduce HBase

Storage HDFS
Hadoop Ecosystem

Closed Issues 2359 2340 Sampling Rate


89.8%
Examined Issues 2180 2038
Sampling Period ~6 years 5 years 6
Methodology Used in Our Study
Issues

Description Patches Follow-up Source Code


Discussions Analysis

6
Methodology Used in Our Study
Issues

Description Patches Follow-up Source Code


Discussions Analysis

Labeling

IssueID Subcomponent Type Causes


Create/Commit Time CorrelatedIssueID ……
HPatchDB
6
Where Are the Correlated Issues From?

Do you know where I’m from?

#Correlated Issues 0 1 2 3 >=4


HDFS 94.7% 4.8% 0.5% - -
External
MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%
HDFS 52.7% 32.8% 9.1% 3.1% 2.3%
Internal
MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%
7
Where Are the Correlated Issues From?

External Correlation
correlated issues appear in other systems
A
Internal Correlation
correlated issues appear in the same system
B C

#Correlated Issues 0 1 2 3 >=4


HDFS 94.7% 4.8% 0.5% - -
External
MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%
HDFS 52.7% 32.8% 9.1% 3.1% 2.3%
Internal
MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%
7
Where Are the Correlated Issues From?

B C

#Correlated Issues 0 1 2 3 >=4


HDFS 94.7% 4.8% 0.5% - -
External
MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%
HDFS 52.7% 32.8% 9.1% 3.1% 2.3%
Internal
MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%
7
Where Are the Correlated Issues From?

A
A significant number of issues are independent.

B C

#Correlated Issues 0 1 2 3 >=4


HDFS 94.7% 4.8% 0.5% - -
External
MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%
HDFS 52.7% 32.8% 9.1% 3.1% 2.3%
Internal
MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%
7
Where Are the Correlated Issues From?

A
Half of them are from YARN.

B C

#Correlated Issues 0 1 2 3 >=4


HDFS 94.7% 4.8% 0.5% - -
External
MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%
HDFS 52.7% 32.8% 9.1% 3.1% 2.3%
Internal
MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%
7
Where Are the Correlated Issues From?

B C

#Correlated Issues 0 1 2 3 >=4


HDFS 94.7% 4.8% 0.5% - -
External
MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%
HDFS 52.7% 32.8% 9.1% 3.1% 2.3%
Internal
MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%
7
Where Are the Correlated Issues From?

Half of them are independent.


B C

#Correlated Issues 0 1 2 3 >=4


HDFS 94.7% 4.8% 0.5% - -
External
MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%
HDFS 52.7% 32.8% 9.1% 3.1% 2.3%
Internal
MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%
7
Where Are the Correlated Issues From?

B C

#Correlated Issues 0 1 2 3 >=4


HDFS 94.7% 4.8% 0.5% - -
External
MapReduce 79.3% 17.1% 2.8% 0.5% 0.3%
HDFS 52.7% 32.8% 9.1% 3.1% 2.3%
Internal
MapReduce 59.3% 32.7% 5.6% 1.3% 1.0%
7
How the Issues Are Correlated?

Do you know our relationship?

8
How the Issues Are Correlated?
Similar Causes
Issues have similar causes

Blocking Other Issues


Issues need to be fixed before fixing other issues

Fix on Fix
Issues are caused by fixing other issues

8
How the Issues Are Correlated?

26-33% of the issues have similar causes.

Similar Causes Blocking Other Issues Fix on Fix


40
Percentage (%)

30
20
10
0
HDFS MapReduce 8
How the Issues Are Correlated?

These issues that block others appear more


frequently in HDFS.

Similar Causes Blocking Other Issues Fix on Fix


40
Percentage (%)

30
20
10
0
HDFS MapReduce 8
How the Issues Are Correlated?

Mostly due to functional dependency.

Similar Causes Blocking Other Issues Fix on Fix


40
Percentage (%)

30
20
10
0
HDFS MapReduce 8
On the Issue Correlations with System Characteristics

Programming

Systems Tools

9
On the Issue Correlations with System Characteristics

27%

47% 26%
Programming

Systems Tools

9
How Issues Relate to Systems?

10
How Issues Relate to Systems?
100
90
80
70
Percentage (%)

60
50
40
30
20
10
0
HDFS MapReduce
security networking storage file system memory cache 10
How Issues Relate to Systems?
100
90
80
GC is still the No.1 concern,
70
Percentage (%)

memory-friendly objects are preferred.


60
50
40 •  LightWeightGSet Vs. java.util structure
30 •  Object cache for long lived object:
20 ReplicasMap, ReplicasInfo
10
0
HDFS MapReduce
security networking storage file system memory cache 10
How Issues Relate to Systems?
100 Many issues happened in file system like
90 EXT4 appear in Hadoop.
80
70
Percentage (%)

60
50 File system semantic:
40 namespace management, file permission,
30 consistency (e.g., fsck), etc.
20
10
0
HDFS MapReduce
security networking storage file system memory cache 10
How Issues Relate to Systems?
100 The statement of the 99.99% of data
90 reliability in cloud storage is challenged.
80
70
Percentage (%)

60 Issues in rack placement policy:


50 0.16% of blocks and their replicas are in
40 the same rack upon system upgrade.
30
20
10
0
HDFS MapReduce
security networking storage file system memory cache 10
How Issues Relate to Systems?
100
One quarter of networking issues cause
90
resource wastage.
80
Socket
70 Read a block:
Percentage (%)

leak !
60 Peer peer = newTcpPeer(dnAddr);
50 -  return newBlockReader(…)
+ try{
40 + reader = newBlockReader(…)
30 + return reader
20 + } catch (IOException ex) {
+ throw ex;
10 + } finally {
0 + if(reader == null) closeQuietly(peer);
HDFS MapReduce + }

security networking storage file system memory cache 10


How Issues Relate to Programming?
100
90
80
70
Percentage (%)

60
50
40
30
20
10
0
HDFS MapReduce
typo lock interface maintenance
11
How Issues Relate to Programming?
100
90
80
70
Percentage (%)

60
50 Half of them relate to code maintenance.
40
30
20
10
0
HDFS MapReduce
typo lock interface maintenance
11
How Issues Relate to Programming?
100
90 Mainly caused by interface changes.
80
70
Percentage (%)

60
50
40
30
20
10
0
HDFS MapReduce
typo lock interface maintenance
11
How Issues Relate to Programming?
100 5.6% of programming issues
90 are caused by typos !
80
70
Percentage (%)

60 A fsimage cannot be accessed due to:

50 -  elif [ “COMMAND” = “oiv_legacy” ] then


+ elif [ “$COMMAND” = “oiv_legacy” ] then
40
30
20
10
0
HDFS MapReduce
typo lock interface maintenance
11
How Issues Relate to Tools?
100
90
80
70
Percentage (%)

60
50
40
30
20
10
0
HDFS MapReduce
configuration debugging documents testing 12
How Issues Relate to Tools?
100
Logs are misleading:
90
incorrect, incomplete, indistinct output.
80
70
Percentage (%)

60
50
40
30
20
10
0
HDFS MapReduce
configuration debugging documents testing 12
How Issues Relate to Tools?
100
Logs are misleading:
90
incorrect, incomplete, indistinct output.
80
70
Percentage (%)

60
50 Accessing a non-exist file via WebHDFS,
40 FileNotFoundException is expected, but we get this

30
20
10
0
HDFS MapReduce Logs

configuration debugging documents testing 12


How Issues Relate to Tools?
100 A majority of configuration issues are
90 related to system performance.
80
70
Percentage (%)

60 59% of the 219 configuration parameters in


50 MapReduce are performance related.
40
30
20
10
0
HDFS MapReduce
configuration debugging documents testing 12
Conclusion

Programming

Systems Tools

Correlations Between Issues


1 Issues are independent; 33% of issues have similar causes, etc.

Correlations With System Characteristics


2 More efforts are required to achieve highly reliable distributed system

13
Thanks!
Jian Huang
jian.huang@gatech.edu

Xuechen Zhang† Karsten Schwan

Q&A

You might also like