Professional Documents
Culture Documents
Jian Ipdps12
Jian Ipdps12
{chet}
@watson.ibm.com
I. I NTRODUCTION
Over the last several years there has been a substantial
increase in the number of cloud data storage services being
deployed to meet the requirements of data intensive web
applications and services. NoSQL systems [1] have shown
that they have enormous scalability to satisfy the immense
data storage and look-up requirements of these applications.
One good example is Googles BigTable [2], which is a
semi-structured database intended to handle petabytes of
data. It is a sparse, persistent multi-dimensional sorted
distributed storage system that is built on top of Google
File System [3]. Many projects at Google such as web
1530-2075/12 $26.00 2012 IEEE
DOI 10.1109/IPDPS.2012.74
774
HRegionServer
ZooKeeper
HBaseClient
HRegionServer
HDFS
HBaseMaster
HRegionServer
Figure 1.
HBase Architecture
775
interconnect. One of the main features of InniBand is Remote Direct Memory Access (RDMA). This feature allows
software to remotely read or update memory contents of
another remote process without any software involvement
at the remote side. This feature is very powerful and can
be used to implement high-performance communication
protocols [21]. InniBand has started making inroads into
the commercial domain with the recent convergence around
RDMA over Converged Enhanced Ethernet (RoCE) [22].
InniBand offers various software layers through which it
can be accessed. They are described below:
(a) InniBand Verbs Layer: The verbs layer is the lowest
access layer to InniBand. There are no intermediate copies
in the OS. The verbs interface follows Queue Pair (QP)
communication model. Each QP has a certain number of
work queue elements. Upper-level software using verbs can
place a work request on a QP. The work request is then
processed by the Host Channel Adapter (HCA). When work
is completed, a completion notication is placed on the completion queue. Upper level software can detect completion
by polling the completion queue or by asynchronous events
(interrupts). Polling often results in the lowest latency. The
OpenFabrics interface [13] is the most popular verbs access
layer due to its applicability to various InniBand vendors.
(b) InniBand IP Layer: InniBand software stacks, such
as OpenFabrics, provide a driver for implementing the IP
layer. This exposes the InniBand device as just another
network interface available from the system with an IP address. Typically, Ethernet interfaces are presented as eth0,
eth1, etc. Similarly, IB devices are presented as ib0,
ib1 and so on. It does not provide OS-bypass. This layer
is often called IP-over-IB or IPoIB for short. We will use
this terminology in the paper. There are two modes available
for IPoIB. One is the datagram mode, implemented over
Unreliable Datagram (UD), and the other is connected mode,
implemented over Reliable Connection (RC).
2) 10 Gigabit Ethernet: 10 Gigabit Ethernet was standardized in an effort to improve bandwidth in data center
environments. It was also realized that traditional sockets
interface may not be able to support high communication
rates [23], [24]. Towards that effort, iWARP standard was
introduced for performing RDMA over TCP/IP [25]. iWARP
is very similar to the verbs layer used by InniBand, with
the exception of requiring a connection manager. In addition
to iWARP, there are also hardware accelerated versions of
TCP/IP available. These are called TCP Ofoad Engines
(TOE), which use hardware ofoad. The benets of TOE are
to maintain full socket streaming semantics and implement
that efciently in hardware. We used 10 Gigabit Ethernet
adapters from Chelsio Communications in our experiments.
776
Applications
HBaseClient APIs
UCR
Figure 3.
Communication
Communication Preparation
Server Processing
Server Serialization
Client Processing
Client Serialization
Time Spent(us)
200
100
50
1GigE
IPoIB(16Gbps)
General Design
150
IB Verbs
300
250
10GigE
777
HBaseClient
HBase RegionServer
Handler
8
Response Queue
Applications
1
20
Responder Thread
11
HBaseClient APIs
Call Queue
19
Network Selector
Reader
23
12
Helper
10
IB Reader
Thread Pool
22
18
Thread Pool
21
13
UCR
UCR
5
16
14
IB Verbs
IB Verbs
15
Figure 4.
778
D. Connection Management
779
1200
800
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
600
500
Time (us)
Time (us)
700
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
1000
600
400
400
300
200
200
100
0
1
16
64 256 1K 4K 16K
16
Message size
64 256 1K
4K 16K
Message size
1GigE
IPoIB (32Gbps)
UCR-IB (32Gbps)
600
500
Time (us)
500
Time (us)
700
1GigE
IPoIB (32Gbps)
UCR-IB (32Gbps)
600
400
300
400
300
200
200
100
100
0
1
16
64 256 1K
Message size
4K 16K
16
64 256 1K
Message size
4K 16K
gionServers and vary number of HBase clients. This workload is similar to real HBase deployment workloads.
A. Experimental Setup
We use two different clusters in our evaluations.
(1) Intel Clovertown Cluster (Cluster A): This cluster
consists of 64 compute nodes with Intel Xeon Dual quadcore processor nodes operating at 2.33 GHz with 6 GB
RAM. Each node is equipped with a ConnectX DDR IB
HCA (16 Gbps data rate) as well as a Chelsio T320 10GbE
Dual Port Adapter with TCP Ofoad capabilities. Each node
runs Red Hat Enterprise Linux Server release 5.5 (Tikanga),
with kernel version 2.6.30.10 and OpenFabrics version 1.5.3.
The IB cards on the nodes are interconnected using a 144
port Silverstorm IB DDR switch, while the 10 GigE cards
are connected using a Fulcrum Focalpoint 10 GigE switch.
(2) Intel Westmere Cluster (Cluster B): This cluster
consists of 144 compute nodes with Intel Westmere series
of processors using Xeon Dual quad-core processor nodes
operating at 2.67 GHz with 12 GB RAM. Each node is
equipped with MT26428 QDR ConnectX HCAs (32 Gbps
data rate) with PCI-Ex Gen2 interfaces. The nodes are
interconnected using 171-port Mellanox QDR switch. Each
node runs Enterprise Linux Server release 6.1 (Santiago) at
kernel version 2.6.32-131 with OpenFabrics version 1.5.3.
This cluster is less than two years old and it represents the
leading edge technologies used in commodity clusters.
We use HBase version 0.90.3 and Hadoop version 0.20.2
in all our experiments. We use the following conguration:
Master node and HDFS name-node run on one compute
node, while HBase clients and HRegionServers run on
B. Micro-benchmark Evaluations
We design micro-benchmarks to evaluate the latency of
HBase Put and Get operations. The benchmark issues Put
or Get operations of a specied size and measures the
total time taken for the operation to complete. We run
this micro-benchmark in single region server - single client
conguration. We keep only one record in the region server,
so that all the accesses are serviced from servers memory
store and no disk accesses are involved. The experiment is
performed on both Cluster A and Cluster B. Due to space
limitations, we report only the results of record sizes from
from 1 byte to 64 KB. It is to be noted that, most of the
Put/Get payload sizes in a typical Cloud application fall
within this range [27]. We also present a detailed proling
analysis of both HBase client and HRegionServer during a
Put/Get operation. It provides insights to what all factors
780
781
250
300
Communication
Communication Preparation
Server Processing
Server Serialization
Client Processing
Client Serialization
200
Communication
Communication Preparation
Server Processing
Server Serialization
Client Processing
Client Serialization
200
Time Spent(us)
Time Spent(us)
250
150
100
150
100
50
50
0
1GigE
IPoIB(16Gbps)
10GigE
UCRIB(16Gbps)
1GigE
(a) Put 1 KB
10GigE
UCRIB(16Gbps)
(b) Get 1 KB
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
70000
60000
Ops/sec
Time (us)
Figure 7.
500
450
400
350
300
250
200
150
100
50
0
IPoIB(16Gbps)
50000
40000
30000
20000
10000
0
16
No. of Clients
Figure 8.
4
8
No. of Clients
16
(a) Latency
(b) Throughput
Single Server, Multiple Clients, HBase Get on Cluster A, 1 KB Message Size.
782
60000
Ops/sec
500
400
300
700
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
70000
500
50000
40000
30000
300
20000
200
100
10000
100
16
No. of Clients
800
600
400
200
0
1
16
No. of Clients
Time (us)
Time (us)
1000
16
20000
4
8
No. of Clients
16
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
1200
4
8
No. of Clients
30000
10000
4
No. of Clients
16
1000
900
800
700
600
500
400
300
200
100
0
35000
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
30000
25000
Ops/sec
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
40000
400
200
50000
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
600
Ops/sec
600
Time (us)
80000
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
700
Time (us)
800
20000
15000
10000
5000
0
4
No. of Clients
16
4
8
No. of Clients
16
(c) Throughput
In recent times, HBase and InniBand are gaining attraction in data-center deployments. Facebook deployed Facebook Messages using HBase. Their engineers indicate that
network I/O plays an important role in the whole system
performance [36]. Oracle designed Oracle Exalogic Elastic
Cloud to integrate business applications, middlewares, and
software products [16]. The storage and compute resources
in this cloud are integrated using its high-performance I/O
back-plane on InniBand. However, these systems support
protocols such as Sockets Direct Protocol (SDP), IP-overInniBand (IPoIB), and Ethernet-over-InniBand (EoIB),
and these do not make use of native InniBand. In our work,
we rely on native InniBand using UCR. In our previous
work [37], we evaluated HDFS performance on InniBand
and 10 G Ethernet networks using different workloads. In
this paper, we enabled RDMA communication channel in a
Java based system.
VII. C ONCLUSION AND F UTURE W ORK
In this paper, we took on the challenge to improve HBase
data query performance by improving the communication
time. We performed detailed proling to reveal the communication overheads caused by Java sockets, and showed
that communication played an important role in the overall
operational latency. We identied this as an opportunity
to take advantage of high performance networks such as
InniBand to improve HBase operation performance.
In order to integrate high performance networks into
HBase, we designed a hybrid hierarchical communication mechanism to incorporate both conventional sockets
and high performance InniBand. By leveraging the highthroughput and low-latency InniBand network, we were
783
1000
80000
40000
1500
500
32
64
No. of Clients
128
16
32
64
No. of Clients
40000
128
16
32
64
No. of Clients
128
16
32
64
No. of Clients
128
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
12000
6000
4000
50000
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
14000
Time (us)
8000
60000
20000
10000
8000
6000
4000
2000
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
40000
Ops/sec
16
80000
1000
20000
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
100000
2000
500
Time (us)
2500
60000
120000
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
3000
Ops/sec
1500
3500
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
100000
Ops/sec
Time (us)
120000
1GigE
10GigE
IPoIB (16Gbps)
UCR-IB (16Gbps)
2000
Time (us)
2500
30000
20000
10000
2000
0
0
8
16
32
64
No. of Clients
128
16
32
64
No. of Clients
128
16
32
64
No. of Clients
128
(c) Throughput
VIII. ACKNOWLEDGMENTS
This research is supported in part by a grant from
IBM OCR Program, U.S. Department of Energy grant
#DE-FC02-06ER25755, National Science Foundation grants
#CCF-0702675, #CCF-0833169, and #CCF-0916302; and
an equipment grant from Wright Center for Innovation
#WCI04-010-OSU-0.
[8] J.
Seligsein,
Facebook
Messages,
http://www.facebook.com/blog.php?post=452288242130.
[9] N.
ONeil,
Facebook
Social
Inbox,
http://www.allfacebook.com/facebook-social-inbox-alwayson-messaging-with-people-you-care-about-2010-11.
R EFERENCES
784
[32] T. Dory, B. Mejias, P. V. Roy, and N.-L. V. Tran, Measuring Elasticity for Cloud Databases, in The Proceedings of
the Second International Conference on Cloud Computing,
GRIDs, and Virtualization, Rome, Italy, September 25-30
2011.
[24] G. Liao, X. Zhu, and L. Bhuyan, A New Server I/O Architecture for High Speed Networks, in The Proceedings of 17th
International Symposium on High Performance Computer
Architecture (HPCA11), San Antonio, Texas, February 1216 2011.
785