You are on page 1of 9

IBM Power SystemsTM

IBM Systems & Technology Group

AIX Troy performance over DDR IB, QDR IB


and RoCE
Bernie King-Smith
IBM Systems &Technology Group
AIX Performance and analysis

Power your planet © 2011 IBM Corporation


IBM Power Systems

Overview

Goal: Measure how the three fabrics support by Troy compare to each
other in POWER/AIX?

How well do Troy/pureScale operations scale on DDR IB, QDR IB


and RoCE?

How well do QDR IB and RoCE perform as compared to DDR IB


using a simulated DTW operation mix?

How well does the CF scale going from 1 to 2 ports on the CF?

This is just a beginning with questions yet to be answered.

2 Power your planet © 2011 IBM Corporation


IBM Power Systems

System Configuration

Member 1 Member 2 Member 3 Member 4


P730-IOC P730-IOC P730-IOC P730-IOC
IB M
IB M IB M IB M server
server server server

SilverStorm 9024 Mellanox 5030 BNT G8124


DDR IB QDR IB RoCE

Networks server
IB M

RoCE
CF 4 member configuration.
P730-IOC
QDR IB Five P730-IOC CECS 16 core 128 GB Memory
DDR IB Galaxy2 adapter
DDR IB Travis-IB QDR IB adapter
Travis-EN RoCE adapter ( PRQ)
Only the CF used both ports in the adapters for 2 port
measurements

Power your planet © 2011 IBM Corporation


IBM Power Systems

Microb Single port Latency


Troy operation single connection latency
Build 1146A_61ps111

Latency was expected to be higher by about 1
usec for QDR IB and RoCE than DDR IB due to 60

PCIe bus overhead. vs. GX++. However we


currently see about a 2.2 usec increase. 50

RoCE latency was expected to be about 1
usec higher than QDR IB even though they
use the same Connect@ chip. We currently 40

see about a 2 usec increase.

Latency (usec)
DDR IB
QDR IB
30
RoCE

Findings comparing DDR IB to RoCE ctraces.
20

Instruction counts for SLS, RAR and
WARM are higher by 3.99%, 4.80%
and 3.79% respectively. 10

All increases in instruction counts
are in RoCE specific code, 0
specifically mxibQpPostSend. DIAGPTEST RARPTEST RARPTESTND REGPTEST SLSPTEST WARMOPTEST


Of the increased instruction counts Opertation type
30% are from the stamp_wqe routine.
This is used to place an end of list
pattern at end of WQE list for the
adapter read ahead function. Working Latency usec Galaxy2 DDR QDR IB RoCE
with RoCE development on improving DIAGPTEST 6.91 9.11 11.08
this routine.

RoCE currently uses up to 3 lwsync() RARPTEST 18.27 19.36 25.82
calls per mxibQpPostSEnd call, RARPTESTND 7.76 9.93 12.02
which may be unnecessary. At the REGPTEST 23.22 25.65 30.82
end of the routine we ring the
doorbell which does a full sync() SLSPTEST 8.77 10.68 12.95
anyway. WARMOPTEST 34.10 34.37 48.90

Power your planet © 2011 IBM Corporation


IBM Power Systems

Microb Single port Operation rates Troy single port small message rates
build 1146A_61ps111
2500000

Small Messages
2000000

Operation per second



QDR IB operation rates are 9.6 – 14.6% lower
than DDR IB. 1500000 DDR IB

RoCE operation rates are 7.3 – 8.5% lower QDR IB
RoCE
than QDR IB using the same adapter chip in 1000000
different modes.

The RegisterPage operation is not to critical 500000
to DB2 transaction performance, however is
much slower on QDR IB and RoCE adapters. 0
DIAGPTEST RARPTESTND REGPTEST SLSPTEST

Operation type

Data Operations Troy single port data pages per second rates
build 1146A_61ps111
QDR IB limit

DDR IB and RoCE write multiple page rates 600000
are link limited.

Read rates for DDR IB and RoCe are lower DRR IB limit 500000
than writes because of additional per page

4K pages pre second


overhead. 400000
DDR IB

QDR IB read and write rates are well below RoCE limit QDR IB
300000
link limits. QDR IB not considered RoCE
strategic/tactical for Tory on POWER. 200000

100000

0
Read Write

Operation type

Power your planet © 2011 IBM Corporation


IBM Power Systems

Microb DTW paced rate traffic simulation results


Microb DTW profile performance operation rates
Build 1146A_61ps111

DTW rate generation based on member tuning to 1400000
generate 215,000 operations on pureScale FP2 with 1200000
DDR IB adapters.

Total operations per decond


1000000
DTW simulation DDR
30 SLS connections 800000 QDR

14 RAR connctions 600000


RoCE

1 WARMO connection
Interoperation gap 175 usec. 400000

RoCE 1 port operation rate is 11.47% lower than


● 200000

DDR IB, 199994 vs. 225907 operations per second 0


1 port 2 ports

For 2 ports RoCE is 12.68% lower than DDR IB,
379937 vs. 435904 operations per second
Microb DTW profile performance latency

Average latency for RoCE 1 ports is 1.4x higher Build 1146A_61ps111
than DDR IB. It is 43.66 vs. 17.93 usec. 160

For 2 ports RoCE latency is 12.68% higher than



140

Average operation latency (usec)


DDR IB. 54.92 vs. 25.50 usec. 120

2 port scaling is 94% for DDR IB but only 90% for



100 DDR
RoCE QDR
80 RoCE

60

40

20

0
1 port 2 ports

Power your planet © 2011 IBM Corporation


IBM Power Systems

Microb DTW peak rate traffic simulation results


Microb peak profile performance operation rates

Peak rate generation based on same number of
Build 1146A_61ps111
connections running SLS, RAR and WARMO operation
unthrottled. Current results do not necessarily match 1400000

the DTW profile percentages. 1200000

Total operations per decond



Peak rates give indicy of operation rates and latency 1000000
differences when increasing the load above 215 000 800000
DDR
QDR
DTW mix operations per second. RoCE
600000

RoCE 1 port operation rate is 42.37% lower than DDR
IB. 828927 vs. 464843 operations per second. 400000

200000

For 2 ports RoCE is 46.74% lower than DDR IB.
1202565 vs. 640531 operations per second. 0
1 port 2 ports

Average latency for RoCE 1 ports is 43.73% higher
than DDR IB. It is 78.51 vs. 53.60 usec.
Microb peak profile performance latency

For 2 ports RoCE latency is 78.51% higher than DDR Build 1146A_61ps111
IB. 138.65 vs. 74.28 usec. 160


2 port scaling is only 45% for DDR IB at peak traffic 140

Average operation latency (usec)


rates and 38% for RoCE. 120

100 DDR
QDR
80 RoCE

60

40

20

0
1 port 2 ports

Power your planet © 2011 IBM Corporation


IBM Power Systems

Summary

Current assessment:

Currently DDR IB using Galaxy2 adapter in POWER performs best.

RoCE is the worst performing option due to much higher latency and limited link
bandwidth.


Yet to be assessed

Indepth analysis of the RoCE driver vs. GX++ Galaxy driver.

Scaling up to 4 ports on the CF.

Comparison to Intel hardware for RoCE.

Power your planet © 2011 IBM Corporation


IBM Power Systems

Backup charts

9 Power your planet © 2011 IBM Corporation

You might also like