AIX Troy Protocol Performance Over RoCE v0.2

IBM Power SystemsTM
IBM Systems & Technology Group
AIX Troy performance over DDR IB, QDR IB

and RoCE
Bernie King-Smith
IBM Systems &Technology Group
AIX Performance and analysis
Power your planet © 2011 IBM Corporation

IBM Power Systems
Overview
Goal: Measure how the three fabrics support by Troy compare to each
other in POWER/AIX?
How well do Troy/pureScale operations scale on DDR IB, QDR IB

and RoCE?
How well do QDR IB and RoCE perform as compared to DDR IB

using a simulated DTW operation mix?
How well does the CF scale going from 1 to 2 ports on the CF?
This is just a beginning with questions yet to be answered.
2 Power your planet © 2011 IBM Corporation

IBM Power Systems
System Configuration
Member 1 Member 2 Member 3 Member 4

P730-IOC P730-IOC P730-IOC P730-IOC
IB M
IB M IB M IB M server
server server server
SilverStorm 9024 Mellanox 5030 BNT G8124

DDR IB QDR IB RoCE
Networks server
IB M
RoCE
CF 4 member configuration.
P730-IOC
QDR IB Five P730-IOC CECS 16 core 128 GB Memory
DDR IB Galaxy2 adapter
DDR IB Travis-IB QDR IB adapter
Travis-EN RoCE adapter ( PRQ)
Only the CF used both ports in the adapters for 2 port
measurements

IBM Power Systems
Microb Single port Latency

Troy operation single connection latency
Build 1146A_61ps111
●
Latency was expected to be higher by about 1
usec for QDR IB and RoCE than DDR IB due to 60
PCIe bus overhead. vs. GX++. However we

currently see about a 2.2 usec increase. 50
●
RoCE latency was expected to be about 1
usec higher than QDR IB even though they
use the same Connect@ chip. We currently 40
see about a 2 usec increase.
Latency (usec)
DDR IB
QDR IB
30
RoCE
●
Findings comparing DDR IB to RoCE ctraces.
20
●
Instruction counts for SLS, RAR and
WARM are higher by 3.99%, 4.80%
and 3.79% respectively. 10
●
All increases in instruction counts
are in RoCE specific code, 0
specifically mxibQpPostSend. DIAGPTEST RARPTEST RARPTESTND REGPTEST SLSPTEST WARMOPTEST
●
Of the increased instruction counts Opertation type
30% are from the stamp_wqe routine.
This is used to place an end of list
pattern at end of WQE list for the
adapter read ahead function. Working Latency usec Galaxy2 DDR QDR IB RoCE
with RoCE development on improving DIAGPTEST 6.91 9.11 11.08
this routine.
●
RoCE currently uses up to 3 lwsync() RARPTEST 18.27 19.36 25.82
calls per mxibQpPostSEnd call, RARPTESTND 7.76 9.93 12.02
which may be unnecessary. At the REGPTEST 23.22 25.65 30.82
end of the routine we ring the
doorbell which does a full sync() SLSPTEST 8.77 10.68 12.95
anyway. WARMOPTEST 34.10 34.37 48.90

IBM Power Systems
Microb Single port Operation rates Troy single port small message rates
build 1146A_61ps111
2500000
Small Messages
2000000
Operation per second

●
QDR IB operation rates are 9.6 – 14.6% lower
than DDR IB. 1500000 DDR IB
●
RoCE operation rates are 7.3 – 8.5% lower QDR IB
RoCE
than QDR IB using the same adapter chip in 1000000
different modes.
●
The RegisterPage operation is not to critical 500000
to DB2 transaction performance, however is
much slower on QDR IB and RoCE adapters. 0
DIAGPTEST RARPTESTND REGPTEST SLSPTEST
Operation type
Data Operations Troy single port data pages per second rates
build 1146A_61ps111
QDR IB limit
●
DDR IB and RoCE write multiple page rates 600000
are link limited.
●
Read rates for DDR IB and RoCe are lower DRR IB limit 500000
than writes because of additional per page
4K pages pre second

overhead. 400000
DDR IB
●
QDR IB read and write rates are well below RoCE limit QDR IB
300000
link limits. QDR IB not considered RoCE
strategic/tactical for Tory on POWER. 200000
100000
0
Read Write
Operation type

IBM Power Systems
Microb DTW paced rate traffic simulation results

Microb DTW profile performance operation rates
Build 1146A_61ps111
●
DTW rate generation based on member tuning to 1400000
generate 215,000 operations on pureScale FP2 with 1200000
DDR IB adapters.
Total operations per decond

1000000
DTW simulation DDR
30 SLS connections 800000 QDR
14 RAR connctions 600000

RoCE
1 WARMO connection
Interoperation gap 175 usec. 400000
RoCE 1 port operation rate is 11.47% lower than

● 200000
DDR IB, 199994 vs. 225907 operations per second 0

1 port 2 ports
●
For 2 ports RoCE is 12.68% lower than DDR IB,
379937 vs. 435904 operations per second
Microb DTW profile performance latency
●
Average latency for RoCE 1 ports is 1.4x higher Build 1146A_61ps111
than DDR IB. It is 43.66 vs. 17.93 usec. 160
For 2 ports RoCE latency is 12.68% higher than

●
140
Average operation latency (usec)

DDR IB. 54.92 vs. 25.50 usec. 120
2 port scaling is 94% for DDR IB but only 90% for

●
100 DDR
RoCE QDR
80 RoCE
60
40
20
0
1 port 2 ports

IBM Power Systems
Microb DTW peak rate traffic simulation results

Microb peak profile performance operation rates
●
Peak rate generation based on same number of
Build 1146A_61ps111
connections running SLS, RAR and WARMO operation
unthrottled. Current results do not necessarily match 1400000
the DTW profile percentages. 1200000
Total operations per decond

●
Peak rates give indicy of operation rates and latency 1000000
differences when increasing the load above 215 000 800000
DDR
QDR
DTW mix operations per second. RoCE
600000
●
RoCE 1 port operation rate is 42.37% lower than DDR
IB. 828927 vs. 464843 operations per second. 400000
200000
●
For 2 ports RoCE is 46.74% lower than DDR IB.
1202565 vs. 640531 operations per second. 0
1 port 2 ports
●
Average latency for RoCE 1 ports is 43.73% higher
than DDR IB. It is 78.51 vs. 53.60 usec.
Microb peak profile performance latency
●
For 2 ports RoCE latency is 78.51% higher than DDR Build 1146A_61ps111
IB. 138.65 vs. 74.28 usec. 160
●
2 port scaling is only 45% for DDR IB at peak traffic 140
Average operation latency (usec)

rates and 38% for RoCE. 120
100 DDR
QDR
80 RoCE
60
40
20
0
1 port 2 ports

IBM Power Systems
Summary
●
Current assessment:
●
Currently DDR IB using Galaxy2 adapter in POWER performs best.
●
RoCE is the worst performing option due to much higher latency and limited link
bandwidth.
●
Yet to be assessed
●
Indepth analysis of the RoCE driver vs. GX++ Galaxy driver.
●
Scaling up to 4 ports on the CF.
●
Comparison to Intel hardware for RoCE.

IBM Power Systems
Backup charts
9 Power your planet © 2011 IBM Corporation

AIX Troy Protocol Performance Over RoCE v0.2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AIX Troy Protocol Performance Over RoCE v0.2

Uploaded by

Copyright:

Available Formats

IBM Power SystemsTM

IBM Systems & Technology Group

AIX Troy performance over DDR IB, QDR IB

Power your planet © 2011 IBM Corporation

How well do Troy/pureScale operations scale on DDR IB, QDR IB

How well do QDR IB and RoCE perform as compared to DDR IB

This is just a beginning with questions yet to be answered.

2 Power your planet © 2011 IBM Corporation

Member 1 Member 2 Member 3 Member 4

SilverStorm 9024 Mellanox 5030 BNT G8124

Power your planet © 2011 IBM Corporation

Microb Single port Latency

PCIe bus overhead. vs. GX++. However we

see about a 2 usec increase.

Power your planet © 2011 IBM Corporation

Operation per second

4K pages pre second

Power your planet © 2011 IBM Corporation

Microb DTW paced rate traffic simulation results

Total operations per decond

14 RAR connctions 600000

RoCE 1 port operation rate is 11.47% lower than

DDR IB, 199994 vs. 225907 operations per second 0

For 2 ports RoCE latency is 12.68% higher than

Average operation latency (usec)

2 port scaling is 94% for DDR IB but only 90% for

Power your planet © 2011 IBM Corporation

Microb DTW peak rate traffic simulation results

the DTW profile percentages. 1200000

Total operations per decond

Average operation latency (usec)

Power your planet © 2011 IBM Corporation

Power your planet © 2011 IBM Corporation

9 Power your planet © 2011 IBM Corporation

You might also like