You are on page 1of 34

HP Gen8 technologies for low latency, high performance trading and exchanges

Patrick Greene Solution Architect HP HPC on Wall Street 9/19/12


Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Experience matters
HP ProLiant
#1 in x86 server market share
16+ years straight 65 consecutive quarters in both factory revenue and units

#1 in blade server market share


5 years straight 23 consecutive quarters in both factory revenue and units

HPs leadership in the datacenter that has been built over years of innovation, experience and market leadership.
Source: IDC Worldwide Quarterly Server Tracker, August 2012. Includes Compaq ProLiant from Q196 through Q202
2 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

FSI-HPC Solutions for Capital Markets


TM

Ultra Low Latency Systems for High Frequency Trading


fastest Xeon performance tuning White Paper and HP-TimeTest utility TM HP/Mellanox TCP/UDP kernel bypass
TM

Low power choices for grid computing Open reference architecture for unstructured data

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Quality infrastructure for IT cost reduction

Low Latency Systems Require Optimization at every layer in the Solution Stack
Use Cases Exchange Matching Engines Market Data Distribution High Frequency Algorithmic Trading Pre/Post Trade Analytics Real Time Enterprise Risk Management

Low Latency FSI Solution Stack


Use Cases / Lines of Business Application Environment Precision Timing Fab. Mgmt

Messaging Middleware
Server I/O Fabric

High Speed Storage Integrated Acceleration


X86-64 Server Architecture

Firmware and Operating System

Definitions: Solution - includes messaging middleware; in-house apps; design services System - integrated server/networking/storage infrastructure Components - specific servers/OS/switches/file system in the system
4 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Optimized Form Factors to meet a variety of needs

DL rack-mount servers for expandability


All top bin E5-2600 processors offered with 3DPC DL380p option for 25 disks in 2U 2P Gen8

BL systems with integrated networking


Integrated chassis system for redundancy & TCO Gen8 NIC/Switch options leveraging PCIe Gen3

SL multi-node systems for scale-out grids


Optimized for performance, power and price at scale

ML mini-tower for ultimate expandability


ML350 model (rack mount or mini tower) for even more disk, 9 PCI slots!

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

HP Gen8 Servers (Sandy Bridge E5-2600)


DL380p 8SFF Model w/optional 8SFF hot swap drives

Three top bin Processors circled

8c 3.1GHz in HP Z820 workstation (4U with racking kit; no iLO4)


8c 2.9GHz and 4c 3.3GHz in DL380p (2U) and DL360p (1U) 130 watt 8c & 6c in BL460c (16 in 10U), SL230 (8 in 4U), and SL250 (4 in 4U)

Turbo Boost deserves a fresh (e.g. +400 MHz) Copyright 2012 Hewlett-Packard Development Company, L.P. look The information contained herein is subject to change without notice.

Fastest Memory: ProLiant Gen8 DIMMs


Intel E5 (SB) = 4 memory channels, so 2p servers have 8 channels with 2 or 3 DPC 8 Dual Rank DIMMs are optimum if it meets your memory capacity requirements
Explanation: The memory bus is forced to idle for one clock when switching between ranks on the same DIMM, and idle for 2 clocks when switching between ranks on different DIMMs. So 1 DPC out performs 2DPC at the same capacity and same number of ranks on the channel.

UDIMMs offer a 1 clock latency advantage when only 1 DIMM per Channel (DPC)
Unregistered DIMMs UDIMM failure rates are higher, so use these judiciously
DIMM 8GB 2Rx4 PC3-12800R 16GB 2Rx4 PC3-12800R 4GB 2Rx8 PC3-12800E Description 1.5V DDR3-1600 RDIMM 1.5V DDR3-1600 RDIMM 1.5V DDR3-1600 UDIMM 1DPC (DDR3-) 1600 1600 1600 2DPC (DDR3-) 1600 1600 3DPC (DDR3-) 1333 1 1333 1
4 June, 2012

New

8GB 2Rx8 PC3-12800E

1.5V DDR3-1600 UDIMM

1600

Why dual rank?


For the same memory speed and DIMM type, more ranks will result in lower loaded latency. We enable rank interleaving when dual-rank DIMMs are installed on a channel. So more ranks give the memory controller a greater capability to theHewlett-Packard processingDevelopment of memory requests. This results in Copyright 2012 Company, L.P. The information contained herein is subject to change without notice. 7 parallelize shorter request queues and therefore lower latency.

Platform Tuning Advice for Low Latency


Updated White Paper: Configuring and Tuning HP ProLiant Servers for Low-Latency Applications
Posted at http://h20000.www2.hp.com/bc/docs/support/SupportManual/c01804533/c01804533.pdf

Disable Power and CPU Monitoring SMI


Eliminate 8x/sec latency spike on managed servers from this System Management Interrupt (SMI) of magnitude >200msec Turns off P-state monitoring so server always runs at full speed

Consider Disabling Memory Pre-Failure Notification SMI


Eliminates an SMI that occurs once per 5 min for Gen8 and once/hour for G7;. Correctible and uncorrectable memory error handling is unaffected by turning off notification of the # of correctible errors made

Do this with the new HPRCU, Conrep scripting tool or RBSU Advanced Menu
Conrep now available for Solaris too

See User Guide for ROM-Based Setup Utility (RBSU) for explanation of BIOS settings
Pub #347563-405 June, 2012 at: http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00191707/c00191707.pdf

Run HP-TimeTest utility v7.2 for a quick jitter check


Request free utility via e-mail to low.latency@hp.com Include your company name, city/country, and HP sales rep/reseller if known so that the right regional person can respond.
8 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

The Benefit of Low Latency Tuning minimized jitter


Plots of HP-TimeTest output:
Jitter observed in 7-8 microsecond range

latency (cycles)

15000 10000 5000 0 0 30 60 90 120 spike (usec) 150 180 210 240 270 300 330 360 Elapsed Time (seconds) 390 420 450 480 510 540 570 600

5 4 3 2 1 0

threshold set to 3 msec

latency (cycles)

with prototype HP BIOS option for SNB memory power refresh, we observe spikes <3 sec !
(to be released mid-Oct12)
9

Jitter observed in 1.5 2.5 microsecond range !


Latency Spikes: Time History, DL380p Gen8, E5-2643 @ 3.300 GHz RHEL 6.2/2.6.32-220.el6.x86_64, HP-TimeTest7.2 bootleg BIOS (06/22/2012)
25000 20000 15000 10000 5000

9 8 7 5 4 3 2 1

0
0 30 The60 90 contained 120 herein 150 is subject 180 to 210 240 270 Copyright 2012 Hewlett-Packard Development Company, L.P. information change without notice. 300 spike (cycle) 330 360 390 420 450 480 510 540 570 600 Elapsed Time (seconds)

threshold set to 1.5 msec

latency (secs)

latency (secs)

with current LL tuning on SNB, we observe spikes <9 sec

Latency Spikes: Time History, DL380p Gen8, E5-2643 @ 3.300 GHz RHEL 6.2/2.6.32-220.el6.x86_64, HP-TimeTest7.2
25000 20000

9
8 7 6

Why PCIe Gen3 matters...

ProLiant Gen8 servers with ConnectX-3 based Adapters and VMA acceleration enable 2msec trading advantage!

10

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

VMA v6 - TCP Improved Capability In ConnectX-3


Feature
Connection Steering

CX-2
MAC+IP per process in addition to Server MAC+IP

CX-3
No additional MAC+IP. Use Servers MAC+IP

Description
ConnectX-3 implements Flow Steering

Multithread support

QP per process Multi-threaded applications will share same CQ

QP per thread/socket

ConnectX-3 Flow Steering enables finer performance tuning and optimizations

DHCP Bonding & HA VLAN IP routing gateway

Not supported Not supported Not supported Single default GW is supported per process and requires per process configuration

Supported Supported (Q112) Supported Host stack routing table is supported ConnectX-3 Flow Steering enables utilizing the host IP stack

CX-3 Introduces 40GbE!


11 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

HP/Mellanox Solution now accelerates TCP as well as UDP protocols


TCP Latency Improvement (Netperf 10GbE)
G7 X5687 3.6GHz ConnectX-2 G7 X5687 3.6GHz ConnectX-3/VMA Gen8 E5-2690 2.9GHz ConnectX-3/VMA

10 9 8 7

Latency (usec)

6 5 4 3

RT Latency (msec)

2
1 0

16

32

64

128

256

512

1024

Message Size (Bytes)


12

Back-to-back configuration (no Switch), Round Trip; Netperf v2.5.0; MTU size = 1470 Bytes RHEL Development 6.1; ConnectX-3 FW 2.10.2220; Driver: OFED-VMA 1.5.3-0008; VMA 6.1.6 Copyright 2012 Hewlett-Packard Company, L.P. The information contained herein is subject to change without notice. Command Line: netperf -n 16 -H <peer ip> -c -C -P 0 -t TCP_RR -l 10 -T 2,2 -- -r <message size>

Application Accelerator Options


FSI customers use accelerators for faster feed handlers, order execution engines, and compute-intensive risk & pricing calculations

ISS/HPC team helps certify accelerators in ProLiant


Computational accelerator partner

FPGAs:

NVIDIA (SL2X0 Gen8 with Tesla cards)

HFT accelerator solution partners:


ActivFinancial (OEMs DL380) Tervella (OEMs DL380) Ulink (OEMs DL160)

Gen8 servers enhance our support for accelerators

DL380p risers now supports double wide HL PCIe cards with aux power cable options at PCIe Gen3 speeds!

Rapid changes underway: FPGA vendors adding 10GbE; 10GbE vendors adding FPGAs; switches adding FPGAs
13 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Application Programming for Low Latency


Determine how many cores your trading strategy requires
Can it run on 8 cores? If so, match up CPU+NIC per strategy

Maximize your Application resources by doing the following: 1. Bind threads, interrupts and processes to cores using CPU_ID
/usr/bin/taskset c 0,1 /usr/bin/numactl --localalloc . (other command line options) or use Red Hat tuna to do this with GUI (in RHEL 5.5 MRG and RHEL 6.0 standard) Beginning with SandyBridge on-chip PCIe controllers, bind NICs to cores for minimum QPI latencies

2. Program memory accesses for NUMA awareness


See: http://bizsupport2.austin.hp.com/bc/docs/support/SupportManual/c03261871/c03261871.pdf

3. Place communication functions threads on adjacent cores 3. Use PCM to determine L3 Cache misses & keep data in L3 Cache http://software.intel.com/file/41604 4. Compile with Performance Settings, Use PGO, Evaluate IPP / SSE 4.2 Strings http://software.intel.com/en-us/articles/using-avx-without-writing-avx-code/ Implement application-transparent multicast acceleration between nodes,
Link Mellanox s VMA v6 library to the application THE GOLDEN TICKET: Above the noise. for kernel bypass over Ethernet and IB (HP now resells VMA) 14 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

FSI-HPC Solutions for Capital Markets


TM

Ultra low latency systems for High Frequency Trading Low power choices for grid computing
SL200s servers with GPU options Moonshot program for ARM, Atom, Phi

Open reference architecture for unstructured data Quality infrastructure for IT cost reduction

15

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Demonstrating the value of SL6500 servers


Built on ProActive Insight Architecture
SL230s
HPC optimized for maximum performance, efficiency and density

SL250s
HPC optimized for efficiency and density, with balanced GPU performance

Purpose-built for HPC performance at scale Up to 1 integrated I/O Accelerator Maximum speed FDR IB FlexibleLOM Multi-node 1/2U density and efficiency Enhanced, simple front serviceability Rack level power management Industry Leading Mgmt with Insight Control*
16

Purpose-built for HPC performance at scale Up to 3 integrated GPUs Maximum speed FDR IB FlexibleLOM Multi-node 1U density and efficiency Enhanced, simple front serviceability Rack level power management Industry Leading Mgmt with Insight Control*

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

GPUDirect RDMA for Peer-to-Peer I/O


GPU Direct RDMA (previously known as GPU Direct 3.0)

Enables peer to peer communication directly between HCA and GPU


Dramatically reduces overall latency for GPU to GPU communications by bypassing the host CPUs memory
System Memory GDDR5 Memory GDDR5 Memory System Memory

CPU
PCI Express 3.0

GPU

GPU

CPU
PCI Express 3.0

GPU Mellanox HCA Mellanox HCA

Mellanox VPI

17

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Availability: GPUDirect RDMA requires CUDA 5.0 and MLNX_OFED driver changes (beta 9/12 with expected GA by 12/12).

HP/Nvidia Gen 8 GPU Starter Kit V2.0 in Americas


Configuration:

1 DL380 control node w/ E5-2670 8 core 2.6GHz 115WCPUs, 64 GB RAM and 2x 600 GB HDD 1 SL6500 enclosures 4 SL250s 2u server trays w/ E5-2670 8 core 2.6GHz 115W CPUs, 64 GB RAM, 600 GB HDD, 2 Nvidia M2090 GPU modules Mellanox IB 4x QDR 36 port managed switch HPN ProCurve 2910 24 port 10/100/1000 Ethernet switch RHEL CMU Linux Value Pack Rack and infrastructure Hardware/Software Integration

Development Environment for commercial, enterprise, Higher Ed, ISVs CUDA Programming Environment Proof-of-concept environment for channel partners

End-user Price ~$70K


Contact HP Sales for detailed BOM
18 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Throughput-bound applications pervade the trading lifecycle


Post Trade Analysis and Compliance Data Storage and Analysis
- Historical market data - Firm-wide log consolidation - Data Publishing for ondemand large analytics

- Full trade history logs and analytics - Venue latencies - Transaction Costs - Risk Analytics - Matching - Execution - Online Risk Management

Strategy Execution Other latency sensitive apps


19

- Back Testing - Optimization - Search

Strategy Development and Testing

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

A New Era of Extreme Scale Computing


From tens of servers per rack sharing nothing to thousands sharing everything

HP Project Moonshot Infrastructure


HP Converged Infrastructure

Federated
Management, Fabric, Storage Networking, Power/Cooling

HP Redstone
Sever Development Platform

HP Discovery Lab
Proof of Concept Lab

HP Pathfinder Program
Partner Collaboration

20

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

HP Redstone Server Development Platform


Perfect for development and testing with unparalleled density, flexibility, and simplicity
ProLiant SL 6500s chassis
Up to 72 servers in a single 1U tray

Shared SL 6500 scalable system enclosure


Pooled power4 common slot power supplies Shared cooling8 shared fans, N+1, rear-serviceable Integrated, configurable network fabric with up to 16 10Gb uplinks

Up to 288 servers18 quad node compute cartridges per server tray


4 trays in a single 4U chassis

Calxeda EnergyCore quad-core ARM SoCs w/4MB L2 cache Up to 4GB ECC (up to 1333mhz) memory per server Integrated management

HP Redstone Development Platform Server tray


21

Shared and configurable storage


Diskless or up 4 SATA drives (1 drive cartridges) per server Up to 192 SSD or 96 2.5 SFF HDD per enclosure

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Breakthrough Savings and Simplicity


Energy, cost and space savings move the industry to new infrastructure
Traditional x86 HP Redstone Server

$3.3M
89% less energy 94% less space 63% less cost 97% less complexity
400 servers 10 racks 20 switches 1,600 cables 91 kilowatts

$1.2M

1,600 servers 1/2 rack 2 switches 41 cables 9.9 kilowatts

Select hyperscale web, and data analytics applications show tremendous promise
22 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. Based on weighted average performance projections for workloads such as web serving, memcached, and Data Analytics. Cost estimates include infrastructure, space, and power and cooling costs over three years.

2011 HP Confidential NDA Required

FSI-HPC Solutions for Capital Markets


TM

Ultra low latency systems for High Frequency Trading


Low power choices for grid computing Open reference architecture for unstructured data
scalable Hadoop clusters with CMU analysis with Vertica and Autonomy

Quality infrastructure for IT cost reduction


23 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

What is Hadoop?
Your data is going unstructured
The digital universe will expand by almost half in 2012 - 90% of that data is unstructured Traditional systems are not designed to analyze unstructured data Hadoop is designed specifically to extract business value from unstructured data

Risk Modeling

Fraud Detection

Sentiment Analysis

Customer Retention

Web Mining

Financial Services

Government

Retail

Telecom

Media

24

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

How does Hadoop fit into existing BI ecosystems


Click Stream Analysis using Hadoop, Vertica and Autonomy
Navigation paths Time per page Products Browsed Data Assimilation Multi-dimensional analysis Predictive analysis Geographical analysis User segmentation Software testing Market research

Data Consolidation, Aggregation


Transformation into structured data

HP Insight CMU

Unstructured Click Stream Data

Hadoop Distributions (Cloudera, MapR, Hortonworks) Operating System HP Converged Infrastructure

Ad hoc SQL Compliant Analytics

Business Users

Vertica
Meaning Based Analytics

Autonomy IDOL

Consulting Services

25

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

HP offers the shortest route to Hadoop success


Open strategy that combines Hadoop with advanced analytics and management
Deploy in days, not months
Seamless analytics

Scale to thousands of nodes with the push


of a button Manage with single pane of glass Optimize with real time and 3-D historical views of compute resources Perform end to end analytics
Insight cluster management utility Consulting Services Leading Distributions Choice of solutions

26

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

HP HyperStorage Server
Address the explosion of data permeating the data center
ProLiant SL 4500 Shared SL 4500 HyperStorage chassis

Pooled power 4 HP common slot power supplies Shared cooling 10 shared fans, N+1, rear-serviceable Shared management Reduced cabling with single iLo port

180TB Storage Single node 2 x 75TB Storage

Most dense storage available in market today

Up to 60 LFF drives in a single chassis giving a total of 180 TB of available storage

Dual node

Multiple configurations available

3 x 45TB Storage Triple node


27

Single server model gives the most dense storage solution for massive data stores Triple server gives users optimal mix of storage and compute for working inside large unstructured datasets Dual server provides an optimal mix of high density storage and compute

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

HP ProLiant SL4500 Solution Efficiency


Three Node vs. Traditional Similar Deployment

vs.

vs.

28

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

SL45xx Overview and Features


Designed for Density
First HP ProLiant server, built purely with storage intensive applications in mind Densest HDD option in HP ProLiant portfolio Various configurations allow customer selection for optimization for their unique data center needs

29

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

FSI-HPC Solutions for Capital Markets


TM

Ultra low latency systems for High Frequency Trading Low power choices for grid computing Open reference architecture for unstructured data

Quality infrastructure for IT cost reduction


ProActive Insight Architecture Performance Optimized Datacenters
30 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

HP ProLiant Gen8:

The Worlds Most Self-Sufficient Servers


With HP ProActive Insight architecture:
Integrated Lifecycle Automation Dynamic Workload Acceleration Automated Energy Optimization Proactive Service & Support

3X

6X

70%
More compute per watt

66%
Faster time to problem resolution

Admin productivity Performance increase improvement for the most demanding workloads

Serviceabilty with Quality: www.youtube.com/watch?v=AZw-LG-oyDU


31 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Designed to Simplify, Integrate and Automate your Infrastructure

HP ProActive Insight Architecture

iLO4 Management Engine

HP Smart Storage

HP FlexNet Adapters Insight Online Virtual Connect Sea of Sensors 3D ProLiant Operating Environment Datacenter Smart Grid

Integrated Lifecycle Automation / Dynamic Workload Acceleration / Automated Energy Optimization / ProActive Service and Support
32 Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

Gen8 Smart Array Innovations


Increased performance, data availability and storage capacity
Faster access to data
Up to 2X performance improvement* 2X Write Cache (up to 2 GB)

Address explosive data growth 2X # of Drives supported (up to 227) Minimize data loss Long term data retention with Flash Backed Write Cache standard
External model with SAS cable connectors for extending the RAID set to JBODs

Reduce initial setup time 95% reduction in parity initialization from several days to 5 hours**

33

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

*256KiB, Sequential write, RAID 5 with 15K SAS drives, performance will vary based on configuration ** HP R & D, Validation information TBD

Thank you

Low.latency@hp.com

Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice.

You might also like