RFD Tnavigator Analysis and Profiling Broadwell PDF

RFD tNavigator
Performance Benchmarking and Profiling

May 2016
Note
• The following research was performed under the HPC Advisory Council activities
– Participating vendors: Intel, Dell, Mellanox
– Compute resource - HPC Advisory Council Cluster Center
• The following was done to provide best practices
– tNavigator performance overview
– Understanding tNavigator communication patterns
– Ways to increase tNavigator productivity
– MPI libraries comparisons
• For more info please refer to
– http://www.dell.com
– http://www.intel.com
– http://www.mellanox.com
– http://www.rfdyn.com/technology/
2
RFD tNavigator
• tNavigator
– Developed by the research and product development teams of Rock Flow Dynamics
– Designed for running dynamic reservoir simulations on engineers’ laptops, servers, and HPC clusters.
– Written in C++ and designed from the ground up to run parallel acceleration algorithms on multicore and manycore
shared and distributed memory computing systems.
– Employs Qt graphical libraries, which makes the system true multiplatform.
– By taking advantage of the latest computing technologies like NUMA, Hyperthreading, MPI/SMP hybrids, the
performance of tNavigator by far exceeds the performance of any industry standard dynamic simulation tools.
– license pricing doesn’t depend on the number of cores employed in the shared memory computing systems
• One of the distinctive features includes the interactive user control of the simulation run
– Users can not only monitor every step of the reservoir simulation at runtime
– but also directly interrupt and change the simulation's configurations with just a mouse click.
3
Objectives
• The presented research was done to provide best practices

– tNavigator performance benchmarking
• CPU performance comparison
• MPI library performance comparison
• Interconnect performance comparison
• System generations comparison
• The presented results will demonstrate

– The scalability of the compute environment/application
– Considerations for higher productivity and efficiency
4
Test Cluster Configuration
• Dell PowerEdge R730 32-node (1024-core) “Thor” cluster
– Dual-Socket 16-Core Intel E5-2697A v4 @ 2.60 GHz CPUs (Power Management in BIOS sets to Maximum Performance)
– Memory: 64GB memory, DDR4 2133 MHz, Memory Snoop Mode in BIOS sets to Home Snoop, Turbo Enabled
– OS: RHEL 6.5, MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW stack
– Hard Drives: 2x 1TB 7.2 RPM SATA 2.5” on RAID 1
• Mellanox ConnectX-4 EDR 100Gbps EDR InfiniBand Adapters
• Mellanox Switch-IB SB7700 36-port 100Gb/s EDR InfiniBand Switch
• Mellanox ConnectX-3 FDR InfiniBand, 10/40GbE Ethernet VPI Adapters
• Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch
• MPI: Intel MPI 5.1.3
• Application: RFD tNavigator v4.2.3-1177-g638ceab
• Benchmark datasets: SpeedTestModel
5
PowerEdge R730
Massive flexibility for data intensive operations
• Performance and efficiency
– Intelligent hardware-driven systems management
with extensive power management features
– Innovative tools including automation for
parts replacement and lifecycle manageability
– Broad choice of networking technologies from GigE to IB
– Built in redundancy with hot plug and swappable PSU, HDDs and fans
• Benefits
– Designed for performance workloads
• from big data analytics, distributed storage or distributed computing
where local storage is key to classic HPC and large scale hosting environments
• High performance scale-out compute and low cost dense storage in one package
• Hardware Capabilities
– Flexible compute platform with dense storage capacity
• 2S/2U server, 6 PCIe slots
– Large memory footprint (Up to 768GB / 24 DIMMs)
– High I/O performance and optional storage configurations
• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server
• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch
6
RFD tNavigator Performance – Ethernet vs InfiniBand
• InfiniBand delivers superior scalability performance compared to Ethernet

– EDR InfiniBand provides higher performance and more scalable than Ethernet
– EDR InfiniBand delivers up over 35-72% of higher performance than 10/40 GbE
– InfiniBand continues to scalable to higher nodes or processes
63% 35%
72%
Higher is better 32 MPI Processes / Node
7
RFD tNavigator Performance – Processes Per Node
• tNavigator process spawns multiple OpenMP worker threads onto CPU cores
– Typical case: launch 1 process per node (PPN=1), then spawn threads to utilize all cores
– We compare a case with 2PPN, where each process would spawn threads to its CPU socket
– Seen up to 20% gain in performance
13%
20%
Higher is better
8
RFD tNavigator Performance – CPU Processor
• “Broadwell” CPU provides more CPU cores per socket than “Haswell” family CPU
– The additional 14% of CPU cores translate to an additional 14% increase in performance
– Haswell: E5-2697 v3 is equipped with 14 cores per CPU which typically runs at 2.6GHz
– Broadwell: E5-2697A v4 is equipped with 16 cores per CPU which typically runs at 2.6GHz
14%
Higher is better
9
RFD tNavigator Performance – File system
• tNavigator demonstrates a need of a decent parallel file system

– Parallel file system like Lustre which supports RDMA transport, can be a good alternative to NFS
– Performance of NFS would cause performance degradation at scale
– Performance degradation appears at 4+ nodes, and would impact on scalability around 8+ nodes
– RamFS option is not recommended; it is to demonstrate the ideal situation whne I/O not bottlenecked
29%
195%
Higher is better 16 MPI Processes / Socket
10
RFD tNavigator Performance – I/O
• The effect of writing results can have an impact on performance

– Both cases have the results written on shared memory locally on each node
– 8% of performance difference is seen
8%
Higher is better 16 MPI Processes / Socket
11
RFD tNavigator Profiling – % MPI Communications
• Majority of the MPI time is spent on MPI_Allreduce

– imbalance is seen for some MPI processes for MPI_Allreduce
– MPI_allreduce accounts for 70% of MPI time (14% of wall time)
12
RFD tNavigator Profiling – % MPI Communications
• Majority of data transfer messages are medium sizes for both data, except for:
– MPI_Allreduce has a large concentration (70% MPI, 16% wall) in small sizes (8,4,128 bytes)
– MPI_Bcast is concentrated at 4-byte (13% MPI, 3% wall)
– MPI_Waitall calls are 0-byte call (8% MPI, 2% wall)
Higher is better
13
RFD tNavigator Summary
• tNavigator integrates the latest technologies which enables higher performance
– tNavigator uses NUMA, Hyper-Threading, MPI/SMP hybrid to achieve higher scaling
• tNavigator demonstrates to perform with the right set of hardware components
– Network: InfiniBand delivers 72% higher performance compared to Ethernet
– CPU: 14% increase in CPU cores translate directly to a 14% increase in performance
– Running additional MPI process per node could improve performance by up to 20%
– File system: tNavigator demonstrates a need of a decent parallel file system
• Parallel file system, such as Lustre, supports RDMA transport, can be a good alternative to NFS
• Performance of NFS would cause performance degradation at scale
• The effect of writing results can have an impact on performance
14
Thank You
HPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein
15 15

RFD Tnavigator Analysis and Profiling Broadwell PDF

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RFD Tnavigator Analysis and Profiling Broadwell PDF

Uploaded by

Copyright:

Available Formats

RFD tNavigator

Performance Benchmarking and Profiling

• The presented research was done to provide best practices

• MPI library performance comparison

• Interconnect performance comparison

• System generations comparison

• The presented results will demonstrate

– OS: RHEL 6.5, MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW stack

– Hard Drives: 2x 1TB 7.2 RPM SATA 2.5” on RAID 1

• Mellanox ConnectX-4 EDR 100Gbps EDR InfiniBand Adapters

• Mellanox Switch-IB SB7700 36-port 100Gb/s EDR InfiniBand Switch

• Mellanox ConnectX-3 FDR InfiniBand, 10/40GbE Ethernet VPI Adapters

• MPI: Intel MPI 5.1.3

• Application: RFD tNavigator v4.2.3-1177-g638ceab

• Benchmark datasets: SpeedTestModel

• InfiniBand delivers superior scalability performance compared to Ethernet

Higher is better 32 MPI Processes / Node

• tNavigator demonstrates a need of a decent parallel file system

Higher is better 16 MPI Processes / Socket

• The effect of writing results can have an impact on performance

Higher is better 16 MPI Processes / Socket

• Majority of the MPI time is spent on MPI_Allreduce

You might also like