Professional Documents
Culture Documents
Abstract HPC interconnect is a very crucial component of components of HPC interconnect, are required to achieve the
any HPC machine. Interconnect performance is one of the competitive performance. Send side interface (between host
contributing factors for overall performance of HPC system. and NIC), fabric and receive side interface (between NIC and
Most popular interface to connect Network Interface Card (NIC) host) broadly constitutes any HPC fabric interconnect.
to CPU is PCI express (PCIe). With denser core counts in Improvement in send side interface is desired in this work.
compute servers and increasingly maturing fabric interconnect
speeds, there is need to maximize the packet data movement Function of send side interface is to pass the application
throughput between system memory and fabric interface data (may or may not be packetized) to the Network Interface
network buffers, such that the rate at which applications (CPU) Card (NIC) for sending out packetized data through the fabric
generating data is matched with rate at which fabric consumes to the destination node. There are two ways this I/O is
the same. Thus PCIe throughput for small and medium messages performed in system: Programmed Input/output (PIO) and
(Programmed Input/Output) needs to be improved, to Direct Memory Access (DMA). Applications adapting
synchronize with core processing rate and fabric speeds. And message passing as their Inter Process Communication (IPC)
there is scope for this improvement by increasing the payload generally has two types of messages to share, latency sensitive
size in Transaction Layer Packet (TLP) of PCIe. Traditionally, small messages (in bytes) and bandwidth hungry bulk transfers
CPU issues memory writes in 8 bytes (payload for TLP), (in Megabytes). Choosing the I/O mode according to the
underutilizing the PCIe bus since overhead compared to payload message size can improve the performance. Threshold, where
is more. Write combining can increase the payload size in TLP,
mode of I/O needed to be switched from PIO to DMA should
leading to more efficient utilization of available bus bandwidth,
thereby improving the overall throughput.
be set very optimally to gain maximum performance. This
work also derives this threshold for HPC interconnect.
This work evaluates the performance that could be gained by Write Combine (WC) technique allows non-temporal
using Write Combine Buffers (WCB) available on Intel CPU, for streaming data to be stored temporarily in intermediate buffers,
send side interface of HPC interconnect. These buffers are used to be released together later in burst instead of writing in small
for aggregating the small (usually 8 bytes) memory mapped I/O chunks to the destination. Destination may be next level cache
stores, to form the bigger PCIe Transaction Level Packets (TLP),
or system memory or I/O memory. Intel CPU contains special
which leads to better bus bandwidth utilization. It is observed
that, this technique improves peak PIO bandwidth by 2x
buffers called Write Combine store buffers to ease the L1 data
compared to normal PIO. It is also observed that till 4096 bytes cache miss penalty. When a memory region is defined as WC
write combine enabled PIO outperforms DMA. memory type, memory locations within this region are not
cached and coherency is not enforced. Speculative reads are
Keywords HPC interconnect; write-combining; Network I/O; allowed and writes may be delayed. Thus, WC is most suitable
PCI bus bandwidth. for applications, in which, strongly ordered un-cached reads,
but weakly ordered streaming stores are required, such as
I. INTRODUCTION graphics applications and network I/O.
Cluster based HPC system is mainly comprised of three
components: high end compute server, high performance fabric II. RELATED WORK
interconnect, and highly optimized software. Compute server is
densely populated with more core count, large system memory Similar work on improving the I/O bus bandwidth is done
and efficient I/O buses. System Software, parallel by Steen Larsen and Ben Lee, in which they have compared
programming library, installation and execution scripts, the throughput achieved, using dma based descriptor vs using
statistics monitoring daemons etc. are bundled in the software write-combined PIO [1]. This work differs from Larsens
suite for HPC system. Metrics for evaluating/measuring the work [1] with following differences. First, this work deploys
performance of fabric interconnect is bandwidth in bytes per direct user I/O, i.e. user application can directly PIO writes
second, latency in micro second and message rate in million data without going through kernel, saving the context switch
messages per second. Optimizations at all levels and all penalty. Issue with bypassing kernel is sharing problem, which
640