You are on page 1of 6

2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C)

Evaluating performance of Parallel Matrix Multiplication


Routine on Intel KNL and Xeon Scalable Processors

Thi My Tuyen Nguyen Yoosang Park Jaeyoung Choi Raehyun Kim


School of Computer Science and School of Computer Science and School of Computer Science and Department of Mathematics
engineering Engineering Engineering
Soongsil University Soongsil University Soongsil University University of California, Berkeley
Seoul, South Korea Seoul, South Korea Seoul, South Korea Berkeley, US
nguyentuyen0406@soongsil.ac.kr yspark@soongsil.ac.kr choi@ssu.ac.kr rhkim79@berkeley.edu

Abstract— In high-performance computing, xGEMM an extensive range of systems, from our desktop computers
routine is the core of Level 3 BLAS operation to achieve through large parallel processing systems. The study of high-
matrix-matrix multiplications. The performance of Parallel performance computing provides a chance to revisit computer
xGEMM (PxGEMM) is significantly affected by two major architecture and enhance it for achieving improvement in their
factors: Firstly, the flop rate that can be achieved by calculating performance.
matrix-matrix multiplication on each node. Secondly, High-performance computing systems are progressively rising
communication costs for broadcasting sub-matrices to others. interest to provide new levels of computational capability for
In this paper, an approach to improve and adjust PDGEMM an increasing range of applications. For the past few years, as a
routine for modern Intel computers: Knights Landing (KNL) software developer, we have had the opportunity to observe the
and Xeon Scalable Processors (SKL) is proposed. This development of computational software (high-performance
approach consists of two methods to deal with the factors computing libraries) that are built for high-performance
mentioned above. First, the improvement of PDGEMM for the computing (HPC) [1].
computation part is suggested based on a blocked matrix-matrix These libraries are used as design tools and building blocks for
multiplication algorithm by providing better fits for engineers, developers, and researchers in many different areas.
architectures of KNL and SKL to deliver a better block size In recent years, numerical libraries for parallel computations
computation. Second, a communication routine with MPI is are available, and have become a standard in the computing
proposed to overcome default settings of BLACS which is a community. It includes Basic Linear Algebra Subprograms
part of communication, to improve a time-wise cost efficiency. (BLAS) [2], Parallel BLAS (PBLAS) [3], Parallel Engineering
The proposed PDGEMM achieves similar performance on and Scientific Subroutine Library (PESSL) [4], Scalable Linear
smaller size matrices as PDGEMM from ScaLAPACK and Algebra Package (ScaLAPACK) [5], and Intel Math Kernel
Intel MKL on 16 node Intel KNL. Furthermore, the proposed Library (MKL) [6]. PBLAS is a part of the ScaLAPACK library.
PDGEMM achieves better performance (on smaller size Besides that, Parallel General Matrix-Matrix Multiplication
matrices) compared to PDGEMM from ScaLAPACK and Intel (PxGEMM) is an important routine for high-performance
MKL on 16 nodes Xeon scalable processors. computing, which is included in ScaLAPACK/PBLAS,
PESSL/PBLAS, and Intel MKL, however, the performance
varies respectively. The PxGEMM routines are available in
Keywords— Parallel matrix-matrix multiplication, Parallel several precision types: single (S), double (D), complex (C),
BLAS, ScaLAPACK, Intel Knights Landing, Intel Xeon Scalable, and double complex (Z) precisions. In this paper, the parallel
AVX-512 double-precision general matrix-matrix multiplication
(PDGEMM) is considered. Because the source code of
PDGEMM routine from MKL is not provided and it only
I. INTRODUCTION provides as in binaries that is compatible with every Intel
Over the last decade, the description of what is called high- processor. On the other hand, the performance of PDGEMM
performance computing has altered dramatically. In 1988, an from PESSL is not as good as PDGEMM from ScaLAPACK
article published in Wall Street Journal titled “Attack of the and PESSL supports parallel processing applications on IBM
Killer Micros” described that how computing systems made up UNIX SP Systems, and clusters of IBM UNIX servers.
of many small inexpensive processors would soon make large Therefore, this paper presents a method for optimizing the
supercomputers obsolete. This vision has come true in some performance of PDGEMM from ScaLAPACK for special Intel
ways, but not in the way the original proponents of the “killer Knights Landing (KNL) and Xeon scalable (SKL) processors.
micro” theory envisioned. High-performance computing runs

978-1-7281-8414-2/20/$31.00 ©2020 IEEE 42


DOI 10.1109/ACSOS-C51401.2020.00027

Authorized licensed use limited to: Soongsil University. Downloaded on May 01,2021 at 05:40:50 UTC from IEEE Xplore. Restrictions apply.
The main contribution of this paper is to improve the DGEMM from OpenBLAS, ATLAS, and BLIS on both KNL,
performance of PDGEMM by considering the following and SKL. Additionally, the performance of proposed DGEMM
constraints on existing PDGEMM: achieves 90% compared to the DGEMM from Intel MKL
(i) Improving the performance of local matrix-matrix BLAS on KNL and on SKL, the performance of proposed
multiplication at each node by using a blocked matrix- DGEMM is better than that of Intel MKL when matrix size is
matrix algorithm and setting the size of blocked less than 6,000. Besides that, the performance of the computing
computation different from the default one. matrix multiplication at each node also depends on the block
(ii) Reduce communication costs by replacing the BLACS size of the matrix, details are presented in the next section.
library to MPI library. Secondly, a method of communication between nodes also
affects the performance of PDGEMM. As mentioned above,
BLACS is used for communicating data in ScaLAPACK.
II. RELATED WORK However, PDEGMM is written in C, but with Fortran 77
An important part of parallel computing is the mechanism for interfaces. Besides that, BLACS requires data packaging before
transmitting data between nodes. Parallel Virtual Machine utilizing MPI for communication. Therefore, PDGEMM uses
(PVM) [7] and Message Passing Interface (MPI) [8] are BLACS for communication that is ineffective than using MPI
commonly used libraries for evaluating and optimizing directly. Therefore, PDGEMM uses MPI that will help to
computation power in a parallel environment. PVM provides a reduce idle time occurrences for interconnected processors
portable heterogeneous environment for using a cluster of during the progress of the communication phase.
machines using socket communication over TCP/IP as a
parallel computer. MPI is a message passing interface for a
distributed memory system, which enables program portability III. OPTIMIZATION STRATEGIES
among different parallel machines. However, the performance A. High-performance DGEMM routine
and losses gains through parallel processing using MPI and
High-performance DGEMM routines are essential to
PVM are different. By evaluating performance on a cluster
improve the performance of the PDGEMM routine. We have
based parallel computing architecture, the performance of MPI
implemented DGEMM routine based on a blocked matrix-
achieved faster performance compared to PVM [9]. Therefore, matrix multiplication algorithm proposed by Goto and van de
MPI is suitable for providing a message-passing layer for Basic Geijn [18] and Gunnels et al. [19]. By blocking the matrices, it
Linear Algebra Communication Subprograms BLACS [10], a
is possible to maximize the data reusability while minimizing
message-passing library designed for linear algebra [11]. On
the communication cost. To optimize the proposed DGEMM
the other hand, BLACS also forms the basic communication
routines, we focus on two main factors, the kernel
layer for the ScaLAPACK and PESSL.
implementation and the parameter optimization.
ScaLAPACK is a set of libraries for high-performance linear To implement the DGEMM routine, we need to build a
algebra routines on a distributed-memory computer system. packing kernel that packs the sub-matrices into a contiguous
PESSL is a scalable mathematical subroutine library. Both
array and an inner kernel that performs the matrix-matrix
PESSL and ScaLAPACK includes PBLAS. PBLAS performs
computation between the sub-blocks of the source matrices.
global computation of parallel computing, use BLAS for
Since both kernels have a significant impact on the overall
computation in a process and communication between
performance, we exploit the Intel AVX-512 instructions for
processes with BLACS. PBLAS also provides the same three both kernels.
levels of implementation as BLAS. Besides that, the The performance of our DGEMM routines highly depend on
performance of PBLAS level 3, PDGEMM in ScaLAPACK is
the register block sizes, the cache block sizes, the prefetching
better than that of PESSL [12]. On the other hand, Intel MKL
distances, the loop unrolling depth and the parallelization
is an optimized library for Intel processors, therefore the
scheme. As we observed in our previous study, the optimal
performance of PDGEMM from Intel MKL is better than
blocking parameters and memory usage can be different despite
ScaLAPACK. However, Intel MKL is not officially revealed in
having the same memory structure. So, the auto-tuning
the source code level, therefore, this paper proposes a method
approach can be used to search the optimal parameters. Hence,
to optimize the performance of PDGEMM from ScaLAPACK
we use the auto-tuner in the previous work [20] to generate
and compare it with MKL on KNL and SKL.
optimal DGEMM routines for the given systems.
The performance of PDGEMM is significantly affected by
In addition, there is a critical issue related to the nested-loop
two major factors. Firstly, the flop rate that can be achieved by
parallelization. In the other previous study [21], we observed
calculating matrix-matrix multiplication on each node.
that the relatively low performance caused by the
Secondly, communication cost for broadcasting sub-matrices to
parallelization scheme and the thread affinity. This
other nodes. The performance of PDGEMM is affected by the
performance loss happens in general, because of the inefficient
performance of the sequential DGEMM routine in each node.
cache use. If we do not consider the cache structure of the target
There is a way to choose BLAS libraries for DGEMM such as
machine, especially the shared cache, then it is difficult to
OpenBLAS [13], ATLAS [14], BLIS [15], and Intel MKL
achieve the optimal cache usage. Therefore, if the given system
BLAS. However, our previous works [16, 17], proposed
has a cache shared by multiple threads, then the number of
DGEMM, which achieves better performance compared to
threads is bound to the corresponding loops in the nested loop

43

Authorized licensed use limited to: Soongsil University. Downloaded on May 01,2021 at 05:40:50 UTC from IEEE Xplore. Restrictions apply.
that is needed to be divisible by the numbers of threads differently depending on computer architectures such as KNL
allocated. Moreover, we need to assure that the logical thread and SKL. Figure 1 shows the performance of PDGEMM with
teams are properly placed along with the physical cores. different matrix sizes on the corresponding computer
Unlike the assumption in our previous study [16-17], the architectures for KNL and SKL. The tested size of matrix is set
source matrices of PDGEMM routine are stored in the column- to 60,000 for 4 nodes, and is set to 144,000 for 16 nodes. The
major order. So, we search the optimal parameters and generate better performance of PDGEMM can be achieved when the KB
our proposed DGEMM routine for the column-major case. We value is set to 336 for KNL and it is shown in Figure 1.(a) and
achieve a similar performance against the one with row-major (b). Moreover, Figure 1.(c) and (d) show that the better result
order on KNL and SKL environments in the same way of of the proposed PDGEMM can be achieved when KB is set to
iterating matrices. 384 for SKL. Therefore, the good performance of PDGEMM
can be achieved when KBs are applied for various computer
B. Inter-nodes communication
architecture kinds. These blocking factors are commonly used
1. Blocked matrix computation and distribution in not only PDGEMM but also in LU factorization [22].
The performance of PDGEMM routine from MKL gives
better performance compared to the PDGEMM from 2. Communication cost
ScaLAPACK, however, MKL provides compatibility for every The other impact has been found during the processes. That is,
Intel processor. Therefore, the blocking factor KB can be set the communication method may affect performance. For
example, the matrix multiplication algorithm performs a large
number of broadcasting operations on blocked matrices (or sub-
matrices). As long as MPI routines are used to communicate
between nodes, it needs to pack the data into buffers before
sending the data. PDGEMM originally calls the collective
BLACS C routines such as Cgebr2d or Cgebs2d, to broadcast
sub-matrices through the nodes. They pack the data with
BLACS routines and call MPI routines to communicate
between the processors. Therefore, we focused on optimizing
both packing and broadcasting operations.

44

Authorized licensed use limited to: Soongsil University. Downloaded on May 01,2021 at 05:40:50 UTC from IEEE Xplore. Restrictions apply.
First, we implemented the packing kernel using the Intel Cilk GHVFLQLWB '(6&0.0%.%565&&65&,&7;70;//'$,1)2 
Plus library. It shows better performance than the BLACS where DESC is the array descriptor of a distributed matrix to be
routines. However, the packing only takes up to 4% of the set; M is the number of rows in the distributed matrix; K is the
whole communication time, the main progress was made in the number of columns in the distributed matrix; MB and KB are
improvement of the broadcasting routine. Rather than using the blocking factors that are used for distributing both rows and
MPI broadcasting routine directly, we developed our columns of matrices; RSRC and CSRC are the process
communication routine to broadcast the data by following the indicators how rows and columns should be distributed,
ring topology for both the columns and rows. By respectively; ICTXT is the BLACS context handler, indicating
communicating as an asynchronous fashion, we can maximize the global context of the operation on the matrix; MXLLDA is
the effect of overlapping communication and computation. the leading dimension of the local array storing the local blocks
Furthermore, we can reduce a large overhead that is observed of distributed matrix. In PDGEMM from MKL, the good value
in the BLACS routine of both Intel MKL and netlib [23]. Figure to check performance is MB=KB=1,000. Table 2 demonstrates
2. shows the impact for communication method with different the three environments of PDGEMM applied:
matrix size and block size KB=336 on 4 nodes KNL, and ƒ The proposed PDGEMM compiled with ScaLAPACK by
KB=384 on 4 nodes SKL. In Figure 2, the lower is better since using blocked matrix-matrix multiplication at each node,
it shows the elapsed time. communication data via MPI routine, using high-
bandwidth on KNL, and blocking factors KB are set as
IV. EXPERIMENTS 336 for KNL and as 384 for SKL. -ķ
ƒ PDGEMM environment on ScaLAPACK with using
In order to perform the experiment of this study, two scalable similar configurations in the proposed PDGEMM but
environments are utilized to compare the performance of using DGEMM from Intel MKL BLAS for calculation
PDGEMM from applying a combination of using MKL and matrix at each node. -ĸ
ScaLAPACK. Experiments are performed by utilizing the
ƒ PDGEMM routine from built-in MKL -Ĺ
facilities provided by the institute named: Korea Institute of
Science and Technology Information (KISTI) [24]. There are
two scalable environments of KNL and SKL which are
provided by KISTI. For KNL, configuration for hardware is set
as cache mode for memory mode, and quadrant mode for
cluster mode. The advantage of testing with these environments
is that both target machines provide the Intel AVX-512
instructions. Table 1 shows a summary of the target machine’s
architecture.

Table 1. Architecture Summary of target Machines

1. Experiment results for System 1 (KNL)


The Intel Xeon Phi processor supports high-bandwidth
memory (HBM). HBM is faster up to five times (൒ 400 GB/s)
compared to DDR4 memory (൒90 GB/s). Buffer memory of
blocked matrix A and B are replaced by using HBM to evaluate
the performance of proposed PDGEMM on System 1 using
The rest settings of proposed PDGEMM is integrated with memkind library as long as the matrix size is acceptable to be
ScaLAPACK version 2.0.2, BLAS from Intel MKL, and Intel stored.
MPI 19.0.5 from Intel Parallel Studio 2019 Update 5 for both We demonstrate the performance of the KNL environment in
systems. Three test applications are compared: the proposed Figure 3. Figure 3.(a) and (b) show the results from 4 nodes (P
PDGEMM fixed from ScaLAPACK that utilizes blocked h Q=2 h 2) and 16 nodes (P h Q=4 h 4), respectively. The
matrix-matrix multiplication at each node for the first one, with parameters for these experiments are set as KB=336 for
the PDGEMM fixed from ScaLAPACK that uses DGEMM environments ķ and ĸ and K =1,000 for PDGEMM routine
from Intel MKL BLAS for the calculation matrix multiplication from MKL, where M=N=K. The PDGEMM from the proposed
at each node, and the simple application based on MKL which
environment ķ shows higher performance as compared to Ĺ
calls PDGEMM routine.
i.e. 25% on 4 nodes for M=N=K=12,000 and 63% on 16 nodes
For executing PDGEMM, each global matrix that is to be
for M=N=K=20,000. On the other hand, if we compare the
distributed across the process grid must be assigned as an array
descriptor. This array descriptor is called descinit_(): results of the proposed environment ཛ with environment ĸ,

45

Authorized licensed use limited to: Soongsil University. Downloaded on May 01,2021 at 05:40:50 UTC from IEEE Xplore. Restrictions apply.
the performance seems to be similar when the matrix size is
smaller than 20,000 on 16 nodes, and the performance of the
proposed environment ķ shows 70% better performance
compared with the environment ĸ on 4 nodes.

V. CONCLUSIONS
2. Experiment results for System 2 (SKL)
In this paper, we presented a method to improve the
For SKL, the results are shown in Figure 4. It shows the performance of PDGEMM routine. The proposed approach
results from 4 nodes (P×Q=2×2) and 16 nodes (P×Q=4×4), contains two improvement factors, computation and inter-node
respectively. The parameter of KB for this experiment is set to communication for modern scalable processors on KNL and
384 for environment ķ and ĸ whereas KB=1,000 for SKL. The proposed PDGEMM achieves better performance
environment Ĺ, and M=N=K. The results from Figure 4.(a) than the PDGEMM from ScaLAPACK for both KNL and SKL
shows that the environment ķ achieved 61% better with relatively small size of matrices. For future work, the
performance compared to environment Ĺ on 4 nodes SKL at suggested approach will be adjusted for the further dense matrix
matrix size of M=N=K=24,000. Furthermore, Figure 4.(b) multiplication routines such as LU, QR, and Cholesky
shows that the environment ķ achieved 75% better factorization.
performance compared to the environment Ĺ on 16 nodes ACKNOWLEDGEMENT
where the matrix size is set to M=N=K=20,000. On the other
hand, environment ķ achieves better performance compared to This work was supported by the Next-Generation
Information Computing Development Program through the
the environment ĸ on small size matrices around 8,000 on 4
National Research Foundation of Korea (NRF) funded by the
nodes SKL. Apart from 4 nodes, the environment ķ achieves Korea government (MSIT) (NRF-2015M3C4A7065662).
98% similar performance as environment ĸ on 16 node SKL
at a matrix size of 20,000.

46

Authorized licensed use limited to: Soongsil University. Downloaded on May 01,2021 at 05:40:50 UTC from IEEE Xplore. Restrictions apply.
REFERENCES [15] F. G. Van Zee and R. A. Van De Geijn, “BLIS: a framework forrapidly
instantiating BLAS functionality,” ACM Transactions on Mathematical
Software, vol. 41, 2015, pp. 14, doi: 10.1145/2764454.
[1] “High-Performance Computing (HPC),” Available : [16] R. Lim, Y. Lee, R. Kim, and J. Choi, “An implementation of matrix-
https://www.nics.tennessee.edu/computing-resources/what-is-hpc/. matrix multiplication on the Intel KNL processor with AVX-512,”
[2] BLAS. Available: http://www.netlib.org/blas/. Cluster Computing, vol. 21, 2018, pp. 1785–1795, doi: 10.1007/s10586-
[3] PBLAS. Available: http://www.netlib.org/pblas/. 018-2810-y.
[4] S. Filippone, “Parallel libraries on distributed memory architectures: The [17] R. Lim, Y. Lee, R. Kim, J. Choi, and M. Lee, “Auto-tuning GEMM
IBM Parallel ESSL,” in Applied Parallel Computing Industrial kernels on the Intel KNL and Intel Skylake-SP processors,” Journal of
Computation and Optimization, -:DĞQLHZVNL-'RQJDUUD.0DGVHQ, Supercomputing, vol. 75, 2019, pp. 7895-7908, doi: 10.1007/s11227-018-
and D. Olesen Eds. Springer, 1996. 2702-1.
[5] ScaLAPACK. Available: http://www.netlib.org/scalapack/. [18] K. Goto and R. A. van de Geijn, “Anatomy of high-performance matrix
multiplication,” ACM Transactions on Mathematical Software (TOMS),
[6] Intel MKL. Available:
vol. 34, 2008, pp. 1-25, doi: 10.1145/1356052.1356053.
https://software.intel.com/content/www/us/en/develop/tools/math-
kernel-library.html/. [19] J. A. Gunnels, G. M. Henry, and R. A. Van De Geijn, “A family of high-
performance matrix multiplication algorithms,” International Conference
[7] A. Geist, A. Beguelin, J. Dongarra, W. Jiang, R. Manchek, and V.
on Computational Science, Springer, 2001, doi: 0.1007/3-540-45545-
Sunderam, “PVM: Parallel virtual machine: a users' guide and tutorial for
0_15.
networked parallel computing,” MIT Press, Cambridge, MA, 1994.
[20] R. Kim, J. Choi, and M. Lee, “Optimizing parallel GEMM routines using
[8] R. Hempel, “The MPI standard for message passing,” in High-
auto-tuning with Intel AVX-512,” Proc. International Conference on
Performance Computing and Networking, W. Gentzsch and U. Harms
High Performance Computing in Asia-Pacific Region, 2019, doi:
Eds. Springer, 1994, pp. 247-252.
10.1145/3293320.3293334.
[9] S. Sampath, B. B. Sagar, and B. R. Nanjesh, “Performance evaluation and
[21] R. Lim, Y. Lee, R. Kim, and J. Choi, “OpenMP-based parallel
comparison of MPI and PVM using a cluster based parallel computing
implementation of matrix-matrix multiplication on the intel knights
architecture,” 2013 International Conference on Circuits, Power and
landing.” Proc. Workshops of HPC Asia, 2018, pp. 63-66, doi:
Computing Technologies (ICCPCT), Nagercoil, pp. 1253-1258, 2013.
10.1145/3176364.3176374.
[10] BLACS. Available: http://www.netlib.org/blacs/.
[22] Recommended value of block size for Intel processor. Available:
[11] D. Walker, W. Sawyer, and V. Deshpande, “An MPI Implementation of https://software.intel.com/content/www/us/en/develop/documentation/m
the BLACS,” in MPI Developers Conference, Notre Dame - South Bend, kl-linux-developer-guide/top/intel-math-kernel-library-
vol. 1, pp. 0195, 1996. benchmarks/intel-distribution-for-linpack-benchmark/configuring-
[12] K. Petter and S. Patrik Skogqvist, “BLACS, PBLAS, PESSL and parameters.html/.
ScaLAPACK Libraries on the IBM SP,” 1997. BLAS BLACS, 2007. [23] Netlib. Available: https://www.netlib.org/.
[13] X. Zhang, Q. Wang, and S. Werber, Openblas. Available: [24] KISTI. Available: https://www.kisti.re.kr/.
http://www.openblas.net/.
[14] R. C. Whaley, A. Petitet, and J. Dongarra, “Automated empirical
optimizations of software and the atlas project,” Parallel Computing, vol.
27, 2001, pp. 3–35, doi: 10.1016/S0167-8191(00)00087-9.

47

Authorized licensed use limited to: Soongsil University. Downloaded on May 01,2021 at 05:40:50 UTC from IEEE Xplore. Restrictions apply.

You might also like