ZAKI A Machine Learning Based Process Mapping Tool For SPMV Computations On Distributed Memory Architectures

SPECIAL SECTION ON DISTRIBUTED COMPUTING
INFRASTRUCTURE FOR CYBER-PHYSICAL SYSTEMS
Received May 25, 2019, accepted June 9, 2019, date of publication June 17, 2019, date of current version July 3, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2923565
ZAKI+: A Machine Learning Based Process

Mapping Tool for SpMV Computations on
Distributed Memory Architectures
SARDAR USMAN1 , RASHID MEHMOOD 2, IYAD KATIB1 , AND AIIAD ALBESHRI1
1 Department of Computer Science, Faculty of Computing and Information Technology, King Abdul Aziz University, Jeddah 21589, Saudi Arabia
2 High Performance Computing Center, King Abdulaziz University, Jeddah 21589, Saudi Arabia
Corresponding author: Rashid Mehmood (rmehmood@kau.edu.sa)

This project was funded by the Deanship of Scientific Research (DSR) at King Abdulaziz University, Jeddah, under grant number
RG-11-611-40. The authors, therefore, acknowledge with thanks DSR for technical and financial support.
ABSTRACT Smart cities and other cyber-physical systems (CPSs) rely on various scientific, engineer-
ing, business, and social applications that provide timely intelligence for their design, operations, and
management. Many of these scientific and analytics applications require the solution of sparse linear
equation systems, where sparse matrix-vector (SpMV) product is a key computing operation. Several factors
determine the performance of parallel SpMV computations, including matrix characteristics, storage formats,
and the rising complexity and heterogeneity of computer systems. There is a pressing need for new ways
of exploiting parallelism, and mapping data and applications to the computing resources. We propose here
ZAKI+ , a data-driven machine-learning approach, allowing users to automatically, effortlessly, and speedily
obtain the best configuration (the data distribution, the optimal number of processes, and mapping strategy)
and performance for the execution of the parallel SpMV computations on distributed memory machines.
We train and test the tool using three machine learning methods—decision trees, random forest, and Xtreme
boosting—and nearly 2000 real-world matrices obtained from 45 application domains, including computer
vision and robotics. ZAKI+ provides optimal process mapping and outperforms the MPI default mapping
policy by a factor of 4.24. This is the first work where the sparsity structure of matrices has been exploited
to predict the optimal mapping of processes and data in distributed-memory environments by using different
base and ensemble machine learning methods. Various CPSs comprise compute-intensive machine learning
applications, such as the SpMV, and hence, the process and data mapping contributions of this paper would
be of paramount impact for the CPSs.
INDEX TERMS Cyber-physical systems, SpMV, sparse linear algebra, sparse matrices, machine learning,
MPI, process affinity, compressed sparse row (CSR), decision trees, random forest, Xtreme boosting, parallel
computing, high performance computing (HPC), smart cities analytics, exascale systems, OpenMPI.
I. INTRODUCTION big data, cloud, fog, and edge computing, artificial intelli-
Cyber-Physical Systems (CPS) comprises ‘‘interacting digi- gence, high performance computing, and other cutting-edge
tal, analog, physical, and human components engineered for technologies to provide the foundations for smart cities and
function through integrated physics and logic. These systems societies [2]–[8]. Smart cities appear as ‘‘the next stage of
will provide the foundation of our critical infrastructure, urbanization, subsequent to the knowledge-based economy,
form the basis of emerging and future smart services, and digital economy, and intelligent economy’’, aiming to ‘‘not
improve our quality of life in many areas’’ [1], including only exploit physical and digital infrastructure for urban
transportation, healthcare, smart grid, disaster management development but also the intellectual and social capital as
and many other areas. A CPS uses Internet of Things (IoT), its core ingredient for urbanization’’. Smart society is an
extension of the smart cities concept, ‘‘a digitally-enabled,
The associate editor coordinating the review of this manuscript and knowledge-based society, aware of and working towards
approving it for publication was Wei Yu. social, environmental and economic sustainability’’ [4].
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/
VOLUME 7, 2019 81279
S. Usman et al.: ZAKI+: Machine Learning-Based Process Mapping Tool for SpMV Computations
Smart cities and other cyber physical systems rely on var- including CPUs [55]–[57], MIC [40], GPUs [39], [58], and
ious scientific, engineering, business, and social applications other architectures [59], [60].
that provide timely intelligence for their design, operations Performance optimization of an application on distributed-
and management. Many of these scientific and analytics memory multicore architectures is challenging due to the het-
applications require the solution of sparse linear equation sys- erogeneity and diversity of architectures. Modern machines
tems. Some examples of these applications where these have have a range of shared and distributed memory, and hybrid
been specifically applied to smart city settings include com- architectures, with several hierarchies involving non-uniform
putational fluid dynamics (CFD) [9], [10], computer vision communication latencies [61], [62]. The sparsity pattern of
or computer graphics [11]–[13], robotics problems [14], the matrix affects the performance of SpMV computations,
2D/3D problems [15], thermal problems [16], [17], acoustics particularly in the case of distributed-memory implementa-
problems [18], [19], operational research [20]–[22], health- tions (due to higher variations in communication delays),
care [23], and networking [24], [25]. Some other exam- resulting in load imbalance that causes both computation
ples of smart city applications include life sciences [26], and communication overheads [39]. The goal of mapping
smart farming [27], transportation [28]–[33], autonomous of sparse matrices and vectors to processors in distributed
vehicles [34], graph computations [35], and social media memory parallel environment is to minimize the overall
analytics [36], [37]. communication (number of sent messages, communication
Sparse matrix-vector product (SpMV) is the most impor- volume per processor, synchronization costs) and provide
tant and time-consuming kernel for the iterative solution computation load balance. The sparsity pattern of a matrix
of sparse linear equation systems. The SpMV operation is unknown before runtime. The manual process of trial and
has been categorized as one of the seven dwarfs i.e. seven error experimentation to find the best mapping of processes
numerical methods of significant importance [38]. It is a and data on a given architecture and resources for computing
memory bound operation compared to other compute inten- SpMV is time-consuming and frustrating. More importantly,
sive algebra kernels such as dense matrix-vector multiplica- it requires a complete search of the node and processor core
tion. High performance computing (HPC) typically exploits space to find the optimal configuration for the computation.
parallel computing features of the underlying software and The structure of the matrix is not regular and therefore the
hardware infrastructure to solve large problems faster. HPC whole manual processes need to be repeated numerous times
has been applied to SpMV/linear algebra [39]–[42], and for each matrix.
other problems for several decades. Big data and data-driven The challenges related to the mapping of data and pro-
approaches [35], [36], [43], [44] have been used relatively cesses onto distributed memory architectures are not spe-
recently in scientific computing to address HPC related chal- cific to SpMV computations alone. Various cyber physical
lenges, and this has given rise to the convergence of HPC systems will comprise compute intensive machine learning
and big data [45], [46]. Moreover, artificial intelligence (AI) applications, such as SpMV, and these will need to be opti-
is increasingly being used to improve big data, HPC, scien- mally mapped onto the underlying cyber physical and exas-
tific computing, and other problem domains. This trend has cale computing infrastructure. Cyber physical systems in the
given rise to the convergence of big data, HPC and AI. This future will comprise an ecosystem of digital infrastructures
paper attempts to contribute to this convergence and applies that are able to work together and enable dynamic real-time
it (the convergence of the three areas) to the area of SpMV interactions between various CPS subsystems. Technologies
computations. such as big data, pervasive, cloud and fog computing, as well
Several factors affect the performance of SpMV com- as the increasingly complex demands of smart cities and
putations [47]. These include matrix characteristics, stor- societies, are likely to transform the future of computing
age formats, software implementations, and hardware plat- infrastructures. The trend would be the integration of comput-
forms. The matrix characteristics include eigenvalues, def- ing at exascale (and beyond) with big data technologies and
initeness, number of non-zero values in the matrix and provision of on-demand service-oriented high performance
the sparsity pattern. The storage formats include Com- computing together with the required data, AI and other appli-
pressed Sparse Row (CSR), Compressed Sparse Column cations. The mapping of data and processes (related to smart
(CSC), Coordinate format, Diagonal, Hybrid, Blocked CSR, applications) onto the underlying cyberphysical converged
Extended BCSR, Compact Modified Sparse Row (CMSR), infrastructure therefore would be of paramount importance.
and many others [42], [48]–[50]. The choice of matrix storage
formats is dependent on the characteristics of the matrix A. AIM AND CONTRIBUTIONS
itself. The software implementations include, among others, In our earlier work, we have proposed ZAKI that pre-
Intel MKL [51], Trilinos Project [52], CUSPARSE [53], dicts the optimal number of processes for SpMV compu-
and CUSP [54]. The characteristics of the hardware plat- tations of an arbitrary sparse matrix on a distributed mem-
forms that could affect the SpMV performance include ory machine [63]. In this paper, we extend the earlier tool
the DRAM bandwidth, cache hierarchy, the available par- and propose ZAKI+, a data-driven and machine-learning
allelism in the hardware, and others. A range of hard- approach to predict the optimal mapping of the processes
ware architecture are being used for SpMV implementations and data (i.e., matrix partitions) on the underlying distributed
81280 VOLUME 7, 2019

TABLE 1. Symbol table.
FIGURE 1. A high-level depiction of the proposed SpMV Performance This paper makes the following contributions. We:
optimization tool ZAKI+.
X propose, implement, and evaluate a machine learning
memory machine architecture for SpMV computations of an tool that allows users to automatically obtain the best
arbitrary sparse matrix. The aim herein is to allow application configuration (including the optimal mapping of the
scientists to automatically, effortlessly, and speedily obtain processes and data), and hence the best performance,
the best configuration (including the matrix/data distribu- for SpMV computations of a given sparse matrix on a
tion, optimal number of processes, and mapping strategy), distributed-memory machine.
and hence the best performance, for the execution of the X train and test the tool using nearly 2000 real-world
SpMV computations for a given sparse matrix (see Figure 1). matrices obtained from 45 application domains includ-
ZAKI+ involves three phases: data preparation, training, ing CFD, computer vision, and robotics.
and testing. Data preparation includes sparse matrix feature X perform in-depth performance modeling and evalua-
extraction, SpMV kernel execution with varying number of tion using different machine learning techniques and
processes and five process mapping strategies, choosing the visualizations.
minimum execution time for each mapping strategy, and X provide a first-ever detailed comparative analysis of
selecting the optimal mapping of the data and processes for multiple MPI process mapping strategies (Node bind-
each matrix in our dataset. We have used the SuiteSparse ing, Latency binding, Bandwidth binding, and Cyclic
matrix collection [64] as our dataset, comprising (randomly) binding) for SpMV computations. The methodology to
selected 1838 sparse matrices associated with 45 application use multiple mapping strategies for prediction is itself
domains. a novel contribution of this paper.
Firstly, the sparse matrices in the dataset are converted To the best of our knowledge, this is the first work of
to the CSR format. The SpMV computations are performed its kind where the sparsity structure of matrices have been
2000 times for each of the 1838 matrices for the whole range exploited to predict the optimal mapping of the processes and
of processes (cores on multiple nodes), varying between 1 and data in distributed memory environments by using different
384 (see Section 4). The sequential, minimum, and average base and ensemble machine learning methods. ZAKI is an
execution times are recorded for each sparse matrix in the Arabic word, which means, ‘‘smart’’.
dataset. The average time of the 2000 SpMV executions is The rest of the paper is organized as follows. Section 2 pro-
used to avoid any anomalies. The labeled data set includes vides background information related to SpMV and machine
sparse matrix features along with the optimal mapping strat- learning algorithms. The literature survey is presented in
egy that gives the minimum execution time for the matrix. Section 3. Section 4 introduces the methodology of our
This labeled data set is divided into the training and test- proposed technique. Section 5 gives detailed experimental
ing datasets, containing randomly selected 90% and 10% results and analysis of the different mapping strategies for
matrices from the dataset, respectively. The training dataset SpMV parallelization. The prediction results and analysis
is used to train the predictive model using three machine of the Zaki+ tool are given in Section 6. We conclude in
learning algorithms; Decision Trees, Random Forest, and Section ý7 and give future directions.
Xtreme Boosting. The trained predictive model is tested on
the test data using a generic classification accuracy metric. II. BACKGROUND
The proposed model is trained off-line once and requires This section gives the brief overview of SpMV, different
no further training at the actual matrix execution or pre- process mapping techniques and machine learning algorithms
diction time. The execution times for the predicted optimal used in this paper. Table 1 lists the basic symbols used in this
configuration of SpMV computations are compared with the paper.
average execution times of MPI default mapping policy;
ZAKI+ provides 4.24 times aggregated speedup over the A. SpMV
MPI default mapping policy with average parallel execution Compressed Sparse Row (CSR) is generally the most com-
times. mon used sparse matrix storage format and can be used to
VOLUME 7, 2019 81281

FIGURE 3. SpMV data distribution.
FIGURE 2. CSR representation.

balancing algorithm. Over the years numerous 2D parti-
tioning algorithms have been proposed but these algorithms
Algorithm 1 SpMV Algorithm are complex and are more suitable for shared memory sys-
tems [65].
SpMV Computation using CSR scheme
A, x and y (see Table 1) needs to be distributed for parallel
1: Procedure SpMV (A::in, x::in, y::out)
implementation of SpMV. Figure 3 shows how all these three
A: input Matrix in CSR format
elements are distributed among processes.
x: input vectory: output vector
The three colors; represent each their process and which
2: for i = 0 to N − 1 do
data is local to each individual process. The lightly colored
3: for k = rowPtr [i]to k < rowPtr[i + 1] do
area of the matrix is to illustrate which part of the distributed
4: y[i] = y[i]+ Val[k] ∗ Vec[Col [k]]
matrix has to be multiplied by remote vector entries (vector
5: end for
entries which are not local to a process and hosted on other
6: end for
processes), whereas the fully colored areas of the matrix are to
7: end Procedure
be multiplied with the vector entries local to the process itself.
This partitioning allows for communications to be overlapped
with computations, where the remote vector entries are being
store any matrix without relying on the structure of the matrix. communicated while the local computations are performed.
Figure 2 shows an example of how the sparse matrix is stored
in CSR format. CSR based sparse matrix is stored using three C. PROCESS MAPPING
arrays, i.e., the val array of size nnz stores the actual nonzero The ways to map parallel processes to processors (or cores)
elements, the col array of size nnz stores the indices of the could affect the application performance significantly due
nonzero elements, and the rowptr array of length M +1 stores to the non-uniform communication cost. Processes sharing
the pointers to the first nonzero element of each row in the col lots of data could be placed physically close to each other
and val arrays. to reduce the communication cost and ultimately overall
SpMV can be formally defined as application’s execution time. The binding of a process or
thread to a specific core, can improve the performance of
y = Ax code by increasing the percentage of local memory accesses.
where A is M × N sparse matrix, x is N × 1 input dense Some OpenMP runtime libraries and MPI libraries may also
vector and y is resulting M × 1 dense vector. The resulting perform certain placements by default. In cases where the
vector yi of size M can be obtained by multiplying matrix A placements by the kernel or the MPI or OpenMP libraries
of order M × N by vector x of size N. Each ith element of are not optimal, one can try several methods to control the
resulting vector y is the result of inner multiplication of ith placement in order to improve performance of an application
row of matrix A and vector x. by maximizing data locality.
n-1
Manual placement of the individual processes in a parallel
X job referred as ‘‘process placement’’ or ‘‘process affinity,
yi = (Ai , x) = Aij xj , 0 ≤ i ≤ m − 1
is time consuming process. The programming model of MPI
j=0
is flat: each process can communicate directly with other
Algorithm 1 represents the parallel SpMV operation using processes. The exchanges can be irregular, which means that
CSR format a given MPI process will not necessarily communicate with
all the other MPI processes and that the amount of data
B. DATA DISTRIBUTION exchanged between consecutive messages may vary. As a
Load balancing and minimizing communication cost are con- consequence, the physical location of the MPI processes
sidered as key optimizations in distributed parallel SpMV influences application communication costs. MPI standard
computation. We have used one dimensional data distri- either provide their own run-time systems for launching and
bution, which ultimately reduces the complexity of load monitoring the individual processes in a parallel application
81282 VOLUME 7, 2019

or use a back-end parallel run-time environment support

for this functionality. MPI implementations sometimes rely
on back-end parallel runtime environments provided by job
schedulers and resource managers to provide process place-
ment functionality for parallel jobs.
Mapping determines the number of processes to be
launched and on which hosts to facilitate efficient interac-
tion among processes to improve the overall performance of
application. We have experimented with OpenMPI default
process affinity as a reference point and use four different
affinity options i.e. By node, Latency binding, cyclic binding
FIGURE 4. By node.
and bandwidth binding.
1) BY NODE
Mapping by node involves processes assignment by iterating
over nodes. The ranking of MPI processes is implicitly set
to node and default binding unit is socket. As an example
considers the case where the host A, have 4 slots and host
B with 2 slots available for processes to run.
The ranks are ordered and alternates between nodes and
with binding to a smaller unit i.e. socket, results in iteration
of subsequent processes over sockets of each host.
The process ranks are represented by R0-R5 and closed
brackets represent sockets with available number of cores.
FIGURE 5. Latency binding.
The process with rank 0 is assigned to node A, rank 1 to node
B and so on. As the default binding is socket so processes are
bind to entire to socket [66]. and widely adopted in machine learning. Decision tree makes
use of greedy algorithms i.e. Hunt’s algorithm, to makes a
2) LATENCY BINDING
best possible decision at each node but does not consider the
Latency binding is also known as packed latency and results global optimum. Depth of the tree is important parameter and
in fastest communication between adjacent ranks by distribut- decision tree algorithms are often prone to over-fitting with
ing processes on cores until all available cores are consumed. increasing tree depth. Nodes represent decisions and edges
Binding and ordering of ranks is also implicitly set to core are binary (True/False) represents possible options from one
(same as set by the mapping). As shown in the Figure 5 pro- node to other. The process of traversing the tree from decision
cesses are assigned by core until all the 4 slots available in node to terminal/leaf node is easily interpretable and thus
node A are consumed and then it moves to node B [66]. decision trees can be used for feature engineering. Deci-
3) CYCLIC BINDING sion tree models often suffer from over-fitting problem due
Cyclic binding can be created, by mapping processes to sock- to outliers and irregularities in data; algorithm goes deeper
ets and ranks distribution by core. As shown in Figure 6 ranks with increased test set error with lower prediction accuracy.
0 and 1 are assigned to one socket and 2 and 3 are assigned Pre-pruning and post pruning are the two most widely used
to second socket of node A. When the available 4 slots are approaches to tackle over-fitting problems. In pre-pruning,
consumed, it moved to second node. splitting of a node is stopped if some threshold value is
reached while in post-pruning, complete tree is formed and
4) BANDWIDTH BINDING if suffered from over-fitting then post pruning is carried out.
Mapping processes to sockets, ranking and binding them Cross validation is normally used to test if splitting of a node
by core to span those cores assignment all over available improves the model or not. If accuracy is suffered by further
sockets, creates bandwidth binding. The final binding unit is expansion of a node then that node is considered to be a leaf
smaller than the unit defined in mapping i.e. core, results in node [67].
cores being iterated in the sockets and closer indexed ranks
are near to each other to maximize the cache and memory 2) ENSEMBLE MACHINE LEARNING METHODS
bandwidth [66]. Ensemble machine learning methods improves the learning
results by combining results from numerous base models to
D. MACHINE LEARNING produce an optimal model with improved predictive perfor-
1) DECISION TREES mance compared to single model. These predictive models
As the name suggest, decision trees uses a tree like model use various machine learning algorithms to improve predic-
for decisions and most commonly used tool in data mining tions (stacking) by decreasing variance (bagging) and bias
VOLUME 7, 2019 81283

source tool to target ranking, classification and regression

problems [71].
III. LITERATURE SURVEY

Selecting the correct mapping scheme has significant
impact on the performance and physical location of the
MPI processes influences application communication costs.
Emani et al. [72] proposed an adaptive mapping of paral-
lelism in the presence of external workload by combining
compile time knowledge with the dynamic workload infor-
FIGURE 6. Cyclic binding. mation and used supervised learning to automatically build
portable heuristic for choosing the right number of threads
for each parallel section of the target program and the optimal
mapping to the available resources. Wang et al. [73] proposed
a profile driven parallelism detection and machine learn-
ing based mapping by mitigating the shortcomings of static
analysis, replacing traditional target specific mapping with
machine learning based mapping mechanism. They applied
machine learning based offline prediction for each parallel
loop candidate to decide the optimal parallel mapping strat-
egy and used OpenMP annotations to generate the parallel
code.
Jeannot et al. [74] addressed the issue of data locality
FIGURE 7. Bandwidth binding.
problem by proposing a process placement policy. They gath-
ered communication pattern of the MPI application and also
modeled the target architecture. Based on the information
(boosting) [68]. There are different ensemble methods e.g. gathered they defines a placement policy that is enforces
Bagging and Random forest. when application is launched. Castro et al. [75] proposed a
Bagging also known as bootstrap aggregation reduces the machine learning based approach to automatically choose an
variance of an estimate by taking average of the multiple esti- appropriate thread mapping strategy for STM applications
mates. Multiple sub-samples of data are extracted randomly considering the features of the applications, STM system and
from the given data set and decision trees is formed for each platform.
data sample [69]. The results of multiple decision trees are Exploiting the sparsity structure of matrix, efficient storage
aggregated for an optimal predictor. formats, matrix reordering, use of accelerators, data structure
and code reorganization are some of the key issues targeted
3) RANDOM FOREST for optimization of SpMV. There is a trade-off between
Random forests (RF) are bagged decision tree model and are balanced workload distribution and minimal communica-
most commonly used due to its flexibility, ease of use and tion for selecting the efficient data mapping method for
can target both classification and regression problems [70]. SpMV. Mansour et al. [76] proposed a data mapping method
The forest is made up of ensemble decision trees trained with for SpMV, derived from checkerboard method (blocks of
bagging method, which is generally associated with com- rows and columns are assigned to 2D mesh of processors),
bining the multiple learning models to improve the overall on Network-on-Chip (NoC) to minimize communication cost
score. RF used random subset of features in each splitting without sacrificing the balanced workload distribution. NoC
of a tree to minimize the correlation between trees. RF can was introduced to overcome the shortcomings of bus-based
handle binary, numerical and categorical features with not on-chip interconnects with packet-switched network archi-
much pre-processing efforts. RF can be parallelized, handle tecture. They have also proposed FPGA based architecture
high dimensionality, faster training and prediction, robust for the proposed data mapping methodology. The perfor-
to outliers but they have tendency to over-fit and demands mance of SPMV is heavily dependent on the structure of the
tuning of hyper-parameters. matrix, which may result in drastic performance variations.
Matrix with irregular structure results in noticeable amount of
4) XGboost cache misses and performance is further degraded with load
XGBoost stands for extreme gradient boosting is a machine imbalance. The specificities of unknown input matrix need to
learning technique that combines several weak learning mod- be considered during runtime to optimize the performance of
els (typically decision trees) to make more stable and accurate SPMV. Kislal et al. [62] proposed a cache aware SPMV opti-
results. It built on the concept of gradient boosting but is more mization methodology primarily focusing on mapping (Itera-
stable and resilient to over-fitting. It is highly flexible open tions to core in the target multicore architecture), scheduling
81284 VOLUME 7, 2019

(Execution order of loop iterations) and data layout reorgani- stored in a database along with the candidate set of optimal
zation. mapping strategy S. Each entry u ∈ U and associated solution
Karakasis et al. [77] investigated the performance energy s ∈ S in the training set consists of sparse features vector of
trade-offs in SpVM by exploiting the execution configuration u and optimal mapping strategy s.
i.e. core frequency and thread placement that yields optimal
performance energy trade-off. As random filling up all cores A. DATA SET
of shared memory machine to cater memory bound applica- The dataset is created using mostly square matrices
tions may not be a suitable solution in terms of performance from SuiteSparse collection and comprises of more than
and energy. Thread placement affects the performance of 1800 matrices. The matrices are chosen to make sure
the memory bound applications on modern multicore archi- that we target applications from multidisciplinary domains.
tecture and give comparable performance with low energy Table 4 List the application domains of selected matrices.
budget. Fujino and Nanri [78] proposed a technique for an Matrices are selected from 45 application domains and their
optimal balance between computation and communication names are listed in column 2. Count gives the total num-
for parallel SpMV in distributed environment and called their ber of matrices selected in each application domain. The
technique as Balancing-CET (Balancing Communication and maximum number of matrices for a single domain (i.e. 164)
Execution Technique). The estimated time of execution is belongs to Subsequent Circuit Simulation problem domain.
measured from the previous iteration and communication While Duplicate Optimization Problem, Directed graph and
time is estimated by a linear performance model with amount Directed Weighted Random Graph have only single matrix.
of data to be sent and received by each process. Minimum and maximum number of rows for each application
Sparse matrix multiplication usually achieves only a small domain are listed in Column 4 (Min. Rows) and Column 5
fraction of the peak performance of a modern processor. Dis- (Max. Rows) respectively. Minimum numbers of rows for
tributed solutions for sparse matrix multiplication lead to sig- any single matrix in our data set is 5 and that belongs to
nificant network communication and network bandwidth is Directed weighted graph while maximum numbers of rows
usually the bottleneck. The distributed solution also imposes are 27993600 (Optimization problem). Similarly minimum
challenges in achieving load balancing. Zheng et al. [79] nnzs (see symbol Table 1) are 19 and maximum nnzs are
explore a solution that scales sparse matrix dense matrix mul- 401232976. The last column shows the image of a randomly
tiplication (SpMM) on a multi-core machine with commodity selected matrix from each application domain.
SSDs and perform SpMM in semi-external memory (SEM)
by keeping one or more columns of a dense matrix in memory B. SPARSE MATRIX FEATURES
and the sparse matrix is accessed from external memory. We tabulate some of the important features in Table 2. The
They demonstrated that the SEM solution uses the resources first column lists the features names along with the descrip-
of a multi-core machine well and achieves performance that tion in the second column. The third column lists the formulae
exceeds the state-of-the-art in-memory implementations. used to get the numerical quantity for each of those features.
Expert programmers can implement effective mapping but The last column lists the computation complexity. The first
manual process is expensive and error prone. As the perfor- 6 features i.e. Number of rows, columns, nnz, density, mean
mance of the SpMV is heavily dependent on the structure nnz in rows and columns, have computation complexity of 2
of the matrix that is an unknown entity before run time, (1). The more complex features require the full scan of matrix
which motivates the idea of using machine learning for its and thus have higher computation complexity of 2 (M) and
optimization. There is very little work on the use of machine their standard deviation with complexity 2 (2M). The costs
learning for the optimization of SpMV and most of the efforts associated with features extraction can be amortize, as it is a
are dedicated to automated format selection based on the part of pre-processing step and is only done once for the set
sparse matrix features. of matrices.
IV. ZAKI+: METHODOLOGY AND DESIGN C. FEATURES CHARACTERISTICS

ZAKI+ incorporates supervised machine learning model to Table 3 lists the feature analysis of chosen sparse matrices.
predict the best process mapping strategy for a given matrix Min and Max columns list the minimum and maximum
based on sparse matrix features related to the distribution numerical quantity for each feature. The last two columns
of nonzero elements. All the relevant features are extracted show the average and standard deviation of selected fea-
along with the best mapping strategy for each matrix. The tures, respectively. The experiments are carried out with
best mapping is selected based on the least execution time more than 1800 matrices and number of rows ranges from
among all the candidate-mapping schemes i.e. by node, BW, minimum 5 to 2.799360e+07 and nnz with minimum 19 to
LB, Def, CB (see symbol Table 1). The labeled data set 4.012330e+08 maximum.
(Sparse matrix features, mapping scheme) is given as input
to machine learning algorithm to build a ML based predictive D. DATA PREPARATION
model. The model predicts the best mapping scheme and is The dataset is prepared by executing each matrix with differ-
validated with unseen matrices. The set of sparse matrices U ent mapping schemes and least execution time is recorded.
VOLUME 7, 2019 81285

TABLE 2. Sparse matrix features. Algorithm 2 Data Preparation

Algorithm for Data preparation
A: input Matrix
f: output features
n: number of matrices
0
A : Matrix in CSR format
Ŵ :{Default, Latency, Bandwidth, Cyclic, Node}
np:number of processes where np ≤ number for cores
1: Function FeaturesSPMV
(A :: in, f :: out, Exetim eavg :: out, ŵ :: out)
2: for j = 1 to j ≤ n do
3: Convert matrix to CSR A0j ← o Aj
4: fj ←: tfeatures of A0j
5: for k = 1 to k ≤ np do
6: for i = 1 to i ≤ 2000 do
7: calculate Executiontime [i] ← call SpMV (A0j , x)
8: end for
2000
Execution_timei /2000
P
9: Avgtimei =
i=1
10: Calculate Avgtimei for all np
end for
11: Choose minimum of Avgtimej for each A0j
12: AVgmintime → Minimum of Avgtimej
13: end for
14: for each mapping scheme b w Ŵ
15: Repeat 5 to 13
00 00
16: choose min of Exetimeavg where Exetimeavg is min
Avgmintime of for each ŵ
w} where b
17: S{f, b w Ŵ
18: end Function
read in matrix market format and converted to CSR and all

the selected set of features are extracted. SpMV operation is
performed 2k times to avoid anomalies and the least average
execution time among all different mapping schemes i.e.
by node, BW, LB, Def, CB (see symbol Table 1), is used to
define a class (Label) of each matrix. Each matrix is executed
TABLE 3. Feature analysis.
with different number of processes and for each process;
SpMV operation is performed 2k times. The first for loop
at line 2 of Algorithm 2 traverses through total number of
matrices in our data set. Each matrix is converted to CSR
format and selected features are extracted as shown in line
2 to 4. The second for loop at line 5 traverses through np (see
Table 1) and upper limit of np is 384. The experiments are
performed on 16 nodes with 24 cores each (total of 384 cores)
and thus each matrix can be scaled up to 384 numbers of
processes. The next for loop traverses the 2k iterations of
SpMV kernel execution and average execution time Avgtimei
is recorded against all np as shown in line 6 to 10. The least
average execution time Avgmintime is than recorded for each
matrix as shown in line 11 and 12.
For each mapping scheme ŵ, the same process is repeated
We have experimented with 16 nodes and each node has for all the matrices. Once we have the least average exe-
2 sockets with 12 cores each. Algorithm 2 shows how the data cution time recorded against all the matrices in our dataset
set is created to train machine-learning model. Matrices are for each mapping scheme ŵ, the next step is to choose the
81286 VOLUME 7, 2019

TABLE 4. Application domains.
FIGURE 8. 2k iterations of SpMV execution of selected matrices on a

single node.
FIGURE 9. 2000 iterations of SpMV execution of selected matrices on

multiple nodes.
00
minimum execution time i.e. Exetimeavg , among all the map-
ping schemes recorded against each matrix as shown in line
14 to 16. The class or label of each matrix is defined by using
00
minimum of Exetimeavg . The labeled dataset comprising of
sparse matrix features f, and mapping scheme ŵ is used to
train our machine-learning model.
We have experimented with 16 nodes with 24 cores each
and each SpMV operation is performed 2000 times to avoid
anomalies and used the average of 2k iterations of execution
time. Figure 8 and Figure 9 shows execution iterations for
selected matrices on single node (with 24 processes) and
on multiple nodes (8 nodes with 190 processes) respec-
tively. The x-axis shows the number of iterations i.e. 2k
and Y-axis represents the execution time on logarithmic
scale.
E. MACHINE LEARNING MODEL

ZAKI+ predicts the best mapping strategy for unseen matrix
by exploiting the sparsity pattern through feature extraction.
Our proposed predictive model is divided in to three main
phases as shown in Figure 10. Matrices in matrix market
format are first converted to CSR format and all the rele-
vant features are extracted. SpMV operation is performed
2k times and average execution time is recorded across all
VOLUME 7, 2019 81287

Algorithm 3 Training and Testing

Algorithm Training
0
S = Scaled Data set
0
x = Scaled features
Input:
x = Sparse matrix feature (f)
y = Label(b w) where b w ∈ Ŵ is mapping scheme
Output
Trained Model
Model Prediction
1: Function Traning (x,y::in, model::out)
2: Training set
S = x1 , y1 , x2 , y2 . . . . . . . . . . . . . . . . . . xn , yn

Xi −X mean
3: featurescaling S0 ← X 0 = standard deviation 0
4: Training set S = x1 , y1 , x02 , y2 , . . . . . . . . . xn ,
0 0

5: Split S0 into Test set S0test and Training set S0train

0
6: Model.fit (Strain )
0
7: Model.predict (Stest )
Number of correct predictions
8: Accuracy = Total number of input samples
9: end Function
F. FEATURES SCALING
Feature scaling is a pre-process step for standardizing
the highly varying features values/magnitudes in a fixed
range to prevent machine-learning algorithm to give high
FIGURE 10. The ZAKI+ predictive model.
weigh to high values and vice versa. There are numerous
feature-scaling techniques but most common techniques are
nps (see Table 1). The same process is repeated for all standardization and Min-Max Normalization. We have used
mapping schemes and minimum of the average execution standardization to standardize features. The standardization
time is recorded. Best time is selected from minimum average involves rescaling of a feature value with mean = 0 and
execution time among all mapping schemes (Def., Node, CB, variance = 1.
LB and BW), which is used to label our data set. Xi − Xmean
Selected set of features along with best mapping strat- Xnew = (2)
standard deviation
egy is served as input for the machine-learning algorithm.
The selected sets of features are listed in Table 2 and fea- G. EVALUATION METRICS
tures analysis is presented in Table 3. The data set is first 1) CLASSIFICATION ACCURACY
divided into testing and training set. The extracted features are Different evaluation metrics can be used for the classification
scaled to standardize the numerical values, thus preventing problems. The classification accuracy can be defined as ratio
the machine-learning algorithm to give high weigh to higher of number of correct predictions to the total number of input
values and lower weigh to lower values. We have used stan- samples.
dardization which involves rescaling the numerical values of
Number of correct predictions
features with mean = 0 and variance = 1 and are represented Accuracy = (3)
with the following formula. total number of input samples
Xi − Xmean 2) PERFORMANCE GAIN
Xnew = (1)
standard deviation To compare the performance gain of the proposed mecha-
Algorithm 3 shows how the training phase is performed. nism we have used Geometric Mean of Normalized Perfor-
0 0
The scaled data set S is divided into training Strain and testing mance (GMNP) defined as
0 0
data Stest as shown in line 2 to 5. ML model is trained on Strain !1/N
0 N
and evaluated with Stest . We have used Decision tree (DT) and Y Besti
GMNP = (4)
ensemble ML algorithms Random Forest (RF) and Extreme Predictedi
i=1
Boosting (XG). For any unseen matrix our predictive model
predicts the optimal mapping strategy, which gives the best where N is the number of matrices in a test set, Besti
performance. is the minimum time of the ith matrix among all the
81288 VOLUME 7, 2019

i=np
TABLE 5. Software specifications. X ti
tiavg = (6)
np
i=1
The sequential and minimum execution time is represented by
tis , timin . Speedup of the ith matrix is calculated with following
equation.
tis
Speedupi = (7)
timin
Speedup against the entire data set is calculated as follows
schemes, and Predictedi is a selected mappings scheme by PN
ti
our machine-learning model. Speedup= PNi=1 s (8)
i=1 timin
PN
H. EXPERIMENTAL PLATFORM i=1 tiavg
Speedupavg = PN (9)
The experiments have been performed on Aziz Super com-
i=1 timin
puter at King Abdul Aziz University Saudi Arabia. Aziz
has 496 Nodes with 11,904 computing cores. 380 standard Minimum execution time and speedup is calculated using
compute nodes (9120 cores) with 96 GB (4GB per core) and equation 5 and 7 respectively. tiavg is the average time cal-
112 high memory compute nodes (2688 cores) with 256 GB culated by taking the mean execution time of each matrix
(10.6GB per core) for applications that require large memory executed with different number of processes as shown in
for their execution. Each compute node has Dual socket Intel equation equations 6. Same process is repeated for all the
Xeon E5-2695v2 12-core processor running at 2.4GHz and mapping schemes to get the minimum of each mapping policy
Compute cluster total memory capacity of 66TB and peak i.e. Node, BW, LB, CB, Def(min) against each matrix. The
performance is nearly 230 TFlops. The Infiniband intercon- term Def used in this paper refers to Def(min) which is calcu-
nect provides high scalable, high speed and low latency data lated as minimum of execution time across all np (see Table 1)
transportation, configured in a full bisectional bandwidth with default MPI mapping scheme. Speedup and average
non-blocking network fabric with zero copy operations using speedup is calculated using equations 8 and 9 respectively
Remote Direct Memory Access (RDMA). It provides 40Gbps for the entire data set.
full bisectional bandwidth communication between any two Best time is a calculated by choosing minimum time
end points. Table 5 lists the software specifications among all the minimums calculated for each matrix in each
mapping scheme. Let Besti denote the best time of the ith
V. ANALYSIS OF MAPPING STRATEGIES FOR SPMV matrix and Nodei , BWi , LBi , CBi , and Defi are the minimum
PARALLELIZATION time for the ith matrix in each these mapping scheme. Best
The optimal number of processes and where those processes time for ith matrix is calculated using equation 9.
should be mapped has great impact on the performance of the Besti = Min(Nodei , BWi , LBi , CBI , Defi ) (10)
parallel application and is evident in Figure 10.
Let np denotes the total number of processes, p denotes The Best time calculated here is used to define a label for
a single process and ti represents the execution time of ith each matrix to train our machine-learning model.
matrix, where 0 < i < N and N is total number of matrices Figure 11 shows the execution time comparison (on log
in our data set. The execution time of each matrix is recorded scale) of choosing the optimal mapping strategy i.e. Best,
against different number of processes and if the execution with Def(min) (timin with default mapping) and four mapping
time kept on decreasing with increasing number of processes, schemes including serial time and default average time (tiavg
we continued scaling up each matrix. Depending on size of with default mapping), for the entire data set. The y-axis
the matrix and its sparsity pattern there comes a breakeven shows the execution time on logarithmic scale and x-axis
point where execution time kept on increasing with increas- shows different mapping schemes i.e. Node, BW, LB, CB,
ing parallelism as the communication between processes Def(min), serial execution time, Def(avg) and Best time (see
started to dominate the overall time, which is our stopping equation 10).
criteria for further scaling. Each matrix is executed 2k times With more than 1800 matrices in our data set, sequen-
for different number of processes and minimum/maximum tial time took almost 132291.1 seconds (36.75 hrs.), aver-
execution time is selected by choosing minimum/maximum age execution time took 16421.58 seconds (4.5 hrs) and
among all the np (see Table 1) against that matrix as shown is reduced to 5521.83 seconds (1.53 hrs.) by using only
in the following equations. The average time is calculated by the optimal number of processes but with default mapping
taking the mean execution time of each matrix executed with scheme. Changing the mapping scheme helped in further
different number of processes. reduction of execution time i.e. with CB time is reduced to
4341.43 sec (1.2 hrs), with LB 4195.61 sec (1.16 hrs), with
timin = min t1 , t2 . . . ..t np

(5) BW 4167.4 sec (1.15 hrs), with Node 3944.86 sec (1.09 hrs).
VOLUME 7, 2019 81289

FIGURE 13. Speedup against Def(avg).
FIGURE 11. Execution time comparison for entire data set.
FIGURE 14. Average slow down against best.

FIGURE 12. Speedup achieved against serial execution.
Figure 14 lists the average slow down when each mapping
Our approach achieved the best performance and reduced the scheme is used for the entire data set rather than choosing the
execution time to 3792.11 sec for the entire data set as shown individual best.
in Figure 11. The Def(avg) shows the worst performance among all and
Figure 12 Shows speedup achieved by different mapping by Node shows the best. All the four mapping schemes used
schemes i.e. by Node, Def(min), LB, BW, CB (see Table 1) here outperformed both Def(min) and Def(avg).
and Def(avg) against serial execution. Def(min) achieved Figure 15 plots the execution time comparison of different
almost 24× speedup against sequential execution. All the mapping schemes i.e. Def, Node, CB, BW, LB, serial exe-
mapping schemes used in this paper have outperformed cution and Best (see equation 10) for all the matrices in our
default MPI mapping scheme Def(min) and Def(avg) by big data set, sorted by rows from minimum to maximum. The
margin. Mapping by node achieved almost 33.5× speedup y-axis shows the execution time in seconds on logarithmic
and outperformed others as LB and BW have achieved scale (stacked for the ease of visualization) and x-axis shows
almost identical speedup of 31.5× and 31.7× respectively. different mapping schemes.
CB binding achieved 30.4× speedup, Best shown in Fig- Figure 16 plots the speedup comparison of different map-
ure 12, achieved almost 35× speedup and outperformed all ping schemes i.e. Def, Node, CB, BW, LB (see Table 1) Best
others. (see equation 10) and serial execution for all the matrices in
Figure 13 illustrate the speedup achieved by Best and our data set, sorted by rows from minimum to maximum.
other mapping schemes against Def(avg). By node achieved Figure 17 plots the execution time of matrices sorted by nnz
the highest speedup with 4.1× followed by BW and (min to max). The times on the y-axis are plotted using the
LB with 3.9×, and CB with 3.7× achieved the low- logarithmic scale and therefore do not show a steep rise in
est among the four mapping schemes. Best here out- the plots. The y-axis shows the execution time in seconds
performs all others and achieves 4.3× speedup against (stacked for the ease of visualization) and x-axis shows all
Def(avg). the matrices sorted on nnz. Figure 18 plots the speedup
All the mapping schemes here performed surprisingly well achieved by different mappings schemes against serial execu-
compared to Def(min) and Def(avg) and margin of speedup tion. Speedup in Figure 16 and Figure 18 shows the speedup
difference between these schemes is quite narrow. of Best against the serial while speedup_def, speedup_node,
81290 VOLUME 7, 2019

FIGURE 15. Execution time comparison of different mapping schemes for FIGURE 18. Speedup comparison of different mapping schemes for all.
all matrices in the data set sorted by rows (min to max).
the numbers in the same order as listed against each

application domain in Table 4. Max_time is a worst case
scenario representing default maximum and Def represents
default minimum execution time. The y-axis shows the exe-
cution on logarithmic scale. Best_time here is minimum time
selected from multiple mapping schemes, calculated using
equation 10.
Figure 19 (d, e, f) shows the speedup achieved by the
choosing the best mapping strategy against serial repre-
sented in figure as Speedup, default mapping scheme as
Speedup_Def and worst case scenario as Speedup_max. The
numbers on X-axis represents the application domains in the
same order as listed in Table 4. The Y axis shows the speedup
on logarithmic scale.
FIGURE 16. Speedup comparison of different mapping schemes on all
matrices in the data set sorted by rows (min to max).
VI. ZAKI+ PREDICTION RESULTS AND ANALYSIS
This section presents the evaluation of ZAKI+ to predict the
best mapping strategy for SpMV computation in distributed
memory environment.
A. PREDICTIVE MODELS COMPARISON

Figure 20 shows the overall distribution of mapping schemes.
With mapping by node, almost 37% matrices have recorded
the least execution time and clearly outperformed the others,
followed by LB with 27%. BW and Def have shown almost
the same result of 12%, while CB binding have shown the
least performance with only 9.34% of the matrices in our data
FIGURE 17. Execution time comparison of different mapping schemes for
set shown least execution with LB.
all matrices in the data set sorted by nnz (min to max). We have experimented with three machine learning mod-
els including Decision Tree (DT), Random Forest (RF) and
speedup_BW, speedup_CB and speedup_LB shows the extreme boosting (XG). Figure 21 shows prediction accuracy
speedup of Def, Node, BW, CB and LB respectively (see of the three algorithms where XG have outperformed others
Table 1) with limited set of learning parameters that we have exper-
Matrices in our data set are chosen from multidisciplinary imented with. Full features contains all the features and the
domains to avoid being biased towards specific kind of basic features have the lowest computation complexity and
applications and to target variety of application domains. don’t require the full scan of the matrix as shown in Table 2.
We have experimented with matrices chosen from 45 appli- XG outperformed other two models and have shown 74.45%
cation domains as listed in Table 4. Figure 19 (a, b, c) shows accuracy compared to RF with 70.1% and DT with 64.67%
the execution time comparison of Best_Time with mini- with full feature set. With basic features RF and DT have
mum, maximum and serial execution, for different appli- shown almost the same accuracy of ≈ 61% and XG per-
cation domains on logarithmic scale. The x-axis shows formed better with ≈ 64% accuracy.
VOLUME 7, 2019 81291

(a) (d)
(b) (e)
(c) (f)
FIGURE 19. Application domains execution time comparison (a,b,c) and speedup (d,e,f).
B. PERFORMANCE GAIN than CB. Def being the worst have shown almost 68% of
The performance gain of the ZAKI+ and other mapping the maximum possible performance i.e. Best, which can be
strategies is presented in Figure 22. We have used Geometric achieved by choosing the right mapping scheme each time
Mean of Normalized Performance (GMNP) to compare the for all the matrices in a test set.
performance calculated using equation 1. The high performance gain is due to the fact that even the
As shown in Figure 22, ZAKI+ has achieved almost 98% miss-predictions chooses the mapping scheme with execution
of that of Best time for the test set. All mapping schemes have time very close to the best and is evident the Figure 23 which
performed well compared to default. Node, LB and BW have plots the best, default and predicted results. The x-axis shows
shown almost the same performance and are slightly better the matrices in our test set sorted with respect to their
81292 VOLUME 7, 2019

FIGURE 20. Class distribution.
FIGURE 23. Performance difference between best, predicted and Def.
FIGURE 21. Prediction accuracy.
FIGURE 24. Performance comparison of ZAKI+ with different mapping

schemes.
3.1× times better than BW(avg), 3.2× times better than

LB(avg), 3.3× times better than CB(avg) and 4.2× times
better than Def(avg).
VII. CONCLUSION
In this paper, we proposed ZAKI+, a data-driven and
machine-learning approach to predict the optimal mapping of
processes and data (i.e., matrix partitions) on the underlying
distributed memory machine architecture for SpMV compu-
tations of an arbitrary sparse matrix. It allows application
FIGURE 22. Geometric mean of normalized performance of proposed scientists to automatically, effortlessly, and speedily obtain
mechanism. the best configuration (including the matrix/data distribu-
tion, optimal number of processes, and mapping strategy),
execution time (minimum to maximum) and y-axis shows and hence the best performance, for the execution of the
the execution time in seconds on a logarithmic scale. It is SpMV computations for a given sparse matrix. We have used
observed that almost all the predictions and miss-predictions 1838 real-world sparse matrices associated with 45 applica-
are very close to best available option as the performance tion domains to train and test the tool. The execution times
difference between different mapping schemes is very narrow for the predicted optimal configuration of SpMV compu-
resulting in high performance gain. tations are compared with the average execution times of
Figure 24 shows the average execution time comparison MPI default mapping policy; ZAKI+ provides 4.24 times
of different mapping schemes with ZAKI+, for the entire aggregated speedup over the MPI default mapping policy
data set of 1838 matrices. Clearly evident in Figure 24Fig- with average parallel execution times. We have provided
ure 24, ZAKI+ took 3875.91 seconds and outperformed all a first-ever detailed comparative analysis of multiple MPI
others by a big margin. Node(avg) performed marginally process mapping strategies (Node binding, Latency bind-
better than other mapping schemes with 11716.23 seconds. ing, Bandwidth binding, and Cyclic binding) for SpMV
Zaki+ performed almost 3× times better than Node(avg), computations. The methodology to use multiple mapping
VOLUME 7, 2019 81293

strategies for prediction is itself a novel contribution of this are likely to transform the future of computing infrastruc-
paper. This is the first work of its kind where the sparsity tures. The trend would be the integration of computing at
structures of matrices have been exploited to predict the opti- exascale (and beyond) with big data technologies and provi-
mal mapping of the processes and data in distributed memory sion of on-demand service-oriented high performance com-
environments by using different base and ensemble machine puting together with the required data, AI and other applica-
learning methods. tions. The mapping of data and processes (related to smart
It is observed that changing the mapping scheme has applications) onto the underlying cyberphysical converged
increased the performance for the entire data set compared infrastructure therefore would be of paramount importance.
to the MPI default mapping strategy (shown in Figure 15 to This area of convergence is in its infancy [38] and our future
Figure 18). Node and LB have shown the best perfor- work will provide more in-depth proposals and analysis on
mance while BW and CW came close second in terms of these aspects of the SpMV computations.
speedup against Def. The most of the matrices in our data
ACKNOWLEDGMENT
set have shown best performance with Node, followed by
The experiments performed in this paper were executed on
LB (Figure 20). The least execution time is chosen among
the Aziz supercomputer being managed by the HPC Center
all these mapping schemes and labeled accordingly, to train
at the King Abdul-Aziz University.
the machine learning model. The performance of all studied
mapping schemes is better than Def, and also performance REFERENCES
differences within these mapping schemes are quite small, [1] National Institute of Standards and Technology. Cyber-Physical Sys-
resulting in the high performance gain of 97.79% of the tems. Accessed: May 23, 2019. [Online]. Available: https://www.nist.gov/
el/cyber-physical-systems
ideally attainable performance for our ZAKI+ tool.
[2] W. Yu, F. Liang, X. He, W. G. Hatcher, C. Lu, J. Lin, and X. Yang,
In the future, we will enhance the proposed techniques ‘‘A survey on the edge computing for the Internet of Things,’’ IEEE Access,
by incorporating additional relevant features and increasing vol. 6, pp. 6900–6919, 2018.
the dataset, both in terms of the number and size of sparse [3] J. Lin, W. Yu, N. Zhang, X. Yang, H. Zhang, and W. Zhao,
‘‘A survey on Internet of things: Architecture, enabling technologies, secu-
matrices and application domains. We are also planning rity and privacy, and applications,’’ IEEE Internet Things J., vol. 4, no. 5,
to extend it further on hybrid MPI/OpenMP programming pp. 1125–1142, Oct. 2017.
model, where MPI is responsible for inter-node commu- [4] R. Mehmood, B. Bhaduri, I. Katib, and I. Chlamtac, Eds., Smart Societies,
Infrastructure, Technologies and Applications (Lecture Notes of the Insti-
nication and OpenMP is used for fine-grained parallelism. tute for Computer Sciences, Social Informatics and Telecommunications
Moreover, we plan to extend our tool to incorporate energy Engineering), vol. 224. Cham, Switzerland: Springer, 2018.
efficiency optimization of SpMV computations. We will also [5] R. Mehmood, S. See, I. Katib, and I. Chlamtac, Eds., Smart Infras-
extend our proposed tool, ZAKI+ to work for dense matrix tructure and Applications: Foundations for Smarter Cities and Societies
(EAI/Springer Innovations in Communication and Computing). Cham,
vector multiplication. Switzerland: Springer, 2019.
It was mentioned that the manual process of trial and error [6] R. Mehmood, F. Alam, N. N. Albogami, I. Katib, A. Albeshri, and
experimentation to find the best mapping of processes and S. M. Altowaijri, ‘‘UTiLearn: A personalised ubiquitous teaching and
learning system for smart societies,’’ IEEE Access, vol. 5, pp. 2615–2635,
data on a given architecture and resources for computing 2017.
SpMV is time-consuming and frustrating, needing a complete [7] F. Alam, R. Mehmood, I. Katib, N. N. Albogami, and A. Albeshri, ‘‘Data
search of the node and processor core space to find the opti- fusion and IoT for smart ubiquitous environments: A survey,’’ IEEE
Access, vol. 5, pp. 9533–9554, 2017.
mal configuration for the computation. The ability to provide
[8] F. Alam, R. Mehmood, I. Katib, and A. Albeshri, ‘‘Analysis of eight data
the optimal process configuration speedily is a crucial advan- mining algorithms for smarter Internet of Things (IoT),’’ Procedia Comput.
tage of the ZAKI+ tool. Our future work will look into ana- Sci., vol. 58, pp. 437–442, Jan. 2016.
lyzing the time profile of the predictive model with the aim [9] M. V. Tabib, A. Rasheed, and T. P. Uteng, ‘‘Methodology for assessing
cycling comfort during a smart city development,’’ Energy Procedia,
to develop a real-time tool for process mapping prediction. vol. 122, pp. 361–366, Sep. 2017.
The current trend in ICT is towards the convergence of big [10] G. Triscone, ‘‘Computational fluid dynamics as a tool to predict the air
data, HPC and AI. This paper has attempted to contribute to pollution dispersion in a neighborhood—A research project to improve
the quality of life in cities,’’ Int. J. Sustain. Dev. Plan., vol. 11, no. 4,
this convergence and has applied it to the area of SpMV com- pp. 546–557, Aug. 2016.
putations. The challenges related to the mapping of data and [11] C. G. García, D. Meana-Llorián, B. C. P. G-Bustelo, J. M. C. Lovelle,
processes onto distributed memory architectures are not spe- and N. Garcia-Fernandez, ‘‘Midgar: Detection of people through computer
cific to SpMV computations alone. Various cyber-physical vision in the Internet of Things scenarios to improve the security in Smart
Cities, Smart Towns, and Smart Homes,’’ Future Gener. Comput. Syst.,
systems will comprise compute intensive machine learn- vol. 76, pp. 301–313, Nov. 2017.
ing applications, such as SpMV, and these will need to be [12] A. S. Montemayor, J. J. Pantrigo, and L. Salgado, ‘‘Special issue on real-
optimally mapped onto the underlying cyber-physical and time computer vision in smart cities,’’ J. Real-Time Image Process., vol. 10,
no. 4, pp. 723–724, Dec. 2015.
exascale computing infrastructure. CPSs in the future will [13] E. Estrada, R. Maciel, A. Ochoa, B. Bernabe-Loranca, D. Oliva, and
comprise an ecosystem of digital infrastructures that are able V. Larios, ‘‘Smart City visualization tool for the Open Data georeferenced
to work together and enable dynamic real-time interactions analysis utilizing machine learning,’’ Int. J. Combinat. Optim. Problems
Inform., vol. 9, no. 2, pp. 25–40, 2018.
between various CPS subsystems. Technologies such as big
[14] A. Rahman, J. Jin, A. Cricenti, A. Rahman, M. Palaniswami, and T. Luo,
data, pervasive, cloud and fog computing, as well as the ‘‘Cloud-enhanced robotic system for smart city crowd control,’’ J. Sens.
increasingly complex demands of smart cities and societies, Actuator Netw., vol. 5, no. 4, p. 20, Dec. 2016.
81294 VOLUME 7, 2019

[15] D. G. Aliaga, ‘‘3D design and modeling of smart cities from a com- [35] Y. Arfat, R. Mehmood, and A. Albeshri, ‘‘Parallel Shortest Path Graph
puter graphics perspective,’’ ISRN Comput. Graph., vol. 2012, Dec. 2012, Computations of United States Road Network Data on Apache Spark,’’
Art. no. 728913. in Int. Conf. Smart Cities, Infrastructure, Technol. Appl., vol. 2017,
[16] R. Gade, ‘‘Thermal imaging systems for real-time applications in smart pp. 323–336.
cities,’’ Int. J. Comput. Appl. Technol., vol. 53, no. 4, pp. 291–308, [36] S. Suma, R. Mehmood, and A. Albeshri, ‘‘Automatic event detection in
2016. smart cities using big data analytics,’’ in Proc. Int. Conf. Smart Cities,
[17] M. Akcin, A. Kaygusuz, A. Karabiber, S. Alagoz, B. B. Alagoz, and Infrastruct., Technol. Appl. (SCITA) Lecture Notes of the Institute for Com-
C. Keles, ‘‘Opportunities for energy efficiency in smart cities,’’ in Proc. puter Sciences, Social Informatics and Telecommunications Engineering,
4th Int. Istanbul Smart Grid Congr. Fair (ICSG), Apr. 2016, pp. 1–5. vol. 224, 2017, pp. 111–122.
[18] M. Zappatore, A. Longo, and M. A. Bochicchio, ‘‘Crowd-sensing our [37] S. Suma, R. Mehmood, N. Albugami, I. Katib, and A. Albeshri, ‘‘Enabling
smart cities: A platform for noise monitoring and acoustic urban planning,’’ next generation logistics and planning for smarter societies,’’ Procedia
J. Commun. Softw. Syst., vol. 13, no. 2, pp. 53–67, Jun. 2017. Comput. Sci., vol. 109, pp. 1122–1127, May 2017.
[19] J. P. Bello, C. Mydlarz, and J. Salamon, ‘‘Sound analysis in smart cities,’’ in [38] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer,
Computational Analysis of Sound Scenes and Events, Cham, Switzerland: J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek,
Springer, 2018, pp. 373–397. D. Wessel, and K. Yelick, ‘‘A view of the parallel computing landscape,’’
[20] R. Mehmood, R. Meriton, G. Graham, P. Hennelly, and M. Kumar, Commun. ACM, vol. 52, no. 10, pp. 56–67, Oct. 2009.
‘‘Exploring the influence of big data on city transport operations: A Marko- [39] T. Muhammed, R. Mehmood, A. Albeshri, and I. Katib, ‘‘SURAA: A novel
vian approach,’’ Int. J. Oper. Prod. Manag., vol. 37, no. 1, pp. 75–104, method and tool for Loadbalanced and coalesced SpMV computations on
Jan. 2017. GPUs,’’ Appl. Sci., vol. 9, no. 5, p. 947, Mar. 2019.
[21] R. Mehmood and G. Graham, ‘‘Big data logistics: A health-care transport [40] H. Alyahya, R. Mehmood, and I. Katib, ‘‘Parallel sparse matrix vector
capacity sharing model,’’ Procedia Comput. Sci., vol. 64, pp. 1107–1114, multiplication on intel MIC: Performance analysis,’’ in Smart Societies,
Oct. 2015. Infrastructure, Technologies and Applications—SCITA (Lecture Notes of
[22] R. Mehmood and J. A. Lu, ‘‘Computational Markovian analysis of large the Institute for Computer Sciences, Social Informatics and Telecom-
systems,’’ J. Manuf. Technol. Manage., vol. 22, no. 6, pp. 804–817, munications Engineering), vol. 224. Cham, Switzerland: Springer, 2018,
Jul. 2011. pp. 306–322.
[23] S. Altowaijri, R. Mehmood, and J. Williams, ‘‘A quantitative model of [41] M. Kwiatkowska, D. Parker, and Y. Zhang, ‘‘Dual-processor parallelisation
grid systems performance in healthcare organisations,’’ in Proc. Int. Conf. of symbolic probabilistic model checking,’’ in Proc. IEEE Comput. Soc.
Intell. Syst., Modelling Simulation, Jan. 2010, pp. 431–436. 12th Annu. Int. Symp. Modeling, Anal., Simulation Comput. Telecommun.
[24] R. Mehmood, R. Alturki, and S. Zeadally, ‘‘Multimedia applications over Syst. (MASCOTS), Oct. 2004, pp. 123–130.
metropolitan area networks (MANs),’’ J. Netw. Comput. Appl., vol. 34, [42] R. Mehmood and J. Crowcroft, ‘‘Parallel iterative solution method for
no. 5, pp. 1518–1529, 2011. large sparse linear equation systems,’’ Univ. Cambridge, Cambridge, U.K.,
[25] T. E. H. El-Gorashi, B. Pranggono, R. Mehmood, and J. M. H. Elmirghani, Tech. Rep. UCAM-CL-TR-650, 2005.
‘‘A data mirroring technique for SANs in a metro WDM sectioned ring,’’ in [43] M. Chen, S. Mao, and Y. Liu, ‘‘Big data: A survey,’’ Mobile Netw. Appl.,
Proc. Int. Conf. Opt. Netw. Design Modeling (ONDM), Mar. 2008, pp. 1–6. vol. 19, no. 2, pp. 171–209, Apr. 2014.
[26] E. Alamoudi, R. Mehmood, A. Albeshri, and T. Gojobori, ‘‘DNA pro- [44] E. Alomari and R. Mehmood, ‘‘Analysis of tweets in Arabic language for
filing methods and tools: A review,’’ in Smart Societies, Infrastructure, detection of road traffic conditions,’’ in Smart Societies, Infrastructure,
Technologies and Applications (Lecture Notes of the Institute for Com- Technologies and Applications (Lecture Notes of the Institute for Com-
puter Sciences, Social-Informatics and Telecommunications Engineering), puter Sciences, Social Informatics and Telecommunications Engineering),
Vol. 224. Cham, Switzerland: Springer, 2018, pp. 216–231. vol. 224. Cham, Switzerland: Springer, 2018, pp. 98–110.
[27] A. Khanum, A. Alvi, and R. Mehmood, ‘‘Towards a semantically enriched
[45] S. Usman, R. Mehmood, and I. Katib, ‘‘Big data and HPC conver-
computational intelligence (SECI) framework for smart farming,’’ in Smart
gence: The cutting edge and outlook,’’ in Smart Societies, Infrastructure,
Societies, Infrastructure, Technologies and Applications (Lecture Notes
Technologies and Applications (Lecture Notes of the Institute for Com-
of the Institute for Computer Sciences, Social-Informatics and Telecom-
puter Sciences, Social Informatics and Telecommunications Engineering),
munications Engineering), vol. 224. Cham, Switzerland: Springer, 2018,
vol. 2248. Cham, Switzerland: Springer, 2018, pp. 11–26.
pp. 247–257.
[46] R. Farber. (2018). The Convergence of Big Data and Extreme-
[28] M. Aqib, R. Mehmood, A. Alzahrani, I. Katib, and A. Albeshri, ‘‘A deep
Scale HPC. HPC Wire. Accessed: Nov. 1, 2011. [Online]. Avail-
learning model to predict vehicles occupancy on freeways for traffic
able: https://www.hpcwire.com/2018/08/31/the-convergence-of-big-data-
management,’’ in Proc. Int. J. Comput. Sci. Netw. Secur., vol. 18, no. 12,
and-extreme-scale-hpc/
pp. 246–254, 2018.
[47] M. Grossman, C. Thiele, M. Araya-Polo, F. Frank, F. O. Alpak, and
[29] M. Aqib, R. Mehmood, A. Albeshri, and A. Alzahrani, ‘‘Disaster man-
V. Sarkar, ‘‘A survey of sparse matrix-vector multiplication performance
agement in smart cities by forecasting traffic plan using deep learning and
on large matrices,’’ Aug. 2016, arXiv:1608.00636. [Online]. Available:
GPUs,’’ in Smart Societies, Infrastructure, Technologies and Applications
https://arxiv.org/abs/1608.00636
(Lecture Notes of the Institute for Computer Sciences, Social Informat-
ics and Telecommunications Engineering), vol. 224. Cham, Switzerland: [48] R. Mehmood, ‘‘Disk-based techniques for efficient solution of large
Springer, 2018, pp. 139–154. Markov chains,’’ Ph.D. dissertation, Dept. School Comput. Sci., Univ.
[30] Y. Arfat, ‘‘Enabling smarter societies through mobile big data fogs and Birmingham, Birmingham, U.K., 2004.
clouds,’’ Procedia Comput. Sci., vol. 109, pp. 1128–1133, May 2017. [49] R. Mehmood, D. Parker, and M. Kwiatkowska, ‘‘An efficient BDD-based
[31] J. Schlingensiepen, R. Mehmood, F. C. Nemtanu, and M. Niculescu, implementation of Gauss-Seidel for CTMC analysis,’’ Dept. School Com-
‘‘Increasing sustainability of road transport in European cities and put. Sci., Univ. Birmingham, Birmingham, U.K., Tech. Rep. CSR-03-13,
metropolitan areas by facilitating autonomic road transport systems 2013.
(ARTS),’’ in Proc. Sustain. Automot. Technol. 5th Int. Conf. (ICSAT), 2014, [50] R. Mehmood, ‘‘A survey of out-of-core analysis techniques in stochastic
pp. 201–210. modelling,’’ Dept. School Comput. Sci., Univ. Birmingham, Birmingham,
[32] M. Aqib, R. Mehmood, A. Alzahrani, I. Katib, A. Albeshri, and U.K., Tech. Rep. CSR-03-7, Aug. 2003.
S. M. Altowaijri, ‘‘Rapid transit systems: Smarter urban planning using [51] (2018). Intel Math Kernel Library (Intel MKL)|Intel Software. Accessed:
big data, in-memory computing, deep learning, and GPUs,’’ Sustainability, Mar. 24, 2019. [Online]. Available: https://software.intel.com/en-us/mkl
vol. 11, no. 10, p. 2736, May 2019. [52] The Trilinos Project. Accessed: Mar. 24, 2019. [Online]. Available:
[33] M. Aqib, R. Mehmood, A. Alzahrani, I. Katib, A. Albeshri, and https://trilinos.org/publicRepo/
S. M. Altowaijri, ‘‘Smarter traffic prediction using big data, in-memory [53] CUSP. Accessed: Mar. 24, 2019. [Online]. Available: https://
computing, deep learning and GPUs,’’ Sensors, vol. 19, no. 9, p. 2206, cusplibrary.github.io/
May 2019. [54] cuSPARSE. Accessed: Mar. 24, 2019. [Online]. Available: https://
[34] F. Alam, R. Mehmood, and I. Katib, ‘‘D2TFRS: An object recognition developer.nvidia.com/cusparse
method for autonomous vehicles based on RGB and spatial values of pix- [55] X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, ‘‘Efficient sparse matrix-
els,’’ in Lecture Notes of the Institute for Computer Sciences, Social Infor- vector multiplication on x86-based many-core processors,’’ in Proc. 27th
matics and Telecommunications Engineering, vol. 224, 2018, pp. 155–168. Int. ACM Conf. Int. Conf. Supercomput. (ICS), 2013, pp. 273–282.
VOLUME 7, 2019 81295

[56] R. Mehmood, J. Crowcroft, and J. M. H. Elmirghani, ‘‘A parallel implicit [77] V. Karakasis, G. Goumas, and N. Koziris, ‘‘Exploring the performance-
method for the steady-state solution of CTMCs,’’ in Proc. 14th IEEE Int. energy tradeoffs in sparse matrix-vector multiplication,’’ in Proc. Work-
Symp. Modeling, Anal., Simulation (MASCOTS), Sep. 2006, pp. 293–302. shop Emerg. Supercomput. Technol. (WEST)-ICS, 2011, pp. 1–6.
[57] A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, and [78] S. Fujino and T. Nanri, ‘‘Parallelized balancing communication and exe-
R. Vuduc, ‘‘Optimizing and tuning the fast multipole method for state-of- cution for sparse matrix-vector multiplication,’’ Trans. Jpn. Soc. Simul.
the-art multicore architectures,’’ in Proc. IEEE Int. Symp. Parallel Distrib. Technol., vol. 7, no. 2, pp. 37–41, 2015.
Process. (IPDPS), Apr. 2010, pp. 1–12. [79] D. Zheng, D. Mhembere, V. Lyzinski, J. T. Vogelstein, C. E. Priebe,
[58] M. Maggioni and T. Berger-Wolf, ‘‘Optimization techniques for sparse and R. Burns, ‘‘Semi-external memory sparse matrix multiplication for
matrix–vector multiplication on GPUs,’’ J. Parallel Distrib. Comput., billion-node graphs,’’ IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 5,
vols. 93–94, pp. 66–86, Jul. 2016. pp. 1470–1483, May 2017.
[59] H. Anzt, V. Heuveline, J. I. Aliaga, M. Castillo, J. C. Fernández, R. Mayo,
and E. S. Quintana-Ortí, ‘‘Analysis and optimization of power consumption SARDAR USMAN received the M.S. degree in
in the iterative solution of sparse linear systems on multi-core and many- network system engineering from the Univer-
core platforms,’’ in Proc. Int. Green Comput. Conf. Workshops, 2011, sity of Plymouth, U.K., and the Ph.D. degree in
pp. 1–6. computer science from King Abdul Aziz Univer-
[60] A. Khajeh-Saeed, S. Poole, and J. B. Perot, ‘‘A comparison of multi- sity, Saudi Arabia, in 2019. His research interests
core processors on scientific computing tasks,’’ in Proc. Innov. Parallel include networking, high-performance computing,
Comput., Found. Appl. GPU, Manycore, Heterogeneous Syst. (InPar), San and machine learning.
Jose, CA, USA, 2012.
[61] X. Feng, H. Jin, R. Zheng, K. Hu, J. Zeng, and Z. Shao, ‘‘Optimization
of sparse matrix-vector multiplication with variant CSR on GPUs,’’ in
Proc. IEEE 17th Int. Conf. Parallel Distrib. Syst. (ICPADS), Dec. 2011,
pp. 165–172. RASHID MEHMOOD is currently a Research
[62] O. Kislal, W. Ding, M. Kandemir, and I. Demirkiran, ‘‘Optimizing sparse Professor of big data systems and the Director of
matrix vector multiplication on emerging multicores,’’ in Proc. IEEE 6th research, training, and consultancy with the High
Int. Workshop Multi-/Many-Core Comput. Syst. (MuCoCoS), Sep. 2013, Performance Computing Centre, King Abdulaziz
pp. 1–10.
University, Saudi Arabia. He has gained qualifi-
[63] S. Usman, R. Mehmood, I. Katib, A. Albeshri, and S. M. Altowaijri,
cations and work experience from universities in
‘‘ZAKI: A smart method and tool for automatic performance optimization
U.K., including Cambridge University and Oxford
of parallel SpMV computations on distributed memory machines,’’ Mobile
Netw. Appl., to be published. University. He has 23 years of academic and
[64] T. A. Davis and Y. Hu, ‘‘The University of Florida sparse matrix collec- industrial experience in computational modeling,
tion,’’ ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1–25, Nov. 2011. simulations, and design using computational intel-
[65] S. Lee and R. Eigenmann, ‘‘Adaptive runtime tuning of parallel sparse ligence, big data, and high-performance computing. His broad research
matrix-vector multiplication on distributed memory systems,’’ in Proc. aim is to develop multi-disciplinary science and technology to enable a
22nd Annu. Int. Conf. Supercomput. (ICS), 2008, pp. 195–204. better quality of life and smart economy with a focus on real-time intel-
[66] IBM Knowledge Center—Understanding MPI Process Placement and ligence and dynamic (autonomic) system management. He has published
Affinity. Accessed: Mar. 20, 2019. [Online]. Available: https://www. over 150 research papers, including six edited books. He is a Founding
ibm.com/support/knowledgecenter/en/SSZTET_10.2/admin/smpi02_ Member of the Future Cities and Community Resilience (FCCR) Network,
proc_affinity_placement.html a member of ACM and OSA, and the former Vice-Chairman of the IET
[67] R. Saxena. How Decision Tree Algorithm Works. Accessed: Mar. 20, 2019. Wales SW Network. He has organized and chaired international conferences
[Online]. Available: http://dataaspirant.com/2017/01/30/how-decision- and workshops, including EuropeComm 2009, Nets4Cars 2010–2013, SCE
tree-algorithm-works/ 2017–2019, SCITA 2017, and HPC Saudi 2018. He has led and contributed
[68] V. Smolaykov. Ensemble Learning to Improve Machine Learning to academia-industry collaborative projects funded by EPSRC, EU, U.K.
Results. Accessed: Mar. 20, 2019. [Online]. Available: https://blog. regional funds, and Technology Strategy Board U.K. with the value of over
statsbot.co/ensemble-learning-d1dcd548e936
50 million Euro.
[69] E. Lutins. Ensemble Methods in Machine Learning: What are They
and Why Use Them? Accessed: Mar. 20, 2019. [Online]. Available:
https://towardsdatascience.com/ensemble-methods-in-machine-learning- IYAD KATIB received the B.S. degree in statis-
what-are-they-and-why-use-them-68ec3f9fef5f tics/computer science from King Abdul Aziz
[70] N. Donges. The Random Forest Algorithm—Towards Data University, in 1999, and the M.S. and Ph.D.
Science. Accessed: Mar. 20, 2019. [Online]. Available: degrees in computer science from the University of
https://towardsdatascience.com/december-edition-80d8992a0fc Missouri-Kansas City, in 2004 and 2011, respec-
[71] K. Nishida. Introduction to Extreme Gradient Boosting in tively. He is currently an Associate Professor with
Exploratory. Accessed: Mar. 20, 2019. [Online]. Available: the Computer Science Department and the cur-
https://blog.exploratory.io/introduction-to-extreme-gradient-boosting- rent Vice Dean and the College Council Secre-
in-exploratory-7bbec554ac7 tary of the Faculty of Computing and Information
[72] Z. Wang, M. F. P. O’Boyle, and M. K. Emani, ‘‘Smart, adaptive mapping Technology (FCIT), King Abdulaziz University
of parallelism in the presence of external workload,’’ in Proc. IEEE/ACM (KAU), where he is also the Director of the High Performance Computing
Int. Symp. Code Gener. Optim. (CGO), Feb. 2013, pp. 1–10. Center. His current research interests include computer networking and
[73] Z. Wang, G. Tournavitis, B. Franke, and M. F. P. O’Boyle, ‘‘Integrat- high-performance computing.
ing profile-driven parallelism detection and machine-learning-based map-
ping,’’ ACM Trans. Archit. Code Optim., vol. 11, no. 1, p. 2, 2013.
AIIAD ALBESHRI received the M.S. and
[74] E. Jeannot, G. Mercier, and F. Tessier, ‘‘Process placement in multicore
Ph.D. degrees in information technology from the
clusters: Algorithmic issues and practical techniques,’’ IEEE Trans. Paral-
lel Distrib. Syst., vol. 25, no. 4, pp. 993–1002, Apr. 2014.
Queensland University of Technology, Brisbane,
[75] M. Castro, L. F. W. Góes, C. P. Ribeiro, M. Cole, M. Cintra, and QLD, Australia, in 2007 and 2013, respectively.
J.-F. Mehaut, ‘‘A machine learning-based approach for thread mapping on He has been an Assistant Professor with the
transactional memory applications,’’ in Proc. 18th Int. Conf. High Perform. Computer Science Department, King Abdulaziz
Comput., Dec. 2011, pp. 1–10. University, Jeddah, Saudi Arabia, since 2013. His
[76] A. Mansour, J. Götze, W.-C. Hsu, and S.-J. Ruan, ‘‘Sparse matrix-vector current research interests include security and trust
multiplication: A data mapping-based architecture,’’ in Proc. 15th Int. in cloud computing and big data.
Conf. Parallel Distrib. Comput., Appl. Technol., 2014, pp. 152–158.
81296 VOLUME 7, 2019

ZAKI A Machine Learning Based Process Mapping Tool For SPMV Computations On Distributed Memory Architectures

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ZAKI A Machine Learning Based Process Mapping Tool For SPMV Computations On Distributed Memory Architectures

Uploaded by

Copyright:

Available Formats

SPECIAL SECTION ON DISTRIBUTED COMPUTING

INFRASTRUCTURE FOR CYBER-PHYSICAL SYSTEMS

ZAKI+: A Machine Learning Based Process

Corresponding author: Rashid Mehmood (rmehmood@kau.edu.sa)

81280 VOLUME 7, 2019

TABLE 1. Symbol table.

VOLUME 7, 2019 81281

FIGURE 3. SpMV data distribution.

FIGURE 2. CSR representation.

81282 VOLUME 7, 2019

or use a back-end parallel run-time environment support

VOLUME 7, 2019 81283

source tool to target ranking, classification and regression

III. LITERATURE SURVEY

81284 VOLUME 7, 2019

IV. ZAKI+: METHODOLOGY AND DESIGN C. FEATURES CHARACTERISTICS

VOLUME 7, 2019 81285

TABLE 2. Sparse matrix features. Algorithm 2 Data Preparation

read in matrix market format and converted to CSR and all

81286 VOLUME 7, 2019

TABLE 4. Application domains.

FIGURE 8. 2k iterations of SpMV execution of selected matrices on a

FIGURE 9. 2000 iterations of SpMV execution of selected matrices on

E. MACHINE LEARNING MODEL

VOLUME 7, 2019 81287

Algorithm 3 Training and Testing

5: Split S0 into Test set S0test and Training set S0train

81288 VOLUME 7, 2019

VOLUME 7, 2019 81289

FIGURE 13. Speedup against Def(avg).

FIGURE 11. Execution time comparison for entire data set.

FIGURE 14. Average slow down against best.

81290 VOLUME 7, 2019

the numbers in the same order as listed against each

A. PREDICTIVE MODELS COMPARISON

VOLUME 7, 2019 81291

81292 VOLUME 7, 2019

FIGURE 20. Class distribution.

FIGURE 23. Performance difference between best, predicted and Def.

FIGURE 21. Prediction accuracy.

FIGURE 24. Performance comparison of ZAKI+ with different mapping

3.1× times better than BW(avg), 3.2× times better than

VOLUME 7, 2019 81293

81294 VOLUME 7, 2019

VOLUME 7, 2019 81295

81296 VOLUME 7, 2019

You might also like