Professional Documents
Culture Documents
Received May 25, 2019, accepted June 9, 2019, date of publication June 17, 2019, date of current version July 3, 2019.
Digital Object Identifier 10.1109/ACCESS.2019.2923565
ABSTRACT Smart cities and other cyber-physical systems (CPSs) rely on various scientific, engineer-
ing, business, and social applications that provide timely intelligence for their design, operations, and
management. Many of these scientific and analytics applications require the solution of sparse linear
equation systems, where sparse matrix-vector (SpMV) product is a key computing operation. Several factors
determine the performance of parallel SpMV computations, including matrix characteristics, storage formats,
and the rising complexity and heterogeneity of computer systems. There is a pressing need for new ways
of exploiting parallelism, and mapping data and applications to the computing resources. We propose here
ZAKI+ , a data-driven machine-learning approach, allowing users to automatically, effortlessly, and speedily
obtain the best configuration (the data distribution, the optimal number of processes, and mapping strategy)
and performance for the execution of the parallel SpMV computations on distributed memory machines.
We train and test the tool using three machine learning methods—decision trees, random forest, and Xtreme
boosting—and nearly 2000 real-world matrices obtained from 45 application domains, including computer
vision and robotics. ZAKI+ provides optimal process mapping and outperforms the MPI default mapping
policy by a factor of 4.24. This is the first work where the sparsity structure of matrices has been exploited
to predict the optimal mapping of processes and data in distributed-memory environments by using different
base and ensemble machine learning methods. Various CPSs comprise compute-intensive machine learning
applications, such as the SpMV, and hence, the process and data mapping contributions of this paper would
be of paramount impact for the CPSs.
INDEX TERMS Cyber-physical systems, SpMV, sparse linear algebra, sparse matrices, machine learning,
MPI, process affinity, compressed sparse row (CSR), decision trees, random forest, Xtreme boosting, parallel
computing, high performance computing (HPC), smart cities analytics, exascale systems, OpenMPI.
I. INTRODUCTION big data, cloud, fog, and edge computing, artificial intelli-
Cyber-Physical Systems (CPS) comprises ‘‘interacting digi- gence, high performance computing, and other cutting-edge
tal, analog, physical, and human components engineered for technologies to provide the foundations for smart cities and
function through integrated physics and logic. These systems societies [2]–[8]. Smart cities appear as ‘‘the next stage of
will provide the foundation of our critical infrastructure, urbanization, subsequent to the knowledge-based economy,
form the basis of emerging and future smart services, and digital economy, and intelligent economy’’, aiming to ‘‘not
improve our quality of life in many areas’’ [1], including only exploit physical and digital infrastructure for urban
transportation, healthcare, smart grid, disaster management development but also the intellectual and social capital as
and many other areas. A CPS uses Internet of Things (IoT), its core ingredient for urbanization’’. Smart society is an
extension of the smart cities concept, ‘‘a digitally-enabled,
The associate editor coordinating the review of this manuscript and knowledge-based society, aware of and working towards
approving it for publication was Wei Yu. social, environmental and economic sustainability’’ [4].
This work is licensed under a Creative Commons Attribution 3.0 License. For more information, see http://creativecommons.org/licenses/by/3.0/
VOLUME 7, 2019 81279
S. Usman et al.: ZAKI+: Machine Learning-Based Process Mapping Tool for SpMV Computations
Smart cities and other cyber physical systems rely on var- including CPUs [55]–[57], MIC [40], GPUs [39], [58], and
ious scientific, engineering, business, and social applications other architectures [59], [60].
that provide timely intelligence for their design, operations Performance optimization of an application on distributed-
and management. Many of these scientific and analytics memory multicore architectures is challenging due to the het-
applications require the solution of sparse linear equation sys- erogeneity and diversity of architectures. Modern machines
tems. Some examples of these applications where these have have a range of shared and distributed memory, and hybrid
been specifically applied to smart city settings include com- architectures, with several hierarchies involving non-uniform
putational fluid dynamics (CFD) [9], [10], computer vision communication latencies [61], [62]. The sparsity pattern of
or computer graphics [11]–[13], robotics problems [14], the matrix affects the performance of SpMV computations,
2D/3D problems [15], thermal problems [16], [17], acoustics particularly in the case of distributed-memory implementa-
problems [18], [19], operational research [20]–[22], health- tions (due to higher variations in communication delays),
care [23], and networking [24], [25]. Some other exam- resulting in load imbalance that causes both computation
ples of smart city applications include life sciences [26], and communication overheads [39]. The goal of mapping
smart farming [27], transportation [28]–[33], autonomous of sparse matrices and vectors to processors in distributed
vehicles [34], graph computations [35], and social media memory parallel environment is to minimize the overall
analytics [36], [37]. communication (number of sent messages, communication
Sparse matrix-vector product (SpMV) is the most impor- volume per processor, synchronization costs) and provide
tant and time-consuming kernel for the iterative solution computation load balance. The sparsity pattern of a matrix
of sparse linear equation systems. The SpMV operation is unknown before runtime. The manual process of trial and
has been categorized as one of the seven dwarfs i.e. seven error experimentation to find the best mapping of processes
numerical methods of significant importance [38]. It is a and data on a given architecture and resources for computing
memory bound operation compared to other compute inten- SpMV is time-consuming and frustrating. More importantly,
sive algebra kernels such as dense matrix-vector multiplica- it requires a complete search of the node and processor core
tion. High performance computing (HPC) typically exploits space to find the optimal configuration for the computation.
parallel computing features of the underlying software and The structure of the matrix is not regular and therefore the
hardware infrastructure to solve large problems faster. HPC whole manual processes need to be repeated numerous times
has been applied to SpMV/linear algebra [39]–[42], and for each matrix.
other problems for several decades. Big data and data-driven The challenges related to the mapping of data and pro-
approaches [35], [36], [43], [44] have been used relatively cesses onto distributed memory architectures are not spe-
recently in scientific computing to address HPC related chal- cific to SpMV computations alone. Various cyber physical
lenges, and this has given rise to the convergence of HPC systems will comprise compute intensive machine learning
and big data [45], [46]. Moreover, artificial intelligence (AI) applications, such as SpMV, and these will need to be opti-
is increasingly being used to improve big data, HPC, scien- mally mapped onto the underlying cyber physical and exas-
tific computing, and other problem domains. This trend has cale computing infrastructure. Cyber physical systems in the
given rise to the convergence of big data, HPC and AI. This future will comprise an ecosystem of digital infrastructures
paper attempts to contribute to this convergence and applies that are able to work together and enable dynamic real-time
it (the convergence of the three areas) to the area of SpMV interactions between various CPS subsystems. Technologies
computations. such as big data, pervasive, cloud and fog computing, as well
Several factors affect the performance of SpMV com- as the increasingly complex demands of smart cities and
putations [47]. These include matrix characteristics, stor- societies, are likely to transform the future of computing
age formats, software implementations, and hardware plat- infrastructures. The trend would be the integration of comput-
forms. The matrix characteristics include eigenvalues, def- ing at exascale (and beyond) with big data technologies and
initeness, number of non-zero values in the matrix and provision of on-demand service-oriented high performance
the sparsity pattern. The storage formats include Com- computing together with the required data, AI and other appli-
pressed Sparse Row (CSR), Compressed Sparse Column cations. The mapping of data and processes (related to smart
(CSC), Coordinate format, Diagonal, Hybrid, Blocked CSR, applications) onto the underlying cyberphysical converged
Extended BCSR, Compact Modified Sparse Row (CMSR), infrastructure therefore would be of paramount importance.
and many others [42], [48]–[50]. The choice of matrix storage
formats is dependent on the characteristics of the matrix A. AIM AND CONTRIBUTIONS
itself. The software implementations include, among others, In our earlier work, we have proposed ZAKI that pre-
Intel MKL [51], Trilinos Project [52], CUSPARSE [53], dicts the optimal number of processes for SpMV compu-
and CUSP [54]. The characteristics of the hardware plat- tations of an arbitrary sparse matrix on a distributed mem-
forms that could affect the SpMV performance include ory machine [63]. In this paper, we extend the earlier tool
the DRAM bandwidth, cache hierarchy, the available par- and propose ZAKI+, a data-driven and machine-learning
allelism in the hardware, and others. A range of hard- approach to predict the optimal mapping of the processes
ware architecture are being used for SpMV implementations and data (i.e., matrix partitions) on the underlying distributed
FIGURE 1. A high-level depiction of the proposed SpMV Performance This paper makes the following contributions. We:
optimization tool ZAKI+.
X propose, implement, and evaluate a machine learning
memory machine architecture for SpMV computations of an tool that allows users to automatically obtain the best
arbitrary sparse matrix. The aim herein is to allow application configuration (including the optimal mapping of the
scientists to automatically, effortlessly, and speedily obtain processes and data), and hence the best performance,
the best configuration (including the matrix/data distribu- for SpMV computations of a given sparse matrix on a
tion, optimal number of processes, and mapping strategy), distributed-memory machine.
and hence the best performance, for the execution of the X train and test the tool using nearly 2000 real-world
SpMV computations for a given sparse matrix (see Figure 1). matrices obtained from 45 application domains includ-
ZAKI+ involves three phases: data preparation, training, ing CFD, computer vision, and robotics.
and testing. Data preparation includes sparse matrix feature X perform in-depth performance modeling and evalua-
extraction, SpMV kernel execution with varying number of tion using different machine learning techniques and
processes and five process mapping strategies, choosing the visualizations.
minimum execution time for each mapping strategy, and X provide a first-ever detailed comparative analysis of
selecting the optimal mapping of the data and processes for multiple MPI process mapping strategies (Node bind-
each matrix in our dataset. We have used the SuiteSparse ing, Latency binding, Bandwidth binding, and Cyclic
matrix collection [64] as our dataset, comprising (randomly) binding) for SpMV computations. The methodology to
selected 1838 sparse matrices associated with 45 application use multiple mapping strategies for prediction is itself
domains. a novel contribution of this paper.
Firstly, the sparse matrices in the dataset are converted To the best of our knowledge, this is the first work of
to the CSR format. The SpMV computations are performed its kind where the sparsity structure of matrices have been
2000 times for each of the 1838 matrices for the whole range exploited to predict the optimal mapping of the processes and
of processes (cores on multiple nodes), varying between 1 and data in distributed memory environments by using different
384 (see Section 4). The sequential, minimum, and average base and ensemble machine learning methods. ZAKI is an
execution times are recorded for each sparse matrix in the Arabic word, which means, ‘‘smart’’.
dataset. The average time of the 2000 SpMV executions is The rest of the paper is organized as follows. Section 2 pro-
used to avoid any anomalies. The labeled data set includes vides background information related to SpMV and machine
sparse matrix features along with the optimal mapping strat- learning algorithms. The literature survey is presented in
egy that gives the minimum execution time for the matrix. Section 3. Section 4 introduces the methodology of our
This labeled data set is divided into the training and test- proposed technique. Section 5 gives detailed experimental
ing datasets, containing randomly selected 90% and 10% results and analysis of the different mapping strategies for
matrices from the dataset, respectively. The training dataset SpMV parallelization. The prediction results and analysis
is used to train the predictive model using three machine of the Zaki+ tool are given in Section 6. We conclude in
learning algorithms; Decision Trees, Random Forest, and Section ý7 and give future directions.
Xtreme Boosting. The trained predictive model is tested on
the test data using a generic classification accuracy metric. II. BACKGROUND
The proposed model is trained off-line once and requires This section gives the brief overview of SpMV, different
no further training at the actual matrix execution or pre- process mapping techniques and machine learning algorithms
diction time. The execution times for the predicted optimal used in this paper. Table 1 lists the basic symbols used in this
configuration of SpMV computations are compared with the paper.
average execution times of MPI default mapping policy;
ZAKI+ provides 4.24 times aggregated speedup over the A. SpMV
MPI default mapping policy with average parallel execution Compressed Sparse Row (CSR) is generally the most com-
times. mon used sparse matrix storage format and can be used to
(Execution order of loop iterations) and data layout reorgani- stored in a database along with the candidate set of optimal
zation. mapping strategy S. Each entry u ∈ U and associated solution
Karakasis et al. [77] investigated the performance energy s ∈ S in the training set consists of sparse features vector of
trade-offs in SpVM by exploiting the execution configuration u and optimal mapping strategy s.
i.e. core frequency and thread placement that yields optimal
performance energy trade-off. As random filling up all cores A. DATA SET
of shared memory machine to cater memory bound applica- The dataset is created using mostly square matrices
tions may not be a suitable solution in terms of performance from SuiteSparse collection and comprises of more than
and energy. Thread placement affects the performance of 1800 matrices. The matrices are chosen to make sure
the memory bound applications on modern multicore archi- that we target applications from multidisciplinary domains.
tecture and give comparable performance with low energy Table 4 List the application domains of selected matrices.
budget. Fujino and Nanri [78] proposed a technique for an Matrices are selected from 45 application domains and their
optimal balance between computation and communication names are listed in column 2. Count gives the total num-
for parallel SpMV in distributed environment and called their ber of matrices selected in each application domain. The
technique as Balancing-CET (Balancing Communication and maximum number of matrices for a single domain (i.e. 164)
Execution Technique). The estimated time of execution is belongs to Subsequent Circuit Simulation problem domain.
measured from the previous iteration and communication While Duplicate Optimization Problem, Directed graph and
time is estimated by a linear performance model with amount Directed Weighted Random Graph have only single matrix.
of data to be sent and received by each process. Minimum and maximum number of rows for each application
Sparse matrix multiplication usually achieves only a small domain are listed in Column 4 (Min. Rows) and Column 5
fraction of the peak performance of a modern processor. Dis- (Max. Rows) respectively. Minimum numbers of rows for
tributed solutions for sparse matrix multiplication lead to sig- any single matrix in our data set is 5 and that belongs to
nificant network communication and network bandwidth is Directed weighted graph while maximum numbers of rows
usually the bottleneck. The distributed solution also imposes are 27993600 (Optimization problem). Similarly minimum
challenges in achieving load balancing. Zheng et al. [79] nnzs (see symbol Table 1) are 19 and maximum nnzs are
explore a solution that scales sparse matrix dense matrix mul- 401232976. The last column shows the image of a randomly
tiplication (SpMM) on a multi-core machine with commodity selected matrix from each application domain.
SSDs and perform SpMM in semi-external memory (SEM)
by keeping one or more columns of a dense matrix in memory B. SPARSE MATRIX FEATURES
and the sparse matrix is accessed from external memory. We tabulate some of the important features in Table 2. The
They demonstrated that the SEM solution uses the resources first column lists the features names along with the descrip-
of a multi-core machine well and achieves performance that tion in the second column. The third column lists the formulae
exceeds the state-of-the-art in-memory implementations. used to get the numerical quantity for each of those features.
Expert programmers can implement effective mapping but The last column lists the computation complexity. The first
manual process is expensive and error prone. As the perfor- 6 features i.e. Number of rows, columns, nnz, density, mean
mance of the SpMV is heavily dependent on the structure nnz in rows and columns, have computation complexity of 2
of the matrix that is an unknown entity before run time, (1). The more complex features require the full scan of matrix
which motivates the idea of using machine learning for its and thus have higher computation complexity of 2 (M) and
optimization. There is very little work on the use of machine their standard deviation with complexity 2 (2M). The costs
learning for the optimization of SpMV and most of the efforts associated with features extraction can be amortize, as it is a
are dedicated to automated format selection based on the part of pre-processing step and is only done once for the set
sparse matrix features. of matrices.
00
minimum execution time i.e. Exetimeavg , among all the map-
ping schemes recorded against each matrix as shown in line
14 to 16. The class or label of each matrix is defined by using
00
minimum of Exetimeavg . The labeled dataset comprising of
sparse matrix features f, and mapping scheme ŵ is used to
train our machine-learning model.
We have experimented with 16 nodes with 24 cores each
and each SpMV operation is performed 2000 times to avoid
anomalies and used the average of 2k iterations of execution
time. Figure 8 and Figure 9 shows execution iterations for
selected matrices on single node (with 24 processes) and
on multiple nodes (8 nodes with 190 processes) respec-
tively. The x-axis shows the number of iterations i.e. 2k
and Y-axis represents the execution time on logarithmic
scale.
F. FEATURES SCALING
Feature scaling is a pre-process step for standardizing
the highly varying features values/magnitudes in a fixed
range to prevent machine-learning algorithm to give high
FIGURE 10. The ZAKI+ predictive model.
weigh to high values and vice versa. There are numerous
feature-scaling techniques but most common techniques are
nps (see Table 1). The same process is repeated for all standardization and Min-Max Normalization. We have used
mapping schemes and minimum of the average execution standardization to standardize features. The standardization
time is recorded. Best time is selected from minimum average involves rescaling of a feature value with mean = 0 and
execution time among all mapping schemes (Def., Node, CB, variance = 1.
LB and BW), which is used to label our data set. Xi − Xmean
Selected set of features along with best mapping strat- Xnew = (2)
standard deviation
egy is served as input for the machine-learning algorithm.
The selected sets of features are listed in Table 2 and fea- G. EVALUATION METRICS
tures analysis is presented in Table 3. The data set is first 1) CLASSIFICATION ACCURACY
divided into testing and training set. The extracted features are Different evaluation metrics can be used for the classification
scaled to standardize the numerical values, thus preventing problems. The classification accuracy can be defined as ratio
the machine-learning algorithm to give high weigh to higher of number of correct predictions to the total number of input
values and lower weigh to lower values. We have used stan- samples.
dardization which involves rescaling the numerical values of
Number of correct predictions
features with mean = 0 and variance = 1 and are represented Accuracy = (3)
with the following formula. total number of input samples
Xi − Xmean 2) PERFORMANCE GAIN
Xnew = (1)
standard deviation To compare the performance gain of the proposed mecha-
Algorithm 3 shows how the training phase is performed. nism we have used Geometric Mean of Normalized Perfor-
0 0
The scaled data set S is divided into training Strain and testing mance (GMNP) defined as
0 0
data Stest as shown in line 2 to 5. ML model is trained on Strain !1/N
0 N
and evaluated with Stest . We have used Decision tree (DT) and Y Besti
GMNP = (4)
ensemble ML algorithms Random Forest (RF) and Extreme Predictedi
i=1
Boosting (XG). For any unseen matrix our predictive model
predicts the optimal mapping strategy, which gives the best where N is the number of matrices in a test set, Besti
performance. is the minimum time of the ith matrix among all the
i=np
TABLE 5. Software specifications. X ti
tiavg = (6)
np
i=1
The sequential and minimum execution time is represented by
tis , timin . Speedup of the ith matrix is calculated with following
equation.
tis
Speedupi = (7)
timin
Speedup against the entire data set is calculated as follows
schemes, and Predictedi is a selected mappings scheme by PN
ti
our machine-learning model. Speedup= PNi=1 s (8)
i=1 timin
PN
H. EXPERIMENTAL PLATFORM i=1 tiavg
Speedupavg = PN (9)
The experiments have been performed on Aziz Super com-
i=1 timin
puter at King Abdul Aziz University Saudi Arabia. Aziz
has 496 Nodes with 11,904 computing cores. 380 standard Minimum execution time and speedup is calculated using
compute nodes (9120 cores) with 96 GB (4GB per core) and equation 5 and 7 respectively. tiavg is the average time cal-
112 high memory compute nodes (2688 cores) with 256 GB culated by taking the mean execution time of each matrix
(10.6GB per core) for applications that require large memory executed with different number of processes as shown in
for their execution. Each compute node has Dual socket Intel equation equations 6. Same process is repeated for all the
Xeon E5-2695v2 12-core processor running at 2.4GHz and mapping schemes to get the minimum of each mapping policy
Compute cluster total memory capacity of 66TB and peak i.e. Node, BW, LB, CB, Def(min) against each matrix. The
performance is nearly 230 TFlops. The Infiniband intercon- term Def used in this paper refers to Def(min) which is calcu-
nect provides high scalable, high speed and low latency data lated as minimum of execution time across all np (see Table 1)
transportation, configured in a full bisectional bandwidth with default MPI mapping scheme. Speedup and average
non-blocking network fabric with zero copy operations using speedup is calculated using equations 8 and 9 respectively
Remote Direct Memory Access (RDMA). It provides 40Gbps for the entire data set.
full bisectional bandwidth communication between any two Best time is a calculated by choosing minimum time
end points. Table 5 lists the software specifications among all the minimums calculated for each matrix in each
mapping scheme. Let Besti denote the best time of the ith
V. ANALYSIS OF MAPPING STRATEGIES FOR SPMV matrix and Nodei , BWi , LBi , CBi , and Defi are the minimum
PARALLELIZATION time for the ith matrix in each these mapping scheme. Best
The optimal number of processes and where those processes time for ith matrix is calculated using equation 9.
should be mapped has great impact on the performance of the Besti = Min(Nodei , BWi , LBi , CBI , Defi ) (10)
parallel application and is evident in Figure 10.
Let np denotes the total number of processes, p denotes The Best time calculated here is used to define a label for
a single process and ti represents the execution time of ith each matrix to train our machine-learning model.
matrix, where 0 < i < N and N is total number of matrices Figure 11 shows the execution time comparison (on log
in our data set. The execution time of each matrix is recorded scale) of choosing the optimal mapping strategy i.e. Best,
against different number of processes and if the execution with Def(min) (timin with default mapping) and four mapping
time kept on decreasing with increasing number of processes, schemes including serial time and default average time (tiavg
we continued scaling up each matrix. Depending on size of with default mapping), for the entire data set. The y-axis
the matrix and its sparsity pattern there comes a breakeven shows the execution time on logarithmic scale and x-axis
point where execution time kept on increasing with increas- shows different mapping schemes i.e. Node, BW, LB, CB,
ing parallelism as the communication between processes Def(min), serial execution time, Def(avg) and Best time (see
started to dominate the overall time, which is our stopping equation 10).
criteria for further scaling. Each matrix is executed 2k times With more than 1800 matrices in our data set, sequen-
for different number of processes and minimum/maximum tial time took almost 132291.1 seconds (36.75 hrs.), aver-
execution time is selected by choosing minimum/maximum age execution time took 16421.58 seconds (4.5 hrs) and
among all the np (see Table 1) against that matrix as shown is reduced to 5521.83 seconds (1.53 hrs.) by using only
in the following equations. The average time is calculated by the optimal number of processes but with default mapping
taking the mean execution time of each matrix executed with scheme. Changing the mapping scheme helped in further
different number of processes. reduction of execution time i.e. with CB time is reduced to
4341.43 sec (1.2 hrs), with LB 4195.61 sec (1.16 hrs), with
timin = min t1 , t2 . . . ..t np
(5) BW 4167.4 sec (1.15 hrs), with Node 3944.86 sec (1.09 hrs).
FIGURE 15. Execution time comparison of different mapping schemes for FIGURE 18. Speedup comparison of different mapping schemes for all.
all matrices in the data set sorted by rows (min to max).
(a) (d)
(b) (e)
(c) (f)
FIGURE 19. Application domains execution time comparison (a,b,c) and speedup (d,e,f).
B. PERFORMANCE GAIN than CB. Def being the worst have shown almost 68% of
The performance gain of the ZAKI+ and other mapping the maximum possible performance i.e. Best, which can be
strategies is presented in Figure 22. We have used Geometric achieved by choosing the right mapping scheme each time
Mean of Normalized Performance (GMNP) to compare the for all the matrices in a test set.
performance calculated using equation 1. The high performance gain is due to the fact that even the
As shown in Figure 22, ZAKI+ has achieved almost 98% miss-predictions chooses the mapping scheme with execution
of that of Best time for the test set. All mapping schemes have time very close to the best and is evident the Figure 23 which
performed well compared to default. Node, LB and BW have plots the best, default and predicted results. The x-axis shows
shown almost the same performance and are slightly better the matrices in our test set sorted with respect to their
VII. CONCLUSION
In this paper, we proposed ZAKI+, a data-driven and
machine-learning approach to predict the optimal mapping of
processes and data (i.e., matrix partitions) on the underlying
distributed memory machine architecture for SpMV compu-
tations of an arbitrary sparse matrix. It allows application
FIGURE 22. Geometric mean of normalized performance of proposed scientists to automatically, effortlessly, and speedily obtain
mechanism. the best configuration (including the matrix/data distribu-
tion, optimal number of processes, and mapping strategy),
execution time (minimum to maximum) and y-axis shows and hence the best performance, for the execution of the
the execution time in seconds on a logarithmic scale. It is SpMV computations for a given sparse matrix. We have used
observed that almost all the predictions and miss-predictions 1838 real-world sparse matrices associated with 45 applica-
are very close to best available option as the performance tion domains to train and test the tool. The execution times
difference between different mapping schemes is very narrow for the predicted optimal configuration of SpMV compu-
resulting in high performance gain. tations are compared with the average execution times of
Figure 24 shows the average execution time comparison MPI default mapping policy; ZAKI+ provides 4.24 times
of different mapping schemes with ZAKI+, for the entire aggregated speedup over the MPI default mapping policy
data set of 1838 matrices. Clearly evident in Figure 24Fig- with average parallel execution times. We have provided
ure 24, ZAKI+ took 3875.91 seconds and outperformed all a first-ever detailed comparative analysis of multiple MPI
others by a big margin. Node(avg) performed marginally process mapping strategies (Node binding, Latency bind-
better than other mapping schemes with 11716.23 seconds. ing, Bandwidth binding, and Cyclic binding) for SpMV
Zaki+ performed almost 3× times better than Node(avg), computations. The methodology to use multiple mapping
strategies for prediction is itself a novel contribution of this are likely to transform the future of computing infrastruc-
paper. This is the first work of its kind where the sparsity tures. The trend would be the integration of computing at
structures of matrices have been exploited to predict the opti- exascale (and beyond) with big data technologies and provi-
mal mapping of the processes and data in distributed memory sion of on-demand service-oriented high performance com-
environments by using different base and ensemble machine puting together with the required data, AI and other applica-
learning methods. tions. The mapping of data and processes (related to smart
It is observed that changing the mapping scheme has applications) onto the underlying cyberphysical converged
increased the performance for the entire data set compared infrastructure therefore would be of paramount importance.
to the MPI default mapping strategy (shown in Figure 15 to This area of convergence is in its infancy [38] and our future
Figure 18). Node and LB have shown the best perfor- work will provide more in-depth proposals and analysis on
mance while BW and CW came close second in terms of these aspects of the SpMV computations.
speedup against Def. The most of the matrices in our data
ACKNOWLEDGMENT
set have shown best performance with Node, followed by
The experiments performed in this paper were executed on
LB (Figure 20). The least execution time is chosen among
the Aziz supercomputer being managed by the HPC Center
all these mapping schemes and labeled accordingly, to train
at the King Abdul-Aziz University.
the machine learning model. The performance of all studied
mapping schemes is better than Def, and also performance REFERENCES
differences within these mapping schemes are quite small, [1] National Institute of Standards and Technology. Cyber-Physical Sys-
resulting in the high performance gain of 97.79% of the tems. Accessed: May 23, 2019. [Online]. Available: https://www.nist.gov/
el/cyber-physical-systems
ideally attainable performance for our ZAKI+ tool.
[2] W. Yu, F. Liang, X. He, W. G. Hatcher, C. Lu, J. Lin, and X. Yang,
In the future, we will enhance the proposed techniques ‘‘A survey on the edge computing for the Internet of Things,’’ IEEE Access,
by incorporating additional relevant features and increasing vol. 6, pp. 6900–6919, 2018.
the dataset, both in terms of the number and size of sparse [3] J. Lin, W. Yu, N. Zhang, X. Yang, H. Zhang, and W. Zhao,
‘‘A survey on Internet of things: Architecture, enabling technologies, secu-
matrices and application domains. We are also planning rity and privacy, and applications,’’ IEEE Internet Things J., vol. 4, no. 5,
to extend it further on hybrid MPI/OpenMP programming pp. 1125–1142, Oct. 2017.
model, where MPI is responsible for inter-node commu- [4] R. Mehmood, B. Bhaduri, I. Katib, and I. Chlamtac, Eds., Smart Societies,
Infrastructure, Technologies and Applications (Lecture Notes of the Insti-
nication and OpenMP is used for fine-grained parallelism. tute for Computer Sciences, Social Informatics and Telecommunications
Moreover, we plan to extend our tool to incorporate energy Engineering), vol. 224. Cham, Switzerland: Springer, 2018.
efficiency optimization of SpMV computations. We will also [5] R. Mehmood, S. See, I. Katib, and I. Chlamtac, Eds., Smart Infras-
extend our proposed tool, ZAKI+ to work for dense matrix tructure and Applications: Foundations for Smarter Cities and Societies
(EAI/Springer Innovations in Communication and Computing). Cham,
vector multiplication. Switzerland: Springer, 2019.
It was mentioned that the manual process of trial and error [6] R. Mehmood, F. Alam, N. N. Albogami, I. Katib, A. Albeshri, and
experimentation to find the best mapping of processes and S. M. Altowaijri, ‘‘UTiLearn: A personalised ubiquitous teaching and
learning system for smart societies,’’ IEEE Access, vol. 5, pp. 2615–2635,
data on a given architecture and resources for computing 2017.
SpMV is time-consuming and frustrating, needing a complete [7] F. Alam, R. Mehmood, I. Katib, N. N. Albogami, and A. Albeshri, ‘‘Data
search of the node and processor core space to find the opti- fusion and IoT for smart ubiquitous environments: A survey,’’ IEEE
Access, vol. 5, pp. 9533–9554, 2017.
mal configuration for the computation. The ability to provide
[8] F. Alam, R. Mehmood, I. Katib, and A. Albeshri, ‘‘Analysis of eight data
the optimal process configuration speedily is a crucial advan- mining algorithms for smarter Internet of Things (IoT),’’ Procedia Comput.
tage of the ZAKI+ tool. Our future work will look into ana- Sci., vol. 58, pp. 437–442, Jan. 2016.
lyzing the time profile of the predictive model with the aim [9] M. V. Tabib, A. Rasheed, and T. P. Uteng, ‘‘Methodology for assessing
cycling comfort during a smart city development,’’ Energy Procedia,
to develop a real-time tool for process mapping prediction. vol. 122, pp. 361–366, Sep. 2017.
The current trend in ICT is towards the convergence of big [10] G. Triscone, ‘‘Computational fluid dynamics as a tool to predict the air
data, HPC and AI. This paper has attempted to contribute to pollution dispersion in a neighborhood—A research project to improve
the quality of life in cities,’’ Int. J. Sustain. Dev. Plan., vol. 11, no. 4,
this convergence and has applied it to the area of SpMV com- pp. 546–557, Aug. 2016.
putations. The challenges related to the mapping of data and [11] C. G. García, D. Meana-Llorián, B. C. P. G-Bustelo, J. M. C. Lovelle,
processes onto distributed memory architectures are not spe- and N. Garcia-Fernandez, ‘‘Midgar: Detection of people through computer
cific to SpMV computations alone. Various cyber-physical vision in the Internet of Things scenarios to improve the security in Smart
Cities, Smart Towns, and Smart Homes,’’ Future Gener. Comput. Syst.,
systems will comprise compute intensive machine learn- vol. 76, pp. 301–313, Nov. 2017.
ing applications, such as SpMV, and these will need to be [12] A. S. Montemayor, J. J. Pantrigo, and L. Salgado, ‘‘Special issue on real-
optimally mapped onto the underlying cyber-physical and time computer vision in smart cities,’’ J. Real-Time Image Process., vol. 10,
no. 4, pp. 723–724, Dec. 2015.
exascale computing infrastructure. CPSs in the future will [13] E. Estrada, R. Maciel, A. Ochoa, B. Bernabe-Loranca, D. Oliva, and
comprise an ecosystem of digital infrastructures that are able V. Larios, ‘‘Smart City visualization tool for the Open Data georeferenced
to work together and enable dynamic real-time interactions analysis utilizing machine learning,’’ Int. J. Combinat. Optim. Problems
Inform., vol. 9, no. 2, pp. 25–40, 2018.
between various CPS subsystems. Technologies such as big
[14] A. Rahman, J. Jin, A. Cricenti, A. Rahman, M. Palaniswami, and T. Luo,
data, pervasive, cloud and fog computing, as well as the ‘‘Cloud-enhanced robotic system for smart city crowd control,’’ J. Sens.
increasingly complex demands of smart cities and societies, Actuator Netw., vol. 5, no. 4, p. 20, Dec. 2016.
[15] D. G. Aliaga, ‘‘3D design and modeling of smart cities from a com- [35] Y. Arfat, R. Mehmood, and A. Albeshri, ‘‘Parallel Shortest Path Graph
puter graphics perspective,’’ ISRN Comput. Graph., vol. 2012, Dec. 2012, Computations of United States Road Network Data on Apache Spark,’’
Art. no. 728913. in Int. Conf. Smart Cities, Infrastructure, Technol. Appl., vol. 2017,
[16] R. Gade, ‘‘Thermal imaging systems for real-time applications in smart pp. 323–336.
cities,’’ Int. J. Comput. Appl. Technol., vol. 53, no. 4, pp. 291–308, [36] S. Suma, R. Mehmood, and A. Albeshri, ‘‘Automatic event detection in
2016. smart cities using big data analytics,’’ in Proc. Int. Conf. Smart Cities,
[17] M. Akcin, A. Kaygusuz, A. Karabiber, S. Alagoz, B. B. Alagoz, and Infrastruct., Technol. Appl. (SCITA) Lecture Notes of the Institute for Com-
C. Keles, ‘‘Opportunities for energy efficiency in smart cities,’’ in Proc. puter Sciences, Social Informatics and Telecommunications Engineering,
4th Int. Istanbul Smart Grid Congr. Fair (ICSG), Apr. 2016, pp. 1–5. vol. 224, 2017, pp. 111–122.
[18] M. Zappatore, A. Longo, and M. A. Bochicchio, ‘‘Crowd-sensing our [37] S. Suma, R. Mehmood, N. Albugami, I. Katib, and A. Albeshri, ‘‘Enabling
smart cities: A platform for noise monitoring and acoustic urban planning,’’ next generation logistics and planning for smarter societies,’’ Procedia
J. Commun. Softw. Syst., vol. 13, no. 2, pp. 53–67, Jun. 2017. Comput. Sci., vol. 109, pp. 1122–1127, May 2017.
[19] J. P. Bello, C. Mydlarz, and J. Salamon, ‘‘Sound analysis in smart cities,’’ in [38] K. Asanovic, R. Bodik, J. Demmel, T. Keaveny, K. Keutzer,
Computational Analysis of Sound Scenes and Events, Cham, Switzerland: J. Kubiatowicz, N. Morgan, D. Patterson, K. Sen, J. Wawrzynek,
Springer, 2018, pp. 373–397. D. Wessel, and K. Yelick, ‘‘A view of the parallel computing landscape,’’
[20] R. Mehmood, R. Meriton, G. Graham, P. Hennelly, and M. Kumar, Commun. ACM, vol. 52, no. 10, pp. 56–67, Oct. 2009.
‘‘Exploring the influence of big data on city transport operations: A Marko- [39] T. Muhammed, R. Mehmood, A. Albeshri, and I. Katib, ‘‘SURAA: A novel
vian approach,’’ Int. J. Oper. Prod. Manag., vol. 37, no. 1, pp. 75–104, method and tool for Loadbalanced and coalesced SpMV computations on
Jan. 2017. GPUs,’’ Appl. Sci., vol. 9, no. 5, p. 947, Mar. 2019.
[21] R. Mehmood and G. Graham, ‘‘Big data logistics: A health-care transport [40] H. Alyahya, R. Mehmood, and I. Katib, ‘‘Parallel sparse matrix vector
capacity sharing model,’’ Procedia Comput. Sci., vol. 64, pp. 1107–1114, multiplication on intel MIC: Performance analysis,’’ in Smart Societies,
Oct. 2015. Infrastructure, Technologies and Applications—SCITA (Lecture Notes of
[22] R. Mehmood and J. A. Lu, ‘‘Computational Markovian analysis of large the Institute for Computer Sciences, Social Informatics and Telecom-
systems,’’ J. Manuf. Technol. Manage., vol. 22, no. 6, pp. 804–817, munications Engineering), vol. 224. Cham, Switzerland: Springer, 2018,
Jul. 2011. pp. 306–322.
[23] S. Altowaijri, R. Mehmood, and J. Williams, ‘‘A quantitative model of [41] M. Kwiatkowska, D. Parker, and Y. Zhang, ‘‘Dual-processor parallelisation
grid systems performance in healthcare organisations,’’ in Proc. Int. Conf. of symbolic probabilistic model checking,’’ in Proc. IEEE Comput. Soc.
Intell. Syst., Modelling Simulation, Jan. 2010, pp. 431–436. 12th Annu. Int. Symp. Modeling, Anal., Simulation Comput. Telecommun.
[24] R. Mehmood, R. Alturki, and S. Zeadally, ‘‘Multimedia applications over Syst. (MASCOTS), Oct. 2004, pp. 123–130.
metropolitan area networks (MANs),’’ J. Netw. Comput. Appl., vol. 34, [42] R. Mehmood and J. Crowcroft, ‘‘Parallel iterative solution method for
no. 5, pp. 1518–1529, 2011. large sparse linear equation systems,’’ Univ. Cambridge, Cambridge, U.K.,
[25] T. E. H. El-Gorashi, B. Pranggono, R. Mehmood, and J. M. H. Elmirghani, Tech. Rep. UCAM-CL-TR-650, 2005.
‘‘A data mirroring technique for SANs in a metro WDM sectioned ring,’’ in [43] M. Chen, S. Mao, and Y. Liu, ‘‘Big data: A survey,’’ Mobile Netw. Appl.,
Proc. Int. Conf. Opt. Netw. Design Modeling (ONDM), Mar. 2008, pp. 1–6. vol. 19, no. 2, pp. 171–209, Apr. 2014.
[26] E. Alamoudi, R. Mehmood, A. Albeshri, and T. Gojobori, ‘‘DNA pro- [44] E. Alomari and R. Mehmood, ‘‘Analysis of tweets in Arabic language for
filing methods and tools: A review,’’ in Smart Societies, Infrastructure, detection of road traffic conditions,’’ in Smart Societies, Infrastructure,
Technologies and Applications (Lecture Notes of the Institute for Com- Technologies and Applications (Lecture Notes of the Institute for Com-
puter Sciences, Social-Informatics and Telecommunications Engineering), puter Sciences, Social Informatics and Telecommunications Engineering),
Vol. 224. Cham, Switzerland: Springer, 2018, pp. 216–231. vol. 224. Cham, Switzerland: Springer, 2018, pp. 98–110.
[27] A. Khanum, A. Alvi, and R. Mehmood, ‘‘Towards a semantically enriched
[45] S. Usman, R. Mehmood, and I. Katib, ‘‘Big data and HPC conver-
computational intelligence (SECI) framework for smart farming,’’ in Smart
gence: The cutting edge and outlook,’’ in Smart Societies, Infrastructure,
Societies, Infrastructure, Technologies and Applications (Lecture Notes
Technologies and Applications (Lecture Notes of the Institute for Com-
of the Institute for Computer Sciences, Social-Informatics and Telecom-
puter Sciences, Social Informatics and Telecommunications Engineering),
munications Engineering), vol. 224. Cham, Switzerland: Springer, 2018,
vol. 2248. Cham, Switzerland: Springer, 2018, pp. 11–26.
pp. 247–257.
[46] R. Farber. (2018). The Convergence of Big Data and Extreme-
[28] M. Aqib, R. Mehmood, A. Alzahrani, I. Katib, and A. Albeshri, ‘‘A deep
Scale HPC. HPC Wire. Accessed: Nov. 1, 2011. [Online]. Avail-
learning model to predict vehicles occupancy on freeways for traffic
able: https://www.hpcwire.com/2018/08/31/the-convergence-of-big-data-
management,’’ in Proc. Int. J. Comput. Sci. Netw. Secur., vol. 18, no. 12,
and-extreme-scale-hpc/
pp. 246–254, 2018.
[47] M. Grossman, C. Thiele, M. Araya-Polo, F. Frank, F. O. Alpak, and
[29] M. Aqib, R. Mehmood, A. Albeshri, and A. Alzahrani, ‘‘Disaster man-
V. Sarkar, ‘‘A survey of sparse matrix-vector multiplication performance
agement in smart cities by forecasting traffic plan using deep learning and
on large matrices,’’ Aug. 2016, arXiv:1608.00636. [Online]. Available:
GPUs,’’ in Smart Societies, Infrastructure, Technologies and Applications
https://arxiv.org/abs/1608.00636
(Lecture Notes of the Institute for Computer Sciences, Social Informat-
ics and Telecommunications Engineering), vol. 224. Cham, Switzerland: [48] R. Mehmood, ‘‘Disk-based techniques for efficient solution of large
Springer, 2018, pp. 139–154. Markov chains,’’ Ph.D. dissertation, Dept. School Comput. Sci., Univ.
[30] Y. Arfat, ‘‘Enabling smarter societies through mobile big data fogs and Birmingham, Birmingham, U.K., 2004.
clouds,’’ Procedia Comput. Sci., vol. 109, pp. 1128–1133, May 2017. [49] R. Mehmood, D. Parker, and M. Kwiatkowska, ‘‘An efficient BDD-based
[31] J. Schlingensiepen, R. Mehmood, F. C. Nemtanu, and M. Niculescu, implementation of Gauss-Seidel for CTMC analysis,’’ Dept. School Com-
‘‘Increasing sustainability of road transport in European cities and put. Sci., Univ. Birmingham, Birmingham, U.K., Tech. Rep. CSR-03-13,
metropolitan areas by facilitating autonomic road transport systems 2013.
(ARTS),’’ in Proc. Sustain. Automot. Technol. 5th Int. Conf. (ICSAT), 2014, [50] R. Mehmood, ‘‘A survey of out-of-core analysis techniques in stochastic
pp. 201–210. modelling,’’ Dept. School Comput. Sci., Univ. Birmingham, Birmingham,
[32] M. Aqib, R. Mehmood, A. Alzahrani, I. Katib, A. Albeshri, and U.K., Tech. Rep. CSR-03-7, Aug. 2003.
S. M. Altowaijri, ‘‘Rapid transit systems: Smarter urban planning using [51] (2018). Intel Math Kernel Library (Intel MKL)|Intel Software. Accessed:
big data, in-memory computing, deep learning, and GPUs,’’ Sustainability, Mar. 24, 2019. [Online]. Available: https://software.intel.com/en-us/mkl
vol. 11, no. 10, p. 2736, May 2019. [52] The Trilinos Project. Accessed: Mar. 24, 2019. [Online]. Available:
[33] M. Aqib, R. Mehmood, A. Alzahrani, I. Katib, A. Albeshri, and https://trilinos.org/publicRepo/
S. M. Altowaijri, ‘‘Smarter traffic prediction using big data, in-memory [53] CUSP. Accessed: Mar. 24, 2019. [Online]. Available: https://
computing, deep learning and GPUs,’’ Sensors, vol. 19, no. 9, p. 2206, cusplibrary.github.io/
May 2019. [54] cuSPARSE. Accessed: Mar. 24, 2019. [Online]. Available: https://
[34] F. Alam, R. Mehmood, and I. Katib, ‘‘D2TFRS: An object recognition developer.nvidia.com/cusparse
method for autonomous vehicles based on RGB and spatial values of pix- [55] X. Liu, M. Smelyanskiy, E. Chow, and P. Dubey, ‘‘Efficient sparse matrix-
els,’’ in Lecture Notes of the Institute for Computer Sciences, Social Infor- vector multiplication on x86-based many-core processors,’’ in Proc. 27th
matics and Telecommunications Engineering, vol. 224, 2018, pp. 155–168. Int. ACM Conf. Int. Conf. Supercomput. (ICS), 2013, pp. 273–282.
[56] R. Mehmood, J. Crowcroft, and J. M. H. Elmirghani, ‘‘A parallel implicit [77] V. Karakasis, G. Goumas, and N. Koziris, ‘‘Exploring the performance-
method for the steady-state solution of CTMCs,’’ in Proc. 14th IEEE Int. energy tradeoffs in sparse matrix-vector multiplication,’’ in Proc. Work-
Symp. Modeling, Anal., Simulation (MASCOTS), Sep. 2006, pp. 293–302. shop Emerg. Supercomput. Technol. (WEST)-ICS, 2011, pp. 1–6.
[57] A. Chandramowlishwaran, S. Williams, L. Oliker, I. Lashuk, G. Biros, and [78] S. Fujino and T. Nanri, ‘‘Parallelized balancing communication and exe-
R. Vuduc, ‘‘Optimizing and tuning the fast multipole method for state-of- cution for sparse matrix-vector multiplication,’’ Trans. Jpn. Soc. Simul.
the-art multicore architectures,’’ in Proc. IEEE Int. Symp. Parallel Distrib. Technol., vol. 7, no. 2, pp. 37–41, 2015.
Process. (IPDPS), Apr. 2010, pp. 1–12. [79] D. Zheng, D. Mhembere, V. Lyzinski, J. T. Vogelstein, C. E. Priebe,
[58] M. Maggioni and T. Berger-Wolf, ‘‘Optimization techniques for sparse and R. Burns, ‘‘Semi-external memory sparse matrix multiplication for
matrix–vector multiplication on GPUs,’’ J. Parallel Distrib. Comput., billion-node graphs,’’ IEEE Trans. Parallel Distrib. Syst., vol. 28, no. 5,
vols. 93–94, pp. 66–86, Jul. 2016. pp. 1470–1483, May 2017.
[59] H. Anzt, V. Heuveline, J. I. Aliaga, M. Castillo, J. C. Fernández, R. Mayo,
and E. S. Quintana-Ortí, ‘‘Analysis and optimization of power consumption SARDAR USMAN received the M.S. degree in
in the iterative solution of sparse linear systems on multi-core and many- network system engineering from the Univer-
core platforms,’’ in Proc. Int. Green Comput. Conf. Workshops, 2011, sity of Plymouth, U.K., and the Ph.D. degree in
pp. 1–6. computer science from King Abdul Aziz Univer-
[60] A. Khajeh-Saeed, S. Poole, and J. B. Perot, ‘‘A comparison of multi- sity, Saudi Arabia, in 2019. His research interests
core processors on scientific computing tasks,’’ in Proc. Innov. Parallel include networking, high-performance computing,
Comput., Found. Appl. GPU, Manycore, Heterogeneous Syst. (InPar), San and machine learning.
Jose, CA, USA, 2012.
[61] X. Feng, H. Jin, R. Zheng, K. Hu, J. Zeng, and Z. Shao, ‘‘Optimization
of sparse matrix-vector multiplication with variant CSR on GPUs,’’ in
Proc. IEEE 17th Int. Conf. Parallel Distrib. Syst. (ICPADS), Dec. 2011,
pp. 165–172. RASHID MEHMOOD is currently a Research
[62] O. Kislal, W. Ding, M. Kandemir, and I. Demirkiran, ‘‘Optimizing sparse Professor of big data systems and the Director of
matrix vector multiplication on emerging multicores,’’ in Proc. IEEE 6th research, training, and consultancy with the High
Int. Workshop Multi-/Many-Core Comput. Syst. (MuCoCoS), Sep. 2013, Performance Computing Centre, King Abdulaziz
pp. 1–10.
University, Saudi Arabia. He has gained qualifi-
[63] S. Usman, R. Mehmood, I. Katib, A. Albeshri, and S. M. Altowaijri,
cations and work experience from universities in
‘‘ZAKI: A smart method and tool for automatic performance optimization
U.K., including Cambridge University and Oxford
of parallel SpMV computations on distributed memory machines,’’ Mobile
Netw. Appl., to be published. University. He has 23 years of academic and
[64] T. A. Davis and Y. Hu, ‘‘The University of Florida sparse matrix collec- industrial experience in computational modeling,
tion,’’ ACM Trans. Math. Softw., vol. 38, no. 1, pp. 1–25, Nov. 2011. simulations, and design using computational intel-
[65] S. Lee and R. Eigenmann, ‘‘Adaptive runtime tuning of parallel sparse ligence, big data, and high-performance computing. His broad research
matrix-vector multiplication on distributed memory systems,’’ in Proc. aim is to develop multi-disciplinary science and technology to enable a
22nd Annu. Int. Conf. Supercomput. (ICS), 2008, pp. 195–204. better quality of life and smart economy with a focus on real-time intel-
[66] IBM Knowledge Center—Understanding MPI Process Placement and ligence and dynamic (autonomic) system management. He has published
Affinity. Accessed: Mar. 20, 2019. [Online]. Available: https://www. over 150 research papers, including six edited books. He is a Founding
ibm.com/support/knowledgecenter/en/SSZTET_10.2/admin/smpi02_ Member of the Future Cities and Community Resilience (FCCR) Network,
proc_affinity_placement.html a member of ACM and OSA, and the former Vice-Chairman of the IET
[67] R. Saxena. How Decision Tree Algorithm Works. Accessed: Mar. 20, 2019. Wales SW Network. He has organized and chaired international conferences
[Online]. Available: http://dataaspirant.com/2017/01/30/how-decision- and workshops, including EuropeComm 2009, Nets4Cars 2010–2013, SCE
tree-algorithm-works/ 2017–2019, SCITA 2017, and HPC Saudi 2018. He has led and contributed
[68] V. Smolaykov. Ensemble Learning to Improve Machine Learning to academia-industry collaborative projects funded by EPSRC, EU, U.K.
Results. Accessed: Mar. 20, 2019. [Online]. Available: https://blog. regional funds, and Technology Strategy Board U.K. with the value of over
statsbot.co/ensemble-learning-d1dcd548e936
50 million Euro.
[69] E. Lutins. Ensemble Methods in Machine Learning: What are They
and Why Use Them? Accessed: Mar. 20, 2019. [Online]. Available:
https://towardsdatascience.com/ensemble-methods-in-machine-learning- IYAD KATIB received the B.S. degree in statis-
what-are-they-and-why-use-them-68ec3f9fef5f tics/computer science from King Abdul Aziz
[70] N. Donges. The Random Forest Algorithm—Towards Data University, in 1999, and the M.S. and Ph.D.
Science. Accessed: Mar. 20, 2019. [Online]. Available: degrees in computer science from the University of
https://towardsdatascience.com/december-edition-80d8992a0fc Missouri-Kansas City, in 2004 and 2011, respec-
[71] K. Nishida. Introduction to Extreme Gradient Boosting in tively. He is currently an Associate Professor with
Exploratory. Accessed: Mar. 20, 2019. [Online]. Available: the Computer Science Department and the cur-
https://blog.exploratory.io/introduction-to-extreme-gradient-boosting- rent Vice Dean and the College Council Secre-
in-exploratory-7bbec554ac7 tary of the Faculty of Computing and Information
[72] Z. Wang, M. F. P. O’Boyle, and M. K. Emani, ‘‘Smart, adaptive mapping Technology (FCIT), King Abdulaziz University
of parallelism in the presence of external workload,’’ in Proc. IEEE/ACM (KAU), where he is also the Director of the High Performance Computing
Int. Symp. Code Gener. Optim. (CGO), Feb. 2013, pp. 1–10. Center. His current research interests include computer networking and
[73] Z. Wang, G. Tournavitis, B. Franke, and M. F. P. O’Boyle, ‘‘Integrat- high-performance computing.
ing profile-driven parallelism detection and machine-learning-based map-
ping,’’ ACM Trans. Archit. Code Optim., vol. 11, no. 1, p. 2, 2013.
AIIAD ALBESHRI received the M.S. and
[74] E. Jeannot, G. Mercier, and F. Tessier, ‘‘Process placement in multicore
Ph.D. degrees in information technology from the
clusters: Algorithmic issues and practical techniques,’’ IEEE Trans. Paral-
lel Distrib. Syst., vol. 25, no. 4, pp. 993–1002, Apr. 2014.
Queensland University of Technology, Brisbane,
[75] M. Castro, L. F. W. Góes, C. P. Ribeiro, M. Cole, M. Cintra, and QLD, Australia, in 2007 and 2013, respectively.
J.-F. Mehaut, ‘‘A machine learning-based approach for thread mapping on He has been an Assistant Professor with the
transactional memory applications,’’ in Proc. 18th Int. Conf. High Perform. Computer Science Department, King Abdulaziz
Comput., Dec. 2011, pp. 1–10. University, Jeddah, Saudi Arabia, since 2013. His
[76] A. Mansour, J. Götze, W.-C. Hsu, and S.-J. Ruan, ‘‘Sparse matrix-vector current research interests include security and trust
multiplication: A data mapping-based architecture,’’ in Proc. 15th Int. in cloud computing and big data.
Conf. Parallel Distrib. Comput., Appl. Technol., 2014, pp. 152–158.