You are on page 1of 21

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.

Digital Object Identifier 10.1109/ACCESS.2017.DOI

Energy-Aware Distributed Simulation for


Mobile Platforms: An Empirical Study
FAHAD MAQBOOL1 , MEESAM R. NAQVI2 , AND ASAD W. MALIK3
1
National University of Sciences and Technology, School of Electrical Engineering and Computer Science (e-mail: fmaqbool.mscs15seecs@seecs.edu.pk)
2
National University of Sciences and Technology, School of Electrical Engineering and Computer Science (e-mail: snaqvi.mscs15seecs@seecs.edu.pk)
3
National University of Sciences and Technology, School of Electrical Engineering and Computer Science (e-mail: asad.malik@seecs.edu.pk)
Corresponding author: Asad W. Malik (e-mail: asad.malik@seecs.edu.pk).
This paragraph of the first footnote will contain support information, including sponsor and financial support acknowledgment. For
example, “This work was supported in part by the U.S. Department of Commerce under Grant BS123456.”

ABSTRACT These instructions give you guidelines for preparing papers for IEEE Access. Use this
document as a template if you are using LATEX. Otherwise, use this document as an instruction set. The
electronic file of your paper will be formatted further at IEEE. Paper titles should be written in uppercase
and lowercase letters, not all uppercase. Avoid writing long formulas with subscripts in the title; short
formulas that identify the elements are fine (e.g., "Nd–Fe–B"). Do not write “(Invited)” in the title. Full
names of authors are preferred in the author field, but are not required. Put a space between authors’ initials.
The abstract must be a concise yet comprehensive reflection of what is in your article. In particular, the
abstract must be self-contained, without abbreviations, footnotes, or references. It should be a microcosm
of the full article. The abstract must be between 150–250 words. Be sure that you adhere to these limits;
otherwise, you will need to edit your abstract accordingly. The abstract must be written as one paragraph,
and should not contain displayed mathematical equations or tabular material. The abstract should include
three or four different keywords or phrases, as this will help readers to find it. It is important to avoid over-
repetition of such phrases as this can result in a page being rejected by search engines. Ensure that your
abstract reads well and is grammatically correct.

INDEX TERMS Enter key words or phrases in alphabetical order, separated by com-
mas. For a list of suggested keywords, send a blank e-mail to keywords@ieee.org or visit
http://www.ieee.org/organizations/pubs/ani_prod/keywrd98.txt

I. INTRODUCTION event execution function and state variables that represent


IELD of Parallel and Discrete Event Simulation (PDES) the state of the system being modeled. DES Algorithm re-
F system has evolved significantly since its birth in 1970s
and 80s [1]. However, it still remains an active area of
peatedly performs the following steps: (1) removes the event
with the minimum simulation time from the event-list; (2)
research because of its applications in domains ranging from evaluates the event message and executes the event execution
military, manufacturing, communication networks, computer function accordingly; (3) it modifies the state variables and
systems, VLSI design, design automation, to air traffic and (4) generates further events.
road traffic systems. Simulation of a system may have several Traditional simulation systems are sequential, however,
objectives, including: (i) understanding behavior of a system; many practical simulations, e.g. in engineering applications,
(ii) obtaining estimates of performance of a system; (iii) consume several of hours (and even days) on a sequential
guiding the selection of design parameters; (iv) validation machine. Parallel and distributed computing environments
of a model. The availability of low cost microcomputers has can be used to reduce execution time of such simulation
introduced simulation to many real life applications. programs. Parallel and Discrete Event Simulation (PDES)
One kind of discrete simulation is the fixed time increment, reduces the overall execution time of a simulation by execut-
or the time-stepped approach, the other kind is discrete- ing it on a parallel or distributed computers. Thus, Parallel
event method. At the basic level, a Discrete Event Simulation and distributed simulation systems have emerged with the
(DES) consists of an ordered list of events called event-list, concept of utilizing high-performance computing infrastruc-

VOLUME 4, 2016 1
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

ture to efficiently execute the complex simulations models networks. With the development of technology, use of elec-
[2]. PDES is a collection of processes running in parallel tronic devices is growing rapidly. Applications and services
that interact through messages. These messages are used to are hosted inside the Cloud and can be accessed through
encapsulate the events and they are used to drive the simu- any device connected to the internet. In addition, with the
lation among different processes. Events that simultaneously inception of internet-of-things (IoT), heterogeneous devices
run on multiple computing systems need to be synchronized. can become part of grid network. Consequently, available
Synchronization (or Time management) algorithms are used computation platforms have changed as compared to the
to ensure that events are processed in correct order adhering conventional cluster environment. This advancement in tech-
local causality constraint. Local causality constraint ensures nology introduced new challenges for research community.
that a parallel or distributed simulation produces the same In Cloud computing environment, the workload on a single
results as a sequential execution. node can affect the whole simulation process executing on the
Fig. 1, shows the classification of traditional synchroniza- system. This can cause longer wait time for event execution
tion algorithms. There are two basic categories of time flow that can give rise to the straggler message problem [3]–[5].
management: Similarly, with IoT, where any computing device can be-
• Time stepped approach, where the simulation time is come a part of a network to share its computing resources
evenly spaced along a sequence of equal sized time steps and store data. Conventional simulation protocols fail to
or intervals. perform efficiently on such network or devices. Most of the
• Event driven approach, where the simulation time does sensors and smart-phones are resource constrained, cannot
not progress in time steps but only when something keep a large amount of data and perform large computations.
interesting happens that is referred to as an “event.” Furthermore, the sending and receiving of data also costs
in terms of energy and data transfer rate. Because of ever
Time stepped approaches are only limited to specific ap-
growing demand for mobile devices and embedded systems
plications. PDES community widely adopted event driven
in microelectronics market, many modern applications are
approaches that are categorized as two famous synchroniza-
particularly designed to execute on mobile platforms. Overall
tion mechanisms, (1) conservative and (2) optimistic. In
performance in terms of energy consumption is the primary
conservative approach causality error is strictly avoided by
design factor for such applications. Therefore, to improve
applying strategies to determine when it is safe to process
the performance and energy efficiency of modern comput-
an event. Whereas, in optimistic approach simulation con-
ing applications and platforms, it is important to accurately
tinues until causality error is detected and then the error is
estimate their power and energy consumption and identify
handled by applying rollback mechanism for recovery and
critical parameters and modules for further optimizations.
later re-execution of rollback events. There are two types
Execution of large-scale distributed simulation over mo-
of conservative mechanisms that are synchronous and asyn-
bile and embedded devices opens new research areas that
chronous. Synchronous algorithms require global synchro-
need to be explored. Such research fields include energy-
nization to compute Lower Bound on Timestamp (LBTS)
aware distributed simulation and dynamic data-driven appli-
thus all LPs proceed in synchronous fashion. On the other
cations. Many aspects of such applications have already been
hand, asynchronous algorithms do not use global synchro-
focused in research community. These systems can be useful
nization mechanism and LPs proceeds asynchronously.
in many applications such as manufacturing, telecommunica-
tions, preparation for inclement weather, defense, intelligent
transportation systems and crisis management systems.
In battery operated systems, energy consumption is a ma-
jor concern; however, minimizing power consumption may
not always result in low energy consumption. Decreasing
the frequency at which the CPU operates results in low
power consumption but it increases the total time required
to complete the task. Thus, computations require more time
to complete. The problem becomes more complex in an
environment where heterogeneous devices are participating
in distributed simulations.
In traditional parallel and distributed simulation, logical
processes (LPs) are mapped onto different systems or pro-
FIGURE 1—Types of Synchronization Algorithms cessing cores. These LPs communicate with each other by
exchanging timestamped messages. However, process map-
The field of parallel & distributed simulations have been ping on heterogeneous devices or resource constraint devices
strongly influenced by emerging technologies. These tech- can affect the performance of entire simulation. Some sim-
nologies include massive parallel systems, Cloud computing, ulation algorithms need significantly large storage capacity
GPU computing, embedded computing systems and sensor to store the history of processed events and more compu-
2 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

tation is required to undo out-of-order execution. There- performance analysis and energy-aware simulation for HPC
fore, it increases the number of memory accesses, sending systems as well as mobile and embedded systems.
and receiving new event messages and execution of events; Many tools have been developed over the years for profil-
that requires significant amount of energy. Thus, resource ing performance and power consumption to generate, analyze
constraint devices are not suitable for traditional simulation and visualize the data of systems and applications from
algorithms. functional components. Tuning and analysis utilities [6], [7],
Traditional PDES protocols are very well tested and de- [8], [9] and [10] provide support for instrumentation and
signed for distributed systems and Cloud infrastructure but performance visualization of parallel applications. Isci et al.
still they need to be tested for handheld, mobile and IoT [11] presented an approach to estimate power consumption
devices. Moreover, traditional frameworks are not designed using performance logs. Similarly, authors in [12] provided
to support mobile or handheld devices that have memory a framework called PowerPack that can be used for energy
and energy constraints. Therefore, thorough analysis of tra- profiling and analysis of parallel applications on multi core
ditional PDES frameworks are required to migrate resource processors. Authors in [13] indicated that power profiles al-
hungry modules to the cloud and to be accessed through well- ways correspond to the characteristics of the application and
defined services. increasing number of nodes results in more power consump-
As a case study, we initially performed the instrumentation tion and not always results in better performance. Authors
of a PDES framework (Rensselaer's Optimistic Simulation in [14] analyzed the power consumption in relationship with
System - ROSS) on a desktop computing environment to software components. Their analysis shows that in different
measure power, CPU usage energy and memory consump- software scenarios, power consumption on a general-purpose
tion using PHOLD benchmark. This case study includes the computer system can vary from 12% to 20%. Authors in
results of serial, parallel conservative and parallel optimistic [15] and [16] focused on low power embedded systems to
approaches. This allows us to understand resource utilization analyze and benchmark HPC applications for their energy
of simulation approaches before adopting the best synchro- consumption.
nization model for handheld embedded and IOT devices. The There is substantial work available in power-aware com-
objective is to precisely identify the modules that are resource puting and various techniques has been deployed to reduce
hungry, so those can be executed on cloudlets and traditional the power consumption [17], [18] and [19]. Computing en-
frameworks. Accurately identifying these modules allows ergy consumed by each machine instructions is one way
simulations to be adapted to handheld devices with little to profile the energy consumption of all functional com-
modification. ponents. Tiwari et al. [20] discussed that functional level
profiles for energy consumption obtained through computing
Using these results, we proposed SEECSSim – a dis-
energy consumed by each machine instruction. However,
tributed simulation suite designed to work on mobile and
the proposed work is only limited to function level energy
embedded devices. It includes the core synchronization al-
marking. Whereas, communication between processes is also
gorithms as classified in fig. 1. The proposed suite includes
holding a major portion is parallel and distributed simulation.
Chandy−Misra−Bryant CMB NULL message algorithm,
In a distributed simulation, different techniques are used to
Time-Stepped, Tree Barrier, Time Warp and Time Warp with
reduce the communication delay between processes [21].
Wolf algorithm (wolf calls). SEECSSim, will help researchers
One such approach is to utilize the different cores available
in selecting suitable algorithms for mobile and embedded
in a physical system [22]. However, the synchronization
systems. A correctly selected energy efficient algorithm can
algorithms incurred overhead which is difficult to reduce.
exploit the true potential of embedded systems for highly
Most of the existing work on energy-efficient computing
scalable parallel and distributed simulation.
has been done for HPC environment [23]–[26]; where dif-
The rest of paper is organized as follows. Section II covers
ferent techniques are used to optimize the use of energy
the literature review. Some of the algorithms in SEECSSim
such as DVFS, process migration, task consolidation and
are discussed in Section III. Section IV, covers the result and
Dynamic Power Management (DPM). R. Child et al. in [27]
discussions. Finally, section V concludes the paper.
explored the features of Dynamic Voltage and Frequency
Scaling (DVFS) to enhance the performance and reduce
II. LITERATURE REVIEW power consumption by repeatedly reducing the operating
Despite the need of profiling power and performance of sim- frequencies of the cores. The authors investigated energy
ulation protocols and creating energy-aware PDES platforms, efficiency through DVFS while for Time Warp simulation
there are only few articles that address this issue. There exists algorithm. The proposed study is conducted over physical
substantial literature work related to energy-aware comput- systems using MPI version of wrapped TW simulator.
ing in the domain of High-Performance Computing (HPC), Similarly, G. Tom et al. [28] described the integration of
mobile computing platforms and Wireless Sensor Networks energy-aware module to simulate the energy consumption
(WSN). Moreover, different profiling methods, techniques, of distributed systems. Authors have provided an overview
and tools have been proposed over the years. In this section, of energy-aware simulations and described DVFS simulation
we have briefly discussed some related contributions for tools required for obtaining accurate simulations in terms
VOLUME 4, 2016 3
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

of power consumptions. This work is mostly related to the The experimental section shows the significant gain in terms
energy of cloud systems; therefore, authors have explored the of performance over traditional optimistic simulation.
DVFS and cloud simulators in detail. Moreover, they have Similarly, Yihua Wu et al. [35] presented a BOINC
also added DVFS features in one of the cloud simulator. based system for Cloud environment to execute parallel
Communication is a common performance bottleneck for and distributed simulation over private cloud infrastructures.
fine grained parallel applications like ROSS,. Authors in BOINC is a middle-ware developed for volunteer and grid
[29] and [30] have discussed the techniques for improv- computing. It is an open-source framework designed to sup-
ing network performance by reducing lock contention and port task distribution and result gathering in client server
overlapping communication. Jagtap et al. [31] analyzed the model. However, use of private cloud is not a recommended
performance of ROSS simulation framework on two different option due to its cost and other management factors. The
platforms and compared multi-threaded implementation with main reason for adopting private clouds is due to the sensitive
MPI based implementation. Erazo et al. [32] presented a nature of simulations results.
case study to profile the energy consumption of distributed In the context of distributed simulation over mo-
simulation tested it on their PRIME simulator [33]. The bile/embedded devices, Biswas et al. [36] discussed the
authors of PRIME Sim concluded that using more nodes to techniques to create the power profiles. Energy consumption
achieve parallelism results in significant increase in energy of simulation model, engine, computations and communi-
consumption. cations is separated to understand the energy consumption
In [34], Fujimoto shared future research challenges for of each aspect of simulation. They have also presented
PDES. These challenges include large-scale simulation of the comparative analysis of energy consumed by Chandy-
complex networks, exploiting GPU’s, Cloud computing ex- Misra-Bryant and YAWNS algorithms. Similarly, Malik et
ploitation, composable simulation and energy consumption al. [37] have analyzed the energy consumption of Time Warp
of PDES. In PDES energy consumption is less explored protocol over smart phones. Moreover, on-line distributed
with respect to other aspects. The minimization of power simulation such as traffic prediction system requires signif-
consumption through change in clock rate (DVFS) can even- icant amount of energy. Neal et al. [38] analyzed the energy
tually increase the overall energy required to complete the consumption of data driven traffic simulations on mobile
task. The schemes such as DVFS are more suited for data devices. Online traffic prediction requires significant amount
centers and super computers. Therefore, for IoT, the power- of energy therefore, understanding the energy consumption
aware and energy-aware techniques are more important for at various levels, helps in optimizing the use of resources.
design consideration of PDES over handheld devices as en- The authors presented the empirical investigation of modules
ergy consumption is based on many factors such as network such as data transmission, gathering and traffic computations.
communication, memory usage etc. Fujimoto et al. [39] has presented a detailed work on power
Traditional simulations are designed for cluster environ- efficient distributed simulation. The authors have covered
ment. With the advancement of technology, easy availability few conservative and optimistic synchronization algorithms
of infrastructure-as-a-service offers flexible computing envi- along with discussion on energy efficient distributed simu-
ronment on a pay-as-you-go model. New PDES techniques lations. The main objective of their work is to analyze the
are proposed for such cloud architectures. In [4] authors power consumption of various distributed simulation tech-
studied the execution of conservative algorithm over various niques along with profiling of simulation engine, application,
configurations of Amazon EC2. The objective is to see the and communication. The experiments are conducted on mul-
suitability of cloud platform for distributed discrete event tiple configurations, such as Jetson TK1 development board
simulation. For conservative algorithm, null messages play and quad-core LG Nexus 5 cellular phone.
a significant role in the performance of whole simulation In this study we have presented a detailed study of existing
system. Therefore, the authors tested various variations of traditional algorithms on hand held devices. The analysis
synchronization algorithms such as Chandy−Misra−Bryant, includes the execution time, CPU, memory usage, energy
time-out based null message sending, deadlock avoidance consumption and event rate. This article will serve as a guide-
based null message, on-demand null message, timeout pro- line for PDES community especially the new researchers,
tocols etc. The results showed that timeout and blocking to help them in selecting the right protocols for embedded
protocol performed better in a cloud environment. systems keeping in view the resource constraints.
Cloud provides a multi-tenant paradigm, therefore, ex- The next sections cover some background, experimental
ecution of optimistic PDES over cloud results in a large setup for Desktop and Mobile Dervices, their results and
number of rollbacks. This is because systems are not equally discsussions.
loaded in terms of number of jobs. Moreover, some tasks
require more computation whereas other need more com- III. BACKGROUND
munication. Similarly, Asad et al. [3] proposed a PDES A brief description of Parallel and Distributed Simulation
model for cloud environment to improve the performance of platforms specially Rensselaer's Optimistic Simulation Sys-
optimistic parallel simulation through dynamically defining tem (ROSS) is presented in this section. We have selected
the barrier points, to reduce the total number of rollbacks. ROSS to present a case study for the instrumentation of
4 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 1: Commonly Used PDES Frameworks


OS PDES Simulation
S. No. Language Description
Frameworks Model
Developed at the University of California, Los Angeles, Global Mobile system
GloMoSim C-based Simulator (GloMoSim) is a set of library modules, developed for parallel
1 Hybrid
(Qualnet) PARSEC execution of wireless network simulations [40]. It is an extensible
simulator implemented on shared memory and distributed memory system.
DaSSF (Dartmouth Scalable Simulation Framework) is a Scalable Simulation
Framework (SSF) for discrete-event simulation [41].The SSF is
C++
2 DaSSF Conservative designed to achieve interoperability with other SSF compliant frameworks.
Java
DaSSF is capable of simulating very large-scale network with tens of thousands
of complex nodes.
ARTIS (Advanced RTI System) is a middleware for Parallel and Distributed
Simulation (PADS) that can simulate complex systems [42]. It
provides a set of simple services to simulate massively populated system
models. ARTIS uses an adaptive approach and makes use of physical allocation
C with of LPs for efficient execution and communication. It supports different
ARTIS+
3 Java Hybrid communication systems such as shared memory, MPI and network
GAIA
bindings communication. GAIA (Generic Adaptive Interaction Architecture), built on the
top of ARTIS, is also an adaptive middleware that dynamically reallocates
the LPs to optimize the simulation execution [43]. GAIA uses a
migration policy to partition and allocate the interacting components over many
LPs dynamically.
LUNES (Large Unstructured NEtwork Simulator) is an agent-based simulator
to model complex large-scale networks [44]. Its
modular approach separates the phases of topology creation, protocol
4 LUNES C Conservative
simulation and performance analysis. LUNES uses dynamic model partitioning
and simulation middle-ware services provided by GAIA and ARTIS
frameworks respectively.
ScipySim is a distributed simulator to simulate heterogeneous systems
developed using SciPy scientific computing platform
5 ScipySim Python Conservative [45]. It is based on the generalized Kahn theory of
heterogeneous system semantics. It was designed to provide basic simulation
capability to develop simulations using Python.
ROOT-Sim (The ROme OpTimistic Simulator) is a MPI based parallel
simulation platform developed using C/POSIX technology
6 ROOT-Sim C Optimistic
[46]. To achieve high scalability andperformance it uses a
set of optimized protocols to minimize the run-time overhead.
SPaDES/Java (Structured Parallel Discrete-event Simulation in Java) is a
process oriented parallel simulation [47]. It was designed to
7 Spades/JAVA Java Optimistic
isolate the synchronization and parallelization implementation details. It
supports both sequential and parallel simulations.
ErlangTW is a parallel and distributed simulator based on Time Warp
synchronization [48]. Erlang is a and concurrent
8 ErlangTW Erlang Optimistic programming language specifically designed to build distributed systems.
ErlangTW simulation model can be executed on single-core processors,
shared memory multiprocessors and distributed memory clusters.
GO-Warp simulator, implemented using GO programming language, is also
based on Time Warp synchronization [49]. It uses Samadi's
9 GO-Warp GO Optimistic algorithm for Global Virtual Time (GVT) computation. Using concurrent
execution and inter-process communication mechanisms of GO, LPs are
allowed to proceed execution without blocking.

VOLUME 4, 2016 5
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

PDES systems with different synchronization algorithms. B. RENSSELAER'S OPTIMISTIC SIMULATION SYSTEM
Few features of ROSS framework are discussed and ex- (ROSS):
plained. In Table 1, we have briefly explained different PDES In this study, we have used ROSS, a PDES framework that is
frameworks to have an understanding of how some of these based on Message Passing Interface (MPI). ROSS is a high
frameworks have be designed previously. Software and hard- performance and extremely modular PDES system that uses
ware tools and techniques to perform instrumentation and small (and constant) amount of memory to keep state vari-
power consumption analysis are discussed in the last part of ables and execute events [53]. Its modular implementation,
the section. use of reverse computation, Kernel Processes and Fujimoto's
GVT (Global Virtual Time) algorithm makes it a state of
the art simulation system to perform experimental studies.
A. PARALLEL DISCRETE EVENT SIMULATION (PDES) Continuous analysis shows that ROSS outperforms the fa-
A Discrete Event Framework is a system where state changes mous GTW (Georgia-Tech Time Warp) System. In this study,
(events) happen at discrete occurrences in time, and events we analyze the performance and power consumption of the
take zero time to happen. It is accepted that nothing (inter- ROSS simulation implementation under classical PHOLD
esting) happens between two continuous events, that means, simulation model benchmark. While ROSS is based on op-
no state change happened in the system between the events. timistic scheduling approach it also provides implementation
Such frameworks that can be categorized as Discrete Event for both conservative and sequential approaches. Our study
Frameworks can be modeled using Discrete Event Simula- includes performance and power consumption analysis of all
tion (DES) Systems. sequential, conservative and optimistic approaches.

A Discrete Event Simulation (DES) comprises of events IV. INSTRUMENTATION OF ROSS


and Logical Processes (LPs). LPs are agent entities in a sim- we have used various famous tools for carrying out extensive
ulation system and store the simulation system states using analysis of ROSS simulation system. These tools include
state variables. A DES continues the execution by sending PIN-based tool developed by Intel to profile performance and
event messages between LPs to communicate state changes. other related metrics. PIN-tools provide a dynamic memory
It is important to execute events while adhering local causal- instrumentation framework to perform instrumentation on
ity constraint. It means if the simulation system does not IA-32 and x86-64 architectures. The framework can be used
respect causal ordering, causality error occurs. This causality to analyze applications in the user space and to perform
error is a logical error that occurs if an event with higher instrumentation on complied binary files. Therefore, there is
value of time-stamp is executed before an event with smaller no need to perform re-compilation to use PIN-based tools.
time-stamp [50]. In a discrete event simulation, a scheduler
Moreover, to analyze the amount of time that ROSS simu-
is the core part of the simulation engine. Hence, performance
lation system spends on locks/waits and to calculate average
of a discrete event simulation is directly influenced by the
CPU usage we have used Vtune Amplifier. These results are
working of scheduler. The DES can be implemented using
then used to estimate the amount of speedup for parallel
sequential and parallel techniques. in a sequential approach
version as compared to the serial one. The Intel SOC Watch
events are executed in time-stamp order in a serial manner. In
is used to get temperature statistics for all CPU cores while
this simplest approach main focus is to execute a simulation
running ROSS simulation. Allinea forge, MAP tool is used to
model without concerning the performance in terms of fast
get insight into the power, energy and memory consumption
execution. While there can be many Logical Processes (LPs)
of ROSS on PHOLD bench-marking model with different
in sequential DES, all events are stored in a single priority
number of logical processes (LPs).
queue ordered by their virtual timestamp and executed one
by one like an ordered sequence [51].
A. EXPERIMENTAL SETUP
Similarly, Distributed and Parallel Discrete Event Simu- In this section, we describe the hardware setup we used to
lation (PDES), provides the capability of executing a single run simulations, along with a brief introduction of bench-
discrete event simulation program on multiple cores using marking applications used to perform power profiling and
parallel computing [52]. There are two main categories of performance analysis of ROSS. We used PHOLD model to
time management/scheduling mechanisms that PDES widely analyze resource usage of ROSS on a shared memory multi-
follow; conservative and optimistic. In conservative approach processor under saturated (parallelism more than processor)
causality errors are strictly avoided by applying strategies workload.
to determine when it is safe to process an event. While in
For this study, 4th generation Intel® Core™ i7-4790 pro-
optimistic approach simulation continues until a causality
cessor (Intel Haswell family) 3.6 GHz (Hyper -Threading
error is detected and this situation is recovered by means of
supported) with 4 cores, 8 threads, 8 MB Cache, 8 GB RAM,
a rollback mechanism and later re-execution of rolled-back
5 GT/s DMI2 and Ubuntu 14.04.3 operating system with
events.
kernel version 3.19 has been used.

6 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

1) PHOLD Benchmark Model: in this section are averaged for three independent simulation
To evaluate the performance of synchronization protocols runs over the specified system and using the profiling tools
used in parallel and distributed simulation systems, many mentioned in the previous section.
different benchmark models have been used. PHOLD is one
of these most widely used benchmark application [54], [55]. A. PHOLD EXECUTION
PHOLD (Parallel HOLD) is a parallel version of HOLD Results obtained by executing PHOLD benchmark model in
model that is a synthetic benchmark used for performance ROSS framework with varying number of logical processes
analysis of sequential discrete event list simulation algo- are discussed in this section. To determine how a compo-
rithms. PHOLD model consists of N fully connected logical nent or more specifically a synchronization algorithm of a
processes(LPs). Simulation model starts with a number of simulation engine should be developed for mobile devices,
objects known as Logical Processes (LPS) with each LP we have done comparative analysis of synchronization algo-
having a fixed number of events. Event execution function rithms in terms of their CPU usage, memory consumption,
sends a message to another LP that is selected uniformly total execution time, energy and power consumption. These
among all the logical processes in the simulation. On receiv- result helped us to determine which simulation model is
ing this message each LP sends another event message to efficient in terms of those parameters that are used in perfor-
the neighboring LP. In this way, number of total events in mance analysis. Using these results not only we can develop
the simulation remains constant. The message size, message efficient algorithms but also improve the performance of
population, timestamp increment and the message routing existing algorithms. Moreover, simulation components that
probabilities can be varied to test the simulation system [56]. are resource hungry can be redesigned or moved to cloudlets.
Different profiling tools that we have been used are briefly Detailed results of serial execution, parallel conservative and
described here: optimistic simulation algorithms are presented in this section.
Results only specific to PHOLD benchmark in ROSS
2) Energy, Memory Consumption - Allinea Map: framework are presented in Tables 2, 3 and 4 respectively.
Allinea MAP is a profiling tool designed for wide range of In all simulations, linear mapping is used between logical
applications including parallel, single threaded and multi- and physical processes. Total number of events are kept
threaded (Pthread, OpenMP and MPI) applications that are constant and are also used as stopping condition for the sim-
based on Fortran, C, C++ [57]. Through analysis of any target ulation. As it was obvious, serial execution takes more time
application pinpoints the bottlenecks in the code execution as compared to parallel conservative and parallel optimistic
and keeps logs for power, energy and memory consumption simulation with varying number of LPs. For 1024 LPs, se-
traces. quential simulation took 34.827 seconds to complete whereas
conservative and optimistic parallel approaches took 21.4
3) CPU Usage - Intel® Vtune Amplifier: and 24.7 seconds respectively. For 524288 LPs, the sequen-
Intel® VTune™ Amplifier is a profiling tool that is used tial execution took 8.72 hours whereas parallel conservative
to optimize code for better performance by profiling CPU took 4.04 hours and parallel optimistic execution took 5.99
usage of the system. It provides a user friendly interface hours. Results showed that conservative simulation execution
to analyze and obtain results using enriched performance outperforms the others techniques while running simulations
insights. It helps application developers to develop code that with different number of LPs. As discussed earlier, in op-
is more threaded, scalable, vectorized and tuned. We have timistic simulation, there are out of order event executions
used VTune™ amplifier to check the average CPU usage and that cause some events to rollback and then re-execute. Thus
the amount of time that ROSS framework spends on locks the total number of events are more than the committed
and waits in a parallel setup. events that results in reduced performance as compared to
the optimistic approach. Other functions that are the reason
4) CPU Cores Temperature - Intel® SoC Watch: for increased execution time are GVT computations, fossil
Intel SoC is a command-line utility designed under the collection and reverse computation.
umbrella of Intel energy profiler (a set of data collector Other interesting results specific to PHOLD execution in
tools) [58]. we have used this utility to study the temperature ROSS framework are memory usage and wastage in case of
profiles of CPU cores during execution of ROSS framework serial and both parallel techniques. For execution with 1024
for different numbers of LPs. LPs, serial execution used a total of 12,080 MBs, parallel
conservative used 11,528 MBs and optimistic used 29,768
V. PRELIMINARY RESULTS AND ANALYSIS OF CASE MBs. Similarly for 524,288 LPs serial execution used a total
STUDY of 38,8176 MBs, parallel conservative used 105,552 MBs
This section section contains in-depth results obtained from and optimistic used 123,792 MBs to complete the simulation.
the analysis of simulation runs obtained using various differ- This trend is similar to the case of total execution time and
ent analysis tools. All the results including CPU usage, CPU thus the reasons is same that every LP must implement the
Temperature, Wait time, Power, Energy and Memory con- rollback mechanism, therefore, the LPs needs to maintain the
sumption are discussed in sub-sections. The results reported history of all events to handle straggler/transient, and anti-
VOLUME 4, 2016 7
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 2: Results - Serial Execution of PHOLD with Varying Number of LPs


LPs
1024 2048 4096 8192 16348 32768 65536 131072 262144 524288
Running Time (Seconds) 34.827 72.272 148.343 303.874 658.453 1556.235 3516.398 7768.777 3730.151 31381.415
Event Rate (million events/sec) 2.94 2.83 2.76 2.70 2.49 2.11 1.86 1.69 1.76 1.67
Memory Allocated (MB) 12080 12816 14288 17232 23120 34896 58448 105552 58448 388176
Memory Wasted (MB) 533 341 469 213 213 213 213 212 213 210
Total Events Processed (billions) 0.102 0.205 0.410 0.819 1.638 3.277 6.554 13.107 6.554 52.428
Efficiency (%) 100 100 100 100 100 100 100 100 100 100

messages. million computations). This is because in the conservative


Some important parameters for simulation synchroniza- approach more GVT computations are performed to avoid
tion schemes are reported in Tables 2, 3 and 4. The term causality errors.
efficiency is defined as the ratio of committed events to the The rollback concept is only used in optimistic execu-
total events. As there is no roll back mechanism in serial tion, rollback occurs whenever a causality error is detected.
and parallel conservative approaches, both have 100% effi- Extensive rollbacks effect the efficiency of the simulation
ciency. On the other hand, in optimistic approach, committed engine. There are two types of rollbacks, primary rollbacks
events are always less than total events because of roll back and secondary rollbacks. A primary rollback occurs when-
mechanism; therefore, the efficiency of parallel optimistic ever an LP receives an event with a time-stamp lower than
synchronization approach is always less than 100%. More- its local time. The primary rollbacks transitively propagates
over, for optimistic synchronization, simulation frameworks secondary rollbacks to other processes to revert the effects
typically store event histories to proactively plan to resolve of previously sent messages [60]. The rollback results for
issues occurred due to causality errors. Storing event histories optimistic execution indicate that as we increase the number
increases memory usage over time and the memory must of LPs there is a considerable decrease in rollbacks (ranging
be reclaimed periodically to reduce the run-time memory from 133218 to only 421 rollbacks for 1024 LPs and 524288
requirement of simulation framework. This reclamation pro- LPs respectively). This is due to the fact that executions slow
cess is commonly known as fossil collection and presents down caused by the increased number of LPs.
an overhead to the simulation execution. The high memory
wastage is also evident from the large number of fossil col- B. CPU USAGE - INTEL VTUNE AMPLIFIER RESULTS
lection attempts in optimistic execution to reclaim memory.
This section covers the results for wait time spent on locks
The total events processed in all three versions of ex- along with average CPU usage to estimate speedup provided
ecution ranged from 0.102 to 52.432 billion events with by the parallel versions of simulation as compared to serial
increasing number LPs. Another important factor to mea- version.
sure the performance efficiency of the simulation system
is event rate. With increasing number of LPs, number of
1) Wait Time:
events to be processed increases and thus communication
overhead increases that results in decreased event rate. The Wait time is the time spent by simulation code on locks and
event rate (events/second) is the lowest for serial version waits in parallel execution. Results show that wait time of
(ranging from 2.94 to 1.67 million events/second) and highest serial simulation execution for different number of LPs. It
for conservative version (ranging from 4.78 to 3.60 million is clearly visible that in serial execution waiting time is very
events/second), while optimistic lies in the middle (ranging limited; at maximum the wait time is 0.007 secs as sequential
from 4.14 to 2.52 million events/second). The reason behind execution does not have to wait for other results.
decreasing event rate is the increased number of LPs and For parallel versions, results show that optimistic execu-
thus the scheduling and communication overhead between tion spent more time on locks and waits as compared to the
the LPs. So far, these results suggest that the conservative conservative execution. This increase in wait time for the
approach is better but we need to investigate the detailed case of optimistic execution (as we increased the number
device specific results to reach a conclusion. of LPs), is because of rollback mechanism that increases in
In parallel optimistic simulation, Global Virtual Time the number of total events to be processed. As greater the
(GVT) is one of the core concepts that is also used for number of total events to be processed greater will be the
fossil collection; GVT acts as a barrier point in the pasr synchronization and scheduling overhead. It is also noted that
for rollback, and some how guarantees that no process can the difference between the wait time for parallel conservative
rollback to a timestamp smaller than this GVT value. It is and parallel optimistic model is less than 80 seconds and
similar to the concept of Lower Bound Timestamp (LBTS) at the same time the CPU usage in parallel optimistic is
in conservative simulation [59]. The GVT computations for better compared to other approaches (CPU usage results are
optimistic (ranging from 0.10 to 51.20 million computations) discussed next).
are slightly less than conservative (ranging from 0.20 to 51.40
8 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 3: Results - Parallel Conservative Execuation of PHOLD with Varying Number of LPs
LPs
1024 2048 4096 8192 16348 32768 65536 131072 262144 524288
Running Time (Seconds) 21.432 42.233 83.687 169.071 375.403 770.323 1598.350 3513.903 7212.069 14542.773
Event Rate (million events/sec) 4.78 4.85 4.89 4.85 4.36 4.25 4.100 3.73 3.63 3.60
Memory Allocated (MB) 11528 11712 12080 12816 12873 17232 23120 34896 58448 105552
Memory Wasted (MB) 677 629 533 341 469 213 213 213 213 212
Total Events Processed (billions) 0.102 0.205 0.410 0.819 1.638 3.277 6.554 13.107 26.214 52.428
Total GVT Computations (million) 0.200 0.300 0.500 0.900 1.700 3.300 6.504 12.920 25.751 51.394
Efficiency (%) 100 100 100 100 100 100 100 100 100 100

TABLE 4: Results - Parallel Optimistic Execution of PHOLD with Varying Number of LPs
LPs
1024 2048 4096 8192 16348 32768 65536 131072 262144 524288
Running Time (Seconds) 24.727 50.468 100.051 216.820 510.869 1197.870 1187.070 5065.395 10401.267 21598.115
Event Rate (million events/sec) 4.141 4.058 4.094 3.778 3.207 2.735 2.760 2.588 2.520 2.427
Memory Allocated (MB) 29768 29952 30320 31056 32528 35472 35472 53136 76688 123792
Memory Wasted (MB) 869 821 725 533 149 405 405 405 405 404
Fossil Collect Attempts (million) 0.408 0.808 1.609 3.210 6.410 12.811 25.611 51.212 102.410 204.811
Total Events Processed (billion) 0.104 0.207 0.412 0.822 1.641 3.280 3.280 13.110 26.217 52.432
Total GVT Computations (million) 0.102 0.202 0.402 0.802 1.603 3.203 3.203 12.803 25.603 51.203
Total Roll Backs 133218 79114 45681 23980 12143 5983 5970 1665 716 421
PrimaryRoll Backs 105890 63101 35860 19571 10561 5541 5448 1578 681 398
SecondaryRoll Backs 27328 16013 9821 4409 1582 442 522 87 35 23
Efficiency (%) 98.13 98.96 99.43 99.69 99.84 99.90 99.92 99.98 99.99 99.99

anti-messages.

C. ENERGY, MEMORY CONSUMPTION AND


FUNCTIONAL LEVEL TIME ANALYSIS - ALLINEA MAP
RESULTS
In this section, results obtained from Allinea forge’s Map
utility in terms of CPU power and energy and memory
consumption are reported along with functional level time.
While power and energy are related concepts but energy
consumption and power consumption are not same . One can
simply define power as energy consumed per unit of time
(rate of energy consumption) as given in Equation 1

P ower = Energy/T ime (1)


FIGURE 2—Average CPU usage for Parallel Optimistic and Desktop computers and similar devices that have constant
Conservative Simulation of PHOLD model for an increasing power supply are power constraint devices, while mobile
number of LPs devices and other battery operated devices are energy con-
straint. Therefore, we have computed results for both energy
and power consumption. System statistics are obtained with
2) Average CPU Usage: varying number of LPs.
Average CPU usage, defines CPU utilization that depends
on the total number of processor cores used to run the 1) Energy Consumption
simulation. It shows concurrency level of the code being In this section, results obtained from Allinea forge’s Map
executed. In case of serial execution average CPU usage was utility in terms of CPU energy consumption are reported
constant as it uses only one processor; whereas the results for while PHOLD is being executed. Figure 3 shows the energy
parallel versions of simulations are presented in Figure 2. consumption results for sequential, parallel conservative and
Results indicate that average CPU usage for optimistic optimistic execution for different number of LPs ranging
simulation was higher as compared to the conservative simu- from 1024 to 524288. Optimistic execution results show
lation. This is due to the optimism that allows LPs to execute highest energy consumption not only as it uses all the avail-
events on availability. The optimistic simulation also utilizes able physical cores but also performs lot more computations
more CPU due to its mechanism of handling straggler and as compared to conservative execution. Interestingly, results
VOLUME 4, 2016 9
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 3—Energy consumption of sequential, parallel FIGURE 4—Power consumption of sequential, parallel
conservative and optimistic approaches for varying number conservative and optimistic approaches for varying number
of LPs of LPs

have shown that the sequential execution that did not used
many CPU resources initially but still consumed a lot more
energy in the end because of its longer execution time.

2) Power Consumption
The CPU power consumption is a significant portion of over-
all power consumed by Desktop computers. It is represented
in watts and is a combination of electrical energy used by
CPU while performing various tasks per unit time and the
energy dissipated in the form of heat during the course of
execution. Here, we have discussed the total CPU power
consumption; whereas temperature statistics are discussed
later. Figure 4 show that for less umber of LPs such as 1024
and 2048 the results of CPU power consumption show a
similar trend for conservative and optimistic approaches but FIGURE 5—Memory usage of sequential, parallel
overall conservative approach consumed more power, while
conservative and optimistic approaches for varying number
CPU power consumption of serial is very low as compared to
both parallel versions (conservative and optimistic) as it uses of LPs
a single core but that eventually results in higher execution
time.
time spent on execution of some major functions and their
3) Memory Usage corresponding individual child functions. In all three sim-
Memory usage is also an important parameters in the design ulation versions total events are kept constant and linear
of parallel applications like ROSS. Memory usage is highest mapping is used between logical and physical processes.
for optimistic simulations and is lowest in the case of serial Functional level percentage time for each sequential sim-
executions. Figure 5 shows the memory usage of sequential, ulation execution for different number of LPs aregiven in
parallel conservative and optimistic approaches with varying Table 5. Results shows that serial simulation with 1024
number of LPs. LPSs took 34.8 seconds but increasing the number of LPs
to 524288, serial simulation took 8.72 hours to complete.
4) Functional Level Execution Time texitittw_scheduler_sequential is the main executing func-
In this section, we have presented execution time results tion; responsible for event processing, memory management
of core functions for serial, and parallel conservative and and virtual time computation. Parallel versions of PHOLD
optimistic simulation execution. Table 5, 6 and 7 contain simulations showed more interesting results.
functional level execution time results respectively for vary- Table 6 contains percentage execution time results for sim-
ing number of logical processes (1024, 2048, 32768, 262144 ulation functions with portion of compute time for parallel
and 524288). These tables contain functional hierarchy and conservative execution. Total execution time of the simu-
10 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 5: Functional Level Execution Time for Serial Version of ROSS


LP’s
Functions (% Time) 1024 2048 32768 262144 524288
tw_run 100 99.9 99.9 100 99.9
tw_scheduler_sequential 100 99.9 99.9 100 99.9
phold_event_handler 89 86 76 77 80
tw_rand_exponential 35 33 23 22 21
tw_event_new 21 21 20 19 20
rng_gen_val 14 14 15 19 18
tw_event_send 12 11 13 11 15
Others 7 7 5 6 6
Others 11 14 24 23 20
Others 0 0.1 0.1 0 0.1
Execution Time (sec) 34.8 72.3 1556.2 3730.2 31381.4

TABLE 6: Functional Level Execution Time for Parallel Conservative Version of ROSS
LP’s
1024 2048 32768 262144 524288
Functions (% Time) Total MPI Total MPI Total MPI Total MPI Total MPI
tw_run 99.7 43 99.8 40 99.9 32 100 27 100 27
tw_scheduler_conservative 99.5 43 99.8 40 99.9 32 100 27 100 27
phold_event_handler 59 14 64 14 55 12 58 10 58 11
tw_event_send 21 14 22 14 18 12 19 10 21 11
tw_rand_exponential 15 - 16 - 13 - 12 - 11 -
tw_event_new 14 - 15 - 14 - 11 - 12 -
rng_gen_val 6 - 8 - 8 - 13 - 12 -
Others 3 - 3 - 2 - 3 - 2 -
tw_net_read 17 15 17 14 16 12 16 11 16 11
service_queues 17 15 17 14 16 12 16 11 16 11
Others 0 - 0 0 0 0 0 0 0 0
tw_gvt_step2 14 14 9 12 10 8 6 6 6 5
MPI_Allreduce 13 13 8 11 10 8 5 5 5 5
Others 1 1 1 1 0 - 1 1 1 -
Others 9.5 - 9.8 - 18.9 - 20 - 20 -
Others 0.3 - 0.2 - 0.1 - 0 - 0 -
Execution Time (sec) 21.4 42.2 770.3 7212.1 14542.8

lation code for 1024 LPs was 21.4 seconds out of which done in sequential fashion essentially. for rollbacks, as no
parallel execution time was 9.2 seconds. similarly when process can rollback to a timestamp smaller than GVT value
the number of LPs is increased to 524288, total execution [59]. This trend can be seen in 6 and 7 as the number of LPs
time was about 4.04 hours with parallel execution time of increase there is decrease in parallel execution time of GVT
1.09 hours. This gives us a rough idea about the degree of calculation function thus spending more time in sequential
parallelism that MPI based ROSS provides as compared to GVT calculation due to rollbacks.
the sequential execution; as the execution time is decreased Table 7 contains functional level percentage execution
to half in the case of parallel conservative executions. But time results for parallel optimistic simulation alogn with
it is interesting to note that for parallel version degree of execution execution time for different numbers of LPs. Total
parallelism tends to decrease as we increase the number of execution time of the simulation code for 1024 LPs was
LPs. This decrease in parallel execution time is because there 24.7 seconds out of which parallel execution time was 9.14
is an earliest time tag known as Global Virtual Time (GVT) seconds. Similarly, for 524288 LPs total execution time was
or (LBTS in the case of conservative synchronization) associ- about 6 hours with parallel execution time of about 1.38
ated with unprocessed pending events. This virtual time value hours. Similar to the case of conservative approach, serial
act as a lower bound in the simulation time that guarantees GVT computations that need to be performed in sequential
that there will be no event message with time-stamp lower manner are used to find a time in the past for which it is
than this GVT value. These GVT computations needs to be guaranteed that there will be no roll back before this time.
For this reason, in table 6 and 7 it can be seen that there
VOLUME 4, 2016 11
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

is considerable increase in computation time of tw_gvt_step consumption for optimistic and conservative execution is that
function as the number of LPs increases. More Roll back a significant part of energy is dissipated in the form of heat.
cause more reverse computation overhead. The use of reverse
computation in parallel discrete event simulation is to reduce E. ANALYSIS OF ROSS FRAMEWORK
state saving. For energy constraint system we had to make we have presented in-depth instrumentation results for serial,
sure that there are no excessive roll backs that may cause parallel conservative and optimistic approaches. This helped
longer execution time. Sometimes it starts a chain of roll- us to determine the critical parameters that we need to fo-
backs; primary rollbacks cause secondary roll transitively, to cus while designing synchronization algorithms for mobile
reverse the effect of previously sent messages [60]. While platforms. A serial execution of the simulation model will
in conservative simulation causality error is avoided by per- consume few resources but it will take a lot more time thus
forming more GVT computations that is why execution time will not be efficient and effective for most of the real time
of conservative approach is better than optimistic algorithm. mobile applications. Some applications that need to perform
less extensive simulation tasks with no need to provide real
D. CPU TEMPERATURE - INTEL® SOC WATCH RESULTS time results, serial execution model can be used effectively.
In this section, we discuss the temperature statistic for each For parallel conservative and optimistic simulation mod-
core while PHOLD application is being executed. For serial els, a good strategy could be to migrate resource extensive
execution, It was observed that the temperature of a specific functions (or modules) to cloud-lets for their execution. In
core remained higher as compared to others. This was due to this analysis, the execution time of core functions are listed
the fact that serial execution utilizes only a single core. The in Table 5, 6 and 7. This kind of detailed analysis is the first
statistics showed that in the serial execution the minimum step of the adoption of any simulation framework on mobile
average temperature is 36.1°C and the maximum average handheld devices. moreover, a second step would be to
temperature goes up to 54.4°C. evaluate the migration cost of compute and energy intensive
code to cloud-lets. In some cases, this evaluation can be done
before running the simulation (e.g. based on information
derived from previous runs of the simulator) but often it needs
to be executed at run-time. In fact, the decision to migrate
or not to migrate a simulation module is based on a many
parameters, some of these can be unknown or unpredictable
a priori. For this reason, heuristic approaches for the dynamic
(and adaptive) reallocation of these modules could be very
promising. Next sections of the paper covers our proposed
simulation suite, its detailed results and performance analysis
for mobile platforms.

VI. SEECSSIM – PROPOSED SIMULATION SUITE


In this section, we have proposed a simulation suite for mo-
bile devices that include different types of synchronization
algorithms discussed in fig. 1. SEECSSim, proposed simu-
FIGURE 6—CPU Temperature Statistics for PHOLD Model lation suite covers different types of distributed simulation
for Different Number of LPs algorithms that are; Time-stepped approach, ii). Synchronous
conservative, iii). Asynchronous conservative and iv). Opti-
For parallel executions (both conservative and optimistic) mistic simulation. A brief discussion of all these techniques
it was seen that the temperature goes high on all four and their algorithms are presented in the following sections.
cores. This is due to the parallelism in conservative and Time-Stepped Model - In a time-stepped simulation model,
optimistic execution, as both are utilizing all available cores. time advances in fixed intervals, that is provided to all the
The minimum and maximum average temperature in case of logical processes or to the one requesting for the time interval
conservative execution was 52°C and 73.2°C respectively. value. At any interval in wall clock time all the logical
Similarly, in optimistic execution minimum and maximum processes are at the same logical time in the simulation.
average temperature was 52.5°C and 73.2°C. Typically, the entire simulation time is divided into time
It is evident from the temperature results that with more steps to equal size. This simulation model also requires a
CPU cores in use, more energy will be dissipated as heat and barrier synchronization mechanism to ensure that all pro-
thus average CPU temperature will increase. The maximum cesses complete their execution in a time step before going
average temperature for serial version is comparable to the forward to the next time time step by fixed interval ∆t.
minimum average temperature of parallel versions. The max- Time-stepped algorithm is presented as Algorithm 1. In this
imum average temperature of parallel versions was 18.8°C algorithm, each process maintains a process logical time Tp ,
higher than the serial version. The reason behind high energy the current simulation time. This synchronization approach
12 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

TABLE 7: Functional Level Execution Time for Parallel Optimistic Version of ROSS
LP’s
1024 2048 32768 262144 524288
Functions (% Time) Total MPI Total MPI Total MPI Total MPI Total MPI
tw_run 99.8 37 99.9 38 99.9 26 99.9 23 99.9 23
tw_scheduler_optimistic 99.6 37 99.7 38 99.9 26 99.9 23 99.9 23
tw_sched_batch 60 13 58 13 54 8 52 7 50 7
phold_event_handler 52 13 51 13 39 8 37 7 38 7
tw_event_send 18 13 18 13 13 8 13 7 13 7
tw_event_new 11 - 11 - 9 - 8 - 8 -
rng_gen_val 7 - 5 - 7 - 7 - 8 -
tw_rand_exponential 13 - 14 - 9 - 8 - 7 -
Others 3 - 3 - 1 - 0 - 1 -
tw_gvt_step2 18 11 20 12 28 8 28 7 29 8
tw_pe_fossil_collect 7 0 8 0 19 - 20 0 20 0
MPI_Allreduce 10 10 11 11 8 8 7 7 7 7
Others 1 1 1 1 1 0 0 0 2 1
tw_net_read 20 12 21 13 17 9 20 8 21 9
service_queues 20 12 21 13 17 9 20 8 21 9
test_q 9 2 9 2 9 0.8 12 1.2 13 1
recv_begin 11 10 12 11 9 8 8 7.1 8 7
Others 0 - 0 - 0 0.2 0 0 0 1
tw_kp_rollback_to 1.4 0.6 0.8 0.2 <0.1 <0.1 <0.1 <0.1 <0.1 <0.1
tw_event_rollback 1.2 0.6 0.8 0.2 <0.1 <0.1 <0.1 <0.1 <0.1 <0.1
Others 0.2 - 0 - 0 0 0 0 0 0
Others 0 - 0 - 0 0 0 0 0 1
Others 1.6 - 0.7 - 0.9 1 0 1 0 0
Others 0.2 0 0.1 0 0.1 0 0.1 0 0.1 0
Execution Time (sec) 24.7 50.5 1197.9 10401.3 21598.1

is most appropriate for models where simulation events are


frequent and dense. However, in simulation models where
initialization Tp = 0;
simulation events are less frequent, performance may suffer
while Tp < Tend do
as it might be difficult to define a correct time-step. For
calculate state at Tp ;
real-time interactive systems, the optimization is achieved by
send self-generated output for Tp ;
maintaining some information about the future events. The
send output complete marker;
future event list is shared with the destination node which
request advance to Tp ;
can generate time advancement request based on its future
update time Tp ;
event list [61].
repeat;
Synchronous Conservative Model - There are multiple
centralized and decentralized algorithms in synchronous con- receive(message);
servative approach that implement global synchronization if message 6= grant_advance(Tend ) then
mechanism. These algorithms include Distributed snapshot, receive all messages;
Grid-based approach, Tree barrier, Broadcast and Centralized receive input complete marker;
barrier algorithms. We have used a centralized tree barrier process input;
approach, in which logical processes are organized as the send output in response to this new input;
nodes of a binary tree. Each LP processes events until it send output complete marker;
reaches a barrier point. Once an LP reaches a barrier point else
it sends a signal to its parent LP only if it has received Simulation end time is reached
signal from both child processes (if they exist). The processes end
continues until the root node receives the signal message. Tp ← Tp + ∆t
Once root node receives the signal, it knows that each LP end
has reached at barrier point. Once all LPs sync, root node Algorithm 1: Time-Stepped algorithm
(centralized node) broadcasts a message to release the barrier
[62] . In this way, a centralized control is achieved. The
VOLUME 4, 2016 13
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

algorithm for centralized tree barrier approach is presented initialization Tp = 0;


as Algorithm 2. while Tp < Tend do
forall Cin do
initialization Tp = 0; Enqueue NewMessages(InQ);
Ni = time of next event in LPi Update channel time Tch ;
LAi = lookahead of LPi while Tp < Tend do end
Enqueue NewMessages(InQ); Select the incoming queue InQ with
if IsMsg(InQ) ≤LBTS then smallest channel time Tch
// process safe events Msg = Dequeue
NextMsg(InQ); if IsMsg(InQ) then
Process Msg; Eim = Dequeue NextMsg(InQ);
Barrier synchronization; Tp = Tim ;
else process Eim ;
No safe Event, Compute LBTS; Enqueue NewMessages(OutQ);
end forall Cout do
end ReleaseMesssagesUpTo(Tp + Tcl );
Compute LBTS (Tree Barrier); SendNullMessage(Tp + Tcl );
LBTS = min( Ni + LAi ); end
if Node = Root then else
send new LBTS to all LPs ; Simulation end time is reached
else end
send min(NextEventTime + LA) end
to Parent Process Algorithm 3: CMB Null Message Algorithm
end
Algorithm 2: Tree Barrier Algorithm

Asynchronous Conservative Model - We have imple-


mechanism. Processes re-execute the rolled back events if
mented Chandy−Misra−Bryant (CMB) algorithm as an
thet are not annihilated. On detecting the error, a process
asynchronous conservative approach. In an asynchronous
sends anti message to cancel or roll back the execution of
conservative model, LPs communicates through messages
event message that is caused by out of order execution.
with increasing time-stamp. Each process maintains a sepa-
Time Warp mechanism needs to keep the list of processed
rate queue for all of its incoming channels (C). Time-stamped
messages to send anti messages, this uses extravagant amount
messages guarantee that at each LP, the time-stamp of the last
of memory. To reclaim the memory TW uses Global Virtual
message received on an incoming link is the lower bound of
Time (GVT) mechanism to serve as a floor for the virtual
any event message (E) that can be received later. However,
times to which any process can ever again roll back. Every
deadlock can occur if the time-stamp of unprocessed events is
process reports its minimum time-stamp among all unpro-
greater than lower bound of an empty queue. In this situation,
cessed events, partially processed events and anti-messages
there is no surety that either an event is safe to be processed
to the coordinating process. After GVT calculation, process
or not. However, null messages (special control messages)
reclaim the memory used to store processed events having
can be sent to other LPs in order to recover from deadlock
time-stamp smaller than the time-stamp of GVT [64]. Algo-
condition. On receiving a null message, an LP can advance
rithm for Time Warp mechanism is presented as Algorithm 4.
its local clock time, but null message will not cause an
LP to change state variables or generate new events [63].
Chandy−Misra−Bryant (CMB) algorithm is presented as VII. RESULTS & DISCUSSIONS
Algorithm 3. Each LP maintains a process logical time Tp
A. BENCHMARK APPLICATION AND SYSTEM
and a lower bound on every incoming channel Tch . Receiving
SPECIFICATIONS
a null message with Tnm time-stamp from an LP, receiving
LP is assured that there will be no messages with time-stamp We have again used PHOLD benchmark to test the perfor-
smaller than the time-stamp of null message. mance of these algorithms on mobile devices. The PHOLD
Optimistic Model – Famous optimistic Time Warp (TW) model consists of N fully connected logical processes(LPs).
synchronization mechanism is included in this proposed After processing, the LPs sends an event to its neighboring
suite. In time warp mechanism, each LP starts event exe- LP, thus a fixed message population circulates throughout
cution independently without coordinating with other other the model. The message size, message population, times-
LPs. Thus, processes start to execute events without looking tamp increment and the message routing probabilities can
to synchronize initially until they detect that there is out of be varied to test the simulation system [56]. The simulation
order execution or causality error. Once it is determined that topology of PHOLD model is shown in fig. 7. The system
there is some causality error, it is recovered using a rollback specifications used for testing are listed in Table 8.
14 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

initialization Tp = 0; TABLE 8: Embedded System Specification


while Tp < Tend do Parameters Values
unProcessedMsgQ.Enqueue ( buffer.outAll ); CPU Quad-core 2.7 GHz
incomingMsg = unProcessedMsgQ.Dequeue RAM 3GB
switch incomingMsg.type do Storage 32GB
case GVT_Message: do Operating System Android
/ / GVT: Global Virtual Time; OS version Marshmallow
/ / LVT: Local Virtual Time; Make Samsung
submit LVT and wait For GVT Chipset Qualcomm Snapdragon 805
Computation; Battery Li-lon - 3220 mAh
submit LVT and wait For GVT
Computation;
do FossileCollection;
case anti_Message: do
do process anti_Message;
end
case normal_Message: do
if anti_Message has already arrived then
do Annihilation;
end
if incomingMsg is straggler then
do Rollback;
end
else
set the LVT to
incomingMsg.getTimeStamp;
processedMsgQ.add(incomingMsg)
LP.execute(incomingMsg.getEvent);
do StateSaving;
foreach event in model.out do
outMsgQueue.Enqeue(newMessage(event)); FIGURE 8—Time-Stepped Execution Model on Mobile
Platform
sendMessages;
end
end
end
end
Simulation end time is reached Results for executing Time-Stepped synchronization al-
end
Algorithm 4: Time Warp Algorithm gorithms on mobile platform with different number of LPs
are shown in fig. 8. The time step is kept fixed for all the
simulation run with varying number of LPs. Figure shows
total number of events processed along with total number
of time steps or intervals ∆t that each simulation used to
complete. Similarly, fig. 9 shows the results for executing
Tree Barrier algorithm with different number of LPs. As
simulations proceed it is important to note the total number
of LBTS computations along with total number of events
processed. Tree barrier is different from time stepped ap-
proach as it computes new barrier point each time reaching
a barrier point. Thus tree barrier algorithm takes slightly
more time to complete execution because of these LBTS
computations. Though fig. 8 and fig. 9 look similar but there
is slight difference in the total execution time. For example,
tree barrier took 1397 seconds for 1024 LPs while time-
stepped completed execution for same number of LPs in 1055
FIGURE 7—Simulation topology seconds.
VOLUME 4, 2016 15
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 9—Tree Barrier Execution Model on Mobile FIGURE 11—Time Warp Execution on Mobile Platform
Platform
Results of Time Warp algorithm with the varying number
of LPs are shown in fig. 11. The total number of events
processed, the number of total rollback events and total GVT
computations are plotted for varying number of LPs. For
increasing number of LPs, the total rollback events does not
increase rapidly as compared to the case of Null messages in
CMB algorithm.

FIGURE 10—The execution of CMB Null Message


algorithm over mobile platform
FIGURE 12—Event Rate for CMB NULL Message,
Time-Stepped, Tree Barrier and Time Warp Algorithms

Comparison results of event-rate for all above mentioned


Execution results of Chandy−Misra−Bryant Null mes- algorithms are shown in fig. 12. Event rate gives the total
sage algorithm with varying number of logical processes are number of events processed in unit time. The Fig. 12 depicts
shown in Fig. 10. FIgure shows the total events processed, that the TW and CMB NULL message algorithm show better
total LBTS computations along with the total number of Null event rate as compared to tree barrier and time-stepped
messages. The number of Null messages almost approach approach. In case of TW, it is due to its optimistic behavior
equal to the total events processed for PHOLD execution. that for most of the time it continues execution without any
Total number of LBTS computations is almost the same synchronization. Whereas, CMB is also better as it exchanges
compared to the LBTS computations in Tree Barrier. The null message to get the value of LBTS of a unknown link
CMB performed better as compared to Tree Barrier and Time otherwise it also keeps on executing events.
Stepped Approach in terms of execution time.
16 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

B. DEVICE SPECIFIC RESULTS


This section includes the resource utilization results for ex-
ecuting these synchronization algorithms over mobile plat-
form. The main objective is to analyze the CPU usage, mem-
ory consumption, energy consumption and the amount of
time spent to complete each simulation run. Results include
the total number of events processed, number of LBTS/GVT
computations and total execution time. The energy analysis
results are obtained using Trepn profiler tool; that is used to
measure the power consumption and performance of differ-
ent synchronization algorithms. Trepn profiler1 is a product
of Qualcomm. It is a diagnostic tool that lets the developers to
profile the performance and power consumption of Android
applications running on mobile devices. The results of en-
ergy consumption, memory consumption and percentage of
CPU usage are included to analyze the performance of syn- FIGURE 13—Average CPU Usage – CMB NULL Message,
chronization algorithms. We expect that this study will also Time-Stepped, Tree Barrier and Time Warp Algorithms
help researcher in choosing the appropriate synchronization
algorithm for mobile/embedded systems. Other objective of
this study is to show that we can use the existing algorithms
on mobile platforms with little modification where required.
That is, building new algorithms for mobile or embedded de-
vices from scratch without thoroughly analyzing the existing
ones may not be fruitful.

1) CPU Usage
– Average CPU usage of all four synchronization algorithms
is shown in fig. 13. Time-stepped approach consumed the
least CPU resources whereas TW approach consumed the
most. The reason for excessive CPU utilization for time
warp algorithm is that it need to process more events as
compared to other approaches. Moreover, during rollbacks,
more events become pilled up in the input queue that is
FIGURE 14—Memory Consumption – CMB NULL
ready to execute, thus more CPU work is required. As TW
allows the events to execute without abiding local causality Message, Time-Stepped, Tree Barrier and Time Warp
constraint, then it has to roll back out of order executions Algorithm.
and then re-execute them causing more computations. GVT
computations and fossil collect attempts are also responsible
for greater CPU utilization. CPU usage for CMB algorithm is 2) Memory Consumption
less as compared to time warp but more than tree barrier. This – Memory consumption is one of the important parameters
is because, a number of NULL messages it has to generate that need to be considered for embedded or mobile devices
and transmit to process a single event. Similarly, the look-a- which has limited memory to use. As discussed in the
head value, also has an impact on the performance of CMB. previous section that the TW algorithm initially executes
The time-stepped approach performed better compared to events without synchronizing with other processes, In order
others. However, it is important to note that the term CPU to perform rollbacks, each LP has to save state variables
utilization for mobile devices is not the same as for Desktop and processed event. This state saving mechanism requires
systems. As for Desktop computer, more CPU utilization a significant amount of memory compared to the other
can be termed as better CPU utilization as consistent power techniques discussed here. Other approaches consume less
supply is always available. On the other hand, for mobile amount of energy and are very close to each other in terms of
devices, more CPU usage means more energy consumption. their memory consumption. The memory comparison for all
That will consume the limited battery more rapidly. algorithms is shown in Fig. 14.

3) Energy Consumption
– Energy consumed by different synchronization algorithms
1 https://developer.qualcomm.com/software/trepn-power-profiler is presented in the Fig. 15. Here energy is given in milliwatt-
VOLUME 4, 2016 17
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

FIGURE 15—Energy Consumption – CMB NULL FIGURE 16—Total Execution Time – CMB NULL
Message, Time-Stepped, Tree Barrier and Time Warp Message, Time-Stepped, Tree Barrier and Time Warp
Algorithm. Algorithm.

hour, the watt-hour (Wh) is a unit of energy equivalent Time Warp is termed as one of the most efficient and
to one watt (1W) of power expended for one hour (1h) extensively used algorithm in distributed simulations. But it
of time, thus, a milliwatt hour is 1/1000 Wh (symbolized consumes a lot of energy that is limited for mobile or embed-
mWh). In the figure, number of LPs and amount of energy ded devices. The energy consumption and CPU utilization of
is plotted on logarithmic scale. It is important to note the TW algorithm can be improved using techniques that help
energy consumption of algorithms relative to each other to reduce the number of rollbacks. One of the well-known
rather than their individual energy consumption. As energy technique is Wolf Call [65]. In the Wolf Calls protocol, when
is computed by multiplying power with time, therefore, if a logical processor detects that there is a straggler message
the execution time increases for completing the simulation its received in the past, it sends a control message to all the LPs
energy consumption also increases. Battery operated mobile causing the LPs to stop processing until the error is removed.
devices are energy constrained, so their design requirement These control messages are called wolf calls. A better way
is to minimize the total amount of energy consumption to to improve the performance of the wolf call algorithm is to
complete the given computation task. Parallel and distributed make sure that only those LPs stop processing to which the
simulations that need to be executed within deadlines require error may have spread. There are other techniques that can
more energy and power at the same time. Like the trend be adopted as well, such as; lazy, re-lazy cancellation, and
depicted in CPU usage, Time Warp is energy extensive as reverse computation. The proposed SEECSSim simulation
compared to the other algorithms. On the other hand, Time suite includes one of the optimization techniques i.e. Wolf
Stepped and Tree Barrier approaches consume lesser amount Calls. Results in the following figure 18 suggest that the
of energy. The energy consumption of CMB is nearly the energy consumption of Time Warp is improved considerably
same as the Time Stepped and Tree Barrier approaches. with increasing number of logical processes.
It is important to consider that the improvement in energy
4) Total Execution Time consumption is achieved at the expense more execution time.
– Previous results clearly show that the performance of the But this does not greatly reduce the performance of the
TW is below par as compared to other techniques discussed simulation system as the execution time of Time Warp with
in this paper. Time stepped approach is better among all Wolf Call protocol is still better than Tree Barrier and Time
the synchronization algorithms (except for execution time), Stepped approach as shown in figure.
however, it cannot exploit the true parallelism. In distributed
systems, the performance is usually measured in terms of VIII. CONCLUSION
speedup and execution time. As shown in the Fig. 16, that It is worth noting that, in comparison with traditional sys-
the total execution time for the Tree barrier and Time Stepped tems, the handheld devices provide a limited amount of
approach is more than other two approaches. The simulations computational resources. Therefore, the completion time of
using Chandy, Misra and Bryant algorithm and Time Warp the simulations on handheld devices usually increases in
algorithm performs better in terms of execution time. Thus, comparison to traditional systems. This is due to the fact
we can conclude that the CMB algorithm is adequate in terms that handheld devices uses ARM based processors that are
of execution time as well as energy consumption. designed for optimizing the energy consumption ( [66] [67]).
18 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

York, 2000, vol. 300.


[3] A. W. Malik, A. J. Park, and R. M. Fujimoto, “An optimistic parallel simu-
lation protocol for cloud computing environments,” SCS M&S Magazine,
vol. 4, pp. 1–9, 2010.
[4] K. Vanmechelen, S. De Munck, and J. Broeckhove, “Conservative dis-
tributed discrete event simulation on amazon ec2,” in Proceedings of the
2012 12th IEEE/ACM International Symposium on Cluster, Cloud and
Grid Computing (ccgrid 2012). IEEE Computer Society, 2012, pp. 853–
860.
[5] Y. Wu, J. Cao, and M. Li, “Private cloud system based on boinc with
support for parallel and distributed simulation,” in Dependable, Autonomic
and Secure Computing (DASC), 2011 IEEE Ninth International Confer-
ence on. IEEE, 2011, pp. 1172–1178.
[6] I. Zhukov, C. Feld, M. Geimer, M. Knobloch, B. Mohr, and P. Saviankou,
“Scalasca v2: Back to the future,” in Tools for High Performance Comput-
ing 2014. Springer, 2015, pp. 1–24.
[7] A. Knüpfer, H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler,
M. S. Müller, and W. E. Nagel, “The vampir performance analysis tool-
set,” in Tools for High Performance Computing. Springer, 2008, pp.
139–155.
FIGURE 17—Total Execution Time – CMB NULL [8] A. D. Malony and S. Shende, “Performance technology for complex
Message, Time-Stepped, Tree Barrier and Time Warp parallel and distributed systems,” in Distributed and parallel systems.
Springer, 2000, pp. 37–46.
Algorithm. [9] A. D. Malony, S. Shende, R. Bell, K. Li, L. Li, and N. Trebon, “Advances
in the tau performance system,” in Performance analysis and grid comput-
ing. Springer, 2004, pp. 129–144.
[10] S. S. Shende and A. D. Malony, “The tau parallel performance system,” In-
ternational Journal of High Performance Computing Applications, vol. 20,
no. 2, pp. 287–311, 2006.
[11] C. Isci and M. Martonosi, “Runtime power monitoring in high-end pro-
cessors: Methodology and empirical data,” in Proceedings of the 36th
annual IEEE/ACM International Symposium on Microarchitecture. IEEE
Computer Society, 2003, p. 93.
[12] R. Ge, X. Feng, S. Song, H.-C. Chang, D. Li, and K. W. Cameron,
“Powerpack: Energy profiling and analysis of high-performance systems
and applications,” IEEE Transactions on Parallel and Distributed Systems,
vol. 21, no. 5, pp. 658–671, 2010.
[13] X. Feng, R. Ge, and K. W. Cameron, “Power and energy profiling of
scientific applications on distributed systems,” in 19th IEEE International
Parallel and Distributed Processing Symposium. IEEE, 2005, pp. 34–34.
[14] G. Procaccianti, L. Ardito, M. Morisio et al., “Profiling power con-
sumption on desktop computer systems,” in International Conference on
Information and Communication on Technology. Springer, 2011, pp.
110–123.
[15] L. Stanisic, B. Videau, J. Cronsioe, A. Degomme, V. Marangozova-Martin,
A. Legrand, and J.-F. Méhaut, “Performance analysis of hpc applications
FIGURE 18—Total Execution Time – CMB NULL on low-power embedded platforms,” in Proceedings of the conference on
Message, Time-Stepped, Tree Barrier and Time Warp design, automation and test in Europe. EDA Consortium, 2013, pp. 475–
480.
Algorithm. [16] N. Rajovic, A. Rico, J. Vipond, I. Gelado, N. Puzovic, and A. Ramirez,
“Experiences with mobile processors for energy efficient hpc,” in Proceed-
ings of the Conference on Design, Automation and Test in Europe. EDA
Consortium, 2013, pp. 464–468.
In other words, it is not fair to compare traditional execution [17] S. Hua and G. Qu, “Approaching the maximum energy saving on em-
architectures (e.g. desktops, servers) with handheld devices bedded systems with multiple voltages,” in Proceedings of the 2003
IEEE/ACM international conference on Computer-aided design. IEEE
considering only the amount of time that is necessary to Computer Society, 2003, p. 26.
complete the simulation runs. A more comprehensive ap- [18] M. Curtis-Maury, A. Shah, F. Blagojevic, D. S. Nikolopoulos, B. R.
proach is to take in account both the execution time and the De Supinski, and M. Schulz, “Prediction models for multi-dimensional
power-performance optimization on many cores,” in Proceedings of the
energy consumptions of the simulation runs. A conclusion 17th international conference on Parallel architectures and compilation
section is not required. Although a conclusion may review the techniques. ACM, 2008, pp. 250–259.
main points of the paper, do not replicate the abstract as the [19] C. Lively, V. Taylor, X. Wu, H.-C. Chang, C.-Y. Su, K. Cameron, S. Moore,
and D. Terpstra, “E-amom: an energy-aware modeling and optimization
conclusion. A conclusion might elaborate on the importance methodology for scientific applications,” Computer Science-Research and
of the work or suggest applications and extensions. Development, vol. 29, no. 3-4, pp. 197–210, 2014.
[20] V. Tiwari, S. Malik, A. Wolfe, and M.-C. Lee, “Instruction level power
analysis and optimization of software,” in VLSI Design, 1996. Proceed-
REFERENCES ings., Ninth International Conference on. IEEE, 1996, pp. 326–328.
[1] R. M. Fujimoto, R. Bagrodia, R. E. Bryant, K. M. Chandy, D. Jefferson, [21] R. M. Yoo, C. J. Hughes, K. Lai, and R. Rajwar, “Performance evaluation
J. Misra, D. Nicol, and B. Unger, “Parallel discrete event simulation: The of intel® transactional synchronization extensions for high-performance
making of a field,” 2017. computing,” in High Performance Computing, Networking, Storage and
[2] R. M. Fujimoto, Parallel and distributed simulation systems. Wiley New Analysis (SC), 2013 International Conference for. IEEE, 2013, pp. 1–11.

VOLUME 4, 2016 19
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[22] K. Kumar, P. Rajiv, G. Laxmi, and N. Bhuyan, “Shuffling: a framework [41] J. H. Cowie, D. M. Nicol, and A. T. Ogielski, “Modeling the global
for lock contention aware thread scheduling for multicore multiprocessor internet,” Computing in Science & Engineering, vol. 1, no. 1, pp. 42–50,
systems,” in Parallel Architecture and Compilation Techniques (PACT), 1999.
2014 23rd International Conference on. IEEE, 2014, pp. 289–300. [42] L. Bononi, M. Bracuto, G. DâĂŹAngelo, and L. Donatiello, “Artis: a
[23] M. Curtis-Maury, A. Shah, F. Blagojevic, D. S. Nikolopoulos, B. R. parallel and distributed simulation middleware for performance evalua-
De Supinski, and M. Schulz, “Prediction models for multi-dimensional tion,” in International Symposium on Computer and Information Sciences.
power-performance optimization on many cores,” in Proceedings of the Springer, 2004, pp. 627–637.
17th international conference on Parallel architectures and compilation [43] G. DâĂŹAngelo, “The simulation model partitioning problem: an
techniques. ACM, 2008, pp. 250–259. adaptive solution based on self-clustering,” Simulation Modelling Practice
[24] X. Feng, R. Ge, and K. W. Cameron, “Power and energy profiling of scien- and Theory (SIMPAT), vol. 70, pp. 1 – 20, 2017. [Online]. Available:
tific applications on distributed systems,” in Parallel and Distributed Pro- http://www.sciencedirect.com/science/article/pii/S1569190X16302350
cessing Symposium, 2005. Proceedings. 19th IEEE International. IEEE, [44] G. D’Angelo and S. Ferretti, “Lunes: Agent-based simulation of p2p
2005, pp. 10–pp. systems,” in High Performance Computing and Simulation (HPCS), 2011
[25] S. Hua and G. Qu, “Approaching the maximum energy saving on em- International Conference on. IEEE, 2011, pp. 593–599.
bedded systems with multiple voltages,” in Proceedings of the 2003 [45] A. I. McInnes and B. R. Thorne, “Scipysim: towards distributed hetero-
IEEE/ACM international conference on Computer-aided design. IEEE geneous system simulation for the scipy platform (work-in-progress),”
Computer Society, 2003, p. 26. in Proceedings of the 2011 Symposium on Theory of Modeling & Sim-
[26] C. Lively, V. Taylor, X. Wu, H.-C. Chang, C.-Y. Su, K. Cameron, S. Moore, ulation: DEVS Integrative M&S Symposium. Society for Computer
and D. Terpstra, “E-amom: an energy-aware modeling and optimization Simulation International, 2011, pp. 89–94.
methodology for scientific applications,” Computer Science-Research and [46] A. Pellegrini, R. Vitali, and F. Quaglia, “The rome optimistic simulator:
Development, vol. 29, no. 3-4, pp. 197–210, 2014. core internals and programming model,” in Proceedings of the 4th Inter-
[27] R. Child and P. A. Wilsey, “Using dvfs to optimize time warp simulations,” national ICST Conference on Simulation Tools and Techniques. ICST
in Proceedings of the Winter Simulation Conference. Winter Simulation (Institute for Computer Sciences, Social-Informatics and Telecommunica-
Conference, 2012, p. 288. tions Engineering), 2011, pp. 96–98.
[28] T. Guérout, T. Monteil, G. Da Costa, R. N. Calheiros, R. Buyya, and [47] Y. M. Teo and Y. K. Ng, “Spades/java: object-oriented parallel discrete-
M. Alexandru, “Energy-aware simulation with dvfs,” Simulation Mod- event simulation,” in Simulation Symposium, 2002. Proceedings. 35th
elling Practice and Theory, vol. 39, pp. 76–91, 2013. Annual. IEEE, 2002, pp. 245–252.
[48] L. Toscano, G. D’Angelo, and M. Marzolla, “Parallel discrete event
[29] R. M. Yoo, C. J. Hughes, K. Lai, and R. Rajwar, “Performance evaluation
simulation with erlang,” in Proceedings of the 1st ACM SIGPLAN
of intel® transactional synchronization extensions for high-performance
workshop on Functional high-performance computing, ser. FHPC’12.
computing,” in 2013 SC-International Conference for High Performance
New York, NY, USA: ACM, 2012, pp. 83–92. [Online]. Available:
Computing, Networking, Storage and Analysis (SC). IEEE, 2013, pp.
http://doi.acm.org/10.1145/2364474.2364487
1–11.
[49] G. D’Angelo, S. Ferretti, and M. Marzolla, “Time warp on the go,”
[30] K. K. Pusukuri, R. Gupta, and L. N. Bhuyan, “Shuffling: a framework
in Proceedings of the 5th International ICST Conference on Simulation
for lock contention aware thread scheduling for multicore multiprocessor
Tools and Techniques, ser. SIMUTOOLS ’12. ICST, Brussels, Belgium,
systems,” in Proceedings of the 23rd international conference on Parallel
Belgium: ICST (Institute for Computer Sciences, Social-Informatics and
architectures and compilation. ACM, 2014, pp. 289–300.
Telecommunications Engineering), 2012, pp. 242–248.
[31] D. Jagtap, N. Abu-Ghazaleh, and D. Ponomarev, “Optimization of parallel [50] E. Mikida, N. Jain, L. Kale, E. Gonsiorowski, C. D. Carothers, P. D.
discrete event simulator for multi-core systems,” in Parallel & Distributed Barnes Jr, and D. Jefferson, “Towards pdes in a message-driven paradigm:
Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, A preliminary case study using charm++,” in Proceedings of the 2016
2012, pp. 520–531. annual ACM Conference on SIGSIM Principles of Advanced Discrete
[32] M. A. Erazo and R. Pereira, “On profiling the energy consumption Simulation. ACM, 2016, pp. 99–110.
of distributed simulations: A case study,” in Proceedings of the 2010 [51] ROSS, “Rensselaer’s optimistic simulation system,” https://carothersc.
IEEE/ACM Int’l Conference on Green Computing and Communications github.io/ROSS, 2017, accessed March 20, 2017.
& Int’l Conference on Cyber, Physical and Social Computing. IEEE [52] R. M. Fujimoto, “Parallel discrete event simulation,” Communications of
Computer Society, 2010, pp. 133–138. the ACM, vol. 33, no. 10, pp. 30–53, 1990.
[33] J. Liu, “The prime research,” 2007. [53] C. D. Carothers, D. Bauer, and S. Pearce, “Ross: A high-performance, low-
[34] R. M. Fujimoto, “Research challenges in parallel and distributed sim- memory, modular time warp system,” Journal of Parallel and Distributed
ulation,” ACM Transactions on Modeling and Computer Simulation Computing, vol. 62, no. 11, pp. 1648–1669, 2002.
(TOMACS), vol. 26, no. 4, p. 22, 2016. [54] K. S. Perumalla, “Scaling time warp-based discrete event execution to
[35] Y. Wu, J. Cao, and M. Li, “Private cloud system based on boinc with 104 processors on a blue gene supercomputer,” in Proceedings of the 4th
support for parallel and distributed simulation,” in Dependable, Autonomic international conference on Computing frontiers. ACM, 2007, pp. 69–76.
and Secure Computing (DASC), 2011 IEEE Ninth International Confer- [55] D. W. Bauer Jr, C. D. Carothers, and A. Holder, “Scalable time warp on
ence on. IEEE, 2011, pp. 1172–1178. blue gene supercomputers,” in Proceedings of the 2009 ACM/IEEE/SCS
[36] A. Biswas and R. Fujimoto, “Profiling energy consumption in distributed 23rd Workshop on Principles of Advanced and Distributed Simulation.
simulations,” in Proceedings of the 2016 annual ACM Conference on IEEE Computer Society, 2009, pp. 35–44.
SIGSIM Principles of Advanced Discrete Simulation. ACM, 2016, pp. [56] R. M. Fujimoto, “Performance of time warp under synthetic workloads,”
201–209. 1990.
[37] A. W. Malik, I. Mahmood, and A. Parkash, “Energy consumption of tradi- [57] Allinea, “Allinea-map,” http://www.allinea.com/products/map, 2017, ac-
tional simulation protocol over smartphones: an empirical study (wip),” in cessed April. 2, 2017.
Proceedings of the Summer Computer Simulation Conference. Society [58] Intel, “Intel® soc watch,” https://software.intel.com/en-us/node/589913,
for Computer Simulation International, 2016, p. 23. 2017, accessed March 29, 2017.
[38] S. Neal, R. Fujimoto, and M. Hunter, “Energy consumption of data [59] K. R. Rouse, William Boff, Organizational simulation. Wily-Interscience,
driven traffic simulations,” in Winter Simulation Conference (WSC), 2016. 2005.
IEEE, 2016, pp. 1119–1130. [60] K. S. Perumalla, Introduction to reversible computing. Chapman and
[39] R. M. Fujimoto, M. Hunter, A. Biswas, M. Jackson, and S. Neal, “Power Hall/CRC, 2013.
efficient distributed simulation,” in Proceedings of the 2017 ACM SIGSIM [61] K. Shenoy, “Techniques for optimizing time-stepped simulations,” 2004.
Conference on Principles of Advanced Discrete Simulation. ACM, 2017, [62] R. Garg, V. K. Garg, and Y. Sabharwal, “Efficient algorithms for global
pp. 77–88. snapshots in large distributed systems,” IEEE Transactions on Parallel and
[40] L. Bajaj, M. Takai, R. Ahuja, K. Tang, R. Bagrodia, and M. Gerla, “Glo- Distributed Systems, vol. 21, no. 5, pp. 620–630, 2010.
mosim: A scalable network simulation environment,” UCLA Computer [63] K. M. Chandy and J. Misra, “Distributed simulation: A case study in
Science Department Technical Report, vol. 990027, no. 1999, p. 213, design and verification of distributed programs,” IEEE Transactions on
1999. software engineering, no. 5, pp. 440–452, 1979.

20 VOLUME 4, 2016
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS

[64] D. R. Jefferson, “Virtual time,” ACM Transactions on Programming Lan-


guages and Systems (TOPLAS), vol. 7, no. 3, pp. 404–425, 1985.
[65] V. Madisetti, J. Walrand, and D. Messerschmitt, “Wolf: A rollback algo-
rithm for optimistic distributed simulation systems,” in Simulation Confer-
ence Proceedings, 1988 Winter. IEEE, 1988, pp. 296–305.
[66] J. W. Smith and A. Hamilton, “Massive affordable computing using arm
processors in high energy physics,” in Journal of Physics: Conference
Series, vol. 608, no. 1. IOP Publishing, 2015, p. 012001.
[67] S. Ryu and G. Ganis, “The proof benchmark suite measuring proof
performance,” in Journal of Physics: Conference Series, vol. 368, no. 1.
IOP Publishing, 2012, p. 012020.

FAHAD MAQBOOL is a graduate student at


NUST School of Electrical Engineering and Com-
puter Science,Pakistan. He holds a bachelorâĂŹs
degree in Telecommunication and Networks from
COMSATS Institute of Information Technology,
Pakistan. His research interests include parallel
and distributed simulation and energy efficient
mobile computing.

SYED MEESAM RAZA NAQVI is currently


masterâĂŹs student at NUST, SEECS (School
of Electrical Engineering and Computer Sci-
ence), Pakistan. He has a bachelorâĂŹs degree in
Telecommunication and Networking. His research
interests include cloud computing, energy effi-
cient computing, machine leaning and prognostic
health management (PHM). He can be reached at
snaqvi.mscs15seecs@seecs.edu.pk.

ASAD W. MALIK is an Assistant Professor at


NUST School of Electrical Engineering and Com-
puter Science, Pakistan. His research interests in-
clude parallel and distributed simulation, cloud
computing, Internet of Things (IoT) and large-
scale networks.

VOLUME 4, 2016 21

You might also like