You are on page 1of 12

DO NOT COMPLETELY REDESIGN PDES FRAMEWORK FOR SMART DEVICES:

AN EMPIRICAL STUDY TO BENCHMARK EXECUTION TIME AND ENERGY


CONSUMPTION OF ROSS FRAMEWORK AT CORE FUNCTION LEVEL

Fahad Maqbool
Syed Meesam Raza Naqvi
Asad W. Malik

Department of Computing
School of Electrical Engineering and Computer Science (SEECS)
National University of Sciences and Technology (NUST)
Islamabad, Pakistan
(fmaqbool.mscs15seecs, snaqvi.mscs15seecs, asad.malik)@seecs.edu.pk

ABSTRACT
Parallel and distributed simulation system requires time management algorithms to ensure the executions
are synchronized. Without the synchronization procedures it is difficult to utilize the hardware platform
efficiently. Similarly, traditional synchronization algorithms are not developed for mobile devices. The exe-
cution of state-of-the-art algorithms over mobile devices consumed energy significantly. At the same time,
before applying energy optimization on traditional algorithms, it is more beneficial to analyze function level
energy and time consumption, that can be used to reduce overall energy consumption. Moreover, as an
insight, energy profiling at function level can help researchers to modify the implementation for energy
constraint devices. In this paper, we have analyzed Rensselaers Optimistic Simulation System (ROSS) and
benchmark the energy and time consumption at function level. The experimental section shows the func-
tion level energy and time consumption of conservative and optimistic synchronization algorithms during
PHOLD benchmark is being executed.
Keywords:ROSS, Performance, PHOLD, Power Profiling, Time Warp

1 INTRODUCTION
In parallel and distributed simulation system, a simulation is executed on computing system with multiple
processors. Events that simultaneously run on multiple computing systems need to be synchronized. Syn-
chronization (or Time management) algorithms are used to ensure that events are processed in correct order
thus possess repeatability (Fujimoto 2001). There are two types of synchronization algorithms that are used
traditionally, one is conservative synchronization algorithms and the other is optimistic synchronization al-
gorithms. Moreover, synchronization on shared objects such as locks and waits are applied in the design
of parallel discrete event simulations (PDES) to ensure synchronization. In most of the cases, locks are
managed by operating system, therefore, PDES frameworks have limited control over shared locks.
It is estimated that in high performance computing (HPC) systems, the cost to power the computing and
HVAC (Heating, Ventilation and Air Conditioning) system is growing rapidly (Scaramella 2006). Thus,
energy is the main concern, the researchers are working to explore this area to make HPC systems power
efficient (Elmore et al. 2016). In parallel and distributed simulation system power consumption has become

SummerSim-SCSC, 2017 July 9-12, Bellevue, Washington, USA; ⃝2017


c Society for Modeling & Simulation International (SCS)
Maqbool, Naqvi and Malik

an area of increasing concern in recent days. One of the reason is adoption of IoE (Internet of Everything)
in every domain. Therefore, execution of traditional data-driven simulation codes over embedded mobile
devices failed to utilized the resources efficiently. The data-driven simulations are used to monitor, simulate
and prediction. In recent years, the computing power of mobile devices have been improved substantially,
but their usability is defined by the power consumption of the running applications. In order to use mobile
devices for running parallel discrete event simulations, it is essential to improve their power consumption.
So far there are very few articles that focus on understanding and developing techniques to reduce the power
consumption by eliminating overheads in acquiring share resources.
The underutilized cores can be switched to low power mode using DVFS or (Dynamic Power Management)
DPM techniques. However, this usually results in increased execution time and consequently increased en-
ergy consumption for a given computation. It is to be noted that decreasing energy usage and minimizing
the power consumption are not the same thing (Unsal 2003). Battery operated mobile devices are energy
constrained, so their design requirement is to minimize the total amount of energy consumption to complete
the given computation task. Parallel and distributed simulations that need to be executed within deadline
requires more energy and power at the same time. The factors that affects the energy consumption in-
cludes electrical specification of underlying hardware and operations on reading and writing to the system
memory/ROM. Moreover, system usage characteristics of applications, i.e. the complexity of computation,
amount of inter-process communication and synchronization algorithm are other important factors as that
can be thoroughly utilized and improved. Improving energy consumption without diminishing performance
becomes a challenging task as performance and power consumption are strongly related on one another.
Thus, fine-grain profiling of the applications can be used to identify where, when and how the energy is con-
sumed by the running application. In this paper, we build an energy profile of ROSS simulation at function
level and also measure the memory usage and other critical parameters as a step towards energy optimiza-
tion of distributed simulations. This main objective of selecting ROSS simulator is it provides conservative
and optimistic synchronization algorithms. We have used PIN (Intel Vtune Amplifier) tool and Intel SOC
watch to profile performance, temperature and measure the amount of time spent on synchronization objects.
Moreover, Allinea forge, MAP tool is used to profile power, energy and memory consumption. All statis-
tics are calculated while PHOLD benchmarking model is being executed with different number of logical
processes (LPs).

2 RELATED WORK
There are very limited literature available that focus on energy optimization for distributed simulations. In
case of high performance computing (HPC) systems, various algorithms, techniques and tools are devel-
oped to optimize performance and energy consumption. In (Child and Wilsey 2012), authors presented a
technique, to reduce the power consumption and to enhance performance of Time Warp simulation, based
on DVFS. In (Guérout et al. 2013) Guerout et al. presented overview of energy-aware simulations and
described DVFS simulation tools that can be used to measure power consumptions.
Although different profiling tools and methods are developed over time for that can be used to generate,
analyze and visualize the data from functional components, such as tuning and analysis utilities developed
by (Zhukov et al. 2015), (Knüpfer et al. 2008), (Malony and Shende 2000), (Malony et al. 2004) and
(Shende and Malony 2006) provides support for instrumentation and performance visualization to parallel
applications.
Isci et al. (Isci and Martonosi 2003) presented an approach to estimate power consumption using perfor-
mance logs. Ge et al. (Ge et al. 2010) developed a framework called PowerPack that can be used for energy
profiling and analysis of parallel applications on multi core processors. For fine grained parallel applications
like ROSS, communication is a major performance bottleneck, Yoo et al. (Yoo et al. 2013) and Pusukuri et
al. (Pusukuri, Gupta, and Bhuyan 2014) has discussed the techniques for improving network performance
by reducing lock contention and overlapping communication.
Maqbool, Naqvi and Malik

Biswas et al. (Biswas and Fujimoto 2016b) discussed techniques to create power profiles of parallel and dis-
tributed simulations in the context of mobile distributed simulations. The energy consumed by application
code, simulation engine and energy used for computation versus energy required for communications are
separated to understand power consumption of each aspect of simulation system. Similarly, Biswas et al.
(Biswas and Fujimoto 2016a) presented comparative analysis of energy consumed by Chandy/Misra/Bryant
algorithm and YAWNS algorithm. Malik et al. (Malik, Mahmood, and Parkash 2016) has analyzed the
energy consumption of optimistic simulation protocol over smart phones.(Tiwari et al. 1996) described that
computing energy consumed by each machine instructions is one way to profile the energy consumption of
all functional components. The implementation lacks the bifurcation of energy consumed by computing and
communication links.
Jagtap et al. (Jagtap, Abu-Ghazaleh, and Ponomarev 2012) studied the performance of ROSS simulators
on two different platforms to find out the affects of multi-threaded implementation in terms of performance
and compared with MPI based version. Feng et al. (Feng et al. 2005) indicated that power profiles al-
ways correspond to the characteristics of the application, simulation at large-scale can result in more power
consumption. Procaccianti et al. (Procaccianti et al. 2011) summarized that in various scenarios, power
consumption on a general-purpose computer system can increase from 12% to 20%, corresponding to 2 to
12 watts overhead.

3 BACKGROUND
In this section, we briefly cover Parallel Discrete Event Simulation (PDES) and Rensselaers Optimistic
Simulation System ( ROSS) framework used in this study.

3.1 Parallel Discrete Event Simulation (PDES)

A Parallel Discrete Event Simulation (PDES) consists of execution of Logical Processes (LPs) on parallel
processors. The LPs are representative entities in simulation and store simulation states. A Discrete Event
Simulation (DES) proceeds by execution of an events in increasing timestamp order. The timestamp exe-
cution in DES is critical to produce accurate results. Out of order execution of simulation program leads
to causality error (Mikida et al. 2016). Thus in any case schedulers are the core part of a Discrete Event
Simulation system. The schedulers directly affect the performance of simulation system. The DES can be
implemented sequentially or by using parallel techniques. In the case of sequential approach events are
executed in timestamp order in serial manner. It is the simplest approach to execute a simulation model with
very little concern for performance. While there can be many Logical Processes (LPs) in sequential DES, all
events are stored in a single priority queue ordered by their virtual timestamp and executed one by one like
a ordered sequence (ROSS 2016). On the other hand Parallel Discrete Event Simulation (PDES) provides
the capability to execute discrete event simulation program on multiple cores (Fujimoto 1990a). The PDES
techniques are widely used in the areas of computer science, engineering, military, economics etc,. There
are two main categories of synchronization mechanisms that PDES follows, conservative and optimistic. In
conservative approach causality error is strictly avoided by applying strategies to determine when it is safe to
process an event. Whereas, in optimistic approach simulation continues until causality error is detected and
then this error is handled by applying rollback mechanism for recovery and later re-execution of rollback
events.

3.2 Rensselaers Optimistic Simulation System (ROSS)

The ROSS is the PDES system used in this study. The ROSS is based on Message passing interface (MPI).
The MPI is a parallel programming paradigm that works by exchanging messages across distributed mem-
Maqbool, Naqvi and Malik

ory. ROSS is a high performance and extremely modular system that requires only a small amount of
memory (Carothers, Bauer, and Pearce 2002). Its modular implementation, use of reverse computation,
Kernel Processes and Fujimoto GVT (Global Virtual Time) algorithm makes it a state of the art simulation
system for experimental studies. In this study, we analyzed the performance and power consumption of
ROSS simulation implementation under classical PHOLD simulation model benchmark. While ROSS is
based on optimistic scheduling approach it also provides implementation for conservative and sequential
approach as well. Our study includes performance and power consumption analyses of all three approaches.

3.3 PHOLD Model

PHOLD model is the parallel version of HOLD model used for the performance analysis of sequential event
list algorithms (Fujimoto 1990b). The PHOLD is widely used in benchmarking performance of PDES sys-
tems; PHOLD model contains N fully connected LPs that communicate with other through messages. In this
study, we used PHOLD model to analyze performance of ROSS on a shared memory multiprocessor under
saturated workload. We executed ROSS with different number of LPs and analyzed PHOLD benchmark
results; the finding are discussed in results section.

4 METHODOLOGY
In this study, we analyzed ROSS at functional level, we have identified the functions execution time, and
its overall contribution in simulation time. Finally, we correlated time taken by individual functions with
energy consumption on different configurations of logical processes (LPs).

4.1 Experimental Configuration

In this section, we describe the hardware platform used to perform power profiling and performance anal-
ysis of ROSS system. For this study, we have used 4th generation Intel⃝ R Core i7-4790 processor with
8M Cache from Intel Haswell family with 8GB RAM and Linux operating system.This Processing system
consists of 4 cores and 8 threads, with 3.60 GHz processor base frequency. This processor dissipates 84
Watts of average power, with all cores actively operating at base frequency under heavy loaded workload.
This value also known as Thermal Design Power (TDP) value can be used to estimate the relative power
consumption of running applications. In advance specifications there are thermal monitoring technologies
that contains several thermal management features and a on-die Digital Thermal Sensor (DTS) detects the
core’s temperature. The classic PHOLD model is used to benchmark with variable number of LPs. Each
simulation run consisted of different number of LPs ranging from 1024 LPs to 524288 LPs for each serial,
conservative and optimistic mechanism.

4.2 Benchmarking Applications

We used different benchmarking applications (Allinea Map tool and Intel Socwatch) to analyzed different
parameters while PHOLD is being executed. These application tools are briefly described below:
Allinea Map: Allinea MAP is a powerful profiling tool designed for Fortran, C, C++ (Allinea 2016). It
can profile wide range of applications including parallel, single threaded and multi-threaded applications.
It works with Pthread, OpenMP and MPI. Allinea MAP analyze code for bottlenecks to code line level
and also keeps track of power, energy and memory consumption. In our study, Allinea MAP tool is used
Maqbool, Naqvi and Malik

to observe power, energy and memory consumption of ROSS for different LPs while executing PHOLD
benchmark.
Intel⃝R
SoC Watch: Intel SoC(System on chips) Watch is designed as an energy profiler under the umbrella
of Intel System Studio (a set of data collector tools) (Intel 2016). It provides detailed reports that contains
system states (C, P and GT states), showing idle state CPU data for running application and hardware
resources. This utility is used to analyze the temperature of CPU cores during execution of ROSS simulation.
Results obtained are discussed in next section.

5 RESULTS AND DISCUSSIONS


5.1 Allinea Map (Power and Energy Profiles)

Power and energy are relative concepts; we define energy as a product of power and time, as the simulation
time increases energy consumption also increases. Energy is represented in Joules.

Energy = Power × Time (1)

Power is defined as energy consumed per unit time (rate of energy consumption) as given in equation 2.
Power is measured in Watts.
Power = Energy/Time (2)
This section contains the results of CPU power and energy consumption that are obtained from Allinea
Forge’s MAP utility. Relative to other parts, CPU power consumption is the most significant portion of
overall power consumption of a system. It is the sum of electrical energy used by CPU during computations
and the energy dissipated in the form of heat. Figure 1 depicts the CPU power and energy consumption for
serial, parallel conservative and parallel optimistic execution for 1024 and 2048 LPs. Parallel conservative
simulation takes more power and energy as compared to optimistic approach. While CPU power consump-
tion of serial is very low as compared to both parallel versions due to use of only single core throughout
execution. Energy usage of optimistic execution is highest due to its higher execution time and high resource
consumption (parallel execution).

Figure 1: CPU Power and CPU Energy consumption for PHOLD Model Simulation on 1024 and 2048 LPs
Maqbool, Naqvi and Malik

Figure 2: CPU Power and CPU Energy consumption for PHOLD Model Simulation on 32768 LPs

Figure 2 shows the CPU power and energy consumption for serial, parallel conservative and parallel opti-
mistic execution of PHOLD model with 32768 LPs. The execution time of serial version is higher, that is
the reason it takes almost the same amount of energy compared to conservative simulation. The energy con-
sumption is highest in case of optimistic approach. The power consumption for serial execution is still very
low compared to conservative and optimistic execution. Figure 3 shows the power and energy consumption
for 262144 and 524288 LPs for three simulation versions. Here the trend remains almost the same.

Figure 3: CPU Power and CPU Energy consumption for PHOLD Model Simulation on 262144 and 524288
LPs

5.2 Allinea Map (Functional Level Execution Time)

In this section, we discussed the results concerning with the time spent on each function in serial, conser-
vative and optimistic simulation run. Further, we presented CPU power and energy consumption over the
period of complete simulation run. Results are obtained and presented as the number of logical processes
varies as 1024, 2048, 32768, 262144 and 524288. In all simulations, linear mapping is used between logical
Maqbool, Naqvi and Malik

process and physical processes. In all three simulation techniques total events are kept constant. Detailed
results of percentage time taken by each function for serial, parallel conservative and parallel optimistic
version of simulation are shown in table 1, 2 and 3 respectively.
Table 1: Functional Level Execution Time for Serial Version of ROSS
LP’s
Functions (% Time) 1024 2048 32768 262144 524288
tw_run 100 99.9 99.9 100 99.9
tw_scheduler_sequential 100 99.9 99.9 100 99.9
phold_event_handler 89 86 76 77 80
tw_rand_exponential 35 33 23 22 21
tw_event_new 21 21 20 19 20
rng_gen_val 14 14 15 19 18
tw_event_send 12 11 13 11 15
Others 7 7 5 6 6
Others 11 14 24 23 20
Others 0 0.1 0.1 0 0.1
Execution Time (sec) 34.8 72.3 1556.2 3730.2 31381.4

Table 2: Functional Level Execution Time for Parallel Conservative Version of ROSS
LP’s
1024 2048 32768 262144 524288
Functions (% Time) Total MPI Total MPI Total MPI Total MPI Total MPI
tw_run 99.7 43 99.8 40 99.9 32 100 27 100 27
tw_scheduler_conservative 99.5 43 99.8 40 99.9 32 100 27 100 27
phold_event_handler 59 14 64 14 55 12 58 10 58 11
tw_event_send 21 14 22 14 18 12 19 10 21 11
tw_rand_exponential 15 - 16 - 13 - 12 - 11 -
tw_event_new 14 - 15 - 14 - 11 - 12 -
rng_gen_val 6 - 8 - 8 - 13 - 12 -
Others 3 - 3 - 2 - 3 - 2 -
tw_net_read 17 15 17 14 16 12 16 11 16 11
service_queues 17 15 17 14 16 12 16 11 16 11
Others 0 - 0 0 0 0 0 0 0 0
tw_gvt_step2 14 14 9 12 10 8 6 6 6 5
MPI_Allreduce 13 13 8 11 10 8 5 5 5 5
Others 1 1 1 1 0 - 1 1 1 -
Others 9.5 - 9.8 - 18.9 - 20 - 20 -
Others 0.3 - 0.2 - 0.1 - 0 - 0 -
Execution Time (sec) 21.4 42.2 770.3 7212.1 14542.8

Table 1 contains the time consumption (percentage) by each function for sequential execution. The total
execution time is 34.8 secs for 1024 LPs and 31381.4 secs for 524288 LPs. The tw_schedular is responsible
for event processing, managing memory and computing virtual time; similarly, tw_scheduler_sequential is
executed in case of serial execution. It is noted that phold_event_handler takes lesser time with increasing
number of LPs because of tw_rand_exponential, while rng_gen_val takes more time to execute. One reason
Maqbool, Naqvi and Malik

may be that some ROSS random distribution functions make multiple internal calls to the rng_gen_val
function. Moreover, the wait time for all of the functions in serial version is no more than 0.1 second. As
number of LPs increases memory allocation is increased but memory wastage decreases constantly.
In the case of parallel conservative simulation, Table 2 characterizes the time spend in executing each func-
tion that is further broken down into compute and MPI calls. The total execution time is 21.4 secs for 1024
LPs and 14542.8 secs for 524288 LPs. A higher percentage value of MPI, means the time spent in MPI
calls such as MPI_Send, MPI_Reduce and MPI_Barrier. As number of LPs increases there is a considerable
improvement in time spent on MPI calls. In distributed simulation, Global Virtual Time (GVT) is the earliest
time tag associated to the unprocessed pending events; GVT act as a barrier for rollbacks, no process can
rollback to a timestamp smaller than GVT value (Rouse 2005). Same improving trend is noticed for GVT
computations with increasing number of LPs.
Table 3: Functional Level Execution Time for Parallel Optimistic Version of ROSS
LP’s
1024 2048 32768 262144 524288
Functions (% Time) Total MPI Total MPI Total MPI Total MPI Total MPI
tw_run 99.8 37 99.9 38 99.9 26 99.9 23 99.9 23
tw_scheduler_optimistic 99.6 37 99.7 38 99.9 26 99.9 23 99.9 23
tw_sched_batch 60 13 58 13 54 8 52 7 50 7
phold_event_handler 52 13 51 13 39 8 37 7 38 7
tw_event_send 18 13 18 13 13 8 13 7 13 7
tw_event_new 11 - 11 - 9 - 8 - 8 -
rng_gen_val 7 - 5 - 7 - 7 - 8 -
tw_rand_exponential 13 - 14 - 9 - 8 - 7 -
Others 3 - 3 - 1 - 0 - 1 -
tw_gvt_step2 18 11 20 12 28 8 28 7 29 8
tw_pe_fossil_collect 7 0 8 0 19 - 20 0 20 0
MPI_Allreduce 10 10 11 11 8 8 7 7 7 7
Others 1 1 1 1 1 0 0 0 2 1
tw_net_read 20 12 21 13 17 9 20 8 21 9
service_queues 20 12 21 13 17 9 20 8 21 9
test_q 9 2 9 2 9 0.8 12 1.2 13 1
recv_begin 11 10 12 11 9 8 8 7.1 8 7
Others 0 - 0 - 0 0.2 0 0 0 1
tw_kp_rollback_to 1.4 0.6 0.8 0.2 <0.1 <0.1 <0.1 <0.1 <0.1 <0.1
tw_event_rollback 1.2 0.6 0.8 0.2 <0.1 <0.1 <0.1 <0.1 <0.1 <0.1
Others 0.2 - 0 - 0 0 0 0 0 0
Others 0 - 0 - 0 0 0 0 0 1
Others 1.6 - 0.7 - 0.9 1 0 1 0 0
Others 0.2 0 0.1 0 0.1 0 0.1 0 0.1 0
Execution Time (sec) 24.7 50.5 1197.9 10401.3 21598.1

The results for parallel optimistic and conservative execution of PHOLD model shows that the thread’s
waiting time on synchronization objects (locks and waits) for optimistic execution is more as compared
to conservative execution. It is due to the occurrence of rollbacks and thus increased the number of total
events to be processed (scheduling overhead). The difference between the waiting time for conservative and
Maqbool, Naqvi and Malik

optimistic model is 80 seconds on average. But the average CPU usage in parallel optimistic execution is
better compared to other approaches.
Table 3 represents the total time of execution for optimistic simulation for different numbers of LPs. The
total execution time is 24.7 secs for 1024 LPs and 21598.1 secs for 524288 LPs. Here, we can finally
pinpoint the exact function tw_gvt_step that takes considerably more time as the LP increases this is because
of rollbacks ad reverse computation overhead. As it takes more time to execute, more energy it consumed.
The GVT computations for optimistic simulation are slightly less than conservative approach because in
conservative approach more GVT computations are performed to avoid causality error. The rollbacks occur
in optimistic execution when a causality error is detected. The Primary rollback propagates the secondary
rollbacks to other LPs transitively, to revert the effects of previously sent messages (Perumalla 2013).

5.3 Intel⃝
R
SoC Watch (Temperature Profiles)

This section includes the results concerning average CPU temperature for serial, parallel conservative and
parallel optimistic approaches. As it is obvious from figure 4, CPU temperature remains low in serial
approach as compared to conservative and optimistic for different number of LPs. The reason behind this
fact is that serial execution utilizes only single core. In serial execution minimum temperature of a core is
36.1◦ C and maximum temperature is 54.4◦ C. In case of conservative simulation, minimum and maximum
temperature of a core is 52◦ C and 73.2◦ C respectively. Similarly, in optimistic simulation, minimum and
maximum temperature of a CPU core is 52.5◦ C and 73.2◦ C. On average maximum temperature of parallel
versions is 18.8◦ C higher than serial version. That is the reason due to which major portion of energy is
dissipated in form of heat for both optimistic and conservative executions.

Figure 4: CPU Temperature Statistics for Serial, Conservative and Optimistic Execution of PHOLD Model
on Different Number of LPs

6 CONCLUSION
With the introduction of Internet of Things (IoT), small mobile devices are significantly contributing in all
the systems and applications. In recent era, the main focus of researchers is to propose new algorithms and
techniques that can be used to run simulations over power constraint devices. In this paper, we have analyzed
the existing simulation framework with a believe that by knowing their energy and time consumption at
function level can help to modify the implementation for power constraint devices instead of designing new
Maqbool, Naqvi and Malik

frameworks. The other main motivation is that there are number of framework exists and it is difficult to
re-implement all of them for portability. Therefore, we have benchmark the ROSS framework in terms
of energy and execution time at function level. In future, based on this study, we are interested to adopt
different techniques to reduce the execution time at functions to port it on small energy constraint devices.

REFERENCES
Allinea 2016. “Allinea”. http://www.allinea.com/products/map. Accessed Oct. 10, 2016.
Biswas, A., and R. Fujimoto. 2016a. “Energy consumption of synchronization algorithms in distributed
simulations”. Journal of Simulation, pp. 1–11.
Biswas, A., and R. Fujimoto. 2016b. “Profiling Energy Consumption in Distributed Simulations”. In Pro-
ceedings of the 2016 annual ACM Conference on SIGSIM Principles of Advanced Discrete Simulation,
pp. 201–209. ACM.
Carothers, C. D., D. Bauer, and S. Pearce. 2002. “ROSS: A high-performance, low-memory, modular Time
Warp system”. Journal of Parallel and Distributed Computing vol. 62 (11), pp. 1648–1669.
Child, R., and P. A. Wilsey. 2012. “Using DVFS to optimize time warp simulations”. In Proceedings of the
Winter Simulation Conference, pp. 288. Winter Simulation Conference.
Elmore, R., K. Gruchalla, C. Phillips, A. Purkayastha, and N. Wunder. 2016. “Analysis of Application Power
and Schedule Composition in a High Performance Computing Environment”. Technical report, NREL
(National Renewable Energy Laboratory (NREL), Golden, CO (United States)).
Feng, X., R. Ge, and K. W. Cameron. 2005. “Power and energy profiling of scientific applications on dis-
tributed systems”. In 19th IEEE International Parallel and Distributed Processing Symposium, pp. 34–
34. IEEE.
Fujimoto, R. M. 1990a. “Parallel discrete event simulation”. Communications of the ACM vol. 33 (10), pp.
30–53.
Fujimoto, R. M. 1990b. “{Performance of Time Warp under synthetic workloads}s”. In Proceedings of the
1990 SCS Multiconference on Distributed Simulation, pp. 23–28.
Fujimoto, R. M. 2001. “Parallel simulation: parallel and distributed simulation systems”. In Proceedings of
the 33nd conference on Winter simulation, pp. 147–157. IEEE Computer Society.
Ge, R., X. Feng, S. Song, H.-C. Chang, D. Li, and K. W. Cameron. 2010. “Powerpack: Energy profiling and
analysis of high-performance systems and applications”. IEEE Transactions on Parallel and Distributed
Systems vol. 21 (5), pp. 658–671.
Guérout, T., T. Monteil, G. Da Costa, R. N. Calheiros, R. Buyya, and M. Alexandru. 2013. “Energy-aware
simulation with DVFS”. Simulation Modelling Practice and Theory vol. 39, pp. 76–91.
Intel 2016. “Intel⃝
R
SoC Watch”. https://software.intel.com/en-us/node/589913. Accessed Dec. 2, 2016.
Isci, C., and M. Martonosi. 2003. “Runtime power monitoring in high-end processors: Methodology and
empirical data”. In Proceedings of the 36th annual IEEE/ACM International Symposium on Microar-
chitecture, pp. 93. IEEE Computer Society.
Jagtap, D., N. Abu-Ghazaleh, and D. Ponomarev. 2012. “Optimization of parallel discrete event simulator
for multi-core systems”. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th
International, pp. 520–531. IEEE.
Knüpfer, A., H. Brunst, J. Doleschal, M. Jurenz, M. Lieber, H. Mickler, M. S. Müller, and W. E. Nagel. 2008.
“The vampir performance analysis tool-set”. In Tools for High Performance Computing, pp. 139–155.
Springer.
Maqbool, Naqvi and Malik

Malik, A. W., I. Mahmood, and A. Parkash. 2016. “Energy consumption of traditional simulation pro-
tocol over SmartPhones: an empirical study”. In Proceedings of the summer Computer Simulation
Conference- SCSC 2016, ISBN: 978-1-5108-2424-9, Number 23, pp. 23:1 – 23:4, ACM.
Malony, A. D., and S. Shende. 2000. “Performance technology for complex parallel and distributed sys-
tems”. In Distributed and parallel systems, pp. 37–46. Springer.
Malony, A. D., S. Shende, R. Bell, K. Li, L. Li, and N. Trebon. 2004. “Advances in the TAU performance
system”. In Performance analysis and grid computing, pp. 129–144. Springer.
Mikida, E., N. Jain, L. Kale, E. Gonsiorowski, C. D. Carothers, P. D. Barnes Jr, and D. Jefferson. 2016.
“Towards PDES in a Message-Driven Paradigm: A Preliminary Case Study Using Charm++”. In Pro-
ceedings of the 2016 annual ACM Conference on SIGSIM Principles of Advanced Discrete Simulation,
pp. 99–110. ACM.
Perumalla, K. S. 2013. Introduction to reversible computing. Chapman and Hall/CRC.
Procaccianti, G., L. Ardito, M. Morisio et al. 2011. “Profiling power consumption on desktop computer
systems”. In International Conference on Information and Communication on Technology, pp. 110–
123. Springer.
Pusukuri, K. K., R. Gupta, and L. N. Bhuyan. 2014. “Shuffling: a framework for lock contention aware
thread scheduling for multicore multiprocessor systems”. In Proceedings of the 23rd international con-
ference on Parallel architectures and compilation, pp. 289–300. ACM.
ROSS 2016. “Rensselaer’s Optimistic Simulation System”. https://carothersc.github.io/ROSS. Accessed on.
Sep. 20, 2016.
Rouse, William Boff, K. R. 2005. Organizational simulation. Wily-Interscience.
Scaramella, J. 2006. “Worldwide server power and cooling expense 2006-2010 forecast”. Market analysis,
IDC Inc.
Shende, S. S., and A. D. Malony. 2006. “The TAU parallel performance system”. International Journal of
High Performance Computing Applications vol. 20 (2), pp. 287–311.
Tiwari, V., S. Malik, A. Wolfe, and M. T.-C. Lee. 1996. “Instruction level power analysis and optimization
of software”. In Technologies for wireless computing, pp. 139–154. Springer.
Unsal, O. S. 2003. System-level power-aware computing in complex real-time and multimedia systems. Ph.
D. thesis, University of Massachusetts Amherst.
Yoo, R. M., C. J. Hughes, K. Lai, and R. Rajwar. 2013. “Performance evaluation of Intel⃝
R transactional
synchronization extensions for high-performance computing”. In 2013 SC-International Conference for
High Performance Computing, Networking, Storage and Analysis (SC), pp. 1–11. IEEE.
Zhukov, I., C. Feld, M. Geimer, M. Knobloch, B. Mohr, and P. Saviankou. 2015. “Scalasca v2: Back to the
Future”. In Tools for High Performance Computing 2014, pp. 1–24. Springer.

AUTHOR BIOGRAPHIES
FAHAD MAQBOOL is a graduate student at NUST School of Electrical Engineering and Computer Sci-
ence, Pakistan. He holds a bachelor’s degree in Telecommunication and Networks from COMSATS Institute
of Information Technology, Pakistan. His research interests include parallel and distributed simulation and
energy efficient mobile computing. His email address is fmaqbool.mscs15seecs@seecs.edu.pk.
SYED MEESAM RAZA NAQVI is currently master’s student at NUST, SEECS (School of Electrical
Engineering and Computer Science), Pakistan. He has a bachelor’s degree in Telecommunication and Net-
Maqbool, Naqvi and Malik

working. His research intrests include cloud computing, energy efficient computing, machine leaning and
prognostic health management (PHM). He can be reached at snaqvi.mscs15seecs@seecs.edu.pk.
ASAD W. MALIK did his PhD in Computer Software Engineering from the NUST, Pakistan. He is an
Assistant Professor at NUST School of Electrical Engineering and Computer Science, Pakistan. His research
interests include parallel and distributed simulation, cloud computing, and large-scale networks. His email
address is asad.malik@seecs.edu.pk.

You might also like