A Study of The Performance of A Cloud Datacenter Server

590 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 5, NO.
4, OCTOBER-DECEMBER 2017
A Study of the Performance of a Cloud

Datacenter Server
Khaleel Mershad, Hassan Artail, Senior Member, IEEE,
Mazen A.R. Saghir, Senior Member, IEEE, Hazem Hajj, and Mariette Awad
Abstract—In a previous work, we presented a system which combines active solid state drives and reconfigurable FPGAs (which we
called reconfigurable active SSD nodes, or simply RASSD nodes) into a storage-compute node that can be used by a cloud datacenter
to achieve accelerated computations while running data intensive applications. To hide the complexity of accessing RASSD nodes
from applications, we proposed in another work a middleware framework which handles all low-level interactions with the hardware.
The Middleware Server (MWS), which manages a group of RASSD nodes, has the role of bridging the connection between a client and
the nodes. In this paper, we present extensions to the MWS to enable it to operate within a collaborative cloud environment, and we
develop a model to evaluate the performance of the collaborative MWS. This model represents a study of the utilization of three
hardware resources of the MWS: CPU, memory, and network interface. For each, we derive the parameters that affect its operations,
and propose formulas for its utilization. The results describe the capacity of a MWS, and hence can be used to decide on the number of
MWSs in a collaborative cloud datacenter.
Index Terms—Cloud computing, big data, cloud collaboration, middleware, FPGA, hardware acceleration
1 INTRODUCTION
C LOUD providers who use distributed systems to handle

Big data analysis should cope with the stringent
and evolving requirements, mostly concerning fast system
the application business logic. Hence, a high level application,
such as a mobile cloud healthcare service, needs only to inter-
face with the middleware through a set of application pro-
response. In [1], we proposed a reconfigurable active solid gramming interfaces (APIs) that determine the services the
state drives (RASSD) system which deals with such require- middleware provides and the underlying functionality of the
ments through employing FPGAs connected to SSDs that low-level hardware. These APIs are compatible with existing
operate as processing nodes deployed near various data loca- Cloud-APIs, and designed for Interoperability between vari-
tions. This setup through which FPGAs are connected directly ous cloud platforms. The middleware abstracts the distribu-
to SSDs avoids data transfer delays associated with network tion and security requirements of the low-level system
connections. The FPGA is used to reconfigure the hardware modules and makes them appear as though the application is
to suit the requirements of the data-intensive processing interfacing with a local centralized system.
applications. The design proposed in [1] assumes that data is In order to achieve high performance in terms of response
generated and collected in dispersed locations through third time and functionality, the middleware manages the data
party applications, and then stored on RASSD nodes, which processing on the RASSD nodes through special pieces of
form an integrated platform that may contain hundreds or code (drivelets) that are sent by the Middleware Server
thousands of nodes physically located in distributed geo- (MWS) to FPGAs, along with FPGA hardware configuration
graphical sites, each representing a cloud datacenter. files (bitstreams). Another important responsibility of the
In order to interface client-level applications with the proposed middleware architecture lies in the automatic
RASSD system, we proposed in [2] a middleware framework management of applications’ flows, where the MWS uses an
which hides the complexity of accessing data from user appli- intelligent script-parsing mechanism to turn one general
cations, and allows application developers to focus instead on request from the client into a sequence of operations that are
executed via RASSD-specific commands that are sent to
K. Mershad, H. Artail, H. Hajj, and M. Awad are with the Department of FPGAs to generate the required results. The design proposed
Electrical and Computer Engineering, American University of Beirut, in [2] illustrated how the MWS configures and accelerates the
Beirut 1107 2020, Lebanon. RASSD hardware to suit the needs of demanding applica-
E-mail: {kwm03, hartail, hh63, ma162}@aub.edu.lb.
tions. The MWS interprets the application workflow and
M.A.R. Saghir is with the Electrical and Computer Engineering Program,
Texas A&M University at Qatar, Doha, Qatar. requests, maps each operation to hardware configurations
E-mail: mazen.saghir@qatar.tamu.edu. and processes jobs on the reconfigurable FPGA nodes.
Manuscript received 22 Jan. 2014; revised 7 Dec. 2014; accepted 12 Feb. 2015. By using hardware accelerators, a datacenter administra-
Date of publication 28 Apr. 2015; date of current version 6 Dec. 2017. tor can take advantage of the existence of RASSD nodes and
Recommended for acceptance by H. Shen. the possibility of executing bitstreams to enhance the overall
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below. performance of the datacenter. For example, a conventional
Digital Object Identifier no. 10.1109/TCC.2015.2415803 datacenter may be receiving several jobs that render it
2168-7161 ß 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See ht_tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.
MERSHAD ET AL.: A STUDY OF THE PERFORMANCE OF A CLOUD DATACENTER SERVER 591
heavily loaded. In such a scenario, it might be obliged to

postpone the execution of some jobs in order to avoid fail-
ures or crashing. For such cases, hardware accelerators can
play a critical role in releasing much needed computing
resources that can be used to serve other jobs in the queue.
Generally, using hardware accelerators to execute suitable
tasks benefits the datacenter from multiple perspectives.
First, jobs that are run as hardware accelerators finish faster,
and will therefore improve the overall throughput of the
datacenter. Second, moving a processing task from general-
purpose computing nodes to hardware accelerators will
result in improved energy efficiency, which reduces the
operating costs. Third, adding RASSD nodes to run hard-
ware accelerators can reduce the number of general-pur-
pose computing resources needed in the datacenter. This in
turn reduces capital, operating, and maintenance costs.
We presented in [2] two use-cases (Epidemic monitoring
and k-means clustering) that illustrate two data intensive Fig. 1. High level view of the distributed RASSD system.
applications which benefit from the RASSD MWS. Moreover,
we described our implementation of a RASSD MWS proto- 2 OVERVIEW OF THE RASSD SYSTEM
type, which consisted of a server running the MWS project
code and connected to a set of Xilinx Virtex 6 FPGAs, and The RASSD System, as detailed in [2] and built upon in [10],
presented the results of integrating and running the two is composed of three layers: application, middleware and
mentioned use-cases into the implemented prototype. That hardware. The application layer represents the client appli-
work however lacked a mathematical model that analyzes the cations that issue requests. The middleware layer abstracts
various performance aspects of the MWS. In this paper, we the low-level details of the RASSD hardware and enables
fill this gap by presenting an analytical model which focuses data-intensive applications to use these devices through
on analyzing the utilization of the MWS from three different a set of APIs to achieve high levels of performance. The
perspectives: CPU usage, memory storage and network band- hardware layer consists of the geographically-distributed
width consumption. In our analysis, we study the various RASSD nodes that store and process data. The application
parameters that affect each resource, including the average layer includes the data-intensive application along with the
size of jobs submitted by users, the number of users’ requests, Client Middleware (CLM), the middleware layer comprises
the number of MWSs in the system, the average number of the Middleware Server, while the hardware layer contains
FPGAs, etc. We deduce the various loads on the MWS under the RASSD nodes. Each MWS is connected to a group of
different scenarios, and accordingly derive a theoretical mea- RASSD nodes via a LAN network (i.e., geographically collo-
sure of the capacity of the system. cated), and each RASSD node comprises one FPGA board
To our knowledge, the work in this paper is the first connected to one or more SSD devices over a PCIe intercon-
attempt to analyze the performance of a cloud datacenter nect. Applications running on PCs and smartphones can be
that supports hardware acceleration on FPGAs. An analysis clients desiring to run different data processing tasks. Fig. 1
related to cloud datacenter performance was provided in provides an overall picture of the system.
[3]. It was however focused on analyzing the total delay and Data-intensive applications usually include several com-
cost of a Hadoop MapReduce job, and described the data- plex functionalities and tasks for pre-processing, classifying,
flow and cost information at the fine granularity of phases processing and/or post-processing the data. For each appli-
within the Map and Reduce tasks. The authors divided the cation, each task is mapped to a drivelet (C code) that imple-
Map task into five different phases: Read, Map, Collect, Spill ments the required tasks. Drivelets are parameterized
and Merge; and the Reduce task into four phases: Shuffle, software modules designed to run on the RASSD’s FPGA
Merge, Reduce and Write. For each of the nine phases, the MicroBlaze microprocessor to accomplish data processing
delay and cost (resource usage) were estimated. The analy- functions on identified data groups stored on the RASSD
sis depended on three sets of parameters: job input data, nodes’ SSDs. Some parts of the drivelets may represent
resources available in the Hadoop cluster (e.g., CPU, I/O) time-consuming functions, where the highest percentage of
and configuration parameters (both cluster-wide and job- the drivelet time is spent. These functions can be turned
level). The authors then used the derived estimates to into hardware accelerators (bitstreams) that exploit the
deduce the total MapReduce job execution time. Compared reconfigurable FPGA logic fabric to customize computa-
to [3], our model is more general, as it estimates the average tions and achieve significant speedups. For details on inte-
utilization of a datacenter compute-nodes controller rather grating drivelets and bitstreams within the RASSD OS, the
than the job itself, and is more focused and simpler, as we reader can refer to [2].
use a much smaller set of parameters, as we illustrate in
Section 3. 2.1 Middleware Components
Next, we present a brief description of the RASSD system The RASSD MWS plays the role of a mediator between
with a more detailed coverage of the MWS, while focusing the cloud application and the RASSDs. The middleware’s
on the elements that we use in our analysis. responsibilities can be summed up in the following tasks:
592 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 5, NO. 4, OCTOBER-DECEMBER 2017
Wait for new requests from clients

Process the requests and prepare the jobs to be
performed by the RASSD nodes
Delegate jobs to the appropriate RASSD nodes
Keep track of the different jobs being processed
Send data sharing requests to other MWSs (when
such requests are received from FPGAs), and process
requests received by other MWSs
Aggregate the results and send them to the clients
Keep track of the different “alive” (operating) nodes
in the system and the data residing on them.
The middleware design comprises three main entities:
the Client Local Middleware (CLM), the Middleware Server
and the Data Site Schema (DSS). Fig. 3 depicts the middle-
ware architecture. The CLM is the middleware entity that is
in direct contact with the cloud application, and resides on
the client machine, constantly listening for new client Fig. 2. Components of the co-Cloud framework.
requests. For every request, the CLM generates a corre-
sponding RASSD job file, and contacts the Data Site Schema capabilities of a datacenter in satisfying demanding user
to get the needed information about the distribution of applications and accessing data that may not be found
the job data on the RASSD nodes. The CLM next sends locally. This also expands the pool of hardware accelera-
the necessary commands to the concerned MWSs, indicat- tors available to each datacenter to be the union of all
ing the processing needed on the specified data and waits accelerators found the different datacenters.
for the results. Once the CLM receives the results, it The co-Cloud framework, which is illustrated in Fig. 2,
aggregates them (if more than one MWS is involved) and comprises datacenters cooperating to provide integrated serv-
sends the final results to the application. This aggregation ices to clients, and therefore includes components that enable
is a second level aggregation, as the MWSs are also communications among them for distributing client jobs and
responsible for aggregating data from the RASSD nodes for aggregating the results before sending them to the clients.
they are connected to. As shown in the figure, co-Cloud comprises several datacen-
The MWS is designed to serve many clients simulta- ters, each of which representing a standalone network. The
neously, and they are typically distributed on dedicated co-Cloud network is managed by a Cloud Master, which is
machines that are geographically close to their RASSD responsible for receiving requests from clients, preparing and
nodes. Upon receiving a request, an MWS contacts the DSS distributing jobs among Cloud Middleware Servers, monitor-
to get information about the RASSD nodes on which the ing the execution of jobs and the communications between
concerned data resides, and then assigns the job’s com- MWSs, sending reports and results to the client, analyzing
mands to them while supplying them with the IDs of the runtime statistics, invoking the generation of new hardware
input data. The results that are obtained by the MWS from acceleration bitstream files (hardware accelerators), and man-
the different nodes are aggregated at the MWS level and aging the utilization of bitstream files based on the needs of
then sent to the requesting CLM. While this scenario depicts client jobs and availability of accelerators.
the general role of the MWS, the next section details several As shown in Fig. 2, each datacenter includes:
possible complete scenarios, where more complex MWS
- an MWS, which in addition to the functions that
operations are described.
were described earlier, it plays the role of a Job Han-
Finally, the Data Site Schema is composed of the data-
dler responsible for communicating with MWSs of
bases that contain the metadata and information needed to
other datacenters for sharing job data and intermedi-
locate the needed data files. The DSS is therefore used to
ate results among different datacenters. It also peri-
guide the CLMs in locating the concerned MWSs, to inform
odically updates the Cloud Master about the status
the MWSs about the RASSD nodes holding the needed data,
of the datacenter.
and to bind the processing functionalities in the client’s
- an Index Node that maintains metadata and loca-
job to their corresponding “drivelet” and “bitstream” files.
tions of the data files stored in the datacenter.
Hence, the DSS is involved in every preparatory step of the
- a Bitstream node to hold the hardware accelerator
jobs to be sent to the MWSs and to the RASSD nodes. How-
bitstream files previously used in that datacenter.
ever, it should be noted that caching at both the CLM and
- an FPGA Driver, which acts as a Bitstream Handler
MWS levels is employed to reduce the number of trips to
responsible for loading bitstreams into selected
the DSS sites.
FPGAs, and initiating the execution of hardware
accelerators.
2.2 Middleware Collaboration - several Workstations, each of which contains a Task
The RASSD MWS is part of a collaborative cloud framework Handler and a Data Store that cooperate to execute a
(which we call co-Cloud) comprising one or more data- certain processing function on specific data. Work-
centers, and hence, the MWS may also need to communi- stations process tasks that do not require hardware
cate with MWSs of other datacenters. This extends the acceleration.
need to prompt its Job Handler to notify the MWS of C1,

which in turn contacts the MWS of C2, and afterwards
sends it the data packets. Next, the MWS of C2 forwards the
packets to its Job Handler, which in turn sends the data to
C2’s Workstations and SSD drives.
2.3 Middleware Operations

2.3.1 At the CLM
The CLM determines from the request the application it is
servicing. Each application has a script stored on the DSS,
depicting the overall set of operations that could be per-
formed within this application. The client application
describes the job as a set of pseudo-code commands, where
each command is a task in the job, and sends the pseudo-
code to the CLM. The CLM runs a tool that produces the
job’s flow file from the pseudo-code commands by transfer-
ring each command into one or more operations in the flow.
The manner in which the flow detection mechanism works
provides application developers with flexibility, since the
generated flow scripts can vary from one job to another,
even within the same application. An example of the first
Fig. 3. General MWS Architecture including middleware components.
three lines of a flow is as follows:
Operation _ID; Number _ Input Input . . .; Output_

As stated earlier, data files will exist in various datacen- of_Input_ File_1; File_2; File
ters, and hence when executing a global job, each MWS will Files;
need to know the part of the job data it should process. For Initialization; 1; State _ Out_of_
this, all Index Nodes will hold duplicate copies of the data simple.txt; (1) .txt
lists stored in all datacenters. The Cloud Master will use Obtain_Visits; 2; Out_of_ graph_ Out_of_
(1).txt; simple.txt; (2) .txt
these lists to assign to each datacenter the portion of the
data it is responsible for working on. The assignment is
done according to two criteria: the data presence in the data- In this flow script example, the first line indicates that the
centers, and the resources available in each one. During a application developer is providing the operation to be per-
job execution, each MWS handles its part of the job in paral- formed (name of the function) and the input and output fil-
lel with other MWSs. es’ names. The CLM knows from each flow line the set of
When the Cloud Master receives a job from a client, it input files that will be used in the corresponding operation.
adds the job to a queue, prompting a thread TE to query the Next, the CLM divides the job into several sub-jobs where
Index Node for the datacenters on which the data related to each one is targeted towards a single MWS as follows: for
the job reside, to split the job into tasks and distribute these each operation in the job flow, the CLM gets from the DSS
tasks according to the hardware resources and data avail- the RASSD nodes that have the the input data files. The
ability in each datacenter. Each hardware accelerator has its CLM defines the set of MWSs that are connected to one or
own metadata stored in the Bitstream node and containing more RASSDs that have data of operation O1, and adds O1
statistics obtained from previous executions of the accelera- to the sub-job of each MWS in the set. That is, the sub-job of
tor about the amount of software and hardware resources it an MWS M1 combines all operations that are partially or
consumed. In case the data resides in a datacenter that totally executed on one or more RASSD nodes connected to
does not have sufficient hardware resources, thread TE M1. After creating the sub-jobs of each MWS on which part
will attempt to find another datacenter that contains the of the job will be processed, the CLM sends the sub-jobs to
required resource, and if it does, it moves the data to it. their MWSs. We note that intermediate data produced by
When data is to be moved from a datacenter C1 to another one sub-job might be the input to another sub-job. Hence,
datacenter C2, TE specifies this data as part of the sub-jobs MWSs need to cooperate and share results to produce the
that are sent to C1 and C2. Once the sub-job of C1 reaches final results. Also, in many cases, a certain sub-job S1 might
the point where the data from C2 is needed, its MWS will be an intermediate step to another sub-job S2, meaning that
contact that of C2 to organize the data transfer process, S2 will wait until S1 has finished and will continue after that
which could occur while other parts of the job are being exe- from where it stopped. Hence, the final results of S2 will be
cuted. The Job Handler of C1 is the element responsible for returned to the CLM, while the results produced by S1 will
distributing the received data to the Workstation and SSDs be discarded after being consumed by S2.
within C1. More details about the operations of TE are
found in Section 2.5. 2.3.2 At the MWS
In case there might be intermediate results that need to Once an MWS receives a sub-job, it proceeds to preparing
be moved from a datacenter (e.g., C1) to another (e.g., C2), the tasks to be executed on the RASSD nodes. Each task per-
the workstation of C1 (which produces the results) will formed by an FPGA is represented at the MWS as an FPGA
command. Hence, the MWS transforms each flow operation

in the sub-job into a list of FPGA commands. Then the MWS
combines the FPGA commands of each RASSD node
together to produce the list of commands that will be exe-
cuted by this RASSD node (more on this in the next section).
Each RASSD node is associated with a queue at the MWS,
which stores all the jobs of this node.
One of the MWS responsibilities is keeping track of the
nodes involved in the processing of a given job, and making
sure it receives the results from these nodes before it aggre-
gates them. The aggregation depends on the type of applica- Fig. 4. Examples of sequential and parallel FPGA commands.
tion that is being handled, and is determined with the help of
the DSS. There are scenarios however where the results of a first sends the number of the commands in the
job are saved on the RASSD nodes instead of being shipped set to the RASSD node, and then adds a flag to
to the MWS. This is needed when more than one processing each command that it sends to the node, one
operation should be performed on certain data consecu- after another. When the RASSD node receives
tively, in which case the node is instructed to save the results such a command, it starts executing it, and as it
until it is instructed by the MWS to perform another opera- receives the next command, it opens a new
tion on them. Eventually the results would be sent back to thread and executes the command in it, and so
the MWS to be aggregated with other data. The MWSs send on for all commands in the parallel set. When all
status reports to the CLM to ebable it to keep track of the sta- parallel commands finish, the node sends an
tus of each operation in the flow file, and perform aggrega- ACK to Tn .
tion once it receives the final results from all MWSs. b) If Tn receives a NACK for a given command, it
resends the command to the RASSD node, and if
2.4 MWS Functional Threads the error persists, Tn will just forget about it.
We now delve deeper into the functionality of the MWS in Whenever Tn receives a reply, it forwards it to
order to make the description of the model easier to under- Tr , which in turn use this info to update reports
stand. We refer to the main thread that executes on the to the client about the progress of the job.
MWS as TC . It runs all the time and continuously listens to Depending on whether NACKs are received or
requests from CLMs. When TC receives a request, it opens a not, Tr could abort or re-executes one or more
new thread Tr and hands over the request to it. Tr will be parts of the job.
responsible for all operations related to the request includ- In addition to the Tr and Tn threads, Each MWS will run
ing communicating with the CLM. a separate thread TE that will listen to requests from other
Each Tr will do the following tasks: MWSs for data exchange, as was discussed in Section 2.3. If
for some reason a RASSD node tries to execute a command
1. Create a list of FPGA commands by dividing the that it does not have data for, it sends to Tn a request that
request into tasks, each of which will be executed on includes the ID of the required data (e.g., file name or
an FPGA via an FPGA command. Some tasks are results of another task in the job), thus prompting Tn to for-
sequential, while others are parallel. Some will start ward the command to Tr , which contacts the DSS to find the
immediately, while others will wait until other tasks location of the data, if the data comes from outside the job.
finish. If the required data is the result of another task ta in the job,
2. Determine from the DSS the RASSD node which con- Tr will search the sub-job flow file to find the RASSD node
tains the data of each task. Hence, each task will be to which the required task (ta ) was assigned. If ta was
assigned to the RASSD node that contains its data. If assigned to a RASSD node Rn that is connected to this same
a task depends on intermediate results produced by MWS, Tr examines the list of update reports from Rn to see
a given RASSD node, Tr will assign the task to this if ta was executed. If yes, Tr sends the location of t0a s data to
node. Tn , which fetches the data and sends it to the RASSD node
3. Next, Tr prepares the list of tasks for each RASSD that requested it. If ta has not been executed yet, Tr asks Tn
node. Some of the involved commands could execute to wait (along with the RASSD node) until ta is executed
in parallel on the RASSD node, while others could and the data becomes ready. If ta is executed by a RASSD
execute sequentially with other commands and with node that is connected to another MWS. Tr will contact the
sets of commands in parallel areas (Fig. 4). DSS to find the concerned MWS, then it will connect to TE
4. Tr creates a thread Tn for each node and sends its list of that MWS and send to it the ID of ta . Once ta is executed
of commands to Tn . All Tn threads run in parallel. on the other MWS, its TE will send the data to Tr which will
The following summarizes the operations of a Tn send it to the RASSD node.
thread: Each Tn that finishes saves its results and sends a notifi-
a) Tn Sends sequential commands to the RASSD cation to Tr , waits for an ACK from Tr , and then closes.
node one after another (as the RASSD finishes a When the last Tn finishes, Tr aggregates the results that
command and replies with an ACK to Tn , the lat- were saved by all Tn into one final result (the final result of
ter sends the next command). If the commands one MWS), and sends it to the CLM. Along with the results,
that follow are supposed to run in parallel, Tn Tr will send the final status report. In some cases, Tr will
start aggregating the results as soon as it receives the first TABLE 1

two from the first two finished nodes. This however Symbols Used in the Analysis and Their Definitions
depends on the type of the aggregation function and
Symbol Definition
whether it is possible to aggregate results gradually or all
results should be present in order for aggregation to start. Rn RASSD node
Another continuous thread at each MWS is TM , which Tf transmission delay
Tjob MWS total time single sub-job execution
monitors the status of each RASSD node that is connected
number of arriving requests
to the MWS. When a RASSD node is processing one or mN number of requests that can be handled
more tasks, it sends to TM continuous heartbeats that con- mp number of requests per seconds
tain metadata of the data files that reside on the RASSD c thread context switch time
nodes and the description of each file, alongside with the fur update report frequency
resources that are currently used by the RASSD node plus MADC size of aggregation data chunk
its free resources. TM extracts useful reports from the heart- MADR size of aggregation result
MRNCL size of a RASSD node command list file
beats and saves them in the DSS, where they will be kept for
MSJFF sub-job flow file
a short time (e.g., few hours) in order to generate statistics MT MWS total available memory
about the loads on the RASSD nodes. Mu total memory utilization
Magg Size of memory required for aggregation
NAL number of aggregation levels
3 MWS UTILIZATION MODEL NBSJ # bytes generated by MWS per sub-job
Given the critical role of the MWS, we evaluate its ability to Ncpt number of commands per task
NMWS number of MWSs in system
provide reliable services. The resources that influence a
NODP average number of data pieces produced by a task
server’s operations are: memory, processor, and network. It NP the average number of concurrently served users
was stated in [5] that for smooth server operations: 1) mem- Nt number of tasks per job
ory utilization must be below 85 percent to avoid page NRN number of FPGAs connected to 1 MWS
faults and swap operations, 2) processor utilization must NTED # tasks/job requiring external data
stay below 75 percent to make room for kernel and other NTED=MWS average number of tasks from other sub-jobs
software to operate with no effect on the server operations, NTEDMWS average number of tasks within a sub-job
and 3) network utilization should be kept under 50 percent O1 operation type
Pet prob. error occuring while executing a task
to prevent queuing delays at the network interface. Before S1 sub-job of MW
proceeding with the analysis, we summarize the symbols SDS size of packets TX/RX by MWS for data sharing/job
that we define in the rest of this paper, along with those few SDSRP size of data sharing request packet
that we already defined in Table 1. SIl traffic size between MWS and DSS for tasks
In our analysis, we make few assumptions to do the anal- Ssur size of a single UR
ysis and arrive to representative scalability measures. First, Std traffic size between MWS and DSS for data location
SOD size of output data created by a job
even though certain jobs that target our system consume
SID size of input data to a job
few resources and finish quickly while others consume ta time to access a random database tuple
more resources and take a longer time to finish, we will give Ta average access time for a single database tuple
all sub-jobs (fragments of jobs that run on individual Tagg average single sub-job total aggregation time
FPGAs) the same priority, and set the time it takes a sub-job TC main thread
to complete to the average service time. These assumptions Tcm time to send a command
are consistent with those in [6], and they are reasonable Tdsh MWS delay for data sharing of a single sub-job
TDSS data locations of the sub-job tasks from DSS
because even if the jobs vary in service time, the sub-jobs
TE thread that listens to requests from other MWSs
are therefore expected to vary much less. Terr total time for error management for a single sub-job
Our second assumption that we make is that we model TFPGA average time to send all commands in a sub-job
the processor (CPU) and network performances using queu- Tj total time to execute a job
ing theory. Considering processor performance, it is well TM continuous thread at each MWS
established that an M/G/1-RR (round robin) queuing Tn thread created by Tr
model is suitable [6]. It is designed for round-robin systems TPRL time to parse sub-job (flow file)
Tr main thread for the sub-job at the MWS
(like operating systems) and is generic, as it requires the
TrDSRP delay to receive and forward a DSRP
mean and variance without the full distribution of the ser- Tsagg time for single aggregation
vice time. This model assumes that requests to the processor TsDSRP delay to send a DSRP packet
follow a Poisson distribution. The distribution of the inter- rM memory utilization
arrival time between requests (clients’ jobs in our system) is rN network utilization
exponential with mean requests/sec. Since requests are rP processor utilization
assumed to have same priority with low size variations, the
queuing model is reduced to M/G/1-PS (processor shar-
ing). The special features of PS are its simplicity, no require- by a Poisson process, where the service time is constant,
ment of knowledge of job sizes and fairness (in particular, basically equal to the transmission delay, so an M/D/1
the expected response time of a job is directly proportional queuing model is appropriate to be applied [7].
to its size) [14]. With respect to the network utilization, the We assume that at full utilization, the processor can serve
requests (clients’ jobs) to the network card can be modeled mp requests per seconds. Thus, by queuing theory and
Little’s Theorem, the processor utilization is rP ¼ =mp . The entities (CLMs and other MWSs). We exclude the communi-
memory utilization rM is the amount of memory used cations with RASSD nodes since they are made on a sepa-
by the server Mu divided by the total memory MT : rate internal network. For each studied operation or
rM ¼ Mu =MT . Finally, the network utilization rN is the process, we determine the total number of bytes that are
number of arriving requests over the number that can sent or received by the MWS per sub-job.
be handled mN : rN ¼ =mN , where the maximum number
of requests that can be handled is equal to the number of 3.2.1 Sub-Job Flow File and Data Sharing
requests that will consume all the network bandwidth. Initially, the MWS receives from a CLM a sub-job flow file
whose size is MSJFF . Hence, the MWS first receives MSJFF
3.1 Memory Allocated bytes from the CLM. Next, when the CLM divides the job
The MWS will allocate memory for sub-job configuration into sub-jobs, it tries to group the tasks within each sub-job
and error management, metadata of IO data to sub-job tasks, so they all require data in one of the FPGAs connected to
periodic heartbeat data, and data aggregation. The com- the MWS assigned this sub-job. This can be achieved to a
ponents that constitute the significant core are the sub-job certain extent, but in many cases, the input data to a task is
configuration memory and data aggregation memory. not known. Hence, when the MWS prepares the lists of
The remaining components pose negligible memory utili- commands, it can determine the number of tasks that
zation. For this, we only carry out the analysis for sub-job require data from an external source. If the average number
configuration and data aggregation and arrive to a repre- of tasks per job that require external data is NTED , and con-
sentative expression. sidering a uniform distribution of the job among various
The MWS will reserve certain space in memory for aggre- MWSs, we can deduce that the average number of tasks
gating the intermediate results of sub-jobs. Note that this within a sub-job that require external data is equal to
memory is not equal to the size of the intermediate results, NTED-MWS ¼ ðNTED =NMWS Þ. Also, the average number of
since these are saved on SSDs, and the aggregation func- tasks from other sub-jobs that will depend on data from this
tions do not need to transfer them all to memory. Rather, it MWS (supposing that dependencies are uniform among all
transfers a chunk of data to memory, does the aggregation, MWSs) is also equal to NTED=MWS ¼ ðNTED =NMWS Þ. In other
and then transfers the next chunk, and so on. Hence the size words, NTED-MWS tasks within the sub-job will require data
of memory depends on the size of the aggregation chunk saved on RASSDs of other MWSs, and NTED=MWS tasks exe-
and the aggregation results. After aggregating a new chunk, cuted by RASSDs of other MWSs will require data from the
the new aggregation result produced overwrites the previ- RASSDs of this MWS. Hence, we can say that for each job,
ous aggregation result. each MWS will send NTED-MWS data sharing request packets
Suppose that the average size of an aggregation data to TE of other MWSs, and the TE of this MWS will receive
chunk is equal to MADC bytes, and the average size of an NTED=MWS data sharing request packets from the TE’s of
aggregation result (intermediate or final) is equal to MADR , other MWSs.
then the total memory for an aggregation operation is equal Assuming that the size of a data sharing request packet
to MADC þ MADR . Each sub-job might require a different (DSRP) is SDSRP , then the total size of packets sent and
number of aggregation levels, according to the structure of received by a MWS for data sharing per job is equal to:
the job. Most simple jobs require a single aggregation level,
while more complex jobs with a hierarchical distribution of NTED
SDS ¼ SDSRP 2 : (2)
tasks might require two or more aggregation levels. In gen- NMWS
eral, jobs that require only a single Reduce stage will need a
single aggregation level to aggregate the Reduce results. On 3.2.2 Update Reports to Clients
the other hand, jobs that require multiple Reduce stages,
The Tr thread, which is the main thread for the sub-job at
such as jobs that calculate the shortest path in a graph,
the MWS, will frequently combine the tasks’ update reports
require multiple aggreagation levels depending on the
from all RASSD nodes and send a general update report
number of Reduce satges. Suppose that the average number
(UR) to the CLM. Supposing the update report frequency is
of aggregation levels per all jobs is equal to NAL , then, the
fur , and the average size of a single UR is Ssur bytes; and
total memory for aggregation, and approximately the mem-
noting that the UR will contain the ID, description and sta-
ory utilization on the MWS can be expressed as:
tus of each task that is running or has newly finished at an
FPGA; then the size of an update report will depend on the
Mu Magg ¼ NP NAL ðMADC þ MADR Þ; (1)
number of tasks per job (Nt ). We will use this information to
where NP is the average number of sub-jobs that will be estimate Ssur later on. From the stated description, we
running simultaneously, which can be calculated using the deduce that the total size of U packets sent by the MWS to
processor utilization queuing model. In [9], it was proved the CLM is equal to fur Ssur bytes/second. To calculate
that an expression for the average number of served concur- the total size of data sent due to update reports for the
rent users can be found by using the average number of whole job, we define the total job execution time Tj . Hence,
requests in the processor, and it is NP ¼ =ðmP mP Þ. the average number of UR bytes per job sent from an MWS
to the CLM is:
3.2 MWS Communications
We begin the network utilization analysis by reviewing the SUR ¼ ðfur Ssur Tj Þ bytes: (3)
MWS operations that involve communications with external
3.2.3 Locations of Tasks’ Data time in which the processor is busy with executing a certain
After the MWS receives the sub-job from the CLM, it assigns task or operation related to the sub-job. As we did in the
each task (i.e., flow) in the sub-job to one or more FPGAs previous sections, we divide the execution of the sub-job
which have access to the task data. If the MWS is caching into separate parts, and calculate the average time for exe-
the data location(s) (as we explained in Section 3.2.2), it can cuting each part. We consider the time required for opening
determine directly the IDs of those FPGAs; else, it needs to the new threads of the sub-job (Tr and Tn’s) and sending the
contact the DSS to find the data locations. We will consider lists of commands to the Tn threads negligible. The first
the worst case scenario, in which the MWS is not caching delay we calculate is the time required to get the location of
task data locations, causing it to send a request to the DSS each task from the DSS.
with the description of each task and receives a reply with
the description and location of data, or with the description 3.3.1 Time to Acquire Task Locations
alone if the location is not known. If the description and the We assume that the DSS saves the tasks descriptions and data
location occupy a single string, then for each task in the sub- locations in a dedicated database, and the tasks are indexed
job, the MWS and the DSS will exchange three strings of according to the application and then according to the task
data, or 12 3 ¼ 36 bytes. As we stated before, the total operation (for example, Operation_ID in Section 2.2.1).
number of tasks per sub-job is Nt =NMWS , and if we consider Hence, we can deduce that the time needed to access the loca-
the size of packet headers, which is 20 bytes, then the data tions of tasks is the average time needed to access a database
location request packet will have a size of 20 þ 12 Nt =NMWS tuple (using the indexes), multiplied by the average number
bytes and the corresponding reply packet will have a size of of tasks per sub-job [11]. Suppose that the average time
20 þ 24 Nt/NMWS bytes. Hence, the total size of the packets needed to access a single database tuple is Ta , and the average
exchanged between the MWS and DSS for data locations is: time needed to transfer a packet (request or reply) from the
MWS to the DSS (or vice versa) is Tf , then the total time for
Std ¼ ð20 þ 36 Nt =NMWS Þ bytes: (4) the MWS to get the locations of the data of the sub-job tasks
from the DSS is:
Another case in which the MWS contacts the DSS is when Nt
it needs to know the locations of data for intermediate tasks TDSS ¼ ðTa Þ þ 2Tf : (8)
NMWS
that require data from an external source (e.g., another
MWS), which we previously described in Section 3.3.1. In
such cases, the identity of data that is required by these 3.3.2 Time to Process FPGA Commands Lists
tasks is not known until their execution. Similar to (4), the After the MWS receives the locations of data required by
MWS and DSS will exchange three strings of data for each each task, it uses this information to create the list of com-
such task. The number of tasks that require external data mands for each FPGA. We denote by the average time
from Section 3.3.1 is NTED . Hence, the total size of the pack- needed to parse the sub-job (flow file) and the information
ets exchanged between the MWS and the DSS for such tasks received from the DSS regarding the task data locations in
is equal to: order to distribute the tasks among FPGAs and create the
list of commands for each FPGA as TPRL.
SIl ¼ ð20 þ 36 NTED =NMWS Þ bytes: (5)
3.3.3 Time to Send Commands to FPGAs
From previous derivations, we can deduce that the aver- We define the time needed to send a single command to an
age total number of bytes generated (sent or received) by FPGA as the time between the instance Tn receives an ACK
the MWS due a single sub-job is equal to: from the FPGA which indicates that the FPGA is waiting for
the next command and the instance at which Tn finishes
NBSJ ¼ MSJFF þ SDS þ SUR þ Std þ SIl bytes: (6) putting the next command on the internal network line that
connects it to the FPGA. Note that this is different from the
delay of executing the command at the FPGA, since the Tn
In order to derive the network utilization, we deduce
thread sends the command and waits for an ACK to send
from (16) that the total bandwidth required by a single sub-
the next command. However, the MWS can execute other
job is equal to ð8=1;000ÞNBSJ Kbps. If we consider a total
tasks while waiting for the FPGA’s reply. Hence, the only
available bandwidth of B Kbps, then the network interface
concerned delay at the MWS is that of sending the
can serve a maximum of mN ¼ 1;000B=8NBSJ . Therefore,
command to the FPGA.
the network utilization will be:
Suppose the average time to send a command is equal to
8 NBSJ Tcm , and the average number of commands per task is equal
rN ¼ : (7) to Ncpt , then the average time to send all commands in a
1;000 B
sub-job equals the average number of tasks per sub-job
times the average number of commands per task, times
3.3 Time Required for Various Sub-Job Tasks the average time to send a single command:
In order to calculate the processor utilization, we need to
find the percent of the processor time that is used in execut- Nt
TFPGA ¼ Ncpt Tcm : (9)
ing sub-jobs. First, we calculate the average time spent by NMWS
the processor in executing a single sub-job, which is the total
3.3.4 Time to Handle Errors TABLE 2

Values of Constant Analysis Parameters
The MWS caches previously encountered errors and their
solutions. However, whenever a new error that is not Param DESCRIPTION Value
cached occurs, the MWS needs to send the error description
Ta time to access a random database tuple 46 ms
to the DSS and then receive and apply the solution. Suppose Tf transmission delay Negligible
that the probability of an error occuring while executing a TPRl time to parse sub-job (flow file) 4 ms
certain task is Pet , and that in order to solve an error, Tr con- Tcm time to send a command 10 ms
tacts the DSS to check for the error and its solution, then Pet prob. error occuring while executing a task 0.05
each time an error occurs, Tr sends an error packet to the TsDSRP delay to send a DSRP packet 4 ms
DSS, waits for the database to retrieve the solution, and TrDSRP delay to receive and forward a DSRP 8 ms
waits to get the packet. Hence, the total time to handle a sin- NTED # tasks/job requiring external data 0.7 Nt
NAL number of aggregation levels 1.5
gle error can be averaged as Ta þ 2Tf , making the total time Tsagg time for single aggregation 10 ms
for error management for a single sub-job: C thread context switch time 30 ms
MRNCL size of a RASSD node command list file 10 KB
Nt MADC size of aggregation data chunk 110 KB
Terr ¼ Pet ðTa þ 2Tf Þ: (10)
NMWS MADR size of aggregation result 110 KB
SDSRP size of data sharing request packet 128 bytes
fUR update report (UR) frequency updates/sec
3.3.5 Time Required for Data Transfer Ssur size of a single UR 10 KB
In case one of the FPGAs connected to MWS M1 needs data
from another MWS M2, TE of M1 will send a data sharing
request packet to M2, and handles the data transfer process. Sections 3.4.1 to 3.4.6, the total time an MWS spends in
In case an FPGA of M2 needs data that is saved in one of the executing a single sub-job can be approximated as:
FPGAs of M1, TE of M1 will receive and process the request.
Suppose that the average delay to send a DSRP is TsDSRP Tjob ¼ TDSS þ TPRl þ TFPGA þ Terr þ Tdsh þ Tagg þ cðk þ 1Þ;
and the average delay to receive and forward a DSRP is (13)
TrDSRP þ TsDSRP , and using the derivation we made in
where c is the context switch of a thread, i.e., the average
Section 3.3.1 about the number of tasks that require external
time needed to store and restore the state so that execution
data being equal to the number of external tasks that require
can be resumed later from the same point [12]. From (12),
local data and is equal to (NTED =NMWS ), then the delay
the maximum number of sub-jobs that can be served by the
experienced by the MWS for data sharing of a single sub-job
MWS per second is equal to mp ¼ 1=Tjob . Hence, we deduce
can be derived:
the equation for the processor utilization as:
NTED NTED
Tdsh ¼ TsDSRP þ ðTrDSRP þ TsDSRP Þ rp ¼ TDSS þ TPRl þ TFPGA þ Terr þ Tdsh þ Tagg þ cðk þ 1Þ :
NMWS NMWS
NTED NTED (14)
¼ 2 TsDSRP þ TrDSRP (11)
NMWS NMWS
NTED Having calculated the main parameters of the three utili-
¼ ð2TsDSRP þ TrDSRP Þ: zation factors in (8), (7), and (13), we examine the elements
NMWS
of each equation, aiming at defining the constants and the
variables. Some of these parameters either have a specified
3.3.6 Aggregation Time constant value or can be assigned an average value.
We depend on previous literature works and the experi-
Suppose that a single aggregation (Section 3.2.3) requires an
ments we performed using the RASSD prototype in [2], to
average time equal to Tsagg , and the number of tasks that pro-
determine the values of such parameters.
duce output data is Ot ¼ 0:9 Nt =NMWS , the average number
of data pieces produced by a task is NODP , and the average
3.4 Variables and Constants
size of memory needed to save a single Data Piece is MDP ,
then the average size of the output data produced by a single Here we explain how we determined the values of the
task is equal to NODP MDP . Now, for a single aggregation parameters used in the evaluation of the ananlysis which
level, the total size of data to be aggregated is equal to we present in Section 4.1, and then present in Tables 2 and 3
Ot NODP MDP . From Section 3.2.3, the average size of an a summary of these values. The values of the fixed parame-
aggregation data chunk is equal to MADC , and so the number ters were obtained experimentally, as part of the prototypi-
of chunks to be aggregated is ðOt NODP MDP Þ=MADC . cal evaluation we present in Section 4.2.
From this, we derive the average total aggregation time for a We start with the parameters whose values were com-
single sub-job as: puted experimentally based on averages derived from run-
ning three real-world applications that we describe in
Ot NODP MDP Section 4.2. The system settings are described in that same
Tagg ¼ NAL Tsagg ; (12) section, and a related point is the fact that custom counters
MADC
were implemented using the Performance Monitor that
where NAL is the number of aggregation levels, as gave time in numbers of CPU cycles, which allowed for
defined in Section 3.2.3. Combining the results from measuring time at the nanosecond level. The average time
TABLE 3 Next, we tackle the parameters whose values we vary

Values of Variable Analysis Parameters for the sake of studying the performance of the MWS ana-
lytically in different scenarios. We state each parameter
Param DESCRIPTION Range Value
and the range within which its value is varied, plus its
Nt number of tasks per job 1-10,000 1,000 default value. The first variable is Nt , which we will vary
NMWS number of MWSs in system 10-1,000 100 between 1 and 10,000, with a default of 1,000; NMWS
Ncpt number of commands per task 1-100 10
between 10 and 1,000, with a default of 100; Ncpt between 1
SOD size of output data created by a job 1 KB-100 MB 1 MB
SID size of total input data to a job 1 KB-1 GB 10 MB and 100, with a default of 10; SOD between 1 KB and 100
NRN # FPGAs connected to 1 MWS 1-30 5 MB, with a default of 1 MB, while SID will be varied
1
MT MWS total available memory /2 GB-100 GB 6 GB between 1 KB and 1 GB, with a default of 10 MB. The aver-
Tj total time to execute a job 1-10,000 sec 1,800 sec age number of FPGAs connected to an MWS, NRN , will be
varied between 1 and 30, with a default of 5, while pc
(k ¼ pc NRN ) will be set to 0.5. This will result in varying k
to access a random database tuple (Ta ) had an average value between 1 and 15, with a default of 3. The MWS total avail-
of 2.2 milliseconds, while the time needed to access a DSS able memory MT will range between 500 MB and 100 GB,
record averaged 46 ms. In this regard, it is noteworthy to with a default of 6 GB.
point out that these figure are consistent with the results
reported in [8]. 3.5 Utilization Factors
We found the average processing time needed to gen- We now use the definitions and values that we calculated in
erate the FPGAs commands lists from a random sub-job the previous section, after substituting them in their corre-
to be about 4 ms, and this is the value we assign to TPRl , sponding equations, to develop expressions from which we
whereas the time to send a command (Tcm ) was around can measure the different loads on the MWS under various
10 microseconds. As for the average time to transmit a conditions. To start with, after substituting the values of Ta ,
packet (TsDSRP ), it was close to 4 microseconds, and the TPRl , Tcm , Pet , TsDSRP , TrDSRP , NTED , MDP , MADC , NAL , Tsagg ,
one to receive a packet (TrDSRP ) was 8 ms. With regard and c in (12), Tjob is found to be equal to:
to Tagg , we found that it takes an average of 10 ms to
aggregate a pair of data chunks with sizes averaging
Nt
110 KB each. On the other hand, the average context Tjob ¼ 0:004 þ ð0:0065112 þ 105 Ncpt Þ
NMWS
switch of a thread, c, was inferred to have a value of (15)
SOD
30 ms. Finally, the average total time to execute a job Tj þ ð107 Þ þ ð3 105 Þðk þ 1Þ:
varied from 1 to 10,000 seconds, with an average of NMWS
1,800 seconds.
The number of tasks requiring external data, NTED , Hence, the processor utilization rp ¼ =Tjob , can be calcu-
varied among the three applications that we imple- lated in terms of the variables , Nt , NMWS , Ncpt , SOD , and k,
mented. In the matrix multiplication (MM) application, after substituting the value of Tjob from 25. rp must be less
the required data was readily located, while in the String than 0.75, or else, the processor will be the bottleneck and
matching (SM) and K-means applications more than one will limit the MWS’s scalability. Next, the total memory
level or layer of execution was required. The average usage of the MWS processes, Mu , can be calculated after
value was NTED ¼ 0:7 Nt . As for the aggregation memory, substituting the values of NP , NAL , MADC , and MADR , which
Magg , our system always aggregates intermediate results results in the following equation:
in pairs of data chunks. The collected statistics illustrated
248 103
a chunk averaged 110 KB in size, and the result of the Mu ¼ bytes: (16)
1
aggregation (joining records) had a similar size. Hence,
MADC ¼ MADR ¼ 110 KB. With respect to the error proba- The memory utilization rM which is equal to Mu =MT ,
bility Pet , the records collected from the experiments we will be dependent on and MT , and must be below 0.85.
conducted for the evaluation in Section 4.2 indicated that Finally, in order to calculate the utilization on the external
nearly 3 percent of the commands returned errors. network interface rN , we calculate first the equation of
Because of this we give Pet a value of 0.03. It is worth NBSJ , by substituting the values of MSJFF , NTED , SDSRP ,
mentioning in this regard that the errors related mostly to fUR , Ssur , and Pet from the previous section in (6), which
communication issues with the RASSD nodes, data prob- leads to:
lems, and timeouts.
822:4Nt
The remaining parameters related to network utilization NBSJ ¼ 105 þ þ 200Tj : (17)
and are calculated as follows: the data sharing request NMWS
packet, which contains the packet headers, the ID of the
MWS, and Metadata of the requested file, had a size equal Therefore, the network utilization rN can be calculated
to 128 bytes (i.e., SDSRP ¼ 128 bytes). For the update reports, by substituting the values of NBSJ and B in (17), which
the MWS sent update reports across all three applications at gives us the equation of rN (which should be below 0.5):
a rate equal to 0.02. Hence, fUR ¼ 0:02. On the other hand,
the size of the update report, which contains the IDs and
8 822:4Nt
status of running tasks, had an average size close to 10 KB. rN ¼ 105 þ þ 200Tj : (18)
B 1;000 NMWS
Therefore, Ssur ¼ 10 KB.
Fig. 6. CPU utilization versus # tasks/job and versus # servers.
Fig. 5. Processor, memory, and network utilization factors of the MWS and as increases, fewer tasks per job can be handled with-
while varying the input request rate .
out queuing. For example, when is equal to five requests
As we have stated above, we wconsider a default value per second, the processor will remain non-loaded for as
of the bandwidth B equal to 128 Kbps. Hence, rN will long as the number of tasks per job is 2,000 tasks or less. The
depend on the variables: , Nt , NMWS , and Tj . In the next utilization follows an opposite trend when NMWS is varied
section, we study each of the three utilizations by varying (Fig. 6-right). For example, if only 10 MWSs are operating in
each of the parameters on which it depends while setting the system, then the lergest that will lead to a sustained
the other parameters to their default values. “good” performance is 1 req/sec. In all, the above results
can serve as a guide to decide on the number of middleware
servers in the cloud datacenter given the expected request
4 RESULTS
rate and the expected job size.
4.1 Analytical Results On the other hand, Fig. 7-left shows that varying Ncpt has
We start by calculating the different values of the three utili- little significance on rp , which is expected since Ncpt only
zation factors: rM , rN , and rp while varying the input affects the time needed by the MWS to send the commands
request rate between 1 and 25 requests per second, while to FPGAs. In the figure, we see that when ¼ 10, rp
setting all other variables to their default values. The results increases from 0.7 to 0.8 as Ncpt increases from 1 to 100.
are shown in Fig. 5 for all three utilizations. We notice that Also, when ¼ 13, rp increases from 0.9 to 1, which reflects
the processor utilization is the most critical among the three, that the MWS is barely affected by the increase in the aver-
as it reaches its full capacity at 14 req/sec, and its good-per- age number of FPGA commands. Finally, Fig. 7-right shows
formance limit, which is 0.75, at 10 requests/sec. On the the effect of varying SOD on processor utilization. Only, the
other hand, the network utilization reaches its full capacity processor performance starts getting impacted appreciably.
at 22 req/sec and its good-performance limit, which is 0.5, For example, when ¼ 5, rp increases from 0.35 to 0.85 as
also at 10 requests/sec. As for the memory utilization, it
the size of the output data (SOD ) is increased beyond 5 to
remains well below its good-performance limit, which is
100 MB; thus reflecting the high impact of the size of output
0.85, for all values of , which reflects that the MWS mem-
data on performance.
ory will remain relaxed and under-utilized even when the
In the following, we study the factors that affect rN .
request rate increases. Hence, we can deduce that when
From (17), we notice that rN depends on Nt , NMWS , and Tj .
considering the average default values that we stated in
The influence of these three variables on rN is illustrated
Sections 3.5 and 3.6, the main factors that affect the MWS
in Fig. 8. In the left graph, we notice that rN remains
performance are its processor and its network link. Also,
from Fig. 5, we can deduce that for a requests rate less
than or equal to 10, the MWS performance will remain fine.
As increases more than 10, the MWS performance starts
degrading. As reaches 15 req/sec, the MWS processor
will not be able to handle all requests simultaneously and
queuing of new requests will start. As reaches 23 req/sec,
the network connection will not be able to accept all incom-
ing requests and new requests will be queued at the net-
work interface.
Next, we study the factors that affect each of the three
utilizations. We start with rp , which depends on the varia-
bles Nt , NMWS , Ncpt , SOD , and k. From (25), we deduce that k
is multiplied by a very small constant, which makes its
impact on rp very small. Hence, we focus on varying the
other four variables for different values of . In Fig. 6-left,
we notice that rp increases exponentially as Nt increases, Fig. 7. CPU utilization versus # commands/task and versus size of data.
Fig. 8. NIC utilization versus Nt, Tj, and NMWS.

Fig. 9. Memory utilization for different request rates.
approximately constant for small values of Nt , but as Nt

cloud cluster. Each MWS has an Intel i7 processor with 8 GB
increases above 500 tasks/job, rN starts increasing. How-
RAM running at 3.4 GHz, and connected via a switch to
ever, its increase remains within a tolerated limit. For a high
two Xilinx Virtex 6 FPGAs. The MapReduce jobs that were
value of Nt (10,000 tasks/job), rN is equal to 0.55 when
run by the MWSs on the FPGAs represented three applica-
¼ 10 and to 0.82 when is equal to 15, which can be
tions which we have used in [4] and [13], and are part of
considered acceptable values. However, to stay below
The Phoenix benchmark [15] that offers a variety of work-
the 0.5 limit, Nt must be kept below 3,000 tasks/job when
loads and emphasis on communication and computation.
¼ 10.
These applications are String Matching, K-means (KM), and
As for the effect of NMWS , we notice that rN decreases as
Matrix Multiplication. For the string matching application,
NMWS increases, which is expected. From Fig. 8-right, we
a master document representing a research publication was
deduce that a number of MWSs between 200 and 300 will
searched for a keyword to return a list of sentences that con-
result in the best network utilization. Concerning the 0.5
tain it, if any. The string whose length ranged between 5
load threshold, we notice the same point in the left graph,
and 22 characters was randomly selected from a list of
stating that a value of above 10 (given the default values
48 keywords, whereas the selected document was one of 25
of other parameters) will always result in a high load on the
conference and journal papers ranging in length from 5,112
network card, and therefore to packet buffering. For ¼ 10,
to 11,674 words. As for the k-means application, we used a
at least 30 middlware servers are required. Hence, NMWS
dataset containing 20 million points in a 2D space, whereas
should be limited to a value between 30 and 200, and its
the number of centroids was randomly set to a value
exact value can be calculated according to the values of
between 2 and 32. Finally, the matrix multiplication applica-
other parameters, and to the average . For example, if it is
tion, calculates the product of two input matrices containing
expected that 10 requests (jobs) arrive every second, and the
integer values and randomly picked from a set of two pairs,
average number of tasks per job is very high, then it is better
with dimensions of (100 150, 150 100) and (400 400,
to set NMWS to its high limit, i.e., 200.
400 400), respectively.
The final parameter that affects network utilization is Tj ,
The string matching application is a series of MapReduce
which is varied in Fig. 8-middle between 1 and 10,000 sec.
Jobs, where each job is assigned a keyword and a buffer that
It shows that to maintain a good performance limit, a cer-
is used to hold a single line from the searched document.
tain maximum value of Tj should be set for each value
The encapsulated Map tasks look for keyword matches in
of : 3,000 seconds for ¼ 5, 1,800 seconds for ¼ 10;700
the assigned buffers, and the results are sent to a Reduce
seconds for ¼ 20, and 290 seconds for ¼ 30.
task for the overall ordering of matches. For K-Means clus-
The final utilization factor which we examine is rM . From
tering, the 2D points are grouped into vectors that are
(15), we see that rM depends on the request rate and on
assigned to Map tasks. Each task is responsible for comput-
MT . We presented in Fig. 5 the memory utilization obtained
ing the Euclidian distances between the various data points
when is varied. We present here the effect of using differ-
in its assigned data vector and the K centroids. The distan-
ent total memory MT on the memory utilization. Fig. 9
ces are sent to a Reduce task to recalculate the nearest cent-
shows that when MT is less than 2 GB, the memory utiliza-
roids. Finally, in matrix multiplication, each Map task
tion will be higher than its good-performance limit. How-
multiplies a row in the left matrix and a column in the right
ever, any value of MT higher than 5 GB insures a good
matrix, whereas the Reduce task performs the addition.
memory performance.
The input data to the applications were distributed uni-
formly among the 10 FPGAs. To each MWS, a laptop is con-
4.2 Comparison with Experimental Results nected via an Ethernet cable, and representing a client that
In this section, we compare the moldel’s results to those sends MapReduce jobs to the MWS. During the testing time,
obtained from testing a real cluster. For this, we built a each client choses randomly one of the three applications, for-
cloud prototype that consisted of five MWS servers that mulates a MapReduce job with random but realistic input
were connected via a high-speed router to form a sample parameter values, and sends the job to the MWS to which it is
In Fig. 10-left, we notice the same trend of the prototype

and the mathematical results of the CPU utilization, but the
prototype results are more affected by . The error margin
between the two results is about 0.15 between ¼ 1 and
¼ 25. After ¼ 30, both results converge to 1 and the pro-
cessor will be working at full utilization. The good perfor-
mance limit, is achieved by the prototype at ¼ 18, while
with the mathematical model it is reached at ¼ 23. The
small error margin between the prototype and the mathe-
matical analysis results of Fig. 10-left proves that our mathe-
matical model depicts, to a high level of accuracy, the CPU
utilization of the MWS. On the other hand, with respect to
the network utilization in Fig. 10-middle, the error margin is
Fig. 10. Utilizations: Analytical versus experimental. about 0.1 between ¼ 1 and ¼ 10, but increases to 0.25 for
values of between 10 and 35. After ¼ 35, both results con-
connected. The MWS, after receiving the job from the client, verge to 1, and the network link will be working at full utiliza-
distributes it to one or more FPGAs, including those attached tion. The network good performance limit is achieved by the
to other MWSs. This operation is repeated by each client prototype at ¼ 13 and by the model at ¼ 22. The error
according to the desired value of in the testing scenario. margin between the prototype and the mathematical model is
In order to make a fair comparison between the mathe- higher for the network utilization results than the CPU utiliza-
matical and the prototype results, we saved all parameters tion results. However, it can be considered an accepted error
that were used when running the testing scenarios, and margin, especially that the network links are subjected to vari-
used them in the analysis equations. For each MapReduce ous activities that are not considered in the analytical model,
job that is executed during a testing scenario, we save to a such as packets due to routing and security protocols. Most
log file the job parameters, inputs, and outputs; and at the importantly, both prototype and model results for the net-
end of the testing scenario, we calculate the average values work utilization converges at near values (40 and 45). Finally,
for all jobs that were executed in the scenario. For example, Fig. 10-right shows the memory utilization results, where we
we found that the average job input and output sizes in the notice that the model’s memory utilization increases at a low
testing scenarios were about 9.67 MB and 122 KB, respec- rate, as increases between 1 and 40. On the other hand, the
tively, and hence, we set the value of SID and SOD in the prototype’s memory utilization increases at a higher rate.
mathematical model’s equations to 10,139,730 and 124,928 However, this rate decreases as increases. After ¼ 30, we
bytes, respectively. Similarly, based on the average values notice that the prototype results starts converging. The error
calculated from the testing scenarios, we found that the margin is the highest here, with a value of 0.22. This is due to
average number of tasks per job was equal to 20, the average the fact that the processes that are run by the RASSD system
number of commands per task was equal to 10, and the on the MWS might require additional memory that is not
average time for a job to finish execution was equal to 1,391 accounted for in the analysis, such as the memory needed by
seconds. Hence, we set Nt to 20, Ncpt to 10, and Tj to 1,391. the OS, and the memory required for the processing of tempo-
Finally, it is important to note that the five MWSs shared a rary variables and temporary data during execution.
network link of 1 Mbps, and therefore, we considered that Overall, the small differences between the prototype and
the bandwidth of each MWS will be 200 Kbps when calcu- the analytical models’ results, and the fact that they con-
lating the value of B in (15). verge at similar values of proves the accuracy of our model
The prototype results were calculated using the Perfor- in depicting the CPU, Memory, and Network utilizations.
mance Monitor tool that is part of Microsoft Windows, but
with custom counters that we built to only track related
application network traffic, plus CPU and memory utiliza- 5 CONCLUDING REMARKS
tions. On each MWS, we run the Performance Monitor tool In this paper, we presented a model that analyzes the per-
during the testing scenario to save to a custom data collector formance of the Middleware Server of a cloud datacenter.
set the amount of CPU and Memory utilizations, and net- The aim was to calculate the utilization of processor, mem-
work bandwidth that are consumed by each running pro- ory, and network interface, where the obtained results
cess. The data collector collects and saves this data based on proved that memory utilization stays in most cases within
constant time intervals. After the testing scenario finishes, the acceptable limit. For processor utilization, it was mostly
we identify from the output of the data collector set, the pro- affected by the overall jobs’ characteristics, such as the num-
cesses that are related to our system. Then we calculate the ber of tasks in the job and the size of the jobs’ output data.
total CPU, Memory, and Network bandwidth consumed by As for network utilization, we found that it is mostly
these processes, and the average utilization of the CPU, affected by the job arrival rate and the average jobs’ execu-
Memory, and Network link during the testing scenario. We tion time, which in turn is a function of the job characteris-
performed nine testing scenarios, in which was varied tics. The presented results can be employed to decide on the
between 1 and 40, with intervals of 5 (i.e., 1, 5, 10, 15 . . .). number of middleware servers in a cloud datacenter, know-
The calculated results, alongside the analysis equations ing the characteristics of its target applications, like the rate
results for the corresponding parameters of this section, are of submitted jobs to the center and the characteristics of
shown in Fig. 10. these jobs, including size and number of sub-jobs.
ACKNOWLEDGMENTS Hassan Artail obtained the BS degree with high

distinction and the MS degree in electrical engi-
This work has been supported by a generous grant from neering from the University of Detroit in 1985 and
1986, respectively, and the PhD degree from
the Qatar National Research Fund (QNRF) under Grant Wayne State University in 1999. He is a professor
Number NPRP 09–1050–2–405. K.W. Mershad is the corre- at the American University of Beirut (AUB) doing
sponding author. research in Internet and mobile computing.
Before joining AUB, he was a supervisor at
Chrysler, where he worked in system develop-
REFERENCES ment for vehicle testing. He has published more
[1] N. Abbani, A. Ali, D. Al Otoom, M. Jomaa, M. Sharafeddine, H. than 180 papers in reputable journals and confer-
Artail, H. Akkary, M. Saghir, M. Awad, and H. Hajj, “A distrib- ence proceedings, and received several awards for excellence in
uted reconfigurable active SSD platform for data intensive research. He is a senior member of the IEEE.
applications,” in Proc. IEEE 13th Int. Conf. High Perform. Comput.
Commun., Sep. 2011, pp. 25–34. Mazen A. R. Saghir received the BE degree in
[2] M. Jomaa, K. Mershad, N. Abbani, Y. Sharaf-Dabbagh, B. Roma- computer and communication engineering from
nous, H. Artail, M. Saghir, H. Hajj, H. Akkary, and M. Awad, “A the American University of Beirut (AUB), and
mediation layer for connecting data-intensive applications to the MASc and PhD degrees in electrical and
reconfigurable data nodes,” in Proc. Int. Conf. Comput. Commun.
computer engineering from the University of
Netw., Nassau, Bahamas, Jul. 2013, pp. 1–9. Toronto. He is an associate professor of electrical
[3] H. Herodotou, “Hadoop performance models.” arXiv preprint and computer engineering at Texas A&M Univer-
arXiv:1106.0940, 2011. sity at Qatar. His research interests include
[4] A. Ali, M. Jomaa, B. Romanous, M. Sharafeddine, M. Saghir, H. reconfigurable computing, computer architecture,
Akkary, H. Artail, M. Awad, and H. Hajj, “An operating system and embedded systems design. He is a senior
for a reconfigurable active SSD processing node,” in Proc. 19th Int.
member of the IEEE.
Conf. Telecommun., Jounieh, Lebanon, Apr. 2012, pp. 1–6.
[5] F. Almari, P. Zavarsky, R. Ruhl, D. Lindskog, and A. Aljaedi,
“Performance analysis of oracle database in virtual environ-
ments,” in Proc. 26th Int. Conf. Advanced Info. Network. Applications Hazem Hajj received the BE degree in ECE from
(AINA), 2012, pp. 1238–1245. the American University of Beirut (AUB) in 1987
[6] J. Cao, M. Andersson, C. Nyberg, and M. Kihl, “Web server per- with distinction, and the PhD degree from the
formance modeling using an m/g/1/k ps queue,” in Proc. 10th University of Wisconsin-Madison in 1996. He is
an associate professor at AUB, and was a princi-
Int. Conf. Telecommun., Tahiti, Papeete, Feb. 2003, pp. 1501–1506.
pal engineer at Intel, where he led research
[7] O. Brun and J-M. Garcia, “Analytical solution of finite capacity
M/D/1 queues,” J. Appl. Probability, 2000, pp. 1092–1098. and development for manufacturing automation,
[8] K. Curran and C. Duffy, “Understanding and reducing web and received several awards. His interests
delays,” Int. J. Netw. Manage., vol. 15, no. 2, pp. 89–102, 2005. include data mining, energy-aware, and high
[9] CISCO Systems, Inc., “Design best practices for latency opti- performance computing.
mization,” Financial Services Technical Decision Maker White Paper,
2007.
[10] K. Mershad, A. Kaitoua, H. Artail, M. Saghir, and H. Hajj, “A Mariette Awad received the PhD degree in
framework for multi-cloud cooperation with hardware reconfigu- electrical engineering in 2007. He is an assistant
ration support,” in Proc. IEEE 9th World Congress Serv., Santa professor at the American University of Beirut
Clara, CA, USA, Jun. 2013, pp. 52–59. (AUB). Prior to her academic position, she
[11] S. Larsen, P. Sarangam, and R. Huggahalli, “Architectural break- was with the IBM System and Technology group
down of end-to-end latency in a TCP/IP Network,” in Proc. 19th in Vermont as a wireless product engineer.
Int. Symp. Comput. Archit. High Perform. Comput., Gramado, RS, She has received several business awards and
Brazil, Oct. 2007, pp. 195–202. multiple patents at IBM. Her research interests
[12] J. Mogul and A. Borg, “The Effect of Context Switches on Cache include machine learning, data mining, image
Performance,” in Proc. 4th Int. Conf. Architectural Support Program- recognition, and ubiquitous computing.
ming Lang. Operating Syst. (ASPLOS-IV), Santa Clara, California,
USA, 1991, vol. 26, no. 4, pp. 75–84.
[13] J. Regehr, R. Alastair, and K. Webb, “Eliminating stack overflow " For more information on this or any other computing topic,
by abstract interpretation,” ACM Trans. Embedded Comput. Syst.,
please visit our Digital Library at www.computer.org/publications/dlib.
vol. 4, no. 4, pp. 751–778, 2005.
[14] B. Nikhil, “Analysis of the M/G/1 processor-sharing queue with
bulk arrivals,” Oper. Res. Lett., vol. 31, no. 5, pp. 401–405, 2003.
[15] R. Yoo, A. Romano, and C. Kozyrakis, “Phoenix rebirth: Scalable
MapReduce on a large-scale shared-memory system,” IEEE Int.
Symp. Workload Characterization, Austin, TX, USA, Oct. 2009,
pp. 198–207.
Khaleel Mershad received the BE degree with

high distinction in computer engineering and
informatics from Beirut Arab University, Lebanon,
in 2004, and the ME and PhD degrees in com-
puter and communications engineering from the
American University of Beirut (AUB), in 2007 and
2012. He worked as a post doc at AUB, in 2012
and 2013, respectively, where he organized
several research projects in cloud computing and
vehicular ad hoc networks. He is currently a full
time researcher at Qatar University, where he is
doing research in data management in the Cloud. His research interests
include cloud computing, mobile ad hoc networks, data management
and availability, parallel and distributed computing, and inter-vehicle
communications.

A Study of The Performance of A Cloud Datacenter Server

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A Study of The Performance of A Cloud Datacenter Server

Uploaded by

Copyright:

Available Formats

590 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 5, NO.

A Study of the Performance of a Cloud

C LOUD providers who use distributed systems to handle

heavily loaded. In such a scenario, it might be obliged to

Wait for new requests from clients

need to prompt its Job Handler to notify the MWS of C1,

2.3 Middleware Operations

Operation _ID; Number _ Input Input . . .; Output_

command. Hence, the MWS transforms each flow operation

start aggregating the results as soon as it receives the first TABLE 1

3.3.4 Time to Handle Errors TABLE 2

TABLE 3 Next, we tackle the parameters whose values we vary

Fig. 6. CPU utilization versus # tasks/job and versus # servers.

Fig. 8. NIC utilization versus Nt, Tj, and NMWS.

approximately constant for small values of Nt , but as Nt

In Fig. 10-left, we notice the same trend of the prototype

ACKNOWLEDGMENTS Hassan Artail obtained the BS degree with high

Khaleel Mershad received the BE degree with

You might also like