Professional Documents
Culture Documents
4, OCTOBER-DECEMBER 2017
Abstract—In a previous work, we presented a system which combines active solid state drives and reconfigurable FPGAs (which we
called reconfigurable active SSD nodes, or simply RASSD nodes) into a storage-compute node that can be used by a cloud datacenter
to achieve accelerated computations while running data intensive applications. To hide the complexity of accessing RASSD nodes
from applications, we proposed in another work a middleware framework which handles all low-level interactions with the hardware.
The Middleware Server (MWS), which manages a group of RASSD nodes, has the role of bridging the connection between a client and
the nodes. In this paper, we present extensions to the MWS to enable it to operate within a collaborative cloud environment, and we
develop a model to evaluate the performance of the collaborative MWS. This model represents a study of the utilization of three
hardware resources of the MWS: CPU, memory, and network interface. For each, we derive the parameters that affect its operations,
and propose formulas for its utilization. The results describe the capacity of a MWS, and hence can be used to decide on the number of
MWSs in a collaborative cloud datacenter.
Index Terms—Cloud computing, big data, cloud collaboration, middleware, FPGA, hardware acceleration
1 INTRODUCTION
Little’s Theorem, the processor utilization is rP ¼ =mp . The entities (CLMs and other MWSs). We exclude the communi-
memory utilization rM is the amount of memory used cations with RASSD nodes since they are made on a sepa-
by the server Mu divided by the total memory MT : rate internal network. For each studied operation or
rM ¼ Mu =MT . Finally, the network utilization rN is the process, we determine the total number of bytes that are
number of arriving requests over the number that can sent or received by the MWS per sub-job.
be handled mN : rN ¼ =mN , where the maximum number
of requests that can be handled is equal to the number of 3.2.1 Sub-Job Flow File and Data Sharing
requests that will consume all the network bandwidth. Initially, the MWS receives from a CLM a sub-job flow file
whose size is MSJFF . Hence, the MWS first receives MSJFF
3.1 Memory Allocated bytes from the CLM. Next, when the CLM divides the job
The MWS will allocate memory for sub-job configuration into sub-jobs, it tries to group the tasks within each sub-job
and error management, metadata of IO data to sub-job tasks, so they all require data in one of the FPGAs connected to
periodic heartbeat data, and data aggregation. The com- the MWS assigned this sub-job. This can be achieved to a
ponents that constitute the significant core are the sub-job certain extent, but in many cases, the input data to a task is
configuration memory and data aggregation memory. not known. Hence, when the MWS prepares the lists of
The remaining components pose negligible memory utili- commands, it can determine the number of tasks that
zation. For this, we only carry out the analysis for sub-job require data from an external source. If the average number
configuration and data aggregation and arrive to a repre- of tasks per job that require external data is NTED , and con-
sentative expression. sidering a uniform distribution of the job among various
The MWS will reserve certain space in memory for aggre- MWSs, we can deduce that the average number of tasks
gating the intermediate results of sub-jobs. Note that this within a sub-job that require external data is equal to
memory is not equal to the size of the intermediate results, NTED-MWS ¼ ðNTED =NMWS Þ. Also, the average number of
since these are saved on SSDs, and the aggregation func- tasks from other sub-jobs that will depend on data from this
tions do not need to transfer them all to memory. Rather, it MWS (supposing that dependencies are uniform among all
transfers a chunk of data to memory, does the aggregation, MWSs) is also equal to NTED=MWS ¼ ðNTED =NMWS Þ. In other
and then transfers the next chunk, and so on. Hence the size words, NTED-MWS tasks within the sub-job will require data
of memory depends on the size of the aggregation chunk saved on RASSDs of other MWSs, and NTED=MWS tasks exe-
and the aggregation results. After aggregating a new chunk, cuted by RASSDs of other MWSs will require data from the
the new aggregation result produced overwrites the previ- RASSDs of this MWS. Hence, we can say that for each job,
ous aggregation result. each MWS will send NTED-MWS data sharing request packets
Suppose that the average size of an aggregation data to TE of other MWSs, and the TE of this MWS will receive
chunk is equal to MADC bytes, and the average size of an NTED=MWS data sharing request packets from the TE’s of
aggregation result (intermediate or final) is equal to MADR , other MWSs.
then the total memory for an aggregation operation is equal Assuming that the size of a data sharing request packet
to MADC þ MADR . Each sub-job might require a different (DSRP) is SDSRP , then the total size of packets sent and
number of aggregation levels, according to the structure of received by a MWS for data sharing per job is equal to:
the job. Most simple jobs require a single aggregation level,
while more complex jobs with a hierarchical distribution of NTED
SDS ¼ SDSRP 2 : (2)
tasks might require two or more aggregation levels. In gen- NMWS
eral, jobs that require only a single Reduce stage will need a
single aggregation level to aggregate the Reduce results. On 3.2.2 Update Reports to Clients
the other hand, jobs that require multiple Reduce stages,
The Tr thread, which is the main thread for the sub-job at
such as jobs that calculate the shortest path in a graph,
the MWS, will frequently combine the tasks’ update reports
require multiple aggreagation levels depending on the
from all RASSD nodes and send a general update report
number of Reduce satges. Suppose that the average number
(UR) to the CLM. Supposing the update report frequency is
of aggregation levels per all jobs is equal to NAL , then, the
fur , and the average size of a single UR is Ssur bytes; and
total memory for aggregation, and approximately the mem-
noting that the UR will contain the ID, description and sta-
ory utilization on the MWS can be expressed as:
tus of each task that is running or has newly finished at an
FPGA; then the size of an update report will depend on the
Mu Magg ¼ NP NAL ðMADC þ MADR Þ; (1)
number of tasks per job (Nt ). We will use this information to
where NP is the average number of sub-jobs that will be estimate Ssur later on. From the stated description, we
running simultaneously, which can be calculated using the deduce that the total size of U packets sent by the MWS to
processor utilization queuing model. In [9], it was proved the CLM is equal to fur Ssur bytes/second. To calculate
that an expression for the average number of served concur- the total size of data sent due to update reports for the
rent users can be found by using the average number of whole job, we define the total job execution time Tj . Hence,
requests in the processor, and it is NP ¼ =ðmP mP Þ. the average number of UR bytes per job sent from an MWS
to the CLM is:
3.2 MWS Communications
We begin the network utilization analysis by reviewing the SUR ¼ ðfur Ssur Tj Þ bytes: (3)
MWS operations that involve communications with external
MERSHAD ET AL.: A STUDY OF THE PERFORMANCE OF A CLOUD DATACENTER SERVER 597
3.2.3 Locations of Tasks’ Data time in which the processor is busy with executing a certain
After the MWS receives the sub-job from the CLM, it assigns task or operation related to the sub-job. As we did in the
each task (i.e., flow) in the sub-job to one or more FPGAs previous sections, we divide the execution of the sub-job
which have access to the task data. If the MWS is caching into separate parts, and calculate the average time for exe-
the data location(s) (as we explained in Section 3.2.2), it can cuting each part. We consider the time required for opening
determine directly the IDs of those FPGAs; else, it needs to the new threads of the sub-job (Tr and Tn’s) and sending the
contact the DSS to find the data locations. We will consider lists of commands to the Tn threads negligible. The first
the worst case scenario, in which the MWS is not caching delay we calculate is the time required to get the location of
task data locations, causing it to send a request to the DSS each task from the DSS.
with the description of each task and receives a reply with
the description and location of data, or with the description 3.3.1 Time to Acquire Task Locations
alone if the location is not known. If the description and the We assume that the DSS saves the tasks descriptions and data
location occupy a single string, then for each task in the sub- locations in a dedicated database, and the tasks are indexed
job, the MWS and the DSS will exchange three strings of according to the application and then according to the task
data, or 12 3 ¼ 36 bytes. As we stated before, the total operation (for example, Operation_ID in Section 2.2.1).
number of tasks per sub-job is Nt =NMWS , and if we consider Hence, we can deduce that the time needed to access the loca-
the size of packet headers, which is 20 bytes, then the data tions of tasks is the average time needed to access a database
location request packet will have a size of 20 þ 12 Nt =NMWS tuple (using the indexes), multiplied by the average number
bytes and the corresponding reply packet will have a size of of tasks per sub-job [11]. Suppose that the average time
20 þ 24 Nt/NMWS bytes. Hence, the total size of the packets needed to access a single database tuple is Ta , and the average
exchanged between the MWS and DSS for data locations is: time needed to transfer a packet (request or reply) from the
MWS to the DSS (or vice versa) is Tf , then the total time for
Std ¼ ð20 þ 36 Nt =NMWS Þ bytes: (4) the MWS to get the locations of the data of the sub-job tasks
from the DSS is:
Another case in which the MWS contacts the DSS is when Nt
it needs to know the locations of data for intermediate tasks TDSS ¼ ðTa Þ þ 2Tf : (8)
NMWS
that require data from an external source (e.g., another
MWS), which we previously described in Section 3.3.1. In
such cases, the identity of data that is required by these 3.3.2 Time to Process FPGA Commands Lists
tasks is not known until their execution. Similar to (4), the After the MWS receives the locations of data required by
MWS and DSS will exchange three strings of data for each each task, it uses this information to create the list of com-
such task. The number of tasks that require external data mands for each FPGA. We denote by the average time
from Section 3.3.1 is NTED . Hence, the total size of the pack- needed to parse the sub-job (flow file) and the information
ets exchanged between the MWS and the DSS for such tasks received from the DSS regarding the task data locations in
is equal to: order to distribute the tasks among FPGAs and create the
list of commands for each FPGA as TPRL.
SIl ¼ ð20 þ 36 NTED =NMWS Þ bytes: (5)
3.3.3 Time to Send Commands to FPGAs
From previous derivations, we can deduce that the aver- We define the time needed to send a single command to an
age total number of bytes generated (sent or received) by FPGA as the time between the instance Tn receives an ACK
the MWS due a single sub-job is equal to: from the FPGA which indicates that the FPGA is waiting for
the next command and the instance at which Tn finishes
NBSJ ¼ MSJFF þ SDS þ SUR þ Std þ SIl bytes: (6) putting the next command on the internal network line that
connects it to the FPGA. Note that this is different from the
delay of executing the command at the FPGA, since the Tn
In order to derive the network utilization, we deduce
thread sends the command and waits for an ACK to send
from (16) that the total bandwidth required by a single sub-
the next command. However, the MWS can execute other
job is equal to ð8=1;000ÞNBSJ Kbps. If we consider a total
tasks while waiting for the FPGA’s reply. Hence, the only
available bandwidth of B Kbps, then the network interface
concerned delay at the MWS is that of sending the
can serve a maximum of mN ¼ 1;000B=8NBSJ . Therefore,
command to the FPGA.
the network utilization will be:
Suppose the average time to send a command is equal to
8 NBSJ Tcm , and the average number of commands per task is equal
rN ¼ : (7) to Ncpt , then the average time to send all commands in a
1;000 B
sub-job equals the average number of tasks per sub-job
times the average number of commands per task, times
3.3 Time Required for Various Sub-Job Tasks the average time to send a single command:
In order to calculate the processor utilization, we need to
find the percent of the processor time that is used in execut- Nt
TFPGA ¼ Ncpt Tcm : (9)
ing sub-jobs. First, we calculate the average time spent by NMWS
the processor in executing a single sub-job, which is the total
598 IEEE TRANSACTIONS ON CLOUD COMPUTING, VOL. 5, NO. 4, OCTOBER-DECEMBER 2017
Fig. 5. Processor, memory, and network utilization factors of the MWS and as increases, fewer tasks per job can be handled with-
while varying the input request rate .
out queuing. For example, when is equal to five requests
As we have stated above, we wconsider a default value per second, the processor will remain non-loaded for as
of the bandwidth B equal to 128 Kbps. Hence, rN will long as the number of tasks per job is 2,000 tasks or less. The
depend on the variables: , Nt , NMWS , and Tj . In the next utilization follows an opposite trend when NMWS is varied
section, we study each of the three utilizations by varying (Fig. 6-right). For example, if only 10 MWSs are operating in
each of the parameters on which it depends while setting the system, then the lergest that will lead to a sustained
the other parameters to their default values. “good” performance is 1 req/sec. In all, the above results
can serve as a guide to decide on the number of middleware
servers in the cloud datacenter given the expected request
4 RESULTS
rate and the expected job size.
4.1 Analytical Results On the other hand, Fig. 7-left shows that varying Ncpt has
We start by calculating the different values of the three utili- little significance on rp , which is expected since Ncpt only
zation factors: rM , rN , and rp while varying the input affects the time needed by the MWS to send the commands
request rate between 1 and 25 requests per second, while to FPGAs. In the figure, we see that when ¼ 10, rp
setting all other variables to their default values. The results increases from 0.7 to 0.8 as Ncpt increases from 1 to 100.
are shown in Fig. 5 for all three utilizations. We notice that Also, when ¼ 13, rp increases from 0.9 to 1, which reflects
the processor utilization is the most critical among the three, that the MWS is barely affected by the increase in the aver-
as it reaches its full capacity at 14 req/sec, and its good-per- age number of FPGA commands. Finally, Fig. 7-right shows
formance limit, which is 0.75, at 10 requests/sec. On the the effect of varying SOD on processor utilization. Only, the
other hand, the network utilization reaches its full capacity processor performance starts getting impacted appreciably.
at 22 req/sec and its good-performance limit, which is 0.5, For example, when ¼ 5, rp increases from 0.35 to 0.85 as
also at 10 requests/sec. As for the memory utilization, it
the size of the output data (SOD ) is increased beyond 5 to
remains well below its good-performance limit, which is
100 MB; thus reflecting the high impact of the size of output
0.85, for all values of , which reflects that the MWS mem-
data on performance.
ory will remain relaxed and under-utilized even when the
In the following, we study the factors that affect rN .
request rate increases. Hence, we can deduce that when
From (17), we notice that rN depends on Nt , NMWS , and Tj .
considering the average default values that we stated in
The influence of these three variables on rN is illustrated
Sections 3.5 and 3.6, the main factors that affect the MWS
in Fig. 8. In the left graph, we notice that rN remains
performance are its processor and its network link. Also,
from Fig. 5, we can deduce that for a requests rate less
than or equal to 10, the MWS performance will remain fine.
As increases more than 10, the MWS performance starts
degrading. As reaches 15 req/sec, the MWS processor
will not be able to handle all requests simultaneously and
queuing of new requests will start. As reaches 23 req/sec,
the network connection will not be able to accept all incom-
ing requests and new requests will be queued at the net-
work interface.
Next, we study the factors that affect each of the three
utilizations. We start with rp , which depends on the varia-
bles Nt , NMWS , Ncpt , SOD , and k. From (25), we deduce that k
is multiplied by a very small constant, which makes its
impact on rp very small. Hence, we focus on varying the
other four variables for different values of . In Fig. 6-left,
we notice that rp increases exponentially as Nt increases, Fig. 7. CPU utilization versus # commands/task and versus size of data.
MERSHAD ET AL.: A STUDY OF THE PERFORMANCE OF A CLOUD DATACENTER SERVER 601