Computational Resource Pooling

COMPUTE RESOURCE POOLING
A PROJECT REPORT
Submitted by
SUDARSAN S 813817104093
BAGATH IBRAHIM A 813817104302
in partial fulfillment for the award of the degree
of
BACHELOR OF ENGINEERING
IN
COMPUTER SCIENCE AND ENGINEERING
SARANATHAN COLLEGE OF ENGINEERING
ANNA UNIVERSITY : CHENNAI 600 025
MARCH 2021
1
Abstract
The academic and workplaces provide dedicated workstations to each user. Most
of the users don't use their CPUs to their full potential. Even in the busiest times, a
considerable amount of CPU cycles are getting wasted. These compute resources could
be used by compute intensive applications. Various kinds of literature were proposed to
harness the wasted cycles. One such idea is to remotely execute the processes in idle
workstations with transparency. In this system, computation is done in the idle
workstation and I/O in the user’s workstation (Maintains the transparency). In this
article, We are going to see a new implementation method for remote execution and a
new mechanism for fault-tolerance.
2
TABLE OF CONTENTS
Serial no Title Page no

Abstract 2
List of tables 5
List of figures 5
1 Introduction 6
1.1 Why remote execution required 6
1.2 Compute resource pooling 7
1.3 Functional parts of remote execution 8
1.3.1 Remote execution environment 8
1.3.2 Scheduling algorithm 10
1.3.3 Fault tolerance 10
1.4 Rest of the report 11
2 Literature Review 12
2.1 Comparison in remote execution 12
2.2 Comparison in fault tolerance 13
2.3 Conclusion 18
3 Proposed system 20
3.1 Functionalities implementation 20
3.1.1 Remote execution environment 20
3.1.2 Replicated fault-tolerance 26
3.2 System architecture 28
3
4 Testing 31
4.1 Setup 31
4.2 Proof of concept for remote execution 31
5 Future works 33
6.1 Support for child forks 33
6.2 Support for socket calls 33
6.3 Hybrid fault tolerance 33
6.4 Async syscall batching 34
Reference 35
4
List of tables
Table no Description Page no

Table 2.1 Comparison of existing literatures 19
Table 3.1 Semantics for some syscalls 21
Table 5.1 Communication legend 29
List of figures
Figure no Name of diagrams Page no

Fig 1.1 Compute resource pooling 7
Fig 1.2 Example C stub 8
Fig 1.3 Computation 9
Fig 2.1 Replication in storage systems 15
Fig 2.2 Replicated remote execution 17
Fig 3.1 Rerouting syscalls 20
Fig 3.2 Flowchart for write syscalls redirection 24
Fig 3.3 Modules and communication 28
Fig 4.1 Remote execution of a stdout program which prints into the 32
console
5
Chapter 1
Introduction
1.1 Why remote execution required
A Study by Matt and Miron[1] states that workstations(computer nodes) in
academic institutions and workplaces are under-utilized. Their observation in a network
of workstations showed that they were idle 70% of the time. Even in the busiest times, a
considerable amount of CPU cycles were getting wasted. Various kinds of literature were
proposed to harness the wasted cycles. One such idea is to remotely execute the
processes in idle workstations and be transparent to the program and its user. Remote
execution pools compute resources of nodes in LAN. So the computer in LAN appears
as a virtual uni-processor.
In the remote execution, computation of a process is done in the idle workstation and I/O
in the user’s workstation(Maintains the transparency). Questions might arise regarding
the efficiency of executing programs remotely. Studies have shown that remote
execution has better CPU time than local execution[3].
6
1.2 Compute resource pooling
Fig 1.1 Compute resource pooling
This idea goes with many names: remote process execution, compute resource
pooling and uni-processor virtualization. These topics refer to the same, if a node is able
to execute its processes on other machines, it does aggregation of compute resources or
does compute resource pooling. Users could span out their jobs to other nodes in the
LAN with the help of middleware. By executing the process remotely, it has virtually
added other node’s compute resources to the user’s node. Hence, Compute resource
pooling.
7
1.3 Functional components
All remote execution systems address three main functions, They are
1. Remote execution environment

2. Scheduling
3. Fault tolerance
Remote execution environment ensures it provides the process executing in the idle
node with the issue node’s environment. So, the execution remains transparent. Another
crucial component of this system is finding the best fit idle node for the job. Scheduling
modules performance is critical since it determines the efficiency of the entire execution.
It has to find the best fit node within minimal time. When more than one node is
involved in the system, node failure is a norm. So, our system must be fault-tolerant.
We will see in detail about those components in the upcoming chapters.
1.3.1 Remote execution environment
Fig 1.2 : Example C stub
8
Executing a process in another node is not a big deal. Just copy the executables to
the desired node and run it there. But, the challenge lies in the “being transparent” part.
The remote execution must be transparent to the process and the user(node) who(which)
executes the process. I.e the execution must behave in such way, if it’s to be executed in
the local node itself. Let’s see how we can create such an environment.
Fig 1.3 : Computation
In the Fig 1.1 C code snippet, X is the data. Its initial value(state) is 0. Then X+=5
computation(state change) occurs. This change of state is called computation as referred
to in Fig1.2. On the whole, a process starts with the initial state and executes a series of
instructions (computation). This computation is the same irrespective of the node it
executes. Consideration of the node matters only when the process executes system
calls. Like “printf” in the snippet. Redirecting these system calls to the node which
issued the process makes the remote execution transparent. Syscall redirecting to the
issue node provides a local execution environment in the remote node.

9
1.3.2 Scheduling algorithm
An ideal scheduling algorithm is expected to allocate the best idle node for a
remote process in optimal time. The algorithm must maintain a database of alive and idle
nodes in the network. It may also store some additional data points of the nodes like their
resources, % of job faults, etc. These additional data points will help in finding the best
node for execution. But in practice finding the best fit may not be optimal, as spending
compute power and time to crunch data points is not productive. We are not going to
explore much about scheduling algorithms, more about this can be seen in chapter 4
under processor allocation of Andrew Tannenbaum’s Distributed operating system
book[8].
1.3.3 Fault tolerance
Nodes or networks are prone to failures. This system shouldn't stop functioning if
a node/network link fails. System having the ability to tolerate node or network faults is
known as fault-tolerance. We have two functional modules that require fault-tolerance.
Those are scheduling and remote execution. We are going to use a scheduling algorithm
that is inherently fault-tolerant, we will look at in chapter 3 of this thesis. When it
comes to remote execution. There is an issue if the idle node which is executing the job
fails. The issue node which might reissue the job, but it causes waste of time and other
resources as well. It could be argued that failures of a node in a LAN are less, but these
systems stop any remote job if the user of the node returns. Existing remote execution
10
systems use the checkpointing mechanism for remote execution fault tolerance. In this
thesis, we will also provide a new efficient mechanism.
1.4 Rest of the report
Rest of the report contains 4 sections: In section 2, We will be comparing remote
execution and fault-tolerant functionalities of similar systems. In section 3, We will be
discussing the implementation of the two functionality and design of the entire
system(It describes functionalities of various modules and communication among them
that implements this remote execution). In section 4, The testing of the two
functionalities will be discussed.
11
Chapter 2
Literature review
Content in the review compares and evaluates remote environment and fault-tolerance
aspects from papers like Remote unix[2], Butler system[3], Condor[4], and DAWGS[5]. In
addition to that, this review also provides a new fault-tolerant mechanism against the
conventional checkpoint and restore mechanism. This new method is more efficient
concerning the storage, network bandwidth consumption, and execution time than the
latter.
In this review, the term ‘issue node’ or ‘issue workstation’ refers to the user’s
workstation that issues a job/process. ‘Remote node’ refers to the node which does the
computation of the job issued by the user. ‘System’ or ‘middleware’ refers to the
setup/application that provides the remote execution environment, scheduler, and
fault-tolerance.
2.1 Comparison in remote execution
The papers differ in terms of the degree of transparency, I/O supported, and
prerequisite of the environment. Systems like DAWGS provide complete transparency
and support standardIO and data files. Whereas Condor and Remote Unix expect users
to link their libraries with the program and Condor limits its support to data files alone,
12
Remote Unix expects users to provide files for standardIO. The butler system supports
GUI programs but requires Andrew File System [6] and a special window server(GUI
applications) for functioning.
The limitations in these papers are, they expect a homogeneous setup with UNIX and
don't support programs that spawn child processes or use Inter Process Communications.
Those features could be enabled as well with necessary syscall redirection and
bookkeeping.
2.2 Comparison in fault tolerance
These systems prioritize the user of the idle workstation. On his/her return the
remote process executing on that node should be suspended or moved to another idle
machine. The system requires a mechanism for handling workstation crashes and restarts
as well. All papers except the butler system use checkpointing as a fault tolerance
mechanism. When a remote process executes, the middleware facilitates the capture of
process state(stack, register values, files opened, etc) at regular intervals. If the process
terminates, the latest checkpoint is restored on another machine. This mechanism
ensures the eventual completion of the jobs. The checkpoint must be stored in a
fault-tolerant way!
13
The time taken for remote execution in this method is Tr = X+Sh+M+R+(x//a)C+F. ‘X’
refers to actual process execution time, ‘Sh’ stands for the time taken for finding idle
nodes, ‘M‘ for initial process migration, ‘R’ for returning the execution results, ‘C’ is
checkpointing overhead. Since checkpoints are taken in ‘a’ time interval, so x//a
checkpoints in total execution. ‘F’ is the time required for process migration after
failure. This method also has an overhead of redoing the computation. In case of a
process failure, restarting the process from the previous checkpoint will make the
process redo computation that might last around [0, a] time quantum(DAWGS captures
checkpoints every 30 min, so if the process fails shy of the next checkpoint, redo of
almost 30min should be done). Besides time overhead, storage and network bandwidth
are consumed as well. Systems like DAWGS move the checkpoint data to the issue
node(consumes network bandwidth). Whereas Condor saves the checkpoint in the
remote machine itself, only transferred if process migration is required, but this is not
resilient against remote node crashes.
Though checkpoint is an effective method for fault-tolerance, it also frustrates users who
issue an interactive job because of redo and time lag for process migration after failure.
Arguments may arise that more users of this system issue compute-bound processes. But
the butler paper's report of jobs for a month states that a considerable amount of
submitted jobs were interactive. Memory intensive interactive applications were
14
executed remotely. Let’s see if a better method could be proposed to overcome the
Checkpoint’s shortcomings.
Replicated process execution for fault-tolerance
Nodes or networks are prone to failures. This system shouldn't stop functioning if
a node/network link fails. System having the ability to tolerate node or network faults is
known as fault-tolerance. Modules in this system that might get affected because faults
are scheduling and remote execution. Since, the scheduling algorithm is not dependent
on a single node or network to work properly. It's inherently fault-tolerant. When it
comes to remote execution. There is an issue if the idle node which is executing the job
of another node fails. The node which was issued might reissue the job. But waste of
time and compute power as well. Existing systems use a checkpointing mechanism for
remote execution fault tolerance.
Fig 2.1 : Replication in storage systems
Distributed storage systems use replication to overcome faults. Similarly, we can
use replication. If the system could replicate the process in multiple machines and make
sure it appears as a single process to the user, then fault-tolerance could be achieved with
15
less overhead in the network, storage, and time. This might consume more computing
power, but it's better to spend computing power than time, network, and storage.
When a user exports a process, it is made to execute in multiple idle systems
simultaneously(Clones). Computation of the clones could proceed independently. On a
system call, they are reverted to the issue node and the shadow program makes sure that
a system call is executed at most once. Each system call is logically timestamped[7] by
the respective remote node’s middleware and sent to the issue node for execution. On
receiving the system call the shadow process inspects the timestamp. If new, it executes,
caches along with the timestamp, and replies with the return value. If not, it returns the
cached value for the respective timestamp. This at most once system call implementation
makes the multiple execution transparent to the user. Fig 3.3 shows active replication in
remote process execution.
In case the user returns to a remote node, the process clone executing in that node could
be suspended or killed. This action won't affect other clones, they continue executing
without any interruptions. The user experiences very little to no delay in the execution.
This method provides faster execution since now Tr = X+Sh+M+R only and also
eliminates time delay because of re-execution in case of failures. Users get a better
experience if he/she exports interactive processes. Fig3 taken from the DAWGS paper
shows job execution of local vs remote, with the x-axis varying the probability of the
16
user reclaiming the idle workstation. We could see the remote execution time increase
exponentially with more process migration. Active replication strategy will ‘almost’
have linear execution irrespective of the probability of reclaim [Changes if all clones
fail].
Fig 2.2 : Replicated remote execution
This method has disadvantages too. The checkpoint method was able to guarantee
eventual completion in all cases. But, the active replication method might fail to
complete the process if all clones get killed. The probability of process loss is minimized
by making more clones. More clones mean occupying more nodes for a process. This
17
blocks more workstations and much time for the process will be spent on waiting to get
an idle workstation to execute.
Comparison with checkpoint mechanism
We have two fault tolerant mechanism checkpoints and active replication. Those two
have their own pros and cons. Since, checkpoint provides eventual completeness of
process and active replication provides faster completion in case of failures. Active
replication consumes more compute power than the former but better at storage,
bandwidth consumption and time. Checkpoint is favorable for compute bound jobs.
Whereas active replication is idle for interactive jobs. A hybrid of two could be made
that guarantees eventual completeness of jobs and less storage, network and time
overhead.
2.3 Conclusion
Systems like Remote unix and condor do transparent execution by relinking
syscall libraries of the program. Whenever the process calls a syscall the remote
execution stub will execute. Butler needs a distributed file system for functioning, it can
also be a common storage for all nodes. Whereas, DAWGS use kernel hooks to reroute
the syscalls. All systems provide standard file I/Os with an exception of condor which
supports only data files I/O (Standard I/O can’t be used in the process). Butler provides
18
an added advantage of remotely executing GUI programs. All systems except Butler
provide checkpoint based fault-tolerance.
Table 2.1 : Comparison of existing literature
19
Chapter 3
Proposed system
The following subsections will discuss the implementation of the two main
functionalities and the possible modules and communication among them which
provides these functionalities.
3.1 Implementation of main functionalities
Implementation of the main functionalities can be viewed in the linked github
repo[9]. That repo also contains containers for proof of concept for remote execution and
replicated fault-tolerance. We will look at testing in the next chapter.
3.1.1 Remote execution environment
Fig 3.1 : Rerouting syscalls semantics
20
As discussed earlier, to provide the local environment in a remote node for the
process, syscalls must be executed by the issue node. Syscalls made by the remote
process must be captured and routed back to the issue node and get its results.
Fortunately, UNIX provides a less messy way for capturing syscalls of a process! Before
continuing let’s see about syscall semantics.
Syscall semantics
Though syscall looks similar to normal procedure calls, it follows different
semantics when called. During a system call, the program sets AX register to the
corresponding syscall id and BX,CX and DX registers as arguments to that syscall.
Then, it calls an interrupt that is defined for syscalls. Then the kernel checks the AX
register and remaining registers for the arguments and executes the syscall in kernel
mode. The return value is placed in the AX register for the child process to get the value.
Table 1.1 shows the register values for File I/O syscall in UNIX.
Table 3.1 : Semantics for some syscalls
AX System call BX CX DX
0 sys_read unsigned int fd char *buf size_t count
21
1 sys_write unsigned int fd const char *buf size_t count
2 sys_open const char *filename int flags int mode
3 sys_close unsigned int fd
Ptrace
The processes in UNIX are represented in a tree format. Init process is the parent for all
processes. It spawns several child processes and executes a program. Similar to that, the
program submitted for remote execution is made to run as a child of a middleware stub.
UNIX provides parents with certain privileges over child processes. “Ptrace” syscall
allows parents(Tracer) to trace their childs(Tracee). Though it provides various
functions, we stick to syscall tracing, register and memory peeking.
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
Above is the ptrace syscall semantics. The first parameter defines the action to be
performed. The second parameter “pid” the tracer(parent) needs to pass the tracee’s
pid(child). The final two parameters are used to read(write) from(to) tracee’s memory.
We use the following actions, which are provided by __ptrace_request.
22
PTRACE_TRACEME is called by a child immediately after the fork, so it allows the
parent to trace it. PTRACE_SYSCALL intercepts syscall from child and the return of
the syscall from OS as well. PTRACE_GETREGS and PTRACE_SETREGS get and set
the register values of the child process. In those actions, the void *data parameter
accepts the regs struct, which has the field of all general purpose registers available in
the CPU. PTRACE_PEEKDATA helps in getting the value of address in tracee’s
memory and also manipulates it.
Limitations with ptrace
This method to reroute syscalls has few limitations. It is platform dependent, so it
can't work under a heterogeneous environment. It requires a UNIX based operating
system for functioning, since ptrace is available only in UNIX based OS. Moreover this
method has a considerable overhead. When the tracee calls a syscall. The OS checks
whether that particular process is being traced. If traced, then it sends a resume signal to
the tracer process which waits. Though there are several other ways to capture and
reroute syscalls like kernel hook as used in DAWGS. The kernel hooks modify interrupt
routine for syscall call interrupts. So, the determined routine gets executed when any
syscall occurs. The problem with this approach is that system calls from all the processes
will execute the routine.
23
Capturing syscalls
Ptrace provides an elegant way to capture syscalls of a child process. In our case,
the remote process is executed as a child process of a stub(issue stub). The stub traces
the remote process for syscalls.
Fig 3.2 : Flowchart for write syscall redirection
24
Whenever the remote process executes the system call, the kernel signals the stub. Stub
reads the remote process’s AX register value to identify the syscall. Depending on the
syscall, it could read the contents of other necessary registers. If the register contains a
value, we retrieve it as it is. If it’s a memory location, the stub needs to peek at the
remote process’s memory and get the desired value(s).
Rerouting syscalls and executing in the issue node
The syscall type and its desired parameters are sent to the stub that executes in the
issue node through a. On receiving the packet, the issue stub executes the syscall with
received values. After executing the desired values (Return value and any other output
value) is sent back to the idle stub. That stub places the return value in the AX register
and other values in the memory address of the child process as mentioned in the syscall
parameter. Then, the child process is signaled back to execute.
25
3.1.2 Replicated fault-tolerance
Replicated fault-tolerance is much simpler to implement compared to checkpoint.
The syscalls need to be logically time stamped at the idle stub. The below pseudo code
shows the handling at issue stub.
The remote process must be deterministic, the syscall made in all clones must be
identical in the order and parameter values. If the logical timestamp of the received
syscall call is more than the current logical timestamp of the local node, then the syscall
must be executed in the issue node and the returned values must be stored in memory
associated with that logical timestamp. If the timestamp of received is lesser than current
local timestamp, then the stored value for the received TS could be sent back. This
successfully provides transparent fault-tolerance.
26
27
3.2 System architecture
In this section, we are going to discuss the additional modules, its functionalities
and the communication among them. These additional modules help the remote execution
service run in the background.
Fig 3.3 Modules and communication
This system is given as service so the user could access it from CLI. An architecture
is a device so it would support such calls and functions of the system. Fig 3.1 provides
the design of architecture and the table the communication between modules.
28
Table 3.1 : Communication legend
Sno Description
1 Daemon initiates idle script with cmd line args containing path to
executable and issue system addr to connect
2 Sends a udp request to the local daemon process with the path to the
executable and number of replicas. It replies with the port to listen.
3 Issue script initiates syscall stub with cmd line args [port and no of
replicas]
4 A dedicated connection for rerouting syscall for that process
5 Inter daemon communication that takes care of membership,

scheduling and file migration.
Daemon script
Daemon script will run in the background on all the nodes which want to
participate in this system. It listens on udp:8000 for heat beats from members, Resource
request, Remote execution request from local users. It does the following functions:
Scheduling, Membership list maintenance and triggering idle or issue scripts on demand.
Idle script
It's the parent script in idle node that's triggered with an addr of the issue system
and path to the executable by daemon when a node requests for resources. It connects
with the provided address and starts execution of the executable as a child along with
ptrace. In case of syscall from the child, it diverts to the issue system’s stub.
29
Issue script
Issue script is the main initiator of this system. Users who want a remote
execution can execute a cmd in CLI that triggers the Issue script. That gets the exec file
path and no replicas. Then pings local daemon to negotiate a remote resource for the
execution. On getting a reply for the port to listen. Issue script activates the syscall stub.
Syscall stub
Another entity in the issue system that listens for syscall requests from remote
processes. It gets triggered by an issue script along with a port to listen to and no replicas
to connect with.
30
Chapter 4
Testing
We will be testing the remote execution and fault-tolerant mechanism in this section.
This test is done on a x86-64 Intel CPU with a 64 bit Ubuntu operating system. First, we
will go through the setup process and then we will be testing the functionalities.
4.1 Setup
The code for the main functionalities are present in the github repo[9]. Clone it
from github. Once cloned go to the directory and exec “make stdout”. It will compile all
necessary files and place it in the tmp folder. Explore Makefile for more options. Since
this section tests only proof of concept for remote execution and replicated fault
tolerance, we assume that an idle node is chosen and executables have been transferred.
git clone https://github.com/bergstartup/Computational-Resource-Pooling
cd Computational-Resource-Pooling
make stdout
4.2 Proof of concept for remote execution
We require two terminals to simulate an idle and issue machine. After executing make,
the idle,issue and sample executables are placed in ./tmp dir. Change directory to tmp in
both terminals.
31
Issue stub must be executed before idle stub execution. To execute, do the following
command. The “-p” parameter refers to the TCP port the stub listens to idle stub for
syscalls.
./issue -p <port>
Idle stub expects the following parameters “-h” and “-p” for issue node address and its
port(In this case -h is 127.0.0.1 and -p is the same as -p passed to issue executable) and
“-e” for the remote process executable path.
./idle -h <IP_of_issue_node> -p <issue_port> -e <exec_path>
Fig 4.1 : Remote execution of a stdout program which prints into the console
For replicated fault-tolerance testing, It's the same as above, but needs to execute more
than one idle stub.
32
Chapter 5
Future works
5.1 Support for child processes
Now the jobs with fork child process cant be remotely executed. This caveat can
be overcomed by masking IPC and through book keeping. This might increase
parallelism in job execution. If a remote process spawns a child. The scripts must know
by the syscal it made. Fork and clone are the calls for child creation. On such events the
script could identify the child creation and make them run on separate computers and
mask the IPCs as well to enable parent child communication.
5.2 Support for socket programs
This is a similar concept to gypsy servers[6]. If syscalls related to sockets are
masked then we can do computations of the packets received elsewhere. It could be
migrated transparently. This provides many useful applications. For example,
1. Web server migration with zero down time
2. AS a gateway to a service
5.3 Hybrid fault tolerance
As discussed earlier both checkpoint and active replication has its own pitfalls and
pros. Checkpoint has eventual completeness under all scenarios and active replications
saves storage, time and network bandwidth. So, combining those two a hybrid algorithm
33
could be used to combine the best of the both worlds. Eventual completeness and less
consumption of resources.
5.4 Async and batching of syscalls
In general a syscall is time intensive. In addition to that we reroute them as well, it
becomes much more inefficient. So, we can support async syscall and batch it as well.
For example, the write syscall could be made async. We can batch async syscalls and
reroute them at once.
34
Reference
1. M. W. Mutka and M. Livny, “Profiling Workstations’ Available Capacity for Remote

Execution” Performance ‘87, Proceedings of the 12th IFIP WG 7.3 Symposium on
Computer Performance, Brussels, Belgium, (December 7-9, 1987)
2. M. Litzkow, “Remote Unix”, Proceedings of 1987 Summer Usenix Conferences,

Phoenix, Arizona, (June, 1987).
3. D. Nichols. 1987. Using idle workstations in a shared computing environment. In

Proceedings of the eleventh ACM Symposium on Operating systems principles (SOSP
'87). Association for Computing Machinery, New York, NY, USA, 5–12.
4. Litzkow, M. J. Livny, M., and Mutka, M. W. Condor-A hunter of idle workstations. Proc.
I988 Conference on Distributed Computing Systems, pp. 104-l 11.
5. H. Clark and B. McMillin, "DAWGS - A Distributed Compute Server Utilizing Idle

Workstations," Proceedings of the Fifth Distributed Memory Computing Conference,
1990., Charleston, SC, USA, 1990, pp. 732-741.
6. Howard, J.H.; Kazar, M.L.; Nichols, S.G.; Nichols, D.A.; Satyanarayanan, M.;
Sidebotham, R.N. & West, M.J. (February 1988). "Scale and Performance in a
Distributed File System". ACM Transactions on Computer Systems. 6 (1): 51–81.
7. Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed System.
Communications of the ACM, 21(7), July 1978.558-565
8. Distributed operating systems book by Andrew S Tannebaum
9. https://github.com/bergstartup/Computational-Resource-Pooling (Github repo)
35

Computational Resource Pooling

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Computational Resource Pooling

Uploaded by

Copyright:

Available Formats

COMPUTE RESOURCE POOLING

BAGATH IBRAHIM A 813817104302

in partial fulfillment for the award of the degree

COMPUTER SCIENCE AND ENGINEERING

SARANATHAN COLLEGE OF ENGINEERING

ANNA UNIVERSITY : CHENNAI 600 025

be used by compute intensive applications. Various kinds of literature were proposed to

workstations with transparency. In this system, computation is done in the idle

new mechanism for fault-tolerance.

Serial no Title Page no

Table no Description Page no

Figure no Name of diagrams Page no

1.1 Why remote execution required

A Study by Matt and Miron[1] states that workstations(computer nodes) in

academic institutions and workplaces are under-utilized. Their observation in a network

in the user’s workstation(Maintains the transparency). Questions might arise regarding

execution has better CPU time than local execution[3].

Fig 1.1 Compute resource pooling

to execute its processes on other machines, it does aggregation of compute resources or

1. Remote execution environment

We will see in detail about those components in the upcoming chapters.

1.3.1 Remote execution environment

Fig 1.2 : Example C stub

Fig 1.3 : Computation

computation(state change) occurs. This change of state is called computation as referred

instructions (computation). This computation is the same irrespective of the node it

issue node provides a local execution environment in the remote node.

under processor allocation of Andrew Tannenbaum’s Distributed operating system

1.3.3 Fault tolerance

known as fault-tolerance. We have two functional modules that require fault-tolerance.

that is inherently fault-tolerant, we will look at in chapter 3 of this thesis. When it

thesis, we will also provide a new efficient mechanism.

1.4 Rest of the report

Rest of the report contains 4 sections: In section 2, We will be comparing remote

execution and fault-tolerant functionalities of similar systems. In section 3, We will be

system(It describes functionalities of various modules and communication among them

functionalities will be discussed.

setup/application that provides the remote execution environment, scheduler, and

2.1 Comparison in remote execution

prerequisite of the environment. Systems like DAWGS provide complete transparency

applications) for functioning.

2.2 Comparison in fault tolerance

terminates, the latest checkpoint is restored on another machine. This mechanism

node(consumes network bandwidth). Whereas Condor saves the checkpoint in the

resilient against remote node crashes.

submitted jobs were interactive. Memory intensive interactive applications were

Replicated process execution for fault-tolerance

on a single node or network to work properly. It's inherently fault-tolerant. When it

remote execution fault tolerance.

Fig 2.1 : Replication in storage systems

Distributed storage systems use replication to overcome faults. Similarly, we can

When a user exports a process, it is made to execute in multiple idle systems

simultaneously(Clones). Computation of the clones could proceed independently. On a

remote process execution.

Fig 2.2 : Replicated remote execution

an idle workstation to execute.

Comparison with checkpoint mechanism

Systems like Remote unix and condor do transparent execution by relinking

provide checkpoint based fault-tolerance.

Table 2.1 : Comparison of existing literature

provides these functionalities.

3.1 Implementation of main functionalities

Implementation of the main functionalities can be viewed in the linked github