You are on page 1of 35

COMPUTE RESOURCE POOLING

A PROJECT REPORT

Submitted by

SUDARSAN S 813817104093

BAGATH IBRAHIM A 813817104302

in partial fulfillment for the award of the degree

of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE AND ENGINEERING

SARANATHAN COLLEGE OF ENGINEERING

ANNA UNIVERSITY : CHENNAI 600 025

MARCH 2021

1
Abstract

The academic and workplaces provide dedicated workstations to each user. Most

of the users don't use their CPUs to their full potential. Even in the busiest times, a

considerable amount of CPU cycles are getting wasted. These compute resources could

be used by compute intensive applications. Various kinds of literature were proposed to

harness the wasted cycles. One such idea is to remotely execute the processes in idle

workstations with transparency. In this system, computation is done in the idle

workstation and I/O in the user’s workstation (Maintains the transparency). In this

article, We are going to see a new implementation method for remote execution and a

new mechanism for fault-tolerance.

2
TABLE OF CONTENTS

Serial no Title Page no


Abstract 2
List of tables 5
List of figures 5
1 Introduction 6
1.1 Why remote execution required 6
1.2 Compute resource pooling 7
1.3 Functional parts of remote execution 8
1.3.1 Remote execution environment 8
1.3.2 Scheduling algorithm 10
1.3.3 Fault tolerance 10
1.4 Rest of the report 11
2 Literature Review 12
2.1 Comparison in remote execution 12
2.2 Comparison in fault tolerance 13
2.3 Conclusion 18
3 Proposed system 20
3.1 Functionalities implementation 20
3.1.1 Remote execution environment 20
3.1.2 Replicated fault-tolerance 26
3.2 System architecture 28

3
4 Testing 31
4.1 Setup 31
4.2 Proof of concept for remote execution 31
5 Future works 33
6.1 Support for child forks 33
6.2 Support for socket calls 33
6.3 Hybrid fault tolerance 33
6.4 Async syscall batching 34
Reference 35

4
List of tables

Table no Description Page no


Table 2.1 Comparison of existing literatures 19
Table 3.1 Semantics for some syscalls 21
Table 5.1 Communication legend 29

List of figures

Figure no Name of diagrams Page no


Fig 1.1 Compute resource pooling 7
Fig 1.2 Example C stub 8
Fig 1.3 Computation 9
Fig 2.1 Replication in storage systems 15
Fig 2.2 Replicated remote execution 17
Fig 3.1 Rerouting syscalls 20
Fig 3.2 Flowchart for write syscalls redirection 24
Fig 3.3 Modules and communication 28
Fig 4.1 Remote execution of a stdout program which prints into the 32
console

5
Chapter 1

Introduction

1.1 Why remote execution required

A Study by Matt and Miron[1] states that workstations(computer nodes) in

academic institutions and workplaces are under-utilized. Their observation in a network

of workstations showed that they were idle 70% of the time. Even in the busiest times, a

considerable amount of CPU cycles were getting wasted. Various kinds of literature were

proposed to harness the wasted cycles. One such idea is to remotely execute the

processes in idle workstations and be transparent to the program and its user. Remote

execution pools compute resources of nodes in LAN. So the computer in LAN appears

as a virtual uni-processor.

In the remote execution, computation of a process is done in the idle workstation and I/O

in the user’s workstation(Maintains the transparency). Questions might arise regarding

the efficiency of executing programs remotely. Studies have shown that remote

execution has better CPU time than local execution[3].

6
1.2 Compute resource pooling

Fig 1.1 Compute resource pooling

This idea goes with many names: remote process execution, compute resource

pooling and uni-processor virtualization. These topics refer to the same, if a node is able

to execute its processes on other machines, it does aggregation of compute resources or

does compute resource pooling. Users could span out their jobs to other nodes in the

LAN with the help of middleware. By executing the process remotely, it has virtually

added other node’s compute resources to the user’s node. Hence, Compute resource

pooling.

7
1.3 Functional components

All remote execution systems address three main functions, They are

1. Remote execution environment


2. Scheduling
3. Fault tolerance

Remote execution environment ensures it provides the process executing in the idle

node with the issue node’s environment. So, the execution remains transparent. Another

crucial component of this system is finding the best fit idle node for the job. Scheduling

modules performance is critical since it determines the efficiency of the entire execution.

It has to find the best fit node within minimal time. When more than one node is

involved in the system, node failure is a norm. So, our system must be fault-tolerant.

We will see in detail about those components in the upcoming chapters.

1.3.1 Remote execution environment

Fig 1.2 : Example C stub

8
Executing a process in another node is not a big deal. Just copy the executables to

the desired node and run it there. But, the challenge lies in the “being transparent” part.

The remote execution must be transparent to the process and the user(node) who(which)

executes the process. I.e the execution must behave in such way, if it’s to be executed in

the local node itself. Let’s see how we can create such an environment.

Fig 1.3 : Computation

In the Fig 1.1 C code snippet, X is the data. Its initial value(state) is 0. Then X+=5

computation(state change) occurs. This change of state is called computation as referred

to in Fig1.2. On the whole, a process starts with the initial state and executes a series of

instructions (computation). This computation is the same irrespective of the node it

executes. Consideration of the node matters only when the process executes system

calls. Like “printf” in the snippet. Redirecting these system calls to the node which

issued the process makes the remote execution transparent. Syscall redirecting to the

issue node provides a local execution environment in the remote node.


9
1.3.2 Scheduling algorithm

An ideal scheduling algorithm is expected to allocate the best idle node for a

remote process in optimal time. The algorithm must maintain a database of alive and idle

nodes in the network. It may also store some additional data points of the nodes like their

resources, % of job faults, etc. These additional data points will help in finding the best

node for execution. But in practice finding the best fit may not be optimal, as spending

compute power and time to crunch data points is not productive. We are not going to

explore much about scheduling algorithms, more about this can be seen in chapter 4

under processor allocation of Andrew Tannenbaum’s Distributed operating system

book[8].

1.3.3 Fault tolerance

Nodes or networks are prone to failures. This system shouldn't stop functioning if

a node/network link fails. System having the ability to tolerate node or network faults is

known as fault-tolerance. We have two functional modules that require fault-tolerance.

Those are scheduling and remote execution. We are going to use a scheduling algorithm

that is inherently fault-tolerant, we will look at in chapter 3 of this thesis. When it

comes to remote execution. There is an issue if the idle node which is executing the job

fails. The issue node which might reissue the job, but it causes waste of time and other

resources as well. It could be argued that failures of a node in a LAN are less, but these

systems stop any remote job if the user of the node returns. Existing remote execution

10
systems use the checkpointing mechanism for remote execution fault tolerance. In this

thesis, we will also provide a new efficient mechanism.

1.4 Rest of the report

Rest of the report contains 4 sections: In section 2, We will be comparing remote

execution and fault-tolerant functionalities of similar systems. In section 3, We will be

discussing the implementation of the two functionality and design of the entire

system(It describes functionalities of various modules and communication among them

that implements this remote execution). In section 4, The testing of the two

functionalities will be discussed.

11
Chapter 2

Literature review

Content in the review compares and evaluates remote environment and fault-tolerance

aspects from papers like Remote unix[2], Butler system[3], Condor[4], and DAWGS[5]. In

addition to that, this review also provides a new fault-tolerant mechanism against the

conventional checkpoint and restore mechanism. This new method is more efficient

concerning the storage, network bandwidth consumption, and execution time than the

latter.

In this review, the term ‘issue node’ or ‘issue workstation’ refers to the user’s

workstation that issues a job/process. ‘Remote node’ refers to the node which does the

computation of the job issued by the user. ‘System’ or ‘middleware’ refers to the

setup/application that provides the remote execution environment, scheduler, and

fault-tolerance.

2.1 Comparison in remote execution

The papers differ in terms of the degree of transparency, I/O supported, and

prerequisite of the environment. Systems like DAWGS provide complete transparency

and support standardIO and data files. Whereas Condor and Remote Unix expect users

to link their libraries with the program and Condor limits its support to data files alone,

12
Remote Unix expects users to provide files for standardIO. The butler system supports

GUI programs but requires Andrew File System [6] and a special window server(GUI

applications) for functioning.

The limitations in these papers are, they expect a homogeneous setup with UNIX and

don't support programs that spawn child processes or use Inter Process Communications.

Those features could be enabled as well with necessary syscall redirection and

bookkeeping.

2.2 Comparison in fault tolerance

These systems prioritize the user of the idle workstation. On his/her return the

remote process executing on that node should be suspended or moved to another idle

machine. The system requires a mechanism for handling workstation crashes and restarts

as well. All papers except the butler system use checkpointing as a fault tolerance

mechanism. When a remote process executes, the middleware facilitates the capture of

process state(stack, register values, files opened, etc) at regular intervals. If the process

terminates, the latest checkpoint is restored on another machine. This mechanism

ensures the eventual completion of the jobs. The checkpoint must be stored in a

fault-tolerant way!

13
The time taken for remote execution in this method is Tr = X+Sh+M+R+(x//a)C+F. ‘X’

refers to actual process execution time, ‘Sh’ stands for the time taken for finding idle

nodes, ‘M‘ for initial process migration, ‘R’ for returning the execution results, ‘C’ is

checkpointing overhead. Since checkpoints are taken in ‘a’ time interval, so x//a

checkpoints in total execution. ‘F’ is the time required for process migration after

failure. This method also has an overhead of redoing the computation. In case of a

process failure, restarting the process from the previous checkpoint will make the

process redo computation that might last around [0, a] time quantum(DAWGS captures

checkpoints every 30 min, so if the process fails shy of the next checkpoint, redo of

almost 30min should be done). Besides time overhead, storage and network bandwidth

are consumed as well. Systems like DAWGS move the checkpoint data to the issue

node(consumes network bandwidth). Whereas Condor saves the checkpoint in the

remote machine itself, only transferred if process migration is required, but this is not

resilient against remote node crashes.

Though checkpoint is an effective method for fault-tolerance, it also frustrates users who

issue an interactive job because of redo and time lag for process migration after failure.

Arguments may arise that more users of this system issue compute-bound processes. But

the butler paper's report of jobs for a month states that a considerable amount of

submitted jobs were interactive. Memory intensive interactive applications were

14
executed remotely. Let’s see if a better method could be proposed to overcome the

Checkpoint’s shortcomings.

Replicated process execution for fault-tolerance

Nodes or networks are prone to failures. This system shouldn't stop functioning if

a node/network link fails. System having the ability to tolerate node or network faults is

known as fault-tolerance. Modules in this system that might get affected because faults

are scheduling and remote execution. Since, the scheduling algorithm is not dependent

on a single node or network to work properly. It's inherently fault-tolerant. When it

comes to remote execution. There is an issue if the idle node which is executing the job

of another node fails. The node which was issued might reissue the job. But waste of

time and compute power as well. Existing systems use a checkpointing mechanism for

remote execution fault tolerance.

Fig 2.1 : Replication in storage systems

Distributed storage systems use replication to overcome faults. Similarly, we can

use replication. If the system could replicate the process in multiple machines and make

sure it appears as a single process to the user, then fault-tolerance could be achieved with

15
less overhead in the network, storage, and time. This might consume more computing

power, but it's better to spend computing power than time, network, and storage.

When a user exports a process, it is made to execute in multiple idle systems

simultaneously(Clones). Computation of the clones could proceed independently. On a

system call, they are reverted to the issue node and the shadow program makes sure that

a system call is executed at most once. Each system call is logically timestamped[7] by

the respective remote node’s middleware and sent to the issue node for execution. On

receiving the system call the shadow process inspects the timestamp. If new, it executes,

caches along with the timestamp, and replies with the return value. If not, it returns the

cached value for the respective timestamp. This at most once system call implementation

makes the multiple execution transparent to the user. Fig 3.3 shows active replication in

remote process execution.

In case the user returns to a remote node, the process clone executing in that node could

be suspended or killed. This action won't affect other clones, they continue executing

without any interruptions. The user experiences very little to no delay in the execution.

This method provides faster execution since now Tr = X+Sh+M+R only and also

eliminates time delay because of re-execution in case of failures. Users get a better

experience if he/she exports interactive processes. Fig3 taken from the DAWGS paper

shows job execution of local vs remote, with the x-axis varying the probability of the

16
user reclaiming the idle workstation. We could see the remote execution time increase

exponentially with more process migration. Active replication strategy will ‘almost’

have linear execution irrespective of the probability of reclaim [Changes if all clones

fail].

Fig 2.2 : Replicated remote execution

This method has disadvantages too. The checkpoint method was able to guarantee

eventual completion in all cases. But, the active replication method might fail to

complete the process if all clones get killed. The probability of process loss is minimized

by making more clones. More clones mean occupying more nodes for a process. This

17
blocks more workstations and much time for the process will be spent on waiting to get

an idle workstation to execute.

Comparison with checkpoint mechanism

We have two fault tolerant mechanism checkpoints and active replication. Those two

have their own pros and cons. Since, checkpoint provides eventual completeness of

process and active replication provides faster completion in case of failures. Active

replication consumes more compute power than the former but better at storage,

bandwidth consumption and time. Checkpoint is favorable for compute bound jobs.

Whereas active replication is idle for interactive jobs. A hybrid of two could be made

that guarantees eventual completeness of jobs and less storage, network and time

overhead.

2.3 Conclusion

Systems like Remote unix and condor do transparent execution by relinking

syscall libraries of the program. Whenever the process calls a syscall the remote

execution stub will execute. Butler needs a distributed file system for functioning, it can

also be a common storage for all nodes. Whereas, DAWGS use kernel hooks to reroute

the syscalls. All systems provide standard file I/Os with an exception of condor which

supports only data files I/O (Standard I/O can’t be used in the process). Butler provides

18
an added advantage of remotely executing GUI programs. All systems except Butler

provide checkpoint based fault-tolerance.

Table 2.1 : Comparison of existing literature

19
Chapter 3

Proposed system

The following subsections will discuss the implementation of the two main

functionalities and the possible modules and communication among them which

provides these functionalities.

3.1 Implementation of main functionalities

Implementation of the main functionalities can be viewed in the linked github

repo[9]. That repo also contains containers for proof of concept for remote execution and

replicated fault-tolerance. We will look at testing in the next chapter.

3.1.1 Remote execution environment

Fig 3.1 : Rerouting syscalls semantics

20
As discussed earlier, to provide the local environment in a remote node for the

process, syscalls must be executed by the issue node. Syscalls made by the remote

process must be captured and routed back to the issue node and get its results.

Fortunately, UNIX provides a less messy way for capturing syscalls of a process! Before

continuing let’s see about syscall semantics.

Syscall semantics

Though syscall looks similar to normal procedure calls, it follows different

semantics when called. During a system call, the program sets AX register to the

corresponding syscall id and BX,CX and DX registers as arguments to that syscall.

Then, it calls an interrupt that is defined for syscalls. Then the kernel checks the AX

register and remaining registers for the arguments and executes the syscall in kernel

mode. The return value is placed in the AX register for the child process to get the value.

Table 1.1 shows the register values for File I/O syscall in UNIX.

Table 3.1 : Semantics for some syscalls

AX System call BX CX DX

0 sys_read unsigned int fd char *buf size_t count

21
1 sys_write unsigned int fd const char *buf size_t count

2 sys_open const char *filename int flags int mode

3 sys_close unsigned int fd

Ptrace

The processes in UNIX are represented in a tree format. Init process is the parent for all

processes. It spawns several child processes and executes a program. Similar to that, the

program submitted for remote execution is made to run as a child of a middleware stub.

UNIX provides parents with certain privileges over child processes. “Ptrace” syscall

allows parents(Tracer) to trace their childs(Tracee). Though it provides various

functions, we stick to syscall tracing, register and memory peeking.

long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);

Above is the ptrace syscall semantics. The first parameter defines the action to be

performed. The second parameter “pid” the tracer(parent) needs to pass the tracee’s

pid(child). The final two parameters are used to read(write) from(to) tracee’s memory.

We use the following actions, which are provided by __ptrace_request.

22
PTRACE_TRACEME is called by a child immediately after the fork, so it allows the

parent to trace it. PTRACE_SYSCALL intercepts syscall from child and the return of

the syscall from OS as well. PTRACE_GETREGS and PTRACE_SETREGS get and set

the register values of the child process. In those actions, the void *data parameter

accepts the regs struct, which has the field of all general purpose registers available in

the CPU. PTRACE_PEEKDATA helps in getting the value of address in tracee’s

memory and also manipulates it.

Limitations with ptrace

This method to reroute syscalls has few limitations. It is platform dependent, so it

can't work under a heterogeneous environment. It requires a UNIX based operating

system for functioning, since ptrace is available only in UNIX based OS. Moreover this

method has a considerable overhead. When the tracee calls a syscall. The OS checks

whether that particular process is being traced. If traced, then it sends a resume signal to

the tracer process which waits. Though there are several other ways to capture and

reroute syscalls like kernel hook as used in DAWGS. The kernel hooks modify interrupt

routine for syscall call interrupts. So, the determined routine gets executed when any

syscall occurs. The problem with this approach is that system calls from all the processes

will execute the routine.

23
Capturing syscalls

Ptrace provides an elegant way to capture syscalls of a child process. In our case,

the remote process is executed as a child process of a stub(issue stub). The stub traces

the remote process for syscalls.

Fig 3.2 : Flowchart for write syscall redirection

24
Whenever the remote process executes the system call, the kernel signals the stub. Stub

reads the remote process’s AX register value to identify the syscall. Depending on the

syscall, it could read the contents of other necessary registers. If the register contains a

value, we retrieve it as it is. If it’s a memory location, the stub needs to peek at the

remote process’s memory and get the desired value(s).

Rerouting syscalls and executing in the issue node

The syscall type and its desired parameters are sent to the stub that executes in the

issue node through a. On receiving the packet, the issue stub executes the syscall with

received values. After executing the desired values (Return value and any other output

value) is sent back to the idle stub. That stub places the return value in the AX register

and other values in the memory address of the child process as mentioned in the syscall

parameter. Then, the child process is signaled back to execute.

25
3.1.2 Replicated fault-tolerance

Replicated fault-tolerance is much simpler to implement compared to checkpoint.

The syscalls need to be logically time stamped at the idle stub. The below pseudo code

shows the handling at issue stub.

The remote process must be deterministic, the syscall made in all clones must be

identical in the order and parameter values. If the logical timestamp of the received

syscall call is more than the current logical timestamp of the local node, then the syscall

must be executed in the issue node and the returned values must be stored in memory

associated with that logical timestamp. If the timestamp of received is lesser than current

local timestamp, then the stored value for the received TS could be sent back. This

successfully provides transparent fault-tolerance.

26
27
3.2 System architecture

In this section, we are going to discuss the additional modules, its functionalities

and the communication among them. These additional modules help the remote execution

service run in the background.

Fig 3.3 Modules and communication

This system is given as service so the user could access it from CLI. An architecture

is a device so it would support such calls and functions of the system. Fig 3.1 provides

the design of architecture and the table the communication between modules.

28
Table 3.1 : Communication legend

Sno Description

1 Daemon initiates idle script with cmd line args containing path to
executable and issue system addr to connect
2 Sends a udp request to the local daemon process with the path to the
executable and number of replicas. It replies with the port to listen.
3 Issue script initiates syscall stub with cmd line args [port and no of
replicas]
4 A dedicated connection for rerouting syscall for that process

5 Inter daemon communication that takes care of membership,


scheduling and file migration.

Daemon script

Daemon script will run in the background on all the nodes which want to

participate in this system. It listens on udp:8000 for heat beats from members, Resource

request, Remote execution request from local users. It does the following functions:

Scheduling, Membership list maintenance and triggering idle or issue scripts on demand.

Idle script

It's the parent script in idle node that's triggered with an addr of the issue system

and path to the executable by daemon when a node requests for resources. It connects

with the provided address and starts execution of the executable as a child along with

ptrace. In case of syscall from the child, it diverts to the issue system’s stub.

29
Issue script

Issue script is the main initiator of this system. Users who want a remote

execution can execute a cmd in CLI that triggers the Issue script. That gets the exec file

path and no replicas. Then pings local daemon to negotiate a remote resource for the

execution. On getting a reply for the port to listen. Issue script activates the syscall stub.

Syscall stub

Another entity in the issue system that listens for syscall requests from remote

processes. It gets triggered by an issue script along with a port to listen to and no replicas

to connect with.

30
Chapter 4

Testing

We will be testing the remote execution and fault-tolerant mechanism in this section.

This test is done on a x86-64 Intel CPU with a 64 bit Ubuntu operating system. First, we

will go through the setup process and then we will be testing the functionalities.

4.1 Setup

The code for the main functionalities are present in the github repo[9]. Clone it

from github. Once cloned go to the directory and exec “make stdout”. It will compile all

necessary files and place it in the tmp folder. Explore Makefile for more options. Since

this section tests only proof of concept for remote execution and replicated fault

tolerance, we assume that an idle node is chosen and executables have been transferred.

git clone https://github.com/bergstartup/Computational-Resource-Pooling

cd Computational-Resource-Pooling

make stdout

4.2 Proof of concept for remote execution

We require two terminals to simulate an idle and issue machine. After executing make,

the idle,issue and sample executables are placed in ./tmp dir. Change directory to tmp in

both terminals.

31
Issue stub must be executed before idle stub execution. To execute, do the following

command. The “-p” parameter refers to the TCP port the stub listens to idle stub for

syscalls.

./issue -p <port>

Idle stub expects the following parameters “-h” and “-p” for issue node address and its

port(In this case -h is 127.0.0.1 and -p is the same as -p passed to issue executable) and

“-e” for the remote process executable path.

./idle -h <IP_of_issue_node> -p <issue_port> -e <exec_path>

Fig 4.1 : Remote execution of a stdout program which prints into the console

For replicated fault-tolerance testing, It's the same as above, but needs to execute more

than one idle stub.

32
Chapter 5

Future works

5.1 Support for child processes

Now the jobs with fork child process cant be remotely executed. This caveat can

be overcomed by masking IPC and through book keeping. This might increase

parallelism in job execution. If a remote process spawns a child. The scripts must know

by the syscal it made. Fork and clone are the calls for child creation. On such events the

script could identify the child creation and make them run on separate computers and

mask the IPCs as well to enable parent child communication.

5.2 Support for socket programs

This is a similar concept to gypsy servers[6]. If syscalls related to sockets are

masked then we can do computations of the packets received elsewhere. It could be

migrated transparently. This provides many useful applications. For example,

1. Web server migration with zero down time

2. AS a gateway to a service

5.3 Hybrid fault tolerance

As discussed earlier both checkpoint and active replication has its own pitfalls and

pros. Checkpoint has eventual completeness under all scenarios and active replications

saves storage, time and network bandwidth. So, combining those two a hybrid algorithm

33
could be used to combine the best of the both worlds. Eventual completeness and less

consumption of resources.

5.4 Async and batching of syscalls

In general a syscall is time intensive. In addition to that we reroute them as well, it

becomes much more inefficient. So, we can support async syscall and batch it as well.

For example, the write syscall could be made async. We can batch async syscalls and

reroute them at once.

34
Reference

1. M. W. Mutka and M. Livny, “Profiling Workstations’ Available Capacity for Remote


Execution” Performance ‘87, Proceedings of the 12th IFIP WG 7.3 Symposium on
Computer Performance, Brussels, Belgium, (December 7-9, 1987)

2. M. Litzkow, “Remote Unix”, Proceedings of 1987 Summer Usenix Conferences,


Phoenix, Arizona, (June, 1987).

3. D. Nichols. 1987. Using idle workstations in a shared computing environment. In


Proceedings of the eleventh ACM Symposium on Operating systems principles (SOSP
'87). Association for Computing Machinery, New York, NY, USA, 5–12.

4. Litzkow, M. J. Livny, M., and Mutka, M. W. Condor-A hunter of idle workstations. Proc.
I988 Conference on Distributed Computing Systems, pp. 104-l 11.

5. H. Clark and B. McMillin, "DAWGS - A Distributed Compute Server Utilizing Idle


Workstations," Proceedings of the Fifth Distributed Memory Computing Conference,
1990., Charleston, SC, USA, 1990, pp. 732-741.

6. Howard, J.H.; Kazar, M.L.; Nichols, S.G.; Nichols, D.A.; Satyanarayanan, M.;
Sidebotham, R.N. & West, M.J. (February 1988). "Scale and Performance in a
Distributed File System". ACM Transactions on Computer Systems. 6 (1): 51–81.

7. Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed System.
Communications of the ACM, 21(7), July 1978.558-565

8. Distributed operating systems book by Andrew S Tannebaum

9. https://github.com/bergstartup/Computational-Resource-Pooling (Github repo)

35

You might also like