Professional Documents
Culture Documents
A PROJECT REPORT
Submitted by
SUDARSAN S 813817104093
of
BACHELOR OF ENGINEERING
IN
MARCH 2021
1
Abstract
The academic and workplaces provide dedicated workstations to each user. Most
of the users don't use their CPUs to their full potential. Even in the busiest times, a
considerable amount of CPU cycles are getting wasted. These compute resources could
harness the wasted cycles. One such idea is to remotely execute the processes in idle
workstation and I/O in the user’s workstation (Maintains the transparency). In this
article, We are going to see a new implementation method for remote execution and a
2
TABLE OF CONTENTS
3
4 Testing 31
4.1 Setup 31
4.2 Proof of concept for remote execution 31
5 Future works 33
6.1 Support for child forks 33
6.2 Support for socket calls 33
6.3 Hybrid fault tolerance 33
6.4 Async syscall batching 34
Reference 35
4
List of tables
List of figures
5
Chapter 1
Introduction
of workstations showed that they were idle 70% of the time. Even in the busiest times, a
considerable amount of CPU cycles were getting wasted. Various kinds of literature were
proposed to harness the wasted cycles. One such idea is to remotely execute the
processes in idle workstations and be transparent to the program and its user. Remote
execution pools compute resources of nodes in LAN. So the computer in LAN appears
as a virtual uni-processor.
In the remote execution, computation of a process is done in the idle workstation and I/O
the efficiency of executing programs remotely. Studies have shown that remote
6
1.2 Compute resource pooling
This idea goes with many names: remote process execution, compute resource
pooling and uni-processor virtualization. These topics refer to the same, if a node is able
does compute resource pooling. Users could span out their jobs to other nodes in the
LAN with the help of middleware. By executing the process remotely, it has virtually
added other node’s compute resources to the user’s node. Hence, Compute resource
pooling.
7
1.3 Functional components
All remote execution systems address three main functions, They are
Remote execution environment ensures it provides the process executing in the idle
node with the issue node’s environment. So, the execution remains transparent. Another
crucial component of this system is finding the best fit idle node for the job. Scheduling
modules performance is critical since it determines the efficiency of the entire execution.
It has to find the best fit node within minimal time. When more than one node is
involved in the system, node failure is a norm. So, our system must be fault-tolerant.
8
Executing a process in another node is not a big deal. Just copy the executables to
the desired node and run it there. But, the challenge lies in the “being transparent” part.
The remote execution must be transparent to the process and the user(node) who(which)
executes the process. I.e the execution must behave in such way, if it’s to be executed in
the local node itself. Let’s see how we can create such an environment.
In the Fig 1.1 C code snippet, X is the data. Its initial value(state) is 0. Then X+=5
to in Fig1.2. On the whole, a process starts with the initial state and executes a series of
executes. Consideration of the node matters only when the process executes system
calls. Like “printf” in the snippet. Redirecting these system calls to the node which
issued the process makes the remote execution transparent. Syscall redirecting to the
An ideal scheduling algorithm is expected to allocate the best idle node for a
remote process in optimal time. The algorithm must maintain a database of alive and idle
nodes in the network. It may also store some additional data points of the nodes like their
resources, % of job faults, etc. These additional data points will help in finding the best
node for execution. But in practice finding the best fit may not be optimal, as spending
compute power and time to crunch data points is not productive. We are not going to
explore much about scheduling algorithms, more about this can be seen in chapter 4
book[8].
Nodes or networks are prone to failures. This system shouldn't stop functioning if
a node/network link fails. System having the ability to tolerate node or network faults is
Those are scheduling and remote execution. We are going to use a scheduling algorithm
comes to remote execution. There is an issue if the idle node which is executing the job
fails. The issue node which might reissue the job, but it causes waste of time and other
resources as well. It could be argued that failures of a node in a LAN are less, but these
systems stop any remote job if the user of the node returns. Existing remote execution
10
systems use the checkpointing mechanism for remote execution fault tolerance. In this
discussing the implementation of the two functionality and design of the entire
that implements this remote execution). In section 4, The testing of the two
11
Chapter 2
Literature review
Content in the review compares and evaluates remote environment and fault-tolerance
aspects from papers like Remote unix[2], Butler system[3], Condor[4], and DAWGS[5]. In
addition to that, this review also provides a new fault-tolerant mechanism against the
conventional checkpoint and restore mechanism. This new method is more efficient
concerning the storage, network bandwidth consumption, and execution time than the
latter.
In this review, the term ‘issue node’ or ‘issue workstation’ refers to the user’s
workstation that issues a job/process. ‘Remote node’ refers to the node which does the
computation of the job issued by the user. ‘System’ or ‘middleware’ refers to the
fault-tolerance.
The papers differ in terms of the degree of transparency, I/O supported, and
and support standardIO and data files. Whereas Condor and Remote Unix expect users
to link their libraries with the program and Condor limits its support to data files alone,
12
Remote Unix expects users to provide files for standardIO. The butler system supports
GUI programs but requires Andrew File System [6] and a special window server(GUI
The limitations in these papers are, they expect a homogeneous setup with UNIX and
don't support programs that spawn child processes or use Inter Process Communications.
Those features could be enabled as well with necessary syscall redirection and
bookkeeping.
These systems prioritize the user of the idle workstation. On his/her return the
remote process executing on that node should be suspended or moved to another idle
machine. The system requires a mechanism for handling workstation crashes and restarts
as well. All papers except the butler system use checkpointing as a fault tolerance
mechanism. When a remote process executes, the middleware facilitates the capture of
process state(stack, register values, files opened, etc) at regular intervals. If the process
ensures the eventual completion of the jobs. The checkpoint must be stored in a
fault-tolerant way!
13
The time taken for remote execution in this method is Tr = X+Sh+M+R+(x//a)C+F. ‘X’
refers to actual process execution time, ‘Sh’ stands for the time taken for finding idle
nodes, ‘M‘ for initial process migration, ‘R’ for returning the execution results, ‘C’ is
checkpointing overhead. Since checkpoints are taken in ‘a’ time interval, so x//a
checkpoints in total execution. ‘F’ is the time required for process migration after
failure. This method also has an overhead of redoing the computation. In case of a
process failure, restarting the process from the previous checkpoint will make the
process redo computation that might last around [0, a] time quantum(DAWGS captures
checkpoints every 30 min, so if the process fails shy of the next checkpoint, redo of
almost 30min should be done). Besides time overhead, storage and network bandwidth
are consumed as well. Systems like DAWGS move the checkpoint data to the issue
remote machine itself, only transferred if process migration is required, but this is not
Though checkpoint is an effective method for fault-tolerance, it also frustrates users who
issue an interactive job because of redo and time lag for process migration after failure.
Arguments may arise that more users of this system issue compute-bound processes. But
the butler paper's report of jobs for a month states that a considerable amount of
14
executed remotely. Let’s see if a better method could be proposed to overcome the
Checkpoint’s shortcomings.
Nodes or networks are prone to failures. This system shouldn't stop functioning if
a node/network link fails. System having the ability to tolerate node or network faults is
known as fault-tolerance. Modules in this system that might get affected because faults
are scheduling and remote execution. Since, the scheduling algorithm is not dependent
comes to remote execution. There is an issue if the idle node which is executing the job
of another node fails. The node which was issued might reissue the job. But waste of
time and compute power as well. Existing systems use a checkpointing mechanism for
use replication. If the system could replicate the process in multiple machines and make
sure it appears as a single process to the user, then fault-tolerance could be achieved with
15
less overhead in the network, storage, and time. This might consume more computing
power, but it's better to spend computing power than time, network, and storage.
system call, they are reverted to the issue node and the shadow program makes sure that
a system call is executed at most once. Each system call is logically timestamped[7] by
the respective remote node’s middleware and sent to the issue node for execution. On
receiving the system call the shadow process inspects the timestamp. If new, it executes,
caches along with the timestamp, and replies with the return value. If not, it returns the
cached value for the respective timestamp. This at most once system call implementation
makes the multiple execution transparent to the user. Fig 3.3 shows active replication in
In case the user returns to a remote node, the process clone executing in that node could
be suspended or killed. This action won't affect other clones, they continue executing
without any interruptions. The user experiences very little to no delay in the execution.
This method provides faster execution since now Tr = X+Sh+M+R only and also
eliminates time delay because of re-execution in case of failures. Users get a better
experience if he/she exports interactive processes. Fig3 taken from the DAWGS paper
shows job execution of local vs remote, with the x-axis varying the probability of the
16
user reclaiming the idle workstation. We could see the remote execution time increase
exponentially with more process migration. Active replication strategy will ‘almost’
have linear execution irrespective of the probability of reclaim [Changes if all clones
fail].
This method has disadvantages too. The checkpoint method was able to guarantee
eventual completion in all cases. But, the active replication method might fail to
complete the process if all clones get killed. The probability of process loss is minimized
by making more clones. More clones mean occupying more nodes for a process. This
17
blocks more workstations and much time for the process will be spent on waiting to get
We have two fault tolerant mechanism checkpoints and active replication. Those two
have their own pros and cons. Since, checkpoint provides eventual completeness of
process and active replication provides faster completion in case of failures. Active
replication consumes more compute power than the former but better at storage,
bandwidth consumption and time. Checkpoint is favorable for compute bound jobs.
Whereas active replication is idle for interactive jobs. A hybrid of two could be made
that guarantees eventual completeness of jobs and less storage, network and time
overhead.
2.3 Conclusion
syscall libraries of the program. Whenever the process calls a syscall the remote
execution stub will execute. Butler needs a distributed file system for functioning, it can
also be a common storage for all nodes. Whereas, DAWGS use kernel hooks to reroute
the syscalls. All systems provide standard file I/Os with an exception of condor which
supports only data files I/O (Standard I/O can’t be used in the process). Butler provides
18
an added advantage of remotely executing GUI programs. All systems except Butler
19
Chapter 3
Proposed system
The following subsections will discuss the implementation of the two main
functionalities and the possible modules and communication among them which
repo[9]. That repo also contains containers for proof of concept for remote execution and
20
As discussed earlier, to provide the local environment in a remote node for the
process, syscalls must be executed by the issue node. Syscalls made by the remote
process must be captured and routed back to the issue node and get its results.
Fortunately, UNIX provides a less messy way for capturing syscalls of a process! Before
Syscall semantics
semantics when called. During a system call, the program sets AX register to the
Then, it calls an interrupt that is defined for syscalls. Then the kernel checks the AX
register and remaining registers for the arguments and executes the syscall in kernel
mode. The return value is placed in the AX register for the child process to get the value.
Table 1.1 shows the register values for File I/O syscall in UNIX.
AX System call BX CX DX
21
1 sys_write unsigned int fd const char *buf size_t count
Ptrace
The processes in UNIX are represented in a tree format. Init process is the parent for all
processes. It spawns several child processes and executes a program. Similar to that, the
program submitted for remote execution is made to run as a child of a middleware stub.
UNIX provides parents with certain privileges over child processes. “Ptrace” syscall
long ptrace(enum __ptrace_request request, pid_t pid, void *addr, void *data);
Above is the ptrace syscall semantics. The first parameter defines the action to be
performed. The second parameter “pid” the tracer(parent) needs to pass the tracee’s
pid(child). The final two parameters are used to read(write) from(to) tracee’s memory.
22
PTRACE_TRACEME is called by a child immediately after the fork, so it allows the
parent to trace it. PTRACE_SYSCALL intercepts syscall from child and the return of
the syscall from OS as well. PTRACE_GETREGS and PTRACE_SETREGS get and set
the register values of the child process. In those actions, the void *data parameter
accepts the regs struct, which has the field of all general purpose registers available in
system for functioning, since ptrace is available only in UNIX based OS. Moreover this
method has a considerable overhead. When the tracee calls a syscall. The OS checks
whether that particular process is being traced. If traced, then it sends a resume signal to
the tracer process which waits. Though there are several other ways to capture and
reroute syscalls like kernel hook as used in DAWGS. The kernel hooks modify interrupt
routine for syscall call interrupts. So, the determined routine gets executed when any
syscall occurs. The problem with this approach is that system calls from all the processes
23
Capturing syscalls
Ptrace provides an elegant way to capture syscalls of a child process. In our case,
the remote process is executed as a child process of a stub(issue stub). The stub traces
24
Whenever the remote process executes the system call, the kernel signals the stub. Stub
reads the remote process’s AX register value to identify the syscall. Depending on the
syscall, it could read the contents of other necessary registers. If the register contains a
value, we retrieve it as it is. If it’s a memory location, the stub needs to peek at the
The syscall type and its desired parameters are sent to the stub that executes in the
issue node through a. On receiving the packet, the issue stub executes the syscall with
received values. After executing the desired values (Return value and any other output
value) is sent back to the idle stub. That stub places the return value in the AX register
and other values in the memory address of the child process as mentioned in the syscall
25
3.1.2 Replicated fault-tolerance
The syscalls need to be logically time stamped at the idle stub. The below pseudo code
The remote process must be deterministic, the syscall made in all clones must be
identical in the order and parameter values. If the logical timestamp of the received
syscall call is more than the current logical timestamp of the local node, then the syscall
must be executed in the issue node and the returned values must be stored in memory
associated with that logical timestamp. If the timestamp of received is lesser than current
local timestamp, then the stored value for the received TS could be sent back. This
26
27
3.2 System architecture
In this section, we are going to discuss the additional modules, its functionalities
and the communication among them. These additional modules help the remote execution
This system is given as service so the user could access it from CLI. An architecture
is a device so it would support such calls and functions of the system. Fig 3.1 provides
the design of architecture and the table the communication between modules.
28
Table 3.1 : Communication legend
Sno Description
1 Daemon initiates idle script with cmd line args containing path to
executable and issue system addr to connect
2 Sends a udp request to the local daemon process with the path to the
executable and number of replicas. It replies with the port to listen.
3 Issue script initiates syscall stub with cmd line args [port and no of
replicas]
4 A dedicated connection for rerouting syscall for that process
Daemon script
Daemon script will run in the background on all the nodes which want to
participate in this system. It listens on udp:8000 for heat beats from members, Resource
request, Remote execution request from local users. It does the following functions:
Scheduling, Membership list maintenance and triggering idle or issue scripts on demand.
Idle script
It's the parent script in idle node that's triggered with an addr of the issue system
and path to the executable by daemon when a node requests for resources. It connects
with the provided address and starts execution of the executable as a child along with
ptrace. In case of syscall from the child, it diverts to the issue system’s stub.
29
Issue script
Issue script is the main initiator of this system. Users who want a remote
execution can execute a cmd in CLI that triggers the Issue script. That gets the exec file
path and no replicas. Then pings local daemon to negotiate a remote resource for the
execution. On getting a reply for the port to listen. Issue script activates the syscall stub.
Syscall stub
Another entity in the issue system that listens for syscall requests from remote
processes. It gets triggered by an issue script along with a port to listen to and no replicas
to connect with.
30
Chapter 4
Testing
We will be testing the remote execution and fault-tolerant mechanism in this section.
This test is done on a x86-64 Intel CPU with a 64 bit Ubuntu operating system. First, we
will go through the setup process and then we will be testing the functionalities.
4.1 Setup
The code for the main functionalities are present in the github repo[9]. Clone it
from github. Once cloned go to the directory and exec “make stdout”. It will compile all
necessary files and place it in the tmp folder. Explore Makefile for more options. Since
this section tests only proof of concept for remote execution and replicated fault
tolerance, we assume that an idle node is chosen and executables have been transferred.
cd Computational-Resource-Pooling
make stdout
We require two terminals to simulate an idle and issue machine. After executing make,
the idle,issue and sample executables are placed in ./tmp dir. Change directory to tmp in
both terminals.
31
Issue stub must be executed before idle stub execution. To execute, do the following
command. The “-p” parameter refers to the TCP port the stub listens to idle stub for
syscalls.
./issue -p <port>
Idle stub expects the following parameters “-h” and “-p” for issue node address and its
port(In this case -h is 127.0.0.1 and -p is the same as -p passed to issue executable) and
Fig 4.1 : Remote execution of a stdout program which prints into the console
For replicated fault-tolerance testing, It's the same as above, but needs to execute more
32
Chapter 5
Future works
Now the jobs with fork child process cant be remotely executed. This caveat can
be overcomed by masking IPC and through book keeping. This might increase
parallelism in job execution. If a remote process spawns a child. The scripts must know
by the syscal it made. Fork and clone are the calls for child creation. On such events the
script could identify the child creation and make them run on separate computers and
2. AS a gateway to a service
As discussed earlier both checkpoint and active replication has its own pitfalls and
pros. Checkpoint has eventual completeness under all scenarios and active replications
saves storage, time and network bandwidth. So, combining those two a hybrid algorithm
33
could be used to combine the best of the both worlds. Eventual completeness and less
consumption of resources.
becomes much more inefficient. So, we can support async syscall and batch it as well.
For example, the write syscall could be made async. We can batch async syscalls and
34
Reference
4. Litzkow, M. J. Livny, M., and Mutka, M. W. Condor-A hunter of idle workstations. Proc.
I988 Conference on Distributed Computing Systems, pp. 104-l 11.
6. Howard, J.H.; Kazar, M.L.; Nichols, S.G.; Nichols, D.A.; Satyanarayanan, M.;
Sidebotham, R.N. & West, M.J. (February 1988). "Scale and Performance in a
Distributed File System". ACM Transactions on Computer Systems. 6 (1): 51–81.
7. Leslie Lamport. Time, Clocks, and the Ordering of Events in a Distributed System.
Communications of the ACM, 21(7), July 1978.558-565
35