You are on page 1of 18

a M 0 E 6 .

a
he Amoeba project is a re- choice. Finally, there is a single, group of CPUs that can be dynami-

T
search effort aimed at un- system-wide file system. The files in cally allocated as needed, used, and
derstanding how to con- a single directory may be located on then returned to the pool. For ex-
nect multiple computers in different machines, possibly in dif- ample, the make command might
a seamless way [16, 17, 26, ferent countries. There is no con- need to do six compilations; so six
27, 3 11. The basic idea is to cept of file transfer, uploading or processors could be taken out of the
provide the users with the illusion downloading from servers, or pool for the time necessary to do
of a single powerful timesharing mounting remote file systems. A the compilation and then returned.
system, when, in fact, the system is file’s position in the directory hier- Alternatively, with a five-pass com-
implemented on a collection of archy has no relation to its location. piler, 5 x 6 = 30 processors could
machines, potentially distributed The remainder of this article will be allocated for the six compila-
among several countries. This re- describe Amoeba and the lessons tions, further gaining speedup.
search has led to the design and we have learned from building it. Many applications, such as heuristic
implementation of the Amoeba dis- In the next section, we will give a search in artificial intelligence (AI)
tributed operating system, which is technical overview of Amoeba as it applications (e.g., playing chess),
being used as a prototype and vehi- currently stands. Since Amoeba use large numbers of pool proces-
cle for further research. In this arti- uses the client-server model, we will sors to do their computing. We cur-
cle we will describe the current state then describe some of the more rently have 48 single board VME-
of the system (Amoeba 4.0), and important servers that have been based computers using the 68020
show some of the lessons we have implemented so far. This is fol- and 68030 CPUs. We also have 10
learned designing and using it over lowed by a description of how wide- VAX CPUs forming an additional
the past eight years. We will also area networks are handled. Then processor pool.
discuss how this experience has in- we will discuss a number of applica- Third are the specialized servers,
fluenced our plans for the next ver- tions that run on Amoeba. Mea- such as directory servers, file serv-
sion, Amoeba 5.0. surements have shown Amoeba to ers, database servers, boot servers,
Amoeba was originally designed be fast, so we will present some of and various other servers with spe-
and implemented at the Vrije our data. After that, we will discuss cialized functions. Each server is
Universiteit in Amsterdam, and is the successes and failures we have dedicated to performing a specific
now being jointly developed there encountered, so that others may function. In some cases, there are
and at the Centrum voor Wiskunde profit from those ideas that have multiple servers that provide the
en Informatica, also in Amsterdam. worked out well and avoid those same function, for example, as part
The chief goal of this work is to that have not. Finally we conclude of the replicated file system.
build a distributed system that is with a very brief comparison be- Fourth are the gateways, which
transparent to the users. This con- tween Amoeba and other systems. are used to link Amoeba systems at
cept can best be illustrated by con- Before describing the software, different sites and different coun-
trasting it with a network operating however, it is worth saying some- tries into a single, uniform system.
system, in which each machine re- thing about the system architecture The gateways isolate Amoeba from
tains its own identity. With a net- on which Amoeba runs. the peculiarities of the protocols
work operating system, each user that must be used over the wide-
logs into one specific machine-his mmhnlcal overvlew 06 area networks.
home machine. When a program is Amoeba All the Amoeba machines run
started, it executes on the home Syetem Arrhltectare the same kernel, which primarily
machine, unless the user gives an The Amoeba architecture consists provides multithreaded processes,
explicit command to run it else- of four principal components, as communication services, I/O, and
where. Similarly, files are local un- shown in Figure I. First are the little else. The basic idea behind the
less a remote file system is explicitly workstations, one per user, on kernel was to keep it small, to en-
mounted or files are explicitly cop- which users can carry out editing hance its reliability, and to allow as
ied. In short, the user is clearly and other tasks that require fast in- much as possible of the operating
aware that multiple independent teractive response. The worksta- system to run as user processes (i.e.,
computers exist, and must deal with tions are all diskless, and are pri- outside the kernel), providing for
them explicitly. marily used in intelligent terminals flexibility and experimentation.
In contrast, users effectively log that do window management,
into a transparent distributed sys- rather than as computers for run- OOJects ancl Capal8lll?les
tem as a whole, rather than to any ning complex user programs. We Amoeba is an object-based system.
specific machine. When a program are currently using Sun-Ss, It can be viewed as a collection of
is run, the system-not the user- VAXstations and X-terminals as objects, each of which contains a set
decides upon the best place to run workstations. of operations that can be per-
it. The user is not even aware of this Second are the pool processors, a formed. For a file object, for exam-

CCYY”IIICITIC”CCF~“LACYIDcccmber 199O/Vo1.33, No.12


ple, typical operations are reading,
writing, appending, and deleting.
The list of allowed operations is
defined by the person who designs
the object and who writes the codes
to implement it. Eloth hardware
and software objects exist.
Associated with each object is a
capability [S], a kind of ticket or key
that allows the holder of the capa-
bility to perform some (not neces-
sarily all) operations on that object.
For example, a user process might
have a capability for a file that per-
mits it to read the tile, but not to
modify it. Capabilities are protected
cryptographically to prevent users
from tampering with them.
Each user process owns some col-
lection of capabilities, which to-
gether define the set of objects it
may access and the types of opera-
tions that may be performed on
each. Thus capabilities provide a
unified mechanism for naming,
accessing, and protecting objects.
From the user’s perspective, the
function of the operating system is
to create an environment in which
objects can be created and manipu- shown in Figure 2. It is 128 bits tion scheme is used. When a server
lated in a protected way. long and contains four fields. The is asked to create an object, it picks
This object-based model visible first field is the senrer port, and is an available slot in its internal ta-
to the users is implemented using used to identify the (server) process bles, and puts the information
remote procedure call [5]. Associ- that manages the obiect. It is in ef- about the object in there along with
ated with each object is a server fect a 4%bit random number a newly generated 4%bit random
process that manages the object. chosen by the server. number. The index for the table is
When a user proces:s want to per- The second field is the object put into the object number field of
form an operation on an object, it number, which is used by the server the capability, the rights bits are all
sends a request message to the to identify which of its objects is set to 1, and the newly generated
server that manages the object. The being addressed. Together, the random number is put into the
message contains the capability for server port and object number check field of the capability. This is
the object, a specification of the uniquely identify the object on an owner capability, and can be
operation to be performed, and any which the operation is to be per- used to perform all operations on
parameters the operation requires. formed. the object.
The user, known as the client, then The third field is the rights field, The owner can construct a new
blocks. After the server has per- which contains a bit map telling capability with a subset of the rights
formed the operation, it sends back which operations the holder of the by turning off some of the rights
a reply message that unblocks the capability may perform. If all the bits and then XOR-ing the rights
client. The combination of sending bits are Is, all operations are field with the random number in
a request message, blocking, and allowed. However, if some of the the check field. The result of this
accepting a reply message forms bits are OS, the holder of the capa- operation is then run through a
the remote procedure call, which bility may not perform the corre- (publicly known) one-way function
can be encapsulated using stub rou- sponding operations. Since the to produce a new 4%bit number
tines, to make the entire remote operations are usually coarse that is put in the check field of the
operation look like a local proce- grained, 8 bits is sufficient. new capability.
dure call. (For other possibilities see To prevent users from just turn- The key property required of the
w31). ing all the 0 bits in the rights field one-way function, f, is that given
The structure of a capability is into 1 bits, a cryptographic protec- the original 4%bit number, N (from

December WW”o1.33, N~.~~ICOYYUWICITIOWSO~T”LMY


the owner capability) and the un- written, and the offset in the file. arise if a server crashes part way
encrypted rights field, R, it is easy The request buffer contains the through the remote operation.
to compute C =f(N XOR R), but data to be written. A reply header Under all conditions of lost mes-
given only C it is nearly impossible contains an error code, a limited sages and crashed servers, Amoeba
to find an argument tof that pro- area for the result of the operation guarantees that messages are deliv-
duces the given C. Such functions (8 bytes), and a capability field that ered at most once. If status 3 is re-
are known [9]. can be used to return a capability turned, it is up to the application or
When a capability arrives at a (e.g., as the result of the creation of run time system to do its own fault
server, the server uses the object an object, or of a directory search recovery.
held to index into its tables to locate operation).
the information about the object. It The primitives for doing remote Remote PrOCeUUre CU118
then checks to see if all the rights operations are listed below: A remote procedure call actually
bits are on. If so, the server knows consists of more than just the re-
get-request(req-header,
that the capability is (or is claimed quest/reply exchange described
req-buffer, req-size)
to be) an owner capability, so it just above. The client has to place the
compares the original random put-reply(rep-header, capability, operation code, and pa-
number in its table with the con- rep-buffet, rep-size) rameters in the request buffer, and
tents of the check field. If they on receiving the reply it has to un-
do-operution(req-header,
agree, the capability is considered pack the results. The server has to
req-buffer, req-size,
valid and the desired operation is check the capability, extract the
rep-header,rep-buffer,
performed. operation code and parameters
rep-size)
If some of the rights bits are 0, from the request, and call the ap-
the server knows that it is dealing When a server is prepared to accept propriate procedure. The result of
with a derived capability, so it per- requests from clients, it executes a the procedure has to be placed in
forms an XOR of the original ran- get-request primitive, which causes it the reply buffer. Placing parame-
dom number in its table with the to block. When a request message ters or results in a message buffer is
rights field of the capability. This arrives, the server is unblocked and called marshalling, and has a non-
number is then run through the the formal parameters of the call to trivial cost. Different data repre-
one-way function. If the output of get-request are filled in with the in- sentations in client and server also
the one-way function agrees with formation from the incoming re- have to be handled. All of these
the contents of the check field, the quest. The server then performs steps must be carefully designed
capability is deemed valid, and the the work and sends a reply using and coded, lest they introduce un-
requested operation is performed if put-reply. acceptable overhead.
its rights bit is set to 1. Due to the On the client side, to invoke a To hide the marshalling and
fact that the one-way function can- remote operation, a process uses message passing from the users,
not be inverted, it is not possible for do-operation. This action causes the Amoeba uses stub routines [5]. For
a user to “decrypt” a capability to request message to be sent to the example, one of the file system
get the original random number in server. The request header con- stubs might start with:
order to generate a false capability tains the capability of the object to
with more rights. be manipulated and various param- int read-file(file-cap, offset,
eters relating to the operation. The nbytes, buffer, bytes-read)
Remote O~emtlons caller is blocked until the reply is capability-t *file-cap;
The combination of a request from received, at which time the three long offset;
a client to a server and a reply from rep- parameters are filled in and a long *nbytes;
a server to a client is a remote opera- status returned. The return status char *buffer;
tion. The request and reply mes- of do-operation can be one of three long *bytes-read;
sages consist of a header and a possibilities:
buffer. Headers are 32 bytes, and This call read nbytes starting at off-
1. The request was delivered and
buffers can be up to 30 kilobytes. A set from the file identified by
has been executed.
request header contains the capa- file-cap into buffer. If returns the
bility of the object to be operated 2. The request was not delivered or number of bytes actually read in
on, the operation code, and a lim- executed (e.g., server was down). bytes-read. The function itself re-
ited area (8 bytes) for parameters to turns 0 if it executed correctly or an
3. The status is unknown.
the operation. For example, in a error code otherwise. A hand-writ-
write operation on a file, the capa- The third case can arise when the ten stub for this code is simple to
bility identifies the file, the opera- request was sent (and possibly even construct: it will produce a request
tion code is write, and the parame- acknowledged), but no reply was header containing file-cap, the op-
ters specify the size of the data to be forthcoming. That situation can eration code for readfile, offset,

COYYUNlCATlONSOFTNSACN/December 1990/Vo1.33,NdZ 49
and nbytes, and invoke the remote can produce stub routines automat- one or more threads that run in
operation: ically [33]. The read-jile operation parallel. All the threads of a process
could be part of an interface (called share the same address space, but
do-operation(req-hdr, req-buf,
a class in AIL) whose definition each one has a dedicated portion of
req-bytes, rep-hdr,
could look something like this: that address space of use as its pri-
buf, rep-bytes);
vate stack, and each one has its own
class simple-file-server [ 100.. 199]{
Automatic generati,on of such a program counter. From the
read-file(*,
stub from the procedure header programmers’s point of view, each
in unsigned offset,
above is impossible. Some essential thread is like a traditional sequen-
in out unsigned nbytes,
information is missing. The author tial process, except that the threads
out char buffer
of the handwritten stub uses several of a process can communicate using
[nbytes:NBYTES]);
pieces of derived infcormation to do shared memory. In addition, the
the job. write-file(*,...); threads can (optionally) synchro-
nize with each other using mutexes
The buffer is used only to re- From this specification, AIL can
or semaphores.
ceive information from the file generate the client stub of the ex-
server; it is an output parameter, The purpose of having multiple
ample above with the correct mar-
threads in a process is to increase
and should not be sent to the shalling code. It can also generate
performance through parallelism,
server. the server main loop, containing
the marshalling code correspond- and still provide a reasonable se-
The maximum length of the mantic model to the programmer.
ing to the client stubs. The AIL
buffer is given in the nbytes pa- For example, a file server could be
specification tells the AIL compiler
rameter. The actual length of programmed as a process with mul-
that the operation codes for the
the buffer is the returned value tiple threads. When a request
simple-file-server can be allocated
if there is no error and zero comes in, it can be given to some
in the range 100 to 199; it tells
otherwise. thread to handle. That thread first
which parameters are input param-
File-cup is special:; it defines the eters to the server and which are checks an internal (software) cache
to see if the needed data are pres-
service that must carry out the output parameters from the server,
ent. If not, it performs remote pro-
remote operation. and it tells that the length of buffer
cedure call (RPC) with a remote
is at most NBYTES (which must be
The stub generator does not disk server to acquire the data.
a constant) and that the actual
know what the server’s opera-
length is nbytes. While waiting for the reply from
tion code for read-file is. This the disk, the thread is blocked and
The Bullet File Server, one of the
requires extra information. But,
file servers operational in Amoeba, will not be able to handle any other
to be fair, the human stub writer requests. However, new requests
inherits this interface, making it
needs this extra information too. can be given to other threads in the
part of the Bullet File Server inter-
In order to be able to do auto- face: same process to work on while the
matic stub generation, the inter- first thread is blocked. In this way,
class bullet-server [200..299] {
faces between client and servers multiple requests can be handled
inherit simple-file-server; simultaneously, while allowing each
have to contain the information
crest-file(*,...); thread to work in a sequential way.
listed above, plus information
about type representation for all The point of having all the threads
language/machine combinations AIL supports multiple inheritance share a common address space is to
used. In addition, the interface so the Bullet server interface can make it possible for all of them to
specifications have to have an in- inherit both the simple file inter- have direct access to a common
heritance mechanism which allows face and, for instance, a capability cache-something that is not possi-
a lower-level interface to be shared management interface for restrict- ble if each thread is its own address
by several other interfaces. The ing rights on capabilities. space.
readfile operation, for instance, Currently, AIL generates stubs The scheduling of threads within
will be defined in a low-level inter- in C, but Modula stubs and stubs in a process is done by code within the
face which is then inherited by all other languages are planned. AIL process itself. When a thread
file-server interfaces, the terminal- stubs have been designed to deal blocks, either because it has no
server interface, and the segment- with different representations- work to do (i.e., on a get-request) or
server interface. such as byte order and floating- because it is waiting for a remote
The Amoeba Interface Lan- point representation-on client reply (i.e., on a do-operation), the
guage (AIL) is a language in which and server machines. internal scheduler is called, the
the extra information for the gen- thread is blocked, and a new thread
eration of efficient stubs can be mreaus can be run. Thus threads are effec-
specified, so that the AIL compiler A process in Amoeba consists of tively co-routines. Threads are not

50 December 199O/Vo1.33, Na.l2/COMYUNICATIONS OF TNE ACM


pre-empted, that is, the currently In the following sections we will The third component describes the
running thread will not be stopped discuss the Amoeba memory state of each thread of control:
because it has run too long. This server, process server, tile server, stack pointer, stack top and bottom,
decision was made to avoid race and directory server, as examples program counter, processor status
conditions. There need be no worry of typical Amoeba servers. Many word, and registers. Threads can be
that a thread, when halfway others exist as well. blocked on certain system calls (e.g.,
through updating some critical get-request); this can also be de-
shared table, will be suddenly me Memory anu Process scribed. The fourth component is a
Server
stopped by some other thread start- list of ports for which the process is
In many applications, processes
ing up and trying to use the same a server. This list is helpful to the
need a way to create subprocesses.
table. It is assumed that the threads kernel when it comes to buffering
In Unix, a subprocess is created by
in a process were all written by the incoming requests and replying to
same programmer and are actively the fork primitive, in which an exact
port-locate operations.
copy of the original process is
cooperating. That is why they are in A process is created by executing
made. This process can then run
the same process. Thus the interac- the following steps.
for a while, attending to house-
tion between two threads in the
keeping activities, and then issue an 1. Get the process descriptor for
same process is quite different from
exec primitive to overwrite its core the binary from the file system.
the interaction between two threads
image with a new program. 2. Create a local segment or a file
in different processes, which may
In a distributed system, this and initialize it to the initial envi-
be hostile to one another and for
model is not attractive. The idea of ronment of the new process.
which hardware memory protec-
first building an exact copy of the The environment consists of a
tion is required and used. Our eval-
process, possibly remotely, and set of named capabilities (a
uation of this approach is discussed
then throwing it away again shortly primitive directory, as it were),
later.
thereafter is inefficient. Conse- and the arguments to the pro-
Seruers quently, Amoeba uses a different cess (in Unix terms, argc and
The Amoeba kernel, as we de- strategy. The key concepts are seg- ar.674.
scribed, essentially handles commu- ments and process descriptors. 3. Modify the process descriptor to
nication and some process manage- A segment is a contiguous chunk make the first segment the envi-
ment, and little else. The kernel of memory that can contain code or ronment segment just created.
takes care of sending and receiving data. Each segment has a capability 4. Send the process descriptor to
messages, scheduling processes, that permits its holder to perform the machine where it will be exe-
and some low-level memory man- operations on it, such as reading cuted.
agement. Everything else is done by and writing. A segment is some-
When the processor descriptor
user processes. Even capability what like an in-core file, with simi-
arrives at the machine where the
management is done entirely in lar properties.
process will run, the memory server
user space, since the cryptographic A process descriptor is a data
there extracts the capabilities for
technique discussed earlier makes it structure that provides information
the remote segments from it, and
virtually impossible for users to about a stunned process, that is, a
fetches the code and data segments
generate counterfeit capabilities. process not yet started or one being
from wherever they reside by using
All of the remaining functions debugged or migrated. It has four
the capabilities to perform READ
that are normally associated with a components. The first describes the
operations in the usual way. In this
modern operating system environ- requirements for the system where
manner, the physical locations of all
ment are performed by servers, the process must run: the class of
the machines involved are irrele-
which are just ordinary user pro- machines, which instruction set,
vant.
cesses. The file system, for exam- minimum available memory, use of
Once all the segments have been
ple, consists of a collection of user special instructions such as floating
filled in, the process can be con-
processes. Users who are not happy point, and several more. The sec-
structed and the process started. A
with the standard file system are ond component describes the lay-
capability for the process is re-
free to write and use their own. out of the address space: number of
turned to the initiator. This capa-
This situation can be contrasted segments and, for each segment,
bility can be used to kill the process,
with a system like Unix’“, in which the size, the virtual address, how it
or it can be passed to a debugger to
there is a single file system that all is mapped (e.g., read only, read-
. . stun (suspend) it, read and write its
applications must use, no matter write, code/data space), and the
memory, and so on.
how inappropriate it may be. In capability of a file or segment con-
[24] for example, the numerous taining the contents of the segment. me File Seruer
problems that Unix creates for As far as the system is concerned, a
database systems are described at Unix is a registered trademark of AT&T Bell file server is just another user pro-
great length. Laboratories. cess. Consequently, a variety of file

COYU”Y,O.TlOYL OF T”E ACM/December 19901Vol.33, No.12 51


servers have been written for DMA operation and stored contig- It is important to realize that the
Amoeba in the course of its exis- uously in the cache. (Unlike con- directory server simply provides a
tence. The first one, Free Univer- ventional file systems, there are no mapping function. The client pro-
sity Storage System (FUSS) [ 151 was “blocks” used anywhere in the file vides a capability for a directory (in
designed as an experiment in man- system.) In the treat-file operation order to specify which directory to
aging concurrent access using opti- one can request the reply before search) and a string, and the direc-
mistic concurrency control. The the file is written to disk (for speed), tory server looks up the string in
current one, the bullet server was or afterwards (to know that it has the specified directory and returns
designed for extremely high per- been successfully written). the capability associated with the
formance [30, 31, 321. When the bullet server is booted, string. The directory server has no
The decrease in the cost of disk the entire “i-node table” is read into knowledge of the kind of object
and RAM memories, over the past memory in a single disk operation that the capability controls.
decade has allowed us to use a radi- and kept there while the server is In particular, it can be a capabil-
cally different design from that running. When a file operation is ity for another directory on the
used in Unix and most other oper- requested, the object number field same or a different directory
ating systems. In particular, we in the capability is extracted, which server-a file, a mailbox, a data-
have abandoned the idea of storing is an index into this table. The entry base, a process capability, a segment
files as a collection of fixed-size disk thus located gives the disk address capability, a capability for a piece of
blocks. All files are s.tored contigu- as well as the cache address of the hardware, or anything else. Fur-
ously, both on the disk and in the contiguous file (if present). No disk thermore, the capability may be for
server’s main memory. While this access is needed to fetch the an object located on the same ma-
design wastes some disk space and “i-node” and at most one disk access chine, a different machine on the
memory due to fragmentation is needed to fetch the file itself, if it local network, or a capability for an
overhead, we feel that the enor- is not in the cache. The simplicity of object in a foreign country. The
mous gain in performance more this design trades off some space nature and location of the object is
than offsets the small extra cost of for high performance. completely arbitrary. Thus the ob-
having to buy, say, an 800 MB disk jects in a directory need not all be
instead of a 500 MB disk in order to The Dlrectwy Seffe on the same disk, for example, as is
store 500 MB worth of files. The bullet server does not provide the case in many systems that sup-
The bullet server is an immuta- any naming services. To access a port “remote mount” operations.
ble file store. Its principal opera- file, a process must provide the rel- Since a directory may contain
tions are read-file and create-jile. evant capability. Since working with entries for other directories, it is
(For garbage collection purposes 128-bit binary numbers is not con- possible to build up arbitrary direc-
there is also a delete,file operation.) venient for people, we have de- tory structures, including trees and
When a process issues a read@ re- signed and implemented a direc- graphs. As an optimization, it is
quest, the bullet server can transfer tory server to manage names and possible to give the directory server
the entire file to the client in a sin- capabilities. a complete path, and have it follow
gle RPC, unless it is larger than the The directory server manages it as far as it can, returning a single
maximum size (30,000 bytes), in multiple directories, each of which capability at the end.
which case multiple RPCs are is a normal object. Stripped down Actually, directories are slightly
needed. The client can then edit or to its barest essentials, a directory more general than just simple map-
otherwise modify the file locally. maps ASCII strings onto capabili- pings. It is commonly the case that
When it is finished, the client issues ties. A process can present a string, the owner of a file may want to have
a createfile RPC to make a new ver- such as a file name, to the directory the right to perform all operations
sion. The old version remains intact server, and the directory server re- on it, but may want to permit others
until explicitly deleted or garbage turns the capability for that file. read-only access. The directory
collected. It should be noted that Using this capability, the process server supports this idea by struc-
different versions of a file have dif- can then access the file. turing directories as a series of
ferent capabilities, so they can co- In Unix terms, when a file is rows, one per object, as shown in
exist, allowing for the straightfor- opened, the capability is retrieved Figure 3.
ward implementation of source from the directory server for use in The first column gives the string
code control systems. subsequent read and write opera- (e.g., the file name). The second
The files are stored contiguously tions. After the capability has been column gives the capability that
on disk, and are cached in the file fetched from the directory server, goes with that string. The remain-
server’s memory (Icurrently 12 subsequent RPCs go directly to the ing columns each apply to one user
Mbytes). When a relquested file is server that manages the object. The class. For example, one could set up
not available in this memory, it is directory server is no longer in- a directory with different access
loaded from disk in a single large volved. rights for the owner, the owner’s

December 199O/Vo1.33, No.l2/COYYUNl~TIOYSOCT”EliCY


group, and others, as in Unix, but
other combinations are also possi-
ble.
The capability for a directory
specifies the columns to which the
holder has access as a bit map in
part of the rights field (e.g., 3 bits).
Thus in Figure 3, the bits 001 might
specify access to only the other col-
umn. Earlier we discussed how the
rights bits are protected from tam-
pering by use of the check field.
To see how multiple columns are
used, consider a typical access. The
client provides a capability for a
directory (implying a column) and a
string. The string is looked up in
the directory to find the proper
row. Next, the column is checked each object. Thus when a process ficing performance [29]. In partic-
against the (singleton) bit map in looks up an object, it can retrieve ular, it is undesirable that the fast
the rights field, to see which col- the entire set of capabilities for all local RPC be slowed down due to
umn should be used. Remember the copies. If one of the objects is the existence of wide-area commu-
that the cryptographic scheme pre- unavailable, the others can be tried. nication. We believe this goal has
viously described prevents users The technique is similar to the one been achieved.
from modifying the bit map, hence used by Eden [20]. In addition, it is The Amoeba world is divided
accessing a forbidden column. possible to instruct the system to into domains, each domain being an
Then the entry in the selected automatically generate replicas and interconnected collection of local
row and column is extracted. Con- store them in the capability set, thus area networks. The key aspect of a
ceptually this is just a capability, freeing the user from this adminis- domain (e.g., a campus), is that
with the proper rights bits turned tration. broadcasts done from any machine
on. However, to avoid having to In addition to supporting repli- in the domain are received by all
store many capabilities, few of cation of user objects, the directory other machines in the domain, but
which are ever used, an optimiza- server is itself duplicated. Among not by machines outside the do-
tion is made, and the entry is just a other properties, it is possible to main.
bit map, b. The directory server can install new versions of the directory The importance of broadcasting
then ask the server that manages server by killing off one instance of has to do with how ports are located
the object to return a new capability it, installing a new version as the in Amoeba. When a process does an
with only those rights in b. This new replacement, killing off the other RPC with a port not previously
capability is returned to the user (original) instance, and installing a used, the kernel broadcasts a locate
and also cached for future use, to second replacement also running message. The server responds to
reduce calls to the server. the new code. In this way bugs can this broadcast with its address,
The directory server supports a be repaired without interrupting which is then used and also cached
number of operations on directory service. for future RPCs.
objects. These include looking up This strategy is undesirable with
capabilities, adding new rows to a Wide-Area Amoeba a wide-area network. Although
directory, removing rows from di- Amoeba was designed with the idea broadcast can be simulated using a
rectories, listing directories, inquir- that a collection of machines on a minimum spanning tree [7] it is
ing about the status of directories local area network (LAN) would be expensive and inefficient. Further-
and objects, and deleting direc- able to communicate over a wide- more, not every service should be
tories. There is also provision for area network with a similar collec- available worldwide. For example, a
performing multiple operations in tion of remote machines. The key laser printer server in the physics
a single atomic action, to provide problem here is that wide-area net- building at a university in Califor-
for fault tolerance. works are slow and unreliable, and nia may not be of much use to cli-
Furthermore, there is also sup- use protocols such as X.25, TCP/IP, ents in New York.
port for handling replicated ob- and 0%; they do not use RPC. The Both of these problems are dealt
jects. The capability field in primary goal of the wide-area net- with by introducing the concept of
Figure 3 can actually hold a set of working in Amoeba has been to publishing. When a service wishes to
capabilities for multiple copies of achieve transparency without sacri- be known and accessible outside its

COYYUN~TlONSOCTNC~N/Dccembcr 19901Vol.33, No.12


own domain, it contacts the Service from the link processes is to com- then to run Unix and MINIX [25]
for Wide-Area Networks (SWAN) pletely isolate the technical details compilers and other utilities on top
and asks that its port be published of the wide-area network in one of it.
in some set of domains. The SWAN kind of process, and to make it eas- Using a special set of library pro-
publishes the port by doing RPCs ier to have multiway gateways, cedures that do RPCs with the
with SWAN processes in each of which would have one type of link Amoeba servers, it has been possi-
those domains. process for each wide-area network ble to construct an emulation of the
When a port is published in a type to which the gateway is at- Unix system call interface-which
domain, a new process called a tached. was dubbed Ajax-that is good
server agent is created in that do- It is important to note that this enough that about 100 of the most
main. The process typically runs on design causes no performance deg- common utility programs have
the gateway machine, and does a radation for local communication. been ported to Amoeba. The
get-request using the remote server’s An RPC between a client and a Amoeba user can now use most of
port. It is quiescent until its server is server on the same LAN proceeds the standard editors, compilers, file
needed, at which time it comes to at full speed, with no relaying of utilities and other programs in a
life and performs an RPC with the any kind. Clearly there is some per- way that looks very much like Unix,
server. formance loss when a client is talk- although in fact it is really Amoeba.
Now let us consider what hap- ing to a server located on a distant A sessionserver has been provided to
pens when a process tries to locate a network, but the limiting factor is handle state information and do
remote server whose port has been normally the bandwidth of the fork and exec in a Unix-like way.
published. The process’ kernel wide-area network, so the extra
broadcasts a locate, which is re- overhead of having messages being Parallel Make
trieved by the server agent. The relayed several times is negligible. As shown in Figure 1, the hardware
server agent then builds a message Another useful aspect of this on which Amoeba runs contains a
and hands it to a link process on the design is its management. To start processor pool with several dozen
gateway machine. The link process with, services can only be published processors. One obvious applica-
forwards it over the wide-area net- with the help of the SWAN server, tion for these processors in a Unix
work to the server’s domain, where which can check to see if the system environment is a parallel version of
it arrives at the gateway, causing a administration wants the port to be make [lo]. The idea here is that
client agent process to be created. published. Another important con- when make discovers that multiple
This client agent then makes a nor- trol is the ability to prevent certain compilations are needed, they are
mal RPC to the server. The set of processes (e.g., those owned by stu- run in parallel on different proces-
processes involved here is shown in dents) from accessing wide-area sors.
Figure 4. services, since all such traffic must Although this idea sounds sim-
The beauty of this scheme is that pass through the gateways, and var- ple, there are several potential
it is completely transparent. Nei- ious checks can be made there. Fi- problems. For one, to make a single
ther user processes nor the kernel nally, the gateways can do account- target file, a sequence of several
know which processes are local and ing, statistics gathering, and commands may have to be exe-
which are remote. The communica- monitoring of the wide-area net- cuted, and some of these may use
tion between the chent and the work. files created by earlier ones. The
server agent is completely local, solution chosen is to let each com-
using the normal RPC. Similarly, nppllcatlonr mand execute in parallel, but block
the communication between the cli- Amoeba has been used to program when it needs a file being made but
ent agent and the server is also a variety of applications. We will not yet fully generated.
completely normal. Neither the cli- now describe several of them, in- Other problems relate to techni-
ent nor the server knows that it is cluding Unix emulation, parallel cal limitations of the make program.
talking to a distant process. make, traveling salesman, and For example, since it expects com-
Of course, the two agents are alpha-beta search. mands to be run sequentially,
well aware of what is going on, but rather than in parallel, it does not
they are automatically generated as Unix Emulatlen keep track of how many processes it
needed, and are not visible to users. One of the goals of Amoeba was to has forked off, which may exceed
The link processes .are the only make it useful as a program devel- various system limits.
ones that know about the details of opment environment. For such an Finally, there are programs, such
the wide-area network.. They talk to environment, one needs editors, as yacc [ 1 l] that write their output
the agents using RPC, but to each compilers, and numerous other on fixed name files, such as y.tab.c.
other using whatever protocol the standard software. It was decided When multiple yacc’s are running in
wide-area network requires. The that the easiest way to obtain this the same directory, they all write to
point of splitting off the agents software was to emulate Unix and the same file, thus producing gib-

December 199O/Vo1.33, No.~~/COYLIUNIWTIONSOFT”EICY


berish. All of these problems have
been dealt with by one means or
another, as described in [Z].
The parallel compilations are
directed by a new version of make,
called amake. Amake does not use
traditional makefiles. Instead, the
user tells it which source files are
needed, but not their dependen-
cies. The compilers have been mod-
ified to keep track of the observed
dependencies (e.g., which files they
in fact included). After a compila-
tion, this information goes into a
kind of minidatabase that replaces
the traditional makefile. It also keeps to be visited include New York, AlpRa-Beta sea?eR
track of which flags were used, Sydney, Nairobi, and Tokyo. The Another application that we have
which version of the compiler was coordinator might tell the first slave programmed in parallel using
used, and other information. Not to investigate all paths starting with Amoeba is game playing using the
having to even think about London-New York; the second alpha-beta heuristic for pruning
makefiles, not even automatically slave to investigate all paths starting the search tree. The general idea is
generated ones, has been popular with London-Sydney; the third the same as for the traveling sales-
with the users. The overhead due slave to investigate all paths starting man. When a processor is given a
to managing the database is negligi- with London-Nairobi; and so on. board to evaluate, it generates all
ble, but the speedup due to paral- All of these searches go on in paral- the legal moves possible starting at
lelization depends strongly on the lel. When a slave is finished, it re- that board, and hands them off to
input. When making a program ports back to the coordinator and others to evaluate in parallel.
consisting of many medium-sized gets a new assignment. The alpha-beta heuristic is com-
files, considerable speedup can be The algorithm can be applied monly used in two-person, zero-
achieved. However, when a pro- recursively. For example, the first sum games to prune the search
gram has one large source file and slave could allocate a processor to tree. A window of values is estab-
many small ones, the total time can investigate paths starting with Lon- lished, and positions that fall out-
never be smaller than the compila- don - New York - Sydney, another side this window are not examined
tion time of the large one. processor to investigate London- because better moves are known to
New York-Nairobi, and so forth. At exist. In contrast to the traveling
Wte mavellng SaIesman some point, of course, a cutoff is salesman problem, in which much
PPoRlem needed at which a slave actually of the tree has to be searched,
In addition to various experiments does the calculation itself and does alpha-beta allows a much greater
with the Unix software, we have not try to farm it out to other proc- pruning if the positions are evalu-
also tried programming some ap- essors. ated in a well-chosen order.
plications in parallel. Typical appli- The performance of the algo- For example, on a single ma-
cations are the traveling salesman rithm can be greatly improved by chine, we might have three legal
problem [ 131 and alpha-beta search keeping track of the best total path moves A, B, and C at some point. As
[ 141 which we briefly describe here. found so far. A good initial path a result of evaluating A we might
More details can be found in [3]. can be found by using the “closest discover that looking at its siblings
In the traveling salesman prob- city next” heuristic. Whenever a in the tree, B and C was pointless.
lem, the computer is given a start- slave is started up, it is given the In a parallel implementation, we
ing location and a list of cities to be length of the best total path so far. would do all at once, and ultimately
visited. The idea is to find the If it ever finds itself working on a waste the computing power de-
shortest path that visits each city partial path that is longer than the voted to B and C. The result is that
exactly once, and then return to the best-known total path, it immedi- much parallel searching is wasted,
starting place. Using Amoeba we ately stops what it is doing, reports and the net result is not that much
have programmed this application back failure, and asks for more better than a sequential algorithm
in parallel by having one pool pro- work. Initial experiments have on a single processor. Our experi-
cessor act as coordinator, and the shown that 75% of the theoretical ments running Othello (Reversi) on
rest as slaves. maximum speedup can be achieved Amoeba have shown that we were
For example, suppose the start- using this algorithm. The rest is lost unable to utilize more than 40% of
ing place is London, and the cities to communication overhead. the total processor capacity avail-

COYYUNlCATlONSOFTREAC,CY/December 199O/Vo1.33, No.12


able, compared to 75% for the trav- the test for 4 bytes, only a header made varies by about a factor of 3.
eling salesman problem. was sent and no data buffer. On the On all distributed systems of this
other hand, on the Sun, a special type running on fast LANs, the
PerFormance optimization is available for the protocols are largely CPU bound.
Amoeba was designed to be fast. local case, which we used. Running the system on a faster
Measurements show that this goal In Figure 5 we illustrate the CPU (but the same network) deft-
has been achieved. In this section, delay and the bandwidth of these nitely improves performance, al-
we will present the results of some eight cases, both for local processes though not linearly with CPU MIPS
timing experiments we have done. (two distinct processes on the same because at some point the network
These measurements were per- machine) and remote processes saturates (although none of the sys-
formed on Sun 3/6Os (20 MHz (processes on different machines). tems quoted here even come close
68020s) using a 10 Mbps Ethernet. The delay is the time as seen from to saturating it). As an example, in
We measured the performance for the client, running as a user pro- [3 l] we reported a null RPC time of
three different configurations: cess, between the calling of, and 1.4 msec, but this was for Sun 3/5Os.
returning from, the RPC primitive. The current figure of 1.1 set is for
Two user processes running on
The bandwidth is t.he number of the faster Sun 3/6Os.
Amoeba.
data bytes per second that the client In Figure 6 we have not cor-
Two user processes running on
receives from the server, excluding rected for machine speed, but we
Sun OS 4.0.3 but using the
headers. The measurements were have at least made a rough estimate
Amoeba primitives, which were
done for both local RPCs, where of the raw total computing power
added to the Sun Kernel.
the client and server processes were of each system, given in the fifth
Two user processes running on
running on the same processor, column of the table in MIPS (Mil-
Sun OS 4.0.3 and using Sun
and for remote RPCs over the lions of Instructions Per Second).
RPC.
Ethernet. While we realize that this is only a
The latter two were for comparison The interesting comparisons in crude measure at best, we see no
purposes only. We ran tests for the these tables are the comparisons of other way to compensate for the
local case (both processes on the pure Amoeba RPC and pure Sun fact that a system running on a 4
same machine) and for the remote OS RPC both for short communica- MIPS machine (Dorado) or on a 5
case (each process on a separate tions, where delay is critical, and CPU multiprocessor (Firefly) has a
machine, with communication over long ones, where bandwidth is the significant advantage over slower
the Ethernet). In all cases commu- issue. A 4-byte Amoeba RPC takes workstations. As an aside, the Sun
nication was from process to pro- 1.1 msec, v. 6.7 msec for Sun RPC. 3/60 is indeed faster than the Sun
cess, all of which were running in Similarly, for 8 Kbyte RPCs, the 3175; this is not a misprint.
user mode outside tbe kernel. The Amoeba bandwidth is 721 Kbytes/ Cedar’s RPC is about the same as
measurements represent the aver- set, v. only 325 Kbytes for the Sun Amoeba’s although it was imple-
age values of 100,000 trials and are RPC. The conclusion is that Amoe- mented on hardware that is 33%
highly reproducible. ba’s delay is six times better and its faster. Its throughput is only 30%
For each configuration (pure throughput is twice as good. of Amoeba’s, but this is partly due
Amoeba, Amoeba primitives on While the Sun is obviously not to the fact that it used an early ver-
Unix, Sun RPC on IJnix), we tried the only system of interest, its wide- sion of the Ethernet running at 3
to run three test cases: a 4-byte spread use makes it a convenient megabitsjsec. Still, it does not even
message (1 integer), an 8 Kbyte benchmark. We have looked in the manage to use the full 3 megabits/
message, and a 30 Kbyte message. literature for performance figures sec.
The 4-byte message test is typical from other distributed systems and The x-Kernel has a 10% better
for short control messages, the have shown the null-RPC latency throughput than Amoeba, but the
8-Kbyte message is typical for read- and maximum throughput in Fig- published measurements are
ing a medium-sized file from a ure 6. kernel-to-kernel, whereas Amoeba
remote file, and the 30-Kbyte test is The RPC numbers for the other was measured from user process to
the maximum the current imple- systems listed in Figure 6 are taken user process. If the extra overhead
mentation of Amoeba can handle. from the following publications: of context switches from kernel to
Thus, in total we should have nine Cedar [5], x-Kernel [19], Sprite user and copying from kernel buff-
cases (three configurations and [18], V [6], Topaz [22], and Mach ers to user buffers are considered,
three sizes). However, the standard 1191. (to make them comparable to the
Sun RPC is limited to 8K, so we The numbers shown here cannot Amoeba numbers), the x-kernel
have measurements for only eight be compared without knowing performance figures would be re-
of them. It should also be noted about the systems from which they duced to 2.3 msec for the null RPC
that the standard Amoeba header were taken, since the speed of the with a throughput of 748 kbytes/sec
has room for 8 bytes of data, so in hardware on which the tests were when mapping incoming data from
: :.
&ay (msec) 8, I_ * BaricjiMth (KbyteS/S8c)
base 1. .case 2 case3 ; case 1 ease 2 case 3
j
._. f4’bvtesf~ f8 Kbl (30 Kb)
~,-n n.

rjuie Amoeba looal

#re fitioeba remote

UNIX driver local

UNfX driver remote

Sun RPC local

Sun RPC remote

(a) (b)

FlGUR6 5. RK bchrecn USer RrO@Ssesin three wmnOn aScS for three diffennt SVStemS. local RKS are RKS In nhidi the dlent and
setwr are running as different gmcessesbut on the same processor.
Remote RKs are bmmn different machines. (1) Dclav in msec. (b) Band-
width In Rbytevsr. The Unir dfiver ImPIemeIttsRmoelIa RKS and Rmoebl protocol under Sun Unix.

Null RPC Throughput Estimated implementation


System Hardware
in msec. in Kbytesls CPU WPS Notes

Amoeba Sun 3f60 1.1 620 Measured user-to-user


Cedar Dorado 1.1 250 Custom microcode
x- Kernel Sun 3l75 1.7 660 Measured kernel-to-kernel
v Sun 3/75 2.5 546 Measured user-to-user
Topaz Firefly 2.7 587 Consists of 5 VAX CPUs
Sprite Sun 3l75 2.6 720 Measured kernel-to-kernel
Mach Sun 3/60 11.0 ? Throughput not reported

FlGURE 6. latency and thmugllput nttll arlous distributed ogerating svstems.

kernel to user and 575 kbyteslsec cess and taken out by the kernel for obtained from a paper published in
when copying it (L. Peterson, pri- transmission. Since the 68020 pro- May 1990 [ 191 and applies to Mach
vate communication). cessor has eight 4-byte data regis- 2.5, in which the networking code is
Similarly, the published Sprite ters, up to 32 bytes can be trans- in the kernel. The Mach RPC per-
figures are also kernel-to-kernel. ferred this way. formance is worse than any of the
Sprite does not support RPC at the Topaz RPC was obtained on other systems by more than a factor
user level, but a close equivalent is Fireflies, which are VAX-based of 3 and is 10 times slower than
the time it takes to send a null mes- multiprocessors. The performance Amoeba. A more recent measure-
sage from one user process to an- obtained in Figure 6 can only be ment on an improved version of
other and get a reply, which is obtained using several CPUs at Mach gives an RPC time of
4.3 msec. The user-to-user band- each end. When only a single CPU 9.6 msec and a throughput of
width is 170 kbytes/sec [34]. is used at each end, the null RPC 250,000 bytes/set (R. Draves, pri-
V uses a clever technique to im- time increases to 4.8 msec and the vate communication).
prove the performance for short throughput drops to 313 kbytes/ Like Amoeba itself, the bullet
RPCs: the entire message is put in sec. server was designed with fast per-
the CPU registers by the user pro- The null RPC time for Mach was formance as a major objective. Next

CCYYUYICATICYS CCTnE ACM/December 199OlVo1.33, No.12 57


we present some measurements of those used to measure Amoeba. the system. When new objects or
what has been achieved. The mea- The measurements were made at services are proposed, we have a
surements were made between a night under a light load. To disable clear model to deal with and spe-
Sun 3/60 client talking to a remote local caching on the Sun 3/60 we cific questions to answer. In partic-
Sun 3160 file server equipped with a locked the file using the Sun Unix ular, for each new service, we must
SCSI disk. Figure 7 gives the per- lockf primitive while doing the read decide what objects will be sup-
formance of the bullet server for test. The timing of the read test ported and what operations will be
tests made with files of 1 Kbyte, 16 consisted of repeated measurement permitted on these objects. The
Kbytes, and 1 Mbyte. In the first of an lseek followed by a read system structuring technique has been val-
column the delay and bandwidth call. The write test consisted of con- uable on many occasions.
for read operations is shown. Note secutively executing treat, write and The use of capabilities for nam-
that the test file will. be completely close. (The m-eat has the effect of ing and protecting objects has also
in memory, and no disk access is deleting the previous version of the been a success. By using cryp-
necessary. In the second column a file.) The results are depicted in tographically protected capabilities,
create and a delete operation to- Figure 8. we have a unique system-wide fixed
gether is measured. ln this case, the Observe that reading and creat- length name for each object, yield-
file is written to disk. Note that both ing 1 Mbvte files results in lower ing a high degree of transparency.
the create and the delete operations bandwidths than for reading and Thus it is simple to implement a
involve disk requests. creating 16 Kbyte files. This effect basic directory as a set of (ASCII
The careful reader may have is due to the Bullet server’s need to string, capability) pairs. As a result,
noticed that user process can pull do more complex buffer manage- a directory may contain names for
813 kbytes/sec from the bullet ment with large files. The Bullet many kinds of objects, located all
server (from Figure 7), even file server’s performance for read over the world and windows can be
though the user-to-user bandwidth operations is two to three times bet- written on by any process holding
is only 783 kbytes/sec (from Figure ter than the Sun NFS file server. the appropriate capability, no mat-
5). The reason for this apparent For create operations, the Bullet ter where it is. We feel this model is
discrepancy is as follows: As far as file server has a constant overhead conceptually both simpler and
the clients are concerned, the bullet for producing capabilities, which more flexible than models using
server is just a black box. It accepts gives it a relatively better perfor- remote mounting and symbolic
requests and gives replies. No user mance for large files. links such as Sun’s NFS. Further-
processes run on its machine. more, it can be implemented just as
Under these circumstances, we de- q raluatlon efficiently.
cided to move the bullet server code In this section we will take a critical We have no experience with ca-
into the kernel, since the users look at Amoeba and its evolution pabilities on huge systems (thou-
could not tell the difference any- and point out some aspects that we sands of simultaneous users). On
way, and protection is not an issue consider successful and others that one hand, with such a large system,
on a free-standing file server with we consider less successful. In areas some capabilities are bound to leak
only one process. ‘Thus the 813 where Amoeba 4.0 was found want- out, compromising security. On the
kbyte/sec figure is user-to-kernel ing, we will make improvements in other hand, capabilities provide a
for access to the file cache, whereas Amoeba 5.0, which is currently kind of firewall, since a compro-
the 783 kbytelsec one is user-to- under development. The following mised capability only affects the
user, from memory-to-memory discussion lists these improvements. security of one object. It is difficult
without involving any files. The One area where little improve- at this point to say whether such
pure user-to-kernel bandwidth is ment is needed is portability. fine-grained protection is better or
certainly higher than 813 kbytes/ Amoeba started out. on the 680x0 worse in practice than more con-
set, but some of it is lost to file CPUs, and has been easily moved to ventional schemes for huge sys-
server overhead. the VAX, and Intel 80386. SPARC tems.
To compare the Amoeba results and MIPS ports are underway. The We are also satisfied with the
with the Sun NFS file system, we Amoeba RPC protocol has also low-level user primitives. In effect
have measured readling and creat- been implemented as part of there are only three principal sys-
ing files on a Sun 3!60 using a re- MINIX [25] and as such is in wide- tem calls-get-request, put-reply,
mote Sun 3160 file server with a 16 spread use around the world. and do-operation-each easy to
Mbyte of memory running Sun OS understand. All communication is
4.0.3. Since the file server had the OtaIects ranu Ce~a&llltles based on these primitives, which
same type of disk as the bullet On the whole, the basic idea of an are much simpler than, for exam-
server, the hardware configura- object-based system has worked ple, the socket interface in Berkeley
tions were, with the exception of well. It has given us a framework Unix, with its myriad of system
extra memory for NFS, identical to which makes it easy to think about calls, parameters, and options.

December 199O/Vol.33, No.12ICOYMUNICITIONSOFTWEWY


Amoeba 5.0 will use 256-bit ca-
pabilities, rather than the 12%bit
capabilities of Amoeba 4.0. The
larger Check field will be more se-
cure against attack. Other security
aspects will also be tightened, in-
cluding the addition of secure, en-
crypted communication between
client and server. Also, the larger
capabilities will have room for a lo-
cation hint which can be exploited
by the SWAN servers for locating
objects in the wide-area network.
Third, all the fields of the new 256- FIGURE 7. Perfom~anceof the Bullet file scm for read opmtlons,andmat8andde-
bit capability will be aligned at 32- leteopefatlons together.
(a)Delay In msec. (bl BanduidthIn KbrteS/sK.
bit boundaries, which potentially
may give better performance.

For the most part, RPC communi-


cation is satisfactory, but sometimes
it gives problems [28]. In particular,
RPC is inherently master-slave and
point-to-point. Sometimes both of
these issues lead to problems. In a
UNIX pipeline, such as:
pit file 1 eqn 1 tbl 1 troff >outfile
for example, there is no inherent
master-slave relationship, and it is HGURE 8. Performance of the Sun IFS file seruer for read and Create OpKttiOnS.
not at all obvious if data movement (al Delay In msec. lb) Blndwldth In KbvWSK.
between the elements of the pipe-
line should be read driven or write
driven. Amoeba 5.0 will fully support cast to all the machines holding
In Amoeba 4.0, when an RPC group communication using multi- copies of the database. This idea is
transfers a long message it is actu- cast. A message sent to a group will obvious and we should have real-
ally sent as a sequence of packets, be delivered to all members, or to ized it earlier and put it in from the
each of which is individually ac- none at all. A higher-level protocol start.
knowledged at the driver level has been devised to implement Although it has long since been
(stop-and-wait protocol). Although 100% reliable multicasting on unre- corrected, we made a truly dreadful
this scheme is simple, it slows the liable networks at essentially the decision in having asynchronous
system down. In Amoeba 5.0 we same price as RPC (two messages RPC in Amoeba 2.0. In that system
will only acknowledge whole mes- per reliable broadcast). This proto- the sender transmitted a message to
sages, which will allow us to achieve col is described in [12]. There are the receiver and then continued
higher bandwidths than shown in many applications (e.g., replicated executing. When the reply came in,
Figure 5. databases of various kinds) which the sender was interrupted. This
Because RPC is inherently point- are simplified by reliable broadcast- scheme allowed considerable paral-
to-point, problems arise in parallel ing. Amoeba 5.0 will use this repli- lelism, but it was impossible to pro-
applications like the traveling sales- cation facility to support fault toler- gram correctly. Our advice to fu-
man problem. When a process dis- ance. ture designers is to avoid
covers a path that is better than the Although not every LAN sup- asynchronous messages like the
best known current path, what it ports broadcasting and multicasting plague.
really wants to do is send a multicast in hardware, when it has this capa-
message to a large number of pro- bility (e.g., Ethernet), it can provide Memory aml PWteSS
cesses to inform all of them imme- an enormous performance gain for Management

diately. At present this is impossi- many applications. For example, a Probably the worst mistake in the
ble, and must either be simulated simple way to update a replicated design of Amoeba 4.0 process man-
with multiple RPCs or finessed. database is to send a reliable multi- agement mechanisms was the deci-

CCYUUWICITICIICCCTRLliCYIDecember 199O/Vol.33, No.12 59


sion to have threads run to comple- just hang forever. We probably dle databases in this environment.
tion, that is, not be preemptable. should have included service time- We envision an Amoeba-based
The idea was that once a thread outs, although doing so would in- database system that would have a
started using some critical table, it troduce race conditions. very large memory for an essen-
would not be interrupted by an- Finally, Amoeba does not sup- tially “in-core” database. Updates
other thread in the same process port virtual memory. It has been would be done in memory. The
until it logically blocked. This our working assumption that mem- only function of the disk would be
scheme seemed simple to under- ory is becoming so cheap that the to make checkpoints periodically.
stand, and it was certainly easy to saving derived from using virtual In this way, the immutability of files
program. memory with its added complexity would not pose any problems.
Problems arose because pro- is not worthwhile. Most worksta- A problem that has not arisen
grammers did not have a very good tions have at least 4M RAM these yet, but might arise if Amoeba were
concept of when a process blocked. days, and will have 32M within a scaled to thousands of users, is
For example, to debug some code couple of years. Simplicity of design caused by the splitting of the direc-
in a critical region, a programmer and implementation and high tory server and file server. Creating
might add some print statements in speed have always been our goals, a file and then entering its capabil-
the middle of the critical region so we really have not yet decided ity into a directory are two separate
code. These print statements might whether to implement virtual mem- operations. If the client should
call library procedures that per- ory in Amoeba 5.0. crash between them, the file exists
formed RPCs with a remote termi- In a similar vein, we do not sup- but is inaccessible. Our current
nal server. While blocked waiting port process migration at present, strategy is to have the directory
for the acknowledgement, a thread even though the mechanisms server access each file it knows
could be interrupted, and another needed for supporting it already about once every k days, and have
thread could access the critical re- exist. Whether process migration the bullet server automatically gar-
gion, wreaking havoc. Thus the for load balancing is an essential bage collect all files not accessed by
sanctity of the critical region could feature or just another frill is still anyone in ?z days (n >> k). With
be destroyed by putting in print under discussion. our current setup and reliable
statements. Needless to say, this hardware, this is not a problem, but
property was very confusing to CllC systwn in a huge, international Amoeba
naive programmers. One area of the system which we system it might become one.
The run-to-completion seman- think has been eminently successful
tics of thread scheduling in is the design of the file server and mmmmtwoUrln~
Amoeba 4.0 also prevents a multi- directory server. We have separated We are also pleased with the way
processor implementation from it into two distinct parts: the bullet wide-area networking has been
exploiting parallelism and shared server, which just handles storage, handled, using server agents, client
memory by allocating different and the directory server, which agents, and the SWAN. In particu-
threads in one process to different handles naming and protection. lar, the fact that the existence of
processors. Amoeba 5.0 threads will The bullet server design allows it to wide-area networking does not af-
be able to run in parallel. No prom- be extremely fast, while the direc- fect the protocols or performance
ises are made by the scheduler tory server design gives a flexible of local RPCs at all is crucial. Many
about allowing a thread to run until protection scheme and also sup- other designs (e.g., TCP/IP, OSI)
it blocks before another thread is ports file replication in a simple and start out with the wide-area case,
scheduled. Thread.s sharing re- easy-to-understand way. The key and then use this locally as well.
sources must explicitly synchronize element here is the fact that files This choice results in significantly
using semaphores or mutexes. are immutable, so they can be repli- lower performance on a LAN than
Another problem concerns the cated at will, and copies regener- the Amoeba design, and no better
lack of timeouts on the duration of ated if necessary. performance over wide-area net-
remote operations. \Nhen the mem- The entire replication process works.
ory server is starting up a process, it takes place in the background (lazy One configuration that was not
uses the capabilities in the process replication), and is entirely auto- adequately dealt with in Amoeba
descriptor to download the code matic, not bothering the user at all. 4.0 is a system consisting of a large
and data. It is perfectly legal for We regard the file system as the number of local area networks in-
these capabilities to be for most innovative part of the Amoeba terconnected by many bridges and
somebody’s private file server, 4.0 design, combining high perfor- gateways. Although Amoeba 4.0
rather than for the bullet server. mance with reliability, robustness, works on these systems, its perfor-
However, if this server is malicious and ease of use. mance is poor, partly due to the
and simply does not respond at all, An issue that we are becoming way port location and message han-
a thread in the memory server will interested in is how one could han- dling is done. In Amoeba 5.0, we

60 December 199O/Vo1.33, No.I~/COYYUNIUTIO~OFTN~~:Y


have designed and implemented a conceived as a system for distrib- The file system lets us read and
completely new low-level protocol uted computing, the existence of write files at about the same rate.
called the Fast Local Internet Pro- the processor pool with dozens of
tocol (FLIP), that will greatly im- CPUs close together has made it user InteHace
prove the performance in complex quite suitable for parallel comput- Amoeba originally had a homebrew
internets. Among other features, ing as well. That is, we have become window system. It was faster tlian
entire messages will be acknowl- much more interested in using the X-windows and, in our view,
edged instead of individual packets, processor pool to achieve large cleaner. It was also much smaller
greatly reducing the number of in- speedups on a single problem. To and easier to understand. For these
terrupts that must be processed. program these parallel applica- reasons we thought it would be easy
Port location is also done more effi- tions, we are currently engaged in to get people to accept it. We were
ciently, and a single server agent implementing a language called wrong. Technical factors sometimes
can now listen to an arbitrary num- Orca [4]. play second fiddle to political and
ber of ports, enormously reducing Orca is based on the concept of marketing ones. We have aban-
the number of quiescent server globally shared objects. Program- doned our window server and
agents required in the gateways for mers can define operations on switched to X windows.
large systems. shared objects, and the compiler
One unexpected problem that and run-time system take care of all Security
we had was the poor quality of the the details of making sure they are An intruder capable of tapping the
wide-area networks that we had to carried out correctly. This scheme network on which Amoeba runs
use, especially the public X.25 ones. gives the programmer the ability to can discover capabilities and do
Also, to access some machines we atomically read and write shared considerable damage. In a produc-
often had to traverse multiple net- objects that are physically distrib- tion environment some form of link
works, each with their own prob- uted among a collection of ma- encryption is needed to guarantee
lems and idiosyncracies. Our only chines without having to deal with better security. Although some
insight to future researchers is not any of the complexity of the physi- thought has been given to a security
to blindly assume that public wide- cal distribution. All the details of mechanism [26] it was not imple-
area networks will actually function the physical distribution are com- mented in Amoeba 4.0.
correctly until this has been experi- pletely hidden from the program- Two potential security systems
mentally verified. mer. Initial results indicate that have been designed for Amoeba
URIX Pnlulatlen close to linear speedup can be 5.0. The first version can only be
The Amoeba 4.0 Unix emulation achieved on some problems involv- used in friendly environments
consists of a library and a session ing branch and bound, successive where the network and operating
server. It was written with the goal overrelaxation, and graph algo- system kernels can be assumed se-
of getting most of the Unix soft- rithms. For example, we have re- cure. This version uses one-way
ware to work without having to done the traveling salesman prob- ciphers and, with caching of argu-
expend much effort on our part. lem in Orca and achieved a ten-fold ment/result pairs, can be made to
The price we pay for this approach speedup with 10 processors (com- run virtually as fast as the current
is that we will never be able to pro- pared to 7.5 using the non-Orca Amoeba. The other version makes
vide 100% compatibility. For exam- version described earlier). Alpha- no assumptions about the security
ple in a capability-based system, it is beta search in Orca achieves a fac- of the underlying network or the
very difficult to get the whole con- tor of six speedup with 10 proces- operating system. Like MIT’s Ker-
cept of user-ids and group-ids sors (compared to four without beros [23] it uses a trusted authenti-
right. Our view of protection is to- Orca). It appears that using Orca cation server for key establishment
tally different. reduces the communication over- and encrypts all network traffic.
Furthermore, Amoeba is essen- head, but it remains true that for We hope to install both versions
tially a stateless system. This means problems with many processes and and investigate the effects on per-
that it is virtually impossible to get a high interaction rate (i.e., small formance of the system. We are re-
right the various subtle properties grain size), there will always be a searching the problems of authenti-
of Unix relating to how files are problem. cation in very large systems
shared between parent and child. spanning multiple organizations
In practice we can live with this, but and national boundaries.
for someone who demands binary Performance, in general, has been a
compatibility, our approach has major success story. The minimum Comparlmon With Other
some shortcomings. RPC time for Amoeba is 1.1 msec Symtemm
between two user-space processes Amoeba is not the only distributed
wmllel bmpmaehns on Sun 3/6Os, and interprocess system in the world. Other well-
Although Amoeba was originally throughput is over 800 kbytes/sec. known ones include Mach [ 11, Cho-

COYYUNIUTIONSOCTNSliCYlDecember 199O/Vo1.33, No.12


rus [21], V [6], and Sprite [18]. Al- into optimizing the distributed case, 5. Birrell, A.D., and Nelson, B.J. Im-
though a comprehensive compari- not the local case. ‘This is clearly a plementing remote procedure calls.
son of Amoeba withL these would no philosophical difference. ACM Trans. Comput. Syst. 2 (Feb.
1984), 39-59.
doubt be very interes:ing, it is be-
6. Cheriton, D.R. The V distributed
yond the scope of this article. Nev- Conclurlon
system. Commun. ACM 31 (March
ertheless, we would like to make a The Amoeba project has clearly 1988), 314-333.
few general remarks. demonstrated that it is possible to 7. Dalal, Y.K. Broadcast protocols in
The main goal of the Amoeba build an efficient, high-perfor- packet switched computer net-
project differs somewhat from the mance distributed operating system works. Ph.D. dissertation, Stanford
goals of most of the other systems. on current hardware. The object- Univ., 1977.
It was our intention to develop a based nature of the system, and the 8. Dennis, J., and Van Horn, E. Pro-
new operating system from scratch, use of capabilities provides a unify- gramming semantics for mul-
using the best ideas currently avail- ing theme that holds the various tiprogrammed computation. Com-
mun. ACM 9 (March 1966), 143-
able, without regartd for backward pieces together. By making the ker-
155.
compatibility with systems designed nel as small as possible, most of the
9. Evans, A., Kantrowitz, W., and
20 years ago. In particular, while key features are implemented as Weiss, E. A user authentication
we have written a library and server user processes, which means that scheme not requiring secrecy in the
that provide enough Unix compati- the system can evolve gradually as computer. Commun. ACM 17 (Aug.
bility that over 100 Unix utilities needs change and we learn more 1974), 437-442.
run on Amoeba (after relinking about distributed computing. 10. Feldman, S.I. Make-A program
with a special library), total compat- Amoeba has been operating sat- for maintaining computer pro-
ibility has never been a goal. Al- isfactorily for several years now, grams. Software-Practice and Expe-
though from a marketing stand- both locally and to a limited extent rience 9 (April 1979), 255-265.
11. Johnson, S.C. Yacc-yet another
point, not aiming for complete over a wide-area network. Its de-
compiler compiler. B,ell Labs Tech.
compatibility with the latest version sign is clean and its performance is
Rep., Bell Labs, Murray Hill, N.J.,
of Unix may scare off potential cus- excellent. By and large we are satis- 1978.
tomers with large existing software fied with the results. Nevertheless, 12. Kaashoek, M.F., Tanenbaum, AS.,
bases, from a research point of no operating system is ever fin- Flynn Hummel, S., and Bal, H.E.
view, having the freedom to selec- ished, so we are continually work- An efficient reliable broadcast pro-
tively use the good ideas from Unix ing to improve it. Amoeba is now tocol. Oper. Syst. Rev. 23 (Ott 1989),
and reject the bad ones is a plus. available. For information on how 5-19.
Some other systems take a different to obtain it, please contact Tanen- 13. Lawler, E.L., and Wood, D.E.
viewpoint. baum, preferably by electronic mail Branch and bound Methods A Sur-
vey. Oper. Res. 14 (July 1966), 699-
Another difference between at AST@CS.VU.NL. q 719.
Amoeba and other systems is our
14. Marsland, T.A., and Campbell, M.
emphasis on Amoeba as a distributed References Parallel search of strongly ordered
system. It was intended from the Accetta, M., Baron, R., Bolosky W., game trees. Comput. Surv. 14 (Dec.
start to run on a large number of Golub, D., Rashid, R., Tevanian, A., 1982). 533-55 I.
machines. One co,mparison with and Young, M. Mach: A new kernel
15. Mullender, S.J., and Tanenbaum,
Mach is instructive on this point. foundation for Unix Development.
A.S. A distributed file service based
Mach uses a clever optimization to In Proceedings of the Summer Usenix
on optimistic concurrency control.
Conference (Atlanta, GA, July 1986).
pass messages between processes In Proceedings of the Tenth Symposium
Baalbergen, E.H., Verstoep, K., and
running on the same machine. The Operating System Principles (Dec.
Tanenbaum, A.S. On the design of
page containing the message is 1985), pp. 51-62.
the Amoeba configuration man-
mapped from the sender’s address ager. In Proceedings of the 2d Interna- 16. Mullender, S.J., and Tanenbaum,
space to the receiver’s address tional Workshop on Software Configu- A.S. The design of a capability-
space, thus avoiding copying. ration Management ACM, N.Y., based distributed operating system.
Amoeba does not do this because (1989). Comput. J. 29 (Aug. l986), 289-299.
we consider the key issue in a dis- Bal, H.E., Van Renesse, R., and 17. Mullender, S.J., van Rossum, G.,
tributed system to be the communi- Tanenbaum, A.S. Implementing Tanenbaum, A..%, van Renesse, R.,
cation speed between processes distributed algorithms using remote van Staveren, J.M. Amoeba-A dis-
procedure call. In Proceedings of the tributed operating system for the
running on different machines. That
National Computer Conference, AFIPS 1990s. IEEE Comput. 23 (May 1990).
is the normal case. Only rarely will 44-53.
(1987), pp. 499-505.
two processes happen to be on the Bal, H.E., and Tanenbaum, A.S. 18. Ousterhout, J.K., Cherenson, A.R.,
same physical processor in a true Distributed programming with Douglis, F., Nelson, M.N., and
distributed system, especially if shared data. IEEE Conference on Welch, B.B. The sprite network
there are hundreds of processors; Computer Languages, IEEE (1988), operating system. IEEE Comput. 2f
therefore we have put a lot of effort pp. 82-91. (Feb. l988), 23-26.

December 199O/Vo1.33, No.lZICO*IYUNICITIONSOCTNEliCY


19. Peterson, L., Hutchinson, N., Proceedings of the Ninth International ~01s and kernel efficiency,
O’Malley, S., and Rao, H. The x- Conference on Distributed Computer
kernel: A platform for accessing Systems, IEEE (1989), pp. 22-27. GREGORY J. SHARP has spent the
Internet resources. IEEE Comput. 31. Van Renesse, R., Van Staveren, H., past five years working on the Amoeba
23 (May 1990), 23-33. and Tanenbaum, A.S. Performance project, first developing a window sys-
20. Pu, C., Noe, J.D., Proudfoot, A. of the world’s fastest distributed tem, then its kernel and file system. His
Regeneration of replicated objects: operating system. Oper. Syst. Rev. 22 research interests include operating sys-
A technique and its Eden imple- (Oct. 1988), 25-34. tems and user interfaces.
mentation. In Proceedings of the 2nd 32. Van Renesse, R., Van Staveren, H.,
SAPE J. MULLENDER heads the dis-
International Conference on Data En- and Tanenbaum, A.S. Performance
tributed systems and computer net-
gineering (Feb. 1986), pp. 175-187. of the Amoeba-distributed operat-
works research group at the Centrum
21. Rozier, M., Abrossimov, V., Ar- ing system. Software-Practice and
voor Wiskunde en lnformatica. He is
mand, F., Bottle, I., Gien, M., Guil- Experience 19 (March 1989) 223-
also one of the designers of the
lemont, M., Hermann, F., Kaiser, 234.
Amoeba-distributed operating system.
C., Langlois, S., Leonard, P., and 33. Van Rossum, G. AIL-A class-
Mullender’s research interests include
Neuhauser, W. CHORUS distrib- oriented stub generator for
high performance distributed comput-
uted operating system. Comput. Syst. Amoeba. In Proceedings of the Work-
ing and the design of scalable fault-
I (Fall 1988), 299-328. shop on Experience with Distributed Sys-
tolerant services.
22< Schroeder, M.D., and, Burrows, M. tems, J. Nehmer, Ed., Springer Ver-
Performance of the firefly RPC. In lag, N.Y., 1990. To be published. JACK JANSEN joined the distributed
Proceedings of the Twelfth ACM Sym- 34. Welch, B.B. and Ousterhout, J.K. systems group at CWI in 1985 after
posium of Operating System Principles, Pseudo devices: User-level exten- teaching computer science for several
ACM, N.Y. (Dec. 1989), 83-90. sions to the Sprite ftle system. In years. His professional interests include
23. Steiner, J.G., Neuman, C., and Proceedings of Summer USENIX Con- kernel programming and process man-
Schiller, J.I. Kerberos An Authenti- ference (June 1988), pp. 37-49. agement.
cation Service for Open Network
GUIDO van ROSSUM joined the
Systems. In Proceedings of the Usenix CR Categories and Subject Descript- Amoeba project three years ago, creat-
Winter Conference, USENIX Assoc. ors: C.2.4 [Computer-Communications ing its RPC interface specification lan-
(1988), pp. 191-201. Networks]: Distributed Systems- guage (AIL), its Unix emulation facility
24. Stonebraker, M. Operating system distributed applications, distributed data- (Ajax), and worked on system integra-
support for database management. bases, network operating systems; D.4.8 tion and user interface issues. His cur-
Commun. ACM 24 (July 1981) 412- [Operating Systems]: Performance- rent research topics include prototyping
418. measurements. languages and user interfaces for power
25. Tanenbaum, A.S. A Unix clone General Terms: Design, Experimen- users.
with source code for operating sys- tation, Performance Authors’ Present Address for Tanen-
tems courses. Oper. Syst. Rev. 21 Additional Key Words and Phrases: baum, van Renesse, van Staveren, and
(Jan. 1987), 20-29. Computer networks, experience Sharp: Department of Mathematics and
26. Tanenbaum, AS., Mullender, S.J.,
About the Authors: Computer Science, Vrije Universiteit,
and Van Renesse, R. Using sparse
ANDREW S. TANENBAUM is the De Boelelaan 1081a, 1081 HV Amster-
capabilities in a distributed operat-
principal architect of three operat- dam, The Netherlands, ast@cs.vu.nl,
ing system. In Proceedings of the Sixth
ing systems-TSS-11, MINIX, and cogito@cs.vu.nl, sater@cs.vu.nl, gregor’-
International Conference on Distributed
Amoeba, as well as the chief designer of @cs.vu.nl.
Computer Systems, IEEE (1986), 558-
the Amsterdam Compiler Kit. He is cur- For Mullender, Jansen and van Ros-
563.
rently a professor of computer science sum: Centrum voor Wiskunde en lnfor-
27. Tanenbaum, AS., and Van
at the Vrije Universiteit in Amsterdam matica, Kruislaan 413, 1098 SJ Amster-
Renesse, R. Distributed operating
where he does research in the areas of dam, The Netherlands. sape@cwi.nl,
systems. Comput. Surv. 17 (Dec.
operating systems, networks, and dis- jack@cwi.nl, guido@cwi.nl.
1985), 419-470.
28. Tanenbaum, A.S., and Van tributed systems. He is also the author
VAX is a trademark of Digital Equipment
Renesse, R. A critique of the remote of 3 widely used textbooks and over 60 Corporation.
procedure call paradigm. In Pro- published papers.
This research was supported in part by the
ceedings of Euteco ‘88 (1988), pp. Netherlands Organization for Scientific Re-
ROBBERT van RENESSE is a re-
775-783. search (NWO) under grant 125-30-10. The
searcher in the computer science de-
29. Van Renesse, R., Tanenbaum, A.S., research at Centrum voor Wiskunde en In-
partment at the Vrije Universiteit as well formatica was supported in part by a grant
Van Staveren, H., and Hall, J. Con-
as a fellow of the Royal Dutch Academy from Digital Equipment Corporation.
necting RPC-based distributed sys-
of Sciences. He is presently working on
tems using wide-area networks. In Permission to copy without fee all or part of this
management of distributed systems to
Proceedings of the Seventh Interna- material is granted provided that the copies are not
improve their robustness, performance, made or distributed for direct commercial advan-
tional Conference on Distributed Com-
and scalability. tage, the ACM copyright notice and the title of the
puting Systems, IEEE (1987), pp. 28- publication and its date appear, and notice is given
34. HANS van STAVEREN is one of the that copying is by permission of the Association for
30. Van Renesse, R. Tanenbaum, AS., implementors of the Amoeba-distrib- Computing Machinery. To copy otherwise, or to
and Wilschut, A. The design of a republish, requires a fee and/or specific permission.
uted operating system. His primary re-
high-performance tile server. In search interests include network proto- 0 1990 ACM OOOl-0782/90/1200-0046 $1.50

63

You might also like