You are on page 1of 43

Design and Implementation of a Network Layer for Distributed Programming Platforms

Anna Neiderud

Master's thesis at:

Civilingenjrsprogrammet i Elektroteknik Royal Institute of Technology Swedish Institute of Computer Science

Conducted at:

Supervisors:

Per Brand, Erik Klintskog Swedish Institute of Computer Science Seif Haridi Department of Teleinformatics Royal Institute of Technology

Examiner:

Stockholm, February 15, 2000

Abstract
Mozart is a distributed programming platform for a multi paradigm language named Oz. Its network layer is a message passing service to higher layers which run protocols to maintain the state of distributed entities. This thesis presents the design, implementation and evaluation of a new network layer for Mozart. While reviewing the problems of the old network layer, a new model was designed and implemented. Solutions provided are a more efficient usage of file descriptors and similar resources, less fragmented sending of data, leaner memory usage, improved multiplexing over communication channels and a monitor mechanism for error handling. The distribution models of a few other languages were also studied and the demands on a message passing service were investigated. It turns out that the many different aspects of a multi paradigm language also require more of such a service. Still a small comparative test shows that the performance of the network layer of Mozart is competitive with, or even higher than the performance of Javas RMI, that uses a more simplistic message passing service.

Table of Contents
1 Introduction..............................................................................................................1 2 Distribution models of Mozart and other Distributed Languages......................1 2.1 Conventional distribution models......................................................................1 2.2 Network Transparency.......................................................................................2 3 Network layers.........................................................................................................3 3.1 The network layer of Java's RMI.......................................................................4 3.2 The network layer of Mozart.............................................................................5 4 Problems addressed with the new design..............................................................6 5 Design and implementation of the new Network Layer.......................................7 5.1 Definition of the new model..............................................................................7 5.2 Connecting to a remote site................................................................................9 5.3 Passing Messages.............................................................................................17 5.4 Priorities...........................................................................................................19 5.5 Acknowledgements and retransmission...........................................................20 5.6 Garbagecollection and references....................................................................21 5.7 Resource caching.............................................................................................22 5.8 Error handling..................................................................................................23 5.9 Logging............................................................................................................23 6 Evaluation...............................................................................................................24 6.1 Evaluation of the performance of the new vs. the old design..........................24 6.2 A comparison with Java's RMI........................................................................28 7 Future work............................................................................................................29 8 Conclusions.............................................................................................................29 Appendix A: Evaluation test code..........................................................................30 Appendix B: Interfaces............................................................................................33 References.................................................................................................................37

Design and Implementation of a Network Layer for Distributed Programming Platforms

Introduction
Today more and more applications are becoming distributed. Somehow they work, but anyone who has tried to distribute an application knows this still involves much extra programming not of interest for the application. More precisely, it still involves getting into low-level networking or possibly using some abstraction that scarcely hides the networking. Mostly this is because network tools and abstractions have been designed with the idea to provide a nice interface directly to the low-level communication. A different viewpoint is to design an environment that easily distributes a centralized application without explicitly defining any communication. Mozart is a multi paradigm language that provides distribution from the application programmers' point of view. Of course, the low-level networking needs to be taken care of here too and the creative parts of this thesis project (Section 5) is about the design and implementation of new networking support for Mozart. Prior to that, different approaches in providing distribution abstractions are studied (Section 2), and their demand on network services is discussed (Section 3). Section 4 describes the problems addressed with the new design.

2 Distribution models of Mozart and other Distributed Languages


2.1 Conventional distribution models Conventional distribution models refer to models providing interfaces directly to communication. Many times, they offer an elegant interface to communication, but they also make the design of the application dependent of the design of the distribution structure. 2.1.1 Sockets The first naive approach to building distributed applications is to connect processes via sockets [9]. This requires a lot of explicit handling of the communication. The application programmer needs to define a protocol for communication, which could lead to tedious debugging. In addition, this approach duplicates work done when writing other distributed applications. Therefore, it is nice to separate the actual communication from the application. One model for doing so is RPC. 2.1.2 RPC Remote Procedure Call (RPC) [10] is a simple and elegant solution that allows applications to transparently invoke procedures remotely. A client process will make a local procedure call to a stub procedure that takes care of any arguments and sends a request to a remote stub, which in turn carries out the call and sends the result back. It is known at compile time exactly where communication may take place, and sockets to serve procedure calls may be set up when a program starts.

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

2.1.3 Network Objects or RMI An extension of RPC is Remote Method Invocation (RMI) provided by the Network Objects of Modula-3 [1] and by Java's RMI [2]. This model uses special objects whose methods can be invoked over the network. As for RPC, stubs are generated, but they are proxies for remote objects instead of procedures. One important difference between RPC and RMI is that while a procedure is known at compile time, an object can be dynamically allocated during runtime. Therefore it is not known until the program is run, for how many objects (possibly zero) communication is needed. The special network objects cannot migrate and are thus stationary. When these objects are arguments or results of a method call, they are referenced. All other objects in the system are serialized and copied instead. This means that the semantics change when objects are passed as arguments to a remote method versus a local. Network objects gives object-oriented semantics to distributed programming, but the programmer still needs to make a static choice of which objects to make remotely available. This means the application has to be designed tightly together with the distribution structure. If the distribution structure is to change, other objects must become network objects and the application must be rewritten. 2.2 Network Transparency There are approaches where the programmer does not need to explicitly make objects or entities accessible remotely, and when accessing an object or entity he does not need to know whether it is on a remote machine. This is called network transparency, and two languages that adopt this approach are the object-oriented language Obliq and Mozart. 2.2.1 Obliq Obliq is described by the inventor Luca Cardelli, as "a language with distributed scope" [3]. This means any scope that a computation can reach when created, will still be reachable when the computation migrates across the network. This concept gives the programmer a chance to view the memory of all interacting machines as a shared memory. The concept of a distributed scope is very powerful. It can separate the distribution structure from the application functionality completely. Obliq achieves distributed scoping by letting all objects be Modula-3 Network Objects [1]. They can therefore easily be transparently referenced and accessed across the network. Unfortunately, they also inherit the property of being stationary to avoid problems with state duplication. This strongly limits the flexibility of the distribution structure, and thus limits the possibility to tune performance by making some objects mobile. Luca Cardelli stresses that objects can migrate, but this is achieved only through explicit tricks by the programmer. 2.2.2 Mozart The models up to Network Objects or RMI (Sections 2.1.1-2.1.3) provide abstractions that should make it easier to define communication in distributed programming. A different approach is to relieve the programmer from defining the communication. Instead, it can be identified how a centralized application can be
13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

distributed from the application programmer's point of view. Such a solution must maintain the semantics, as the solution provided by Obliq (Section 2.2.1) does. During the development of distribution in Mozart the following five concerns have been identified as the ones to consider when building a distributed application [4]:

1. Application functionality: what the application does if all effects of distribution are disregarded. 2. Distribution structure: the partitioning of the application over a set of sites1. 3. Open computing: the ability for independently written applications to interact with each other in interesting ways. 4. Fault tolerance: the ability for the application to continue providing its service despite partial failures. 5. Resource control and security: the ability for the application to continue providing its service despite intentional interference.

Since the application functionality (1) contains the desired result of programming, the approach in Mozart is to separate the other four (2-5) concerns from this. To begin with, the separation of the application functionality (1) and the distribution structure (2) is addressed. An application should behave in the same way independently of the distribution structure, but the distribution structure should be controllable. To accomplish this Mozart too provides distributed lexical scoping, but without limits on where entities reside. Having a distributed scope means all of Mozart's entities must be possible to distribute. These are objects as well as other types of entities such as records, cells, variables and ports. The different properties of these imply different distribution strategies. First Mozart distinguishes between stateful and stateless data. Stateless data can simply be copied when another site acquires a reference. Stateful data is managed via different distribution algorithms. Some stateful entities have a stationary state per default and some have a mobile state. Second, some algorithms are defined to do asynchronous communication, while others use synchronous communication.

Network layers
All of the above mentioned distribution models need some means of communication in the form of a message passing system. This is referred to as a network layer throughout this report. The first approach on designing a network layer might be to use what the operating system or some internet protocol provides directly, but they do not provide all that is demanded of the communication. Typical issues to decide on are: use of synchronous or asynchronous message passing

The word site is used to denote processes that are not necessarily on the same machine.

13-05-10

4 -

Design and Implementation of a Network Layer for Distributed Programming Platforms

how to maintain connections with respect to single or multiple connections between each virtual machine and limited resources such as file descriptors or port numbers what communication protocol to use how to reflect errors to higher layers

Here the demands on the network layers of Java's RMI and Mozart are discussed, and the solutions chosen are briefly described. 3.1 The network layer of Java's RMI As mentioned in Section 2.1.3 RMI in Java uses the standard mechanism with stubs and skeletons to communicate with remote objects. On the client side, a stub object acts as a placeholder for the remote object. Whenever a method is called, the stub is responsible for (1) establishing a connection to a remote skeleton object, (2) marshaling arguments and passing them to the skeleton, (3) waiting for the result and (4) unmarshaling and returning the result or throwing an exception. On the server side the skeleton listens for incoming calls and performs a local method invocation on the actual object [2]. This pattern of communication has implied synchronous message passing. The actual communication is separated from the stub and skeleton to open for a customizable choice of communication protocols. The chosen architecture looks as in Figure 1:

Architecture of client and server in Java's RMI. Figure 1 [5]

The bottom layer, "Distributed Computing Services", is where connections are set up and used. Whenever a stub needs a connection to a skeleton, a SocketFactory is requested for a socket. The default SocketFactory can be replaced by the application and that way custom sockets may be used. The RMI Protocol requires one connection for every method invocation. However when one invocation is finished, its connection will be cached for a period of time, ready to use by some other call to the same virtual machine [8]. Using one connection per method invocation reflects limited resources as a limitation on number of simultaneous (incoming or outgoing) method calls. When resources are low currently unused connections can be closed, but others must be maintained. Since method calls are synchronous, all errors can immediately be reflected back to where the method call was made as an exception.
13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

3.2 The network layer of Mozart In contrary to RMI the network layer of Mozart does not have one single protocol to serve, but several different to take care of different types of entities. These protocols include sending messages to one or more sites, synchronous or asynchronous. To efficiently serve the various needs, the network layer of Mozart implements an asynchronous message passing service. For comparison with the architecture of RMI described in Figure 1, the architecture of distributed Mozart can be viewed as in Figure 2:
Centralized Mozart Distribution Protocols Network Layer Centralized Mozart Distribution Protocols Network Layer

OS

Architecture of distributed Mozart Figure 2

To utilize allocated network resources as well as possible, there will always be at most one (virtual) connection to every site. This is depicted in Figure 3. Communication is multiplexed over the shared connections. Since a typical Mozart application runs several threads, several messages to one site can be collected and sent in one chunk, which will lower overhead on each message. When resources are low, physical connections can be temporarily closed to let other connections be opened. To upper layers, this is only visible as higher latency.

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

SiteA

SiteB

SiteC

At one site (one Mozart Virtual Machine) threads can share entities and entities can share connections to other sites. There is always at most one connection per pair of sites. Figure 3

Since asynchronous communication is used, errors can be asynchronous as well as synchronous. Only some errors can be reflected back as an exception where the call was made. Other errors can be discovered at application level e.g. by monitoring an entity. The network layer will then reflect the errors to these monitors.

Problems addressed with the new design


Distribution in Mozart is implemented and part of the current release (Mozart 1.0). The protocols used are mature and well working, but the network layer has become hard to maintain and alter, which makes improvements hard to add. The creative parts of this thesis is about redesigning and reimplementing the network layer to achieve a better structure, and to implement some improvements while preparing for others. Here follows a description of the problems in the old design addressed by the new design, before defining the new design in Section 5. Modularity In the old design almost the entire network layer had to be defined for each type of transport media used (TCP and shared memory were the implemented versions). This led to code duplication and made it hard to experiment with other transport media, such as UDP. The new design singles out transport media specific code in separate modules and defines an interface to be used by any new such module. The common code is also divided into separate modules and cleaned up, hopefully making it easier to maintain. Resources Each virtual connection (in the network layer using TCP) used two physical connections, one for reads and one for writes. That kept protocols for initiating and
13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

closing communication very simple, but also used two file descriptors where one would have sufficed. Since file descriptors are limited per process in most operating systems, the new design only uses one physical connection per virtual connection. This should improve performance with many simultaneous connections at one site. Nagle algorithm When sending much data asynchronously the operation can benefit from waiting for some data to buffer up before sending, rather than sending many small messages. This is provided through the Nagle algorithm in TCP, but enabling that leaves no control, and to some protocols data will be sent too late. Therefore, the new design adds its own buffering. In addition, it provides a possibility to schedule distribution I/O activities in the future. Memory usage Many messages produced by the protocols contain oz terms that need to be marshaled. They exist in memory on the oz heap. In the old design, everything is marshaled when the message is constructed. This causes terms in not yet sent messages to be stored twice, on the oz heap, plus in a larger marshaled format. Since the number of stored, not yet sent messages increases with the new buffering strategies, the new design postpones marshaling until just before the send, and limits the amount of currently marshaled data. Multiplexing In the old design, one large message could monopolize the channel, since it had to be completely transferred before any other message was started. By using a new suspendable marshaler, large messages can be divided into smaller parts, which other messages can interleave. To achieve interleaving, priority queuing of messages is also introduced. Error handling To give the programmer not only network transparency, but also network awareness, mechanisms telling the programmer of the state of the network are needed. For TCPcommunication, the old design forwards the errors returned by TCP to upper layers. These errors can then be interpreted as the remote site being permanently or temporarily (possibly forever) down. The temporary fault typically appears after a delay of several minutes, but to many applications, it would be desirable with a mechanism that both can report temporary faults within a custom amount of time, and can monitor the throughput. Therefore the new design provides probing, i.e. probes (pings) are sent and the roundtrip is measured.

Design and implementation of the new Network Layer

5.1 Definition of the new model In the old model, the distributed subsystem is divided into a distribution layer and a network layer. The distribution layer maintains information on where different sites are physically located. The network layer passes the messages. This structure is kept

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

for the new model, but the network layer is now divided into one communication layer and one transport layer. This gives us layering as in Figure 4.

Architecture of the new model. Figure 4

Figure 4 also shows the different objects used in the new design. The boxes mark controllers that exist whenever the distribution is running. Controllers are responsible for creating and garbage collecting the multiple objects of each layer. The ovals mark object instances on each layer that are unique for every remote peer. 5.1.1 Distribution layer The objects marked S1-S6 represent site objects. In Mozart the word site is used to denote processes that might reside on different machines, and in this model siteobjects are used as references to remote sites that are known to the own site. A site-object contains all the necessary information to connect to a remote site, such as the physical location and what transport media to use when connecting to it. When a reference is passed to some other site, the information in the site is marshaled and transmitted. When information about a site is received and unmarshaled, a site-object is created in the system if it does not exist. 5.1.2 Communication layer Every ComObject (marked C1, C2) takes care of the communication for one siteobject. It acquires a channel, sets up properties for the communication over that channel, keeps track of when a channel can be closed, queues messages, assures that messages reach their destination or are reported unsent and provides probes for network monitoring. 5.1.3 Transport layer A TransObject (marked T1, T2) provides an abstraction for some means of a reliable physical communication channel. Several different types of TransObjects can coexist in the system, providing different types of communication between sites depending of their relative physical location. When a ComObject needs a TransObject it requests it from a special "Physical Media Mediator", that with the help of the site-object decides on what type of
13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

transport media to use. The different TransControllers are kept by this mediator and used to extract the TransObject. The connection is then established with the connection procedure described in [6]. Resources are also controlled by TransControllers. If too many are in use, ComObjects will be queued to get a resource when there is one free. A ComObject should always be sure to sooner or later get a TransObject, i.e. resources are handled fair in the same aspect as threads [7]. To begin with, one TransObject is implemented to use TCP, and one to use shared memory, but it is left open to add communication via UDP or something else. 5.1.4 Messages and MsgContainers Messages are stored in MsgContainers that act as transporters of information. MsgContainers are created by the distribution protocols when a message is to be sent or by the transport layer when a message has been received. Except for storing the type and content of the message, a MsgContainer can also store information on when a message was sent. All the knowledge of how to marshal and unmarshal a message of a certain type and what to do at garbage collection is also kept in the MsgContainer. 5.2 Connecting to a remote site ComObjects "connect" to remote sites by getting a TransObject (physically connected) and running an open protocol to set some properties for the communication. Connections are originally initiated when a ComObject is ordered to send. A remote peer will accept the connection. Being an initiator or acceptor are two cases that need to be treated in different ways. The reason for this is that an accepting ComObject cannot know who is trying to connect to it, until the remote peer has introduced itself. 5.2.1 Building the architecture as an initiating site Building the architecture as an initiating site is rather straightforward. The course of action is described by Figure 5 and Figure 6. If a connection to the remote peer has previously been present, a site and perhaps a ComObject may be present. Then the corresponding steps are simply left out.

13-05-10

10

Design and Implementation of a Network Layer for Distributed Programming Platforms

Mozart
1

SiteController

Site
2

Distribution

ComController

ComObject Communication

TransController Transport OS

The SiteController, ComController and TransController are present in the system. When communication is needed a Site is given (1) by the SiteController. This Site in turn contacts (2) the ComController to get a ComObject.

Figure 5
Mozart

SiteController

Site Distribution

ComController
3

ComObject Communication TransObject Transport OS

TransController

The ComObject asks (3) a TransController for a TransObject. If there are not enough resources, the TransController will queue this ComObject, otherwise a physical channel will be established. On success the ComObject is handed back a TransObject ready to communicate. The ComObject will run the open protocol described below using the TransObject and then start sending the upper layer messages as requested.

Figure 6

5.2.2 Building the architecture as an accepting site Whenever a connection is accepted a TransObject is created and handed over to the ComController as described in section 5.2.3. The ComController creates an anonymous ComObject since the remote site is not yet known. This ComObject gets to run the open protocol that will determine if another ComObject to this site exists. This is shown in Figure 7.

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

11

Mozart

SiteController Distribution
1

ComController

ComObject Communication TransObject Transport OS

TransController

After an accept a working TransObject is handed to the ComController (1). The ComController creates an anonymous ComObject that is handed the TransObject (2). Then the open protocol is started.

Figure 7

In those cases where the remote site is known prior to the accept, a ComObject may or may not exist for that site. If an object does exist, that ComObject is to be used since it may contain queued messages and valuable information from prior connections. The anonymous one is discarded. If the old ComObject also has a TransObject this means the two sites must be trying to connect to each other simultaneously, and the open protocol described below determines which channel is to be used. The old ComObject then adopts the TransObject of that channel. 5.2.3 Connect and accept with the "Physical Media Mediator" This design uses a yet to be designed "Physical Media Mediator" for the establishment of a physical channel. (The implementation has a stub for this.) The idea of the "Physical Media Mediator" is to make it possible to control what transport media to use from oz-level. It should work for both standard and added on transport layers. The architecture is described by Figure 8:

13-05-10

12

Design and Implementation of a Network Layer for Distributed Programming Platforms

Connection procedure

Accept procedure

Oz-level

Site

Com Object

Physical Media Mediator


TransController A TransController B

A schematic view of a future architecture with a "Physical Media Mediator". The current implementation replaces the connection and accept procedures plus the "Physical Media Mediator" with a stub. Figure 8

Every site-object has one connection and one accept procedure. When a reference to a site is handed out, the connection procedure to be used to connect to that site is included. Locally every site runs an accept procedure dedicated to accept connections from the connection procedure. The interface to the network layer is given through a built-in. A brief description of what happens in the "Physical Media Mediator" and the builtin at connect and accept is given below. Connect 1. The ComObject asks the "Physical Media Mediator" for a TransObject. The site is referenced. 2. The "Physical Media Mediator" chooses what TransController is to be used based on the site and its connection procedure, and asks for a TransObject. The request might be queued for a while until a resource is available. 3. A TransObject is granted and handed back to the "Physical Media Mediator". The "Physical Media Mediator" calls the connection procedure to get a physical channel. It then blocks. 4. A channel is handed over from the connection procedure, and the "Physical Media Mediator" "puts" it in the TransObject. The TransObject is then handed over to the requesting ComObject. Accept 1. The accept-procedure gets a physical channel and decides on what TransController to use. The TransController is asked if there are enough resources for accept. The decision on what TransController to use, and the channel is handed down to the "Physical Media Mediator". 2. The "Physical Media Mediator" tries to get a TransObject and "puts" the channel in it. The TransObject is then handed to the ComController, who creates the anonymous ComObject and gets the open protocol started.
13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

13

Disconnect 1. The ComObject hands back the TransObject to the "Physical Media Mediator". It can then disconnect the physical channel and return the TransObject to the corresponding TransController. 5.2.4 Open protocol The open protocol is used to negotiate some communication properties and to assure that both sites are using the same perdioversion2 and know whom they are talking to. It also determines what channel to use in case both sites initiate communication simultaneously. It consists of three regular messages created and sent by the ComObjects (also see Figure 9): 1. The anonymous ComObject sends PRESENT(Version, Site). 2. The initiating ComObject sends NEGOTIATE(Version, Site, ChannelInfo) 3. The anonymous ComObject sends NEGOTIATE_ANSWER(ChannelInfo)
S1
initiating Handover

S2
anonymous Accept

present

negotiate

negotiate answer

The course of the open protocol in the regular case. Figure 9

At any point, either side can close the channel without sending any prior warning. Sending a specific abort message has been considered, but whenever there is a version mismatch (the other site might not even be an ozengine3) that technique cannot be used. Therefore the transport layer must always detect a lost channel, and a lost channel during the open protocol will be interpreted as aborting the entire connection. A timer can also close the channel, i.e. if open messages are not received fast enough the other site might not be an ozengine, might have crashed, or might be a hostile
2

Perdioversion is a version related to the marshaling format of the messages. Having the same perdioversion means being able to read each other's messages.
3

An ozengine is a Mozart Virtual Machine

13-05-10

14

Design and Implementation of a Network Layer for Distributed Programming Platforms

host. If no connection is succeeded, the ComObject will keep on trying to connect until the site removes it. (The site is responsible for knowing when another site is permanently unreachable, or when a decision to give up the connection has been made from ozlevel.) The message exchange implies some states on a ComObject. The course of action looks as in Figure 10:
S1
initiating 1 2 Connection established present 4
Wait for present Wait for negotiate

S2
anonymous
Closed

Wait for handover

negotiate 5

Wait for negotiate answer

negotiate answer 6

Working

States marked by the numbers


1 Closed 2 Wait for handover 3 Wait for present 4 Wait for negotiate 5 Wait for negotiate answer 6 Working

States derived from the open protocol. Figure 10

At reconnect, or when two sites initiate a connection simultaneously, one or both sites may have both an anonymous and a regular ComObject. Once the NEGOTIATE message arrives to the anonymous ComObject, it can be detected if that is the case. If so, the anonymous ComObject has to examine the old ComObject and make decisions based on the state of that ComObject. The table below shows what decision to make (some of the states will be described below):

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

15

State of old ComObject


Closed Problem Wait for handover Wait for remote Wait for present Closing weak Closing wait for disconnect Wait for negotiate answer

Decision
Adopt the new connection. Close any connection of old ComObject.

Comment
Either the old ComObject is not opening a channel, or no NEGOTIATE has yet been sent in an ongoing open protocol. Therefore, the anonymous ComObject at the other site, corresponding to the old ComObject does not know its initiator. The other site is in the same position (anonymous ComObject receives NEGOTIATE, while the old ComObject is in state Wait for negotiate answer). Precedence is used to choose the same connection to keep. An already working connection is of course better. If resources are low, let a closing ComObject wait for a while. Illegal state for old ComObject since it cannot be anonymous.

The decision is made based on precedence. The ComObject of the "larger" site wins.

Working Closing hard

Close connection of the anonymous ComObject. -

Wait for negotiate

ChannelInfo The ChannelInfo contains values of various parameters that can be set for the communication. The following parameters exist: Last received: MsgNum of the last received message, zero if this is a new connection. When the connection is reopened, this gives the other site a chance to resend non-received messages. MsgAckTimeout: Maximum time for the receiver to wait before acknowledging this message. MsgAckLength: Maximum number of messages for the receiver to receive before acknowledging this message. HasNeed: Tells whether this site has a need for the channel. The information is used at garbagecollection and to know whether it is necessary to reopen a connection closed due to lack of resources. Buffer size: Size of the ByteBuffers. The first four are definite, but the buffer size needs to be equal for good performance with big messages. Therefore the initiating side gives a suggestion; the accepting side can either accept this or choose a smaller size, but never a larger (to make a natural end of the protocol). When the open protocol is run the ByteBuffer is already in use. An effect of this is that the ByteBuffer must always be large enough to fit a received open-protocol message, and the size has to be possible to change during runtime. Decisions on what size the buffer should have, is completely up to the used transport layer, and the corresponding TransObject should be asked for this. 5.2.5 Close protocol A channel can be closed either because it is no longer needed which is detected at garbagecollection, or because one site is running out of resources and needs to give the resources to another ComObject for a while. When garbagecollection is run the
13-05-10

16

Design and Implementation of a Network Layer for Distributed Programming Platforms

other side is given a chance to keep the channel, but when resources are out, the connection must be closed. This implies two different close messages: 1. CLOSE_WEAK, sent at garbagecollection 2. CLOSE_HARD, sent at lack of resources Possible responses are: 1. CLOSE_ACCEPT 2. CLOSE_REJECT, may only be a response to CLOSE_WEAK A ComObject that is being closed will also interpret one of the close messages as CLOSE_ACCEPT. After sending a CLOSE_HARD or CLOSE_WEAK the state of the ComObject changes to closing hard or closing weak. No more messages will be sent until a CLOSE_REJECT is received or the channel has been closed and reopened. At the remote peer the state will be closing wait for disconnect if it accepts the close. If the close is accepted the close initiator will close the channel and the remote peer should detect the disconnection as a lost channel. On both sides, the TransObject is handed back to the TransController, but the ComObject will remain until removed from the site.

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

17

5.2.6 States of the ComObject Figure 11 shows the full state model of a ComObject:
Closed Problem Wait for remote

Wait for handover Closed

Wait for present Wait for negotiate


(anonymous)

Wait for negotiate answer

Opening

Working Working

Closing weak

Closing hard

Closing wait for disconnect

Closing

Closed

Wait for handover

Wait for remote Closed


(repeated states)

The full state graph of the ComObject. X:s mark places where the ComObject can decide to reject communication. For readability, three states occur twice. Figure 11

To upper layer messages the state determines whether a channel is open or not. Outgoing messages will be transmitted in state working only. Incoming messages are received also in states closing weak and closing hard. Communication layer specific messages can be received and transmitted whenever a TransObject is present. The different protocols must assure that they are only sent when they make sense. 5.3 Passing Messages The transport layer is responsible for the actual transportation of messages. This can be done in different ways depending on what type of media is used and thus on what type of TransObject is used. This section first describes general aspects of the message passing and then the details of the message passing of the TCP-TransObject.

13-05-10

18

Design and Implementation of a Network Layer for Distributed Programming Platforms

5.3.1 General demands on the transport layer The communication layer expects the transport layer to provide a fully reliable service. Messages may only be lost when a channel is lost. Messages sent should be received at their remote peer in the same order. As discussed in section 5.4 some priority levels could be defined non-FIFO. In this case, an extra parameter needs to be added to the communication that allows these messages to be received out of order. When a ComObject has something to send, it will order its TransObject to deliver. The TransObject should then pull messages when it is allowed to send. The scheduler in the Mozart Virtual Machine[7] could be used to schedule when each TransObject can run. For now a more naive approach is used, where it is checked at each thread switch whether i/o is possible, and if it is the TransObject will be invoked. When a message is received by the TransObject it should be put in a MsgContainer and handed up to the ComObject. The TransObject is also responsible for sending acknowledgement numbers that will be provided when a message to be sent is pulled. 5.3.2 ByteBuffers Since most media are assumed to transmit serialized consecutive data, a MsgContainers provides one method that marshals a message into a ByteBuffer and one that unmarshals it from a ByteBuffer. To make marshaling and unmarshaling independent of when data is sent from, or received to the ByteBuffer, it is convenient to have a continuous ByteBuffer. Therefore, the ByteBuffer is implemented as a circular structure. The marshaler and the read handler, who write to the byte buffer, must assure that the ByteBuffer is not overfilled. This requires a suspendable marshaler. A suspendable marshaler can stop when the ByteBuffer is full, and continue later where it left off. This assumes the unmarshaler can handle unmarshaling of the fragments produced by such a marshaler. It is expected that each TransObject use exactly one ByteBuffer for incoming and one for outgoing data. This limits the amount of memory used by each TransObject. 5.3.3 Big messages One of the problems of the old distribution engine was that one large transfer was allowed to monopolize the channel. The complete message had to be sent before any other message could be sent. In addition, the complete message had to be received before unmarshaling could be done which made the memory of the distribution engine temporarily grow. Therefore, the new model introduces the concept of marshaling and sending only parts of a big message at a time. This is made possible through the suspendable marshaler mentioned in the previous section. The unmarshaler will require a full fragment to be received before beginning to work, and therefore the ByteBuffer of the receiving side, must be at least large enough to fit the complete contents of the ByteBuffer at the sending side.

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

19

5.3.4 Specifics of the TCP-transport layer When the TCP-TransObject is initialized, it registers a read handler with the I/Ohandler and when it is ordered to deliver, it registers a write handler. The I/O-handler will then give the control back to the TransObject at certain times. These are the times that messages will be marshaled and written, and when messages will be read and unmarshaled. Messages are marshaled from MsgContainers into frames. The contents of the frames are stored in ByteBuffers. One message can be put in one or more frames. One or more frames can be put in each ByteBuffer. On every chance the TransObject gets to write, it will try to write up to one full ByteBuffer. Frame format A frame looks as follows (The numbers indicates the size of each field in bytes.):
Ctrl
1

Ack
4

FrameSize
4

MT
1

CF
1

<CONTENT>

T
1

Ctrl: Ack: FrameSize: MT: CF: T:

A control byte only used for debugging. The number of the last received (numbered) message. (See 5.5) The size of the current frame. MessageType, the type as defined by either the protocol-layer or the communication layer. Continuation Field, tells whether the frame is a continuation of a previous frame or not. Trailer, tells whether this is the last frame of the message or not.

The chosen structure means the first nine bytes must always be received before trying to use any data. Then the frame size can be compared to the received amount and a frame can possibly be unmarshaled. The acknowledgement is received earlier since it is independent of the message, and might be waited for probe usage (see section 5.8) Depending on the CF, <CONTENT> can have two different formats: CF = first
<Data>

CF = continued
MsgNum
4

<Data>

MsgNum:

Tells the number of the message that this frame belongs to.

5.4 Priorities Messages can be delivered on five different priority levels, where level 1 and 5 are reserved for special system messages and levels 2 through 4 are used for all other messages. Levels 2-4 are scheduled like threads, but on number of bytes sent instead of on time. The relation between these priority levels can be altered from application level. Level 5: Send as fast as possible Level 4: High priority
13-05-10

20

Design and Implementation of a Network Layer for Distributed Programming Platforms

Level 3: Medium priority Level 2: Low priority Level 1: Send when no messages are waiting on levels two through five. At regular intervals, messages on this level will be moved to level two to ensure throughput.

Properties of level 5 and 1 Whenever there is a message queued on level 5, this will be sent next. To avoid starvation of other levels, protocols must assure these messages cannot monopolize the channel. Messages that are allowed are messages from the open protocol, messages from the close protocol, messages from the reference protocol and explicit acknowledgements, i.e. currently all messages produced on the communication layer. Messages on this level can be sent even if the ComObject is in some state other than Working, provided the outgoing channel is open as defined in 5.2.6. Level 5 is connection dependent. As soon as a connection is lost, the queue of this level is cleared. Level 1 is for messages that can be sent whenever there is space left. Levels 4-2 and FIFO For the time being all levels are FIFO, but since not all protocol layer protocols require FIFO, it would be desirable to define one or two levels to be non-FIFO. This is especially interesting when considering using some non-reliable transport-media. 5.5 Acknowledgements and retransmission Since the transport layer is expected to provide a reliable transfer, the only case when messages can be lost is when the connection is lost. Therefore, the only time that messages need to be retransmitted is at reconnection. To be able to do so messages must be stored at the sender until they are acknowledged. Messages from the protocol layer are implicitly numbered by the ComObjects at the sending and at the receiving side. Whenever such a message is sent, an acknowledgement number telling the number of the last received message, is attached to the frame. Messages produced by the ComObject are not desirable to acknowledge. They are part of protocols that themselves reply to each other. It would be wrong to acknowledge an explicit acknowledgement, and it is not possible for an anonymous ComObject to know what the number of the last received message from the unknown site is. If a connection is lost, the information in the ChannelInfo can be used at reconnection to determine if messages were also lost. This approach means that a big message that was almost completely transmitted in a number of frames needs to be resent if one frame is lost since only the complete message is acknowledged. However with a reliable transport media, this will rarely happen and the implementation of the TransObject is free to add an extra

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

21

acknowledgement schema if the protocols or media used are likely to cause lost messages. 5.6 Garbagecollection and references In order to know when ComObjects (and their channels) can be garbage collected, they need to keep track of their own and their opponents need to write. The local need is defined by the union of existing queued messages, and the presence of references via sites. A local reference is set to true once the protocol layer orders the ComObject to send, and is set to false when the ComObject is asked if it can be garbagecollected. Similarly, a remote reference is set to true when a remote protocol layer message is received and cleared by a communication layer message at remote garbagecollection. 5.6.1 Reference protocol The only message exchange needed is when a channel is opened and for clearing a reference. Therefore, a field in the ChannelInfo declares the need when a channel is opened, and one message, the CLEAR_REFERENCE message clears a remote reference. 5.6.2 Garbagecollection When Mozart does garbage collection, the distributed subsystem needs to be checked in two phases. The structure that might be garbage collected is shown in Figure 12.

Site

Com Object

hasLocalRef hasExternalRef

Trans Object The structure that might be garbage collected.

Figure 12

1. All queued or unacknowledged MsgContainers need to be traversed since they can hold references to oz-entities and therefor are roots for garbagecollection. Especially they can hold references to sites, which implies the two phases. MsgContainers are found through the ComController that maintains a list of all (including anonymous) ComObjects. 2. The site table is traversed. All sites that were not marked need to check if they can be garbagecollected. This is done by asking a present ComObject if it can be

13-05-10

22

Design and Implementation of a Network Layer for Distributed Programming Platforms

garbagecollected. If the ComObject does not have a need, and does not have a TransObject, it will answer yes and will be collected. If the ComObject still has a TransObject but no need, it will send a C_CLEAR_REFERENCE message. If it also thinks that the remote peer has no need, it will send a C_CLOSE_WEAK message. Then it will answer no. If it is closeable the close protocol will close the connection and at the next garbage collection, the ComObject will be collected to. 5.7 Resource caching Using a transport media is often connected to some limited resource. In the case with TCP, this resource is file descriptors; with UDP it is port numbers and with shared memory it is number of shareable memory pages. In order not to put a limit on the number of connected sites, the available resources have to be shared fairly between connections. In this model, this is done by a mechanism called resource caching. A positive side effect of resource caching in this way is that the memory usage is also limited. TransObjects contain ByteBuffers that may be large, and this way there will always be a limited number of ByteBuffers. As mentioned before, when establishing a connection an initiating ComObject has to ask the TransController for a TransObject. This gives the TransController a chance to control the number of simultaneously used TransObject and thus the number of used resources. If there are not enough resources right now, the ComObject will be put in line to get a TransObject as soon as possible. When accepting a connection from another site it is not possible to postpone getting a TransObject since the other site will most likely give up before one is retrieved. Therefore, the TransController will keep a number of resources to be able to grant them to the incoming requests. The number of resources available can be set from oz-level as a weak and a hard limit. The weak limit may be temporarily exceeded for incoming requests, but the hard limit is definite. When the resources available run out, the TransController will start a timer and try to preempt the resources of some ComObjects. The ComObjects then have to nicely close their connections and hand back the TransObject to the TransController who can grant it to the first waiting ComObject. 5.7.1 Preemption A big issue for efficiency with e.g. servers is whom to preempt, how many and when. This is a scheduling problem and is not yet addressed, only prepared for. For now a naive model is adopted: Whenever a resource for accept is granted, and the weak limit is exceeded, or else if a connect is requested and the weak limit is reached; start a timer. The timer will preempt the resources of one ComObject at regular intervals. The decision on which one to preempt is done as follows: The list of running ComObjects is traversed and the first one that can be closed (currently defined as being in state working) will be chosen with the following prioritizing:
13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

23

1. Has empty buffers and no queued messages. 2. Has empty buffers but queued messages. 3. Has non-empty buffers. The basic advantage of this model is its simplicity. Disadvantages are plenty: No respect is given to incoming messages. Thus the one connection where a message resolving a suspension in this engine is to come might be closed, or a server might close an active client instead of a client that just forgot to disconnect. The empty buffer-criteria might not be relevant for all transport mediums. Clients behind firewalls should be treated separately since it might be impossible to reconnect from the outside. No guaranties are given on how long a connection can be up before it will be closed again.

5.8 Error handling A new fault model for the Mozart platform is currently discussed. It is therefore not completely clear what error handling mechanisms need to be provided from the network level. The main support however will be the probing mechanism described in 5.8.1. In the cases where it can be definitely known that a remote site is permanently down, this needs to be reported to upper layers. Those cases are when the open protocol determines the remote site to run the wrong perdioversion or to be a different site than expected, or possibly if the transport media reports that the site is permanently down. The new fault model will give the application programmer some way to close the communication with a remote site. The importance in being able to do so is to free resources such as memory and any other resources connections use. 5.8.1 Probing To be able to give the application programmer full control over communication, some type of roundtrip-measurement must be done. Probes will essentially put a simple ping-message in the channel, mark the time and check when it is acknowledged. In order to receive an accurate time-measurement, the remote ComObject must be instructed to acknowledge messages as soon as they come in. To avoid sending to many extra messages on an already congested channel pingmessages will only be sent when no other messages are sent. When probing is turned on, and when regular messages are sent, these are simply marked with a time. 5.9 Logging A possibility to log all incoming and outgoing messages is convenient for tuning and debugging. Logging should be controlled from application programmer level with a specified output file. This output file can then be used with special tools written in Oz to graphically study the behavior of the network layer.

13-05-10

24

Design and Implementation of a Network Layer for Distributed Programming Platforms

Evaluation
This part serves to evaluate any performance changes of the network layer in the old versus the new design, and also to do a small comparison of distributed programming in Mozart versus Java's RMI.

6.1 Evaluation of the performance of the new vs. the old design To fully evaluate the performance changes, one would have to study how the performance has changed for each of the distributed entities as well as for larger applications performing various different tasks. Such a study is too large to fit inside this thesis project and therefore only the performance of the network layer is measured and compared. Questions to be answered are: 1. Throughput: Has the throughput of small/large chunks of data changed? Is the throughput tunable by tuning the buffer size? 2. Simultaneous connections: Has the capability and performance of being simultaneously connected to a number of sites changed? 3. Multiplexing: Has the buffering improved performance when more threads are communicating with the same site simultaneously? 4. Memory usage: Has the memory usage changed? The suspendable marshaler (see Section 5.3.3) is not yet done and this affects some of the testing. The buffer size has to be set high and therefore test 4 would show a very high memory usage. Test 4 is therefore left out for the time being. It is also not possible to test tuning as mentioned in test 1. A generic test measuring roundtrip times is constructed. It consists of one server and a number of clients. The server listens to a port for incoming messages, and simply sends these messages back. It also triggers the clients to start and measures the total time from the trigger to when all clients are done. The server can also poll its memory usage. Every client sends lists of a specified size (containing the integer 1) and waits for it to come back. They measure the total time for a number of iterations. Specifiable parameters are number of clients, size of lists, number of iterations, memory-polling on/off. The code of the test can be found in Appendix A. For the "old" engine, the prerelease version of Mozart 1.1 is used. The "new" engine is the one described in the design. All tests are run over a LAN at off peak hours. The server runs on a workstation and all clients on a CPU server. Comparative tests are run in a sequence of new, old, new, old to verify reproducible results. 1. Throughput Throughput is measured by starting one client and using larger and larger lists to send data for a number of iterations. The average roundtrip time is recorded and plotted in Figure 13. Iterations: 100.

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

25

35 30 25 20 15 10 5 0 0 100 200 300 400 500 600 700 800

new old

900

1000

The size of the list (x) in elements versus the average roundtrip time (y) in milliseconds. Figure 13

This indicates that the performance has improved for all sizes. It should be noted that these results depend on the number of iterations. At 1000 number of iterations the differences vanish for list sizes below about 300 elements, but remain for larger lists. At less iterations the differences increase. 2. Simultaneous connections The list size is set rather small, and the number of clients is gradually increased. The total times at server and clients are measured and plotted. It is also attempted to find an upper limit on the number of clients one server can handle simultaneously. The engine per default allows thirty file descriptors to be used simultaneously (as the weak limit described in Section 5.7). Results are displayed in Figure 14 and Figure 15. List size: 100, Iterations: 100

13-05-10

26
18 16 14 12 10 8 6 4 2 0 0

Design and Implementation of a Network Layer for Distributed Programming Platforms

new old

10

20

30

40

50

60

70

80

90

100

The number of clients (x) versus the total time at the server (y) in seconds. Figure 14
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 new old

The number of clients (x) versus the average roundtrip time at each client (y) in milliseconds. Figure 15

With many clients, the old engine test only rarely came to a result, explaining the lack of data in that graph. With the new engine, tests with up to 170 clients were successful. Beyond that, the operating system did not allow more processes on the CPU server. Further testing with more machines will be conducted in the future. When viewing the total time of the server (Figure 14), it is evident that the performance has improved, but the second graph (Figure 15) is more puzzling. The strange shape of the curves can be explained by the way the test is written. All clients introduce themselves and then wait for the server to respond before they start their timer. This is to make sure that they really run simultaneously. With many clients, the response from the server will be delayed until there is a connection to a
13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

27

particular client. This means a client can starve for some time and then still measure a short time (also see the code in Appendix A). The fact that the clients running with the old engine get a lower time comes from the preemption decision. The old engine has a more sophisticated algorithm that closes the least recently used connection. The new engine still makes a very naive decision as described in Section 5.7. Conclusions are that performance has improved, but resource scheduling can be further refined. It should be considered that different applications might favor clients getting done fast, while others might want to let clients use resources fairly. 3. Multiplexing Roundtrip when several threads at one client send to the same server is measured and compared to the results of simultaneous connections above. Since all threads in one ozengine use the same communication channel, this will test effects on the performance due to messages being sent in one or more chunks. The result is shown in Figure 16. List size: 100, Iterations: 100
18 16 14 12 10 8 6 4 2 0 0 10 20 30 40 50 60 70 80 90 100 old threaded new threaded old new

The number of clients or threads (x) versus the total time at the server (y) in seconds. Figure 16

As a comparison of how much the multiplexing improved, the quote between results of multiple clients and multiple threads was calculated for a few values: Number of clients/threads 10 30 60 0,99 1,3 1,3 0,79 1,4 1,5 Old engine New engine

The results show that multiplexing has not improved much. This is an effect of the naive scheduling algorithm currently used (described in Section 5.3.1), that lets TransObjects run at every thread switch. This means the messages produced by each
13-05-10

28

Design and Implementation of a Network Layer for Distributed Programming Platforms

thread at the client side will still be marshaled and sent in separate parts. The small improvement that can be observed comes from the server side. There marshaling and sending will be done only when the (one) reading thread suspends for more data. The fact that the threaded test runs faster than the multiple client test, must be due to the operating system having to swap different processes in and out. It should be further investigated in the future, how the scheduling algorithm can be altered to use I/O resources more efficiently. 6.2 A comparison with Java's RMI To compare Mozart to Java's RMI, you can compare the performance of the message passing services of both platforms with test examples producing similar messaging, or you can compare the performance of the whole system by using typical efficient programming styles for each language. Here the performance of the message passing services is compared. A comparison of the second kind can be found in [11]. The test is run with the server process on one workstation and all clients on one CPU server. The Java version is Blackdown JDK 1.2.2. Intensive remote method invocation A simple server with one object having one method, "sayHi(Name)", is set up. A number of clients connect and measure the amount of time needed to invoke the remote method a large number of iterations. The Java implementation is straight forward, but in the case of Mozart the server object can be either mobile or static. A comparison using a static object gives the difference in the message passing service and therefore, that is used here. The result is presented in Figure 17:
1600 1400 1200 1000 800 600 400 200 0 0 5 10 15 20 25 30 Java Mozart

Number of clients (x) versus average total time for 100 iterations measured at the clients (y) in milli seconds Figure 17

The graph shows a higher performance of the network layer of Mozart with many clients. Additional testing with only one client shows that Javas RMI has a slightly higher performance (about 2 % lower total time) than the network layer of Mozart.
13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

29

To run tests with more clients than thirty was impossible since the Java test reported OutOfMemoryException.

Future work
Most of the design described in Section 4 was implemented during this thesis project, but some parts remain and some additional work can be done to fully utilize it. Work for the future: Design and Implement the "Physical Media Mediator" (Section 5.2.3). Implement more transport layers. To begin with, adapt the old "virtual sites" that use shared memory when sites reside on the same machine. Add scheduling of when TransObjects should run as discussed in Sections 5.3.1 and 6.1. Define different priorities to different types of messages to fully use the model described in Section 5.4. Improve the preemption decision discussed in Section 5.7.1. Implement an interface for the application programmer to use probes discussed in Section 5.8.1. Do further testing as soon as the suspendable marshaler is done (Section 6.1).

Conclusions
It was possible to implement a new network layer with more complex features, and still pertain or even improve the performance of the existing and working network layer of Mozart. It was also possible to reach a higher scalability than the network layers of Java and Mozart with simple means. The maximum amount of memory needed for message passing can be known in advance. There is much more to be done to refine this model and further improve its performance and make it dynamically changeable.

13-05-10

30

Design and Implementation of a Network Layer for Distributed Programming Platforms

Appendix A: Evaluation test code


Here is the oz-code used in Section 6.1: server.oz:
functor import Application Pickle Connection System Property DPPane(getNetInfo) at 'x-oz://boot/DPPane' define S P={NewPort S} {Pickle.save {Connection.offerUnlimited P} '/home/annan/tmp/ticket'} [SN SSize SIt DoPoll]={Application.getArgs plain} N={String.toInt SN} Size={String.toInt SSize} It={String.toInt SIt} proc {Poll} {System.show memory(active:{Property.get gc}.active heap:{Property.get memory}.heap net:{FoldL {DPPane.getNetInfo $} fun {$ Ack E} Ack+E.nr*E.size end 0})} {Delay 10000} {Poll} end Dict={NewDictionary} StartT EndT In={NewCell 0} Trig Done={NewCell 0#0.0} thread {ForAll S proc {$ Msg} case Msg of _#X#RP then if {Access In} < N then raise error end end {Send RP X} [] hi(Name Ret Token) then Old New in % A ref to a port is saved to keep the channel from % being gc:ed {Dictionary.put Dict Name Token} Ret=trig(Trig) {Exchange In Old New} if Old<N then New=Old+1 if New==N then {System.show allin} Trig=unit StartT={Property.get time} end else {System.show more_than(N)} end [] done(Name AvRtt) then Cur New Tot NewTot in {Dictionary.put Dict Name unit} % Cancel need - for gc {Exchange Done Cur#Tot New#NewTot} New=Cur+1 NewTot=Tot+AvRtt if New==N then EndT={Property.get time} {System.show alldone} {System.gcDo} {Delay 2000} {System.gcDo}

in

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms


{System.show serverresult(number:N size:Size iterations:It time:EndT.total-StartT.total avrtt:NewTot / {Int.toFloat N})} {Application.exit 0} end

31

end if DoPoll=="poll" then thread {Poll} end end end

end end}

client.oz:
functor import Application Pickle Connection System(show showInfo) Property(get) Fault define {Fault.defaultDisable _} {Fault.defaultEnable [permFail] _} [SName SSize SIterations]={Application.getArgs plain} Name={String.toInt SName} Size={String.toInt SSize} It={String.toInt SIterations} P MyPort AvRtt fun {GetPort Times} try {Connection.take {Pickle.load '/home/annan/tmp/ticket'}} catch X then if Times =< 100 then {Delay 100} {GetPort Times+1} else {System.show givingUp(Name X)} {Application.exit 0} nil end end end proc {DoSend P Msg} try {Send P Msg} catch system(dp(conditions:tempfail|_ ...)) then {Delay 1000} {DoSend P Msg} end end fun {Fun St X} /*T1 T2*/ L Snext in try L = {Map {MakeList Size} fun{$ _} 1 end} {DoSend P Name#L#MyPort} St=L|Snext catch X then {System.show otherEx(Name X)} {Application.exit 1} end Snext end in {Delay 5000} P={GetPort 0} {Wait P}

13-05-10

32

Design and Implementation of a Network Layer for Distributed Programming Platforms

local Ret TmpS Tmp={NewPort TmpS} in {Send P hi(Name Ret Tmp)} case Ret of trig(Trig) then proc {W Times} try {Wait Trig} catch system(dp(conditions:tempfail|_ ...)) then if Times=<100 then {Delay 100} {W Times+1} else {System.show givingUpWait(Name)} {Application.exit 0} end [] X then if Times=<100 then {System.show waitEx(Name X)} {Delay 100} {W Times+1} else {System.show givingUpWait(Name)} {Application.exit 0} end end end in {W 0} end end local T1 T2 in {Property.get time T1} {ForThread 1 It 1 Fun {NewPort $ MyPort} _} {Property.get time T2} AvRtt={Int.toFloat T2.total-T1.total}/({Int.toFloat It} * {Int.toFloat Size}) end {Send P done(Name AvRtt)} {Application.exit 0} end

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

33

Appendix B: Interfaces
Site For ComObject: Tells the site about an anonymous ComObject. If the site does not have a ComObject newComObj is taken and NULL is returned. If it has one, newComObj is ignored and the old one is returned.
ComObj *setComObj(ComObj *newComObj):

ComController For Site:


ComObj *newComObj(DSite *site,int recCtr): void deleteComObj(ComObj* comObj):

Returns a fresh ComObject.

Deletes (and removes any current

activities) the ComObject. ComObject For Site: Hands down a MsgContainer to be sent. Returns immediately, the message is later pulled by the distribution layer. If a connection is not yet established, one will be opened now.
void send(MsgContainer *,int priority): void installProbe(int lowerBound, int higherBound, int interval):

Installs a probe with the given limits. A question that implicitly tells the ComObject that no local references exists. If the ComObject is done true is returned, otherwise false is returned after possibly sending some messages to the other side to tell that this side is done.
Bool canBeFreed():

For TransController: void preemptTransObj(): Tells that the connection has to be temporarily taken down and the TransObject has to be handed back to the TransController. For TransObject: Gives the MsgContainer of the next message to be sent. The number of the last received message is put in acknum.
MsgContainer *getNextMsgContainer(int &acknum): void msgPartlySent(MsgContainer *):

Stores a message that was only partially

marshaled and sent.


void msgSent(MsgContainer *):

Tells that this message was sent. It can now be

put in the unacknowledged list.


void msgAcked(int num):

Tells that message number um was acknowledged.

13-05-10

34

Design and Implementation of a Network Layer for Distributed Programming Platforms

MsgContainer *getMsgContainer():

Gives a clean MsgContainer to be filled with Stores a message that was only

an incoming message.
void msgPartlyReceived(MsgContainer *):

partially received and unmarshaled. Gives the previously stored MsgContainer for partially received message number num.
MsgContainer *getMsgContainer(int num): Bool msgReceived(MsgContainer *): void connectionLost(void *info):

Hands up a fully received message. The ComObject should return whether it wishes to continue with this buffer. Tells that the connection was lost.

"Physical Media Mediator" For ComObject: Initiates a connection. Eventually comObj will be handed back a TransObject with an established physical channel.
void doConnect(ComObj *comObj):

Hands back transObj owned by comObj. This closes the physical channel and hands back the transObj to the TransController using transObjFreed.
void handback(ComObj *comObj, TransObj *transObj):

For TransController:
void transObjReady(TransObj *transObj):

Hands back a TransObject after a

request from doConnect. TransController These methods are implemented by the parent TransController: For "Physical Media Mediator": TransObj *getTransObj(): Returns a fresh TransObject to be used at accept if resources are available, otherwise returns NULL. Since the ComObject is not known at this point, the ComObject must report to the TransController as soon as it gets the returned TransObject.
void getTransObj(ComObj *comObj): Tells that a TransObject is wanted for comObj. When resources are available, transObjReady is called to hand over the

TransObject.
void transObjFreed(ComObj *comObj,TransObj *transObj):

Hands back

transObj to let someone else have it.


void comObjDone(ComObj *comObj):

Cancels any request for a TransObject for

comObj. For ComObject:


void addRunning(ComObj *comObj): Tells which accept, as described at the first getTransObj above.

ComObject is running after

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

35

Called when an anonymous ComObject hands over a TransObject to an old ComObject.


void switchRunning(ComObj *anon,ComObj *old):

These methods must be implemented by any TransController: For (parent) TransController: virtual TransObj *newTransObj(): Gives a new TransObject.
virtual void deleteTransObj(TransObj *transObj): virtual int getMaxNumOfResources():

Deletes a TransObject.

Gives the hard upper limit on the resources for this TransController's type of TransObject. Gives the weak upper limit limit on the resources for this TransController's type of TransObject.
virtual int getWeakMaxNumOfResources():

TransObject These methods that must be implemented by any TransObject: For "Physical Media Mediator": void setSite(DSite *site): Sets the site that this TransObject belongs to.
void setOwner(ComObj *comObj):

Sets the ComObject that this TransObject

belongs to. For ComObject: void init(): Refreshes the TransObject to be as new.
void *close():

Closes the TransObject. Tells the TransObject that there are messages to be delivered. Gives the TransController of this

void deliver():

TransController *getTransController():

TransObject. For TransController: Bool hasEmptyBuffers(): Asked to make decisions on what TransObject to preempt.
int getBufferSize():

Should report what buffer size is used for ChannelInfo.

MsgContainer
void setMessageType(MessageType mt) MessageType getMessageType() void setImplicitMessageCredit(DSite* s) DSite* getImplicitMessageCredit() void put_<MessageType>(<Arguments specified by MessageType>):

Puts the

contents of a specific MessageType inside the MsgContainer.

13-05-10

36

Design and Implementation of a Network Layer for Distributed Programming Platforms

Gets the contents of a specific MessageType from the MsgContainer. (The arguments are references.)
void get_<MessageType>(<Arguments specified by MessageType>): void marshal(ByteBuffer *byteBuffer):

Marshals the contents of the

MsgContainer to ByteBuffer. Unmarshals the message in the byteBuffer and stores the retrieved contents in the MsgContainer.
void unmarshal(ByteBuffer *byteBuffer): void gcMsgC():

Takes care of garbagecollection of the contents of this

MsgContainer. Flags and flag-operations:


MSG_HAS_MARSHALCONT MSG_HAS_UNMARSHALCONT void setFlag(int f) int getFlags() Bool checkFlag(int f) void clearFlag(int f)

For ComObject only:


void setMsgNum(int msgNum): int getMsgNum():

Sets the number of the message. Sets the sendtime.

Gets the number of the message. Gets the sendtime.

void setSendTime(int sendTime): int getSendTime():

13-05-10

Design and Implementation of a Network Layer for Distributed Programming Platforms

37

References
[1] Andrew Birrell, Greg Nelson, Susan Owicki, Edward Wobber. Network Objects. Research Report 115, Digital Equipment Corporation, Systems Research Center, 1995. [2] Java Remote Method Invocation Specification, JDK 1.2. Sun Microsystems, 1998. [3] Luca Cardelli. A Language with Distributed Scope. Digital Equipment Corporation, Systems Research Center, 1995. [4] Seif Haridi, Peter Van Roy, Per Brand, Christian Schulte. Programming Languages for Distributed Applications, 1998 [5] Java Remote Method Invocation. Available at: http://java.sun.com/marketing/collateral/rmi_ds.html, 1999 [6] Connection procedure, being designed and described by Konstantin Popov and Erik Klintskog [7] Ralf Scheidhauer. Design, Implementierung und Evaluiering einer virtuellen Maschine fr Oz. Dissertation, Universitt des Saarlandes, 1998 [8] Frequently Asked Questions - RMI and Object Serialization. Available at: http://java.sun.com/products/jdk/1.2/docs/guide/rmi/faq.html, January 2000 [9] Andrew S. Tanenbaum. Computer Networks, ISBN 0-13-394248-1, 1996 [10] Andrew S. Tanenbaum. Distributed Operating Systems, ISBN 0-13-143934-0, 1995 [11] Jan Tngring. Mozart: koncis och snabb, Datateknik 3.0, No 3 1999

13-05-10

You might also like